The flowchart below summarizes the steps taken in order to build the M. abscessus model proteome and is complemented by the brief description below it. For full details, please refer to the original manuscript:
Marcin J Skwark, Pedro H M Torres, Liviu Copoiu, Bridget Bannerman, R Andres Floto, Tom L Blundell, Mabellini: a genome-wide database for understanding the structural proteome and evaluating prospective antimicrobial targets of the emerging pathogen Mycobacterium abscessus, Database, Volume 2019, Issue 1, 2019, baz113, https://doi.org/10.1093/database/baz113
Representative sequences for the proteome of M. abscessus retrieved from UniProt1 (Proteome ID: UP000007137) were organised as individual sequences, each of which formed a starting point for the pipeline. For each of the identified gene sequences (target sequences) that are longer than 250 residues, we perform domain decomposition, through a HMMER2 search against PfamA3 database.
The full sequence and those of identified domains are subjected to PSI-BLAST4 searches against the UniRef100 database. The resulting sequences are re-aligned using MAFFT5 and used for prediction of structural characteristics of the protein (e.g. secondary structure, solvent accessibility, etc.). This information in turn is fed into MELODY (part of the FUGUE6 suite), a program that produces a compact representation of this information (profile), to be used for querying our protein profiles database (TOCCATA). Using each of these profiles, Vivace performs FUGUE-based profile-profile search against the TOCCATA database.
In parallel to the FUGUE search, Vivace conducts a BLAST7 search against pdbaa8, a list of the protein sequences in PDB. The resulting hits with sequence identity of at least 70% are located in the TOCCATA database, and each profile that contains this protein or its subset is automatically included in the subsequent processing, even if it has not been detected by FUGUE search. In this way, we ensure that profiles containing very close homologues of the target are always included, even if there are already good hits present, identified from an initial FUGUE search.
Of the identified profiles, ones with low confidence (FUGUE Z-Score below 4.0) are discarded. Profiles retrieved through BLAST searches have no inherent Z-Score associated and thus are always retained. All retained profiles undergo a grouping process, in order to reduce redundancy. The profiles are compared and are grouped where they cover an overlapping span of the queried sequences (over 75% overlap) and if they contain at least one SCOP9 or CATH10 family in common. Otherwise they form new hit groups. Each group is then further processed, and the query sequences are trimmed to match the length of the selected profiles, inside which the potential template structures reside.
In each of the selected profiles, the potential templates are classified into functional states, with respect to the bound ligands and cofactors. For each of the states, Vivace selects up to five templates for modelling ensuring maximal diversity of template structures in terms of amino acid sequence, while maximizing their expected quality.
The selected templates in each of the states are re-aligned to the target sequence using BATON (manuscript in preparation), a modified open implementation of the COMPARER method11. Based on this alignment, Vivace proceeds to the comparative modelling stage, using MODELLER12. Models are assessed with DOPE13 and GA34114 potentials.
Models are then filtered to remove the ones that have extensive main chain clashes, large poorly resolved loops and/or loosely interacting ligands. Finally, they are selected according to several different quality metrics (PconsD15, MolProbity16, GOAP-AG17, GOAP-score and SOAP18)
- Bateman, A. et al. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017)
- Eddy, S. R. Multiple alignment using hidden Markov models. Proceedings. Int. Conf. Intell. Syst. Mol. Biol. 3, 114–20 (1995)
- Finn, R. D. et al. Pfam: The protein families database. Nucleic Acids Res. 42, 222–230 (2014)
- Altschul, S. F. et al. Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
- Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002)
- Shi, J., Blundell, T. L. & Mizuguchi, K. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties.J. Mol. Biol. 310, 243–257 (2001)
- Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 1–9 (2009)
- Wang, G. & Dunbrack, R. L. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003)
- Fox, N. K., Brenner, S. E. & Chandonia, J. M. SCOPe: Structural Classification of Proteins - Extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 42, 304–309 (2014)
- Sillitoe, I. et al. CATH: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43, D376–D381 (2015)
- Šali, A. & Blundell, T. L. Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships throughsimulated annealing and dynamic programming. J. Mol. Biol. 212, 403–428 (1990)
- Šali, A. & Blundell, T. L. Comparative Protein Modelling by Satisfaction of Spatial Restraints. J. Mol. Biol. 234, 779–815 (1993)
- Melo, F., Sánchez, R. & Sali, A. Statistical potentials for fold assessment. Protein Sci. 11, 430–448 (2009)
- Melo, F. Fold assessment for comparative protein structure modeling. Protein Sci. 16, 2412–2426 (2007)
- Skwark, M. J. & Elofsson, A. PconsD: Ultra rapid, accurate model quality assessment for protein structure prediction. Bioinformatics 29, 1817–1818 (2013)
- Chen, V. B. et al. MolProbity: All-atom structure validation for macromolecular crystallography. Acta Crystallogr. Sect. D Biol. Crystallogr. 66, 12–21 (2010)
- Zhou, H. & Skolnick, J. GOAP: A generalized orientation-dependent, all-atom statistical potential for protein structure prediction. Biophys. J. 101, 2043–2052 (2011)
- Dong, G. Q. et al. Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics 29, 3158–3166 (2013)