A 3D SEQUENCE-INDEPENDENT REPRESENTATION OF THE PROTEIN DATA-BANK

Citation
D. Fischer et al., A 3D SEQUENCE-INDEPENDENT REPRESENTATION OF THE PROTEIN DATA-BANK, Protein engineering, 8(10), 1995, pp. 981-997
Citations number
66
Categorie Soggetti
Biology
Journal title
ISSN journal
02692139
Volume
8
Issue
10
Year of publication
1995
Pages
981 - 997
Database
ISI
SICI code
0269-2139(1995)8:10<981:A3SROT>2.0.ZU;2-H
Abstract
Here we address the following questions, How many structurally differe nt entries are there in the Protein Data Bank (PDB)? How do the protei ns populate the structural universe? To investigate these questions a structurally nonredundant set of representative entries was selected f rom the PDB, Construction of such a dataset is not trivial: (i) the co nsiderable size of the PDB requires a large number of comparisons (the re were more than 3250 structures of protein chains available in May 1 994); (ii) the PDB is highly redundant, containing many structurally s imilar entries, not necessarily with significant sequence homology, an d (iii) there is no clear-cut definition of structural similarity, The latter depend on the criteria and methods used, Here, we analyze stru ctural similarity ignoring protein topology. To date, representative s ets have been selected either by hand, by sequence comparison techniqu es which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural compari son (i.e., the topology, or the sequential order of the chains, is enf orced in the structural comparison), Here we describe a 3D sequence-in dependent automated and efficient method to obtain a representative se t of protein molecules from the PDB which contains all unique structur es and which is structurally non-redundant. The method has two novel f eatures, The first is the use of strictly structural criteria in the s election process without taking into account the sequence information, To this end we employ a fast structural comparison algorithm which re quires on average similar to 2 s per pairwise comparison on a workstat ion, The second novel feature is the iterative application of a heuris tic clustering algorithm that greatly reduces the number of comparison s required, We obtain a representative set of 220 chains with resoluti on better than 3.0 Angstrom, or 268 chains including lower resolution entries, NMR entries and models, The resulting set can serve as a basi s for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships, The clustering algorit hm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the imm unoglobulin-like folds into a third, In addition, unexpected structura l similarities of interest have been automatically detected between pa irs of chains, A cluster analysis of the representative structures dem onstrates the way the 'structural universe' is populated.