Using structure recurrence to define protein domains
Résumé
Domains are basic units of protein structure and essential for exploring protein fold space and structure evolution. With the NIH Protein Structure Initiative and other structural genomics initiatives worldwide, the number of protein structures in PDB is increasing dramatically and domain parsing needs to be done automatically. Most of the existing structural domain parsing programsconsider the compactness of the domains and/or the number and strength of internal (intra-domain) versus external (inter-domain) contacts. Here we present a completely different approach. Taking advantage of the growing number of known structures in the PDB, the chains are parsed solely by using recurrence of similar structures that appear in the structural database. A non-redundant set of 6373 protein chains was selected as the target data set and 128 benchmark chains from pDomains were used as query chains. For each query chain, one against all target structure comparisons were performed using VAST. Then the VAST cliques were collected and the protein residues were clustered using mathematical procedures akin to those used for analyzing the microarray data. These clusters define domains. NDO scores were used to compare the results with SCOP and CATH domain boundaries as well as with those from other parsing programs. Our algorithm gave results that were comparable to those of several existing programs. It handles segmented domains equally well as non-segmented domains. The structures that contribute the cliques that define a domain may contain distant evolutionary information of the domain.
Origine : Fichiers éditeurs autorisés sur une archive ouverte