A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base

Citation
Rt. Miller et al., A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base, GENOME RES, 9(11), 1999, pp. 1143-1155
Citations number
32
Categorie Soggetti
Molecular Biology & Genetics
Journal title
GENOME RESEARCH
ISSN journal
10889051 → ACNP
Volume
9
Issue
11
Year of publication
1999
Pages
1143 - 1155
Database
ISI
SICI code
1088-9051(199911)9:11<1143:ACATCO>2.0.ZU;2-Z
Abstract
The expressed human genome is being sequenced and analyzed by disparate gro ups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exo nic representation and expression Forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into in tegrated and unified expressed transcript indices that consist of hierarchi cally organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and it s intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PA CK clustering system has been applied to dbEST release 121598 (GenBank vers ion 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed i nto 143,885 tissue level multiple sequence clusters; linking through clone- ID annotations produces 68,701 total assemblies, such that 81% of the origi nal input set is captured in a STACK multiple sequence or linked cluster. I ndexing of alignments by substituent EST accession allows browsing of the d ata structure and its cross-links to UniGene. STACK metaclusters consolidat e a greater number of ESTs by a Factor of 1.86 with respect to the correspo nding UniGene build. Fidelity comparison with genome reference sequence AC0 04106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing with in a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.