ITA
ENG

A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base

Authors

Miller, RT Christoffels, AG Gopalakrishnan, C Burke, J Ptitsyn, AA Broveak, TR Hide, WA

Citation

Rt. Miller et al., A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base, GENOME RES, 9(11), 1999, pp. 1143-1155

Citations number

Categorie Soggetti

Molecular Biology & Genetics

Journal title

GENOME RESEARCH

ISSN journal

10889051 → ACNP

Volume

Issue

Year of publication

1999

Pages

1143 - 1155

Database

ISI

SICI code

1088-9051(199911)9:11<1143:ACATCO>2.0.ZU;2-Z

Abstract

The expressed human genome is being sequenced and analyzed by disparate gro ups producing disparate data. The majority of the identified coding portion is in the form of expressed sequence tags (ESTs). The need to discover exo nic representation and expression Forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery. A highly redundant human EST data set has been processed into in tegrated and unified expressed transcript indices that consist of hierarchi cally organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class. The expression index and it s intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE. The STACK_PA CK clustering system has been applied to dbEST release 121598 (GenBank vers ion 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed i nto 143,885 tissue level multiple sequence clusters; linking through clone- ID annotations produces 68,701 total assemblies, such that 81% of the origi nal input set is captured in a STACK multiple sequence or linked cluster. I ndexing of alignments by substituent EST accession allows browsing of the d ata structure and its cross-links to UniGene. STACK metaclusters consolidat e a greater number of ESTs by a Factor of 1.86 with respect to the correspo nding UniGene build. Fidelity comparison with genome reference sequence AC0 04106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing with in a whole body index cluster and three STACK v.2.3 tissue-level clusters. Statistics of a staggered release whole body index build of STACK v.2.0 are presented.