Rt. Miller et al., A comprehensive approach to clustering of expressed human gene sequence: The sequence tag alignment and consensus knowledge base, GENOME RES, 9(11), 1999, pp. 1143-1155
The expressed human genome is being sequenced and analyzed by disparate gro
ups producing disparate data. The majority of the identified coding portion
is in the form of expressed sequence tags (ESTs). The need to discover exo
nic representation and expression Forms of full-length cDNAs for each human
gene is frustrated by the partial and variable quality nature of this data
delivery. A highly redundant human EST data set has been processed into in
tegrated and unified expressed transcript indices that consist of hierarchi
cally organized human transcript consensi reflecting gene expression forms
and genetic polymorphism within an index class. The expression index and it
s intermediate outputs include cleaned transcript sequence, expression, and
alignment information and a higher fidelity subset, SANIGENE. The STACK_PA
CK clustering system has been applied to dbEST release 121598 (GenBank vers
ion 110). Sixty-four percent of 1,313,103 Homo sapiens ESTs are condensed i
nto 143,885 tissue level multiple sequence clusters; linking through clone-
ID annotations produces 68,701 total assemblies, such that 81% of the origi
nal input set is captured in a STACK multiple sequence or linked cluster. I
ndexing of alignments by substituent EST accession allows browsing of the d
ata structure and its cross-links to UniGene. STACK metaclusters consolidat
e a greater number of ESTs by a Factor of 1.86 with respect to the correspo
nding UniGene build. Fidelity comparison with genome reference sequence AC0
04106 demonstrates consensus expression clusters that reflect significantly
lower spurious repeat sequence content and capture alternate splicing with
in a whole body index cluster and three STACK v.2.3 tissue-level clusters.
Statistics of a staggered release whole body index build of STACK v.2.0 are
presented.