S. Saxonov et al., EID: the Exon-Intron Database - an exhaustive database of protein-coding intron-containing genes, NUCL ACID R, 28(1), 2000, pp. 185-190
To aid studies of molecular evolution and to assist in gene prediction rese
arch, we have constructed an Exon-Intron Database (EID) in FASTA format. Cu
rrently, the database is derived from GenBank release 112, and it contains
51 289 protein-coding genes (287 209 exons) that harbor introns, along with
extensive descriptions of each gene and its DNA and protein sequences, as
well as splice motif information, There is 17% redundancy inherited from Ge
nBank-a purge at the 99% identity level reduced the data-base to 42 460 gen
es (243 589 exons), We have created subdatabases of genes whose intron posi
tions have been experimentally determined, One such: database, 'constructed
by comparing genomic and mRNA sequences, contains 11 242 genes (62 474 exo
ns), A larger database of 22 196 genes (105 595 exons) was constructed by s
electing on keywords to eliminate computer-predicted genes, By examining th
e two nucleotides adjacent to the intron boundary, we infer that there is a
2% rate of errors or other deviations from the standard GT...AG motif in n
uclear genes, This criterion can be used to eliminate 4921 genes from the o
verall database. Various tools are provided to enable generation of user-sp
ecific subsets of the EID, The EID distribution can be obtained from http:/
mcb.harvard.edu/gilbert/EID.