SpliceDB: database of canonical and non-canonical mammalian splice sites

Citation
M. Burset et al., SpliceDB: database of canonical and non-canonical mammalian splice sites, NUCL ACID R, 29(1), 2001, pp. 255-259
Citations number
7
Categorie Soggetti
Biochemistry & Biophysics
Journal title
NUCLEIC ACIDS RESEARCH
ISSN journal
03051048 → ACNP
Volume
29
Issue
1
Year of publication
2001
Pages
255 - 259
Database
ISI
SICI code
0305-1048(20010101)29:1<255:SDOCAN>2.0.ZU;2-M
Abstract
A database (SpliceDB) of known mammalian splice site sequences has been dev eloped. We extracted 43 337 splice pairs from mammalian divisions of the ge ne-centered Infogene database, including sites from incomplete or alternati vely spliced genes. Known EST sequences supported 22 815 of them. After dis carding sequences with putative errors and ambiguous location of splice jun ctions the verified dataset includes 22 489 entries. Of these, 98.71% conta in canonical GT-AG junctions (22 199 entries) and 0.56% have non-canonical GC-AG splice site pairs. The remainder (0.73%) occurs in a lot of small gro ups (with a maximum size of 0.05%). We especially studied non-canonical spl ice sites, which comprise 3.73% of GenBank annotated splice pairs, EST alig nments allowed us to verify only the exonic part of splice sites. To check the conservative dinucleotides we compared sequences of human non-canonical splice sites with sequences from the high throughput genome sequencing pro ject (HTG), Out of 171 human non-canonical and EST-supported splice pairs, 156 (91.23%) had a clear match in the human HTG. They can be classified aft er sequence analysis as: 79 GC-AG pairs (of which one was an error that cor rected to GC-AG), 61 errors corrected to GT-AG canonical pairs, six AT-AC p airs (of which two were errors corrected to AT-AC), one case was produced f rom a non-existent intron, seven cases were found in HTG that were deposite d to GenBank and finally there were only two other cases left of supported non-canonical splice pairs. The information about verified splice site sequ ences for canonical and non-canonical sites is presented in SpliceDB with t he supporting evidence. We also built weight matrices for the major splice groups, which can be incorporated into gene prediction programs. SpliceDB i s available at the computational genomic Web sewer of the Sanger Centre: ht tp:// genomic.sanger.ac.uk/spldb/SpliceDB.html and at http://www.softberry. com/spldb/SpliceDB.html.