A database (SpliceDB) of known mammalian splice site sequences has been dev
eloped. We extracted 43 337 splice pairs from mammalian divisions of the ge
ne-centered Infogene database, including sites from incomplete or alternati
vely spliced genes. Known EST sequences supported 22 815 of them. After dis
carding sequences with putative errors and ambiguous location of splice jun
ctions the verified dataset includes 22 489 entries. Of these, 98.71% conta
in canonical GT-AG junctions (22 199 entries) and 0.56% have non-canonical
GC-AG splice site pairs. The remainder (0.73%) occurs in a lot of small gro
ups (with a maximum size of 0.05%). We especially studied non-canonical spl
ice sites, which comprise 3.73% of GenBank annotated splice pairs, EST alig
nments allowed us to verify only the exonic part of splice sites. To check
the conservative dinucleotides we compared sequences of human non-canonical
splice sites with sequences from the high throughput genome sequencing pro
ject (HTG), Out of 171 human non-canonical and EST-supported splice pairs,
156 (91.23%) had a clear match in the human HTG. They can be classified aft
er sequence analysis as: 79 GC-AG pairs (of which one was an error that cor
rected to GC-AG), 61 errors corrected to GT-AG canonical pairs, six AT-AC p
airs (of which two were errors corrected to AT-AC), one case was produced f
rom a non-existent intron, seven cases were found in HTG that were deposite
d to GenBank and finally there were only two other cases left of supported
non-canonical splice pairs. The information about verified splice site sequ
ences for canonical and non-canonical sites is presented in SpliceDB with t
he supporting evidence. We also built weight matrices for the major splice
groups, which can be incorporated into gene prediction programs. SpliceDB i
s available at the computational genomic Web sewer of the Sanger Centre: ht
tp:// genomic.sanger.ac.uk/spldb/SpliceDB.html and at http://www.softberry.
com/spldb/SpliceDB.html.