A set of 43 337 splice junction pairs was extracted from mammalian GenBank
annotated genes. Expressed sequence tag (EST) sequences support 22 489 of t
hem. Of these, 98.71% contain canonical dinucleotides: GT and AG for donor
and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice sit
e pairs; and the remaining 0.73% occurs in a lot of small groups (with a ma
ximum size of 0.05%). Studying these groups we observe that many of them co
ntain splicing dinucleotides shifted from the annotated splice junction by
one position. After close examination of such cases we present a new classi
fication consisting of only eight observed types of splice site pairs (out
of 256 a priori possible combinations). EST alignments allow us to verify t
he exonic part of the splice sites, but many non-canonical cases may be due
to intron sequencing errors. This idea is given substantial support when w
e compare the sequences of human genes having non-canonical splice sites de
posited in GenBank by high throughput genome sequencing projects (HTG). A h
igh proportion (156 out of 171) of the human non-canonical and EST-supporte
d splice-site sequences had a clear match in the human HTG. They can be cla
ssified after corrections as: 79 GC-AG pairs (of which one was an error tha
t corrected to GC-AG), 61 errors that were corrected to GT-BG;canonical pai
rs, six AT-AC pairs (of which two were-errors that corrected to AT-AC), one
case was produced from non-existent intron, seven cases were found in HTG
that were deposited to GenBank and finally there were only two cases left o
f supported non-canonical splice sites. If we assume that approximately the
same situation is true for the whole: set of annotated mammalian non-canon
ical splice-sites, then the 99.24% of splice site pairs should be GT-AG, 0.
69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types
of non-canonical splice sites. We analyze several characteristics of EST-ve
rified splice sites and build weight matrices for the major groups, which c
an be incorporated into gene prediction programs. We also present a set of
EST-verified canonical splice sites larger by two orders of magnitude than
the current one (22 199 entries versus similar to 600) and finally, a set o
f 290 EST-supported non-canonical splice sites, Both sets should be signifi
cant for future investigations of the splicing mechanism.