Although the sequencing of the human genome is complete, identification of
encoded genes and determination of their structures remain a major challeng
e. In this report, we introduce a method that effectively uses full-length
mouse cDNAs to complement efforts in carrying out these difficult tasks. A
total of 61,227 RIKEN mouse cDNAs (21,076 full-length and 40,151 EST sequen
ces containing certain redundancies) were aligned with the draft human sequ
ences. We found 35,141 non-redundant genomic regions that showed a signific
ant alignment with the mouse cDNAs. We analyzed the structures and composit
ional properties of the regions detected by the full-length cDNAs, includin
g cross-species comparisons, and noted a systematic bias of GENSCAN against
exons of small size and/or low GC-content. Of the cDNAs locating the 35,14
1 genomic regions, 3,217 did not match any sequences of the known human gen
es or ESTs. Among those 3,217 cDNAs, 1,141 did not show any significant sim
ilarity to any protein sequence in the GenBank non-redundant protein databa
se and thus are candidates for novel genes.