Motivation: In cDNA sequencing projects, it is vital to know whether t
he protein coding region of a sequence is complete, or whether errors
have occurred during library construction here we present a linear dis
criminant approach that predicts this completeness by estimating the p
robability of each ATG being the initiation codon. Results: because of
the current shortage of full-length cDNA data on which to base this w
ork, tests were performed on a non-redundant set of 660 initiation cod
on-containing DNA sequences that had been conceptually spliced into mR
NA/cDNA. We also used an edited set of the same sequences that only co
ntained the region following the initiation codon as a negative contro
l. Using the criterion that only a single prediction is allowed for ea
ch sequence, a cut-off was selected at which discrimination of both po
sitive and negative sets was equal. At this cut-off, 67% of each set c
ould be correctly distinguished, with the correct ATG codon also being
identified in the positive set. Reliability could be increased furthe
r by raising the cut-off or including homologues, the relative merits
of which are discussed.