Single-nucleotide polymorphisms (SNPs) are the most abundant form of human
genetic variation and a resource for mapping complex genetic traits'. The l
arge volume of data produced by high-throughput sequencing projects is a ri
ch and largely untapped source of SNPs (refs 2-5). We present here a unifie
d approach to the discovery of variations in genetic sequence data of arbit
rary DNA sources. We propose to use the rapidly emerging genomic: sequence(
6,7) as a template on which to layer often unmapped, fragmentary sequence d
ata(8-11) and to use base quality values(12) to discern true allelic variat
ions from sequencing errors. By taking advantage of the genomic sequence we
are able to use simpler yet more accurate methods for sequence organizatio
n: fragment clustering, paralogue identification and multiple alignment. We
analyse these sequences with a novel, Bayesian inference engine, POLYBAYES
, to calculate the probability that a given site is polymorphic. Rigorous t
reatment of base quality permits completely automated evaluation of the ful
l length of all sequences, without limitations on alignment depth. We demon
strate this approach by accurate SNP predictions in human ESTs aligned to f
inished and working-draft quality genomic sequences, a data set representat
ive of the typical challenges of sequence-based SNP discovery.