An efficient test for comparing sequence diversity between two populations

Citation
Pb. Gilbert et al., An efficient test for comparing sequence diversity between two populations, J COMPUT BI, 8(2), 2001, pp. 123-139
Citations number
37
Categorie Soggetti
Biochemistry & Biophysics
Journal title
JOURNAL OF COMPUTATIONAL BIOLOGY
ISSN journal
10665277 → ACNP
Volume
8
Issue
2
Year of publication
2001
Pages
123 - 139
Database
ISI
SICI code
1066-5277(2001)8:2<123:AETFCS>2.0.ZU;2-B
Abstract
We address the problem of comparing interindividual genomic sequence divers ity between two populations. Although the methods are general, for concrete ness we focus on comparing two human immunodeficiency virus (HIV) infected populations. From a viral isolate(s) taken from each individual in a sample of persons from each population, suppose one or multiple measurements are made on the genetic sequence of a coding region of HIV. Given a definition of genetic distance between sequences, the goal is to test if the distribut ion of interindividual distances differs between populations. If distances between all pairs of sequences within each group are used, then data-depend encies arising from the use of multiple sequences from individuals invalida tes the use of a standard two-sample test such as the t-test. Where this pr oblem has been recognized, a typical solution has been to apply a standard test to a reduced dataset comprised of one sequence or a consensus sequence from each patient. Disadvantages of this procedure are that the conclusion of the test depends on the choice of utilized sequences, often an arbitrar y decision, and exclusion of replicate sequences from the analysis may need lessly sacrifice statistical power. We present a new test free of these dra wbacks, which is based on a statistic that linearly combines all possible s tandard test statistics calculated from independent sequence subsamples. We describe statistical power advantages of the test and illustrate its use b y application to nucleotide sequence distances measured from HIV-1 infected populations in southern Africa (GenBank accession numbers AF110959-AF11098 1) and North America/Europe. The test makes minimal assumptions, is maximal ly efficient and objective, and is broadly applicable.