One of the basic issues that arises in functional genomics is the abil
ity to predict the subcellular location of proteins that are deduced f
rom gene and genome sequencing. In particular, one would like to be ab
le to readily specify those proteins that are soluble and those that a
re inserted in a membrane. Traditional methods of distinguishing betwe
en these two locations have relied on extensive, time-consuming bioche
mical studies. The alternative approach has been to make inferences ba
sed on a visual search of the amino acid sequences of presumed gene pr
oducts for stretches of hydrophobic amino acids. This numerical, seque
nce-based approach is usually seen as a first approximation pending mo
re reliable biochemical data. The recent availability of large and com
plete sequence data sets for several organisms allows us to determine
just how accurate such a numerical approach could be, and to attempt t
o minimize and quantify the error involved. We have optimized a statis
tical approach to protein location determination. Using our approach,
we have determined that surprisingly few proteins are misallocated usi
ng the numerical method. We also examine the biological implications o
f the success of this technique.