A statistical basis for testing the significance of mass spectrometric protein identification results

Citation
J. Eriksson et al., A statistical basis for testing the significance of mass spectrometric protein identification results, ANALYT CHEM, 72(5), 2000, pp. 999-1005
Citations number
33
Categorie Soggetti
Chemistry & Analysis","Spectroscopy /Instrumentation/Analytical Sciences
Journal title
ANALYTICAL CHEMISTRY
ISSN journal
00032700 → ACNP
Volume
72
Issue
5
Year of publication
2000
Pages
999 - 1005
Database
ISI
SICI code
0003-2700(20000301)72:5<999:ASBFTT>2.0.ZU;2-N
Abstract
A method for testing the significance of mass spectrometric (MS) protein id entification results is presented. MS proteolytic peptide mapping and genom e database searching provide a rapid, sensitive, and potentially accurate m eans for identifying proteins. Database search algorithms detect the matchi ng between proteolytic peptide masses from an MS peptide map and theoretica l proteolytic peptide masses of the proteins in a genome database. The numb er of masses that matches is used to compute a score, S, for each protein, and the protein that yields the best score is assumed as the identification result. There is a risk of obtaining a false result, because masses determ ined by MS are not unique; i.e., each mass in a peptide map can match rando mly one or several proteins in a genome database. A false result is obtaine d when the score, S, due to random matching cannot be discerned from the sc ore due to matching with a real protein in the sample. We therefore introdu ce the frequency function, f(S), for false (random) identification results as a basis for testing at what significance level, a, one can reject a null hypothesis, H-0: "the result is false". The significance is tested by comp aring an experimental score, SE, with a critical score, Sc, required for a significant result at the level alpha. If S-E greater than or equal to S-C, H-0 is rejected. f(S) and S-C were obtained by simulations utilizing rando m tryptic peptide maps generated from a genome database. The critical score , S-C, was studied as a function of the number of masses in the peptide map , the mass accuracy, the degree of incomplete enzymatic cleavage, the prote in mass range, and the size of the genome. With S-C known for a variety of experimental constraints, significance testing can be fully automated and i ntegrated with database searching software used for protein identification.