Validation and verification of regression in small data sets

Citation
Ha. Martens et P. Dardenne, Validation and verification of regression in small data sets, CHEM INTELL, 44(1-2), 1998, pp. 99-121
Citations number
9
Categorie Soggetti
Spectroscopy /Instrumentation/Analytical Sciences
Journal title
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS
ISSN journal
01697439 → ACNP
Volume
44
Issue
1-2
Year of publication
1998
Pages
99 - 121
Database
ISI
SICI code
0169-7439(199812)44:1-2<99:VAVORI>2.0.ZU;2-#
Abstract
Four different methods of using small data sets in multivariate modelling a re compared w.r.t. predictive precision in the long-run. The modelling in t his case concerns multivariate calibration: (y) over cap=f(X). The study co nsists of a Monte Carlo simulation within a large data base of real data; X = NIR reflectance spectra and y = protein percentage, measured in 922 whol e maize plant samples. Small data sets (40-120 objects) were repeatedly sel ected at random from the data base, each time simulating the situation of h aving only a small set of samples available for estimating, optimizing and assessing the calibration model. The 'true' apparent prediction error was e ach time controlled in the remaining data base. This was replicated 100 tim es in order to study the statistical performance of the four different vali dation methods. In each Monte Carlo replicate, the splitting of the availab le data set into calibration set and test set was compared to full cross va lidation. The results demonstrated that removing samples from an already Li mited set of available samples to an independent VALIDATION TEST SET seriou sly reduced the predictive performance of the calibrated models, and at the same time gave uncertain, systematically over-optimistic assessment of the models' predictive performance. Full CROSS VALIDATION gave improved predic tive performance, and gave only slightly over-optimistic assessment of this predictive performance. Further removal of even more of the available samp les for use in an independent VERIFICATION TEST SET gave in-the-long-run co rrect, although uncertain estimates of the predictive performance of the ca librated models, but this performance level had seriously deteriorated. Alt ernative verification of the model's predictive performance by the method o f CROSS VERIFICATION gave results very similar to those of the cross valida tion. These results from real data correspond closely to previous findings for artificially simulated data. It appears that full cross validation is s uperior to both the use of independent validation test set and independent verification test set. (C) 1998 Elsevier Science B.V, All rights reserved.