Db. Holiday et al., PRESS-RELATED STATISTICS - REGRESSION TOOLS FOR CROSS-VALIDATION AND CASE DIAGNOSTICS, Medicine and science in sports and exercise, 27(4), 1995, pp. 612-620
In the health science literature, a common approach of validating a re
gression equation is data-splitting, where a portion of the data fits
the model (fitting sample) and the remainder (validation sample) estim
ates future performance. The R(2) and SEE obtained by predicting the v
alidation sample with the fitting sample equation is a proper estimate
of future performance, tending to correct for the natural upward bias
of the R(2) and SEE obtained from fitting sample alone. Data-splittin
g has several disadvantages, however. These include: 1) difficulty, ar
bitrariness, and inconvenience of matching samples; 2) the need to rep
ort two sets of statistics to determine homogeneity; and 3) the lack o
f equation stability due to diluted sample size. The PRESS statistic a
nd associated residuals do not require the data to be split, yield alt
ernative unbiased estimates of R(2) and SEE, and provide useful case d
iagnostics. This procedure is easy to use, is widely available in mode
rn statistical packages, but is rarely utilized. The two methods are c
ontrasted here using a simulation from original data for predicting bo
dy density from anthropometric measurements of a group of 117 women. T
he PRESS approach is particularly appropriate for smaller datasets; me
thods of reporting these statistics are recommended.