M. Defernez et Ek. Kemsley, Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs), ANALYST, 124(11), 1999, pp. 1675-1681
Complex data analysis is becoming more easily accessible to analytical chem
ists, including natural computation methods such as artificial neural netwo
rks (ANNs). Unfortunately, in many of these methods, inappropriate choices
of model parameters can lead to overfitting. This study concerns overfittin
g issues in the use of ANNs to classify complex, high-dimensional data (whe
re the number of variables far exceeds the number of specimens). We examine
whether a parameter rho, equal to the ratio of the number of observations
in the training set to the number of connections in the network, can be use
d as an indicator to forecast overfitting. Networks possessing different rh
o values were trained using as inputs either raw data or scores obtained fr
om principal component analysis (PCA). A primary finding was that different
data sets behave very differently. For data sets with either abundant or s
cant information related to the proposed group structure, overfitting was l
ittle influenced by rho, whereas for intermediate cases some dependence was
found, although it was not possible to specify values of rho which prevent
ed overfitting altogether. The use of a tuning set, to control termination
of training and guard against overtraining, did not necessarily prevent ove
rfitting from taking place. However, for data containing scant group-relate
d information, the use of a tuning set reduced the likelihood and magnitude
of overfitting, although not eliminating it entirely. For other data sets,
little difference in the nature of overfitting arose from the two modes of
termination. Small data sets (in terms of number of specimens) were more l
ikely to produce overfit ANNs, as were input layers comprising large number
s of PC scores. Hence, for high-dimensional data, the use of a limited numb
er of PC scores as inputs, a tuning set to prevent overtraining and a test
set to detect and guard against overfitting are recommended.