Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Citation
M. Defernez et Ek. Kemsley, Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs), ANALYST, 124(11), 1999, pp. 1675-1681
Citations number
12
Categorie Soggetti
Chemistry & Analysis","Spectroscopy /Instrumentation/Analytical Sciences
Journal title
ANALYST
ISSN journal
00032654 → ACNP
Volume
124
Issue
11
Year of publication
1999
Pages
1675 - 1681
Database
ISI
SICI code
0003-2654(199911)124:11<1675:AOITAO>2.0.ZU;2-O
Abstract
Complex data analysis is becoming more easily accessible to analytical chem ists, including natural computation methods such as artificial neural netwo rks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfittin g issues in the use of ANNs to classify complex, high-dimensional data (whe re the number of variables far exceeds the number of specimens). We examine whether a parameter rho, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be use d as an indicator to forecast overfitting. Networks possessing different rh o values were trained using as inputs either raw data or scores obtained fr om principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or s cant information related to the proposed group structure, overfitting was l ittle influenced by rho, whereas for intermediate cases some dependence was found, although it was not possible to specify values of rho which prevent ed overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent ove rfitting from taking place. However, for data containing scant group-relate d information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more l ikely to produce overfit ANNs, as were input layers comprising large number s of PC scores. Hence, for high-dimensional data, the use of a limited numb er of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.