ITA
ENG

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Authors

Defernez, M Kemsley, EK

Citation

M. Defernez et Ek. Kemsley, Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs), ANALYST, 124(11), 1999, pp. 1675-1681

Citations number

Categorie Soggetti

Chemistry & Analysis","Spectroscopy /Instrumentation/Analytical Sciences

Journal title

ANALYST

ISSN journal

00032654 → ACNP

Volume

124

Issue

Year of publication

1999

Pages

1675 - 1681

Database

ISI

SICI code

0003-2654(199911)124:11<1675:AOITAO>2.0.ZU;2-O

Abstract

Complex data analysis is becoming more easily accessible to analytical chem ists, including natural computation methods such as artificial neural netwo rks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfittin g issues in the use of ANNs to classify complex, high-dimensional data (whe re the number of variables far exceeds the number of specimens). We examine whether a parameter rho, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be use d as an indicator to forecast overfitting. Networks possessing different rh o values were trained using as inputs either raw data or scores obtained fr om principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or s cant information related to the proposed group structure, overfitting was l ittle influenced by rho, whereas for intermediate cases some dependence was found, although it was not possible to specify values of rho which prevent ed overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent ove rfitting from taking place. However, for data containing scant group-relate d information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more l ikely to produce overfit ANNs, as were input layers comprising large number s of PC scores. Hence, for high-dimensional data, the use of a limited numb er of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.