ITA
ENG

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Authors

Sahiner, B Chan, HP Petrick, N Wagner, RF Hadjiiski, L

Citation

B. Sahiner et al., Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size, MED PHYS, 27(7), 2000, pp. 1509-1522

Citations number

Categorie Soggetti

Radiology ,Nuclear Medicine & Imaging","Medical Research Diagnosis & Treatment

Journal title

MEDICAL PHYSICS

ISSN journal

00942405 → ACNP

Volume

Issue

Year of publication

2000

Pages

1509 - 1522

Database

ISI

SICI code

0094-2405(200007)27:7<1509:FSACPI>2.0.ZU;2-1

Abstract

In computer-aided diagnosis (CAD), a frequently used approach for distingui shing normal and abnormal cases is first to extract potentially useful feat ures for the classification task. Effective features are then selected from this entire pool of available features. Finally, a classifier is designed using the selected features. In this study, we investigated the effect of f inite sample size on classification accuracy when classifier design involve s stepwise feature selection in linear discriminant analysis, which is the most commonly used feature selection algorithm for linear classifiers. The feature selection and the classifier coefficient estimation steps were cons idered to be cascading stages in the classifier design process. We compared the performance of the classifier when feature selection was performed on the design samples alone and on the entire set of available samples, which consisted of design and test samples. The area A(z) under the receiver oper ating characteristic curve was used as our performance measure. After linea r classifier coefficient estimation using the design samples, we studied th e hold-out and resubstitution performance estimates. The two classes were a ssumed to have multidimensional Gaussian distributions, with a large number of features available for feature selection. We investigated the dependenc e of feature selection performance on the covariance matrices and means for the two classes, and examined the effects of sample size, number of availa ble features, and parameters of stepwise feature selection on classifier bi as. Our results indicated that the resubstitution estimate was always optim istically biased, except in cases where the parameters of stepwise feature selection were chosen such that too few features were selected by the stepw ise procedure. When feature selection was performed using only the design s amples, the hold-out estimate was always pessimistically biased. When featu re selection was performed using the entire finite sample space, the hold-o ut estimates could be pessimistically or optimistically biased, depending o n the number of features available for selection, the number of available s amples, and their statistical distribution. For our simulation conditions, these estimates were always pessimistically (conservatively) biased if the ratio of the total number of available samples per class to the number of a vailable features was greater than five. (C) 2000 American Association of P hysicists in Medicine. [S0094-2405(00)01607-2].