ITA
ENG

Using principal component analysis and correspondence analysis for estimation in latent variable models

Authors

Lynn, HS McCulloch, CE

Citation

Hs. Lynn et Ce. Mcculloch, Using principal component analysis and correspondence analysis for estimation in latent variable models, J AM STAT A, 95(450), 2000, pp. 561-572

Citations number

Categorie Soggetti

Mathematics

Journal title

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION → ACNP

Volume

Issue

450

Year of publication

2000

Pages

561 - 572

Database

ISI

SICI code

Abstract

Correspondence analysis (CA) and principal component analysis (PCA) are oft en used to describe multivariate data. In certain applications they have be en used for estimation in latent variable models. The theoretical basis for such inference is assessed in generalized linear models where the linear p redictor equals alpha(j) + x(i)beta(j) or a(j) - b(j) (x(i) - u(j))(2), (i = 1, ..., n; j = 1, ..., m), and x(i) is treated as a latent fixed effect. The PCA and CA eigenvectors/column scores are evaluated as estimators of be ta(j) and u(j) and as estimators of u(j). With m fixed and n up arrow infin ity, consistent estimators cannot be obtained due to the incidental paramet ers problem unless sufficient "moment" conditions are imposed on x(i). PCA is equivalent to maximum likelihood estimation for the linear Gaussian mode l and gives a consistent estimator of beta(j) (up to a scale change) when t he second sample moment of x(i) is positive and finite in the limit. It is inconsistent for Poisson and Bernoulli distributions, but when b(j) is cons tant, its first and/or second eigenvectors can consistently estimate u(j) ( up to a location and scale change) for the quadratic Gaussian model. In con trast, the CA estimator is always inconsistent. For finite samples, however , the CA column scores often have high correlations with the u(j)'s, especi ally when the response curves are spread out relative to one another. The c orrelations obtained from PCA are usually weaker, although the second PCA e igenvector can sometimes do much better than the first eigenvector, and for incidence data with tightly clustered response curves its performance is c omparable to that of CA. For small sample sizes, PCA and particularly CA ar e competitive alternatives to maximum likelihood and may be preferred becau se of their computational ease.