Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures

Authors
Citation
Rd. Penfield, Assessing differential item functioning among multiple groups: A comparison of three Mantel-Haenszel procedures, APPL MEAS E, 14(3), 2001, pp. 235-259
Citations number
37
Categorie Soggetti
Education
Journal title
APPLIED MEASUREMENT IN EDUCATION
ISSN journal
08957347 → ACNP
Volume
14
Issue
3
Year of publication
2001
Pages
235 - 259
Database
ISI
SICI code
0895-7347(2001)14:3<235:ADIFAM>2.0.ZU;2-E
Abstract
It is often the case in performing a differential item functioning (DIF) an alysis that comparisons are made between a single reference group and multi ple focal groups. Conducting a separate test of DIF for each focal group ha s several undesirable qualities: (a) the Type I error rate will exceed the intended nominal level if the level of significance for each individual tes t is not appropriately adjusted, (b) the power may not be as high as a sing le test that assesses DIF among all groups simultaneously, and (c) substant ial time and computing resources are required. These drawbacks are potentia lly avoided by using a procedure that has the capacity to assess DIF across all groups simultaneously. In this study I compare the performance of thre e methods of assessing DIF across multiple demographic groups; the Mantel-H aenszel chi-square statistic with no adjustment to the alpha level, the Man tel-Haenszel chi-square statistic with a Bonferroni adjusted alpha level, a nd the Generalized Mantel-Haenszel statistic (GMH) that offers a single tes t of significance across all groups. Simulations were conducted in which th ere was a single reference group and 1, 2, 3, and 4 focal groups, having fr om 1 to all of the focal groups in a given condition experiencing DIE Addit ional conditions that were varied included group size, focal group ability distribution, and magnitude of matching criterion contamination. The result s suggest that GMH is in general the most appropriate procedure because its Type I error rate remained at the nominal level of 0.05, and its power was consistently among the highest.