Xt. Fan et M. Chen, Published studies of interrater reliability often overestimate reliability: Computing the correct coefficient, EDUC PSYC M, 60(4), 2000, pp. 532-542
It is erroneous to generalize the interrater reliability coefficient estima
ted from two or more raters rating only a (small) portion of the sample to
the rest of the sample data for which only one rater is used for scoring, a
lthough such generalization is often made implicitly in practice. If the in
terrater reliability estimate from part of a sample is available, the score
reliability for the rest of the sample data for which only one rater is us
ed for scoring can be estimated both within the framework of classical reli
ability theory and that of generalizability theory. As intuitively expected
, score reliability when only one rater is used for scoring is lower than t
he score reliability for which two raters are used. The authors provide a s
ample of published studies in different disciplines that inappropriately ge
neralized reliability coefficients involving several raters to scores gener
ated by a single rater.