H. Danker-hopfe et Wm. Herrmann, Interrater reliability of sleep stage scoring according to Rechtschaffen and Kales rules (RKR): A review and methodological considerations, KLIN NEUROP, 32(2), 2001, pp. 89-99
A literature review has been done on interrater reliability of sleep stage
scoring according to the Rechtschaffen and Kales rules both between two and
more than two raters. These results have been compared with the interrater
reliability between visual scorings and semiautomatic as well as fully aut
omated scorings. For single night scorings the interrater reliability varie
s between 61% and 96% while at the group level the agreement between visual
scorings varies between 85% and 95% with an average of approximately 89%.
The Interrater reliability between visual and automatic scoring at a group
level varies between 70% and 95% with an average of about 83%. The interrat
er reliability of sleep stage scorings varies with the number and the exper
ience of the scorers, the choice of the 100% reference (if two or more huma
n experts are involved), the number of stages that are distinguished, the s
ample (healthy subjects vs. patients with sleep disturbances), the age of t
he subjects and the choice of the statistical method to estimate the interr
ater reliability. Based on the review of interrater reliability data method
ological considerations on the measurement of interrater reliability are pr
esented and discussed. For variables measured on different scales (quantita
tive sleep parameters measured on a metric scale vs. sleep stages as qualit
ative variables measured on a nominal scale) different approaches to estima
te interrater reliability are used. For sleep parameters measured on a metr
ic scale the advantages and disadvantages of correlation statistics on one
hand and approaches to test group differences on the other are discussed. A
mong the approaches of correlation analysis, intra-class correlation should
be the method of choice and with regard to approaches that test group diff
erences the paired nature of the data has to be considered. Only a combinat
ion of both statistical approaches yields a comprehensive impression on the
interrater reliability of the scoring results. For sleep stages, which rep
resent nominal scaled qualitative data, agreement is commonly expressed as
a percentage. Although this is a simple measure which is readily understood
, it is not an adequate index of agreement since it makes no allowance for
agreement between scorers that might be attributed just to chance. This dis
advantage is overcome by the kappa statistics (by Cohen for two scorers and
by Fleiss for more than two scorers), which expresses the difference betwe
en observed and chance agreement in relation to maximum possible excess of
observed over chance agreement. Kappa usually varies between 0 (agreement i
s equal to chance) and 1 (complete agreement between scorers). Values <0, w
hich are rarely observed, indicate that there is a systematic deviation in
agreement.