This study reports on the generalizability of different skills assesse
d in the oral certification examinations in Internal Medicine of the R
oyal College of Physicians and Surgeons of Canada. Assessments from th
e 1992 examination were examined prospectively to determine (i) inter-
rater reliability, (ii) correlation from morning to afternoon sessions
, and (iii) overall test reliability. While inter-rater reliability wa
s acceptable and in the range reported from previous studies, the gene
ralizability across sessions was very low, ranging from 0 . 30 to 0 .
47, presumably reflecting content specificity. As a consequence, the o
verall test reliability was low, ranging from 0 . 57 to 0 . 69. Collap
sing the overall scores into three decision categories (pass, borderli
ne, fail) lowered the test reliability still further. Strategies to re
solve this problem are suggested.