Purpose. The aim of this study was to investigate to what extent ratin
gs of tutor performance remain stable in the long term. At many school
s, teaching performance is assessed and these evaluations are consulte
d as part of the decision making process for promotion, tenure, and sa
lary. Since this information may have summative value, it is crucial t
hat the reliability of the data be assessed. A previous study had show
n that a single evaluation of a tutor is reliable when the responses o
f six students are used (interrater reliability). The present study fo
cused on the stability of tutor evaluations over repeated occasions of
evaluation. Method. A generalizability study was conducted to estimat
e the number of occasions required to demonstrate stability. The study
took place during three academic years (1992-93, 1993-94, and 1994-95
) at the problem-based medical school of the University of Limburg (no
w Maastricht University). A total of 291 ratings were analyzed (97 tut
ors rated during three sequential tutoring occasions). Two types of sc
ores were used: an aggregate score calculated from ratings of 13 items
and an overall judgment. Results. The results indicate that when the
scores are used to interpret the precision of individual scores, two e
valuation occasions should he available for the overall judgment and f
our occasions for the aggregate score. If the tutor scores are consult
ed only to determine whether performances are above or below a cutoff
score, a reliable decision can he made after only a single occasion of
evaluation. Conclusion. The results demonstrate that data collected o
ver an extended period of time can be reliably used as part of the dec
ision-making process for promotion, salary, and tenure.