Md. Brundage et al., ASSESSING THE RELIABILITY OF 2 TOXICITY SCALES - IMPLICATIONS FOR INTERPRETING TOXICITY DATA, Journal of the National Cancer Institute, 85(14), 1993, pp. 1138-1148
Background: The toxicity of a given cancer therapy is an important end
point in clinical trials examining the potential costs and benefits o
f that therapy. Treatment-related toxicity is conventionally measured
with one of several toxicity criteria grading scales, even though the
reliability and validity of these scales have not been established. Pu
rpose: We determined the reliability of the National Cancer Institute
of Canada Clinical Trials Group (NCIC-CTG) expanded toxicity scale and
the World Health Organization (WHO) standard toxicity scale by use of
a clinical simulation of actual patients. Methods: Seven experienced
data managers each interviewed 12 simulated patients and scored their
respective acute toxic effects. Inter-rater agreement (agreement betwe
en multiple raters of the same case) was calculated using the kappa (k
appa) statistic across all seven randomly assigned raters for each of
18 toxicity categories (13 NCIC-CTG and five WHO categories). Intra-ra
ter agreement (agreement within the same rater on one case rated on se
parate occasions) was calculated using kappa over repeated cases (wher
e raters were blinded to the repeated nature of the subjects). Proport
ions of agreement (estimate of the probability of two randomly selecte
d raters assigning the same toxicity grade to a given case) were also
calculated for inter-rater agreement. Since minor lack of agreement mi
ght have adversely affected these statistics of agreement, both kappa
and proportion of agreement analyses were repeated for the following c
ondensed grading categories: none (0) versus low-grade (1 or 2) versus
high-grade (3 or 4) toxicity present. Results: Modest levels of inter
-rater reliability were demonstrated in this study with kappa values t
hat ranged from 0.50 to 1.00 in laboratory-based categories and from -
0.04 to 0.82 for clinically based categories. Proportions of agreement
for clinical categories ranged from 0.52 to 0.98. Condensing the toxi
city grades improved statistics of agreement, but substantial lack of
agreement remained (kappa range, -0.04-0.82; proportions of agreement
range, 0.67-0.98). Conclusions: Experienced data managers, when interv
iewing patients, draw varying conclusions regarding toxic effects expe
rienced by such patients. Neither the NCIC-CTG expanded toxicity scale
nor the WHO standard toxicity scale demonstrated a clear superiority
in reliability, although the breadth of toxic effects recorded differe
d.