The purpose of this study is to examine whether the reviewers on item revie
w committees can accurately identify test items that exhibit a variety of f
laws. An instrument with 75 items was constructed and administered to 39 re
viewers who were operational members of an item review committee. After und
ergoing training, the 39 reviewers were asked to examine the 75 items and i
ndicate whether each item exhibited cultural or technical flaws. There were
8 cultural flaw categories (e.g., "Does the item unfairly favor males or f
emales?") and 8 technical flaw categories (e.g., "Is the item content inacc
urate or factually incorrect?"). The accuracy of the reviewers was defined
in terms of the match between the judged classifications and the a priori c
lassifications of the items into flaw categories. A new approach based on i
tem response theory for examining rater accuracy was used to analyze the da
ta (Engelhard, 1996). The data suggest that it is easier to identify some t
ypes of item flaws than others; specifically, the reviewers were more accur
ate in identifying items with cultural flaws than with technical flaws. The
reviewers exhibited fairly high accuracy rates overall that ranged from 83
% to 94%, and there are statistically significant differences in judgmental
accuracy between the reviewers. Suggestions for future research on judgmen
tal accuracy and the implications of this study for identifying biased item
s are discussed.