The reliability of scores on four forms of the Test of English as a Fo
reign Language (TOEFL) was estimated using a hybrid IRT model. It was
found that there was very little difference between their overall reli
ability when the testlet items were assumed to be independent and when
their dependence was modeled. A larger difference in reliability was
found when test sections were analyzed individually. Then we found as
much as a 40% overestimate in reading comprehension testlets, with the
longer testlets of the newest form of TOEFL showing the most local de
pendence. The listening comprehension testlets exhibited much less loc
al dependence. We also found that the test was unidimensional enough f
or the use of univariate item response theory (IRT) to be efficacious,
and that the reading comprehension testlets showed essentially no dif
ferential functioning by sex.