In the past two decades, striking examples of allegedly inferior likelihood
ratio tests (LRT) have appeared in the statistical literature. These examp
les, which arise in multiparameter hypothesis testing problems, have severa
l common features. In each case the null hypothesis is composite, the size
a LRT is not similar and hence biased, and competing size ct tests can be c
onstructed that are less biased, or even unbiased, and that dominate the LR
T in the sense of being everywhere more powerful. It is therefore asserted
that in these examples and, by implication, many other testing problems, th
e LR criterion produces "inferior," "deficient," "undesirable," or "flawed"
statistical procedures.
This message, which appears to be proliferating, is wrong. In each example
it is the allegedly superior test that is flawed, not the LRT. At worst, th
e "superior" tests provide unwarranted and inappropriate inferences and hav
e been deemed scientifically unacceptable by applied statisticians. This re
inforces the well-documented but oft-neglected fact that the Neyman-Pearson
theory desideratum of a more (or most) powerful size ct test may be scient
ifically inappropriate; the same is true for the criteria of unbiasedness a
nd ct-admissibility. Although the LR criterion is not infallible, we believ
e that it remains a. generally reasonable first option for non-Bayesian par
ametric hypothesis-testing problems.