There have been many experiments comparing query languages. Their find
ings are difficult to combine as the experiments have used different s
ettings and procedures. Before a meta-analysis combining the experimen
ts, it is proposed that these differences be checked for any possible
effect. The most important measure of user performance in these experi
ments is query accuracy, which has been determined using many differen
t grading schemes. The different schemes are therefore checked for pos
sible effects on hypothesis rejection. These grading schemes are appli
ed to two sets of queries from two different experiments. The outcomes
are examined to identify any effects resulting from the grading schem
es. The results show that the experimental outcomes are robust and imm
une to the grading schemes.