Generally accepted standards for testing and validating ecosystem mode
ls would benefit both modellers and model users, Universally applicabl
e test procedures are difficult to prescribe, given the diversity of m
odelling approaches and the many uses for models, However, the general
ly accepted scientific principles of documentation and disclosure prov
ide a useful framework for devising general standards for model evalua
tion, Adequately documenting model tests requires explicit performance
criteria, and explicit benchmarks against which model performance is
compared. A model's validity, reliability, and accuracy can be most me
aningfully judged by explicit comparison against the available alterna
tives, In contrast, current practice is often characterized by vague,
subjective claims that model predictions show 'acceptable' agreement w
ith data; such claims provide little basis for choosing among alternat
ive models, Strict model tests (those that invalid models are unlikely
to pass) are the only ones capable of convincing rational skeptics th
at a model is probably valid, However, 'false positive' rates as low a
s 10% can substantially erode the power of validation tests, making th
em insufficiently strict to convince rational skeptics, Validation tes
ts are often undermined by excessive parameter calibration and overuse
of ad hoc model features, Tests are often also divorced from the cond
itions under which a model will be used, particularly when it is desig
ned to forecast beyond the range of historical experience, In such sit
uations, data from laboratory and field manipulation experiments can p
rovide particularly effective tests, because one can create experiment
al conditions quite different from historical data, and because experi
mental data can provide a more precisely defined 'target' for the mode
l to hit, We present a simple demonstration showing that the two most
common methods for comparing model predictions to environmental time s
eries (plotting model time series against data time series, and plotti
ng predicted versus observed values) have little diagnostic power. We
propose that it may be more useful to statistically extract the relati
onships of primary interest from the time series, and test the model d
irectly against them.