The dearth of published empirical data on major industrial systems has been
one of the reasons that software engineering has failed to establish a pro
per scientific basis. In this paper, we hope to provide a small contributio
n to the body of empirical knowledge. We describe a number of results from
a quantitative study of faults and failures in two releases of a major comm
ercial system. We tested a range of basic software engineering hypotheses r
elating to: The Pareto principle of distribution of faults and failures; th
e use of early fault data to predict later fault and failure data; metrics
for fault prediction; and benchmarking fault data. For example, we found st
rong evidence that a small number of modules contain most of the faults dis
covered in prerelease testing and that a very small number of modules conta
in most of the faults discovered in operation. However, in neither case is
this explained by the size or complexity of the modules. We found no eviden
ce to support previous claims relating module size to fault density nor did
we find evidence that popular complexity metrics are good predictors of ei
ther fault-prone or failure-prone modules. We confirmed that the number of
faults discovered in prerelease testing is an order of magnitude greater th
an the number discovered in 12 months of operational use. We also discovere
d fairly stable numbers of faults discovered at corresponding testing phase
s. Our most surprising and important result was strong evidence of a counte
r-intuitive relationship between pre- and postrelease faults: Those modules
which are the most fault-prone prerelease are among the least fault-prone
postrelease, while conversely, the modules which are most fault-prone postr
elease are among the least fault-prone prerelease. This observation has ser
ious ramifications for the commonly used fault density measure. Not only is
it misleading to use it as a surrogate quality measure, but, its previous
extensive use in metrics studies is shown to be flawed. Our results provide
data-points in building up an empirical picture of the software developmen
t process. However, even the strong results we have observed are not genera
lly valid as software engineering laws because they fail to take account of
basic explanatory data, notably testing effort and operational usage. Afte
r all, a module which has not been tested or used will reveal no faults, ir
respective of its size, complexity, or any other factor.