Lc. Briand et al., A comprehensive evaluation of capture-recapture models for estimating software defect content, IEEE SOFT E, 26(6), 2000, pp. 518-540
An important requirement to control the inspection of software artifacts is
to be able to decide, based on more objective information, whether the ins
pection can stop or whether it should continue to achieve a suitable level
of artifact quality. A prediction of the number of remaining defects in an
inspected artifact can be used for decision making. Several studies in soft
ware engineering have considered capture-recapture models, originally propo
sed by biologists to estimate animal populations, to make a prediction. How
ever, few studies compare the actual number of remaining defects to the one
predicted by a capture-recapture model on real software engineering artifa
cts. Thus, there is little work looking at the robustness of capture-recapt
ure models under realistic software engineering conditions, where it is exp
ected that some of their assumptions will be violated. Simulations have bee
n performed, but no definite conclusions can be drawn regarding the degree
of accuracy of such models under realistic inspection conditions and the fa
ctors affecting this accuracy. Furthermore, the existing studies focused on
a subset of the existing capture-recapture models. Thus, a more exhaustive
comparison is still missing. In this study, we focus on traditional inspec
tions and estimate, based on actual inspections data, the degree of accurac
y of relevant, state-of-the-art capture-recapture models as they have been
proposed in biology and for which statistical estimators exist, in order to
assess their robustness, we look at the impact of the number of inspectors
and the number of actual defects on the estimators' accuracy based on actu
al inspection data. Our results show that models are strongly affected by t
he number of inspectors and, therefore, one must consider this factor befor
e using capture-recapture models. When the number of inspectors is too smal
l, no model is sufficiently accurate and underestimation may be substantial
. In addition, some models perform better than others in a large number of
conditions and plausible reasons are discussed. Based on our analyses, we r
ecommend using a model taking into account that defects have different prob
abilities of being detected and the corresponding Jackknife Estimator. Furt
hermore, we attempt to calibrate the prediction models based on their relat
ive error, as previously computed on other inspections. Although intuitive
and straightforward, we identified theoretical limitations to this approach
which were then confirmed by the data.