An optimal (in some sense) retry policy in a computer system is usuall
y derived under an unrealistic assumption that fault characteristics a
re known a priori and remain unchanged throughout the mission lifetime
. In such a case, the optimal retry period depends only upon the syste
m's status at the time of fault detection. We propose to remedy this d
eficiency by formulating the optimal retry problem as a Bayesian decis
ion problem where not only the time of fault detection but also the re
sults of earlier retries are used to estimate the current fault charac
teristics. Previous knowledge about fault characteristics is represent
ed by the prior distributions of fault-related parameters which are up
dated whenever new samples are obtained from retry and detection mecha
nisms. A new fault classification scheme is proposed to assign a tempo
ral fault type (i.e., permanent or intermittent or transient) to each
detected fault so that the corresponding fault parameters can be estim
ated. The estimated fault parameters are then used to derive the optim
al retry period that minimizes the mean task completion time. Efficien
t algorithms are developed to determine the optimal retry period on-li
ne upon detection of each fault. To evaluate the goodness of the propo
sed retry policy, it is compared with, and is always found to outperfo
rm, a number of fixed-retry-period policies.