A large class of problems of sequential decision making under uncertainty,
of which the underlying probability structure is a Markov process, can be m
odeled as stochastic dynamic programs (referred to, in general, as Markov d
ecision problems or MDPs). However, the computational complexity of the cla
ssical MDP algorithms, such as value iteration and policy iteration, is pro
hibitive and can grow intractably with the size of the problem and its rela
ted data. Furthermore, these techniques require for each action the one ste
p transition probability and reward matrices, and obtaining these is often
unrealistic for large and complex systems. Recently, there has been much in
terest in a simulation-based stochastic approximation framework called rein
forcement learning (RL), for computing near optimal policies for MDPs. RL h
as been successfully applied to very large problems, such as elevator sched
uling, and dynamic channel allocation of cellular telephone systems. In thi
s paper, we extend RL to a more general class of decision tasks that are re
ferred to as semi-Markov decision problems (SMDPs). In particular, we focus
on SMDPs under the average-reward criterion. We present a new model-free R
L algorithm called SMART (Semi-Markov Average Reward Technique). We present
a detailed study of this algorithm on a combinatorially large problem of d
etermining the optimal preventive maintenance schedule of a production inve
ntory system. Numerical results from both the theoretical model and the RL
algorithm are presented and compared.