Solving semi-Markov decision problems using average reward reinforcement learning

Citation
Tk. Das et al., Solving semi-Markov decision problems using average reward reinforcement learning, MANAG SCI, 45(4), 1999, pp. 560-574
Citations number
37
Categorie Soggetti
Management
Journal title
MANAGEMENT SCIENCE
ISSN journal
00251909 → ACNP
Volume
45
Issue
4
Year of publication
1999
Pages
560 - 574
Database
ISI
SICI code
0025-1909(199904)45:4<560:SSDPUA>2.0.ZU;2-H
Abstract
A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be m odeled as stochastic dynamic programs (referred to, in general, as Markov d ecision problems or MDPs). However, the computational complexity of the cla ssical MDP algorithms, such as value iteration and policy iteration, is pro hibitive and can grow intractably with the size of the problem and its rela ted data. Furthermore, these techniques require for each action the one ste p transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much in terest in a simulation-based stochastic approximation framework called rein forcement learning (RL), for computing near optimal policies for MDPs. RL h as been successfully applied to very large problems, such as elevator sched uling, and dynamic channel allocation of cellular telephone systems. In thi s paper, we extend RL to a more general class of decision tasks that are re ferred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free R L algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of d etermining the optimal preventive maintenance schedule of a production inve ntory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.