ITA
ENG

Solving semi-Markov decision problems using average reward reinforcement learning

Authors

Das, TK Gosavi, A Mahadevan, S Marchalleck, N

Citation

Tk. Das et al., Solving semi-Markov decision problems using average reward reinforcement learning, MANAG SCI, 45(4), 1999, pp. 560-574

Citations number

Categorie Soggetti

Management

Journal title

MANAGEMENT SCIENCE

ISSN journal

00251909 → ACNP

Volume

Issue

Year of publication

1999

Pages

560 - 574

Database

ISI

SICI code

0025-1909(199904)45:4<560:SSDPUA>2.0.ZU;2-H

Abstract

A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be m odeled as stochastic dynamic programs (referred to, in general, as Markov d ecision problems or MDPs). However, the computational complexity of the cla ssical MDP algorithms, such as value iteration and policy iteration, is pro hibitive and can grow intractably with the size of the problem and its rela ted data. Furthermore, these techniques require for each action the one ste p transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much in terest in a simulation-based stochastic approximation framework called rein forcement learning (RL), for computing near optimal policies for MDPs. RL h as been successfully applied to very large problems, such as elevator sched uling, and dynamic channel allocation of cellular telephone systems. In thi s paper, we extend RL to a more general class of decision tasks that are re ferred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free R L algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of d etermining the optimal preventive maintenance schedule of a production inve ntory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.