ITA
ENG

Infinite-horizon policy-gradient estimation

Authors

Baxter, J Bartlett, PL

Citation

J. Baxter et Pl. Bartlett, Infinite-horizon policy-gradient estimation, J ARTIF I R, 15, 2001, pp. 319-350

Citations number

Categorie Soggetti

AI Robotics and Automatic Control

Journal title

JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH

ISSN journal

10769757 → ACNP

Volume

Year of publication

2001

Pages

319 - 350

Database

ISI

SICI code

1076-9757(2001)15:<319:IPE>2.0.ZU;2-U

Abstract

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partia l observability and to avoid some of the problems associated with policy de gradation in value-function methods. In this paper we introduce GPOMDP, a s imulation-based algorithm for generating a biased estimate of the gradient of the average reward in Partially Observable Markov Decision Processes (PO MDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chi ef advantages are that it requires storage of only twice the number of poli cy parameters, uses one free parameter beta is an element of [0, 1) (which has a natural interpretation in terms of bias-variance trade-off), and requ ires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter fi is related to the mixin g time of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces , multiple-agents, higher-order derivatives, and a version for training sto chastic policies with internal states. In a companion paper (Baxter, Bartle tt, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conju gate-gradient procedure to find local optima of the average reward.