Reinforcement Learning (RL) is the study of programs that improve thei
r performance by receiving rewards and punishments from the environmen
t. Most RL methods optimize the discounted total reward received by an
agent, while, in many domains, the natural criterion is to optimize t
he average reward per time step. In this paper, we introduce a model-b
ased Average-reward Reinforcement Learning method called H-learning an
d show that it converges more quickly and robustly than its discounted
counterpart in the domain of scheduling a simulated Automatic Guided
Vehicle (AGV). We also introduce a version of H-learning that automati
cally explores the unexplored parts of the state space, while always c
hoosing greedy actions with respect to the current value function. We
show that this ''Auto-exploratory H-Learning'' performs better than th
e previously studied exploration strategies. To scale H-learning to la
rger state spaces, we extend it to learn action models and reward func
tions in the form of dynamic Bayesian networks, and approximate its va
lue function using local linear regression. We show that both of these
extensions are effective in significantly reducing the space requirem
ent of H-learning and making it converge faster in some AGV scheduling
tasks. (C) 1998 Published by Elsevier Science B.V.