Reinforcement learning is the problem of generating optimal behavior in a s
equential decision-making environment given the opportunity of interacting
with it. Many algorithms for solving reinforcement-learning problems work b
y computing improved estimates of the optimal value function. We extend pri
or analyses of reinfarcement-learning algorithms and present a powerful new
theorem that can provide a unified analysis of such value-function-based r
einforcement-learning algorithms. The usefulness of the theorem lies in how
it allows the convergence of a complex asynchronous reinforcement-learning
algorithm to be proved by verifying that a simpler synchronous algorithm c
onverges. We illustrate the application of the theorem by analyzing the con
vergence of Q-learning, model-based reinforcement learning, Q-learning with
multistate updates, Q-learning for Markov games, and risk-sensitive reinfo
rcement learning.