An important application of reinforcement learning (RL) is to finite-state
control problems and one of the most difficult problems in learning for con
trol is balancing the exploration/exploitation tradeoff. Existing theoretic
al results for RL give very little guidance on reasonable ways to perform e
xploration. In this paper, we examine the convergence of single-step on-pol
icy RL algorithms for control. On-policy algorithms cannot separate explora
tion from learning and therefore must confront the exploration problem dire
ctly. We prove convergence results for several related on-policy algorithms
with both decaying exploration and persistent exploration. We also provide
examples of exploration strategies that can be followed during learning th
at result in convergence to both optimal values and optimal policies.