This paper proposes a simulation-based algorithm for optimizing the average
reward in a finite-state Markov reward process that depends on a set of pa
rameters. As a special case, the method applies to Markov decision processe
s where optimization takes place within a parametrized set of policies. The
algorithm relies on the regenerative structure of finite-state Markov proc
esses, involves the simulation of a single sample path, and can be implemen
ted online. A convergence result (with probability 1) is provided.