M. G. Lagoudakis and R. Parr. (2001, Dec.).Model–free least–squares policy iteration. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.4345&rep=rep1&type=pdf
We propose a new approach to reinforcement learning which combinesleast squares function approximation with policy iteration. Ourmethod is model-free and completely o policy. We are motivatedby the least squares temporal dierence learning algorithm (LSTD),which is known for its ecient use of sample experiences comparedto pure temporal dierence algorithms. LSTD is ideal for predictionproblems, however it heretofore has not had a straightforward applicationto control problems. Moreover, approximations learned by LSTDare strongly inuenced by the visitation distribution over states. Ournew algorithm, Least-Squares Policy Iteration (LSPI) addresses theseissues. The result is an o-policy method which can use (or reuse)data collected from any source. We test LSPI on several problems,including a bicycle simulator in which it learns to guide the bicycleto a goal eciently by merely observing a relatively small number ofcompletely random trials.