Ioannis Rexakis, "Directed exploration of policy space in reinforcement learning", Doctoral Dissertation, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018
https://doi.org/10.26233/heallink.tuc.78690
Reinforcement learning refers to a broad class of learning problems. Autonomous agents typically try to learn how to achieve their goal solely by interacting with their environment. They perform a trial-and-error search and they receive delayed rewards (or penalties). The challenge is to learn a good or even optimal decision policy, one that maximizes the total long-term reward. A decision policy for an autonomous agent is the knowledge of what to do in any possible state in order to achieve the long-term goal efficiently.Several recent learning approaches within decision making under uncertainty suggest the use of classifiers for the compact (approximate) representation of policies. However, the space of possible policies, even under such structured representations, is huge and must be searched carefully to avoid computationally expensive policy simulations.In this dissertation, our first contribution uncovers policy structure by deriving optimal policies for two standard two-dimensional reinforcement learning domains, namely the Inverted Pendulum and the Mountain Car. We found that optimal policies have significant structure and a high degree of locality, i.e. dominant actions persist over large continuous areas within the state space. This observation provides sufficient justification for the appropriateness of classifiers for approximate policy representation.Our second and main contribution is the proposal of two Directed Policy Search algorithms for the efficient exploration of policy space provided by Support Vector Machines and Relevance Vector Machines. The first algorithm exploits the structure of the classifiers used for policy representation. The second algorithm uses an importance function to rank the states, based on action prevalence. In both approaches, the search over the state space is focused on areas where there is change of action domination. This directed focus on critical parts of the state space iteratively leads to refinement and improvement of the underlying policy and delivers excellent control policies in only a few iterations with a relatively small rollout budget, yielding significant computational time savings.We demonstrate the proposed algorithms and compare them to prior work on three standard reinforcement learning domains: Inverted Pendulum (two-dimensional), Mountain Car (two-dimensional), Acrobot (four-dimensional). Additionally, we demonstrate the scalability of the proposed approaches on the problem of learning how to control a 4-Link, Under-Actuated, Planar Robot, which corresponds to an eight-dimensional problem, well-known in the control theory community. In all cases, the proposed approaches strike a balance between efficiency and effort, yielding sufficiently good policies without excessive steps of learning.