Το work with title A multi-modal Q-learning approach using normalized advantage functions and deep neural networks by Petridis Christos is licensed under Creative Commons Attribution 4.0 International
Bibliographic Citation
Christos Petridis, "A multi-modal Q-learning approach using normalized advantage functions and deep neural networks", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.82891
Reinforcement Learning, a branch of Machine Learning geared towards thedevelopment of Autonomous Agents, presents a rapid evolution in recentyears as a means of solving sequential decision problems. The developmentof robust Deep Neural Networks has also played a crucial role to this success.The combination of these two areas eventually led to Deep ReinforcementLearning, a state-of-the-art field which demonstrated already a great potential and tremendous results in continuous control tasks. In order to contributeto this effort, the present thesis investigates an extension of the NormalizedAdvantage Functions (NAFs) to multi-modal representations, such as multiple quadratics and RBFs (Radial Basis Functions). More specifically, wefocus on a continuous variant of the well-known Q-learning algorithm withexperience replay, combined with the NAF representation and deep neuralnetworks. The original NAF representation is by design unimodal, given thatthe quadratic advantage function offers only one mode, which means that lossin performance may occur due to the inability to explore and capture complex representations with multiple modes. To tackle this problem, this thesisproposes two multi-modal representations as a simple solution. The first oneuses multiple quadratic terms, whereas the second one uses RBFs. In eachcase, the formulation of the action advantage is accomplished by two differentmethods. The first one uses the sum of equally weighted advantage terms,which are derived as outputs of the neural network. The second method usesthe argmax operator over the advantage terms. Both of these methods avoidany direct interaction with the neural network, thus making the proposedarchitectures more efficient. In order to evaluate our implementation, simulation tests were run on an open-source platform, called RoboSchool, which isintegrated into the broader OpenAI Gym framework, and provides different environments for testing reinforcement learning algorithms. In our case, weused six environments (pendulum, inverted pendulum, inverted double pendulum, humanoid, ant, walker2d), which support different simulated robotsand consist of continuous control tasks. Our results showed a significant improvement in performance and efficiency of the proposed multi-modal algorithm compared to the original unimodal one, nevertheless at the cost of someincrease in computation time. We observed that the outcome for each taskdiffers as it depends on the values of several hyper-parameters, with batchnormalization, learning rate and exploration noise being the most sensitiveones. This thesis is a first step towards a full-scale extension to multi-modalrepresentations and their application to more complex environments yieldingeven more robust solutions to continuous control tasks.