Institutional Repository [SANDBOX]
Technical University of Crete
EN  |  EL

Search

Browse

My Space

A multi-modal Q-learning approach using normalized advantage functions and deep neural networks

Petridis Christos

Full record


URI: http://purl.tuc.gr/dl/dias/5B1A91B3-A0A3-44C2-A68A-F569935872DD
Year 2019
Type of Item Diploma Work
License
Details
Bibliographic Citation Christos Petridis, "A multi-modal Q-learning approach using normalized advantage functions and deep neural networks", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019 https://doi.org/10.26233/heallink.tuc.82891
Appears in Collections

Summary

Reinforcement Learning, a branch of Machine Learning geared towards thedevelopment of Autonomous Agents, presents a rapid evolution in recentyears as a means of solving sequential decision problems. The developmentof robust Deep Neural Networks has also played a crucial role to this success.The combination of these two areas eventually led to Deep ReinforcementLearning, a state-of-the-art field which demonstrated already a great potential and tremendous results in continuous control tasks. In order to contributeto this effort, the present thesis investigates an extension of the NormalizedAdvantage Functions (NAFs) to multi-modal representations, such as multiple quadratics and RBFs (Radial Basis Functions). More specifically, wefocus on a continuous variant of the well-known Q-learning algorithm withexperience replay, combined with the NAF representation and deep neuralnetworks. The original NAF representation is by design unimodal, given thatthe quadratic advantage function offers only one mode, which means that lossin performance may occur due to the inability to explore and capture complex representations with multiple modes. To tackle this problem, this thesisproposes two multi-modal representations as a simple solution. The first oneuses multiple quadratic terms, whereas the second one uses RBFs. In eachcase, the formulation of the action advantage is accomplished by two differentmethods. The first one uses the sum of equally weighted advantage terms,which are derived as outputs of the neural network. The second method usesthe argmax operator over the advantage terms. Both of these methods avoidany direct interaction with the neural network, thus making the proposedarchitectures more efficient. In order to evaluate our implementation, simulation tests were run on an open-source platform, called RoboSchool, which isintegrated into the broader OpenAI Gym framework, and provides different environments for testing reinforcement learning algorithms. In our case, weused six environments (pendulum, inverted pendulum, inverted double pendulum, humanoid, ant, walker2d), which support different simulated robotsand consist of continuous control tasks. Our results showed a significant improvement in performance and efficiency of the proposed multi-modal algorithm compared to the original unimodal one, nevertheless at the cost of someincrease in computation time. We observed that the outcome for each taskdiffers as it depends on the values of several hyper-parameters, with batchnormalization, learning rate and exploration noise being the most sensitiveones. This thesis is a first step towards a full-scale extension to multi-modalrepresentations and their application to more complex environments yieldingeven more robust solutions to continuous control tasks.

Available Files

Services

Statistics