Το work with title A systematic evaluation of the PPO algorithm for deep reinforcement learning in lane-free autonomous driving by Akrivopoulos Grigorios is licensed under Creative Commons Attribution 4.0 International
Bibliographic Citation
Grigorios Akrivopoulos, "A systematic evaluation of the PPO algorithm for deep reinforcement learning in lane-free autonomous driving", Diploma Work, School of Production Engineering and Management, Technical University of Crete, Chania, Greece, 2024
https://doi.org/10.26233/heallink.tuc.101775
Lane-free traffic is a novel paradigm that targets environments fully comprised of Connected and Automated Vehicles (CAVs), where CAVs do not adhere to traffic lanes but can occupy any lateral position within the road boundaries. This gives rise to many research opportunities and innovative applications. At the same time, the field of Deep Reinforcement Learning (DRL) has gained momentum and continues to rapidly advance, with active lines of research for applications in autonomous driving. Specifically, Proximal Policy Optimization (PPO) is a recently introduced on-policy algorithm for DRL and is considered as one of the most prominent for modern DRL applications. As of now, research avenues for DRL in lane-free traffic have examined other algorithms not related to PPO or on-policy algorithms in general. To this end, we build upon existing work for DRL in single-agent lane-free environments, where a CAV with the form of an agent has the task to learn a lane-free vehicle movement strategy while navigating a road populated with other CAVs. To effectively apply PPO in this setting, we extend an existing Markov Decision Process formulation of the problem with different new components, and systematically evaluate their influence on the agent’s learning performance. Firstly, we put forward an image state representation of surrounding traffic that captures the 2-dimensional movement of CAVs and compare it with the existing vector-based state input. Then, we formulate and examine different reward function terms that are better fitted for PPO. Moreover, we develop a blocking environment setting where the agent’s actions are filtered under some critical conditions. There, instead of the fully unconstrained learning environment, we observe the impact of a practical constraint that better guides the learning process away from the local maxima that we commonly experienced in practice. Our experimental evaluation shows the improvement that each of the above-mentioned enhancements for PPO provides under the single-agent lane-free environment. The results indicate the agent’s capacity to learn strategies that overcome solutions of inferior quality that were initially observed under the original formulation targeting other methods. Motivated by the results, we believe that the proposed enhancements can serve as groundwork in future endeavours for PPO and other methods for DRL in lane-free traffic.