Michail Gemistos, "Implementation of an intelligent agent for the AIBIRDS competition", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.83630
The broad field of Artificial Intelligence (AI) strives to reproducehuman behavior on machines. Machine Learning, as a subfield, and morespecifically Reinforcement Learning (RL), enables autonomous agents totake suitable actions under different circumstances through atrial-and-error learning process, without being programmed for everypossible scenario they may encounter. Since 2013, the InternationalJoint Conference on Artificial Intelligence (IJCAI) hosts the AngryBirds AI Competition (AIBIRDS), where various AI agents compete on theAngry Birds computer game. The agents compete on unknown game levelswithout any human intervention. In this thesis, we designed two agentsfor AIBIRDS following the principles of two well-known RL algorithms,namely Q-Learning and Least Squares Policy Iteration (LSPI). Both ofthem are model-free RL algorithms, trying to learn the best action ateach step (policy) for any given game scene. Since the action and statespaces of the game are extremely large and due to the absence of a modelthat describes the transition from a state to a next state affected byan action choice, we used an approximation architecture to represent thelearned Q values, which estimate the quality of each action in eachstate. The approximation uses a set of eight basis functions (features)we designed, which try to describe a game scene effectively, and eachone is weighted by its own parameter (weight). In our experiments, theQ-Learning agent is trained for 20,000 iterations updating its weightsincrementally during the course of that training, concluding to theirfinal values, when the iterations are completed. At each iteration, theQ-Learning agent stores locally each observed sample of interaction withthe game, which includes the current state, the action taken, the newstate and the reward gained. The LSPI agent is then trained using thestored set of samples to find its own set of weights and thus its ownpolicy. When the process of training ends for both Q-Learning and LSPIon the same observed samples, we test each agent on 54 different levelstaken directly from the AIBIRDS competition, 34 of those being thelevels our agents were trained on and 20 levels being completely new tothe agents. The Q-Learning agent is able to complete successfully 68% ofthese levels and the LSPI agent 81% of them, occasionally performingprecise shots with amazing results.