Alternating policy iteration: an analysis and future directions

Bacharis Athanasios

Full record

URI:

http://purl.tuc.gr/dl/dias/6FEFC5DC-E7CD-439E-8A37-E1DBD4D6E364

Year

2019

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Αθανάσιος Μπαχάρης, "Εναλλασσόμενη επανάληψη πολιτικής: μία ανάλυση και μελλοντικές κατευθύνσεις", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2019 https://doi.org/10.26233/heallink.tuc.83626

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Intelligent Systems Laboratory

Summary

Markov Decision Processes (MDPs) constitute a powerful mathematical model for decision making under uncertainty. They have been used widely in a number of application areas such as economics, operation research, health care and robotics. In their fundamental form, solving an MDP to derive its optimal policy is computationally expensive, and the problem is only exacerbated in its high dimensions (i.e., in large state-action spaces). To this end, a number of approximate solution methods have been proposed over time, tackling time and space complexity in various ways. An interesting approach has been proposed in 2015, by Panagopoulos et al, that utilizes an iterative optimization method to optimize over state-action sub-spaces. Although the idea of iteratively optimizing over sub-spaces, is not new in optimization theory, this algorithm was perhaps the first to propose such an approach in the context of MDPs. The same paper also illustrates the success of such an approach in controlling a solar tracking system. Nevertheless, that work does not illustrate clearly how this new algorithm scales along with problem size, nor how it compares with typical policy iteration or value iteration approaches; and could not be used in environments that do not allow the execution of the actions computed after optimization in each separate dimensions. Intuitively, this corresponds to situations where we have information aliasing phenomena. Information aliasing is a concept which appears in many scientific fields, such as telecommunications and robotics, and describes the loss of information due to dimensionality reduction.As such, in this thesis we provide a novel variant of the alternating policy iteration algorithm that resolves the aforementioned aliasing issues, and provide a comparison with policy iteration and value iteration. We show empirically that Aliasing Aware Alternating Policy Iteration (AAAPI) can converge to the optimal solutions (policies), in the presence of information aliasing phenomena. Also, the computational complexity of this algorithm is directly related to the intensity of information aliasing. In environments where information aliasing is not intense, AAAPI converges faster than policy iteration and value iteration; but in high-aliasing environments like the maze-grid, the AAAPI convergence rate is substantially reduced. Finally, we provide a discussion on a possible AAAPI multi-agent extension.

Αναζήτηση

Πλοήγηση

Ο Χώρος μου

Alternating policy iteration: an analysis and future directions

Bacharis Athanasios

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: