<efrbr:recordSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:efrbr="http://vfrbr.info/efrbr/1.1" xmlns:efrbr-work="http://vfrbr.info/efrbr/1.1/work" xmlns:efrbr-expression="http://vfrbr.info/efrbr/1.1/expression" xmlns:efrbr-manifestation="http://vfrbr.info/efrbr/1.1/manifestation" xmlns:efrbr-person="http://vfrbr.info/efrbr/1.1/person" xmlns:efrbr-corporateBody="http://vfrbr.info/efrbr/1.1/corporateBody" xmlns:efrbr-concept="http://vfrbr.info/efrbr/1.1/concept" xmlns:efrbr-structure="http://vfrbr.info/efrbr/1.1/structure" xmlns:efrbr-responsible="http://vfrbr.info/efrbr/1.1/responsible" xmlns:efrbr-subject="http://vfrbr.info/efrbr/1.1/subject" xmlns:efrbr-other="http://vfrbr.info/efrbr/1.1/other" xsi:schemaLocation="http://vfrbr.info/efrbr/1.1 http://vfrbr.info/schemas/1.1/efrbr.xsd"><efrbr:entities><efrbr-work:work identifier="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3"><efrbr-work:titleOfTheWork>Directed exploration of policy space in reinforcement learning</efrbr-work:titleOfTheWork></efrbr-work:work><efrbr-expression:expression identifier="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3"><efrbr-expression:titleOfTheExpression>Directed exploration of policy space in reinforcement learning</efrbr-expression:titleOfTheExpression><efrbr-expression:titleOfTheExpression>Κατευθυνόμενη αναζήτηση του χώρου πολιτικών στην ενισχυτική μάθηση</efrbr-expression:titleOfTheExpression><efrbr-expression:formOfExpression vocabulary="DIAS:TYPES">
            Διδακτορική Διατριβή
            Doctoral Dissertation
         </efrbr-expression:formOfExpression><efrbr-expression:dateOfExpression type="issued">2018-09-07</efrbr-expression:dateOfExpression><efrbr-expression:dateOfExpression type="published">2018</efrbr-expression:dateOfExpression><efrbr-expression:languageOfExpression vocabulary="iso639-1">en</efrbr-expression:languageOfExpression><efrbr-expression:summarizationOfContent>Reinforcement learning refers to a broad class of learning problems. Autonomous agents typically try to learn how to achieve their goal solely by interacting with their environment. They perform a trial-and-error search and they receive delayed rewards (or penalties). The challenge is to learn a good or even optimal decision policy, one that maximizes the total long-term reward. A decision policy for an autonomous agent is the knowledge of what to do in any possible state in order to achieve the long-term goal efficiently.

Several recent learning approaches within decision making under uncertainty suggest the use of classifiers for the compact (approximate) representation of policies. However, the space of possible policies, even under such structured representations, is huge and must be searched carefully to avoid computationally expensive policy simulations.

In this dissertation, our first contribution uncovers policy structure by deriving optimal policies for two standard two-dimensional reinforcement learning domains, namely the Inverted Pendulum and the Mountain Car. We found that optimal policies have significant structure and a high degree of locality, i.e. dominant actions persist over large continuous areas within the state space. This observation provides sufficient justification for the appropriateness of classifiers for approximate policy representation.

Our second and main contribution is the proposal of two Directed Policy Search algorithms for the efficient exploration of policy space provided by Support Vector Machines and Relevance Vector Machines. The first algorithm exploits the structure of the classifiers used for policy representation. The second algorithm uses an importance function to rank the states, based on action prevalence. In both approaches, the search over the state space is focused on areas where there is change of action domination. This directed focus on critical parts of the state space iteratively leads to refinement and improvement of the underlying policy and delivers excellent control policies in only a few iterations with a relatively small rollout budget, yielding significant computational time savings.

We demonstrate the proposed algorithms and compare them to prior work on three standard reinforcement learning domains: Inverted Pendulum (two-dimensional), Mountain Car (two-dimensional), Acrobot (four-dimensional). Additionally, we demonstrate the scalability of the proposed approaches on the problem of learning how to control a 4-Link, Under-Actuated, Planar Robot, which corresponds to an eight-dimensional problem, well-known in the control theory community. In all cases, the proposed approaches strike a balance between efficiency and effort, yielding sufficiently good policies without excessive steps of learning. </efrbr-expression:summarizationOfContent><efrbr-expression:summarizationOfContent>Η ενισχυτική μάθηση αναφέρεται σε μια ευρεία κατηγορία προβλημάτων μάθησης. Οι αυτόνομες οντότητες τυπικά προσπαθούν να μάθουν να επιτυγχάνουν το στόχο τους αποκλειστικά μέσω της αλληλεπίδρασης με το περιβάλλον τους. Κάνουν διερευνητικές προσπάθειες αναζήτησης μέσω δοκιμών και ελέγχων και λαμβάνουν με καθυστέρηση ανταμοιβές (ή ποινές). Η πρόκληση είναι να μάθουν μια ικανοποιητική ή ακόμα και βέλτιστη πολιτική λήψης αποφάσεων, η οποία να μεγιστοποιεί τη συνολική μακροπρόθεσμη ανταμοιβή. Μια πολιτική λήψης αποφάσεων για μια αυτόνομη οντότητα είναι η γνώση του τι πρέπει να κάνει σε κάθε πιθανή κατάσταση προκειμένου να επιτευχθεί αποτελεσματικά ο μακροπρόθεσμος στόχος. 

Πολλές πρόσφατες προσεγγίσεις μάθησης για τη λήψη αποφάσεων υπό αβεβαιότητα προτείνουν τη χρήση ταξινομητών για την συμπαγή (προσεγγιστική) αναπαράσταση πολιτικών. Ωστόσο, ο χώρος των πιθανών πολιτικών, ακόμα και κάτω από τέτοιες δομημένες αναπαραστάσεις, είναι τεράστιος και πρέπει να αναζητηθεί προσεκτικά για να αποφευχθούν υπολογιστικά ακριβές προσομοιώσεις πολιτικών.

Σε αυτή τη διατριβή, η πρώτη μας συμβολή σχετίζεται με την ανίχνευση δομής σε βέλτιστες πολιτικές. Εξετάσαμε βέλτιστες πολιτικές για δύο βασικά πεδία ενισχυτικής μάθησης δύο διαστάσεων, το Inverted Pendulum και το Mountain Car. Διαπιστώσαμε ότι οι βέλτιστες πολιτικές τους έχουν σημαντική δομή και υψηλό βαθμό τοπικότητας, δηλαδή οι κυρίαρχες ενέργειες παραμένουν ίδιες  σε μεγάλες συνεχείς περιοχές εντός του χώρου καταστάσεων. Η παρατήρηση αυτή παρέχει επαρκή αιτιολόγηση για την καταλληλότητα των ταξινομητών για προσεγγιστική αναπαράσταση πολιτικών.

Η δεύτερη και κύρια συμβολή μας είναι η πρόταση δύο αλγορίθμων για την κατευθυνόμενη αναζήτηση του χώρου πολιτικών με τη χρήση των ταξινομητών SVM και RVM. Ο πρώτος αλγόριθμος εκμεταλλεύεται τη δομή των ταξινομητών που χρησιμοποιούνται για την αναπαράσταση της πολιτικής. Ο δεύτερος αλγόριθμος χρησιμοποιεί μια συνάρτηση σημαντικότητας των καταστάσεων, βάσει της επικράτησης των ενεργειών. Και στις δύο προσεγγίσεις, η αναζήτηση στον χώρο καταστάσεων επικεντρώνεται σε περιοχές όπου υπάρχει αλλαγή κυρίαρχης ενέργειας. Αυτή η κατευθυνόμενη εστίαση σε κρίσιμα τμήματα του χώρου καταστάσεων οδηγεί επαναληπτικά σε εκλέπτυνση και βελτίωση της τρέχουσας πολιτικής. Λίγες μόνο επαναλήψεις αρκούν για την παραγωγή εξαιρετικών πολιτικών με σχετικά χαμηλό αριθμό προσομοιώσεων, καταλήγοντας σε σημαντική εξοικονόμηση χρόνου.

Παρουσιάζουμε τους προτεινόμενους αλγόριθμους και τους συγκρίνουμε με τις προηγούμενες εργασίες σε τρία βασικά πεδία μελέτης της ενισχυτικής μάθησης: Inverted Pendulum (δύο διαστάσεων), Mountain Car (δύο διαστάσεων) και Acrobot (τεσσάρων διαστάσεων). Επιπροσθέτως, επιδεικνύουμε την επεκτασιμότητα των προτεινόμενων προσεγγίσεων στο πρόβλημα της μάθησης για τον έλεγχο ενός 4-Link Planar Robot, το οποίο αντιστοιχεί σε ένα πρόβλημα οκτώ διαστάσεων, γνωστό στην κοινότητα της θεωρίας ελέγχου. Σε όλες τις περιπτώσεις, οι προτεινόμενες προσεγγίσεις επιτυγχάνουν μια ισορροπία μεταξύ αποτελεσματικότητας και προσπάθειας, αποδίδοντας επαρκώς καλές πολιτικές σε σύντομο χρονικό διάστημα, χωρίς υπερβολικό αριθμό βημάτων μάθησης.</efrbr-expression:summarizationOfContent><efrbr-expression:contextForTheExpression>Διατριβή που παρεδόθη για να καλύψει μερικώς τις απαιτήσεις απόκτησης του Διδακτορικού Διπλώματος στη Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών του Πολυτεχνείου Κρήτης</efrbr-expression:contextForTheExpression><efrbr-expression:contextForTheExpression>Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the School of Electrical and Computer Engineering of the Technical University of Crete, Greece</efrbr-expression:contextForTheExpression><efrbr-expression:useRestrictionsOnTheExpression type="creative-commons">http://creativecommons.org/licenses/by/4.0/</efrbr-expression:useRestrictionsOnTheExpression><efrbr-expression:note type="academic unit">Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών</efrbr-expression:note></efrbr-expression:expression><efrbr-manifestation:manifestation identifier="http://purl.tuc.gr/dl/dias/F793DAC8-CD6A-4497-BBD3-128BAF81A735"><efrbr-manifestation:titleOfTheManifestation>Rexakis_Ioannis_PhD_2018.pdf</efrbr-manifestation:titleOfTheManifestation><efrbr-manifestation:publicationDistribution><efrbr-manifestation:placeOfPublicationDistribution type="distribution">Chania [Greece]</efrbr-manifestation:placeOfPublicationDistribution><efrbr-manifestation:publisherDistributor type="distributor">Library of TUC</efrbr-manifestation:publisherDistributor><efrbr-manifestation:dateOfPublicationDistribution>2018-09-07</efrbr-manifestation:dateOfPublicationDistribution></efrbr-manifestation:publicationDistribution><efrbr-manifestation:formOfCarrier>application/pdf</efrbr-manifestation:formOfCarrier><efrbr-manifestation:extentOfTheCarrier>3.0 MB</efrbr-manifestation:extentOfTheCarrier><efrbr-manifestation:accessRestrictionsOnTheManifestation>free</efrbr-manifestation:accessRestrictionsOnTheManifestation></efrbr-manifestation:manifestation><efrbr-person:person identifier="http://users.isc.tuc.gr/~irexakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Rexakis Ioannis
            Ρεξακης Ιωαννης
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~lagoudakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Lagoudakis Michail
            Λαγουδακης Μιχαηλ
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~epetrakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Petrakis Evripidis
            Πετρακης Ευριπιδης
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~apotamianos"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Potamianos Alexandros
            Ποταμιανος Αλεξανδρος
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~gchalkiadakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Chalkiadakis Georgios
            Χαλκιαδακης Γεωργιος
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~mzervakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Zervakis Michail
            Ζερβακης Μιχαηλ
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="88E30D22-C379-418E-8BD5-7F4DA6864AA3"><efrbr-person:nameOfPerson vocabulary="">
            Blekas Konstantinos
            Κωνσταντινος Μπλεκας
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~nvlassis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Vlassis Nikolaos
            Βλασσης Νικολαος
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-corporateBody:corporateBody identifier="3F4594A9-CA1E-4CB5-B9C9-66DEFE6A41E1"><efrbr-corporateBody:nameOfTheCorporateBody vocabulary="">
            Πολυτεχνείο Κρήτης
            Technical University of Crete
         </efrbr-corporateBody:nameOfTheCorporateBody></efrbr-corporateBody:corporateBody><efrbr-concept:concept identifier="577E9B65-604B-4AB6-825D-A18FF4851006"><efrbr-concept:termForTheConcept>
            Μηχανική μάθηση
            Machine learning
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="20572022-F440-4BE8-9DBB-099F3473274E"><efrbr-concept:termForTheConcept>
            Ενισχυτική μάθηση
            Reinforcement learning
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="AAFCAE0A-61B1-4519-AEEC-6A2F1884A3BC"><efrbr-concept:termForTheConcept>
            Ταξινόμηση
            Classification
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="A53AFA3A-7498-497F-9AEC-6606A64B3FAB"><efrbr-concept:termForTheConcept>
            Λήψη αποφάσεων υπό αβεβαιότητα
            Decision making under uncertainty
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="F603CC8D-F43D-4396-81B4-C6EA397F6492"><efrbr-concept:termForTheConcept>
            Προβλήματα ελέγχου
            Control problems
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="C5AB06BC-A98B-4B64-BD19-D0FDDC79F73D"><efrbr-concept:termForTheConcept>
            Πολυδιάστατοι χώροι
            Multi-dimensional spaces
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="92E0210A-AA71-4E8C-86B7-6C98045D78FE"><efrbr-concept:termForTheConcept>
            Classifiers
            Ταξινομητές
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="F151B606-7D88-4C7F-AAA7-FD86A13FB9B4"><efrbr-concept:termForTheConcept>
            Policy representation
            Αναπαράσταση πολιτικών
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="D84CE0F8-B7B4-4503-BC98-A9BA586F9705"><efrbr-concept:termForTheConcept>
            Directed resampling
            Κατευθυνόμενη δειγματοληψία
         </efrbr-concept:termForTheConcept></efrbr-concept:concept></efrbr:entities><efrbr:relationships><efrbr-structure:structureRelations><efrbr-structure:realizedThrough sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="expression" targetURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3"/><efrbr-structure:embodiedIn sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="manifestation" targetURI="http://purl.tuc.gr/dl/dias/F793DAC8-CD6A-4497-BBD3-128BAF81A735"/></efrbr-structure:structureRelations><efrbr-responsible:responsibleRelations><efrbr-responsible:createdBy sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~irexakis"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~irexakis" role="author"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~lagoudakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/1"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~epetrakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~apotamianos" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~gchalkiadakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~mzervakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="88E30D22-C379-418E-8BD5-7F4DA6864AA3" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="http://users.isc.tuc.gr/~nvlassis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="person" targetURI="3F4594A9-CA1E-4CB5-B9C9-66DEFE6A41E1" role="publisher"/></efrbr-responsible:responsibleRelations><efrbr-subject:subjectRelations><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="577E9B65-604B-4AB6-825D-A18FF4851006"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="20572022-F440-4BE8-9DBB-099F3473274E"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="AAFCAE0A-61B1-4519-AEEC-6A2F1884A3BC"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="A53AFA3A-7498-497F-9AEC-6606A64B3FAB"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="F603CC8D-F43D-4396-81B4-C6EA397F6492"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="C5AB06BC-A98B-4B64-BD19-D0FDDC79F73D"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="92E0210A-AA71-4E8C-86B7-6C98045D78FE"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="F151B606-7D88-4C7F-AAA7-FD86A13FB9B4"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/B71794BD-645B-4A9A-977A-FDC713DF4DE3" targetEntity="concept" targetURI="D84CE0F8-B7B4-4503-BC98-A9BA586F9705"/></efrbr-subject:subjectRelations><efrbr-other:otherRelations/></efrbr:relationships></efrbr:recordSet>