<efrbr:recordSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:efrbr="http://vfrbr.info/efrbr/1.1" xmlns:efrbr-work="http://vfrbr.info/efrbr/1.1/work" xmlns:efrbr-expression="http://vfrbr.info/efrbr/1.1/expression" xmlns:efrbr-manifestation="http://vfrbr.info/efrbr/1.1/manifestation" xmlns:efrbr-person="http://vfrbr.info/efrbr/1.1/person" xmlns:efrbr-corporateBody="http://vfrbr.info/efrbr/1.1/corporateBody" xmlns:efrbr-concept="http://vfrbr.info/efrbr/1.1/concept" xmlns:efrbr-structure="http://vfrbr.info/efrbr/1.1/structure" xmlns:efrbr-responsible="http://vfrbr.info/efrbr/1.1/responsible" xmlns:efrbr-subject="http://vfrbr.info/efrbr/1.1/subject" xmlns:efrbr-other="http://vfrbr.info/efrbr/1.1/other" xsi:schemaLocation="http://vfrbr.info/efrbr/1.1 http://vfrbr.info/schemas/1.1/efrbr.xsd"><efrbr:entities><efrbr-work:work identifier="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26"><efrbr-work:titleOfTheWork>Visual recognition of text in images for question answering using deep learning</efrbr-work:titleOfTheWork></efrbr-work:work><efrbr-expression:expression identifier="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26"><efrbr-expression:titleOfTheExpression>Visual recognition of text in images for question answering using deep learning</efrbr-expression:titleOfTheExpression><efrbr-expression:titleOfTheExpression>Οπτική αναγνώριση κειμένου σε εικόνες για ερωταπαντήσεις με χρήση βαθιάς μάθησης</efrbr-expression:titleOfTheExpression><efrbr-expression:formOfExpression vocabulary="DIAS:TYPES">
            Διπλωματική Εργασία
            Diploma Work
         </efrbr-expression:formOfExpression><efrbr-expression:dateOfExpression type="issued">2024-08-01</efrbr-expression:dateOfExpression><efrbr-expression:dateOfExpression type="published">2024</efrbr-expression:dateOfExpression><efrbr-expression:languageOfExpression vocabulary="iso639-1">en</efrbr-expression:languageOfExpression><efrbr-expression:summarizationOfContent>Visual Question Answering (VQA) is a complex challenge that combines the domains of Computer Vision and Natural Language Processing. The key concept behind VQA is to be able to automatically answer questions, provided in the form of natural language text, about the content of a digital color image, provided also as part of the input. The answer is to be delivered also in the same form of natural language text. This diploma thesis explores the development of a VQA model, utilizing existing systems, trained on millions of data using deep machine learning techniques. More specifically, the two systems utilized are: EfficientNetB0 as the image feature extractor and BERT for question embedding. The feature maps generated by these two components are concatenated and are subsequently passed through a convolutional Neural Network architecture with two dense layers, which is responsible for making predictions. The goal of this model’s architecture is to correctly classify inputs, consisting of a question and an image, to answers selected from a predefined set of 500 possible responses. Training the model involved leveraging Colab’s Pro GPUs, experimenting with various configurations to optimize performance, and employing a range of callbacks for enhanced training stability. The resulting model demonstrated good performance in many cases, accurately recognizing objects, understanding scenes, and performing spatial reasoning to answer questions related to the input image. These results are illustrated through a series of correct and incorrect predicted answers on selected instances. Finally, limitations, future extensions and potential applications of the proposed approach are discussed. </efrbr-expression:summarizationOfContent><efrbr-expression:summarizationOfContent>Η Απάντηση Ερωτήσεων μέσω Οπτικών Δεδομένων (Visual Question Answering, VQA) είναι μια σύνθετη πρόκληση που συνδυάζει τους τομείς της Υπολογιστικής Όρασης και της Επεξεργασίας Φυσικής Γλώσσας. Η βασική ιδέα πίσω από το VQA είναι να μπορεί να απαντά κανείς αυτόματα σε ερωτήσεις, που παρέχονται με τη μορφή κειμένου φυσικής γλώσσας, σχετικά με το περιεχόμενο μιας ψηφιακής έγχρωμης εικόνας, που παρέχεται επίσης ως μέρος της εισόδου. Η απάντηση πρέπει να παραδοθεί επίσης στην ίδια μορφή κειμένου φυσικής γλώσσας. Η παρούσα διπλωματική εργασία διερευνά την ανάπτυξη ενός μοντέλου VQA, αξιοποιώντας υπάρχοντα συστήματα, εκπαιδευμένα σε εκατομμύρια δεδομένα χρησιμοποιώντας τεχνικές βαθιάς μηχανικής μάθησης. Πιο συγκεκριμένα, τα δύο συστήματα που αξιοποιήθηκαν είναι: το EfficientNetB0 ως εργαλείο εξαγωγής χαρακτηριστικών εικόνας και το BERT για την ενσωμάτωση ερωτήσεων. Οι χάρτες χαρακτηριστικών που παράγονται από αυτά τα δύο στοιχεία συνενώνονται και στη συνέχεια τροφοδοτούνται σε μια συνελικτική αρχιτεκτονική νευρωνικού δικτύου με δύο πυκνά στρώματα, τα οποία είναι υπεύθυνα για την πραγματοποίηση προβλέψεων. Ο στόχος της αρχιτεκτονικής αυτού του μοντέλου είναι να ταξινομήσει σωστά τις εισόδους, που αποτελούνται από μια ερώτηση και μια εικόνα, σε απαντήσεις που επιλέγονται από ένα προκαθορισμένο σύνολο 500 πιθανών επιλογών. Η εκπαίδευση του μοντέλου περιελάμβανε κατάλληλη χρήση των Pro GPUs του Colab, καθώς και πειραματισμό με διάφορες διαμορφώσεις για τη βελτιστοποίηση της απόδοσης και την εφαρμογή μιας σειράς από callbacks για τη βελτίωση της σταθερότητας της εκπαίδευσης. Το μοντέλο που προέκυψε επέδειξε καλή απόδοση σε πολλές περιπτώσεις, αναγνωρίζοντας αντικείμενα με ακρίβεια, κατανοώντας σκηνές και εκτελώντας χωροταξικούς συλλογισμούς για να απαντάει σε ερωτήσεις σχετικές με την εικόνα εισόδου. Αυτά τα αποτελέσματα παρουσιάζονται μέσω μια σειράς σωστών και λανθασμένων προβλεπόμενων απαντήσεων σε επιλεγμένες περιπτώσεις. Τέλος, συζητούνται εκτενώς περιορισμοί, μελλοντικές επεκτάσεις και πιθανές εφαρμογές της προτεινόμενης προσέγγισης.</efrbr-expression:summarizationOfContent><efrbr-expression:contextForTheExpression>Διπλωματική Εργασία που υποβλήθηκε στη σχολή ΗΜΜΥ του Πολ. Κρήτης για την πλήρωση προϋποθέσεων λήψης του Πτυχίου.</efrbr-expression:contextForTheExpression><efrbr-expression:useRestrictionsOnTheExpression type="creative-commons">http://creativecommons.org/licenses/by/4.0/</efrbr-expression:useRestrictionsOnTheExpression><efrbr-expression:note type="academic unit">Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών</efrbr-expression:note></efrbr-expression:expression><efrbr-manifestation:manifestation identifier="https://dias.library.tuc.gr/view/100600"><efrbr-manifestation:titleOfTheManifestation>Vlachos_Konstantinos_Dip_2024.pdf</efrbr-manifestation:titleOfTheManifestation><efrbr-manifestation:publicationDistribution><efrbr-manifestation:placeOfPublicationDistribution type="distribution">Chania [Greece]</efrbr-manifestation:placeOfPublicationDistribution><efrbr-manifestation:publisherDistributor type="distributor">Library of TUC</efrbr-manifestation:publisherDistributor><efrbr-manifestation:dateOfPublicationDistribution>2024-07-31</efrbr-manifestation:dateOfPublicationDistribution></efrbr-manifestation:publicationDistribution><efrbr-manifestation:formOfCarrier>application/pdf</efrbr-manifestation:formOfCarrier><efrbr-manifestation:extentOfTheCarrier>25.0 MB</efrbr-manifestation:extentOfTheCarrier><efrbr-manifestation:accessRestrictionsOnTheManifestation>free</efrbr-manifestation:accessRestrictionsOnTheManifestation></efrbr-manifestation:manifestation><efrbr-person:person identifier="http://users.isc.tuc.gr/~kvlachos1"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Vlachos Konstantinos
            Βλαχος Κωνσταντινος
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~mzervakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Zervakis Michail
            Ζερβακης Μιχαηλ
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~lagoudakis"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Lagoudakis Michail
            Λαγουδακης Μιχαηλ
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-person:person identifier="http://users.isc.tuc.gr/~ppartsinevelos"><efrbr-person:nameOfPerson vocabulary="TUC:LDAP">
            Partsinevelos Panagiotis
            Παρτσινεβελος Παναγιωτης
         </efrbr-person:nameOfPerson></efrbr-person:person><efrbr-corporateBody:corporateBody identifier="C7A559D5-8DB7-4DE0-A880-7C598D30ACBB"><efrbr-corporateBody:nameOfTheCorporateBody vocabulary="">
            Πολυτεχνείο Κρήτης
            Technical University of Crete
         </efrbr-corporateBody:nameOfTheCorporateBody></efrbr-corporateBody:corporateBody><efrbr-concept:concept identifier="A0A856CD-0694-452C-99EA-90B70DBF5352"><efrbr-concept:termForTheConcept>
            Visual question answering
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="2711C11A-B4B7-4EDD-8251-A99BBCF25954"><efrbr-concept:termForTheConcept>
            Computer vision
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="F702CDDF-E90B-49EE-BB2A-266F88508E3E"><efrbr-concept:termForTheConcept>
            Natural language processing
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="11992AE3-BB27-4193-AD16-267FFCB4CEDA"><efrbr-concept:termForTheConcept>
            Convolutional neural network
         </efrbr-concept:termForTheConcept></efrbr-concept:concept><efrbr-concept:concept identifier="3F102B0C-B38C-4826-83DF-F40AEA83704D"><efrbr-concept:termForTheConcept>
            Recurrent neural network
         </efrbr-concept:termForTheConcept></efrbr-concept:concept></efrbr:entities><efrbr:relationships><efrbr-structure:structureRelations><efrbr-structure:realizedThrough sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="expression" targetURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26"/><efrbr-structure:embodiedIn sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="manifestation" targetURI="http://purl.tuc.gr/dl/dias/C64A1E80-7873-4566-885E-D01846AA57E4"/></efrbr-structure:structureRelations><efrbr-responsible:responsibleRelations><efrbr-responsible:createdBy sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="http://users.isc.tuc.gr/~kvlachos1"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="http://users.isc.tuc.gr/~kvlachos1" role="author"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="http://users.isc.tuc.gr/~mzervakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="http://users.isc.tuc.gr/~lagoudakis" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/1"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="http://users.isc.tuc.gr/~ppartsinevelos" role="http://purl.tuc.gr/dl/dias/vocabs/contributor-roles/2"/><efrbr-responsible:realizedBy sourceEntity="expression" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="person" targetURI="C7A559D5-8DB7-4DE0-A880-7C598D30ACBB" role="publisher"/></efrbr-responsible:responsibleRelations><efrbr-subject:subjectRelations><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="concept" targetURI="A0A856CD-0694-452C-99EA-90B70DBF5352"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="concept" targetURI="2711C11A-B4B7-4EDD-8251-A99BBCF25954"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="concept" targetURI="F702CDDF-E90B-49EE-BB2A-266F88508E3E"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="concept" targetURI="11992AE3-BB27-4193-AD16-267FFCB4CEDA"/><efrbr-subject:hasSubject sourceEntity="work" sourceURI="http://purl.tuc.gr/dl/dias/CCD4EDF5-7A8E-4B87-83AD-394B19281B26" targetEntity="concept" targetURI="3F102B0C-B38C-4826-83DF-F40AEA83704D"/></efrbr-subject:subjectRelations><efrbr-other:otherRelations/></efrbr:relationships></efrbr:recordSet>