Το έργο με τίτλο Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming από τον/τους δημιουργό/ούς Ziakas Christos διατίθεται με την άδεια Creative Commons Αναφορά Δημιουργού 4.0 Διεθνές
Βιβλιογραφική Αναφορά
Χρήστος Ζιάκας, "Υλοποίηση δέντρων αποφάσεων για ροές δεδομένων στην πλατφόρμα Spark Streaming ", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018
https://doi.org/10.26233/heallink.tuc.78767
In the era of big data, enormous amounts of data are created, replicatedand transferred every day. The current technology for handling and analyzingvast amounts of data allows us to develop applications for various problems(e.g., DNA sequence analysis, medical imaging, traffic control) that could notpreviously be solved efficiently. More precisely, the time required to processlarge volumes of data can be minimized by using distributed computing platformssuch as Apache Spark. The Apache Spark framework includes variousimplementations for large-scale machine learning, distributed data streamingprocessing and parallel graph analytics. The Spark Streaming platform providesscalable and fault-tolerant data streaming processing. However, there isonly a limited number of implemented distributed incremental machine learningalgorithms available in the Spark Streaming platform.In this thesis, we propose a parallel implementation of an incremental andscalable tree learning method for classification in Spark Streaming, the Hoeffdingdecision tree. Our proposed implementation performs horizontal dataparallelism in the shared-nothing architecture of Spark. The Hoeffding boundguarantees with high confidence that the Hoeffding decision tree is asymptoticallyidentical to a batch-learning one. The high dimensional statistics, requiredfor evaluating splits, are stored as sparse matrices in main memoryacross the Spark cluster. These statistics are instantly updated, when newtraining instances are available. Furthermore, distributed computations areperformed in order to identify the optimal split and assess whether the splittingcriterion is satisfied. The generated model is used in order to make colorclassification based on the spectral signature of each color. Each color has adifferent chemical composition, and as a consequence a different spectral signature.