Το work with title Implementation of decision trees for data streams in the Spark Streaming platform by Ziakas Christos is licensed under Creative Commons Attribution 4.0 International
Bibliographic Citation
Christos Ziakas, "Implementation of decision trees for data streams in the Spark Streaming platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018
https://doi.org/10.26233/heallink.tuc.78767
In the era of big data, enormous amounts of data are created, replicatedand transferred every day. The current technology for handling and analyzingvast amounts of data allows us to develop applications for various problems(e.g., DNA sequence analysis, medical imaging, traffic control) that could notpreviously be solved efficiently. More precisely, the time required to processlarge volumes of data can be minimized by using distributed computing platformssuch as Apache Spark. The Apache Spark framework includes variousimplementations for large-scale machine learning, distributed data streamingprocessing and parallel graph analytics. The Spark Streaming platform providesscalable and fault-tolerant data streaming processing. However, there isonly a limited number of implemented distributed incremental machine learningalgorithms available in the Spark Streaming platform.In this thesis, we propose a parallel implementation of an incremental andscalable tree learning method for classification in Spark Streaming, the Hoeffdingdecision tree. Our proposed implementation performs horizontal dataparallelism in the shared-nothing architecture of Spark. The Hoeffding boundguarantees with high confidence that the Hoeffding decision tree is asymptoticallyidentical to a batch-learning one. The high dimensional statistics, requiredfor evaluating splits, are stored as sparse matrices in main memoryacross the Spark cluster. These statistics are instantly updated, when newtraining instances are available. Furthermore, distributed computations areperformed in order to identify the optimal split and assess whether the splittingcriterion is satisfied. The generated model is used in order to make colorclassification based on the spectral signature of each color. Each color has adifferent chemical composition, and as a consequence a different spectral signature.