Clustering Big Data Streams in Apache Flink

Bitsakis Theodoros

URI	http://purl.tuc.gr/dl/dias/BEEF5DFC-1996-42F3-A0E6-3F3D10EB5CF3	-
Identifier	https://doi.org/10.26233/heallink.tuc.79102	-
Language	en	-
Extent	62 pages	en
Title	Clustering Big Data Streams in Apache Flink	en
Title	Συσταδοποίηση Μεγάλων Ροών Δεδομένων στο Apache Flink	el
Creator	Bitsakis Theodoros	en
Creator	Μπιτσακης Θεοδωρος	el
Contributor [Thesis Supervisor]	Deligiannakis Antonios	en
Contributor [Thesis Supervisor]	Δεληγιαννακης Αντωνιος	el
Contributor [Committee Member]	Garofalakis Minos	en
Contributor [Committee Member]	Γαροφαλακης Μινως	el
Contributor [Committee Member]	Samoladas Vasilis	en
Contributor [Committee Member]	Σαμολαδας Βασιλης	el
Publisher	Πολυτεχνείο Κρήτης	el
Publisher	Technical University of Crete	en
Academic Unit	Technical University of Crete::School of Electrical and Computer Engineering	en
Academic Unit	Πολυτεχνείο Κρήτης::Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών	el
Content Summary	We live in the era of Big Data where massive amounts of information are generated continuously from numerous types of sources. Today’s goal is to apply techniques that take into consideration the volume, the variety and the velocity of the data, in order to gain insight that couldn’t be revealed with traditional data processing application software. Cluster analysis is a technique that groups a set of objects such that objects in the same group have similar properties. It is commonly used in the fields of machine learning, data mining, statistical data analysis, pattern recognition and bioinformatics. In this thesis, we propose a parallel implementation for the well-known unsupervised learning algorithm, StreamKM++, for clustering data streams in an online fashion. For the development phase, Apache Flink framework is chosen as a distributed streaming engine with high-throughput, low-latency and fault-tolerant computations over unbounded and bounded data streams. Initially, we introduce the theoretical background of the implemented algorithm and the distributed framework. Afterwards, we propose a parallel implementation which computes the set of cluster centers after the consumption of the input dataset. In addition to that, we propose an alternative implementation which produces periodically requests for the re-evaluation of cluster centers. Finally, we develop a program that exploits the Queryable State feature of Flink, in order to allow the user to query the most up-to-date values of the cluster centers. Experimental evaluation shows that by increasing the level of parallelism the running time droops significantly and at the same time the quality of the clustering gets slightly better.	en
Type of Item	Διπλωματική Εργασία	el
Type of Item	Diploma Work	en
License	http://creativecommons.org/licenses/by/4.0/	en
Date of Item	2018-10-11	-
Date of Publication	2018	-
Subject	Apache Flink	en
Subject	Data Stream Clustering	en
Bibliographic Citation	Theodoros Bitsakis, "Clustering Big Data Streams in Apache Flink", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2018	en
Bibliographic Citation	Θεόδωρος Μπιτσάκης, "Συσταδοποίηση Μεγάλων Ροών Δεδομένων στο Apache Flink", Διπλωματική Εργασία, Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών, Πολυτεχνείο Κρήτης, Χανιά, Ελλάς, 2018	el

Search

Browse

My Space

Clustering Big Data Streams in Apache Flink

Bitsakis Theodoros

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: