Dimitra Gkoutziouli, "Synopses over streaming data at Apache Flink", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.83503
A growing number of applications demand algorithms and data structures that enable the efficient processing of data sets with gigabytes to terabytes to petabytes. Massive amounts of information that are generated continuously from numerous types of sources are called Big Data. Big Data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. Nowadays, many applications receive data in a streaming fashion way that must be processed on the fly as it arrives. Thus, the use of data structures called Synopses, is essential for managing such massive data, as handling large data sets is not often efficient to work fully on them. Synopses summarize the data set and provide approximate responses to queries.One of the main families of synopses are sketches. A sketch of a large amount of data is a small data structure that is able to calculate or approximate certain characteristics of the original data. In this diploma thesis we focus on various streaming algorithms of sketches such as Bloom Filter, Count-Min, Flajolet-Martin and AMS sketches.We propose a parallel implementation of query registration of the above sketches to be updated as more data arrives, insert dynamically new instances of these sketches in real-time execution and compute several functions. These functions may estimate the cardinality of theelements, the amount of distinct elements, or inform about the existence of an element in a stream. In order to develop that, we used the Apache Flink framework. Flink is a distributed streaming engine with high-throughput, low-latency and fault-tolerant computations over unbounded and bounded data streams.First of all, we expound the theoretical background of the implemented algorithms and the distributed framework. Then, we explicate the implementation of the code as we use a Kafka connector, the transformations of Datastream API and finally the Queryable State feature of Flink. Through that method, users query the most up-to-date values of the sketches whileother platforms can use this information as source.