Panagiotis Giannakopoulos, "Supporting elasticity in Flink", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2020
https://doi.org/10.26233/heallink.tuc.86993
Apache Flink is an open source framework that supports high-throughput, low latency data processing, as well as event processing. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. At a basic level, Flink programs consist of streams and operators which apply transformations on data. The number of operator subtasks correspond to the parallelism of that particular operator and regulates the allocated resources for the execution of the operator. As a result, it is possible to set the desirable overall parallelism of the program by adjusting accordingly the parallelism of each operator. Although, Flink currently lacks the ability to automatically adjust to the resource needs of a running program and therefore it cannot adapt the program to varying workload. Therefore, such an adjustion can only be done with human intervention. The lack of dynamic resource allocation could lead to performance drop or to allocated resources remaining unused for long time (in the case of over or under utilization of resources respectively). In order to address this issue, we propose a statistical machine learning methodology which is implemented as a software agent that runs in parallel with Flink. The agent monitors the running program and adjusts the allocated resources to the incoming workload. The agent acts proactively by predicting the forthcoming workload in order to maintain the performance of the application within acceptable limits (i.e. defined in the form of SLAs) while minimizing the utilization of resources. This is achieved by adjusting (i.e. scaling-up or down) the computational resources to the actual and future needs of the application. To do so, a statistical machine learning model is used with online training in order to approach an optimal policy for scaling. As a proof of concept, we designed and implemented an infrastructure on the cloud which assess the efficiency of such scaling method in a Flink cluster. We run an exhaustive set of experiments using synthetic and real workloads available on the internet. The experimental results are a good support to our claims of efficiency.