Το work with title Optimizing stream processing efficiency and cost: TALOS Task level autoscaler for Apache Flink platform by Ntouni Ourania is licensed under Creative Commons Attribution 4.0 International
Bibliographic Citation
Ourania Ntouni, "Optimizing stream processing efficiency and cost: TALOS Task level autoscaler for Apache Flink platform", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2024
https://doi.org/10.26233/heallink.tuc.99094
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Big Data plat- forms, due to their dynamic nature, often face fluctuations in data streams workloads, leading to potential over-provisioning or under-provisioning be- cause of static resource allocation. Most existing solutions solve the re- source adaptation problem by scaling the entire Job. These solutions are sub-optimal since not all Tasks are equally stressed and need not be scaled. Therefore, we decided to develop an agent that manages the resources of a Flink job during runtime, at a Task level. This thesis presents TALOS, an innovative task autoscaler specifically designed for Apache Flink. TALOS dynamically changes the parallelism of tasks within Flink jobs in response to real-time workload fluctuations. TALOS is a threshold-based agent that utilizes a combination of metrics, such as Kafka consumer lag, throughput, backpressure, buffer metrics, and idleness to make scaling decisions. This al- gorithm targets not only in optimizing the performance of the Flink Job, but also in minimizing the cost in cloud environments, especially for long run- ning applications. Notable is, that TALOS monitors each task separately, without taking into consideration output and input rate of upstream or downstream tasks, respectively. In this thesis, we prove that our model not only successfully maintains the performance of the application while min- imizing infrastructure costs, but can provide a better performance-to-cost ratio compared to already existing work on Flink autoscaling. Our system is compared against Flink Kubernetes Operator autoscaler, a threshold based algorithm, and both of the agents are tested in pretentious workloads.