Institutional Repository [SANDBOX]
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Migrating state between jobs in Apache Spark

Kalogerakis Stefanos

Full record


URI: http://purl.tuc.gr/dl/dias/980F4111-F7BA-470D-B865-D18A6A06E176
Year 2020
Type of Item Diploma Work
License
Details
Bibliographic Citation Stefanos Kalogerakis, "Migrating state between jobs in Apache Spark", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2020 https://doi.org/10.26233/heallink.tuc.87951
Appears in Collections

Summary

Nowadays, data is being generated at an unprecedented rate and impacts every aspect of our everyday life. As this amount increases, more and more organizations try to incorporate techniques to handle that data in real-time and evolve their business strategy. One critical challenge is ensuring fault-tolerance and high availability in our data. On different occasions, the heterogeneous systems responsible for data processing must disrupt their operation and update their infrastructure. In some other cases, system failures can occur. Therefore, migration techniques that prevent data loss are getting increasingly important.In this thesis, we propose a state migration algorithm implemented on Apache Spark’s Structured Streaming API. This powerful API offers a fast, scalable solution for processing complex workloads and ensures fault tolerance through its checkpointing mechanism. The algorithm handles state among different jobs and covers various scenarios where users might wish to split, merge, or remotely deploy workflows in each job with no data loss. In that way, users have complete control over workflow operators and can impact their execution at will. Additionally, to prove that our implementation works, we used Rapidminer Studio workflow designer to present complete and detailed test-cases for the cases mentioned above.

Available Files

Services

Statistics