Institutional Repository [SANDBOX]
Technical University of Crete
EN  |  EL

Search

Browse

My Space

Extreme-Scale online machine learning on stream processing platforms

Konidaris Vissarion-Bertcholnt

Full record


URI: http://purl.tuc.gr/dl/dias/52EF7270-E7E6-47C1-A576-5044960EE29D
Year 2022
Type of Item Master Thesis
License
Details
Bibliographic Citation Vissarion-Bertcholnt Konidaris, "Extreme-Scale online machine learning on stream processing platforms", Master Thesis, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2022 https://doi.org/10.26233/heallink.tuc.92838
Appears in Collections

Summary

Online Machine Learning (OML) techniques support training over continuous unbounded training items while simultaneously providing predictions on the same or another unlabeled stream. The explosion in the amount and complexity of digital information generated online is gradually rendering OML techniques essential for modern analytics and forecasting applications due to their ability to handle massive, unbounded, and most importantly, inherently not-static data. Having noted that support for popular Machine Learning (ML) toolchains is somewhat weak for the OML setting, we have designed the Online Machine Learning and Data Mining (OMLDM) component, a state-of-the-art engine for effortlessly deploying OML pipelines on streaming platforms. Our prototype, built on Apache Flink, validates our architecture, and identifies issues that current streaming platforms should improve on to support OML. To achieve high performance, OMLDM supports distributed online learning by utilizing the Parameter Server paradigm. We have identified the communication cost of synchronizing distributed learners as the major impediment to scalability. To overcome this obstacle, our proposed engine supports several popular model synchronization strategies. In addition, we bring forward and evaluate a novel synchronization strategy, Functional Dynamic Averaging (FDA), that minimizes the prediction loss and network communication all at once. We demonstrate through experiments that FDA is superior to current model synchronization strategies in many settings.

Available Files

Services

Statistics