Finding correlated attributes in datasets at flinκ

Anastasiou Michalis

Full record

URI:

http://purl.tuc.gr/dl/dias/3EEF5888-54C6-48F8-A416-4C6AAF85F504

Year

2023

Type of Item

Diploma Work

License

Details

Bibliographic Citation

Michalis Anastasiou, "Finding correlated attributes in datasets at flinκ", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2023 https://doi.org/10.26233/heallink.tuc.96534

Appears in Collections

Diploma Works in Community School of Electrical and Computer Engineering

Diploma Works in Community Distributed Multimedia Information Systems and Applications Laboratory

Summary

The rapid development of technology has brought about a huge amount of data on a daily basis. This is data whose volume is ten times greater than it was 5 years ago. So the modern era is rightly described as the era of Big Data. The study of this data is essential both at the academic level and in various industries, since by studying this data, one can draw conclusions much easier. The aim of this thesis is to find correlated data in real-time in order to extract data that can be used to predict similarity. Due to the fact that, as mentioned before, there is a huge amount of data this thesis processes distributed and parallel thousands of data streams in order to find the k most similar streams. Computing the similarity of thousands of data streams with a large size would be too costly to implement an algorithm for sampling the data with the ultimate goal of reducing the data size but without the risk of information loss. This algorithm was developed within the Synopses Data Engine. This platform is built on top of the Apache Flink framework, and its main function is to support several synopses running in parallel and distributed in real time. After completing the algorithm for the synopsis, the mathematical model for finding the similarity between the synopses was followed. The mathematical model consists of Pearson Correlation plus the standard error of sampling using the Fisher Z transformation. For the efficiency and correctness of the system was initially designed locally where experiments were conducted and verified. It was then tested remotely and distributed where final experiments were conducted, achieving positive and satisfactory results.

Search

Browse

My Space

Finding correlated attributes in datasets at flinκ

Anastasiou Michalis

Summary

Available Files

Services

Export

Share

Statistics

Metadata & Content in a METS Package:

Metadata in Format: