Nikolaos Kafritsas, "A caching platform for large scale data-intensive distributed applications", Diploma Work, School of Electrical and Computer Engineering, Technical University of Crete, Chania, Greece, 2019
https://doi.org/10.26233/heallink.tuc.83098
In the last decade, data processing systems started using main memory as much as possible, in order to speed up computations and boost performance. Towards this direction, many breakthroughs were created in the stream processing systems, which must meet rigorous demands and achieve sub-second latency along with high throughput. These advancements were feasible due to the availability of large amounts of DRAM at a plummeting cost and the rapid evolution of in-memory databases. However, in the Big Data era, maintaining such a huge amount of data in memory is impossible. On the other hand, the use of disk-based databases to remedy the situation is prohibitively expensive in terms of disk latencies. The ideal scenario would be to have the high access speed of memory, with the large capacity and low price of disk. This hinges on the ability to effectively utilize both the main memory and disk. Consequently, developing a solution which somehow combines the benefits of both worlds is highly desirable.This diploma thesis tackles the aforementioned problem by proposing an alternative architecture. More specifically, hot data are stored in memory, while cold data are moved to disk in a transactionally-safe manner as the database grows in size. Because data initially reside in memory, this architecture reverses the traditional storage hierarchy of disk-based systems. The disk is treated as an extended storage for evicted elements/cold data, not the primary host for the whole data. Based on this architecture, a multi-layered platform is presented which is highly scalable and can work in a distributed manner. The memory layer acts as a cache with configurable capacity and provides several eviction policies, the most important being a variation of the traditional LFU eviction policy. In particular, data regarded as cold could return back to memory if it becomes hot again, a case that occurs when the distribution of data changes in online processing. Thanks to this feature and the sub-second latency that is achieved, the platform can also perform efficiently in a streaming environment and can be used as a stateful memory component in a real-time architecture. The disk layer is flexible and elastic, meaning that users can use the database of their choice as a disk-based storage for cold data. Finally, the platform is tested in different scenarios under heavy load, and the benchmarks showed that it can perform extremely well and achieve throughput in the order of thousands of elements per second.