Το έργο με τίτλο SPARTAN: A model-based semantic compression system for massive data tables από τον/τους δημιουργό/ούς Babu Shivnath, Garofalakis Minos, Rastogi Rajeev διατίθεται με την άδεια Creative Commons Αναφορά Δημιουργού 4.0 Διεθνές
Βιβλιογραφική Αναφορά
S. Babu, M. Garofalakis and R. Rastogi, "SPARTAN: A model-based semantic compression system for massive data tables", in International Conference on Management of Data (SIGMOD 2001), May 2001, pp. 283-294.
While a variety of lossy compression schemes have been developed for certainforms of digital data (e.g., images, audio, video), the area of lossycompression techniques for arbitrary data tables has been left relatively unexplored.Nevertheless, such techniques are clearly motivated by the everincreasingdata collection rates of modern enterprises and the need for effective,guaranteed-quality approximate answers to queries over massiverelational data sets. In this paper, we propose SPARTAN , a system thattakes advantage of attribute semantics and data-mining models to performlossy compression of massive data tables. SPARTAN is based on thenovel idea of exploiting predictive data correlations and prescribed errortolerances for individual attributes to construct concise and accurate Classificationand Regression Tree (CaRT) models for entire columns of a table.More precisely, SPARTAN selects a certain subset of attributes forwhich no values are explicitly stored in the compressed table; instead, conciseCaRTs that predict these values (within the prescribed error bounds)are maintained. To restrict the huge search space and construction costof possible CaRT predictors, SPARTAN employs sophisticated learningtechniques and novel combinatorial optimization algorithms. Our experimentationwith several real-life data sets offers convincing evidence of theeffectiveness of SPARTAN ’s model-based approach – SPARTAN isable to consistently yield substantially better compression ratios than existingsemantic or syntactic compression tools (e.g., gzip) while utilizing onlysmall data samples for model inference.