Join-distinct aggregate estimation over update streams

Ganguly, Sumit, Garofalakis Minos, Kumar Amit, Rastogi Rajeev

URI	http://purl.tuc.gr/dl/dias/5325F237-4135-4A25-ABF2-A2CFBC74B7CF	-
Αναγνωριστικό	http://cse.iitk.ac.in/users/sganguly/jd.pdf	-
Γλώσσα	en	-
Μέγεθος	12 pages	en
Τίτλος	Join-distinct aggregate estimation over update streams	en
Δημιουργός	Ganguly, Sumit	en
Δημιουργός	Garofalakis Minos	en
Δημιουργός	Γαροφαλακης Μινως	el
Δημιουργός	Kumar Amit	en
Δημιουργός	Rastogi Rajeev	en
Εκδότης	Association for Computing Machinery	en
Περίληψη	There is growing interest in algorithms for processing and querying continuous data streams (i.e., data that is seen only once in a fixed order) with limited memory resources. Providing (perhaps approximate) answers to queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. The ability to estimate the number of distinct (sub)tuples in the result of a join operation correlating two data streams (i.e., the cardinality of a projection with duplicate elimination over a join) is an important requirement for several data-analysis scenarios. For instance, to enable real-time traffic analysis and load balancing, a network-monitoring application may need to estimate the number of distinct (source, destination) IP-address pairs occurring in the stream of IP packets observed by router R1, where the source address is also seen in packets routed through a different router R2. Earlier work has presented solutions to the individual problems of distinct counting and join-size estimation (without duplicate elimination) over streams. These solutions, however, are fundamentally different and extending or combining them to handle our more complex “Join-Distinct” estimation problem is far from obvious. In this paper, we propose the first space-efficient algorithmic solution to the general Join-Distinct estimation problem over continuous data streams (our techniques can actually handle general update streams comprising tuple deletions as well as insertions). Our estimators are probabilistic in nature and rely on novel algorithms for building and combining a new class of hash-based synopses (termed “JD sketches”) for individual update streams. We demonstrate that our algorithms can provide low error, high-confidence Join-Distinct estimates using only small space and small processing time per update. In fact, we present lower bounds showing that the space usage of our estimators is within small factors of the best possible for the Join-Distinct problem. Preliminary experimental results verify the effectiveness of our approach.	en
Τύπος	Πλήρης Δημοσίευση σε Συνέδριο	el
Τύπος	Conference Full Paper	en
Άδεια Χρήσης	http://creativecommons.org/licenses/by/4.0/	en
Ημερομηνία	2015-12-01	-
Ημερομηνία Δημοσίευσης	2005	-
Θεματική Κατηγορία	Management systems	en
Θεματική Κατηγορία	Query-processing algorithms	en
Βιβλιογραφική Αναφορά	S. Ganguly, M. Garofalakis, A. Kumar and R. Rastogi, "Join-distinct aggregate estimation over update streams", in 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2005.	en

Search

Browse

My Space

Join-distinct aggregate estimation over update streams

Ganguly, Sumit, Garofalakis Minos, Kumar Amit, Rastogi Rajeev

Υπηρεσίες

Εξαγωγή

Κοινοποίηση

Στατιστικά

Μεταδεδομένων & Περιεχομένου σε METS:

Μεταδεδομένων σε Μορφότυπο: