Content Summary | Recent years have seen growing interest in effective algorithms for
summarizing and querying massive, high-speed data streams. Randomized sketch
synopses provide accurate approximations for general-purpose summaries of the
streaming data distribution (e.g., wavelets). The focus of existing work has typically
been on minimizing space requirements of the maintained synopsis — however,
to effectively support high-speed data-stream analysis, a crucial practical
requirement is to also optimize: (1) the update time for incorporating a streaming
data element in the sketch, and (2) the query time for producing an approximate
summary (e.g., the top wavelet coefficients) from the sketch. Such time
costs must be small enough to cope with rapid stream-arrival rates and the realtime
querying requirements of typical streaming applications (e.g., ISP network
monitoring). With cheap and plentiful memory, space is often only a secondary
concern after query/update time costs.
In this paper, we propose the first fast solution to the problem of tracking wavelet
representations of one-dimensional and multi-dimensional data streams, based on
a novel stream synopsis, the Group-Count Sketch (GCS). By imposing a hierarchical
structure of groups over the data and applying the GCS, our algorithms
can quickly recover the most important wavelet coefficients with guaranteed accuracy.
A tradeoff between query time and update time is established, by varying
the hierarchical structure of groups, allowing the right balance to be found for
specific data stream. Experimental analysis confirms this tradeoff, and shows that
all our methods significantly outperform previously known methods in terms of
both update time and query time, while maintaining a high level of accuracy. | en |