Finding Hierarchical Heavy Hitters in Data Streams

Flip Korn

AT&T Lab
Florham Park, New Jersey, USA

Aggregation along hierarchies is a critical summary and analysis technique in many online applications including network management (IP clustering, DoS attack monitoring, etc.). We develop one such aggregate view based on certain hierarchically organized sets of large-valued regions (``heavy hitters''). The problem we discuss in this talk is that of finding Hierarchical Heavy Hitters (HHH). In one dimension, the problem is as follows: given a hierarchy (e.g., IP address space) and a fraction $\phi$, find all HHH nodes that have a total number of descendants in the data stream no smaller than $\phi$ of the total number of elements in the data stream, after discounting the descendant nodes that are HHH nodes. For multi-dimensional data, we identify ``overlap'' and ``split'' variants, depending on how an aggregate computed for a child node in the multi- dimensional hierarchy is propagated to its parent(s). The resulting summary gives a topological ``cartogram'' of the hierarchical data. For data streams, we present online algorithms that find approximate HHHs, with proven accuracy guarantees. Our experiments demonstrate fewer false-positives and less memory usage, by several factors, over the straightforward approach. In fact, our proposed HHH algorithms yielded outputs very similar (virtually identical, in many cases) to their offline counterparts.

About the speaker