Data Streams: Algorithms & Applications - Estimating Moments & Distinct Elements | Study notes Algorithms and Programming

Data Streams: Algorithms and Applications

by S. Muthukrishnan

Presentation by Ramesh Sridharan and Matthew Johnson, Part 2

Formalism [Sec. 4]

We consider input streams, which represent underlying shorter signals. We will use a1,a2, . . . , at, . . . to

represent the input stream, where atarrives at time t. This stream describes some underlying signal, A[i]

for i∈[1, N ] for some dimensionality N, which we would like to query. There are three typical models used:

•Time Series: at=A[t]

•Cash Register: at= (j, It), and At[j] = At−1[j] + It, where It≥0.

•Turnstile: As above, but no restriction on It. In the strict turnstile model, At[j]≥0∀j∀t.

Basic Mathematical Techniques [Sec. 5] (continued)

Random Projections

Moments estimation

Here, we want to estimate the kth moment of a stream: Fk=PiA[i]k. This is useful in many practical

settings, as we will see over the next few weeks. In this section, we focus on F2.

We consider the random vectors Xij [i] of length N whose elements are ±1 and fourwise independent. We

also define Xij =hA, Xiji=P`A[`]Xij [`].

We can show E[X2

ij] = F2by considering the square of the sum above, and noting that in expectation,

the cross terms between Xij are 0. We can also show that var(X2

ij)≤2F2

2using a similar approach for X4

ij,

the second moment of the random variable X2

ij.

To obtain an approximation that lies within (1 ±)F2with probability greater than (1 −δ), we consider

iin the range {1,...,16

2}, and jin the range {1,...,2 log 1

δ}, and look at the average across j, called Yi. By

the Chebyshev inequality, this is bounded by a constant. We then take the median of the Yis. Unless more

than half of the Yis deviate from F2by F2, the median will be within the desired range. The probability of

this error event occurring is given by the Chernoff bound as δ, so with probability 1−δwe have the desired

bounds on our estimate.

Count-min sketch

We often want to keep track of A[i] for all i, but this violates our space constraints. So, instead of maintaining

A[i] for all i, we instead maintain a 2-dimensional d×warray called count, where w=de

eand d=dln 1

δe.

Associated with the array are dhash functions h1, . . . , hd:{1, . . . , N } → {1, . . . , w}. When we receive an

update ai= (j, Ii), for each hash function hk, we update count[k, hk(j)] to be count[k, hk(j)] + Ii; that is,

each cell maintains the cumulative sum of all updates whose index hashes to that value.

This allows us to efficiently solve the point-estimation problem, i.e. find A[i] for an arbitrary i. Our

estimate is

A[i] = min

jcount[j, hj(i)]

This is (certainly) bounded from below by A[i] and (with probability at least 1 −δ) from above by

A[i] + ||A||1.

Note that count[j, hj(i)] has not only the Iks corresponding to index i, but also the Iks corresponding to

any other index that hashes to the same value. So, ˆ

A[i] is bounded from below because of these “extra values.”

The upper bound comes from applying the Markov inequality to the probability P(ˆ

A[i]≥A[i] + ||A||1).

This is equivalent to P(count[j, hj(i)] ≥A[i] + ||A||1∀j). This is equivalent to the probability that the

Data Streams: Algorithms & Applications - Estimating Moments & Distinct Elements, Study notes of Algorithms and Programming

Related documents

Partial preview of the text

Download Data Streams: Algorithms & Applications - Estimating Moments & Distinct Elements and more Study notes Algorithms and Programming in PDF only on Docsity!

Data Streams: Algorithms and Applications

Formalism [Sec. 4]

Basic Mathematical Techniques [Sec. 5] (continued)

Random Projections

Sampling

Basic Algorithmic Techniques [Sec. 6]