Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Streams: Algorithms & Applications - Estimating Moments & Distinct Elements, Study notes of Algorithms and Programming

Techniques for estimating moments and the number of distinct elements in data streams. Three models of input streams: time series, cash register, and turnstile. It focuses on estimating the second moment (f2) using random projections and count-min sketch. Additionally, it touches upon estimating the number of distinct elements (d) using hash functions and lp-sum estimation. Part of a series and will be useful for university students and lifelong learners in the fields of computer science, data mining, and machine learning.

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

scream
scream 🇬🇧

4.5

(11)

276 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Streams: Algorithms and Applications
by S. Muthukrishnan
Presentation by Ramesh Sridharan and Matthew Johnson, Part 2
Formalism [Sec. 4]
We consider input streams, which represent underlying shorter signals. We will use a1,a2, . . . , at, . . . to
represent the input stream, where atarrives at time t. This stream describes some underlying signal, A[i]
for i[1, N ] for some dimensionality N, which we would like to query. There are three typical models used:
Time Series: at=A[t]
Cash Register: at= (j, It), and At[j] = At1[j] + It, where It0.
Turnstile: As above, but no restriction on It. In the strict turnstile model, At[j]0jt.
Basic Mathematical Techniques [Sec. 5] (continued)
Random Projections
Moments estimation
Here, we want to estimate the kth moment of a stream: Fk=PiA[i]k. This is useful in many practical
settings, as we will see over the next few weeks. In this section, we focus on F2.
We consider the random vectors Xij [i] of length N whose elements are ±1 and fourwise independent. We
also define Xij =hA, Xiji=P`A[`]Xij [`].
We can show E[X2
ij] = F2by considering the square of the sum above, and noting that in expectation,
the cross terms between Xij are 0. We can also show that var(X2
ij)2F2
2using a similar approach for X4
ij,
the second moment of the random variable X2
ij.
To obtain an approximation that lies within (1 ±)F2with probability greater than (1 δ), we consider
iin the range {1,...,16
2}, and jin the range {1,...,2 log 1
δ}, and look at the average across j, called Yi. By
the Chebyshev inequality, this is bounded by a constant. We then take the median of the Yis. Unless more
than half of the Yis deviate from F2by F2, the median will be within the desired range. The probability of
this error event occurring is given by the Chernoff bound as δ, so with probability 1δwe have the desired
bounds on our estimate.
Count-min sketch
We often want to keep track of A[i] for all i, but this violates our space constraints. So, instead of maintaining
A[i] for all i, we instead maintain a 2-dimensional d×warray called count, where w=de
eand d=dln 1
δe.
Associated with the array are dhash functions h1, . . . , hd:{1, . . . , N } {1, . . . , w}. When we receive an
update ai= (j, Ii), for each hash function hk, we update count[k, hk(j)] to be count[k, hk(j)] + Ii; that is,
each cell maintains the cumulative sum of all updates whose index hashes to that value.
This allows us to efficiently solve the point-estimation problem, i.e. find A[i] for an arbitrary i. Our
estimate is
ˆ
A[i] = min
jcount[j, hj(i)]
This is (certainly) bounded from below by A[i] and (with probability at least 1 δ) from above by
A[i] + ||A||1.
Note that count[j, hj(i)] has not only the Iks corresponding to index i, but also the Iks corresponding to
any other index that hashes to the same value. So, ˆ
A[i] is bounded from below because of these “extra values.”
The upper bound comes from applying the Markov inequality to the probability P(ˆ
A[i]A[i] + ||A||1).
This is equivalent to P(count[j, hj(i)] A[i] + ||A||1j). This is equivalent to the probability that the
1
pf3

Partial preview of the text

Download Data Streams: Algorithms & Applications - Estimating Moments & Distinct Elements and more Study notes Algorithms and Programming in PDF only on Docsity!

Data Streams: Algorithms and Applications

by S. Muthukrishnan

Presentation by Ramesh Sridharan and Matthew Johnson, Part 2

Formalism [Sec. 4]

We consider input streams, which represent underlying shorter signals. We will use a 1 , a 2 ,... , at,... to represent the input stream, where at arrives at time t. This stream describes some underlying signal, A[i] for i ∈ [1, N ] for some dimensionality N , which we would like to query. There are three typical models used:

  • Time Series: at = A[t]
  • Cash Register: at = (j, It), and At[j] = At− 1 [j] + It, where It ≥ 0.
  • Turnstile: As above, but no restriction on It. In the strict turnstile model, At[j] ≥ 0 ∀j ∀t.

Basic Mathematical Techniques [Sec. 5] (continued)

Random Projections

Moments estimation

Here, we want to estimate the kth moment of a stream: Fk =

i A[i] k. This is useful in many practical

settings, as we will see over the next few weeks. In this section, we focus on F 2. We consider the random vectors Xij [i] of length N whose elements are ±1 and fourwise independent. We also define Xij = 〈A, Xij 〉 =

A[]Xij^ [`]. We can show E[X^2 ij ] = F 2 by considering the square of the sum above, and noting that in expectation, the cross terms between Xij are 0. We can also show that var(X ij^2 ) ≤ 2 F 22 using a similar approach for X ij^4 , the second moment of the random variable X ij^2. To obtain an approximation that lies within (1 ± )F 2 with probability greater than (1 − δ), we consider i in the range { 1 ,... , (^16)  2 }, and j in the range { 1 ,... , 2 log (^1) δ }, and look at the average across j, called Yi. By the Chebyshev inequality, this is bounded by a constant. We then take the median of the Yis. Unless more than half of the Yis deviate from F 2 by F 2 , the median will be within the desired range. The probability of this error event occurring is given by the Chernoff bound as δ, so with probability 1 − δ we have the desired bounds on our estimate.

Count-min sketch

We often want to keep track of A[i] for all i, but this violates our space constraints. So, instead of maintaining A[i] for all i, we instead maintain a 2-dimensional d × w array called count, where w = d e e and d = dln (^1) δ e. Associated with the array are d hash functions h 1 ,... , hd : { 1 ,... , N } → { 1 ,... , w}. When we receive an update ai = (j, Ii), for each hash function hk, we update count[k, hk(j)] to be count[k, hk(j)] + Ii; that is, each cell maintains the cumulative sum of all updates whose index hashes to that value. This allows us to efficiently solve the point-estimation problem, i.e. find A[i] for an arbitrary i. Our estimate is

Aˆ[i] = min j count[j, hj (i)]

This is (certainly) bounded from below by A[i] and (with probability at least 1 − δ) from above by A[i] +  ||A|| 1. Note that count[j, hj (i)] has not only the Iks corresponding to index i, but also the Iks corresponding to any other index that hashes to the same value. So, Aˆ[i] is bounded from below because of these “extra values.” The upper bound comes from applying the Markov inequality to the probability P( Aˆ[i] ≥ A[i] +  ||A|| 1 ). This is equivalent to P(count[j, hj (i)] ≥ A[i] +  ||A|| 1 ∀j). This is equivalent to the probability that the

sum of the “extra values” is less than  ||A|| 1. The expectation of this “extra weight” is ||A|| 1 /w, and since they are pairwise independent, we can obtain a bound by multiplying their probabilities. Using the Markov inequality then gives the desired result. Note that many of the problems expressed in earlier sections can be solved using this technique.

Sampling

Estimating Number of Distinct Elements

The problem is to estimate D = |{i|A[i] 6 = 0}|. If A[i] is the number of occurrences of i in the stream, D is the number of distinct items. More generally, D is the size of the support of A[i]. One way of estimating D in the cash register model keeps a bit vector c of length log 2 N and uses a hash function f : [1, N ] → { 1 , 2 ,... , log 2 N } such that P[f (i) = j] = 2−j^ and any update j to item i sets c[f (i)] to 1. An unbiased estimate of the number of distinct items is given by 2k(c), where k(c) is the lowest index j such that c[j] = 0. Intuitively, if the probability that any item is mapped into the counter at index j is 2 −j^ , then if there are D distinct items, we expect D/2 of them to be mapped to c[1], D/4 to be mapped to c[2], etc. However, that relies on the existence of a fully random hash function, and so it has been extended to allow a hash function that can be stored in O(( (^) ^12 log log m + log m log(1/)) log(1/δ)). For the turnstile model, the methods for estimating D uses Lp-sum estimation for small p.

Basic Algorithmic Techniques [Sec. 6]

The Algorithmic Techniques section is differentiated from the Mathematical Techniques section in that it focuses on more deterministic settings in which the main innovations are in careful data structure planning.

Estimating wavelet coefficients

In the the time series model, consider the problem of approximating the signal by using the B largest Haar wavelet coefficients (see Figure 1 for a depiction of the Haar wavelets). Because of the time-localization of the Haar wavelets, we can essentially walk along the signal while keeping two data structures: a heap of the B largest coefficients so far, and a list of log N straddling coefficients, i.e. the “in-progress” coefficients. The meaning of the straddling coefficients and the relationship of those structures is best visualized by drawing the Haar wavelets on a binary tree sitting on top of the signal. Using the above method, we can compute the best B-term approximation to the signal in the Haar wavelet domain in O(B + log N ) space.

Deterministic heavy hitter with sparsity

Although we cannot deterministically solve the heavy hitters problem in the general case, we can if we impose a sparsity constraint over A: we assume that no more than k indices have nonzero values in A, and we want to find those k indices and/or their corresponding values in A. Take the x consecutive primes p 1 ,... , px larger than k, where x = (k − 1) logk N + 1. For each prime pj , construct a table Tj of size pj. In each table, each index i is mapped to i mod pj. Our update rule for an update (i, Ii) is

Tj [i mod pj ] := Tj [i mod pj ] + Ii We then claim that each index will have at least one table where it is the only index in its entry. Two indices can share the same entry in at most logk N tables. Otherwise, their difference would be divisible by logk N primes. However, this would imply that the difference is larger than N (since it is the product of more than logk N numbers greater than k). We can repeat this argument for every pairing, requiring (k − 1) logk N + 1 tables. We could estimate Aˆ[i] by

Aˆ[i] =^1 x

∑^ x

j=

Tj [i mod pj ]