

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Techniques for estimating moments and the number of distinct elements in data streams. Three models of input streams: time series, cash register, and turnstile. It focuses on estimating the second moment (f2) using random projections and count-min sketch. Additionally, it touches upon estimating the number of distinct elements (d) using hash functions and lp-sum estimation. Part of a series and will be useful for university students and lifelong learners in the fields of computer science, data mining, and machine learning.
Typology: Study notes
1 / 3
This page cannot be seen from the preview
Don't miss anything!
by S. Muthukrishnan
Presentation by Ramesh Sridharan and Matthew Johnson, Part 2
We consider input streams, which represent underlying shorter signals. We will use a 1 , a 2 ,... , at,... to represent the input stream, where at arrives at time t. This stream describes some underlying signal, A[i] for i ∈ [1, N ] for some dimensionality N , which we would like to query. There are three typical models used:
Moments estimation
Here, we want to estimate the kth moment of a stream: Fk =
i A[i] k. This is useful in many practical
settings, as we will see over the next few weeks. In this section, we focus on F 2. We consider the random vectors Xij [i] of length N whose elements are ±1 and fourwise independent. We also define Xij = 〈A, Xij 〉 =
A[
]Xij^ [`]. We can show E[X^2 ij ] = F 2 by considering the square of the sum above, and noting that in expectation, the cross terms between Xij are 0. We can also show that var(X ij^2 ) ≤ 2 F 22 using a similar approach for X ij^4 , the second moment of the random variable X ij^2. To obtain an approximation that lies within (1 ± )F 2 with probability greater than (1 − δ), we consider i in the range { 1 ,... , (^16) 2 }, and j in the range { 1 ,... , 2 log (^1) δ }, and look at the average across j, called Yi. By the Chebyshev inequality, this is bounded by a constant. We then take the median of the Yis. Unless more than half of the Yis deviate from F 2 by F 2 , the median will be within the desired range. The probability of this error event occurring is given by the Chernoff bound as δ, so with probability 1 − δ we have the desired bounds on our estimate.
Count-min sketch
We often want to keep track of A[i] for all i, but this violates our space constraints. So, instead of maintaining A[i] for all i, we instead maintain a 2-dimensional d × w array called count, where w = d e e and d = dln (^1) δ e. Associated with the array are d hash functions h 1 ,... , hd : { 1 ,... , N } → { 1 ,... , w}. When we receive an update ai = (j, Ii), for each hash function hk, we update count[k, hk(j)] to be count[k, hk(j)] + Ii; that is, each cell maintains the cumulative sum of all updates whose index hashes to that value. This allows us to efficiently solve the point-estimation problem, i.e. find A[i] for an arbitrary i. Our estimate is
Aˆ[i] = min j count[j, hj (i)]
This is (certainly) bounded from below by A[i] and (with probability at least 1 − δ) from above by A[i] + ||A|| 1. Note that count[j, hj (i)] has not only the Iks corresponding to index i, but also the Iks corresponding to any other index that hashes to the same value. So, Aˆ[i] is bounded from below because of these “extra values.” The upper bound comes from applying the Markov inequality to the probability P( Aˆ[i] ≥ A[i] + ||A|| 1 ). This is equivalent to P(count[j, hj (i)] ≥ A[i] + ||A|| 1 ∀j). This is equivalent to the probability that the
sum of the “extra values” is less than ||A|| 1. The expectation of this “extra weight” is ||A|| 1 /w, and since they are pairwise independent, we can obtain a bound by multiplying their probabilities. Using the Markov inequality then gives the desired result. Note that many of the problems expressed in earlier sections can be solved using this technique.
Estimating Number of Distinct Elements
The problem is to estimate D = |{i|A[i] 6 = 0}|. If A[i] is the number of occurrences of i in the stream, D is the number of distinct items. More generally, D is the size of the support of A[i]. One way of estimating D in the cash register model keeps a bit vector c of length log 2 N and uses a hash function f : [1, N ] → { 1 , 2 ,... , log 2 N } such that P[f (i) = j] = 2−j^ and any update j to item i sets c[f (i)] to 1. An unbiased estimate of the number of distinct items is given by 2k(c), where k(c) is the lowest index j such that c[j] = 0. Intuitively, if the probability that any item is mapped into the counter at index j is 2 −j^ , then if there are D distinct items, we expect D/2 of them to be mapped to c[1], D/4 to be mapped to c[2], etc. However, that relies on the existence of a fully random hash function, and so it has been extended to allow a hash function that can be stored in O(( (^) ^12 log log m + log m log(1/)) log(1/δ)). For the turnstile model, the methods for estimating D uses Lp-sum estimation for small p.
The Algorithmic Techniques section is differentiated from the Mathematical Techniques section in that it focuses on more deterministic settings in which the main innovations are in careful data structure planning.
Estimating wavelet coefficients
In the the time series model, consider the problem of approximating the signal by using the B largest Haar wavelet coefficients (see Figure 1 for a depiction of the Haar wavelets). Because of the time-localization of the Haar wavelets, we can essentially walk along the signal while keeping two data structures: a heap of the B largest coefficients so far, and a list of log N straddling coefficients, i.e. the “in-progress” coefficients. The meaning of the straddling coefficients and the relationship of those structures is best visualized by drawing the Haar wavelets on a binary tree sitting on top of the signal. Using the above method, we can compute the best B-term approximation to the signal in the Haar wavelet domain in O(B + log N ) space.
Deterministic heavy hitter with sparsity
Although we cannot deterministically solve the heavy hitters problem in the general case, we can if we impose a sparsity constraint over A: we assume that no more than k indices have nonzero values in A, and we want to find those k indices and/or their corresponding values in A. Take the x consecutive primes p 1 ,... , px larger than k, where x = (k − 1) logk N + 1. For each prime pj , construct a table Tj of size pj. In each table, each index i is mapped to i mod pj. Our update rule for an update (i, Ii) is
Tj [i mod pj ] := Tj [i mod pj ] + Ii We then claim that each index will have at least one table where it is the only index in its entry. Two indices can share the same entry in at most logk N tables. Otherwise, their difference would be divisible by logk N primes. However, this would imply that the difference is larger than N (since it is the product of more than logk N numbers greater than k). We can repeat this argument for every pairing, requiring (k − 1) logk N + 1 tables. We could estimate Aˆ[i] by
Aˆ[i] =^1 x
∑^ x
j=
Tj [i mod pj ]