






















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Information, Learning, Entropy, Information Theory, KL-Divergence, EM Convergence, Optimal Code, EM, Likelihood, Mixture Model, Regression, Gating Network, Conditional Mixtures, Parametrization, Graphical Model, Non-Parametric Models, Greg Shakhnarovich, Lecture Slides, Introduction to Machine Learning, Computer Science, Toyota Technological Institute at Chicago, United States of America.
Typology: Lecture notes
1 / 30
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich TTI–Chicago
November 8, 2010
The entropy of a RV A ∈ {a 1 ,... , am}
H(A) ,
∑^ m i=
p(ai) log (^) p(^1 a i)^
∑^ m i=
p(ai) log p(ai)
is the (asymptotically) optimal codelength for for a sequence of outcomes of that RV Minimum Description Length (MDL) principle:
argmin ˆθ
DL(X, θˆ) ≈ argmin ˆθ
i=
log p
xi | ˆθ
BIC is an approximation of MDL
Suppose we have a random discrete variable X with distribution p, pi , Pr(X = i), i = 1,... , m. Optimal code (knowing p) has expected length per observation
L(p) = −
∑^ m i=
pi log pi.
Suppose now we think (estimate) the distribution is ˆp = q.
L(q) = −
∑^ m i=
pi log qi.
3-letter alphabet, true probabilities p(a) = 0.5, p(b) = 0.2, p(c) = 0.3. Estimate from (small) sample text q(a) = 0.35, q(b) = 0.25, q(c) = 0.4. Huffman code: assumed distribution P a b c L(P ) p 0 10 11
3-letter alphabet, true probabilities p(a) = 0.5, p(b) = 0.2, p(c) = 0.3. Estimate from (small) sample text q(a) = 0.35, q(b) = 0.25, q(c) = 0.4. Huffman code: assumed distribution P a b c L(P ) p 0 10 11 1.5 bits q
3-letter alphabet, true probabilities p(a) = 0.5, p(b) = 0.2, p(c) = 0.3. Estimate from (small) sample text q(a) = 0.35, q(b) = 0.25, q(c) = 0.4. Huffman code: assumed distribution P a b c L(P ) p 0 10 11 1.5 bits q 10 11 0 1.7 bits
The cost of estimating p by q:
DKL (p || q) , L(q) − L(p) = −
∑^ m i=
pi log qi +
∑^ m i=
pi log pi
∑^ m i=
pi(log pi − log qi)
∑^ m i=
pi log p qi i
called the Kullback-Leibler divergence between p and q A result from information theory:
DKL (p || q) =
∑^ m i=
pi log p qi i
DKL (p || q) ≥ 0 for any p, q DKL (p || q) = 0 if and only if p ≡ q It’s asymmetric:
DKL (p || q) =
∑^ m i=
pi log p qi i
DKL (p || q) ≥ 0 for any p, q DKL (p || q) = 0 if and only if p ≡ q It’s asymmetric:
DKL (p || q) =
∑^ m i=
pi log p qi i
DKL (p || q) ≥ 0 for any p, q DKL (p || q) = 0 if and only if p ≡ q It’s asymmetric:
Recall: X are observed, Z are hidden; by chain rule p (X, Z | θ) = p (Z | X, θ) p (X | θ) log p (X, Z | θ) − log p (Z | X, θ) = log p (X | θ)
Now take expectation w.r.t. p
Z | X, θold
Ep(Z | X,θold) [log p (X, Z | θ)] ︸ ︷︷ ︸
−Ep(Z | X,θold) [log p (Z | X, θ)] = log p (X | θ)
Recall: X are observed, Z are hidden; by chain rule p (X, Z | θ) = p (Z | X, θ) p (X | θ) log p (X, Z | θ) − log p (Z | X, θ) = log p (X | θ)
Now take expectation w.r.t. p
Z | X, θold
Ep(Z | X,θold) [log p (X, Z | θ)] ︸ ︷︷ ︸ Q(θ;θold)
−Ep(Z | X,θold) [log p (Z | X, θ)] = log p (X | θ)
log p (X | θ) = Q(θ; θold) − Ep(Z | X,θold) [log p (Z | X, θ)]
Since θnew^ = argmaxθ Q(θ; θold), we have Q(θnew; θold) ≥ Q(θold; θold). Also,
Ep(Z | X,θold)
log p
Z | X, θold
− Ep(Z | X,θold) [log p (Z | X, θnew)]
=
p
Z | X, θold
log p^
Z | X, θold
p (Z | X, θnew) = DKL
p
Z | X, θold
|| p (Z | X, θnew)
So, p (X | θnew) − p
X | θold
Example:
−3^1 −2 −1 0 1 2 3
2
3
4
5
6