















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A lecture note from ttic 31020: introduction to machine learning, covering the topic of boosting. The lecture explains the concept of combining classifiers through voting and the greedy assembly of classifier combinations. It also introduces the idea of boosting algorithms, which maintain weights on the training data and update them based on the classification so far.
Typology: Lecture notes
1 / 23
This page cannot be seen from the preview
Don't miss anything!
TTIC 31020: Introduction to Machine Learning
Instructor: Greg Shakhnarovich
TTI–Chicago
October 27, 2010
General form of SVM with kernel K:
max
i=
αi −
i,j=
αiαj yiyj K(xi, xj )
s.t. 0 ≤ αi ≤ C
Classification of x:
ˆy = sign
w ˆ 0 +
αi> 0
αiyiK(xi, x)
A positive-definite kernel corresponds to dot product in a feature space
A similar idea: combine classifiers h 1 (x),... , hm(x)
H(x) = α 1 h 1 (x) +... + αmhm(x),
αj is the vote assigned to classifier hj.
Prediction: ˆy(x) = sign H(x).
Classifiers hj can be simple (e.g., based on a single feature).
Consider a family of classifiers H parametrized by θ. Setting θ 1 : minimize the training error
∑^ N
i=
L 0 / 1 (h(xi; θ 1 ), yi).
How do we set θ 2? We would like to minimize the training error of the combination, ∑N
i=
L 0 / 1 (H(xi), yi),
where H(x) = sign (α 1 h(x; θ 1 ) + α 2 h(x; θ 2 )).
Adaptive boosting: weights are updated based on classification so far.
Votes: assigned based on the weighted error in j-th iteration
αj =
log
1 − j j
The only requirement: j < 1 /2.
Weights: updated and normalized, so that
We will use the exponential loss to measure the quality of the classifier:
L(H(x), y) = e−y·H(x)
i=
L(H(xi), yi)
i=
e−yi·H(xi) −1.5^0 −1 −0.5 0 0.5 1 1.
1
2
3
Differentiable approximation (bound) of 0/1 loss
Denote Hm(x) = α 1 h 1 (x) +... + αmhm(x)
Suppose we add αm · hm(x) to Hm− 1 :
L(Hm, X) =
i=
e−yi·[Hm−^1 (xi)+αmhm(xi)]
i=
e−yiHm−^1 (xi)^ −^ αmyihm(xi)
Denote Hm(x) = α 1 h 1 (x) +... + αmhm(x)
Suppose we add αm · hm(x) to Hm− 1 :
L(Hm, X) =
i=
e−yi·[Hm−^1 (xi)+αmhm(xi)]
i=
e−yiHm−^1 (xi)^ −^ αmyihm(xi)
i=
e−yiHm−^1 (xi)^ · e−αmyihm(xi)
Exponential loss after m-th iteration:
L(Hm, X) =
i=
e ︸− yiH︷︷m− 1 (xi︸) W (^) i(m−1)
e ︸− yiαm︷︷h m(xi︸) need to optimize
W (^) i( m−1)captures the “history” of classification of xi by Hm− 1.
Optimization: choose αm, hm = h(x; θm) that optimize the (weighted) exponential loss at iteration m.
i=
W (^) i( m−1)e−αmyihm(xi)^ = e−αm
i: yi=hm(xi)
W (^) i(m−1)
i: yi 6 =hm(xi)
W (^) i(m−1)
For any αm > 0, minimizing this ⇒ minimizing weighted training error
We can normalize the weights:
W (^) i( m−1) =
e−yiHm−^1 (xi) ∑N j=1 e
−yj Hm− 1 (xj )
1 Initialize weights: W (^) i(0) = 1/N
2 Iterate for m = 1,... , M :
m =
i=
W (^) i( m−1)yihm(xi)
i W^
(m) i = 1:
W (^) i( m) =
W (^) i( m−1)e−αmyihm(xi),
3 The combined classifier: sign
M m=1 αmhm(x)
Training error of H goes down;
Weighted error m goes up; ⇒ votes αm go down.
Exponential loss goes strictly down. (^00 10 20 30 )
error of H^ Average^ exp. loss weighted error of h
Typical behavior: test error can still decrease after training error is flat (even zero).
We can define the margin of an example
γ(xi) = yi ·
α 1 h 1 (x) +... + αmhm(x) α 1 +... + αm
γ(xi) ∈ [− 1 , 1], positive iff H(xi) = yi; this is a measure of confidence in the correct decision.
Iterations of AdaBoost increase the margin of training examples!