




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The Perceptron algorithm, a machine learning algorithm used for automatic classification and linear separability. It explains how to compute the separating line between red and blue points using linear programming and the concept of linear separability in two dimensions. The document also touches upon the VC dimension and its relationship to the complexity of the function being learned.
What you will learn
Typology: Lecture notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!
CS 573: Algorithms, Fall 2013 November 12, 2013
22.1.0.1 Labeling...
(A) given examples:a database of cars. (B) like to determine which cars are sport cars.. (C) Each car record: interpreted as point in high dimensions. (D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4 , 1997 , 6). Labeled as a sport car. (E) Tractor by General Mess (manufacturer ID 3) in 1998: (0 , 1997 , 3) Labeled as not a sport car. (F) Real world: hundreds of attributes. In some cases even millions of attributes! (G) Automate this classification process: label sports/regular car automatically.
22.1.0.2 Automatic classification...
(A) learning algorithm: (A) given several (or many) classified examples... (B) ...develop its own conjecture for rule of classification. (C) ... can use it for classifying new data. (B) learning : training + classifying. (C) Learn a function: f : IR d^ → {− 1 , 1 }. (D) challenge: f might have infinite complexity... (E) ...rare situation in real world. Assume learnable functions. (F) red and blue points that are linearly separable. (G) Trying to learn a line ℓ that separates the red points from the blue points.
22.1.0.3 Linear separability example...
22.1.0.4 Learning linear separation
(A) Given red and blue points – how to compute the separating line ℓ? (B) line/plane/hyperplane is the zero set of a linear function. (C) Form: ∀ x ∈ IR d^ f ( x ) = ⟨ a, x ⟩ + b, where a = ( a 1 ,... , ad ) , b = ( b 1 ,... , bd ) ∈ IR^2. ⟨ a, x ⟩ = ∑ i aixi^ is the^ dot product^ of^ a^ and^ x. (D) classification done by computing sign of f ( x ): sign( f ( x )). (E) If sign( f ( x )) is negative: x is not in class. If positive: inside. (F) A set of training examples :
S =
{ ( x 1 , y 1 ) ,... , ( xn, yn )
} ,
where xi ∈ IR d^ and yi ∈ {-1,1}, for i = 1 ,... , n.
22.1.0.5 Classification...
(A) linear classifier h : ( w, b ) where w ∈ IR d^ and b ∈ IR. (B) classification of x ∈ IR d^ is sign(⟨ w, x ⟩ + b ). (C) labeled example ( x, y ), h classifies ( x, y ) correctly if sign(⟨ w, x ⟩ + b ) = y. (D) Assume a linear classifier exists. (E) Given n labeled example. How to compute the linear classifier for these examples? (F) Use linear programming.... (G) looking for ( w , b ), such that for an ( x i, yi ) we have sign(⟨ w , x i ⟩ + b ) = yi , which is
⟨ w , x i ⟩ + b ≥ 0 if yi = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if yi = − 1_._
22.1.0.11 Claim by figure...
hard easy
R R
R
wopt
γ
R
wopt
γ ′
22.1.0.12 Proof of Perceptron convergence...
(A) Idea of proof: perceptron weight vector converges to wopt. (B) Distance between wopt and k th update vector:
αk =
∥∥ ∥∥ ∥ wk^ −^
γ
wopt
∥∥ ∥∥ ∥
2 .
(C) Quantify the change between αk and αk + (D) Example being misclassified is ( x, y ).
22.1.0.13 Proof of Perceptron convergence...
(A) Example being misclassified is ( x, y ) (both are constants). (B) w k +1 ← w k + y ∗ x
(C) αk +1 =
∥∥ ∥∥ ∥ wk +1^ −^
γ
wopt
∥∥ ∥∥ ∥
∥∥ ∥∥ ∥ wk^ +^ y x^ −^
γ
wopt
∥∥ ∥∥ ∥
2
∥∥ ∥∥ ∥
( wk −
γ
wopt
)
∥∥ ∥∥ ∥
2
⟨( wk − R
2 γ wopt
)
( wk − R
2 γ wopt
)
⟩
⟨( wk − R
2 γ wopt
) ,
( wk − R
2 γ wopt
)⟩ +2 y
⟨( wk − R
2 γ wopt
) , x
⟩
⟨( wk − R
2 γ wopt
) , x
⟩
∥∥ ∥ x
∥∥ ∥
2 .
22.1.0.14 Proof of Perceptron convergence...
(A) We proved: αk +1 = αk + 2 y
⟨( wk − R
2 γ wopt
) , x
⟩
∥∥ ∥ x
∥∥ ∥
2 .
(B) ( x , y ) is misclassified: sign(⟨ wk, x ⟩) ̸= y (C) =⇒ sign( y ⟨ w k, x ⟩) = − 1 (D) =⇒ y ⟨ w k, x ⟩ < 0.
(E)
∥∥ ∥ x
∥∥ ∥ ≤ R =⇒
αk +1 ≤ αk + R^2 + 2 y ⟨ wk, x ⟩ − 2 y
⟨ R^2 γ
wopt, x
⟩
≤ αk + R^2 + − 2
γ
y ⟨ wopt,x ⟩.
(F) ... since 2 y ⟨ wk, x ⟩ < 0.
22.1.0.15 Proof of Perceptron convergence...
(A) Proved: αk +1 ≤ αk + R^2 − 2 R
2 γ y^ ⟨ wopt,x ⟩. (B) sign(⟨ w opt , x ⟩) = y. (C) By margin assumption: y ⟨ wopt , x ⟩ ≥ γ, ∀( x, y ) ∈ S. (D) αk +1 ≤ αk + R^2 − 2 R 2 γ y^ ⟨ wopt,x ⟩ ≤ αk + R^2 − 2 R
2 γ γ ≤ αk + R^2 − 2 R^2 ≤ αk − R^2_._
22.1.0.16 Proof of Perceptron convergence...
(A) We have: αk +1 ≤ αk − R^2
(B) α 0 =
∥∥ ∥∥ ∥^0 −^
γ
wopt
∥∥ ∥∥ ∥
γ^2
∥∥ ∥ wopt
∥∥ ∥
γ^2
(C) ∀ i αi ≥ 0. (D) Q: max # classification errors can make? (E) ... # of updates (F) .. # of updates ≤ α 0 /R^2 ...
(G) A: ≤
γ^2
(C) ∀( x, y, x^2 + y^2 ) ∈ ℓ ( B ): ax + by + c ( x^2 + y^2 ) + d ≥ 0.
(D) U ( h ) =
{ ( x, y )
∣∣ ∣ h (( x, y, x^2 + y^2 )) ≤ 0
} . (E) If U ( h ) is a circle =⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅. (F) U ( h ) ≡ ax + by + c ( x^2 + y^2 ) ≤ − d.
(G) ⇐⇒
( x^2 + ac x
)
( y^2 + bc y
) ≤ − dc
(H) ⇐⇒
( x + 2 ac
) 2
( y + 2 bc
) 2 ≤ a
(^2) + b 2 4 c^2 −^
d c (I) This is disk in the plane, as claimed.
22.2.0.22 A closing comment...
Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization.
22.3.0.23 A Little Bit On VC Dimension
(A) Q: how complex is the function trying to learn? (B) VC-dimension is one way of capturing this notion. (VC = Vapnik, Chervonenkis,1971). (C) A matter of expressivity: What is harder to learn:
(a) A rectangle in the plane. (b) A halfplane. (c) A convex polygon with k sides.
22.3.0.24 Thinking about concepts as binary functions...
(A) X = { p 1 ,p 2 ,... , pm }: points in the plane. (B) H: set of all halfplanes. (C) A half-plane r ∈ H defines a binary vector
r ( X ) = ( b 1 ,... , bm ) where bi = 1 if and only if pi is inside r. (D) Possible binary vectors generated by halfplanes: U ( X, H) = { r ( X ) | r ∈ H }. (E) A set X of m elements is shattered by R if
| U ( X, R )| = 2 m.
(F) What does this mean? (G) The VC-dimension of a set of ranges R is the size of the largest set that it can shatter.
22.3.1 Examples 22.3.1.1 Examples
What is the VC dimensions of circles in the plane? X is set of n points in the plane C is a set of all circles. X = { p, q, r, s }
What subsets of X can we generate by circle?
p
q
r
s
22.3.1.2 Subsets realized by disks
p
q
r
s
{} , { r } , { p } , { q } , { s } , { p, s }, { p, q }, { p, r },{ r, q }{ q, s } and { r, p, q }, { p, r, s }{ p, s, q } , { s, q, r } and { r, p, q, s } We got only 15 sets. There is one set which is not there. Which one? The VC dimension of circles in the plane is 3.
22.3.1.3 Sauer’s Lemma
Lemma 22.3.1 (Sauer Lemma). If R has VC dimension d then | U ( X, R )| = O
( md
) , where m is the size of X.