

























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Madam Amrita Ahuja took this quiz in class of Artificial Intelligence at Central University of Jammu and Kashmir. This quiz involves: Learning, Hypothesis, Classes, Standard, Perceptron, Algorithm, Linearly, Separable, Linear, Kernel, Gaussian, Svm
Typology: Exercises
1 / 33
This page cannot be seen from the preview
Don't miss anything!
A
D E F
B C
Consider a classification problem with two realvalued inputs. For each of the following algorithms, specify all of the separators below that it could have generated and explain why. If it could not have generated any of the separators, explain why not.
B — the SVM algorithm will find some separator in the space that maximizes the margin, even if the data are not linearly separable
E — a small sigma results in a classifier that more tightly fits the training data because the Gaussian bumps at each point are narrower
D (or F) — a larger sigma results in a classifier that generalizes better because the Gaussian bumps at each point are wider. D is the separator actual ly generated by an SVM with Gaussian kernel, σ = 1, but we accepted F because it is difficult to tell which of these two would be generated without actual ly running the algorithm.
Assume that we are using an SVM with a polynomial kernel of degree 2. You are given the following support vectors:
x 1 x 2 y 1 2 + 1 2 1
The α values for each of these support vectors are equal to 0.05.
� � � �
� �
� �
A physician wants to use a neural network to predict whether patients have a disease, based on the results of a battery of tests. He has assigned a cost of c 01 to false positives (generating an output of 1 when it ought to have been 0), and a cost of c 10 to generating an output of 0 when it ought to have been 1. The cost of a correct answer is 0. The neural network is just a single sigmoid unit, which computes the following function:
g(¯x ) = s(w¯ ·x¯)
with s(z) being the usual sigmoid function.
E(w ¯) = c 10 (g(¯ x i) − yi )^2 + c 01 (g(¯x i) − yi)^2 {i| y i=1} {i|yi=0}
Describe, in English that is not simply a direct paraphrase of the mathematics, what it measures. Answer: Weighted square magnitude of prediction error.
x i^ x + 2c i 01 (g(¯ i) − y )(ds/dz 2 c 10 (g(¯ i)¯ i) − y )(ds/dz i)¯^ x^ x {i| y i=1} {i|yi=0}
where
ds/dzi = g(¯xi )(1 − g(¯xi))
In all the parts of this problem we will be dealing with onedimensional data, that is, a set of points (xi) with only one feature (called simply x). The points are in two classes given by the value of yi^. We will show you the points on the x axis, labeled by their class values; we also give you a table of values.
i xi^ yi 1 1 0 2 2 1 3 3 1 4 4 0 5 6 1 6 7 1 7 10 0 8 11 1
Assume that each of the units of a neural net uses one of the the following output functions of the total activation (instead of the usual sigmoid s(z))
l(z) = z
f (z) = 0 if z < − 1 f (z) = 1 if z > 1 f (z) = 0.5(z + 1) otherwise
�
�
What are the values for the αi and the offset b that would give the maximal margin linear classifier for the two data points shown below? You should be able to find the answer without deriving it from the dual Lagrangian.
i xi^ yi 1 0 1 2 4 1
We know that the w = i i αix iy. Thus:
w = α^1 1 x (^1) y + α 2 x (^2) y
w = α 1 (0)(1) + α 2 (4)(−1) w = − 4 α 2
We know further that (^) i yiαi = 0, so the alphas must be equal. Lastly, we know that the margin for the support vectors is 1, so wx 1 + b = 1, which tells us that b = 1, and wx 2 + b = − 1 , which tells us that w = − 0. 5. Thus we know that α 1 = α 2 = 18.
� �
�
Grady Ent decides to train a single sigmoid unit using the following error function:
1 E(w) = (y(xi^ , w) − y i∗)^2 +
β wj^2 (^2) i 2 j
where y(xi^ , w) = s(x^ i w) with s(z) = 1+
1 · e−z being our usual sigmoid function.
� � ⎛^ ⎞ ∂E ∂ 1 �^ i ∂ (^) ⎝ 1 β
� (^2) ⎠ ∂wj
∂wj 2
(y(x , w) − y i∗)^2 + ∂wj (^2) j
wj i ∂E ∂y ∂z = + βwj ∂y ∂z ∂wj = (y − y i∗)(y)(1 − y)(x^ i j ) +^ βwj i=i
Following are some different strategies for pruning decision trees. We assume that we grow the decision tree until there is one or a small number of elements in each leaf. Then, we prune by deleting individual leaves of the tree until the score of the tree starts to get worse. The question is how to score each possible pruning of the tree. For each possible definition of the score below, explain whether or not it would be a good idea and give a reason why or why not.
Problem 4: Learning (25 points)
Part A: (5 Points)
Since the cost of using a nearest neighbor classifier grows with the size of the training set, sometimes one tries to eliminate redundant points from the training set. These are points whose removal does not affect the behavior of the classifier for any possible new point.
+ - -
**+ -
**-
The boundary shown is only approximate
Let the Voronoi cell for a training point be the set of points that are closest to that point (as opposed to some other training point). The Voronoi cell of a redundant point touches only on other Voronoi cells of points of the same class.
Part C: (10 Points)
In this network, all the units are sigmoid except unit 5 which is linear (its output is simply the weighted sum of its inputs). All the bias weights are zero. The dashed connections have weights of -1, all the other connections (solid lines) have weights of 1.
w2,3 = (-1) – (1)(1)(0.5) = -1.
Part D: (10 Points)
Draw a plausible classifier output curve for a trained SVM, indicating the classifier output for every feature value in the range shown. Do this twice, once assuming that the standard deviation (σ) is very small relative to the distance between adjacent training points and again assuming that the standard deviation (σ) is about double the distance between adjacent training points.
Small standard deviation (σ):
SVM output
+ - - + Feature value
Large standard deviation (σ):
+ - - +
Very approximante SVM output
Feature value