Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Classes Part 2-Artificial Intelligence-Quiz, Exercises of Artificial Intelligence

Central University of Jammu and Kashmir Artificial Intelligence

Madam Amrita Ahuja took this quiz in class of Artificial Intelligence at Central University of Jammu and Kashmir. This quiz involves: Learning, Hypothesis, Classes, Standard, Perceptron, Algorithm, Linearly, Separable, Linear, Kernel, Gaussian, Svm

Typology: Exercises

2011/2012

Uploaded on 07/31/2012

shaina_44kin 🇮🇳

3.9

(9)

68 documents

1 / 33

This page cannot be seen from the preview

Don't miss anything!

5 Learning hypothesis classes (16 points)

Consider a classiﬁcation problem with two real-valued inputs. For each of the following

algorithms, specify all of the separators below that it could have generated and explain why.

If it could not have generated any of the separators, explain why not.

1. 1-nearest neighbor

A — this is a Voronoi partition of the space, which describes the decision boundary

produced by the 1-nearest neighbor algorithm

2. decision trees on real-valued inputs

C — decision trees on real-valued inputs create a decision boundary that is made up of

rectangles in the input space

3. standard perceptron algorithm

none — the inputs are not linearly separable, and the standard perceptron algorithm

does not terminate until it ﬁnds a linear separator that correctly classiﬁes all of the

training data

docsity.com

Partial preview of the text

Download Classes Part 2-Artificial Intelligence-Quiz and more Exercises Artificial Intelligence in PDF only on Docsity!

D E F

B C

5 Learning hypothesis classes (16 points)

Consider a classification problem with two realvalued inputs. For each of the following algorithms, specify all of the separators below that it could have generated and explain why. If it could not have generated any of the separators, explain why not.

1nearest neighbor A — this is a Voronoi partition of the space, which describes the decision boundary produced by the 1nearest neighbor algorithm
decision trees on realvalued inputs C — decision trees on realvalued inputs create a decision boundary that is made up of rectangles in the input space
standard perceptron algorithm none — the inputs are not linearly separable, and the standard perceptron algorithm does not terminate until it finds a linear separator that correctly classifies all of the training data

SVM with linear kernel

B — the SVM algorithm will find some separator in the space that maximizes the margin, even if the data are not linearly separable

SVM with Gaussian kernel (σ = 0.25)

E — a small sigma results in a classifier that more tightly fits the training data because the Gaussian bumps at each point are narrower

SVM with Gaussian kernel (σ = 1)

D (or F) — a larger sigma results in a classifier that generalizes better because the Gaussian bumps at each point are wider. D is the separator actual ly generated by an SVM with Gaussian kernel, σ = 1, but we accepted F because it is difficult to tell which of these two would be generated without actual ly running the algorithm.

neural network with no hidden units and one sigmoidal output unit, run until conver gence of training error B — a neural network with a single sigmoidal output will generate a linear classifier. The difference between a neural net with a single sigmoidal output and a perceptron unit is that the neural net training algorithm terminates when the error on a validation set reaches a minimum.
neural network with 4 hidden units and one sigmoidal output unit, run until conver gence of training error F (and/or D) — a neural net with hidden units, run until convergence, can correctly classify all of the data (ruling out B). Also, the decision boundary will be smooth (why?) (ruling out A and C). Why not E?

7 SVMs (12 points)

Assume that we are using an SVM with a polynomial kernel of degree 2. You are given the following support vectors:

x 1 x 2 y 1 2 + 1 2 1

The α values for each of these support vectors are equal to 0.05.

What is the value of b? Explain your approach to getting the answer. Answer: 0
What value does this SVM compute for the input point (1, 3) Answer: 0.05(1+(1,3).(1,2)) 2 − 0 .05(1 + (1, 3).(1, 2))^2 ] = 0.05[36 − 64] = − 1. 4

� � � �

� �

8 Neural networks (18 points)

A physician wants to use a neural network to predict whether patients have a disease, based on the results of a battery of tests. He has assigned a cost of c 01 to false positives (generating an output of 1 when it ought to have been 0), and a cost of c 10 to generating an output of 0 when it ought to have been 1. The cost of a correct answer is 0. The neural network is just a single sigmoid unit, which computes the following function:

g(¯x ) = s(w¯ ·x¯)

with s(z) being the usual sigmoid function.

Give an error function for the whole training set, E(w ¯) that implements this error metric, for example, for a training set of 20 cases, if the network predicts 1 for 5 cases that should have been 0, predicts 0 for 3 cases that should have been 1 and predicts another 12 correctly, the value of the error function should be: 5c 01 + 3c 10. Answer: E(w) = c i^ i 10 I(y^ ¯ i^ = g(¯x )) + c x )) 01 I(y^ i (^) = g(¯ {i| y i=1} {i|yi=0}
Would this be an appropriate error criterion to use for a neural network? Why or why not? Answer: No good. Not differentiable.
Consider the following error function for the whole training set:

E(w ¯) = c 10 (g(¯ x i) − yi )^2 + c 01 (g(¯x i) − yi)^2 {i| y i=1} {i|yi=0}

Describe, in English that is not simply a direct paraphrase of the mathematics, what it measures. Answer: Weighted square magnitude of prediction error.

What is the gradient of this E with respect to w¯? Answer:

x i^ x + 2c i 01 (g(¯ i) − y )(ds/dz 2 c 10 (g(¯ i)¯ i) − y )(ds/dz i)¯^ x^ x {i| y i=1} {i|yi=0}

where

ds/dzi = g(¯xi )(1 − g(¯xi))

4 Machine Learning — Continuous Features (20 points)

In all the parts of this problem we will be dealing with onedimensional data, that is, a set of points (xi) with only one feature (called simply x). The points are in two classes given by the value of yi^. We will show you the points on the x axis, labeled by their class values; we also give you a table of values.

4.1 Nearest Neighbors

i xi^ yi 1 1 0 2 2 1 3 3 1 4 4 0 5 6 1 6 7 1 7 10 0 8 11 1

In the figure below, draw the output of a 1NearestNeighbor classifier over the range indicated in the figure.

In the figure below, draw the output of a 5NearestNeighbor classifier over the range indicated in the figure.

You may find this table useful.

1 2 0.50 1 8 0. x y (x/y)lg(x/y) x y (x/y)lg(x/y)
1 3 0.53 3 8 0.
2 3 0.39 5 8 0.
1 4 0.50 7 8 0.
3 4 0.31 1 9 0.
1 5 0.46 2 9 0.
2 5 0.53 4 9 0.
3 5 0.44 5 9 0.
4 5 0.26 7 9 0.
1 6 0.43 8 9 0.
2 6 0.53 1 10 0.
5 6 0.22 3 10 0.
1 7 0.40 7 10 0.
2 7 0.52 9 10 0.
3 7 0.
4 7 0.
5 7 0.
6 7 0.

4.3 Neural Nets

Assume that each of the units of a neural net uses one of the the following output functions of the total activation (instead of the usual sigmoid s(z))

Linear: This just outputs the total activation:

l(z) = z

NonLinear: This looks like a linearized form of the usual sigmoid funtion:

f (z) = 0 if z < − 1 f (z) = 1 if z > 1 f (z) = 0.5(z + 1) otherwise

�

4.4 SVM

What are the values for the αi and the offset b that would give the maximal margin linear classifier for the two data points shown below? You should be able to find the answer without deriving it from the dual Lagrangian.

i xi^ yi 1 0 1 2 4 1

We know that the w = i i αix iy. Thus:

w = α^1 1 x (^1) y + α 2 x (^2) y

w = α 1 (0)(1) + α 2 (4)(−1) w = − 4 α 2

We know further that (^) i yiαi = 0, so the alphas must be equal. Lastly, we know that the margin for the support vectors is 1, so wx 1 + b = 1, which tells us that b = 1, and wx 2 + b = − 1 , which tells us that w = − 0. 5. Thus we know that α 1 = α 2 = 18.

� �

�

5 Machine Learning (20 points)

Grady Ent decides to train a single sigmoid unit using the following error function:

1 E(w) = (y(xi^ , w) − y i∗)^2 +

β wj^2 (^2) i 2 j

where y(xi^ , w) = s(x^ i w) with s(z) = 1+

1 · e−z being our usual sigmoid function.

Write an expression for ∂E ∂wj. Your answer should not involve derivatives.

� � ⎛^ ⎞ ∂E ∂ 1 �^ i ∂ (^) ⎝ 1 β

� (^2) ⎠ ∂wj

∂wj 2

(y(x , w) − y i∗)^2 + ∂wj (^2) j

wj i ∂E ∂y ∂z = + βwj ∂y ∂z ∂wj = (y − y i∗)(y)(1 − y)(x^ i j ) +^ βwj i=i

What update should be made to weight wj given a single training example < x, y∗^ >. Your answer should not involve derivatives. w�j^ = wj − η((y − y∗)(y)(1 − y)(xj ) + (βwj ))

6 Pruning Trees (20 points)

Following are some different strategies for pruning decision trees. We assume that we grow the decision tree until there is one or a small number of elements in each leaf. Then, we prune by deleting individual leaves of the tree until the score of the tree starts to get worse. The question is how to score each possible pruning of the tree. For each possible definition of the score below, explain whether or not it would be a good idea and give a reason why or why not.

The score is the percentage correct of the tree on the training set. Not a good idea. The original tree was constructed to maximize performance on the training set. Pruning any part of the tree will reduce performance on the training set.
The score is the percentage correct of the tree on a separate validation set. A good idea. The validation set will be an independent check on whether pruning a node is likely to increase or decrease performance on unseen data.
The score is the percentage correct of the tree, computed using cross validation. Not a good idea. Crossvalidation allows you to evaluate algorithms, not individual hypotheses. Crossvalidation will construct many new hypotheses and average their performance, this will not tell you whether pruning a node in a particular hypothesis is worthwhile or not.
The score is the percentage correct of the tree, computed on the training set, minus a constant C times the number of nodes in the tree. C is chosen in advance by running this algorithm (grow a large tree then prune in order to maximize percent correct minus C times number of nodes) for many different values of C, and choosing the value of C that minimizes trainingset error. Not a good idea. Running trials to maximize performance on the training set will not give us an indication of whether this algorithm will produce answers that generalize to other data sets.
The score is the percentage correct of the tree, computed on the training set, minus a constant C times the number of nodes in the tree. C is chosen in advance by running crossvalidation trials of this algorithm (grow a large tree then prune in order to maximize percent correct minus C times number of nodes) for many different values of C, and choosing the value of C that minimizes crossvalidation error. A good idea when we don’t have enough data to hold out a validation set. Choosing C by crossvalidation will hopefully give us an effective general way of penalizing for complexity of the tree (for this type of data).

Problem 4: Learning (25 points)

Part A: (5 Points)

Since the cost of using a nearest neighbor classifier grows with the size of the training set, sometimes one tries to eliminate redundant points from the training set. These are points whose removal does not affect the behavior of the classifier for any possible new point.

In the figure below, sketch the decision boundary for a 1-nearest-neighbor rule and circle the redundant points.

+ - -

**+ -

- +**

**-

The boundary shown is only approximate

What is the general condition(s) required for a point to be declared redundant for a 1- nearest-neighor rule? Assume we have only two classes (+, -). Restating the definition of redundant ("removing it does not change anything") is not an acceptable answer. Hint

think about the neighborhood of redundant points.

Let the Voronoi cell for a training point be the set of points that are closest to that point (as opposed to some other training point). The Voronoi cell of a redundant point touches only on other Voronoi cells of points of the same class.

Part C: (10 Points)

X Y

In this network, all the units are sigmoid except unit 5 which is linear (its output is simply the weighted sum of its inputs). All the bias weights are zero. The dashed connections have weights of -1, all the other connections (solid lines) have weights of 1.

Given X=0 and Y=0, what are the output values of each of the units? Unit 1 = 0. Unit 2 = 0. Unit 3 = 0. Unit 4 = 0. Unit 5 = 0
What are the δ values for each unit (as computed by backpropagation defined for squared error) assume that the desired output for the network is 4. Unit 1 = 0. Unit 2 = -0. Unit 3 = 1 Unit 4 = - Unit 5 = -
What would be the new value of the weight connecting units 2 and 3 assuming that the learning rate for backpropagation is set to 1?

w2,3 = (-1) – (1)(1)(0.5) = -1.

Part D: (10 Points)

Consider the simple one-dimensional classification problem shown below. Imagine attacking this problem with an SVM using a radial-basis function kernel. Assume that we want the classifier to return a positive output for the + points and a negative output for the – points.

Draw a plausible classifier output curve for a trained SVM, indicating the classifier output for every feature value in the range shown. Do this twice, once assuming that the standard deviation (σ) is very small relative to the distance between adjacent training points and again assuming that the standard deviation (σ) is about double the distance between adjacent training points.

Small standard deviation (σ):

SVM output

+ - - + Feature value

Large standard deviation (σ):

+ - - +

Very approximante SVM output

Feature value

Classes Part 2-Artificial Intelligence-Quiz, Exercises of Artificial Intelligence

Related documents

Partial preview of the text

Download Classes Part 2-Artificial Intelligence-Quiz and more Exercises Artificial Intelligence in PDF only on Docsity!

5 Learning hypothesis classes (16 points)

7 SVMs (12 points)

8 Neural networks (18 points)

4 Machine Learning — Continuous Features (20 points)

4.1 Nearest Neighbors

You may find this table useful.

4.3 Neural Nets

4.4 SVM

5 Machine Learning (20 points)

6 Pruning Trees (20 points)

X Y