Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning, Midterm Exam, Exams of Machine Learning

Exam Questions with Solution booklet.

Typology: Exams

2021/2022

Uploaded on 02/24/2022

laskhminaran
laskhminaran 🇺🇸

4.7

(6)

224 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
10-601 Machine Learning, Midterm Exam
Instructors: Tom Mitchell, Ziv Bar-Joseph
Monday 22nd October, 2012
There are 5 questions, for a total of 100 points.
This exam has 16 pages, make sure you have all pages before you begin.
This exam is open book, open notes, but no computers or other electronic devices.
Good luck!
Name:
Andrew ID:
Question Points Score
Short Answers 20
Comparison of ML algorithms 20
Regression 20
Bayes Net 20
Overfitting and PAC Learning 20
Total: 100
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Machine Learning, Midterm Exam and more Exams Machine Learning in PDF only on Docsity!

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Monday 22nd^ October, 2012

There are 5 questions, for a total of 100 points. This exam has 16 pages, make sure you have all pages before you begin. This exam is open book, open notes, but no computers or other electronic devices.

Good luck!

Name:

Andrew ID:

Question Points Score Short Answers 20

Comparison of ML algorithms 20 Regression 20 Bayes Net 20

Overfitting and PAC Learning 20 Total: 100

Question 1. Short Answers

True False Questions. (a) [1 point] We can get multiple local optimum solutions if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent. True False

Solution: False

(b) [1 point] When a decision tree is grown to full depth, it is more likely to fit the noise in the data. True False

Solution: True

(c) [1 point] When the hypothesis space is richer, over fitting is more likely. True False

Solution: True

(d) [1 point] When the feature space is larger, over fitting is more likely. True False

Solution: True

(e) [1 point] We can use gradient descent to learn a Gaussian Mixture Model. True False

Solution: True

Short Questions. (f) [3 points] Can you represent the following boolean function with a single logistic threshold unit (i.e., a single unit from a neural network)? If yes, show the weights. If not, explain why not in 1- sentences.

A B f(A,B) 1 1 0 0 0 0 1 0 1 0 1 0

(g) [3 points] Suppose we clustered a set of N data points using two different clustering algorithms: k-means and Gaussian mixtures. In both cases we obtained 5 clusters and in both cases the centers of the clusters are exactly the same. Can 3 points that are assigned to different clusters in the k- means solution be assigned to the same cluster in the Gaussian mixture solution? If no, explain. If so, sketch an example or explain in 1-2 sentences.

Solution: Yes, k-means assigns each data point to a unique cluster based on its distance to the cluster center. Gaussian mixture clustering gives soft (probabilistic) assignment to each data point. Therefore, even if cluster centers are identical in both methods, if Gaussian mixture compo- nents have large variances (components are spread around their center), points on the edges between clusters may be given different assignments in the Gaussian mixture solution.

Circle the correct answer(s).

(h) [3 points] As the number of training examples goes to infinity, your model trained on that data will have: A. Lower variance B. Higher variance C. Same variance

Solution: Lower variance

(i) [3 points] As the number of training examples goes to infinity, your model trained on that data will have: A. Lower bias B. Higher bias C. Same bias

Solution: Same bias

(j) [3 points] Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify: A. Expectation B. Maximization C. No modification necessary D. Both

Solution: Maximization

Question 2. Comparison of ML algorithms

Assume we have a set of data from patients who have visited UPMC hospital during the year 2011. A set of features (e.g., temperature, height) have been also extracted for each patient. Our goal is to decide whether a new visiting patient has any of diabetes, heart disease, or Alzheimer (a patient can have one or more of these diseases).

(a) [3 points] We have decided to use a neural network to solve this problem. We have two choices: either to train a separate neural network for each of the diseases or to train a single neural network with one output neuron for each disease, but with a shared hidden layer. Which method do you prefer? Justify your answer.

Solution: 1- Neural network with a shared hidden layer can capture dependencies between diseases. It can be shown that in some cases, when there is a dependency between the output nodes, having a shared node in the hidden layer can improve the accuracy. 2- If there is no dependency between diseases (output neurons), then we would prefer to have a separate neural network for each disease.

(b) [3 points] Some patient features are expensive to collect (e.g., brain scans) whereas others are not (e.g., temperature). Therefore, we have decided to first ask our classification algorithm to predict whether a patient has a disease, and if the classifier is 80% confident that the patient has a disease, then we will do additional examinations to collect additional patient features In this case, which classification methods do you recommend: neural networks, decision tree, or naive Bayes? Justify your answer in one or two sentences.

Solution: We expect students to explain how each of these learning techniques can be used to output a confidence value (any of these techniques can be modified to provide a confidence value). In addition, Naive Bayes is preferable to other cases since we can still use it for classification when the value of some of the features are unknown. We gave partial credits to those who mentioned neural network because of its non-linear de- cision boundary, or decision tree since it gives us an interpretable answer.

(c) Assume that we use a logistic regression learning algorithm to train a classifier for each disease. The classifier is trained to obtain MAP estimates for the logistic regression weights W. Our MAP estimator optimizes the objective

W ← arg max W ln[P (W )

l

P (Y l|Xl, W )]

where l refers to the lth training example. We adopt a Gaussian prior with zero mean for the weights W = 〈w 1... wn〉, making the above objective equivalent to:

W ← arg max W

−C

i

wi +

l

ln P (Y l|Xl, W )

Note C here is a constant, and we re-run our learning algorithm with different values of C. Please answer each of these true/false questions, and explain/justify your answer in no more than 2 sentences. i. [2 points] The average log-probability of the training data can never increase as we increase C. True False

(d) Decision boundary

(a) (b)

Figure 1: Labeled training set.

i. [2 points] Figure 1(a) illustrates a subset of our training data when we have only two features: X 1 and X 2. Draw the decision boundary for the logistic regression that we explained in part (c).

Solution: The decision boundary for logistic regression is linear. One candidate solution which clas- sifies all the data correctly is shown in Figure 1. We will accept other possible solutions since decision boundary depends on the value of C (it is possible for the trained classifier to miss-classify a few of the training data if we choose a large value of C).

ii. [3 points] Now assume that we add a new data point as it is shown in Figure 1(b). How does it change the decision boundary that you drew in Figure 1(a)? Answer this by drawing both the old and the new boundary.

Solution: We expect the decision boundary to move a little toward the new data point.

(e) [3 points] Assume that we record information of all the patients who visit UPMC every day. How- ever, for many of these patients we don’t know if they have any of the diseases, can we still improve the accuracy of our classifier using these data? If yes, explain how, and if no, justify your answer.

Solution: Yes, by using EM. In the class, we showed how EM can improve the accuracy of our classifier using both labeled and unlabeled data. For more details, please look at http://www.cs. cmu.edu/˜tom/10601_fall2012/slides/GrMod3_10_9_2012.pdf, page 6.

Question 3. Regression

Consider real-valued variables X and Y. The Y variable is generated, conditional on X, from the fol- lowing process:

 ∼ N (0, σ^2 ) Y = aX + 

where every  is an independent variable, called a noise term, which is drawn from a Gaussian distri- bution with mean 0, and standard deviation σ. This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has distribution p(Y |X, a) ∼ N (aX, σ^2 ), so it can be written as p(Y |X, a) =

2 πσ

exp

2 σ^2

(Y − aX)^2

The following questions are all about this model.

MLE estimation

(a) [3 points] Assume we have a training dataset of n pairs (Xi, Yi) for i = 1..n, and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a? Say yes or no to each one. More than one of them should have the answer “yes.”

[Solution: no ] arg max a

i

2 πσ

exp(−

2 σ^2

(Yi − aXi)^2 )

[Solution: yes ] arg max a

i

2 πσ

exp(−

2 σ^2

(Yi − aXi)^2 )

[Solution: no ] arg max a

i

exp(−

2 σ^2

(Yi − aXi)^2 )

[Solution: yes ] arg max a

i

exp(−

2 σ^2 (Yi − aXi)^2 )

[Solution: no ] arg max a

i

(Yi − aXi)^2

[Solution: yes ] arg min a

i

(Yi − aXi)^2

(b) [7 points] Derive the maximum likelihood estimate of the parameter a in terms of the training example Xi’s and Yi’s. We recommend you start with the simplest form of the problem you found above.

Solution:

p(a|λ) prior probability: wider, narrower, or same?

p(Y 1... Yn|X 1... Xn, a) conditional likelihood: wider, narrower, or same?

|aM LE^ − aM AP^ | increase or decrease?

As λ → ∞ [Solution: wider ] [Solution: same ] [Solution: decrease ]

As λ → 0 [Solution: narrower ] [Solution: same ] [Solution: increase ]

More data: as n → ∞ (fixed λ)

[Solution: same ] [Solution: narrower ] [Solution: decrease ]

(d) [7 points] Assume σ = 1, and a fixed prior parameter λ. Solve for the MAP estimate of a,

arg max a [ln p(Y 1 ..Yn | X 1 ..Xn, a) + ln p(a|λ)]

Your solution should be in terms of Xi’s, Yi’s, and λ.

Solution:

∂a

[log p(Y |X, a) + log p(a|λ)] =

∂`

∂a

∂ log p(a|λ) ∂a

To stay sane, let’s look at it as maximization, not minimization. (It’s easy to get signs wrong by trying to use the squared error minimization form from before.) Since σ = 1, the log-likelihood and its derivative is

`(a) = log

[

i

2 πσ

exp

2 σ^2

(Yi − aXi)^2

)]

`(a) = − log Z −

i

(Yi − aXi)^2 (8)

∂` ∂a

i

(Yi − aXi)(−Xi) (9)

i

(Yi − aXi)Xi (10)

i

XiYi − aX i^2 (11)

Next get the partial derivative for the log-prior.

∂ log p(a) ∂a

∂a

[

− log(

2 πλ) −

2 λ^2

a^2

]

a λ^2

The full partial is the sum of that and the log-likelihood which we did before.

∂`

∂a

∂ log p(a) ∂a

i

XiYi − aX i^2

a λ^2

a =

i XiYi (

i X

2 i ) + 1/λ

Partial credit: 1 point for writing out the log posterior, and/or doing some derivative. 1 point for getting the derivative correct.

For full solution: deduct a point for a sign error. (There are many potential places for flipping signs). Deduct a point for having n/λ^2 : this results from wrapping a sum around the log-prior. (Only the log-likelihood as a

i around it since it’s the probability of drawing each data point. The parameter a is drawn only once.)

Some people didn’t set σ = 1 and kept σ to the end. We simply gave credit if substituting σ = 1 gave the right answer; a few people may have derived the wrong answer but we didn’t carefully check all these cases.

People who did gradient descent rules were graded similarly as before: 4 points if correct, deduct one for sign error.

(e) [3 points] From your answer to (d) , can you say X 13 and X 33 are independent? Why?

Solution: No. Conditional independence doesn’t imply marginal independence.

(f) [3 points] Can you say the same thing when X 22 = 1? In other words, can you say X 13 and X 33 are independent given X 22 = 1? Why?

Solution: Yes. X 22 is the only parent of X 33 and X 13 is a nondescendant of X 33 , so by the rule in the lecture we can say they are independent given X 22 = 1

(g) [2 points] Replace X 21 and X 22 by a single new variable X 2 whose value is a pair of boolean values, defined as: X2 = 〈X 21 , X 22 〉. Draw the new Bayes net B′^ after the change.

Solution:

X 11 X 12 X 13

X 2 = (X 21 , X 22 )

X 31 X 32 X 33

(h) [3 points] Do all the conditional independences in B hold in the new network B′? If not, write one that is true in B but not in B′. Consider only the variables present in both B and B′.

Solution: No. For instance, X 32 is not conditionally independnt of X 33 given X 22 anymore.

  • Note: We noticed the problem description was a bit ambiguous, so we also accepted yes as a correct answer

Solution: False. The variance in test accuracy will decrease as we increase the size of the test set.

(b) Short answers.

i. [2 points] Given the above plot of training and test accuracy, which size decision tree would you choose to use to classify future examples? Give a one-sentence justification.

Solution: The tree with 10 nodes. This has the highest test accuracy of any of the trees, and hence the highest expected true accuracy.

ii. [2 points] What is the amount of overfitting in the tree you selected?

Solution: overfitting = training accuracy minus test accuracy = 0.77 - 0.74 = 0.

Let us consider the above plot of training and test error from the perspective of agnostic PAC bounds. Consider the agnostic PAC bound we discussed in class:

m ≥

2 ^2

(ln |H| + ln(1/δ))

where  is defined to be the difference between errortrue(h) and errortrain(h) for any hypothesis h output by the learner.

iii. [2 points] State in one carefully worded sentence what the above PAC bound guarantees about the two curves in our decision tree plot above.

Solution: If we train on m examples drawn at random from P (X), then with probability ( 1 − δ) the overfitting (difference between training and true accuracy) for each hypothesis in the plot will be less than or equal to . Note the the true accuracy is the expected value of the test accuracy, taken over different randomly drawn test sets.

iv. [2 points] Assume we used 200 training examples to produce the above decision tree plot. If we wish to reduce the overfitting to half of what we observe there, how many training examples would you suggest we use? Justify your answer in terms of the agnostic PAC bound, in no more than two sentences.

Solution: The bound shows that m grows as (^21)  2. Therefore if we wish to halve , it will suffice to increase m by a factor of 4. We should use 200 × 4 = 800 training examples.

v. [2 points] Give a one sentence explanation of why you are not certain that your recommended number of training examples will reduce overfitting by exactly one half.

Solution: There are several reasons, including the following. 1. Our PAC theory result gives a bound, not an equality, so 800 examples might decrease overfitting by more than half. 2. The ”observed” overfitting is actually the test set accuracy, which is only an estimate of true accuracy, so it may vary from true accuracy and our ”observed” overfitting will vary accordingly.

(c) You decide to estimate of the probability θ that a particular coin will turn up heads, by flipping it 10 times. You notice that if repeat this experiment, each time obtaining as new set of 10 coin flips, you get different resulting estimates. You repeat the experiment N = 20 times, obtaining estimates θ^ ˆ^1 , θˆ^2... θˆ^20. You calculate the variance in these estimates as

var =

N

i∑=N

i=

(ˆθi^ − θmean)^2

where θmean^ is the mean of your estimates θˆ^1 , θˆ^2... θˆ^20. i. [4 points] Which do you expect to produce a smaller value for var: a Maximum likelihood estimator (MLE), or a Maximum a posteriori (MAP) estimator that uses a Beta prior? Assume both estimators are given the same data. Justify your answer in one sentence.

Solution: We should expect the MAP estimate to produce a smaller value for var, because using the Beta prior is equivalent to adding in a fixed set of ”hallucinated” training examples that will not vary from experiment to experiment.