









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Exam Questions with Solution booklet.
Typology: Exams
1 / 17
This page cannot be seen from the preview
Don't miss anything!
There are 5 questions, for a total of 100 points. This exam has 16 pages, make sure you have all pages before you begin. This exam is open book, open notes, but no computers or other electronic devices.
Good luck!
Name:
Andrew ID:
Question Points Score Short Answers 20
Comparison of ML algorithms 20 Regression 20 Bayes Net 20
Overfitting and PAC Learning 20 Total: 100
True False Questions. (a) [1 point] We can get multiple local optimum solutions if we solve a linear regression problem by minimizing the sum of squared errors using gradient descent. True False
Solution: False
(b) [1 point] When a decision tree is grown to full depth, it is more likely to fit the noise in the data. True False
Solution: True
(c) [1 point] When the hypothesis space is richer, over fitting is more likely. True False
Solution: True
(d) [1 point] When the feature space is larger, over fitting is more likely. True False
Solution: True
(e) [1 point] We can use gradient descent to learn a Gaussian Mixture Model. True False
Solution: True
Short Questions. (f) [3 points] Can you represent the following boolean function with a single logistic threshold unit (i.e., a single unit from a neural network)? If yes, show the weights. If not, explain why not in 1- sentences.
A B f(A,B) 1 1 0 0 0 0 1 0 1 0 1 0
(g) [3 points] Suppose we clustered a set of N data points using two different clustering algorithms: k-means and Gaussian mixtures. In both cases we obtained 5 clusters and in both cases the centers of the clusters are exactly the same. Can 3 points that are assigned to different clusters in the k- means solution be assigned to the same cluster in the Gaussian mixture solution? If no, explain. If so, sketch an example or explain in 1-2 sentences.
Solution: Yes, k-means assigns each data point to a unique cluster based on its distance to the cluster center. Gaussian mixture clustering gives soft (probabilistic) assignment to each data point. Therefore, even if cluster centers are identical in both methods, if Gaussian mixture compo- nents have large variances (components are spread around their center), points on the edges between clusters may be given different assignments in the Gaussian mixture solution.
Circle the correct answer(s).
(h) [3 points] As the number of training examples goes to infinity, your model trained on that data will have: A. Lower variance B. Higher variance C. Same variance
Solution: Lower variance
(i) [3 points] As the number of training examples goes to infinity, your model trained on that data will have: A. Lower bias B. Higher bias C. Same bias
Solution: Same bias
(j) [3 points] Suppose you are given an EM algorithm that finds maximum likelihood estimates for a model with latent variables. You are asked to modify the algorithm so that it finds MAP estimates instead. Which step or steps do you need to modify: A. Expectation B. Maximization C. No modification necessary D. Both
Solution: Maximization
Assume we have a set of data from patients who have visited UPMC hospital during the year 2011. A set of features (e.g., temperature, height) have been also extracted for each patient. Our goal is to decide whether a new visiting patient has any of diabetes, heart disease, or Alzheimer (a patient can have one or more of these diseases).
(a) [3 points] We have decided to use a neural network to solve this problem. We have two choices: either to train a separate neural network for each of the diseases or to train a single neural network with one output neuron for each disease, but with a shared hidden layer. Which method do you prefer? Justify your answer.
Solution: 1- Neural network with a shared hidden layer can capture dependencies between diseases. It can be shown that in some cases, when there is a dependency between the output nodes, having a shared node in the hidden layer can improve the accuracy. 2- If there is no dependency between diseases (output neurons), then we would prefer to have a separate neural network for each disease.
(b) [3 points] Some patient features are expensive to collect (e.g., brain scans) whereas others are not (e.g., temperature). Therefore, we have decided to first ask our classification algorithm to predict whether a patient has a disease, and if the classifier is 80% confident that the patient has a disease, then we will do additional examinations to collect additional patient features In this case, which classification methods do you recommend: neural networks, decision tree, or naive Bayes? Justify your answer in one or two sentences.
Solution: We expect students to explain how each of these learning techniques can be used to output a confidence value (any of these techniques can be modified to provide a confidence value). In addition, Naive Bayes is preferable to other cases since we can still use it for classification when the value of some of the features are unknown. We gave partial credits to those who mentioned neural network because of its non-linear de- cision boundary, or decision tree since it gives us an interpretable answer.
(c) Assume that we use a logistic regression learning algorithm to train a classifier for each disease. The classifier is trained to obtain MAP estimates for the logistic regression weights W. Our MAP estimator optimizes the objective
W ← arg max W ln[P (W )
l
P (Y l|Xl, W )]
where l refers to the lth training example. We adopt a Gaussian prior with zero mean for the weights W = 〈w 1... wn〉, making the above objective equivalent to:
W ← arg max W
i
wi +
l
ln P (Y l|Xl, W )
Note C here is a constant, and we re-run our learning algorithm with different values of C. Please answer each of these true/false questions, and explain/justify your answer in no more than 2 sentences. i. [2 points] The average log-probability of the training data can never increase as we increase C. True False
(d) Decision boundary
(a) (b)
Figure 1: Labeled training set.
i. [2 points] Figure 1(a) illustrates a subset of our training data when we have only two features: X 1 and X 2. Draw the decision boundary for the logistic regression that we explained in part (c).
Solution: The decision boundary for logistic regression is linear. One candidate solution which clas- sifies all the data correctly is shown in Figure 1. We will accept other possible solutions since decision boundary depends on the value of C (it is possible for the trained classifier to miss-classify a few of the training data if we choose a large value of C).
ii. [3 points] Now assume that we add a new data point as it is shown in Figure 1(b). How does it change the decision boundary that you drew in Figure 1(a)? Answer this by drawing both the old and the new boundary.
Solution: We expect the decision boundary to move a little toward the new data point.
(e) [3 points] Assume that we record information of all the patients who visit UPMC every day. How- ever, for many of these patients we don’t know if they have any of the diseases, can we still improve the accuracy of our classifier using these data? If yes, explain how, and if no, justify your answer.
Solution: Yes, by using EM. In the class, we showed how EM can improve the accuracy of our classifier using both labeled and unlabeled data. For more details, please look at http://www.cs. cmu.edu/˜tom/10601_fall2012/slides/GrMod3_10_9_2012.pdf, page 6.
Consider real-valued variables X and Y. The Y variable is generated, conditional on X, from the fol- lowing process:
∼ N (0, σ^2 ) Y = aX +
where every is an independent variable, called a noise term, which is drawn from a Gaussian distri- bution with mean 0, and standard deviation σ. This is a one-feature linear regression model, where a is the only weight parameter. The conditional probability of Y has distribution p(Y |X, a) ∼ N (aX, σ^2 ), so it can be written as p(Y |X, a) =
2 πσ
exp
2 σ^2
(Y − aX)^2
The following questions are all about this model.
(a) [3 points] Assume we have a training dataset of n pairs (Xi, Yi) for i = 1..n, and σ is known. Which ones of the following equations correctly represent the maximum likelihood problem for estimating a? Say yes or no to each one. More than one of them should have the answer “yes.”
[Solution: no ] arg max a
i
2 πσ
exp(−
2 σ^2
(Yi − aXi)^2 )
[Solution: yes ] arg max a
i
2 πσ
exp(−
2 σ^2
(Yi − aXi)^2 )
[Solution: no ] arg max a
i
exp(−
2 σ^2
(Yi − aXi)^2 )
[Solution: yes ] arg max a
i
exp(−
2 σ^2 (Yi − aXi)^2 )
[Solution: no ] arg max a
i
(Yi − aXi)^2
[Solution: yes ] arg min a
i
(Yi − aXi)^2
(b) [7 points] Derive the maximum likelihood estimate of the parameter a in terms of the training example Xi’s and Yi’s. We recommend you start with the simplest form of the problem you found above.
Solution:
p(a|λ) prior probability: wider, narrower, or same?
p(Y 1... Yn|X 1... Xn, a) conditional likelihood: wider, narrower, or same?
|aM LE^ − aM AP^ | increase or decrease?
As λ → ∞ [Solution: wider ] [Solution: same ] [Solution: decrease ]
As λ → 0 [Solution: narrower ] [Solution: same ] [Solution: increase ]
More data: as n → ∞ (fixed λ)
[Solution: same ] [Solution: narrower ] [Solution: decrease ]
(d) [7 points] Assume σ = 1, and a fixed prior parameter λ. Solve for the MAP estimate of a,
arg max a [ln p(Y 1 ..Yn | X 1 ..Xn, a) + ln p(a|λ)]
Your solution should be in terms of Xi’s, Yi’s, and λ.
Solution:
∂a
[log p(Y |X, a) + log p(a|λ)] =
∂a
∂ log p(a|λ) ∂a
To stay sane, let’s look at it as maximization, not minimization. (It’s easy to get signs wrong by trying to use the squared error minimization form from before.) Since σ = 1, the log-likelihood and its derivative is
`(a) = log
i
2 πσ
exp
2 σ^2
(Yi − aXi)^2
`(a) = − log Z −
i
(Yi − aXi)^2 (8)
∂` ∂a
i
(Yi − aXi)(−Xi) (9)
i
(Yi − aXi)Xi (10)
i
XiYi − aX i^2 (11)
Next get the partial derivative for the log-prior.
∂ log p(a) ∂a
∂a
− log(
2 πλ) −
2 λ^2
a^2
a λ^2
The full partial is the sum of that and the log-likelihood which we did before.
∂a
∂ log p(a) ∂a
i
XiYi − aX i^2
a λ^2
a =
i XiYi (
i X
2 i ) + 1/λ
Partial credit: 1 point for writing out the log posterior, and/or doing some derivative. 1 point for getting the derivative correct.
For full solution: deduct a point for a sign error. (There are many potential places for flipping signs). Deduct a point for having n/λ^2 : this results from wrapping a sum around the log-prior. (Only the log-likelihood as a
i around it since it’s the probability of drawing each data point. The parameter a is drawn only once.)
Some people didn’t set σ = 1 and kept σ to the end. We simply gave credit if substituting σ = 1 gave the right answer; a few people may have derived the wrong answer but we didn’t carefully check all these cases.
People who did gradient descent rules were graded similarly as before: 4 points if correct, deduct one for sign error.
(e) [3 points] From your answer to (d) , can you say X 13 and X 33 are independent? Why?
Solution: No. Conditional independence doesn’t imply marginal independence.
(f) [3 points] Can you say the same thing when X 22 = 1? In other words, can you say X 13 and X 33 are independent given X 22 = 1? Why?
Solution: Yes. X 22 is the only parent of X 33 and X 13 is a nondescendant of X 33 , so by the rule in the lecture we can say they are independent given X 22 = 1
(g) [2 points] Replace X 21 and X 22 by a single new variable X 2 whose value is a pair of boolean values, defined as: X2 = 〈X 21 , X 22 〉. Draw the new Bayes net B′^ after the change.
Solution:
X 11 X 12 X 13
(h) [3 points] Do all the conditional independences in B hold in the new network B′? If not, write one that is true in B but not in B′. Consider only the variables present in both B and B′.
Solution: No. For instance, X 32 is not conditionally independnt of X 33 given X 22 anymore.
Solution: False. The variance in test accuracy will decrease as we increase the size of the test set.
(b) Short answers.
i. [2 points] Given the above plot of training and test accuracy, which size decision tree would you choose to use to classify future examples? Give a one-sentence justification.
Solution: The tree with 10 nodes. This has the highest test accuracy of any of the trees, and hence the highest expected true accuracy.
ii. [2 points] What is the amount of overfitting in the tree you selected?
Solution: overfitting = training accuracy minus test accuracy = 0.77 - 0.74 = 0.
Let us consider the above plot of training and test error from the perspective of agnostic PAC bounds. Consider the agnostic PAC bound we discussed in class:
m ≥
(ln |H| + ln(1/δ))
where is defined to be the difference between errortrue(h) and errortrain(h) for any hypothesis h output by the learner.
iii. [2 points] State in one carefully worded sentence what the above PAC bound guarantees about the two curves in our decision tree plot above.
Solution: If we train on m examples drawn at random from P (X), then with probability ( 1 − δ) the overfitting (difference between training and true accuracy) for each hypothesis in the plot will be less than or equal to . Note the the true accuracy is the expected value of the test accuracy, taken over different randomly drawn test sets.
iv. [2 points] Assume we used 200 training examples to produce the above decision tree plot. If we wish to reduce the overfitting to half of what we observe there, how many training examples would you suggest we use? Justify your answer in terms of the agnostic PAC bound, in no more than two sentences.
Solution: The bound shows that m grows as (^21) 2. Therefore if we wish to halve , it will suffice to increase m by a factor of 4. We should use 200 × 4 = 800 training examples.
v. [2 points] Give a one sentence explanation of why you are not certain that your recommended number of training examples will reduce overfitting by exactly one half.
Solution: There are several reasons, including the following. 1. Our PAC theory result gives a bound, not an equality, so 800 examples might decrease overfitting by more than half. 2. The ”observed” overfitting is actually the test set accuracy, which is only an estimate of true accuracy, so it may vary from true accuracy and our ”observed” overfitting will vary accordingly.
(c) You decide to estimate of the probability θ that a particular coin will turn up heads, by flipping it 10 times. You notice that if repeat this experiment, each time obtaining as new set of 10 coin flips, you get different resulting estimates. You repeat the experiment N = 20 times, obtaining estimates θ^ ˆ^1 , θˆ^2... θˆ^20. You calculate the variance in these estimates as
var =
i∑=N
i=
(ˆθi^ − θmean)^2
where θmean^ is the mean of your estimates θˆ^1 , θˆ^2... θˆ^20. i. [4 points] Which do you expect to produce a smaller value for var: a Maximum likelihood estimator (MLE), or a Maximum a posteriori (MAP) estimator that uses a Beta prior? Assume both estimators are given the same data. Justify your answer in one sentence.
Solution: We should expect the MAP estimate to produce a smaller value for var, because using the Beta prior is equivalent to adding in a fixed set of ”hallucinated” training examples that will not vary from experiment to experiment.