Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Parameter Learning: Maximum Likelihood Estimation and Maximum A Posteriori, Exercises of Machine Learning

Parameter learning in the context of statistical modeling, focusing on Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) methods. the concept of parameters, the difference between likelihood and probability, and the process of estimating parameters using these two approaches. It also includes examples of MLE for Bernoulli and Normal distributions.

What you will learn

  • How do you calculate the maximum likelihood estimate for a Bernoulli distribution?
  • What is the role of the prior distribution in Maximum A Posteriori (MAP) estimation?

Typology: Exercises

2021/2022

Uploaded on 09/12/2022

torley
torley 🇺🇸

4.6

(41)

258 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
–1–
Will Monroe
CS 109
Lecture Notes #20
August 9, 2017
Parameter Learning
Based on a chapter by Chris Piech
We have learned many different distributions for random variables, and all of those distributions
had parameters: the numbers that you provide as input when you define a random variable. So
far when we were working with random variables, we either were explicitly told the values of the
parameters, or we could divine the values by understanding the process that was generating the
random variables.
What if we don’t know the values of the parameters and we can’t estimate them from our own expert
knowledge? What if instead of knowing the random variables, we have a lot of examples of data
generated with the same underlying distribution? In this chapter we are going to learn formal ways
of estimating parameters from data.
These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms
work like this: (1) Specify a probabilistic model that has parameters. (2) Learn the value of those
parameters from data.
Parameters
Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a
model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli
random variable, the single parameter was the value p. In the case of a Uniform random variable,
the parameters are the aand bvalues that define the min and max value. Here is a list of random
variables and the corresponding parameters. From now on, we are going to use the notation θto be
a vector of all the parameters:
Distribution Parameters
Bernoulli(p)θ=p
Poisson(λ)θ=λ
Uniform(a,b)θ=(a,b)
Normal(µ, σ2)θ=(µ, σ2)
Y=mX +bθ=(m,b)
In the real world often you don’t know the “true” parameters, but you get to observe data. Next up,
we will explore how we can use data to estimate the model parameters.
It turns out there isn’t just one way to estimate the value of parameters. There are two main
approaches: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of
these approaches assume that your data are IID samples: X1,X2, . . . Xnwhere all Xiare independent
and have the same distribution.
pf3
pf4
pf5

Partial preview of the text

Download Parameter Learning: Maximum Likelihood Estimation and Maximum A Posteriori and more Exercises Machine Learning in PDF only on Docsity!

Will Monroe CS 109

Lecture Notes # August 9, 2017

Parameter Learning

Based on a chapter by Chris Piech

We have learned many different distributions for random variables, and all of those distributions had parameters : the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or we could divine the values by understanding the process that was generating the random variables.

What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowledge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data.

These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) Specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data.

Parameters

Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation θ to be a vector of all the parameters:

Distribution Parameters

Bernoulli(p) θ = p Poisson(λ) θ = λ Uniform(a, b) θ = (a, b) Normal(μ, σ^2 ) θ = (μ, σ^2 ) Y = mX + b θ = (m, b)

In the real world often you don’t know the “true” parameters, but you get to observe data. Next up, we will explore how we can use data to estimate the model parameters.

It turns out there isn’t just one way to estimate the value of parameters. There are two main approaches: Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP). Both of these approaches assume that your data are IID samples: X 1 , X 2 ,... Xn where all Xi are independent and have the same distribution.

Maximum Likelihood

Our first algorithm for estimating parameters is called maximum likelihood estimation (MLE). The central idea behind MLE is to select that parameters (θ) that make the observed data the most likely.

The data that we are going to use to estimate the parameters are going to be n independent and identically distributed (IID) samples: X 1 , X 2 ,... Xn.

Likelihood

We made the assumption that our data are identically distributed. This means that they must have either the same probability mass function (if the data are discrete) or the same probability density function (if the data are continuous). To simplify our conversation about parameter estimation, we are going to use the notation f (X | θ) to refer to this shared PMF or PDF. Our new notation is interesting in two ways. First, we have now included a conditional on θ which is our way of indicating that the likelihood of different values of X depends on the values of our parameters. Second, we are going to use the same symbol f for both discrete and continuous distributions.

What does likelihood mean and how is “likelihood” different than “probability”? In the case of discrete distributions, likelihood is a synonym for the joint probability of your data. In the case of continuous distribution, likelihood refers to the joint probability density of your data.

Since we assumed each data point is independent, the likelihood of all our data is the product of the likelihood of each data point. Mathematically, the likelihood of our data given parameters θ is:

L(θ) =

∏^ n

i= 1

f (Xi |θ)

For different values of parameters, the likelihood of our data will be different. If we have correct parameters, our data will be much more probable than if we have incorrect parameters. For that reason we write likelihood as a function of our parameters (θ).

Maximization

In maximum likelihood estimation (MLE) our goal is to chose values of our parameters (θ) that maximizes the likelihood function from the previous section. We are going to use the notation θˆ to represent the best choice of values for our parameters. Formally, MLE assumes that:

θ^ ˆ = arg max θ

L(θ)

“Arg max” is short for argument of the maximum. The arg max of a function is the value of the domain at which the function is maximized. It applies for domains of any dimension.

A cool property of arg max is that since log is a monotone function, the arg max of a function is the same as the arg max of the log of the function! That’s nice because logs make the math simpler.

Normal MLE Estimation

Practice is key. Next, we will estimate the best parameter values for a normal distribution. All we have access to are n samples from our normal, which we represent as IID random variables X 1 , X 2 ,... Xn. We assume that for all i, Xi ∼ N (μ = θ 0 , σ^2 = θ 1 ). This example seems trickier because a normal has two parameters that we have to estimate. In this case, θ is a vector with two values. The first is the mean (μ) parameter, and the second is the variance (σ^2 ) parameter.

L(θ) =

∏^ n

i= 1

f (Xi |θ)

∏^ n

i= 1

2 πθ 1

e−^

(Xi −θ 0 )^2 2 θ (^1) Likelihood for a continuous variable is the PDF

LL(θ) =

∑^ n

i= 1

log

2 πθ 1

e−^

(Xi −θ 0 )^2 2 θ (^1) We want to calculate log likelihood

∑^ n

i= 1

[

− log(

2 πθ 1 ) −

2 θ 1

(Xi − θ 0 )^2

]

Again, the last step of MLE is to choose values of θ that maximize the log likelihood function. In this case, we can calculate the partial derivative of the LL function with respect to both θ 0 and θ 1 , set both equations to equal 0, and then solve for the values of θ. Doing so results in the equations for the values μˆ = θˆ 0 and σˆ^2 = θˆ 1 that maximize likelihood. The result is: μˆ = (^1) n

∑n i= 1 Xi^ and σ^ ˆ^2 = 1 n

∑n i= 1 (Xi^ −^ μˆ)

Maximum A Posteriori Estimation

MLE is great, but it is not the only way to estimate parameters! This section introduces an alternate algorithm, Maximum A Posteriori (MAP).The paradigm of MAP is that we should chose the value for our parameters that is the most likely given the data. At first blush this might seem the same as MLE; however, remember that MLE chooses the value of parameters that makes the data most likely. Formally, for IID random variables X 1 ,... , Xn:

θMAP = arg max θ

f (θ|X 1 , X 2 ,... Xn)

In the equation above we trying to calculate the conditional probability of unobserved random variables given observed random variables. When that is the case, think Bayes’ Theorem! Expand the function f using the continuous version of Bayes’ Theorem:

θMAP = arg max θ

f (θ|X 1 , X 2 ,... Xn)

= arg max θ

f (X 1 , X 2 ,... , Xn |θ)g(θ) h(X 1 , X 2 ,... Xn)

by Bayes’ Theorem

Note that f , g and h are all probability densities. I used different symbols to make it explicit that they may have different functions. Now we are going to leverage two observations. First, the data is assumed to be IID so we can decompose the density of the data given θ. Second, the denominator

is a constant with respect to θ. As such, its value does not affect the arg max, and we can drop that term. Mathematically:

θMAP = arg max θ

∏n i= 1 f^ (Xi^ |θ)g(θ) h(X 1 , X 2 ,... Xn)

Since the samples are IID

= arg max θ

∏^ n

i= 1

f (Xi |θ)g(θ) Since h is constant with respect to θ

As before, it will be more convenient to find the arg max of the log of the MAP function, which gives us the final form for MAP estimation of parameters.

θMAP =argmax θ

log(g(θ)) +

∑^ n

i= 1

log( f (Xi |θ))+

Using Bayesian terminology, the MAP estimate is the mode of the “posterior” distribution for θ. If you look at this equation side by side with the MLE equation you will notice that MAP is the arg max of the exact same function plus a term for the log of the prior.

Parameter Priors

In order to get ready for the world of MAP estimation, we are going to need to brush up on our distributions. We will need reasonable distributions for each of our different parameters. For example, if you are predicting a Poisson distribution, what is the right random variable type for the prior of λ?

A desiderata for prior distributions is that the resulting posterior distribution has the same functional form. We call these “conjugate” priors. In the case where you are updating your belief many times, conjugate priors makes programming in the math equations much easier.

Here is a list of different parameters and the distributions most often used for their priors:

Parameter Distribution Bernoulli p Beta Binomial p Beta Poisson λ Gamma Exponential λ Gamma Multinomial pi Dirichlet Normal μ Normal Normal σ^2 Inverse Gamma

We won’t cover the inverse gamma distribution in this class. The remaining two, Dirichlet and gamma, you will not be required to know, but details for them are included below for completeness.

The distributions used to represent your “prior” belief about a random variable will often have their own parameters. For example, a Beta distribution is defined using two parameters (a, b). Do we have to use parameter estimation to evaluate a and b too? No. Those parameters are called “hyperparameters”. That is a term we reserve for parameters in our model that we fix before running parameter estimate. Before you run MAP you decide on the values of (a, b).