Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Stats -Least-Squares Estimation, Summaries of Mathematics

Colorado School of Mines Mathematics

Least-Squares Estimation, Simple Linear Regression Example — Simple Linear Regression

Typology: Summaries

2021/2022

Uploaded on 03/12/2022

MGolden 🇺🇸

5

(1)

5 documents

1 / 100

This page cannot be seen from the preview

Don't miss anything!

Least-Squares Estimation:

Recall that the projection of yonto C(X), the set of all vectors of the

form Xb for b∈ Rk+1, yields the closest point in C(X) to y. That is,

p(y|C(X)) yields the minimizer of

Q(β) = ky−Xβk2(the least squares criterion)

This leads to the estimator ˆ

βgiven by the solution of

XTXβ=XTy (the normal equations)

or ˆ

β= (XTX)−1XTy.

All of this has already been established back when we studied projections

(see pp. 30–31). Alternatively, we could use calculus:

To find a stationary point (maximum, minimum, or saddle point) of Q(β),

we set the partial derivative of Q(β) equal to zero and solve:

∂

∂βQ(β) = ∂

∂β(y−Xβ)T(y−Xβ) = ∂

∂β(yTy−2yTXβ+βT(XTX)β)

=0−2XTy+ 2XTXβ

Here we’ve used the vector differentiation formulas ∂

∂zcTz=cand ∂

∂zzTAz =

2Az (see §2.14 of our text).

Setting this result equal to zero, we obtain the normal equations, which

has solution ˆ

β= (XTX)−1XTy. That this is a minimum rather than

a max, or saddle point can be verified by checking the second derivative

matrix of Q(β):

∂2Q(β)

∂β= 2XTX

which is positive definite (result 7, p. 54), therefore ˆ

βis a minimum.

101

Partial preview of the text

Download Stats -Least-Squares Estimation and more Summaries Mathematics in PDF only on Docsity!

Least-Squares Estimation:

Recall that the projection of y onto C(X), the set of all vectors of the

form Xb for b ∈ R

k+ , yields the closest point in C(X) to y. That is,

p(y|C(X)) yields the minimizer of

Q(β) = ‖y − Xβ‖

2 (the least squares criterion)

This leads to the estimator

β given by the solution of

X

T Xβ = X

T y (the normal equations)

or

β = (X

T X)

− 1 X

T y.

All of this has already been established back when we studied projections

(see pp. 30–31). Alternatively, we could use calculus:

To find a stationary point (maximum, minimum, or saddle point) of Q(β),

we set the partial derivative of Q(β) equal to zero and solve:

∂β

Q(β) =

∂β

(y − Xβ)

T (y − Xβ) =

∂β

(y

T y − 2 y

T Xβ + β

T (X

T X)β)

= 0 − 2 X

T y + 2X

T Xβ

Here we’ve used the vector differentiation formulas

∂

∂z

c

T z = c and

∂

∂z

z

T Az =

2 Az (see §2.14 of our text).

Setting this result equal to zero, we obtain the normal equations, which

has solution

β = (X

T X)

− 1 X

T y. That this is a minimum rather than

a max, or saddle point can be verified by checking the second derivative

matrix of Q(β):

2 Q(β)

∂β

= 2X

T X

which is positive definite (result 7, p. 54), therefore

β is a minimum.

Example — Simple Linear Regression

Consider the case k = 1:

y i = β 0

β 1 x i
e i , i = 1,... , n

where e 1 ,... , e n are i.i.d. each with mean 0 and variance σ

2

. Then the

model equation becomes

y 1

y 2

y n

1 x 1

1 x 2

1 x n

=X

β 0

β 1

=β

e 1

e 2

e n

It follows that

X

T X =

n

i

x i ∑

i

x i

i

x

2

i

, X

T y =

i

y i ∑

i

x i y i

(X

T X)

− 1

n

i

x

2

i

x i

2

i

x

2

i

x i

i

x i n

Therefore,

β = (X

T X)

− 1 X

T y yields

β =

β 0

β 1

n

i

x

2

i

x i

2

i

x

2

i

y i

i

x i

i

x i y i

i

x i

i

y i ) + n

i

x i y i

After a bit of algebra, these estimators simplify to

β 1

i

(x i − ¯x)(y i − y¯)

i

(x i − x¯)

2

S

xy

S

xx

and

β 0 = ¯y −

β 1 x¯

Example — Simple Linear Regression (Continued)

Result 2 on the previous page says for var(y) = σ

2 I, var(

β) = σ

2 (X

T X)

− 1 .

Therefore, in the simple linear regression case,

var

β 0

β 1

= σ

2 (X

T X)

− 1

σ

2

n

i

x

2

i

x i

2

i

x

2

i

x i

i

x i n

σ

2

i

(x i − x¯)

2

n

− 1

i

x

2

i

−x¯

−x¯ 1

Thus,

var(

β 0

σ

2

i

x

2

i

/n

i

(x i

− x¯)

2

= σ

2

[

n

x¯

2

i

(x i

− x¯)

2

]

var(

β 1

σ

2

i

(x i − x¯)

2

and cov(

β 0

β 1

−σ

2 x¯

i

(x i − x¯)

2

Note that if ¯x > 0, then cov(

β 0

β 1 ) is negative, meaning that the

slope and intercept are inversely related. That is, over repeated

samples from the same model, the intercept will tend to decrease

when the slope increases.

Gauss-Markov Theorem:

We have seen that in the spherical errors, full-rank linear model, the least-

squares estimator

β = (X

T X)

− 1 X

T y is unbiased and it is a linear esti-

mator.

The following theorem states that in the class of linear and unbiased esti-

mators, the least-squares estimator is optimal (or best) in the sense that

it has minimum variance among all estimators in this class.

Gauss-Markov Theorem: Consider the linear model y = Xβ + e where

X is n × (k + 1) of rank k + 1, where n > k + 1, E(e) = 0 , and var(e) =

σ

2 I. The least-squares estimators

β j , j = 0, 1 ,... , k (the elements of

β = (X

T X)

− 1 X

T y have minimum variance among all linear unbiased

estimators.

Proof: Write

β j as

β j = c

T ˆ

β where c is the indicator vector containing a 1

in the (j + 1)st position and 0’s elsewhere. Then

β j = c

T (X

T X)

− 1 X

T y =

a

T y where a = X(X

T X)

− 1 c. The quantity being estimated is β j = c

T β =

c

T (X

T X)

− 1 X

T μ = a

T μ where μ = Xβ.

Consider an arbitrary linear estimator

β j

= d

T y of β j

. For such an esti-

mator to be unbiased, it must satisfy E(

β j ) = E(d

T y) = d

T μ = a

T μ for

any μ ∈ C(X). I.e.,

d

T μ − a

T μ = 0 ⇒ (d − a)

T μ = 0 for all μ ∈ C(X),

or (d − a) ⊥ C(X). Then

β j = d

T y = a

T y + (d − a)

T y =

β j

(d − a)

T y.

The random variables on the right-hand side,

β j and (d − a)

T y, have

covariance

cov(a

T y, (d − a)

T y) = a

T var(y)(d − a) = σ

2 a

T (d − a) = σ

2 (d

T a − a

T a).

Since d

T μ = a

T μ for any μ ∈ C(X) and a = X(X

T X)

− 1 c ∈ C(X), it

follows that d

T a = a

T a so that

cov(a

T y, (d − a)

T y) = σ

2 (d

T a − a

T a) = σ

2 (a

T a − a

T a) = 0.

An additional property of least-squares estimation is that the estimated

mean ˆμ = X(X

T X)

− 1 X

T y is invariant to (doesn’t change as a result of)

linear changes of scale in the explanatory variables.

That is, consider the linear models

y =

1 x 11 x 12 · · · x 1 k

1 x 21 x 22 · · · x 2 k

1 x n 1

x n 2

· · · x nk

=X

β + e

and

y =

1 c 1 x 11 c 2 x 12 · · · c k x 1 k

1 c 1 x 21 c 2 x 22 · · · c k x 2 k

1 c 1 x n 1 c 2 x n 2 · · · c k x nk

=Z

β

∗

e

Then, ˆμ, the least squares estimator of E(y), is the same in both of these

two models. This follows from a more general theorem:

Theorem: In the linear model y = Xβ + e where E(e) = 0 and X is of

full rank, ˆμ, the least-squares estimator of E(y) is invariant to a full rank

linear transformation of X.

Proof: A full rank linear transformation of X is given by

Z = XH

where H is square and of full rank. In the original (untransformed) linear

model ˆμ = X(X

T X)

− 1 X

T y = P C(X) y. In the transformed model y =

Zβ

∗ +e, ˆμ = Z(Z

T Z)

− 1 Z

T y = P C(Z)

y = P C(XH)

y. So, it suffices to show

that P C(X)

= P

C(XH)

. This is true because if x ∈ C(XH) then x = XHb

for some b, ⇒ x = Xc where c = Hb ⇒ x ∈ C(X) ⇒ C(XH) ⊂ C(X).

In addition, if x ∈ C(X) then x = Xd for some d ⇒ x = XHH

− 1 d =

XHa where a = H

− 1 d ⇒ x ∈ C(XH) ⇒ C(X) ⊂ C(XH). Therefore,

C(X) = C(XH).

The simple case described above where each of the x j

’s is rescaled

by a constant c j occurs when H = diag(1, c 1 , c 2 ,... , c k

Maximum Likelihood Estimation:

Least-squares provides a simple, intuitively reasonable criterion for esti-

mation. If we want to estimate a parameter describing μ, the mean of

y, then choose the parameter value that minimizes the squared distance

between y and μ. If var(y) = σ

2 I, then the resulting estimator is BLUE

(optimal, in some sense).

Least-squares is based only on assumptions concerning the mean and

variance-covaraince matrix (the first two moments) of y.

Least-squares tells us how to estimate parameters associated with

the mean (e.g., β) but nothing about how to estimate parameters

describing the variance (e.g., σ

2 ) or other aspects of the distribution

of y.

An alternative method of estimation is maximum likelihood estimation.

Maximum likelihood requires the specification of the entire distribu-

tion of y (up to some unknown parameters), rather than just the

mean and variance of that distribution.

ML estimation provides a criterion of estimation for any parameter

describing the distribution of y, including parameters describing the

mean (e.g., β), variance (σ

2 ), or any other aspect of the distribution.

Thus, ML estimation is simultaneously more general and less general

than least squares in certain senses. It can provide estimators of

all sorts of parameters in a broad array of model types, including

models much more complex than those for which least-squares is

appropriate; but it requires stronger assumptions than least-squares.

Under independence,

L(γ; y) =

n ∏

i=

f (y i ; γ)

Since its easier to work with sums than products its useful to note that in

general

arg max

γ

L(γ; y) = arg max

γ

log L(γ; y)

≡`(γ;y)

Therefore, we define a MLE of γ as a ˆγ so that

(ˆγ, y) ≥(γ; y) for all γ ∈ Γ

If Γ is an open set, then ˆγ must satisfy (if it exists)

∂`(ˆγ)

∂γ j

= 0, j = 1,... , p

or in vector form

∂`(ˆγ; y)

∂γ

∂`( ˆγ)

∂γ 1

∂`( ˆγ)

∂γp

= 0 , (the likelihood equation, a.k.a. score equation)

In the classical linear model, the unknown parameters of the model are β

and σ

2 , so the pair (β, σ

2 ) plays the role of γ.

Under the assumption A5 that e ∼ N n

( 0 , σ

2 I n

), it follows that y ∼

N

n (Xβ, σ

2 I n ), so the likelihood function is given by

L(β, σ

2 ; y) =

(2πσ

2 )

n/ 2

exp

2 σ

2

‖y − Xβ‖

2

for β ∈ R

k+ and σ

2

The log-likelihood is a bit easier to work with, and has the same maximiz-

ers. It is given by

`(β, σ

2 ; y) = −

n

log(2π) −

n

log(σ

2 ) −

2 σ

2

‖y − Xβ‖

2 .

We can maximize this function with respect to β and σ

2 in two steps:

First maximize with respect to β treating σ

2 as fixed, then second plug

that estimator back into the loglikelihood function and maximize with

respect to σ

2 .

For fixed σ

2 , maximizing `(β, σ

2 ; y) is equivalent to maximizing the third

term −

1

2 σ

2 ‖y − Xβ‖

2 or, equivalently, minimizing ‖y − Xβ‖

2

. This is just

what we do in least-squares, and leads to the estimator

β = (X

T X)

− 1 X

T y.

Next we plug this estimator back into the loglikelihood (this gives what’s

known as the profile loglikelihood for σ

2 ):

`(

β, σ

2 ) = −

n

log(2π) −

n

log(σ

2 ) −

2 σ

2

‖y − X

β‖

2

and maximize with respect to σ

2 .

Estimation of σ

2 :

Maximum likelihood estimation provides a unified approach to esti-

mating all parameters in the model, β and σ

2 .

In contrast, least squares estimation only provides an estimator of

β.

We’ve seen that the LS and ML estimators of β coincide. However, the

MLE of σ

2 is not the usually preferred estimator of σ

2 and is not the

estimator of σ

2 that is typically combined with LS estimation of β.

Why not?

Because ˆσ

2 is biased.

That E(ˆσ

2 ) 6 = σ

2 can easily be established using our results for taking

expected values of quadratic forms:

E(ˆσ

2 ) = E

n

‖P

C(X)

⊥ y‖

2

n

E

(P

C(X)

⊥ y)

T P C(X)

⊥ y

n

E(y

T P C(X)

⊥ y)

n

σ

2 dim(C(X)

⊥ ) + ‖ P C(X)

⊥ Xβ

= 0 , because Xβ ∈ C(X)

2

σ

2

n

dim(C(X)

⊥ )

σ

2

n

{n − dim(C(X))

=rank(X)

σ

2

n

{n − (k + 1)}

Therefore, the MLE ˆσ

2 is biased by a multiplicative factor of {n−k − 1 }/n

and an alternative unbiased estimator of σ

2 can easily be constructed as

s

2 ≡

n

n − k − 1

ˆσ

2

n − k − 1

‖y − X

β‖

2 ,

or more generally (that is, for X not necessarily of full rank),

s

2

n − rank(X)

‖y − X

β‖

2 .

s

2 rather than ˆσ

2 is generally the preferred estimator of σ

2

. In fact,

it can be shown that in the spherical errors linear model, s

2 is the

best (minimum variance) estimator of σ

2 in the class of quadratic

(in y) unbiased estimators.

In the special case that Xβ = μj n (i.e., the model contains only

an intercept, or constant term), so that C(X) = L(j n ), we get

β =

μˆ = ¯y, and rank(X) = 1. Therefore, s

2 becomes the usual sample

variance from the one-sample problem:

s

2

n − 1

‖y − y¯j n

2

n − 1

n ∑

i=

(y i − y¯)

2 .

If e has a normal distribution, then by part 3 of the theorem on p. 85,

‖y − X

β‖

2 /σ

2 ∼ χ

2 (n − rank(X))

and, since the central χ

2 (m) has mean m and variance 2m,

s

2

σ

2

n − rank(X)

‖y − X

β‖

2 /σ

2

∼χ

2 (n−rank(X))

implies

E(s

2 ) =

σ

2

n − rank(X)

{n − rank(X)} = σ

2 ,

and

var(s

2 ) =

σ

4

{n − rank(X)}

2

2 {n − rank(X)} =

2 σ

4

n − rank(X)

Our model is the classical linear model with normal errors:

y = Xβ + e, e ∼ N ( 0 , σ

2 I n

We first need the concept of a complete sufficient statistic:

Sufficiency: Let y be random vector with p.d.f. f (y; θ) depending on an

unknown k × 1 parameter θ. Let T(y) be an r × 1 vector-valued statistic

that is a function of y. Then T(y) is said to be a sufficient statistic for

θ if and only if the conditional distribution of y given the value of T(y)

does not depend upon θ.

If T is sufficient for θ then, loosely, T summarizes all of the informa-

tion in the data y relevant to θ. Once we know T, there’s no more

information in y about θ.

The property of completeness is needed as well, but it is somewhat tech-

nical. Briefly, it ensures that if a function of the sufficient statistic exists

that is unbiased for the quantity being estimated, then it is unique.

Completeness: A vector-valued sufficient statistic T(y) is said to be

complete if and only if E{h(T(y))} = 0 for all θ implies Pr{h(T(y)) =

0 } = 1 for all θ.

Theorem: If T(y) is a complete sufficient statistic, then f (T(y)) is a

minimum variance unbiased estimator of E{f (T(y))}.

Proof: This theorem is known as the Lehmann-Scheff´e Theorem and its

proof follows easily from the Rao-Blackwell Theorem. See, e.g., Bickel and

Doksum, p. 122, or Casella and Berger, p. 320.

In the linear model, the p.d.f. of y depends upon β and σ

2 , so the pair

(β, σ

2 ) plays the role of θ.

Is there a complete sufficient statistic for (β, σ

2 ) in the classical linear

model?

Yes, by the following result:

Theorem: Let θ = (θ 1 ,... , θ r

T and let y be a random vector with

probability density function

f (y) = c(θ) exp

r ∑

i=

θ i

T

i (y)

h(y).

Then T(y) = (T 1

(y),... , T r

(y))

T is a complete sufficient statistic provided

that neither θ nor T(y) satisfy any linear constraints.

The density function in the above theorem describes the exponential

family of distributions. For this family, which includes the normal

distribution, then it is easy to find a complete sufficient statistic.

Consider the classical linear model

y = Xβ + e, e ∼ N ( 0 , σ

2 I n

The density of y can be written as

f (y; β, σ

2 ) = (2π)

−n/ 2 (σ

2 )

−n/ 2 exp{−(y − Xβ)

T (y − Xβ)/(2σ

2 )}

= c 1 (σ

2 ) exp{−(y

T y − 2 β

T X

T y + β

T X

T Xβ)/(2σ

2 )

= c 2 (β, σ

2 ) exp[{(− 1 /(2σ

2 ))y

T y + (σ

− 2 β

T )(X

T y)}

If we reparameterize in terms of θ where

θ 1

2 σ

2

θ 2

θ k+

σ

2

β,

then this density can be seen to be of the exponential form, with vector-

valued complete sufficient statistic

y

T y

X

T y

Generalized Least Squares

Up to now, we have assumed var(e) = σ

2 I in our linear model. There are

two aspects to this assumption: (i) uncorrelatedness (var(e) is diagonal),

and (ii) homoscedasticity (the diagonal elements of var(e) are all the same.

Now we relax these assumptions simultaneously by considering a more

general variance-covaraince structure. We now consider the linear model

y = Xβ + e, where E(e) = 0 , var(e) = σ

2 V,

where X is full rank as before, and where V is a known positive definite

matrix.

Note that we assume V is known, so there still is only one variance-

covariance parameter to be estimated, σ

2 .

In the context of least-squares, allowing V to be unknown compli-

cates things substantially, so we postpone discussion of this case. V

unknown can be handled via ML estimation and we’ll talk about

that later. Of course, V unknown is the typical scenario in practice,

but there are cases when V would be known.

A good example of such a situation is the simple linear regression

model with uncorrelated, but heteroscedastic errors:

y i

= β 0

β 1

x i

e i

where the e i ’s are independent, each with mean 0, and var(e i

σ

2 x i

. In this case, var(e) = σ

2 V where V = diag(x 1 ,... , x n ), a

known matrix of constants.

Estimation of β and σ

2 when var(e) = σ

2 V:

A nice feature of the model

y = Xβ + e where var(e) = σ

2 V (1)

is that, although it is not a Gauss-Markov (spherical errors) model, it is

simple to transform this model into a Gauss-Markov model. This allows

us to apply what we’ve learned about the spherical errors case to obtain

methods and results for the non-spherical case.

Since V is known and positive definite, it is possible to find a matrix Q

such that V = QQ

T (e.g., Q

T could be the Cholesky factor of V).

Multiplying on both sides of the model equation in (1) by the known matrix

Q

− 1 , it follows that the following transformed model holds as well:

Q

− 1 y = Q

− 1 Xβ + Q

− 1 e

or y˜ =

Xβ + ˜e where var(˜e) = σ

2 I (2)

where ˜y = Q

− 1 y,

X = Q

− 1 X and ˜e = Q

− 1 e.

Notice that model (2) is a Gauss-Markov model because

E(˜e) = Q

− 1 E(e) = Q

− 1 0 = 0

and

var(˜e) = Q

− 1 var(e)(Q

− 1 )

T = σ

2 Q

− 1 V(Q

− 1 )

T

= σ

2 Q

− 1 QQ

T (Q

− 1 )

T = σ

2 I

Stats -Least-Squares Estimation, Summaries of Mathematics

Related documents

Partial preview of the text

Download Stats -Least-Squares Estimation and more Summaries Mathematics in PDF only on Docsity!

X

= 0 − 2 X

= 2X

X

, X

(X

− 1

S

S

[

]

T ˆ

Z = XH

= P

C(X) = C(XH).

N

`(

‖P

E

(P

2

2

2

2

2

T

Q

Q

X = Q