Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Stats -Least-Squares Estimation, Summaries of Mathematics

Least-Squares Estimation, Simple Linear Regression Example — Simple Linear Regression

Typology: Summaries

2021/2022

Uploaded on 03/12/2022

MGolden
MGolden 🇺🇸

5

(1)

5 documents

1 / 100

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Least-Squares Estimation:
Recall that the projection of yonto C(X), the set of all vectors of the
form Xb for b Rk+1, yields the closest point in C(X) to y. That is,
p(y|C(X)) yields the minimizer of
Q(β) = kyXβk2(the least squares criterion)
This leads to the estimator ˆ
βgiven by the solution of
XTXβ=XTy (the normal equations)
or ˆ
β= (XTX)1XTy.
All of this has already been established back when we studied projections
(see pp. 30–31). Alternatively, we could use calculus:
To find a stationary point (maximum, minimum, or saddle point) of Q(β),
we set the partial derivative of Q(β) equal to zero and solve:
βQ(β) =
β(yXβ)T(yXβ) =
β(yTy2yTXβ+βT(XTX)β)
=02XTy+ 2XTXβ
Here we’ve used the vector differentiation formulas
zcTz=cand
zzTAz =
2Az (see §2.14 of our text).
Setting this result equal to zero, we obtain the normal equations, which
has solution ˆ
β= (XTX)1XTy. That this is a minimum rather than
a max, or saddle point can be verified by checking the second derivative
matrix of Q(β):
2Q(β)
β= 2XTX
which is positive definite (result 7, p. 54), therefore ˆ
βis a minimum.
101
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Stats -Least-Squares Estimation and more Summaries Mathematics in PDF only on Docsity!

Least-Squares Estimation:

Recall that the projection of y onto C(X), the set of all vectors of the

form Xb for b ∈ R

k+ , yields the closest point in C(X) to y. That is,

p(y|C(X)) yields the minimizer of

Q(β) = ‖y − Xβ‖

2 (the least squares criterion)

This leads to the estimator

β given by the solution of

X

T Xβ = X

T y (the normal equations)

or

β = (X

T X)

− 1 X

T y.

All of this has already been established back when we studied projections

(see pp. 30–31). Alternatively, we could use calculus:

To find a stationary point (maximum, minimum, or saddle point) of Q(β),

we set the partial derivative of Q(β) equal to zero and solve:

∂β

Q(β) =

∂β

(y − Xβ)

T (y − Xβ) =

∂β

(y

T y − 2 y

T Xβ + β

T (X

T X)β)

= 0 − 2 X

T y + 2X

T Xβ

Here we’ve used the vector differentiation formulas

∂z

c

T z = c and

∂z

z

T Az =

2 Az (see §2.14 of our text).

Setting this result equal to zero, we obtain the normal equations, which

has solution

β = (X

T X)

− 1 X

T y. That this is a minimum rather than

a max, or saddle point can be verified by checking the second derivative

matrix of Q(β):

2 Q(β)

∂β

= 2X

T X

which is positive definite (result 7, p. 54), therefore

β is a minimum.

Example — Simple Linear Regression

Consider the case k = 1:

y i = β 0

  • β 1 x i
  • e i , i = 1,... , n

where e 1 ,... , e n are i.i.d. each with mean 0 and variance σ

2

. Then the

model equation becomes

y 1

y 2

y n

1 x 1

1 x 2

1 x n

=X

β 0

β 1

e 1

e 2

e n

It follows that

X

T X =

n

i

x i ∑

i

x i

i

x

2

i

, X

T y =

i

y i ∑

i

x i y i

(X

T X)

− 1

n

i

x

2

i

i

x i

2

i

x

2

i

i

x i

i

x i n

Therefore,

β = (X

T X)

− 1 X

T y yields

β =

β 0

β 1

n

i

x

2

i

i

x i

2

i

x

2

i

i

y i

i

x i

i

x i y i

i

x i

i

y i ) + n

i

x i y i

After a bit of algebra, these estimators simplify to

β 1

i

(x i − ¯x)(y i − y¯)

i

(x i − x¯)

2

S

xy

S

xx

and

β 0 = ¯y −

β 1 x¯

Example — Simple Linear Regression (Continued)

Result 2 on the previous page says for var(y) = σ

2 I, var(

β) = σ

2 (X

T X)

− 1 .

Therefore, in the simple linear regression case,

var

β 0

β 1

= σ

2 (X

T X)

− 1

σ

2

n

i

x

2

i

i

x i

2

i

x

2

i

i

x i

i

x i n

σ

2

i

(x i − x¯)

2

n

− 1

i

x

2

i

−x¯

−x¯ 1

Thus,

var(

β 0

σ

2

i

x

2

i

/n

i

(x i

− x¯)

2

= σ

2

[

n

2

i

(x i

− x¯)

2

]

var(

β 1

σ

2

i

(x i − x¯)

2

and cov(

β 0

β 1

−σ

2 x¯

i

(x i − x¯)

2

  • Note that if ¯x > 0, then cov(

β 0

β 1 ) is negative, meaning that the

slope and intercept are inversely related. That is, over repeated

samples from the same model, the intercept will tend to decrease

when the slope increases.

Gauss-Markov Theorem:

We have seen that in the spherical errors, full-rank linear model, the least-

squares estimator

β = (X

T X)

− 1 X

T y is unbiased and it is a linear esti-

mator.

The following theorem states that in the class of linear and unbiased esti-

mators, the least-squares estimator is optimal (or best) in the sense that

it has minimum variance among all estimators in this class.

Gauss-Markov Theorem: Consider the linear model y = Xβ + e where

X is n × (k + 1) of rank k + 1, where n > k + 1, E(e) = 0 , and var(e) =

σ

2 I. The least-squares estimators

β j , j = 0, 1 ,... , k (the elements of

β = (X

T X)

− 1 X

T y have minimum variance among all linear unbiased

estimators.

Proof: Write

β j as

β j = c

T ˆ

β where c is the indicator vector containing a 1

in the (j + 1)st position and 0’s elsewhere. Then

β j = c

T (X

T X)

− 1 X

T y =

a

T y where a = X(X

T X)

− 1 c. The quantity being estimated is β j = c

T β =

c

T (X

T X)

− 1 X

T μ = a

T μ where μ = Xβ.

Consider an arbitrary linear estimator

β j

= d

T y of β j

. For such an esti-

mator to be unbiased, it must satisfy E(

β j ) = E(d

T y) = d

T μ = a

T μ for

any μ ∈ C(X). I.e.,

d

T μ − a

T μ = 0 ⇒ (d − a)

T μ = 0 for all μ ∈ C(X),

or (d − a) ⊥ C(X). Then

β j = d

T y = a

T y + (d − a)

T y =

β j

  • (d − a)

T y.

The random variables on the right-hand side,

β j and (d − a)

T y, have

covariance

cov(a

T y, (d − a)

T y) = a

T var(y)(d − a) = σ

2 a

T (d − a) = σ

2 (d

T a − a

T a).

Since d

T μ = a

T μ for any μ ∈ C(X) and a = X(X

T X)

− 1 c ∈ C(X), it

follows that d

T a = a

T a so that

cov(a

T y, (d − a)

T y) = σ

2 (d

T a − a

T a) = σ

2 (a

T a − a

T a) = 0.

An additional property of least-squares estimation is that the estimated

mean ˆμ = X(X

T X)

− 1 X

T y is invariant to (doesn’t change as a result of)

linear changes of scale in the explanatory variables.

That is, consider the linear models

y =

1 x 11 x 12 · · · x 1 k

1 x 21 x 22 · · · x 2 k

1 x n 1

x n 2

· · · x nk

=X

β + e

and

y =

1 c 1 x 11 c 2 x 12 · · · c k x 1 k

1 c 1 x 21 c 2 x 22 · · · c k x 2 k

1 c 1 x n 1 c 2 x n 2 · · · c k x nk

=Z

β

  • e

Then, ˆμ, the least squares estimator of E(y), is the same in both of these

two models. This follows from a more general theorem:

Theorem: In the linear model y = Xβ + e where E(e) = 0 and X is of

full rank, ˆμ, the least-squares estimator of E(y) is invariant to a full rank

linear transformation of X.

Proof: A full rank linear transformation of X is given by

Z = XH

where H is square and of full rank. In the original (untransformed) linear

model ˆμ = X(X

T X)

− 1 X

T y = P C(X) y. In the transformed model y =

∗ +e, ˆμ = Z(Z

T Z)

− 1 Z

T y = P C(Z)

y = P C(XH)

y. So, it suffices to show

that P C(X)

= P

C(XH)

. This is true because if x ∈ C(XH) then x = XHb

for some b, ⇒ x = Xc where c = Hb ⇒ x ∈ C(X) ⇒ C(XH) ⊂ C(X).

In addition, if x ∈ C(X) then x = Xd for some d ⇒ x = XHH

− 1 d =

XHa where a = H

− 1 d ⇒ x ∈ C(XH) ⇒ C(X) ⊂ C(XH). Therefore,

C(X) = C(XH).

  • The simple case described above where each of the x j

’s is rescaled

by a constant c j occurs when H = diag(1, c 1 , c 2 ,... , c k

Maximum Likelihood Estimation:

Least-squares provides a simple, intuitively reasonable criterion for esti-

mation. If we want to estimate a parameter describing μ, the mean of

y, then choose the parameter value that minimizes the squared distance

between y and μ. If var(y) = σ

2 I, then the resulting estimator is BLUE

(optimal, in some sense).

  • Least-squares is based only on assumptions concerning the mean and

variance-covaraince matrix (the first two moments) of y.

  • Least-squares tells us how to estimate parameters associated with

the mean (e.g., β) but nothing about how to estimate parameters

describing the variance (e.g., σ

2 ) or other aspects of the distribution

of y.

An alternative method of estimation is maximum likelihood estimation.

  • Maximum likelihood requires the specification of the entire distribu-

tion of y (up to some unknown parameters), rather than just the

mean and variance of that distribution.

  • ML estimation provides a criterion of estimation for any parameter

describing the distribution of y, including parameters describing the

mean (e.g., β), variance (σ

2 ), or any other aspect of the distribution.

  • Thus, ML estimation is simultaneously more general and less general

than least squares in certain senses. It can provide estimators of

all sorts of parameters in a broad array of model types, including

models much more complex than those for which least-squares is

appropriate; but it requires stronger assumptions than least-squares.

Under independence,

L(γ; y) =

n ∏

i=

f (y i ; γ)

Since its easier to work with sums than products its useful to note that in

general

arg max

γ

L(γ; y) = arg max

γ

log L(γ; y)

≡`(γ;y)

Therefore, we define a MLE of γ as a ˆγ so that

(ˆγ, y) ≥(γ; y) for all γ ∈ Γ

If Γ is an open set, then ˆγ must satisfy (if it exists)

∂`(ˆγ)

∂γ j

= 0, j = 1,... , p

or in vector form

∂`(ˆγ; y)

∂γ

∂`( ˆγ)

∂γ 1

∂`( ˆγ)

∂γp

= 0 , (the likelihood equation, a.k.a. score equation)

In the classical linear model, the unknown parameters of the model are β

and σ

2 , so the pair (β, σ

2 ) plays the role of γ.

Under the assumption A5 that e ∼ N n

( 0 , σ

2 I n

), it follows that y ∼

N

n (Xβ, σ

2 I n ), so the likelihood function is given by

L(β, σ

2 ; y) =

(2πσ

2 )

n/ 2

exp

2 σ

2

‖y − Xβ‖

2

for β ∈ R

k+ and σ

2

The log-likelihood is a bit easier to work with, and has the same maximiz-

ers. It is given by

`(β, σ

2 ; y) = −

n

log(2π) −

n

log(σ

2 ) −

2 σ

2

‖y − Xβ‖

2 .

We can maximize this function with respect to β and σ

2 in two steps:

First maximize with respect to β treating σ

2 as fixed, then second plug

that estimator back into the loglikelihood function and maximize with

respect to σ

2 .

For fixed σ

2 , maximizing `(β, σ

2 ; y) is equivalent to maximizing the third

term −

1

2 σ

2 ‖y − Xβ‖

2 or, equivalently, minimizing ‖y − Xβ‖

2

. This is just

what we do in least-squares, and leads to the estimator

β = (X

T X)

− 1 X

T y.

Next we plug this estimator back into the loglikelihood (this gives what’s

known as the profile loglikelihood for σ

2 ):

`(

β, σ

2 ) = −

n

log(2π) −

n

log(σ

2 ) −

2 σ

2

‖y − X

β‖

2

and maximize with respect to σ

2 .

Estimation of σ

2 :

  • Maximum likelihood estimation provides a unified approach to esti-

mating all parameters in the model, β and σ

2 .

  • In contrast, least squares estimation only provides an estimator of

β.

We’ve seen that the LS and ML estimators of β coincide. However, the

MLE of σ

2 is not the usually preferred estimator of σ

2 and is not the

estimator of σ

2 that is typically combined with LS estimation of β.

Why not?

Because ˆσ

2 is biased.

That E(ˆσ

2 ) 6 = σ

2 can easily be established using our results for taking

expected values of quadratic forms:

E(ˆσ

2 ) = E

n

‖P

C(X)

⊥ y‖

2

n

E

(P

C(X)

⊥ y)

T P C(X)

⊥ y

n

E(y

T P C(X)

⊥ y)

n

σ

2 dim(C(X)

⊥ ) + ‖ P C(X)

⊥ Xβ

= 0 , because Xβ ∈ C(X)

2

σ

2

n

dim(C(X)

⊥ )

σ

2

n

{n − dim(C(X))

=rank(X)

σ

2

n

{n − (k + 1)}

Therefore, the MLE ˆσ

2 is biased by a multiplicative factor of {n−k − 1 }/n

and an alternative unbiased estimator of σ

2 can easily be constructed as

s

2 ≡

n

n − k − 1

ˆσ

2

n − k − 1

‖y − X

β‖

2 ,

or more generally (that is, for X not necessarily of full rank),

s

2

n − rank(X)

‖y − X

β‖

2 .

  • s

2 rather than ˆσ

2 is generally the preferred estimator of σ

2

. In fact,

it can be shown that in the spherical errors linear model, s

2 is the

best (minimum variance) estimator of σ

2 in the class of quadratic

(in y) unbiased estimators.

  • In the special case that Xβ = μj n (i.e., the model contains only

an intercept, or constant term), so that C(X) = L(j n ), we get

β =

μˆ = ¯y, and rank(X) = 1. Therefore, s

2 becomes the usual sample

variance from the one-sample problem:

s

2

n − 1

‖y − y¯j n

2

n − 1

n ∑

i=

(y i − y¯)

2 .

If e has a normal distribution, then by part 3 of the theorem on p. 85,

‖y − X

β‖

2 /σ

2 ∼ χ

2 (n − rank(X))

and, since the central χ

2 (m) has mean m and variance 2m,

s

2

σ

2

n − rank(X)

‖y − X

β‖

2 /σ

2

∼χ

2 (n−rank(X))

implies

E(s

2 ) =

σ

2

n − rank(X)

{n − rank(X)} = σ

2 ,

and

var(s

2 ) =

σ

4

{n − rank(X)}

2

2 {n − rank(X)} =

2 σ

4

n − rank(X)

Our model is the classical linear model with normal errors:

y = Xβ + e, e ∼ N ( 0 , σ

2 I n

We first need the concept of a complete sufficient statistic:

Sufficiency: Let y be random vector with p.d.f. f (y; θ) depending on an

unknown k × 1 parameter θ. Let T(y) be an r × 1 vector-valued statistic

that is a function of y. Then T(y) is said to be a sufficient statistic for

θ if and only if the conditional distribution of y given the value of T(y)

does not depend upon θ.

  • If T is sufficient for θ then, loosely, T summarizes all of the informa-

tion in the data y relevant to θ. Once we know T, there’s no more

information in y about θ.

The property of completeness is needed as well, but it is somewhat tech-

nical. Briefly, it ensures that if a function of the sufficient statistic exists

that is unbiased for the quantity being estimated, then it is unique.

Completeness: A vector-valued sufficient statistic T(y) is said to be

complete if and only if E{h(T(y))} = 0 for all θ implies Pr{h(T(y)) =

0 } = 1 for all θ.

Theorem: If T(y) is a complete sufficient statistic, then f (T(y)) is a

minimum variance unbiased estimator of E{f (T(y))}.

Proof: This theorem is known as the Lehmann-Scheff´e Theorem and its

proof follows easily from the Rao-Blackwell Theorem. See, e.g., Bickel and

Doksum, p. 122, or Casella and Berger, p. 320.

In the linear model, the p.d.f. of y depends upon β and σ

2 , so the pair

(β, σ

2 ) plays the role of θ.

Is there a complete sufficient statistic for (β, σ

2 ) in the classical linear

model?

Yes, by the following result:

Theorem: Let θ = (θ 1 ,... , θ r

T and let y be a random vector with

probability density function

f (y) = c(θ) exp

r ∑

i=

θ i

T

i (y)

h(y).

Then T(y) = (T 1

(y),... , T r

(y))

T is a complete sufficient statistic provided

that neither θ nor T(y) satisfy any linear constraints.

  • The density function in the above theorem describes the exponential

family of distributions. For this family, which includes the normal

distribution, then it is easy to find a complete sufficient statistic.

Consider the classical linear model

y = Xβ + e, e ∼ N ( 0 , σ

2 I n

The density of y can be written as

f (y; β, σ

2 ) = (2π)

−n/ 2 (σ

2 )

−n/ 2 exp{−(y − Xβ)

T (y − Xβ)/(2σ

2 )}

= c 1 (σ

2 ) exp{−(y

T y − 2 β

T X

T y + β

T X

T Xβ)/(2σ

2 )

= c 2 (β, σ

2 ) exp[{(− 1 /(2σ

2 ))y

T y + (σ

− 2 β

T )(X

T y)}

If we reparameterize in terms of θ where

θ 1

2 σ

2

θ 2

θ k+

σ

2

β,

then this density can be seen to be of the exponential form, with vector-

valued complete sufficient statistic

y

T y

X

T y

Generalized Least Squares

Up to now, we have assumed var(e) = σ

2 I in our linear model. There are

two aspects to this assumption: (i) uncorrelatedness (var(e) is diagonal),

and (ii) homoscedasticity (the diagonal elements of var(e) are all the same.

Now we relax these assumptions simultaneously by considering a more

general variance-covaraince structure. We now consider the linear model

y = Xβ + e, where E(e) = 0 , var(e) = σ

2 V,

where X is full rank as before, and where V is a known positive definite

matrix.

  • Note that we assume V is known, so there still is only one variance-

covariance parameter to be estimated, σ

2 .

  • In the context of least-squares, allowing V to be unknown compli-

cates things substantially, so we postpone discussion of this case. V

unknown can be handled via ML estimation and we’ll talk about

that later. Of course, V unknown is the typical scenario in practice,

but there are cases when V would be known.

  • A good example of such a situation is the simple linear regression

model with uncorrelated, but heteroscedastic errors:

y i

= β 0

  • β 1

x i

  • e i

where the e i ’s are independent, each with mean 0, and var(e i

σ

2 x i

. In this case, var(e) = σ

2 V where V = diag(x 1 ,... , x n ), a

known matrix of constants.

Estimation of β and σ

2 when var(e) = σ

2 V:

A nice feature of the model

y = Xβ + e where var(e) = σ

2 V (1)

is that, although it is not a Gauss-Markov (spherical errors) model, it is

simple to transform this model into a Gauss-Markov model. This allows

us to apply what we’ve learned about the spherical errors case to obtain

methods and results for the non-spherical case.

Since V is known and positive definite, it is possible to find a matrix Q

such that V = QQ

T (e.g., Q

T could be the Cholesky factor of V).

Multiplying on both sides of the model equation in (1) by the known matrix

Q

− 1 , it follows that the following transformed model holds as well:

Q

− 1 y = Q

− 1 Xβ + Q

− 1 e

or y˜ =

Xβ + ˜e where var(˜e) = σ

2 I (2)

where ˜y = Q

− 1 y,

X = Q

− 1 X and ˜e = Q

− 1 e.

  • Notice that model (2) is a Gauss-Markov model because

E(˜e) = Q

− 1 E(e) = Q

− 1 0 = 0

and

var(˜e) = Q

− 1 var(e)(Q

− 1 )

T = σ

2 Q

− 1 V(Q

− 1 )

T

= σ

2 Q

− 1 QQ

T (Q

− 1 )

T = σ

2 I