




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Least-Squares Estimation, Simple Linear Regression Example — Simple Linear Regression
Typology: Summaries
1 / 100
This page cannot be seen from the preview
Don't miss anything!
Least-Squares Estimation:
Recall that the projection of y onto C(X), the set of all vectors of the
form Xb for b ∈ R
k+ , yields the closest point in C(X) to y. That is,
p(y|C(X)) yields the minimizer of
Q(β) = ‖y − Xβ‖
2 (the least squares criterion)
This leads to the estimator
β given by the solution of
T Xβ = X
T y (the normal equations)
or
β = (X
T X)
− 1 X
T y.
All of this has already been established back when we studied projections
(see pp. 30–31). Alternatively, we could use calculus:
To find a stationary point (maximum, minimum, or saddle point) of Q(β),
we set the partial derivative of Q(β) equal to zero and solve:
∂β
Q(β) =
∂β
(y − Xβ)
T (y − Xβ) =
∂β
(y
T y − 2 y
T Xβ + β
T (X
T X)β)
T y + 2X
T Xβ
Here we’ve used the vector differentiation formulas
∂
∂z
c
T z = c and
∂
∂z
z
T Az =
2 Az (see §2.14 of our text).
Setting this result equal to zero, we obtain the normal equations, which
has solution
β = (X
T X)
− 1 X
T y. That this is a minimum rather than
a max, or saddle point can be verified by checking the second derivative
matrix of Q(β):
2 Q(β)
∂β
T X
which is positive definite (result 7, p. 54), therefore
β is a minimum.
Example — Simple Linear Regression
Consider the case k = 1:
y i = β 0
where e 1 ,... , e n are i.i.d. each with mean 0 and variance σ
2
. Then the
model equation becomes
y 1
y 2
y n
1 x 1
1 x 2
1 x n
=X
β 0
β 1
=β
e 1
e 2
e n
It follows that
T X =
n
i
x i ∑
i
x i
i
x
2
i
T y =
i
y i ∑
i
x i y i
T X)
n
i
x
2
i
i
x i
2
i
x
2
i
i
x i
i
x i n
Therefore,
β = (X
T X)
− 1 X
T y yields
β =
β 0
β 1
n
i
x
2
i
i
x i
2
i
x
2
i
i
y i
i
x i
i
x i y i
i
x i
i
y i ) + n
i
x i y i
After a bit of algebra, these estimators simplify to
β 1
i
(x i − ¯x)(y i − y¯)
i
(x i − x¯)
2
xy
xx
and
β 0 = ¯y −
β 1 x¯
Example — Simple Linear Regression (Continued)
Result 2 on the previous page says for var(y) = σ
2 I, var(
β) = σ
2 (X
T X)
− 1 .
Therefore, in the simple linear regression case,
var
β 0
β 1
= σ
2 (X
T X)
− 1
σ
2
n
i
x
2
i
i
x i
2
i
x
2
i
i
x i
i
x i n
σ
2
i
(x i − x¯)
2
n
− 1
i
x
2
i
−x¯
−x¯ 1
Thus,
var(
β 0
σ
2
i
x
2
i
/n
i
(x i
− x¯)
2
= σ
2
n
x¯
2
i
(x i
− x¯)
2
var(
β 1
σ
2
i
(x i − x¯)
2
and cov(
β 0
β 1
−σ
2 x¯
i
(x i − x¯)
2
β 0
β 1 ) is negative, meaning that the
slope and intercept are inversely related. That is, over repeated
samples from the same model, the intercept will tend to decrease
when the slope increases.
Gauss-Markov Theorem:
We have seen that in the spherical errors, full-rank linear model, the least-
squares estimator
β = (X
T X)
− 1 X
T y is unbiased and it is a linear esti-
mator.
The following theorem states that in the class of linear and unbiased esti-
mators, the least-squares estimator is optimal (or best) in the sense that
it has minimum variance among all estimators in this class.
Gauss-Markov Theorem: Consider the linear model y = Xβ + e where
X is n × (k + 1) of rank k + 1, where n > k + 1, E(e) = 0 , and var(e) =
σ
2 I. The least-squares estimators
β j , j = 0, 1 ,... , k (the elements of
β = (X
T X)
− 1 X
T y have minimum variance among all linear unbiased
estimators.
Proof: Write
β j as
β j = c
β where c is the indicator vector containing a 1
in the (j + 1)st position and 0’s elsewhere. Then
β j = c
T (X
T X)
− 1 X
T y =
a
T y where a = X(X
T X)
− 1 c. The quantity being estimated is β j = c
T β =
c
T (X
T X)
− 1 X
T μ = a
T μ where μ = Xβ.
Consider an arbitrary linear estimator
β j
= d
T y of β j
. For such an esti-
mator to be unbiased, it must satisfy E(
β j ) = E(d
T y) = d
T μ = a
T μ for
any μ ∈ C(X). I.e.,
d
T μ − a
T μ = 0 ⇒ (d − a)
T μ = 0 for all μ ∈ C(X),
or (d − a) ⊥ C(X). Then
β j = d
T y = a
T y + (d − a)
T y =
β j
T y.
The random variables on the right-hand side,
β j and (d − a)
T y, have
covariance
cov(a
T y, (d − a)
T y) = a
T var(y)(d − a) = σ
2 a
T (d − a) = σ
2 (d
T a − a
T a).
Since d
T μ = a
T μ for any μ ∈ C(X) and a = X(X
T X)
− 1 c ∈ C(X), it
follows that d
T a = a
T a so that
cov(a
T y, (d − a)
T y) = σ
2 (d
T a − a
T a) = σ
2 (a
T a − a
T a) = 0.
An additional property of least-squares estimation is that the estimated
mean ˆμ = X(X
T X)
− 1 X
T y is invariant to (doesn’t change as a result of)
linear changes of scale in the explanatory variables.
That is, consider the linear models
y =
1 x 11 x 12 · · · x 1 k
1 x 21 x 22 · · · x 2 k
1 x n 1
x n 2
· · · x nk
=X
β + e
and
y =
1 c 1 x 11 c 2 x 12 · · · c k x 1 k
1 c 1 x 21 c 2 x 22 · · · c k x 2 k
1 c 1 x n 1 c 2 x n 2 · · · c k x nk
=Z
β
∗
Then, ˆμ, the least squares estimator of E(y), is the same in both of these
two models. This follows from a more general theorem:
Theorem: In the linear model y = Xβ + e where E(e) = 0 and X is of
full rank, ˆμ, the least-squares estimator of E(y) is invariant to a full rank
linear transformation of X.
Proof: A full rank linear transformation of X is given by
where H is square and of full rank. In the original (untransformed) linear
model ˆμ = X(X
T X)
− 1 X
T y = P C(X) y. In the transformed model y =
Zβ
∗ +e, ˆμ = Z(Z
T Z)
− 1 Z
T y = P C(Z)
y = P C(XH)
y. So, it suffices to show
that P C(X)
C(XH)
. This is true because if x ∈ C(XH) then x = XHb
for some b, ⇒ x = Xc where c = Hb ⇒ x ∈ C(X) ⇒ C(XH) ⊂ C(X).
In addition, if x ∈ C(X) then x = Xd for some d ⇒ x = XHH
− 1 d =
XHa where a = H
− 1 d ⇒ x ∈ C(XH) ⇒ C(X) ⊂ C(XH). Therefore,
’s is rescaled
by a constant c j occurs when H = diag(1, c 1 , c 2 ,... , c k
Maximum Likelihood Estimation:
Least-squares provides a simple, intuitively reasonable criterion for esti-
mation. If we want to estimate a parameter describing μ, the mean of
y, then choose the parameter value that minimizes the squared distance
between y and μ. If var(y) = σ
2 I, then the resulting estimator is BLUE
(optimal, in some sense).
variance-covaraince matrix (the first two moments) of y.
the mean (e.g., β) but nothing about how to estimate parameters
describing the variance (e.g., σ
2 ) or other aspects of the distribution
of y.
An alternative method of estimation is maximum likelihood estimation.
tion of y (up to some unknown parameters), rather than just the
mean and variance of that distribution.
describing the distribution of y, including parameters describing the
mean (e.g., β), variance (σ
2 ), or any other aspect of the distribution.
than least squares in certain senses. It can provide estimators of
all sorts of parameters in a broad array of model types, including
models much more complex than those for which least-squares is
appropriate; but it requires stronger assumptions than least-squares.
Under independence,
L(γ; y) =
n ∏
i=
f (y i ; γ)
Since its easier to work with sums than products its useful to note that in
general
arg max
γ
L(γ; y) = arg max
γ
log L(γ; y)
≡`(γ;y)
Therefore, we define a MLE of γ as a ˆγ so that
(ˆγ, y) ≥
(γ; y) for all γ ∈ Γ
If Γ is an open set, then ˆγ must satisfy (if it exists)
∂`(ˆγ)
∂γ j
= 0, j = 1,... , p
or in vector form
∂`(ˆγ; y)
∂γ
∂`( ˆγ)
∂γ 1
∂`( ˆγ)
∂γp
= 0 , (the likelihood equation, a.k.a. score equation)
In the classical linear model, the unknown parameters of the model are β
and σ
2 , so the pair (β, σ
2 ) plays the role of γ.
Under the assumption A5 that e ∼ N n
( 0 , σ
2 I n
), it follows that y ∼
n (Xβ, σ
2 I n ), so the likelihood function is given by
L(β, σ
2 ; y) =
(2πσ
2 )
n/ 2
exp
2 σ
2
‖y − Xβ‖
2
for β ∈ R
k+ and σ
2
The log-likelihood is a bit easier to work with, and has the same maximiz-
ers. It is given by
`(β, σ
2 ; y) = −
n
log(2π) −
n
log(σ
2 ) −
2 σ
2
‖y − Xβ‖
2 .
We can maximize this function with respect to β and σ
2 in two steps:
First maximize with respect to β treating σ
2 as fixed, then second plug
that estimator back into the loglikelihood function and maximize with
respect to σ
2 .
For fixed σ
2 , maximizing `(β, σ
2 ; y) is equivalent to maximizing the third
term −
1
2 σ
2 ‖y − Xβ‖
2 or, equivalently, minimizing ‖y − Xβ‖
2
. This is just
what we do in least-squares, and leads to the estimator
β = (X
T X)
− 1 X
T y.
Next we plug this estimator back into the loglikelihood (this gives what’s
known as the profile loglikelihood for σ
2 ):
β, σ
2 ) = −
n
log(2π) −
n
log(σ
2 ) −
2 σ
2
‖y − X
β‖
2
and maximize with respect to σ
2 .
Estimation of σ
2 :
mating all parameters in the model, β and σ
2 .
β.
We’ve seen that the LS and ML estimators of β coincide. However, the
MLE of σ
2 is not the usually preferred estimator of σ
2 and is not the
estimator of σ
2 that is typically combined with LS estimation of β.
Why not?
Because ˆσ
2 is biased.
That E(ˆσ
2 ) 6 = σ
2 can easily be established using our results for taking
expected values of quadratic forms:
E(ˆσ
2 ) = E
n
C(X)
⊥ y‖
2
n
C(X)
⊥ y)
T P C(X)
⊥ y
n
E(y
T P C(X)
⊥ y)
n
σ
2 dim(C(X)
⊥ ) + ‖ P C(X)
⊥ Xβ
= 0 , because Xβ ∈ C(X)
2
σ
2
n
dim(C(X)
⊥ )
σ
2
n
{n − dim(C(X))
=rank(X)
σ
2
n
{n − (k + 1)}
Therefore, the MLE ˆσ
2 is biased by a multiplicative factor of {n−k − 1 }/n
and an alternative unbiased estimator of σ
2 can easily be constructed as
s
2 ≡
n
n − k − 1
ˆσ
n − k − 1
‖y − X
β‖
2 ,
or more generally (that is, for X not necessarily of full rank),
s
n − rank(X)
‖y − X
β‖
2 .
2 rather than ˆσ
2 is generally the preferred estimator of σ
2
. In fact,
it can be shown that in the spherical errors linear model, s
2 is the
best (minimum variance) estimator of σ
2 in the class of quadratic
(in y) unbiased estimators.
an intercept, or constant term), so that C(X) = L(j n ), we get
β =
μˆ = ¯y, and rank(X) = 1. Therefore, s
2 becomes the usual sample
variance from the one-sample problem:
s
n − 1
‖y − y¯j n
n − 1
n ∑
i=
(y i − y¯)
2 .
If e has a normal distribution, then by part 3 of the theorem on p. 85,
‖y − X
β‖
2 /σ
2 ∼ χ
2 (n − rank(X))
and, since the central χ
2 (m) has mean m and variance 2m,
s
σ
2
n − rank(X)
‖y − X
β‖
2 /σ
2
∼χ
2 (n−rank(X))
implies
E(s
2 ) =
σ
2
n − rank(X)
{n − rank(X)} = σ
2 ,
and
var(s
2 ) =
σ
4
{n − rank(X)}
2
2 {n − rank(X)} =
2 σ
4
n − rank(X)
Our model is the classical linear model with normal errors:
y = Xβ + e, e ∼ N ( 0 , σ
2 I n
We first need the concept of a complete sufficient statistic:
Sufficiency: Let y be random vector with p.d.f. f (y; θ) depending on an
unknown k × 1 parameter θ. Let T(y) be an r × 1 vector-valued statistic
that is a function of y. Then T(y) is said to be a sufficient statistic for
θ if and only if the conditional distribution of y given the value of T(y)
does not depend upon θ.
tion in the data y relevant to θ. Once we know T, there’s no more
information in y about θ.
The property of completeness is needed as well, but it is somewhat tech-
nical. Briefly, it ensures that if a function of the sufficient statistic exists
that is unbiased for the quantity being estimated, then it is unique.
Completeness: A vector-valued sufficient statistic T(y) is said to be
complete if and only if E{h(T(y))} = 0 for all θ implies Pr{h(T(y)) =
0 } = 1 for all θ.
Theorem: If T(y) is a complete sufficient statistic, then f (T(y)) is a
minimum variance unbiased estimator of E{f (T(y))}.
Proof: This theorem is known as the Lehmann-Scheff´e Theorem and its
proof follows easily from the Rao-Blackwell Theorem. See, e.g., Bickel and
Doksum, p. 122, or Casella and Berger, p. 320.
In the linear model, the p.d.f. of y depends upon β and σ
2 , so the pair
(β, σ
2 ) plays the role of θ.
Is there a complete sufficient statistic for (β, σ
2 ) in the classical linear
model?
Yes, by the following result:
Theorem: Let θ = (θ 1 ,... , θ r
T and let y be a random vector with
probability density function
f (y) = c(θ) exp
r ∑
i=
θ i
i (y)
h(y).
Then T(y) = (T 1
(y),... , T r
(y))
T is a complete sufficient statistic provided
that neither θ nor T(y) satisfy any linear constraints.
family of distributions. For this family, which includes the normal
distribution, then it is easy to find a complete sufficient statistic.
Consider the classical linear model
y = Xβ + e, e ∼ N ( 0 , σ
2 I n
The density of y can be written as
f (y; β, σ
2 ) = (2π)
−n/ 2 (σ
2 )
−n/ 2 exp{−(y − Xβ)
T (y − Xβ)/(2σ
2 )}
= c 1 (σ
2 ) exp{−(y
T y − 2 β
T X
T y + β
T X
T Xβ)/(2σ
2 )
= c 2 (β, σ
2 ) exp[{(− 1 /(2σ
2 ))y
T y + (σ
− 2 β
T )(X
T y)}
If we reparameterize in terms of θ where
θ 1
2 σ
2
θ 2
θ k+
σ
2
β,
then this density can be seen to be of the exponential form, with vector-
valued complete sufficient statistic
y
T y
X
T y
Generalized Least Squares
Up to now, we have assumed var(e) = σ
2 I in our linear model. There are
two aspects to this assumption: (i) uncorrelatedness (var(e) is diagonal),
and (ii) homoscedasticity (the diagonal elements of var(e) are all the same.
Now we relax these assumptions simultaneously by considering a more
general variance-covaraince structure. We now consider the linear model
y = Xβ + e, where E(e) = 0 , var(e) = σ
2 V,
where X is full rank as before, and where V is a known positive definite
matrix.
covariance parameter to be estimated, σ
2 .
cates things substantially, so we postpone discussion of this case. V
unknown can be handled via ML estimation and we’ll talk about
that later. Of course, V unknown is the typical scenario in practice,
but there are cases when V would be known.
model with uncorrelated, but heteroscedastic errors:
y i
= β 0
x i
where the e i ’s are independent, each with mean 0, and var(e i
σ
2 x i
. In this case, var(e) = σ
2 V where V = diag(x 1 ,... , x n ), a
known matrix of constants.
Estimation of β and σ
2 when var(e) = σ
2 V:
A nice feature of the model
y = Xβ + e where var(e) = σ
2 V (1)
is that, although it is not a Gauss-Markov (spherical errors) model, it is
simple to transform this model into a Gauss-Markov model. This allows
us to apply what we’ve learned about the spherical errors case to obtain
methods and results for the non-spherical case.
Since V is known and positive definite, it is possible to find a matrix Q
such that V = QQ
T (e.g., Q
T could be the Cholesky factor of V).
Multiplying on both sides of the model equation in (1) by the known matrix
− 1 , it follows that the following transformed model holds as well:
− 1 y = Q
− 1 Xβ + Q
− 1 e
or y˜ =
Xβ + ˜e where var(˜e) = σ
2 I (2)
where ˜y = Q
− 1 y,
− 1 X and ˜e = Q
− 1 e.
E(˜e) = Q
− 1 E(e) = Q
− 1 0 = 0
and
var(˜e) = Q
− 1 var(e)(Q
− 1 )
T = σ
2 Q
− 1 V(Q
− 1 )
T
= σ
2 Q
− 1 QQ
T (Q
− 1 )
T = σ
2 I