




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Completes schematical summary of "mathematical statistics with applications"
Typology: Summaries
1 / 142
This page cannot be seen from the preview
Don't miss anything!
These notes in English will closely follow Mathematische Statistik, by H.R. K¨unsch (2005), but are as yet incomplete. Mathematische Statistik can be used as supplementary reading material in German.
Mathematical rigor and clarity often bite each other. At some places, not all subtleties are fully presented. A snake will indicate this.
Statistics is about the mathematical modeling of observable phenomena, using stochastic models, and about analyzing data: estimating parameters of the model and testing hypotheses. In these notes, we study various estimation and testing procedures. We consider their theoretical properties and we investigate various notions of optimality.
1.1 Some notation and model assumptions
The data consist of measurements (observations) x 1 ,... , xn, which are regarded as realizations of random variables X 1 ,... , Xn. In most of the notes, the Xi are real-valued: Xi ∈ R (for i = 1,... , n), although we will also consider some extensions to vector-valued observations.
Example 1.1.1 Fizeau and Foucault developed methods for estimating the speed of light (1849, 1850), which were later improved by Newcomb and Michel- son. The main idea is to pass light from a rapidly rotating mirror to a fixed mirror and back to the rotating mirror. An estimate of the velocity of light is obtained, taking into account the speed of the rotating mirror, the distance travelled, and the displacement of the light as it returns to the rotating mirror.
Fig. 1
The data are Newcomb’s measurements of the passage time it took light to travel from his lab, to a mirror on the Washington Monument, and back to his lab.
distance: 7.44373 km.
66 measurements on 3 consecutive days
first measurement: 0.000024828 seconds= 24828 nanoseconds
The dataset has the deviations from 24800 nanoseconds.
The measurements on 3 different days:
l llll
l
lll l
ll l (^) l l ll l ll
0 5 10 15 20 25 −^
0 20
40
day 1
t
X1^ l l
l (^) ll^ l^ ll^ l (^) ll^ lll l^ l^ l^ l^ ll^ ll
20 25 30 35 40 45 −^
0 20
40
day 2
t
X
ll l l l l l (^) l lll l llll (^) l lll l (^) l l l ll
40 45 50 55 60 65 −^
0 20
40
day 3
t
X
All measurements in one plot:
l l
l l
l
l
l l
l
l
l l lll
l l
l l l
ll
lll ll l
l l
l ll l
l l
l ll l
l ll l
l lll l
l l
l l l
l l l l
l l l l l l
l
l
0 10 20 30 40 50 60
−
−
0
20
40
t
X
l
l
The class F 0 is for example modeled as the class of all symmetric distributions, that is F 0 := {F 0 (x) = 1 − F 0 (−x), ∀ x}. (1.2)
This is an infinite-dimensional collection: it is not parametrized by a finite dimensional parameter. We then call F 0 an infinite-dimensional parameter.
A finite-dimensional model is for example
F 0 := {Φ(·/σ) : σ > 0 }, (1.3)
where Φ is the standard normal distribution function.
Thus, the location model is
Xi = μ + i, i = 1,... , n,
with 1 ,... , n i.i.d. and, under model (1.2), symmetrically but otherwise un- known distributed and, under model (1.3), N (0, σ^2 )-distributed with unknown variance σ^2.
1.2 Estimation
A parameter is an aspect of the unknown distribution. An estimator T is some given function T (X) of the observations X. The estimator is constructed to estimate some unknown parameter, γ say.
In Example 1.1.2, one may consider the following estimators ˆμ of μ:
μˆ 1 :=
n
i=
Xi.
Note that ˆμ 1 minimizes the squared loss
∑^ n
i=
(Xi − μ)^2.
It can be shown that ˆμ 1 is a “good” estimator if the model (1.3) holds. When (1.3) is not true, in particular when there are outliers (large, “wrong”, obser- vations) (Ausreisser), then one has to apply a more robust estimator.
μˆ 2 :=
X((n+1)/2) when n odd {X(n/2) + X(n/2+1)}/ 2 when n is even ,
where X(1) ≤ · · · ≤ X(n) are the order statistics. Note that ˆμ 2 is a minimizer of the absolute loss (^) n ∑
i=
|Xi − μ|.
μˆ 3 := arg min μ
∑n
i=
ρ(Xi − μ), (1.4)
where
ρ(x) =
x^2 if |x| ≤ k k(2|x| − k) if |x| > k
with k > 0 some given threshold.
μˆ 4 :=
n − 2[nα]
n− ∑[nα]
i=[nα]+
X(i).
Note To avoid misunderstanding, we note that e.g. in (1.4), μ is used as variable over which is minimized, whereas in (1.1), μ is a parameter. These are actually distinct concepts, but it is a general convention to abuse notation and employ the same symbol μ. When further developing the theory (see Chapter 6) we shall often introduce a new symbol for the variable, e.g., (1.4) is written as
μˆ 3 := arg min c
∑n
i=
ρ(Xi − c).
An example of a nonparametric estimator is the empirical distribution function
Fˆn(·) :=^1 n #{Xi ≤ ·, 1 ≤ i ≤ n}.
This is an estimator of the theoretical distribution function
F (·) := P (X ≤ ·).
Any reasonable estimator is constructed according the so-called a plug-in princi- ple (Einsetzprinzip). That is, the parameter of interest γ is written as γ = Q(F ), with Q some given map. The empirical distribution Fˆn is then “plugged in”, to obtain the estimator T := Q( Fˆn). (We note however that problems can arise, e.g. Q( Fˆn) may not be well-defined ....).
Examples are the above estimators ˆμ 1 ,... , μˆ 4 of the location parameter μ. We define the maps
Q 1 (F ) :=
xdF (x)
(the mean, or point of gravity, of F ), and
Q 2 (F ) := F −^1 (1/2)
(the median of F ), and
Q 3 (F ) := arg min μ
ρ(· − μ)dF,
Break down point Let for m ≤ n,
(m) := sup x∗ 1 ,...,x∗ m
|μˆ(x∗ 1 ,... , x∗ m, Xm+1,... , Xn)|.
If (m) := ∞, we say that with m outliers the estimator can break down. The break down point is defined as
∗^ := min{m : (m) = ∞}/n.
1.5 Confidence intervals
Consider the location model (Example 1.1.2).
Definition A subset I = I(X) ⊂ R, depending (only) on the data X = (X 1 ,... , Xn), is called a confidence set (Vertrauensbereich) for μ, at level 1 −α, if
IPμ,F 0 (μ ∈ I) ≥ 1 − α, ∀ μ ∈ R, F 0 ∈ F 0.
A confidence interval is of the form
I := [μ, μ¯],
where the boundaries μ = μ(X) and ¯μ = ¯μ(X) depend (only) on the data X.
Let for each μ 0 ∈ R, φ(X, μ 0 ) ∈ { 0 , 1 } be a test at level α for the hypothesis
Hμ 0 : μ = μ 0.
Thus, we reject Hμ 0 if and only if φ(X, μ 0 ) = 1, and
IPμ 0 ,F 0 (φ(X, μ 0 ) = 1) ≤ α.
Then
I(X) := {μ : φ(X, μ) = 0 }
is a (1 − α)-confidence set for μ.
Conversely, if I(X) is a (1 − α)-confidence set for μ, then, for all μ 0 , the test φ(X, μ 0 ) defined as
φ(X, μ 0 ) =
1 if μ 0 ∈/ I(X) 0 else
is a test at level α of Hμ 0.
1.6 Intermezzo: quantile functions
Let F be a distribution function. Then F is cadlag (continue a droite, limite
a gauche). Define the quantile functions
qF + (u) := sup{x : F (x) ≤ u},
and q −F (u) := inf{x : F (x) ≥ u} := F −^1 (u).
It holds that F (q −F (u)) ≥ u
and, for all h > 0, F (qF + (u) − h) ≤ u.
Hence F (q +F (u)−) := lim h↓ 0 F (q +F (u) − h) ≤ u.
1.7 How to construct tests and confidence sets
Consider a model class P := {Pθ : θ ∈ Θ}.
Moreover, consider a space Γ, and a map
g : Θ → Γ, g(θ) := γ.
We think of γ as the parameter of interest (as in the plug-in principle, with γ = Q(Pθ) = g(θ)).
For instance, in Example 1.1.2, the parameter space is Θ := {θ = (μ, F 0 ), μ ∈ R, F 0 ∈ F 0 }, and, when μ is the parameter of interest, g(μ, F 0 ) = μ.
To test
Hγ 0 : γ = γ 0 ,
we look for a pivot (T¨ur-Angel). This is a function Z(X, γ) depending on the data X and on the parameter γ, such that for all θ ∈ Θ, the distribution
IPθ(Z(X, g(θ)) ≤ ·) := G(·)
does not depend on θ. We note that to find a pivot is unfortunately not always possible. However, if we do have a pivot Z(X, γ) with distribution G, we can compute its quantile functions
qL := qG +
( (^) α 2
, qR := q −G
α 2
and the test φ(X, γ 0 ) :=
1 if Z(X, γ 0 ) ∈/ [qL, qR] 0 else
is an asymptotic pivot, with limiting distribution G = Φ.
Comparison of confidence intervals and tests When comparing confidence intervals, the aim is usually to take the one with smallest length on average (keeping the level at 1 − α). In the case of tests, we look for the one with maximal power. In the location model, this leads to studying
EIμ,F 0 |¯μ(X) − μ(X)|
for (1 − α)-confidence sets [μ, μ¯], or to studying the power of test φ(X, μ 0 ) at level α. Recall that the power is Pμ,F 0 (φ(X, μ 0 ) = 1) for values μ 6 = μ 0.
1.8 An illustration: the two-sample problem
Consider the following data, concerning weight gain/loss. The control group x had their usual diet, and the treatment group y obtained a special diet, designed for preventing weight gain. The study was carried out to test whether the diet works.
control group group
treatment
rank( x ) rank( y ) x y 7
Table 2
Let n (m) be the sample size of the control group x (treatment group y). The mean in group∑ x (y) is denoted by ¯x (¯y). The sums of squares are SSx := n i=1(xi^ −^ x¯)
(^2) and SSy := ∑m j=1(yj^ −^ y¯)
(^2). So in this study, one has n = m = 5
and the values ¯x = 6.4, ¯y = 0, SSx = 161.2 and SSy = 114. The ranks, rank(x) and rank(y), are the rank-numbers when putting all n + m data together (e.g., y 3 = −6 is the smallest observation and hence rank(y 3 ) = 1).
We assume that the data are realizations of two independent samples, say X = (X 1 ,... , Xn) and Y = (Y 1 ,... , Ym), where X 1 ,... , Xn are i.i.d. with distribution function FX , and Y 1 ,... , Ym are i.i.d. with distribution function FY. The distribution functions FX and FY may be in whole or in part un- known. The testing problem is: H 0 : FX = FY against a one- or two-sided alternative.
The classical two-sample student test is based on the assumption that the data come from a normal distribution. Moreover, it is assumed that the variance of FX and FY are equal. Thus, (FX , FY ) ∈ { FX = Φ
· − μ σ
· − (μ + γ) σ
: μ ∈ R, σ > 0 , γ ∈ Γ
Here, Γ ⊃ { 0 } is the range of shifts in mean one considers, e.g. Γ = R for two-sided situations, and Γ = (−∞, 0] for a one-sided situation. The testing problem reduces to H 0 : γ = 0.
We now look for a pivot Z(X, Y, γ). Define the sample means
n
∑^ n
i=
Xi, Y¯ :=
m
∑^ m
j=
Yj ,
and the pooled sample variance
m + n − 2
{ (^) ∑n
i=
(Xi − X¯)^2 +
∑^ m
j=
(Yj − Y¯ )^2
Note that X¯ has expectation μ and variance σ^2 /n, and Y¯ has expectation μ + γ and variance σ^2 /m. So Y¯ − X¯ has expectation γ and variance
σ^2 n
σ^2 m = σ^2
n + m nm
The normality assumption implies that
Y¯ − X¯ is N
γ, σ^2
n + m nm
−distributed.
Hence (^) √ nm n + m
Y − X¯ − γ σ
is N (0, 1)−distributed.
To arrive at a pivot, we now plug in the estimate S for the unknown σ:
Z(X, Y, γ) :=
nm n + m
Y − X¯ − γ S
Indeed, Z(X, Y, γ) has a distribution G which does not depend on unknown parameters. The distribution G is Student(n + m − 2) (the Student-distribution with n+m−2 degrees of freedom). As test statistic for H 0 : γ = 0, we therefore take T = T Student^ := Z(X, Y, 0).
Large values of T mean that the Xi are generally larger than the Yj , and hence indicate evidence against H 0.
To check whether or not the observed value of the test statistic is compatible with the null-hypothesis, we need to know its null-distribution, that is, the distribution under H 0. Under H 0 : FX = FY , the vector of ranks (R 1 ,... , Rn) has the same distribution as n random draws without replacement from the numbers { 1 ,... , N }. That is, if we let
r := (r 1 ,... , rn, rn+1,... , rN )
denote a permutation of { 1 ,... , N }, then
(R 1 ,... , Rn, Rn+1,... RN ) = r
(see Theorem 1.8.1), and hence
IPH 0 (T = t) = #{r :
∑n i=1 ri^ =^ t} N!
This can also be written as
IPH 0 (T = t) =
n
) #{r 1 < · · · < rn < rn+1 < · · · < rN :
∑^ n
i=
ri = t}.
So clearly, the null-distribution of T does not depend on FX or FY. It does however depend on the sample sizes n and m. It is tabulated for n and m small or moderately large. For large n and m, a normal approximation of the null-distribution can be used.
Theorem 1.8.1 formally derives the null-distribution of the test, and actually proves that the order statistics and the ranks are independent. The latter result will be of interest in Example 2.10.4.
For two random variables X and Y , use the notation
when X and Y have the same distribution.
Theorem 1.8.1 Let Z 1 ,... , ZN be i.i.d. with continuous distribution F on R. Then (Z(1),... , Z(N )) and R := (R 1 ,... , RN ) are independent, and for all permutations r := (r 1 ,... , rN ),
IP(R = r) =
Proof. Let ZQi := Z(i), and Q := (Q 1 ,... , QN ). Then
R = r ⇔ Q = r−^1 := q,
where r−^1 is the inverse permutation of r.^1 For all permutations q and all measurable maps f ,
f (Z 1 ,... , ZN )
= f (Zq 1 ,... , ZqN ).
Therefore, for all measurable sets A ⊂ RN^ , and all permutations q,
(Zq 1... , ZqN ) ∈ A, Zq 1 <... < ZqN
Because there are N! permutations, we see that for any q,
(Z(1),... , Z(n)) ∈ A
(Zq 1... , ZqN ) ∈ A, Zq 1 <... < ZqN
(Z(1),... , Z(N )) ∈ A, R = r
where r = q−^1. Thus we have shown that for all measurable A, and for all r,
(Z(1),... , Z(N )) ∈ A, R = r
(Z(1),... , Z(n)) ∈ A
Take A = RN^ to find that (1.5) implies
R = r
Plug this back into (1.5) to see that we have the product structure
(Z(1),... , Z(N )) ∈ A, R = r
(Z(1),... , Z(n)) ∈ A
R = r
which holds for all measurable A. In other words, (Z(1),... , Z(N )) and R are independent. tu
Because Wilcoxon’s test is ony based on the ranks, and does not rely on the assumption of normality, it lies at hand that, when the data are in fact normally distributed, Wilcoxon’s test will have less power than Student’s test. The loss (^1) Here is an example, with N = 3:
(z 1 , z 2 , z 3 ) = ( 5 , 6 , 4 ) (r 1 , r 2 , r 3 ) = ( 2 , 3 , 1 ) (q 1 , q 2 , q 3 ) = ( 3 , 1 , 2 )