Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

mathematical statistics with applications, Summaries of Mathematical Statistics

Completes schematical summary of "mathematical statistics with applications"

Typology: Summaries

2018/2019

Uploaded on 07/31/2019

ekapad
ekapad 🇮🇳

5

(17)

266 documents

1 / 142

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Mathematical Statistics
Sara van de Geer
September 2010
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download mathematical statistics with applications and more Summaries Mathematical Statistics in PDF only on Docsity!

Mathematical Statistics

Sara van de Geer

September 2010

  • 1 Introduction
    • 1.1 Some notation and model assumptions
    • 1.2 Estimation
    • 1.3 Comparison of estimators: risk functions
    • 1.4 Comparison of estimators: sensitivity
    • 1.5 Confidence intervals
      • 1.5.1 Equivalence confidence sets and tests
    • 1.6 Intermezzo: quantile functions
    • 1.7 How to construct tests and confidence sets
    • 1.8 An illustration: the two-sample problem
      • 1.8.1 Assuming normality
      • 1.8.2 A nonparametric test
      • 1.8.3 Comparison of Student’s test and Wilcoxon’s test
    • 1.9 How to construct estimators
      • 1.9.1 Plug-in estimators
      • 1.9.2 The method of moments
      • 1.9.3 Likelihood methods
  • 2 Decision theory
    • 2.1 Decisions and their risk
    • 2.2 Admissibility
    • 2.3 Minimaxity
    • 2.4 Bayes decisions
    • 2.5 Intermezzo: conditional distributions
    • 2.6 Bayes methods
    • 2.7 Discussion of Bayesian approach (to be written)
    • 2.8 Integrating parameters out (to be written)
    • 2.9 Intermezzo: some distribution theory
      • 2.9.1 The multinomial distribution
      • 2.9.2 The Poisson distribution
      • 2.9.3 The distribution of the maximum of two random variables
    • 2.10 Sufficiency
      • 2.10.1 Rao-Blackwell
      • 2.10.2 Factorization Theorem of Neyman
      • 2.10.3 Exponential families
      • 2.10.4 Canonical form of an exponential family
      • 2.10.5 Minimal sufficiency 4 CONTENTS
  • 3 Unbiased estimators
    • 3.1 What is an unbiased estimator?
    • 3.2 UMVU estimators
      • 3.2.1 Complete statistics
    • 3.3 The Cramer-Rao lower bound
    • 3.4 Higher-dimensional extensions
    • 3.5 Uniformly most powerful tests
      • 3.5.1 An example
      • 3.5.2 UMP tests and exponential families
      • 3.5.3 Unbiased tests
      • 3.5.4 Conditional tests
  • 4 Equivariant statistics
    • 4.1 Equivariance in the location model
    • 4.2 Equivariance in the location-scale model (to be written)
  • 5 Proving admissibility and minimaxity
    • 5.1 Minimaxity
    • 5.2 Admissibility
    • 5.3 Inadmissibility in higher-dimensional settings (to be written)
  • 6 Asymptotic theory
    • 6.1 Types of convergence
      • 6.1.1 Stochastic order symbols
      • 6.1.2 Some implications of convergence
    • 6.2 Consistency and asymptotic normality
      • 6.2.1 Asymptotic linearity
      • 6.2.2 The δ-technique
    • 6.3 M-estimators
      • 6.3.1 Consistency of M-estimators
      • 6.3.2 Asymptotic normality of M-estimators
    • 6.4 Plug-in estimators
      • 6.4.1 Consistency of plug-in estimators
      • 6.4.2 Asymptotic normality of plug-in estimators
    • 6.5 Asymptotic relative efficiency
    • 6.6 Asymptotic Cramer Rao lower bound
      • 6.6.1 Le Cam’s 3rd Lemma
    • 6.7 Asymptotic confidence intervals and tests
      • 6.7.1 Maximum likelihood
      • 6.7.2 Likelihood ratio tests
    • 6.8 Complexity regularization (to be written)
  • 7 Literature

CONTENTS 5

These notes in English will closely follow Mathematische Statistik, by H.R. K¨unsch (2005), but are as yet incomplete. Mathematische Statistik can be used as supplementary reading material in German.

Mathematical rigor and clarity often bite each other. At some places, not all subtleties are fully presented. A snake will indicate this.

Chapter 1

Introduction

Statistics is about the mathematical modeling of observable phenomena, using stochastic models, and about analyzing data: estimating parameters of the model and testing hypotheses. In these notes, we study various estimation and testing procedures. We consider their theoretical properties and we investigate various notions of optimality.

1.1 Some notation and model assumptions

The data consist of measurements (observations) x 1 ,... , xn, which are regarded as realizations of random variables X 1 ,... , Xn. In most of the notes, the Xi are real-valued: Xi ∈ R (for i = 1,... , n), although we will also consider some extensions to vector-valued observations.

Example 1.1.1 Fizeau and Foucault developed methods for estimating the speed of light (1849, 1850), which were later improved by Newcomb and Michel- son. The main idea is to pass light from a rapidly rotating mirror to a fixed mirror and back to the rotating mirror. An estimate of the velocity of light is obtained, taking into account the speed of the rotating mirror, the distance travelled, and the displacement of the light as it returns to the rotating mirror.

Fig. 1

The data are Newcomb’s measurements of the passage time it took light to travel from his lab, to a mirror on the Washington Monument, and back to his lab.

8 CHAPTER 1. INTRODUCTION

distance: 7.44373 km.

66 measurements on 3 consecutive days

first measurement: 0.000024828 seconds= 24828 nanoseconds

The dataset has the deviations from 24800 nanoseconds.

The measurements on 3 different days:

l llll

l

lll l

ll l (^) l l ll l ll

0 5 10 15 20 25 −^

0 20

40

day 1

t

X1^ l l

l (^) ll^ l^ ll^ l (^) ll^ lll l^ l^ l^ l^ ll^ ll

20 25 30 35 40 45 −^

0 20

40

day 2

t

X

ll l l l l l (^) l lll l llll (^) l lll l (^) l l l ll

40 45 50 55 60 65 −^

0 20

40

day 3

t

X

All measurements in one plot:

l l

l l

l

l

l l

l

l

l l lll

l l

l l l

ll

lll ll l

l l

l ll l

l l

l ll l

l ll l

l lll l

l l

l l l

l l l l

l l l l l l

l

l

0 10 20 30 40 50 60

0

20

40

t

X

l

l

10 CHAPTER 1. INTRODUCTION

The class F 0 is for example modeled as the class of all symmetric distributions, that is F 0 := {F 0 (x) = 1 − F 0 (−x), ∀ x}. (1.2)

This is an infinite-dimensional collection: it is not parametrized by a finite dimensional parameter. We then call F 0 an infinite-dimensional parameter.

A finite-dimensional model is for example

F 0 := {Φ(·/σ) : σ > 0 }, (1.3)

where Φ is the standard normal distribution function.

Thus, the location model is

Xi = μ + i, i = 1,... , n,

with  1 ,... , n i.i.d. and, under model (1.2), symmetrically but otherwise un- known distributed and, under model (1.3), N (0, σ^2 )-distributed with unknown variance σ^2.

1.2 Estimation

A parameter is an aspect of the unknown distribution. An estimator T is some given function T (X) of the observations X. The estimator is constructed to estimate some unknown parameter, γ say.

In Example 1.1.2, one may consider the following estimators ˆμ of μ:

  • The average

μˆ 1 :=

n

∑^ N

i=

Xi.

Note that ˆμ 1 minimizes the squared loss

∑^ n

i=

(Xi − μ)^2.

It can be shown that ˆμ 1 is a “good” estimator if the model (1.3) holds. When (1.3) is not true, in particular when there are outliers (large, “wrong”, obser- vations) (Ausreisser), then one has to apply a more robust estimator.

  • The (sample) median is

μˆ 2 :=

X((n+1)/2) when n odd {X(n/2) + X(n/2+1)}/ 2 when n is even ,

where X(1) ≤ · · · ≤ X(n) are the order statistics. Note that ˆμ 2 is a minimizer of the absolute loss (^) n ∑

i=

|Xi − μ|.

1.2. ESTIMATION 11

  • The Huber estimator is

μˆ 3 := arg min μ

∑n

i=

ρ(Xi − μ), (1.4)

where

ρ(x) =

x^2 if |x| ≤ k k(2|x| − k) if |x| > k

with k > 0 some given threshold.

  • We finally mention the α-trimmed mean, defined, for some 0 < α < 1, as

μˆ 4 :=

n − 2[nα]

n− ∑[nα]

i=[nα]+

X(i).

Note To avoid misunderstanding, we note that e.g. in (1.4), μ is used as variable over which is minimized, whereas in (1.1), μ is a parameter. These are actually distinct concepts, but it is a general convention to abuse notation and employ the same symbol μ. When further developing the theory (see Chapter 6) we shall often introduce a new symbol for the variable, e.g., (1.4) is written as

μˆ 3 := arg min c

∑n

i=

ρ(Xi − c).

An example of a nonparametric estimator is the empirical distribution function

Fˆn(·) :=^1 n #{Xi ≤ ·, 1 ≤ i ≤ n}.

This is an estimator of the theoretical distribution function

F (·) := P (X ≤ ·).

Any reasonable estimator is constructed according the so-called a plug-in princi- ple (Einsetzprinzip). That is, the parameter of interest γ is written as γ = Q(F ), with Q some given map. The empirical distribution Fˆn is then “plugged in”, to obtain the estimator T := Q( Fˆn). (We note however that problems can arise, e.g. Q( Fˆn) may not be well-defined ....).

Examples are the above estimators ˆμ 1 ,... , μˆ 4 of the location parameter μ. We define the maps

Q 1 (F ) :=

xdF (x)

(the mean, or point of gravity, of F ), and

Q 2 (F ) := F −^1 (1/2)

(the median of F ), and

Q 3 (F ) := arg min μ

ρ(· − μ)dF,

1.5. CONFIDENCE INTERVALS 13

Break down point Let for m ≤ n,

(m) := sup x∗ 1 ,...,x∗ m

|μˆ(x∗ 1 ,... , x∗ m, Xm+1,... , Xn)|.

If (m) := ∞, we say that with m outliers the estimator can break down. The break down point is defined as

∗^ := min{m : (m) = ∞}/n.

1.5 Confidence intervals

Consider the location model (Example 1.1.2).

Definition A subset I = I(X) ⊂ R, depending (only) on the data X = (X 1 ,... , Xn), is called a confidence set (Vertrauensbereich) for μ, at level 1 −α, if

IPμ,F 0 (μ ∈ I) ≥ 1 − α, ∀ μ ∈ R, F 0 ∈ F 0.

A confidence interval is of the form

I := [μ, μ¯],

where the boundaries μ = μ(X) and ¯μ = ¯μ(X) depend (only) on the data X.

1.5.1 Equivalence confidence sets and tests

Let for each μ 0 ∈ R, φ(X, μ 0 ) ∈ { 0 , 1 } be a test at level α for the hypothesis

Hμ 0 : μ = μ 0.

Thus, we reject Hμ 0 if and only if φ(X, μ 0 ) = 1, and

IPμ 0 ,F 0 (φ(X, μ 0 ) = 1) ≤ α.

Then

I(X) := {μ : φ(X, μ) = 0 }

is a (1 − α)-confidence set for μ.

Conversely, if I(X) is a (1 − α)-confidence set for μ, then, for all μ 0 , the test φ(X, μ 0 ) defined as

φ(X, μ 0 ) =

1 if μ 0 ∈/ I(X) 0 else

is a test at level α of Hμ 0.

14 CHAPTER 1. INTRODUCTION

1.6 Intermezzo: quantile functions

Let F be a distribution function. Then F is cadlag (continue a droite, limitea gauche). Define the quantile functions

qF + (u) := sup{x : F (x) ≤ u},

and q −F (u) := inf{x : F (x) ≥ u} := F −^1 (u).

It holds that F (q −F (u)) ≥ u

and, for all h > 0, F (qF + (u) − h) ≤ u.

Hence F (q +F (u)−) := lim h↓ 0 F (q +F (u) − h) ≤ u.

1.7 How to construct tests and confidence sets

Consider a model class P := {Pθ : θ ∈ Θ}.

Moreover, consider a space Γ, and a map

g : Θ → Γ, g(θ) := γ.

We think of γ as the parameter of interest (as in the plug-in principle, with γ = Q(Pθ) = g(θ)).

For instance, in Example 1.1.2, the parameter space is Θ := {θ = (μ, F 0 ), μ ∈ R, F 0 ∈ F 0 }, and, when μ is the parameter of interest, g(μ, F 0 ) = μ.

To test

Hγ 0 : γ = γ 0 ,

we look for a pivot (T¨ur-Angel). This is a function Z(X, γ) depending on the data X and on the parameter γ, such that for all θ ∈ Θ, the distribution

IPθ(Z(X, g(θ)) ≤ ·) := G(·)

does not depend on θ. We note that to find a pivot is unfortunately not always possible. However, if we do have a pivot Z(X, γ) with distribution G, we can compute its quantile functions

qL := qG +

( (^) α 2

, qR := q −G

α 2

and the test φ(X, γ 0 ) :=

1 if Z(X, γ 0 ) ∈/ [qL, qR] 0 else

16 CHAPTER 1. INTRODUCTION

is an asymptotic pivot, with limiting distribution G = Φ.

Comparison of confidence intervals and tests When comparing confidence intervals, the aim is usually to take the one with smallest length on average (keeping the level at 1 − α). In the case of tests, we look for the one with maximal power. In the location model, this leads to studying

EIμ,F 0 |¯μ(X) − μ(X)|

for (1 − α)-confidence sets [μ, μ¯], or to studying the power of test φ(X, μ 0 ) at level α. Recall that the power is Pμ,F 0 (φ(X, μ 0 ) = 1) for values μ 6 = μ 0.

1.8 An illustration: the two-sample problem

Consider the following data, concerning weight gain/loss. The control group x had their usual diet, and the treatment group y obtained a special diet, designed for preventing weight gain. The study was carried out to test whether the diet works.

control group group

treatment

rank( x ) rank( y ) x y 7

Table 2

Let n (m) be the sample size of the control group x (treatment group y). The mean in group∑ x (y) is denoted by ¯x (¯y). The sums of squares are SSx := n i=1(xi^ −^ x¯)

(^2) and SSy := ∑m j=1(yj^ −^ y¯)

(^2). So in this study, one has n = m = 5

and the values ¯x = 6.4, ¯y = 0, SSx = 161.2 and SSy = 114. The ranks, rank(x) and rank(y), are the rank-numbers when putting all n + m data together (e.g., y 3 = −6 is the smallest observation and hence rank(y 3 ) = 1).

We assume that the data are realizations of two independent samples, say X = (X 1 ,... , Xn) and Y = (Y 1 ,... , Ym), where X 1 ,... , Xn are i.i.d. with distribution function FX , and Y 1 ,... , Ym are i.i.d. with distribution function FY. The distribution functions FX and FY may be in whole or in part un- known. The testing problem is: H 0 : FX = FY against a one- or two-sided alternative.

1.8. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 17

1.8.1 Assuming normality

The classical two-sample student test is based on the assumption that the data come from a normal distribution. Moreover, it is assumed that the variance of FX and FY are equal. Thus, (FX , FY ) ∈ { FX = Φ

· − μ σ

, FY = Φ

· − (μ + γ) σ

: μ ∈ R, σ > 0 , γ ∈ Γ

Here, Γ ⊃ { 0 } is the range of shifts in mean one considers, e.g. Γ = R for two-sided situations, and Γ = (−∞, 0] for a one-sided situation. The testing problem reduces to H 0 : γ = 0.

We now look for a pivot Z(X, Y, γ). Define the sample means

X¯ :=^1

n

∑^ n

i=

Xi, Y¯ :=

m

∑^ m

j=

Yj ,

and the pooled sample variance

S^2 :=

m + n − 2

{ (^) ∑n

i=

(Xi − X¯)^2 +

∑^ m

j=

(Yj − Y¯ )^2

Note that X¯ has expectation μ and variance σ^2 /n, and Y¯ has expectation μ + γ and variance σ^2 /m. So Y¯ − X¯ has expectation γ and variance

σ^2 n

σ^2 m = σ^2

n + m nm

The normality assumption implies that

Y¯ − X¯ is N

γ, σ^2

n + m nm

−distributed.

Hence (^) √ nm n + m

Y − X¯ − γ σ

is N (0, 1)−distributed.

To arrive at a pivot, we now plug in the estimate S for the unknown σ:

Z(X, Y, γ) :=

nm n + m

Y − X¯ − γ S

Indeed, Z(X, Y, γ) has a distribution G which does not depend on unknown parameters. The distribution G is Student(n + m − 2) (the Student-distribution with n+m−2 degrees of freedom). As test statistic for H 0 : γ = 0, we therefore take T = T Student^ := Z(X, Y, 0).

1.8. AN ILLUSTRATION: THE TWO-SAMPLE PROBLEM 19

Large values of T mean that the Xi are generally larger than the Yj , and hence indicate evidence against H 0.

To check whether or not the observed value of the test statistic is compatible with the null-hypothesis, we need to know its null-distribution, that is, the distribution under H 0. Under H 0 : FX = FY , the vector of ranks (R 1 ,... , Rn) has the same distribution as n random draws without replacement from the numbers { 1 ,... , N }. That is, if we let

r := (r 1 ,... , rn, rn+1,... , rN )

denote a permutation of { 1 ,... , N }, then

IPH 0

(R 1 ,... , Rn, Rn+1,... RN ) = r

N!

(see Theorem 1.8.1), and hence

IPH 0 (T = t) = #{r :

∑n i=1 ri^ =^ t} N!

This can also be written as

IPH 0 (T = t) =

(N

n

) #{r 1 < · · · < rn < rn+1 < · · · < rN :

∑^ n

i=

ri = t}.

So clearly, the null-distribution of T does not depend on FX or FY. It does however depend on the sample sizes n and m. It is tabulated for n and m small or moderately large. For large n and m, a normal approximation of the null-distribution can be used.

Theorem 1.8.1 formally derives the null-distribution of the test, and actually proves that the order statistics and the ranks are independent. The latter result will be of interest in Example 2.10.4.

For two random variables X and Y , use the notation

X D= Y

when X and Y have the same distribution.

Theorem 1.8.1 Let Z 1 ,... , ZN be i.i.d. with continuous distribution F on R. Then (Z(1),... , Z(N )) and R := (R 1 ,... , RN ) are independent, and for all permutations r := (r 1 ,... , rN ),

IP(R = r) =

N!

Proof. Let ZQi := Z(i), and Q := (Q 1 ,... , QN ). Then

R = r ⇔ Q = r−^1 := q,

20 CHAPTER 1. INTRODUCTION

where r−^1 is the inverse permutation of r.^1 For all permutations q and all measurable maps f ,

f (Z 1 ,... , ZN )

D

= f (Zq 1 ,... , ZqN ).

Therefore, for all measurable sets A ⊂ RN^ , and all permutations q,

IP

(Z 1 ,... , ZN ) ∈ A, Z 1 <... < ZN

= IP

(Zq 1... , ZqN ) ∈ A, Zq 1 <... < ZqN

Because there are N! permutations, we see that for any q,

IP

(Z(1),... , Z(n)) ∈ A

= N !IP

(Zq 1... , ZqN ) ∈ A, Zq 1 <... < ZqN

= N !IP

(Z(1),... , Z(N )) ∈ A, R = r

where r = q−^1. Thus we have shown that for all measurable A, and for all r,

IP

(Z(1),... , Z(N )) ∈ A, R = r

N!

IP

(Z(1),... , Z(n)) ∈ A

Take A = RN^ to find that (1.5) implies

IP

R = r

N!

Plug this back into (1.5) to see that we have the product structure

IP

(Z(1),... , Z(N )) ∈ A, R = r

= IP

(Z(1),... , Z(n)) ∈ A

IP

R = r

which holds for all measurable A. In other words, (Z(1),... , Z(N )) and R are independent. tu

1.8.3 Comparison of Student’s test and Wilcoxon’s test

Because Wilcoxon’s test is ony based on the ranks, and does not rely on the assumption of normality, it lies at hand that, when the data are in fact normally distributed, Wilcoxon’s test will have less power than Student’s test. The loss (^1) Here is an example, with N = 3:

(z 1 , z 2 , z 3 ) = ( 5 , 6 , 4 ) (r 1 , r 2 , r 3 ) = ( 2 , 3 , 1 ) (q 1 , q 2 , q 3 ) = ( 3 , 1 , 2 )