Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Notes on Statistical Decision Theory | STAT 9220, Study notes of Statistics

Material Type: Notes; Professor: Rempala; Class: Advanced Statistical Inference; Subject: Statistics; University: Medical College of Georgia; Term: Spring 2009;

Typology: Study notes

Pre 2010

Uploaded on 08/04/2009

koofers-user-5km
koofers-user-5km 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STAT 9220
Lecture 6
Statistical Decision Theory
Greg Rempala
Department of Biostatistics
Medical College of Georgia
Feb 17, 2009
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Lecture Notes on Statistical Decision Theory | STAT 9220 and more Study notes Statistics in PDF only on Docsity!

STAT 9220

Lecture 6

Statistical Decision Theory

Greg Rempala

Department of Biostatistics

Medical College of Georgia

Feb 17, 2009

6.1 Basics

Let X be a sample from a population P ∈ P. A statistical decision is an action

we take after observing X concerning (conclusion about) P.

Let A denote the set of allowable actions and let F A

be a σ-field on A. Then the

measurable space (A, F A

) is called the action space. Let X be the range of X and

F

X

be a σ-field on X. A decision rule is a measurable function (a statistic) T from

(X, F

X

) to (A, F A

). Typically a decision rule is based on a loss function L, where

L : (P × A) → R+

The value of the loss is L(P, T (X)) if X = x. The average loss for the decision

rule T is defined as

R

T

(P ) = E[L(P, T (X))]

and called the risk. If P is a parametric family indexed by θ ∈ Θ, the loss and

risk are denoted by L(θ, a) and R T

(θ). A rule T 1

is as good as T 2

if and only if

RT

1

(P ) ≤ RT

2

(P ) for any P ∈ P,

and is better than T 2

if, in addition, R T 1

(P ) < R

T 2

(P ) for at least one P ∈ P.

Two decision rules T 1

and T 2

are equivalent if and only if R T 1

(P ) = R

T 2

(P ) for all

P ∈ P. It is also possible to consider randomized decision rules, i.e. a function δ

on the space (X × F A

) such that, for every A ∈ F A

, δ(·, A) is a Borel function and,

for every x ∈ X , δ(x, ·) is a probability measure on (A, F A

). If one wants to select

an action in A, one needs to simulate a random element according to δ(x, ·).

In similar problems we can consider different loss functions as well, e.g. L(P, a) =

|θ − a| (absolute loss).

Example 6.1.2 (Hypothesis testing). Let P be a family of distributions, P = P 0

∪ P

1

and P 0 ∩ P 1 = ∅. A hypothesis testing problem can be formulated as that of decid-

ing which of the following two statements is true: H 0

: P ∈ P

0

versus H 1

: P ∈ P

1

Here, H 0

is called the null hypothesis and H 1

is called the alternative hypothesis.

The action space is A = { 0 , 1 }, where 0 is the action of accepting H 0

and 1 is

the action of rejecting H 0

. A decision rule is called a test T : X → { 0 , 1 }, so

T (X) = 1

C

(X), where C ∈ F X

is called the rejection region or critical region for

testing H 0

versus H 1

. A loss function in this problem is 0 − 1 loss:

L(P, a) =

0 if a correct decision is made,

1 otherwise.

Under this loss, the risk is

RT (P ) =

P (T (X) = 1) = P (X ∈ C), P ∈ P

0

P (T (X) = 0) = P (X /∈ C), P ∈ P 1.

Example 6.1.3. Let X 1

,... , X

n

be i.i.d. random variables from a population P ∈

P that is the family of populations having finite mean μ and variance σ

2

. Consider

the estimation of μ (A = R) under the square loss function L(μ, a) = (μ − a)

2

. Let

T be the class of all linear functions in X = (X 1

,... , X

n

), i.e., T (X) =

n

i=

c i

X

i

with known ci ∈ R, i = 1,... , n.

Then

R

T

(P ) = E(μ − T (X))

2

= E

n ∑

i=

c i

X

i

− μ

2

= E

[

n ∑

i=

c i

(X

i

− μ) + (

n ∑

i=

c i

− 1)μ

]

2

n ∑

i=

c

2

i

σ

2

  • μ

2

n ∑

i=

c i

2

(a) We show that there is no T (X) that uniformly in P (μ, σ

2

) minimizes RT (P ).

The minimum of R T

(P ) as a function of (c) is attained at c 1

= · · · = c n

and

n

i=

c i

μ

2

σ

2 +nμ

2

, which depends on P. There is no c i

’s that would minimize

RT (P ) uniformly over any P.

(b) Consider now a subclass T 0

⊂ T with c i

’s satisfying

n

i=

c i

= 1. Then

R

T

(P ) =

n

i=

c

2

i

σ

2 if T ∈ T 0

. Minimizing

n

i=

c

2

i

σ

2 subject to

n

i=

c i

= 1 leads

to an optimal solution of c i

1

n

for all i. Thus, the sample mean

X is T 0

-optimal.

Example 6.1.4. Assume that the sample X has the binomial distribution b(n, θ)

with an unknown θ ∈ (0, 1) and a fixed integer n > 1. Consider the hypothesis

testing problem described in Example 1.1.2. with H 0

: θ ∈ (0, θ 0

] versus H 1

θ ∈ (θ 0 , 1), where θ 0 ∈ (0, 1) is a fixed value. Suppose that we are only interested in

the following class of nonrandomized decision rules: T = {T j

: j = 0, 1 ,... , n− 1 },

where T j

(X) = 1

{j+1,...,n}

(X). The risk function for T j

under the 0 − 1 loss is

R

Tj

(θ) = P (X > j) 1 (0,θ 0 ]

(θ) + P (X ≤ j) 1 (θ 0 ,1)

(θ).

6.2 Example: Minimizing MSE

One of the most important aspects of statistical decision theory is that it may be

used to formalize the notion of optimality of statistical estimators via minimizing

MSE.

Example 6.2.1. Let X 1

,... , X

n

be i.i.d. from an unknown c.d.f. F. Suppose

that the parameter of interest is ϑ = 1 − F (t) for a fixed t > 0.

a) If F is not in a parametric family, then a nonparametric estimator of F (t) is

the empirical c.d.f.

F

n

(t) =

n

n ∑

i=

(−∞,t]

(X

i

), t ∈ R.

Since 1 (−∞,t]

(X

i

) = Y

i

∈ { 0 , 1 } and

n

i=

Y

i

∼ Binomial(1 − ϑ, n), F n

(t) =

Y , and

V ar(F n

(t)) = mse Fn(t)

(P ) = F (t)[1 − F (t)]/n = ϑ(1 − ϑ)/n. Consequently, F n

(t)

is a non-paramtric estimator of (1 − ϑ). By linearity of expectations an unbiased

estimator of ϑ is U (X) = 1 − F n

(t), which has the same variance and mse as F n

(t).

b) The estimator U (X) can be improved in terms of the mse if there is further

information about F. Suppose that F is the c.d.f. of the exponential distribution

E(0, θ) with an unknown θ > 0. Then F (t) = 1 − e

−t/θ and ϑ = e

−t/θ

. The sample

mean

X is sufficient for θ > 0. Now, we take

U (X) = E(U (X)|

X) = h(

X). Then

mse(

U ) < mse(U )

E[U (X) − ϑ]

2

= E{U (X) − E[U (X)|

X] − ϑ + E[U (X)|

X]}

2

= E{U (X) − E[U (X)|

X]}

2

  • E[ϑ − E[U (X)|

X]]

2

+2E{[U (X) − E[U (X)|

X][ϑ − E[U (X)|

X]]}

= E{U (X) − E[U (X)|

X]}

2

  • E{ϑ − E[U (X)|

X]}

2

+2E{[U (X) − E[U (X)|

X]]|

X}E{[ϑ − E[U (X)|

X]]|

X}

= E{U (X) − E[U (X)|

X]}

2

  • E{ϑ − E[U (X)|

X]}

2

+2(E[U (X)|

X] − E[U (X)|

X])E{[ϑ − E[U (X)|

X]]|

X}

= E{U (X) − E[U (X)|

X]}

2

  • E{ϑ − E[U (X)|

X]}

2

E{ϑ − E[U (X)|

X]}

2

This is method of improving estimators is sometimes called the blackwellization

method after one of its inventors David Blackwell.