Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Selected Distribution Models: Normal, Lognormal, Extreme, Multivariate Normal Distributions, Lecture notes of Statistics

a combination of three lecture handout

Typology: Lecture notes

2020/2021

Uploaded on 06/21/2021

kavinsky
kavinsky 🇺🇸

4.4

(28)

286 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 5
Distributions
5.1 Theoretical Distributions for Data
It is not necessary to read this chapter thoroughly. If you put it under your
pillow, you may stay up at night tossing and turning. If you read it consci-
entiously, you are guaranteed to fall asleep. Peruse it. Make sure that you
know the meaning of a probability density function and a cumulative density
function, standard or Zscore, and note the names of the important theoretical
distributions–normal, lognormal, t,Fand χ2. If you can understand the basic
concepts of theoretical distributions, then it is much easier to understand the
logic of hypothesis testing that will be presented in a later chapter.
In Section X.X, we gave a definition of empirical and theoretical distribu-
tions. Here, we expand on theoretical distributions and then briefly discuss how
to examine the fit of theoretical distributions to empirical distributions of real
data.
Atheoreticaldistributionisgeneratedbyamathematicalfunctionthathas
three major properties:
(1) the function gives the relative frequency of a score as a function of the
value of the score and other mathematical unknowns (i.e., parameters);
(2) the area under the curve generated by the function between two points
gives the relative likelihood of randomly selecting a score between those two
points;
(3) the area under the curve from its lowest possible value to its highest
possible value is 1.0.
What constitutes a “score” depends on the measurement scale of the variable.
For a c a te g or i ca l o r ca t eg o ri z ed var i ab l e, t h e “s c or e w oul d b e a cla ss or g rou p,
while for a continuous variable like weight, it would be the actual numerical
weight. The mathematical function depends on both the measurement scale and
the type of problem at hand. It is easiest to learn about theoretical distributions
by dividing them into two types— those that apply to categorical variables and
those that apply to continuous variables.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Selected Distribution Models: Normal, Lognormal, Extreme, Multivariate Normal Distributions and more Lecture notes Statistics in PDF only on Docsity!

Chapter 5

Distributions

5.1 Theoretical Distributions for Data

It is not necessary to read this chapter thoroughly. If you put it under your pillow, you may stay up at night tossing and turning. If you read it consci- entiously, you are guaranteed to fall asleep. Peruse it. Make sure that you know the meaning of a probability density function and a cumulative density function, standard or Z score, and note the names of the important theoretical distributions–normal, lognormal, t, F and χ 2. If you can understand the basic concepts of theoretical distributions, then it is much easier to understand the logic of hypothesis testing that will be presented in a later chapter. In Section X.X, we gave a definition of empirical and theoretical distribu- tions. Here, we expand on theoretical distributions and then briefly discuss how to examine the fit of theoretical distributions to empirical distributions of real data. A theoretical distribution is generated by a mathematical function that has three major properties: (1) the function gives the relative frequency of a score as a function of the value of the score and other mathematical unknowns (i.e., parameters); (2) the area under the curve generated by the function between two points gives the relative likelihood of randomly selecting a score between those two points; (3) the area under the curve from its lowest possible value to its highest possible value is 1.0. What constitutes a “score” depends on the measurement scale of the variable. For a categorical or categorized variable, the “score” would be a class or group, while for a continuous variable like weight, it would be the actual numerical weight. The mathematical function depends on both the measurement scale and the type of problem at hand. It is easiest to learn about theoretical distributions by dividing them into two types—those that apply to categorical variables and those that apply to continuous variables.

Before we begin, however, there is a matter of jargon. The mathematical function that gives the probability of observing a value of X as a function of X is called the probability density function often abbreviated as pdf. The math- ematical function that gives the probability of observing a value of X between any two values of X—say, X 1 and X 2 —is called the probability distribution function or cumulative distribution function or cdf. The cumulative dis- tribution function is the integral of the probability density function—i.e., the gives the area under the curve between two values of X, sayX 1 and X 2.

5.2 Theoretical Distributions for Categorical Vari-

ables

In neuroscience, the most frequently encountered categorical variable is the bi- nary variable which can take only one of two mutually exclusive values. Statis- ticians call this a Bernoulli variable. Sex of animal is a binary variable. A rat could be male or female, but it cannot have a sex of “neither” or “both.” If a study has two groups, a control and treatment condition, then “group” is a Bernoulli variable. Rats can belong to only one of the two groups. The pdf for a binary variable is intuitive. One of the groups—male or fe- male, control or treatment, it makes no difference—has a frequency of p and the other, a frequency of (1 – p). Hence, there is only one parameter for a Bernoulli distribution—p. For sex and group, the frequency is usually fixed by the investigator. For other variables, however, the frequency may be a free parameter—i.e., one does not know the value beforehand or fix the experiment to achieve a desired frequency (usually equal frequencies). Consider a pheno- type like central nervous system seizure. In an experimental situation a rat may have at least one seizure or may not have a seizure. In this case the value of p may be unknown beforehand. The statistical issue may be whether the p for a treatment group is within statistical sampling error of the p for a control group. A second theoretical distribution for categorical variables is the binomial distribution. This is a souped-up Bernoulli. If there are n total objects, events, trials, etc., the binomial gives the probability that r of these will have the outcome of interest. For example, if there are n = 12 rats in the control group, the binomial will give the probability that, say, r= 4 of them will have a seizure and the remaining (n - r) = 8 do not have a seizure. The pdf for the binomial is

Pr(r of n) =

n! r!(n − r)!

p r^ (1 − p) n−r^.

A more general distribution is the multinomial. The binomial treats two mutu- ally exclusive outcomes. The multinomial deals with any number of mutually exclusive outcomes. For example, if there were four different outcomes for the 12 rats, the multinomial gives the probability that r 1 of them will have the first outcome; r 2 , the second; r 3 , the third; and (12 - r 1 - r 2 - r 3 ), the fourth. There are only a few, albeit important, applications in neuroscience for the

Because the normal curve is a probability density function, the area under the curve between two values gives the probability of randomly selecting a score between those values. For example, the probability of randomly picking a value of X between the mean and one standard deviation above the mean (μ + σ) is .3413. Similarly, the probability of selecting a score less than two standard deviations below the mean will equals .0013 + .0215 = .0228. The probability distribution function or cumulative distribution function of the normal curve gives the probability of randomly picking a value of X below some predetermined value, say X˜. The equation is the integral of the equation for the normal curve from its lowest value (negative infinity) to X￿ or

F

X￿

ˆX^ ￿

−∞

σ

2 π

exp

X − μ σ

dX. (5.2)

A graph of the function is illustrated in Figure 5.2. This function is also called the probit function and it plays an important role in many different aspects of statistics. One procedure relevant to neuroscience is a probit analysis which is used to predict the presence or absence of a response as a function of, say, dose of drug (see Section X.X).

Figure 5.2: The cumulative normal dis- tribution.

-3 -2 -1 0 1 2 3

X

Density

2.3% 15.9% 50% 84.1% 97.7% 99.9%

Here, one would fit the probit model in Figure reffig:5.2 or Equa- tion 5.2 to observed data on drug dose ( X￿) and the probability of a response

at that dose (F

X￿

). The two un-

knowns that defined the curve would be μ and σ.

5.3.1.1 The Standard Normal Distribution and Z Scores.

The mathematics behind the area un- der the normal curve is quite com- plicated and there is no simple way to solve for Equation 5.2. Instead, a series of numerical approximations is used to arrive at the area under the normal curve. Before the advent of digital computers, statisticians re- lied on printed tables to get the area under the curve. If we think for a minute, this presented a certain prob- lem. Because μ can be any numerical value, and σ can take any value greater than 0, there are an infinite number of normal curves. Did statisticians need an infinite number of tables?

Table 5.1: SAS and R code for calculating the area under the normal curve.

SAS Code: R Code: DATA N u l l ; x = 9 0 ; mu = 1 0 0 ; sigma = 1 5 ; z = ( x − mu) / sigma ; pz = c d f ( ’ normal ’ , z ) ; PUT pz =; RUN;

x <− 90 mu <− 100 sigma <− 15 z <− ( x − mu) / sigma pz <− pnorm ( z ) pz

The answer is, “No.” Statisticians calculated the areas under one and only one distribution and then transformed their own distributions to this distribu- tion. That distribution is the standard normal distribution and it is defined as a normal distribution with a mean of 0.0 and a standard deviation of 1.0. (To grasp the value of these numbers, substitute 0 for μ and 1 for σ in Equations 5.1 and 5.2 and note how this simplifies the algebra.) Scores from the standard normal distribution are called Z scores and may be calculated by the formula

Z =

X − X¯

s (^) X

Here, Z is the score from the standard normal, X is the score to be trans- formed, X¯is the mean of that distribution (i.e., the distribution of X) and s (^) X is the standard deviation of that distribution. For example, if IQ is distributed as a normal with a mean of 100 and a standard deviation of 15, then the Z score equivalent of an IQ of 108 is

Z =

X − X¯

s (^) X

If we wanted to know the frequency of people in the population with IQs less than 90, then we would first convert 90 into a Z score:

Z =

X − X¯

s (^) X

Then we would find the area under the standard normal curve that is less than a Z score of -.667. The answer would be .252. In today’s world of readily accessible computers, numerical algorithms have replaced tables for calculations of areas under all statistical distributions 1. For

(^1) Paradoxically, teaching statistics is the one area where tables are still used. The statistics’ student is forced to look up areas under the curve in tables, only to completely abandon that approach to analyze real data.

X = 15(1.282) + 100 = 119. 23.

Rounding off, we would say than the top 10% of the population has an IQ of 119 or above. One again, calculations for areas under the curve are seldom done by hand anymore (with the notable exception of introductory statistics students). The SAS and R codes that can be used to solve for this problem is given in Table 5.2. It is obvious that the cumulative density function can be used to calculate the area of the normal curve between any two values. For example, what proportion of the population has IQs between 90 and 120? Here we have two X values, the lower or X (^) L equaling 90 and the higher (X (^) H ) being 120. Let us first calculate the area under the curve from negative infinity to 120. Translating the raw score to a Z score gives

Z H =

X H − X¯

s (^) X

and the area under the standard normal curve from negative infinity to Z = 1.333 is .909. Next we calculate the area under the curve from negative infinity to an IQ of 95. Here,

Z (^) L =

X L − X¯

s (^) X

and the area under the standard normal curve from negative infinity to Z = -.667 is .252. Thus far, our situation is identical to panels A and B of Figure 5.3. That is, we have two areas under the curve, each starting at negative infinity. To find the area between 80 and 120, we only need to subtract the smaller area from the larger area. Hence, Prob(90 ≤ IQ ≤ 120) = Prob(IQ ≤ 120) - Prob(IQ ≤

  1. = .909 - .252 = .657,or about 66% of IQ scores will lie between 90 and 120.

5.3.1.2 Standard Normal Scores, Standard Scores, and Z Scores: Terminological Problems

There is significant confusion, especially among introductory students, among the terms for standard scores, standard normal scores, and Z scores. No fault of the student here–statisticians use the terms equivocally. Let us spend some time to note the different meanings of the terms. Any distribution, not matter what its shape, can be transformed into a distribution with a mean of 0 and a standard deviation of 1 by the application of Equation 5.3. All one has to do is subtract the mean and then divide the result by the standard deviation. The fundamental shape of the distribution will not change. Its location will move from the old mean to 0, and change in standard deviation is effectively the same as looking at an object under a magnifying

Figure 5.3: Calculating the area between two values of the normal curve.

60 80 100 120 140 IQ

60 80 100 120 140 IQ

60 80 100 120 140 IQ

!"#$%

!&#&%

!'('%

Figure 5.4: Examples of lognormal distributions.

0 2 4 6 8 10

x

Density

often seen in modeling the onset or relapse of disorders. The exponential dis- tribution is often used to measure the time for a continuous process to change state. The classic example of an exponential process is the time it takes for a radioactive particle to decay, but it can also be used to model the initiation and termination of biological processes. Counts are strictly speaking not continuous variables, but in some cases they may he treated as such. Here, the Poisson dis- tribution is useful for variables like the number of events of interest that occur between two time points. It is used in Poisson regression and log-linear models to analyze count variables.

5.4 Theoretical Distributions for Statistics

We have discussed distributions in the sense of “scores” that could be measured (or were actually measured) on observations like people, rats or cell cultures. Distributions, however, are mathematical functions that can be applied to any- thing. One of their most important applications is to statistics themselves. Instead of imaging that we reach into a hat and randomly pick out a score below X, think of reaching into a hat of means and picking a mean below X¯ (see Section X.X). Or perhaps we can deal with the probability of randomly picking a variance greater than 6.2 for some particular situation. The mathematics of the probability of picking scores is the same as those for picking statistics, but with statistics come some special distributions. We mention them here, but deal with them in the chapter on statistical inference.

5.4.1 The t Distribution

The t distribution is a bell-shaped curve that resembles a normal distribution. Indeed, it is impossible on visual inspection to distinguish a t curve from a normal curve. Whereas there is one and only one normal curve, there is a whole family of t distributions, each one depending on its degrees of freedom of df (see Section X.X). Those with small degrees of freedom depart most from the normal curve. As the degrees of freedom increase, the t becomes closer and closer to a normal. Technically, the t distribution equals the normal when the degrees of freedom equal infinity, but there is little difference between the two when the df is greater than 30. Figure 5.5 depicts a normal distribution along with three t distributions with different degrees of freedom. It is useful to think of the t distribution as a substitute for the normal dis- tribution when we do not know a particular variance and instead must estimate it from fallible data. As the degrees of freedom increase, the amount of error in our estimate of the variance decreases, and the t approaches the normal. One of the most important uses of the t distribution is when we compare some types of observed statistics to their hypothesized values. If θ is an observed statistic, E(θ) is the expected or hypothesized value of the statistic, and σ (^) θ is the standard deviation of the statistic, then for many—but not all—statistics

the quantity θ − E(θ) σ (^) θ

is distributed as a normal. When we substitute σ (^) θ with a fallible estimate from observed data, then this quantity is distributed as a t distribution. For example, suppose that θ is the difference between the mean of a con- trol group and the mean of an experimental group and hypothesize that the difference is 0. Then σ (^) θ is the standard deviation of the difference between two means. This is precisely the logic behind the t test for two independent samples, one of the statistical tests used most often in neuroscience.

5.4.2 The F Distribution

The F distribution is the ratio of two variances. In the analysis of variance of ANOVA (a statistical technique) and in an analysis of variance table (a summary table applicable to a number of statistical techniques), we obtain two different estimates of the same population variance and compute an F statistic. The F distribution has two different degrees of freedom. The first df is for the variance in the number and the second, for the one in the denominator. We deal with the F distribution in more detail in regression, ANOVA, and the general linear model (GLM). Some F distributions are presented in Figure 5.6.

5.4.3 The Chi Square (χ 2 ) Distribution

The chi square (χ 2 ) distribution is very important in inferential statistics, but it does not have a simple meaning. The most frequent use of the χ 2 distribution is to compare a distribution predicted from a model (i.e., a hypothesized distri- bution) to an observed distribution. The χ 2 gives the discrepancy (in squared units) between the predicted distribution and the observed distribution. The larger the value of χ 2 , the greater the discrepancy between the two. Like, the t distribution, there is a family of χ 2 distributions, each associated with its degrees of freedom. Figure 5.7 illustrates some chi square distributions.

5.5 References:

Agesti ()

Figure 5.6: Examples of F distributions.

F

  • 0.0 0.5 1.0 1.5 2.0 2.5 3.
      • df = 1, Density
      • df = 2,
      • df = 5,
      • df = 15,

Lecture 21. The Multivariate Normal Distribution

21.1 Definitions and Comments

The joint moment-generating function of X 1 ,... , Xn [also called the moment-generating function of the random vector (X 1 ,... , Xn)] is defined by

M (t 1 ,... , tn) = E[exp(t 1 X 1 + · · · + tnXn)].

Just as in the one-dimensional case, the moment-generating function determines the den- sity uniquely. The random variables X 1 ,... , Xn are said to have the multivariate normal distribution or to be jointly Gaussian (we also say that the random vector (X 1 ,... , Xn) is Gaussian) if

M (t 1 ,... , tn) = exp(t 1 μ 1 + · · · + tnμn) exp

∑^ n

i,j=

tiaij tj

where the ti and μj are arbitrary real numbers, and the matrix A is symmetric and positive definite.

Before we do anything else, let us indicate the notational scheme we will be using. Vectors will be written with an underbar, and are assumed to be column vectors unless otherwise specified. If t is a column vector with components t 1 ,... , tn, then to save space we write t = (t 1 ,... , tn)′. The row vector with these components is the transpose of t, written t′. The moment-generating function of jointly Gaussian random variables has the form

M (t 1 ,... , tn) = exp(t′μ) exp

t′At

We can describe Gaussian random vectors much more concretely.

21.2 Theorem

Joint Gaussian random variables arise from nonsingular linear transformations on inde- pendent normal random variables.

Proof. Let X 1 ,... , Xn be independent, with Xi normal (0,λi), and let X = (X 1 ,... , Xn)′. Let Y = BX +μ where B is nonsingular. Then Y is Gaussian, as can be seen by computing the moment-generating function of Y :

MY (t) = E[exp(t′Y )] = E[exp(t′BX)] exp(t′μ).

But

E[exp(u′X)] =

∏^ n

i=

E[exp(uiXi)] = exp

( ∑n

i=

λiu^2 i / 2

= exp

u′Du

where D is a diagonal matrix with λi’s down the main diagonal. Set u = B′t, u′^ = t′B; then

MY (t) = exp(t′μ) exp(

t′BDB′t)

and BDB′^ is symmetric since D is symmetric. Since t′BDB′t = u′Du, which is greater than 0 except when u = 0 (equivalently when t = 0 because B is nonsingular), BDB′^ is positive definite, and consequently Y is Gaussian.

Conversely, suppose that the moment-generating function of Y is exp(t′μ) exp[(1/2)t′At)] where A is symmetric and positive definite. Let L be an orthogonal matrix such that L′AL = D, where D is the diagonal matrix of eigenvalues of A. Set X = L′(Y − μ), so that Y = μ + LX. The moment-generating function of X is

E[exp(t′X)] = exp(−t′L′μ)E[exp(t′L′Y )].

The last term is the moment-generating function of Y with t′^ replaced by t′L′, or equiv- alently, t replaced by Lt. Thus the moment-generating function of X becomes

exp(−t′L′μ) exp(t′L′μ) exp

t′L′ALt

This reduces to

exp

t′Dt

= exp

∑^ n

i=

λit^2 i

Therefore the Xi are independent, with Xi normal (0, λi). ♣

21.3 A Geometric Interpretation

Assume for simplicity that the random variables Xi have zero mean. If E(U ) = E(V ) = 0 then the covariance of U and V is E(U V ), which can be regarded as an inner product. Then Y 1 − μ 1 ,... , Yn − μn span an n-dimensional space, and X 1 ,... , Xn is an orthogonal basis for that space. We will see later in the lecture that orthogonality is equivalent to independence. (Orthogonality means that the Xi are uncorrelated, i.e., E(XiXj ) = 0 for i = j.)

21.4 Theorem

Let Y = μ + LX as in the proof of (21.2), and let A be the symmetric, positive definite matrix appearing in the moment-generating function of the Gaussian random vector Y. Then E(Yi) = μi for all i, and furthermore, A is the covariance matrix of the Yi, in other words, aij = Cov(Yi, Yj ) (and aii = Cov(Yi, Yi) = Var Yi).

It follows that the means of the Yi and their covariance matrix determine the moment- generating function, and therefore the density.

21.7 Theorem

If X 1 ,... , Xn are jointly Gaussian and uncorrelated (Cov(Xi, Xj ) = 0 for all i = j), then the Xi are independent.

Proof. The moment-generating function of X = (X 1 ,... , Xn) is

MX (t) = exp(t′μ) exp

t′Kt

where K is a diagonal matrix with entries σ 12 , σ 22 ,... , σ n^2 down the main diagonal, and 0’s elsewhere. Thus

MX (t) =

∏^ n

i=

exp(tiμi) exp

σ^2 i t^2 i

which is the joint moment-generating function of independent random variables X 1 ,... , Xn, whee Xi is normal (μi, σ i^2 ). ♣

21.8 A Conditional Density

Let X 1 ,... , Xn be jointly Gaussian. We find the conditional density of Xn given X 1 ,... , Xn− 1 :

f (xn|x 1 ,... , xn− 1 ) =

f (x 1 ,... , xn) f (x 1 ,... , xn− 1 )

with

f (x 1 ,... , xn) = (2π)−n/^2 (det K)−^1 /^2 exp

[

∑^ n

i,j=

yiqij yj

]

where Q = K−^1 = [qij ], yi = xi − μi. Also,

f (x 1 ,... , xn− 1 ) =

−∞

f (x 1 ,... , xn− 1 , xn) dxn = B(y 1 ,... , yn− 1 ).

Now

∑^ n

i,j=

yiqij yj =

n∑− 1

i,j=

yiqij yj + yn

n∑− 1

j=

qnj yj + yn

n∑− 1

i=

qinyi + qnny n^2.

Thus the conditional density has the form

A(y 1 ,... , yn− 1 ) B(y 1 ,... , yn− 1 ) exp[−(Cy n^2 + D(y 1 ,... , yn− 1 )yn]

with C = (1/2)qnn, D =

∑n− 1 j=1 qnj^ yj^ =^

∑n− 1 i=1 qinyi^ since^ Q^ =^ K

− (^1) is symmetric. The

conditional density may now be expressed as

A B

exp

( D^2

4 C

exp

[

− C(yn +

D

2 C

)^2

]

We conclude that

given X 1 ,... , Xn− 1 , Xn is normal.

The conditional variance of Xn (the same as the conditional variance of Yn = Xn − μn) is

1 2 C

qnn because

2 σ^2 = C, σ^2 =

2 C

Thus

Var(Xn|X 1 ,... , Xn− 1 ) =

qnn

and the conditional mean of Yn is

D

2 C

qnn

n∑− 1

j=

qnj Yj

so the conditional mean of Xn is

E(Xn|X 1 ,... , Xn− 1 ) = μn −

qnn

n∑− 1

j=

qnj (Xj − μj ).

Recall from Lecture 18 that E(Y |X) is the best estimate of Y based on X, in the sense that the mean square error is minimized. In the joint Gaussian case, the best estimate of Xn based on X 1 ,... , Xn− 1 is linear, and it follows that the best linear estimate is in fact the best overall estimate. This has important practical applications, since linear systems are usually much easier than nonlinear systems to implement and analyze.

Problems

  1. Let K be the covariance matrix of arbitrary random variables X 1 ,... , Xn. Assume that K is nonsingular to avoid degenerate cases. Show that K is symmetric and positive definite. What can you conclude if K is singular?
  2. If X is a Gaussian n-vector and Y = AX with A nonsingular, show that Y is Gaussian.
  3. If X 1 ,... , Xn are jointly Gaussian, show that X 1 ,... , Xm are jointly Gaussian for m ≤ n.
  4. If X 1 ,... , Xn are jointly Gaussian, show that c 1 X 1 + · · · + cnXn is a normal random variable (assuming it is nondegenerate, i.e., not identically constant).