Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Measuring Central Tendency & Dispersion: Mean, Median, Mode, Percentiles, Quartiles, Range, Study notes of Statistics

Various measures of central tendency and dispersion, including the sample mean, median, mode, percentiles, quartiles, range, interquartile range, population variance, sample variance, standard deviation, chebyshev's theorem, and the empirical rule. Central tendency refers to identifying the most common value or values in a dataset, while dispersion measures the spread or variability of the data. Examples and formulas for calculating each measure.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-myt
koofers-user-myt 🇺🇸

10 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Econ 5
Introduction to Statistics
Asat ar Bair, Ph.D.
Departmen t of Economics
City College o f San Francisc o
aba ir@ccsf.edu
Lectures on Chapter 3
Measures of Location
Mean
Median
Mode
Percentiles
Quartile s
Mean
The arithmetic mean is one measure of
location. It is a measure of central
tendency or an average. It is some times
referred to as “the average” or “the
mean” – although there are other
means and averages.
Central tendency
Central tendency is an important concept;
we want to know if any particular values
are more commonly observed, if the data is
clustered in a certain range
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Measuring Central Tendency & Dispersion: Mean, Median, Mode, Percentiles, Quartiles, Range and more Study notes Statistics in PDF only on Docsity!

Econ 5

Introduction to Statistics

Asatar Bair, Ph.D.

Department of Economics

City College of San Francisco

abair@ccsf.edu

Lectures on Chapter 3

Measures of Location

  • Mean
  • Median
  • Mode
  • Percentiles
  • Quartiles

Mean

The arithmetic mean is one measure of

location. It is a measure of central

tendency or an average. It is sometimes

referred to as “the average” or “the

mean” – although there are other

means and averages.

Central tendency

Central tendency is an important concept;

we want to know if any particular values

are more commonly observed, if the data is

clustered in a certain range

Sample Mean

  • If the data are from a sample,

the mean is denoted by:

!

x =

x

i
i = 1
n

n

Summation sign

it means add up the values

!

x

i

= x

  • x
  • x
  • x
  • x
i = 1

Population Mean

  • If the data are from a population,
the mean is denoted by:

μ =

x

i
i = 1
N

"

N

Home Prices

The following is a sample of the prices of 20

homes in San Francisco. The data are in

ascending order, in thousands of dollars.

420 455 465 472 512

514 554 575 580 600

625 630 670 670 810

1 , 250 1 , 480 2, 700 3 , 400 4 ,

Distribution of wealth is skewed

$0 tril

$5 tril

$10 tril

$15 tril

$20 tril

2004

Bottom 90% Top 1%

http://www.federalreserve.gov/pubs/feds/ 2006 / 200613 /200613pap.pdf

Average wealth, bottom 50% = $22, 300

Avg. wealth, 50-90% = $313,

Average wealth, top 1% = $15 mil

http://www.federalreserve.gov/pubs/feds/ 2006 / 200613 /200613pap.pdf

Distribution of wealth is skewed

Bi-polar distributions

Say there’s a land populated exclusively by

gnomes (height: 1. 5 - 2 .5 ft.) and giants (height:

9-11 ft.)

the mean (6 ft.) does not accurately describe the

central tendencies of the population data

0

  1. 5

  2. 0

  3. 5

  4. 0

Gnome Giant

Rolling a six-sided die;

possible outcomes: 1, 2, 3, 4, 5, 6

mean = 3. 5

does this mean you will roll a 3. 5?

not a central tendency in the sense that you are

more likely to observe the mean value.

Mean with high

variance

Average does not mean “normal”

Most cultures propagate strong pressures to

conform to prevailing standards of behavior,

appearance, etc.

Standards change, leading to a constant

search for what is normal;

“Most Americans have an above-average

number of legs”;

number of Americans with 3+ legs = 0

number of Americans with 1 or 0 legs > 0

this means the mean number of legs < 2

I’

m a

b o v e

a v e r a g e !

Median

  • The median of a data set is the value in
the middle when the data items are
arranged in ascending order.
  • If there is an odd number of items, the
median is the value of the middle item.
  • If there is an even number of items, the
median is the midpoint of the values for
the middle two items.

Mode and central tendency

The mode is rarely a good measure of

central tendency;

it fails when a data set is large and there is

either no mode or there are many repeated

values, which may make the mode less

meaningful.

Percentiles

  • The p

th

percentile of a data set is a

value such that at least p percent of the

items take on this value or less and at

least (100- p ) percent of the items take

on this value or more.

Percentiles

  • To find the p

th

percentile of a data set:

  • Arrange the data in ascending order.
  • Compute index i , the position of the p

th

percentile.

i = ( p/ 100 ) n

Percentiles

  • If i is not an integer, the p

th

percentile is the value of the next

integer, e.g. i = 3.2 4

  • If i is an integer, the p

th

percentile is the midpoint of the

values in positions i and ( i+ 1).

Example: Home prices

90 th Percentile:
i = ( p / 100 ) n = ( 90 / 100 )20 = 18

The midpoint of the 18th and 19th data

values:

420

455

465

472

512

514

554

575

580

600

625

630

670

670

810

1 , 250

1 , 480

2, 700

3 , 400

4 ,

Example: Home prices

78 th Percentile:
i = ( p / 100 ) n = (78/ 100 )20 = 15. 6

Round up to 16, (even if it were 15.1 we’d

round up), so the 78th percentile would be

the 16th data value

420

455

465

472

512

514

554

575

580

600

625

630

670

670

810

1 , 250

1 , 480

2, 700

3 , 400

4 ,

78 th percentile

Quartiles

  • Quartiles are specific percentiles
    • (^) First Quartile = 25th Percentile
    • (^) Second Quartile = 50th Percentile = Median
    • (^) Third Quartile = 75th Percentile

420

455

465

472

512

514

554

575

580

600

625

630

670

670

810

1 , 250

1 , 480

2, 700

3 , 400

4 ,

512 + 514

= 513

25th

percentile

2

600 + 625

= 612. 5

50 th

percentile

2

810 + 1250

= 1030

75th

percentile

2

Measuring variability

The concept of variability is an important one

is statistics and probability, for it’s one of the

foundations of the concept of risk and

uncertainty;

the minimum variability would be a set of

numbers that does not change at all --

essentially the same number repeated;

when the data do vary, we want to know:

how much?

because we’re not accustomed to thinking

about variability, the number can be hard to

interpret.

Range

The simplest measure of variability:

Range = Largest value - Smallest value

very sensitive to extreme highs and lows

Range of home prices = 4500 - 420 = 4, 080

Interquartile range

This solves the problem of high and low

extreme values, by considering the

difference bet ween the third and first

quartiles:

IQR = Q3 - Q 1

IQR of home prices = 1030 - 513 = 517

Population Variance

  • If the data are from a sample,

the variance is denoted by:

"

2

=

x

i

μ

( )

2

i = 1

N

$

N

Sample Variance

  • If the data are from a sample,

the variance is denoted by:

s

2

x

i

" x

2

i = 1

n

n " 1

Sample variance:

alternative formula

s

x

i

" nx

i = 1
n

n " 1

Standard deviation

Population standard deviation =

Sample standard deviation =

!

" = "

2

!

s = s

2

Sample Variance and Sample Standard Deviation

  • A common question is, why divide by

(n- 1 )? Why not divide by n?

s

2

x

i

" x

2

i = 1

n

n " 1

Using Excel

The output for the home price data looks

like this:

Coefficient of Variation

This statistic looks at the standard

deviation in relation to the mean:

Standard deviation

X 100

Mean

Distribution shape

Looking at the shape of the frequency

histogram gives us information about the

central tendency or tendencies of the data

There are many kinds of histograms:

symmetric, skewed, uniform, and bi-polar

are some examples

Symmetric Distribution

0

25

50

75

100

1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35

Skewed-right Distribution

0

25

50

75

100

1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35

Look at the “tail” of the

distribution; a longer tail on

the right side means the

distribution is skewed right.

Skewed-left Distribution

0

25

50

75

100

1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35

Uniform Distribution

0

25

50

75

100

1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35

Bi-polar Distribution

0

25

50

75

100

1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35

The number of data values

within z standard

deviations of the mean is at

least equal to:

for all z > 1

!

1 "

1

z

2

$

%

&

'

(

Chebyshev’s

Theorem

Chebyshev’s Theorem

For z = 2, at least 0. 75 , or 75% of the data

values are within 2 standard deviations of

the mean (above or below)

!

1 "

1

z

2

$

%

&

'

( =^1 "^

1

2

2

$

%

&

'

( =^1 "^

1

4

$

%

&

'

( =^

3

4

= 0.

Chebyshev’s Theorem

The incredible thing about

Chebyshev’s theorem is that it

holds for all distributions,

regardless of the shape.

Chebyshev’s Theorem

A more formal version:

Let c be any constant greater than zero.

For any distribution of X :

P ( X " μ # c ) $

%

2

c

2

Empirical rule

We can use a different rule for data which

seem to have the bell-shaped, or normal

distribution;

about 68% of the data are within 1

standard deviation from the mean;

about 95% are within 2;

nearly all are within 3.

Empirical rule

68%

95%

nearly 100%

Outliers

Since the over whelming majority (89--

almost 100%) of the data lie within 3

standard deviations of the mean, it’s good

to review any data points with z-scores

less than -3 or greater than 3;

such points may be outliers;

they could be errors, which should be

removed;

they could be evidence of something unusual

in the data.

Five number summary

A quick way of summarizing the data is to

consider the following five numbers:

  1. Smallest value 420
  2. First quartile (Q 1 ) 513
  3. Second quartile (Q 2 ) 612. 5
  4. Third quartile (Q 3 ) 1030
  5. Largest value 4500

Correlation Coefficient

This measure solves the problem of units

that plagues the covariance:

!

r

xy

=

s

xy

s

x

s

y

"

xy

=

xy

x

y

Sample Population

Correlation Coefficient

One benefit of the correlation coefficient is that

it also tells us about the strength of the linear

relationship bet ween x and y;

Correlation coefficient Relationship bet ween x and y

0 None

1 perfect positive

  • 1 perfect negative

positive, close to zero weak positive

negative, close to zero weak negative

Positive linear relationship

Covariance, Correlation coefficient positive

y

x

Negative linear relationship

Covariance, Correlation coefficient negative

y

x

No linear relationship

Covariance, Correlation coefficient zero

y

x

A perfect positive linear relationship

Correlation coefficient = 1

y

x

Weighted mean

Sometimes the arithmetic mean

does not give us an accurate

measure of the central tendency

of a data set, because certain

values occur with much greater

frequency than others;

for example, with this data, the

arithmetic mean cost would be

$3, but the over whelming

majority of the time, the cost is

Cost Purchases

$1 50

$2 120

$3 120

$4 110

$5 1500

Weighted mean

To overcome this problem, we use the

weighted mean:

x =
w

i

x

i

"

w

i

"

x

i

= value of observation i
w

i

= weight for observation i