













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Various measures of central tendency and dispersion, including the sample mean, median, mode, percentiles, quartiles, range, interquartile range, population variance, sample variance, standard deviation, chebyshev's theorem, and the empirical rule. Central tendency refers to identifying the most common value or values in a dataset, while dispersion measures the spread or variability of the data. Examples and formulas for calculating each measure.
Typology: Study notes
1 / 21
This page cannot be seen from the preview
Don't miss anything!
Asatar Bair, Ph.D.
Department of Economics
City College of San Francisco
abair@ccsf.edu
Lectures on Chapter 3
Central tendency
Central tendency is an important concept;
we want to know if any particular values
are more commonly observed, if the data is
clustered in a certain range
!
x =
x
n
Summation sign
it means add up the values
!
x
= x
Population Mean
μ =
x
"
N
The following is a sample of the prices of 20
homes in San Francisco. The data are in
ascending order, in thousands of dollars.
420 455 465 472 512
514 554 575 580 600
625 630 670 670 810
1 , 250 1 , 480 2, 700 3 , 400 4 ,
Distribution of wealth is skewed
$0 tril
$5 tril
$10 tril
$15 tril
$20 tril
2004
Bottom 90% Top 1%
http://www.federalreserve.gov/pubs/feds/ 2006 / 200613 /200613pap.pdf
Average wealth, bottom 50% = $22, 300
Avg. wealth, 50-90% = $313,
Average wealth, top 1% = $15 mil
http://www.federalreserve.gov/pubs/feds/ 2006 / 200613 /200613pap.pdf
Distribution of wealth is skewed
Bi-polar distributions
Say there’s a land populated exclusively by
gnomes (height: 1. 5 - 2 .5 ft.) and giants (height:
9-11 ft.)
the mean (6 ft.) does not accurately describe the
central tendencies of the population data
0
5
0
5
0
Gnome Giant
Rolling a six-sided die;
possible outcomes: 1, 2, 3, 4, 5, 6
mean = 3. 5
does this mean you will roll a 3. 5?
not a central tendency in the sense that you are
more likely to observe the mean value.
Mean with high
variance
Most cultures propagate strong pressures to
conform to prevailing standards of behavior,
appearance, etc.
Standards change, leading to a constant
search for what is normal;
“Most Americans have an above-average
number of legs”;
number of Americans with 3+ legs = 0
number of Americans with 1 or 0 legs > 0
this means the mean number of legs < 2
m a
b o v e
a v e r a g e !
Median
Mode and central tendency
The mode is rarely a good measure of
central tendency;
it fails when a data set is large and there is
either no mode or there are many repeated
values, which may make the mode less
meaningful.
th
th
th
th
percentile is the value of the next
integer, e.g. i = 3.2 4
th
percentile is the midpoint of the
values in positions i and ( i+ 1).
The midpoint of the 18th and 19th data
values:
420
455
465
472
512
514
554
575
580
600
625
630
670
670
810
1 , 250
1 , 480
2, 700
3 , 400
4 ,
Round up to 16, (even if it were 15.1 we’d
round up), so the 78th percentile would be
the 16th data value
420
455
465
472
512
514
554
575
580
600
625
630
670
670
810
1 , 250
1 , 480
2, 700
3 , 400
4 ,
78 th percentile
420
455
465
472
512
514
554
575
580
600
625
630
670
670
810
1 , 250
1 , 480
2, 700
3 , 400
4 ,
512 + 514
= 513
25th
percentile
2
600 + 625
= 612. 5
50 th
percentile
2
810 + 1250
= 1030
75th
percentile
2
Measuring variability
The concept of variability is an important one
is statistics and probability, for it’s one of the
foundations of the concept of risk and
uncertainty;
the minimum variability would be a set of
numbers that does not change at all --
essentially the same number repeated;
when the data do vary, we want to know:
how much?
because we’re not accustomed to thinking
about variability, the number can be hard to
interpret.
Range
The simplest measure of variability:
Range = Largest value - Smallest value
very sensitive to extreme highs and lows
Range of home prices = 4500 - 420 = 4, 080
Interquartile range
This solves the problem of high and low
extreme values, by considering the
difference bet ween the third and first
quartiles:
IQR of home prices = 1030 - 513 = 517
"
2
=
x
i
( )
2
i = 1
N
$
N
2
i
2
i = 1
n
Population standard deviation =
Sample standard deviation =
!
" = "
2
!
s = s
2
2
i
2
i = 1
n
Using Excel
The output for the home price data looks
like this:
Coefficient of Variation
This statistic looks at the standard
deviation in relation to the mean:
Standard deviation
Mean
Distribution shape
Looking at the shape of the frequency
histogram gives us information about the
central tendency or tendencies of the data
There are many kinds of histograms:
symmetric, skewed, uniform, and bi-polar
are some examples
Symmetric Distribution
0
25
50
75
100
1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35
Skewed-right Distribution
0
25
50
75
100
1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35
Look at the “tail” of the
distribution; a longer tail on
the right side means the
distribution is skewed right.
Skewed-left Distribution
0
25
50
75
100
1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35
Uniform Distribution
0
25
50
75
100
1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35
Bi-polar Distribution
0
25
50
75
100
1 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35
The number of data values
within z standard
deviations of the mean is at
least equal to:
for all z > 1
!
1 "
1
z
2
$
%
&
'
(
Chebyshev’s
Theorem
Chebyshev’s Theorem
For z = 2, at least 0. 75 , or 75% of the data
values are within 2 standard deviations of
the mean (above or below)
!
1 "
1
z
2
$
%
&
'
( =^1 "^
1
2
2
$
%
&
'
( =^1 "^
1
4
$
%
&
'
( =^
3
4
= 0.
Chebyshev’s Theorem
The incredible thing about
Chebyshev’s theorem is that it
holds for all distributions,
regardless of the shape.
Chebyshev’s Theorem
A more formal version:
%
2
2
Empirical rule
We can use a different rule for data which
seem to have the bell-shaped, or normal
distribution;
about 68% of the data are within 1
standard deviation from the mean;
about 95% are within 2;
nearly all are within 3.
Empirical rule
68%
95%
nearly 100%
Outliers
Since the over whelming majority (89--
almost 100%) of the data lie within 3
standard deviations of the mean, it’s good
to review any data points with z-scores
less than -3 or greater than 3;
such points may be outliers;
they could be errors, which should be
removed;
they could be evidence of something unusual
in the data.
Five number summary
A quick way of summarizing the data is to
consider the following five numbers:
Correlation Coefficient
This measure solves the problem of units
that plagues the covariance:
!
r
xy
=
s
xy
s
x
s
y
"
xy
=
xy
x
y
Sample Population
Correlation Coefficient
One benefit of the correlation coefficient is that
it also tells us about the strength of the linear
relationship bet ween x and y;
Correlation coefficient Relationship bet ween x and y
0 None
1 perfect positive
positive, close to zero weak positive
negative, close to zero weak negative
Positive linear relationship
Covariance, Correlation coefficient positive
y
x
Negative linear relationship
Covariance, Correlation coefficient negative
y
x
No linear relationship
Covariance, Correlation coefficient zero
y
x
A perfect positive linear relationship
Correlation coefficient = 1
y
x
Weighted mean
Sometimes the arithmetic mean
does not give us an accurate
measure of the central tendency
of a data set, because certain
values occur with much greater
frequency than others;
for example, with this data, the
arithmetic mean cost would be
$3, but the over whelming
majority of the time, the cost is
Cost Purchases
$1 50
$2 120
$3 120
$4 110
$5 1500
Weighted mean
To overcome this problem, we use the
weighted mean:
i
i
"
i
"
i
i