






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Measures of position, including z-scores, percentiles, and quartiles, to determine the relative position of data values within a set. It also covers identifying outliers using these measures.
Typology: Study Guides, Projects, Research
1 / 11
This page cannot be seen from the preview
Don't miss anything!
In Section 3.1, we determined measures of central tendency, which describe the typical data value. Section 3.2 discussed measures of dispersion, which describe the amount of spread in a set of data. In this section, we discuss measures of position, which describe the relative position of a certain data value within the entire set of data.
➋ Interpret percentiles ➌ Determine and interpret quartiles ➍ Determine and interpret the interquartile range ❺ Check a set of data for outliers
SECTION 3.4 Measures of Position and Outliers 155
If a data value is larger than the mean, the z -score is positive. If a data value is smaller than the mean, the z -score is negative. If the data value equals the mean, the z -score is zero. A z -score measures the number of standard deviations an observation is above or below the mean. For example, a z -score of 1.24 means the data value is 1. standard deviations above the mean. A z -score of - 2.31 means the data value is 2. standard deviations below the mean.
In Other Words The z -score provides a way to compare apples to oranges by converting variables with different centers or spreads to variables with the same center (0) and spread (1).
Determine and Interpret z -Scores At the end of the 2014 season, the Los Angeles Angels led the American League with 773 runs scored, while the Colorado Rockies led the National League with 755 runs scored. It appears that the Angels are the better run-producing team. However, this comparison is unfair because the two teams play in different leagues. The Angels play in the American League, where the designated hitter bats for the pitcher, whereas the Rockies play in the National League, where the pitcher must bat (pitchers are typically poor hitters). To compare the two teams’ scoring of runs, we need to determine their relative standings in their respective leagues. We can do this using a z-score.
➊
Definition The z -score represents the distance that a data value is from the mean in terms of the number of standard deviations. We find it by subtracting the mean from the data value and dividing this result by the standard deviation. There is both a population z -score and a sample z -score: Population z -Score Sample z -Score
z =
x - m s z =
x - x s
The z -score is unitless. It has mean 0 and standard deviation 1.
rNow Work Problem 5
Comparing z -Scores
Problem Determine whether the Los Angeles Angels or the Colorado Rockies had a relatively better run-producing season. The Angels scored 773 runs and play in the American League, where the mean number of runs scored was m = 677.4 and the standard deviation was s = 51.7 runs. The Rockies scored 755 runs and play in the National League, where the mean number of runs scored was m = 640.0 and the standard deviation was s = 55.9 runs.
Approach To determine which team had the relatively better run-producing season, compute each team’s z -score. The team with the higher z -score had the better season. Because we know the values of the population parameters, compute the population z -score.
Solution We compute each team’s z -score, rounded to two decimal places.
Angels: z @score =
x - m s
Rockies: z @score =
x - m s
So the Angels had run production 1.85 standard deviations above the mean, while the Rockies had run production 2.06 standard deviations above the mean. Therefore, the Rockies had a relatively better year at scoring runs than the Angels. (^) r
EXAMPLE 1
SECTION 3.4 Measures of Position and Outliers 157
Approach Follow the steps given above.
Solution Step 1 The data written in ascending order are given as follows:
$180 $189 $370 $618 $735 $802 $1185 $1414 $ $1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,
Step 2 There are n = 18 observations, so the median, or second quartile, Q 2 , is the mean of the 9th and 10th observations. Therefore, M = Q 2 = $1657^ + 2 $1953= $1805. Step 3 The median of the bottom half of the data is the first quartile, Q 1. As shown next, the median of these data is the 5th observation, so Q 1 = $735.
$180 $189 $370 $618 $735 $802 $1185 $1414 $
r The first quartile, denoted Q 1 , divides the bottom 25% of the data from the top 75%. Therefore, the first quartile is equivalent to the 25th percentile. r The second quartile, Q 2 , divides the bottom 50% of the data from the top 50%; it is equivalent to the 50th percentile or the median. r The third quartile, Q 3 , divides the bottom 75% of the data from the top 25%; it is equivalent to the 75th percentile. Figure 18 illustrates the concept of quartiles.
In Other Words The first quartile, Q 1 , is equivalent to the 25th percentile, P 25. The 2nd quartile, Q 2 , is equivalent to the 50th percentile, P 50 , which is equivalent to the median, M. Finally, the third quartile, Q 3 , is equivalent to the 75th percentile, P 75. (^) Figure 18
Smallest Data Value
Median Largest Data Value
25% of the data
25% of the data
25% of the data
25% of the data
Q 1 Q 2 Q 3
Finding Quartiles Step 1 Arrange the data in ascending order. Step 2 Determine the median, M , or second quartile, Q 2. Step 3 Divide the data set into halves: the observations below (to the left of) M and the observations above M. The first quartile, Q 1 , is the median of the bottom half of the data and the third quartile, Q 3 , is the median of the top half of the data.
Finding and Interpreting Quartiles
Problem The Highway Loss Data Institute routinely collects data on collision coverage claims. Collision coverage insures against physical damage to an insured individual’s vehicle. The data in Table 16 represent a random sample of 18 collision coverage claims based on data obtained from the Highway Loss Data Institute. Find and interpret the first, second, and third quartiles for collision coverage claims.
EXAMPLE 3
In Other Words To find Q 2 , determine the median of the data set. To find Q 1 , determine the median of the “lower half” of the data set. To find (^) Q 3 , determine the median of the “upper half” of the data set.
Table 16 $6751 $9908 $3461 $2336 $21,147 $ $189 $1185 $370 $1414 $4668 $ $10,034 $735 $802 $618 $180 $
If the number of observations is odd, do not include the median when determining Q 1 and Q 3 by hand. r (continued)
158 CHAPTER 3 Numerically Summarizing Data
The median of the top half of the data is the third quartile, Q 3. As shown next, the median of these data is the 5th observation, so Q 3 = $4668. $1953 $2332 $2336 $3461 $4668 $6751 $9908 $10,034 $21,
Interpretation Interpret the quartiles as percentiles. For example, 25% of the collision claims are less than or equal to the first quartile, $735, and 75% of the collision claims are greater than $735. Also, 50% of the collision claims are less than or equal to $1805, the second quartile, and 50% of the collision claims are greater than $1805. Finally, 75% of the collision claims are less than or equal to $4668, the third quartile, and 25% of the collision claims are greater than $4668. r
Finding Quartiles Using Technology
Problem Find the quartiles of the collision coverage claims data in Table 16.
Approach Use both StatCrunch and Minitab to obtain the quartiles. The steps for obtaining quartiles using a TI-83/84 Plus graphing calculator, Minitab, Excel, and StatCrunch are given in the Technology Step-by-Step on pages 160–161.
Solution The results obtained from StatCrunch [Figure 19(a)] agree with our “by hand” solution. In Figure 19(b), notice that the first quartile, 706, and the third quartile, 5189, reported by Minitab disagree with our “by hand” and StatCrunch result. This difference is due to the fact that StatCrunch and Minitab use different algorithms for obtaining quartiles.
EXAMPLE 4
rNow Work Problem 21(b)
Using Technology Statistical packages may use different formulas for obtaining the quartiles, so results may differ slightly.
U tatis
Figure 19
(a)
Descriptive statistics: Claim Variable Claim
N 18
N* 0
Mean 3874
SE Mean 1250
StDev 5302
Minimum 180
Maximum 21447
Median 1805
Q 706
Q 5189 (b) (^) r
Determine and Interpret the Interquartile Range So far we have discussed three measures of dispersion: range, standard deviation, and variance, all of which are not resistant. Quartiles, however, are resistant. For this reason, quartiles are used to define a fourth measure of dispersion.
➍
Definition (^) The interquartile range , IQR , is the range of the middle 50% of the observations in a data set. That is, the IQR is the difference between the third and first quartiles and is found using the formula IQR = Q 3 - Q 1
The interpretation of the interquartile range is similar to that of the range and standard deviation. That is, the more spread a set of data has, the higher the interquartile range will be.
160 CHAPTER 3 Numerically Summarizing Data
(approximately $175,000), it probably would be an outlier, because this car costs much more than the typical European automobile. The value of this car would be considered unusual because it is not a typical value from the data set. Use the following steps to check for outliers using quartiles.
Checking for Outliers by Using Quartiles Step 1 Determine the first and third quartiles of the data. Step 2 Compute the interquartile range. Step 3 Determine the fences. Fences serve as cutoff points for determining outliers. Lower fence = Q 1 - 1.5 1 IQR 2 Upper fence = Q 3 + 1.5 1 IQR 2 Step 4 If a data value is less than the lower fence or greater than the upper fence, it is considered an outlier.
Checking for Outliers
Problem Check the collision coverage claims data in Table 16 for outliers.
Approach Follow the preceding steps. Any data value that is less than the lower fence or greater than the upper fence will be considered an outlier.
Solution Step 1 The quartiles found in Example 3 are Q 1 = $735 and Q 3 = $4668. Step 2 The interquartile range, IQR, is IQR = Q 3 - Q 1 = $4668 - $ = $
Step 3 The lower fence, LF, is LF = Q 1 - 1.5 1 IQR 2 = $735 - 1.5 1 $3933 2 = - $5164. The upper fence, UF, is UF = Q 3 + 1.5 1 IQR 2 = $4668 + 1.5 1 $3933 2 = $10,567. Step 4 There are no observations below the lower fence. However, there is an observation above the upper fence. The claim of $21,147 is an outlier. (^) r
EXAMPLE 6
rNow Work Problem 21(d)
Technology Step-by-Step Determining Quartiles
TI-83/84 Plus Follow the same steps given to compute the mean and median from raw data. (Section 3.1)
Minitab Follow the same steps given to compute the mean and median from raw data. (Section 3.1)
SECTION 3.4 Measures of Position and Outliers 161
Excel
1. Enter the raw data into column A. 2. With the data analysis Tool Pak enabled, select the Data tab and click on Data Analysis. 3. Select Rank and Percentile from the Data Analysis window. Press OK. 4. With the cursor in the Input Range cell, highlight the data. Press OK.
StatCrunch Follow the same steps given to compute the mean and median from raw data. (Section 3.1)
Vocabulary
1. The represents the number of standard deviations an observation is from the mean. 2. The of a data set is a value such that k percent of the observations are less than or equal to the value. 3. divide data sets into fourths. 4. The is the range of the middle 50% of the observations in a data set.
Applying the Concepts
5. Birth Weights Babies born after a gestation period of 32–35 weeks have a mean weight of 2600 grams and a standard deviation of 660 grams. Babies born after a gestation period of 40 weeks have a mean weight of 3500 grams and a standard deviation of 470 grams. Suppose a 34-week gestation period baby weighs 2400 grams and a 40-week gestation period baby weighs 3300 grams. What is the z -score for the 34-week gestation period baby? What is the z -score for the 40-week gestation period baby? Which baby weighs less relative to the gestation period? 6. Birth Weights Babies born after a gestation period of 32–35 weeks have a mean weight of 2600 grams and a standard deviation of 660 grams. Babies born after a gestation period of 40 weeks have a mean weight of 3500 grams and a standard deviation of 470 grams. Suppose a 34-week gestation period baby weighs 3000 grams and a 40-week gestation period baby weighs 3900 grams. What is the z -score for the 34-week gestation period baby? What is the z -score for the 40-week gestation period baby? Which baby weighs less relative to the gestation period? 7. Men versus Women The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of 3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of 3.8 inches. Who is relatively taller, a 75-inch man or a 70-inch woman? Source: CDC Vital and Health Statistics, Advance Data, Number 361, July 5, 2005 8. Men versus Women The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of 3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a standard deviation of 3.8 inches. Who is relatively taller, a 67-inch man or a 62-inch woman? Source: CDC Vital and Health Statistics, Advance Data, Number 361, July 5, 2005 9. ERA Champions In 2014, Clayton Kershaw of the Los Angeles Dodgers had the lowest earned-run average (ERA is the mean number of runs yielded per nine innings pitched) of any starting pitcher in the National League, with an ERA of 1.77. Also in 2014, Felix Hernandez of the Seattle Mariners had the lowest ERA of any starting pitcher in the American League with an ERA of 2.14. In the National League, the mean ERA in 2014 was 3.430 and the standard deviation was 0.721. In the American League, the mean ERA in 2014 was 3.598 and the standard deviation was 0.762. Which player had the better year relative to his peers? Why? 10. Batting Champions The highest batting average ever recorded in Major League Baseball was by Ted Williams in 1941 when he hit 0.406. That year, the mean and standard deviation for batting average were 0.2806 and 0.0328. In 2014, Jose Altuve was the American League batting champion, with a batting average of 0.341. In 2014, the mean and standard deviation for batting average were 0.2679 and 0.0282. Who had the better year relative to his peers, Williams or Altuve? Why? 11. Swim Ryan Murphy, nephew of the author, swims for the University of California at Berkeley. Ryan’s best time in the 100-meter backstroke is 45.3 seconds. The mean of all NCAA swimmers in this event is 48.62 seconds with a standard deviation of 0.98 second. Ryan’s best time in the 200-meter backstroke is 99.32 seconds. The mean of all NCAA swimmers in this event is 106.58 seconds with a standard deviation of 2. seconds. In which race is Ryan better? 12. Triathlon Roberto finishes a triathlon (750-meter swim, 5-kilometer run, and 20-kilometer bicycle) in 63.2 minutes. Among all men in the race, the mean finishing time was 69.4 minutes with a standard deviation of 8.9 minutes. Zandra finishes the same triathlon in 79.3 minutes. Among all women in the race, the mean finishing time was 84.7 minutes with a standard deviation of 7.4 minutes. Who did better in relation to their gender? 13. School Admissions A highly selective boarding school will only admit students who place at least 1.5 standard deviations above the mean on a standardized test that has a mean of 200 and a standard deviation of 26. What is the minimum score that an applicant must make on the test to be accepted? 14. Quality Control A manufacturer of bolts has a quality- control policy that requires it to destroy any bolts that are more than 2 standard deviations from the mean. The quality-control engineer knows that the bolts coming off the assembly line have
3.4 Assess Your Understanding
SECTION 3.4 Measures of Position and Outliers 163
31.5 36.0 37.8 38.4 40.1 42. 34.3 36.3 37.9 38.8 40.6 42. 34.5 37.4 38.0 39.3 41.4 43. 35.5 37.5 38.3 39.5 41.5 47. Source: www.fueleconomy.gov
(a) Compute the z -score corresponding to the individual who obtained 36.3 miles per gallon. Interpret this result. (b) Determine the quartiles. (c) Compute and interpret the interquartile range, IQR. (d) Determine the lower and upper fences. Are there any outliers?
22. Hemoglobin in Cats The following data represent the hemoglobin (in g/dL) for 20 randomly selected cats.
5.7 8.9 9.6 10.6 11. 7.7 9.4 9.9 10.7 12. 7.8 9.5 10.0 11.0 13. 8.7 9.6 10.3 11.2 13. Source: Joliet Junior College Veterinarian Technology Program
(a) Compute the z -score corresponding to the hemoglobin of Blackie, 7.8 g/dL. Interpret this result. (b) Determine the quartiles. (c) Compute and interpret the interquartile range, IQR. (d) Determine the lower and upper fences. Are there any outliers?
23. Rate of Return of Google The following data represent the monthly rate of return of Google common stock from its inception in January 2007 through November 2014. - 0.10 - 0.02 0.00 0.02 - 0.10 0.03 0.04 - 0.15 - 0. 0.02 0.01 - 0.18 - 0.10 - 0.18 0.14 0.07 - 0.01 0. 0.03 0.10 - 0.17 - 0.10 0.05 0.05 0.08 0.08 - 0. 0.06 0.25 - 0.07 - 0.02 0.10 0.01 0.09 - 0.07 0. 0.05 - 0.02 0.30 - 0.14 0.00 0.05 0.06 - 0.08 0. Source: Yahoo!Finance
(a) Determine and interpret the quartiles. (b) Check the data set for outliers.
24. CO 2 Emissions The following data represent the carbon dioxide emissions from the consumption of energy per capita (total carbon dioxide emissions, in tons, divided by total population) for the countries of Europe.
1.31 5.38 10.36 5.73 3.57 5.40 6. 8.59 9.46 6.48 11.06 7.94 4.63 6. 14.87 9.94 10.06 10.71 15.86 6.93 3. 4.09 9.91 161.57 7.82 8.70 8.33 9. 7.31 16.75 9.95 23.87 7.76 8. Source: Carbon Dioxide Information Analysis Center
(a) Determine and interpret the quartiles. (b) Is the observation corresponding to Albania, 1.31, an outlier?
25. Fraud Detection As part of its “Customers First” program, a cellular phone company monitors monthly phone usage. The program identifies unusual use and alerts the customer that their
phone may have been used by another person. The data below represent the monthly phone use in minutes of a customer enrolled in this program for the past 20 months. The phone company decides to use the upper fence as the cutoff point for the number of minutes at which the customer should be contacted. What is the cutoff point?
346 345 489 358 471 442 466 505 466 372 442 461 515 549 437 480 490 429 470 516
26. Stolen Credit Card A credit card company has a fraud- detection service that determines if a card has any unusual activity. The company maintains a database of daily charges on a customer’s credit card. Days when the card was inactive are excluded from the database. If a day’s worth of charges appears unusual, the customer is contacted to make sure that the credit card has not been compromised. Use the following daily charges (rounded to the nearest dollar) to determine the amount the daily charges must exceed before the customer is contacted.
143 166 113 188 133 90 89 98 95 112 111 79 46 20 112 70 174 68 101 212
27. Student Survey of Income A survey of 50 randomly selected full-time Joliet Junior College students was conducted during the Fall 2015 semester. In the survey, the students were asked to disclose their weekly income from employment. If the student did not work, $0 was entered.
0 262 0 635 0 0 671 244 521 476 100 650 454 95 12,777 567 310 527 0 67 736 83 159 0 547 188 389 300 719 0 367 316 0 0 181 479 0 82 579 289 375 347 331 281 628 0 203 149 0 403 (a) Check the data set for outliers. (b) Draw a histogram of the data and label the outliers on the histogram. (c) Provide an explanation for the outliers.
28. Student Survey of Entertainment Spending A survey of 40 randomly selected full-time Joliet Junior College students was conducted in the Fall 2015 semester. In the survey, the students were asked to disclose their weekly spending on entertainment. The results of the survey are as follows:
21 54 64 33 65 32 21 16 22 39 67 54 22 51 26 14 115 7 80 59 20 33 13 36 36 10 12 101 1000 26 38 8 28 28 75 50 27 35 9 48
164 CHAPTER 3 Numerically Summarizing Data
(a) Check the data set for outliers. (b) Draw a histogram of the data and label the outliers on the histogram. (c) Provide an explanation for the outliers.
29. Pulse Rate Use the results of Problem 21 in Section 3. and Problem 19 in Section 3.2 to compute the z -scores for all the students. Compute the mean and standard deviation of these z -scores. 30. Travel Time Use the results of Problem 22 in Section 3. and Problem 20 in Section 3.2 to compute the z -scores for all the students. Compute the mean and standard deviation of these z -scores. 31. Fraud Detection Revisited Use the fraud-detection data from Problem 25 to do the following. (a) Determine the standard deviation and interquartile range of the data. (b) Suppose the month in which the customer used 346 minutes was not actually that customer’s phone. That particular month, the customer did not use her phone at all, so 0 minutes were used. How does changing the observation from 346 to 0 affect the standard deviation and interquartile range? What property does this illustrate?
Explaining the Concepts
32. Write a paragraph that explains the meaning of percentiles. 33. Suppose you received the highest score on an exam. Your friend scored the second-highest score, yet you both were in the 99th percentile. How can this be? 34. Morningstar is a mutual fund rating agency. It ranks a fund’s performance by using one to five stars. A one-star mutual fund is in the bottom 10% of its investment class; a five-star mutual fund is at the 90th percentile of its investment class. Interpret the meaning of a five-star mutual fund. 35. When outliers are discovered, should they always be removed from the data set before further analysis? 36. Mensa is an organization designed for people of high intelligence. One qualifies for Mensa if one’s intelligence is measured at or above the 98th percentile. Explain what this means. 37. Explain the advantage of using z -scores to compare observations from two different data sets. 38. Explain the circumstances for which the interquartile range is the preferred measure of dispersion. What is an advantage that the standard deviation has over the interquartile range? 39. Explain what each quartile represents.