















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An insight into a statistical investigation aimed at improving the survival rate and care of British babies at birth. the importance of collecting and analyzing large datasets, summarizing data through tables, graphs, and diagrams, and the use of MINITAB for data analysis. The investigation focuses on birth-weight data and its relation to whether the mother smokes or not.
Typology: Exercises
1 / 23
This page cannot be seen from the preview
Don't miss anything!
An example of a statistical investigation, the Survey of British Births , is used in order to introduce the study of statistics. Tables, graphs and diagrams are shown to provide convenient initial summaries of the information in data.
Statistics is the science that studies the collection and interpretation of numeri- cal data. An example of an investigation using statistics was the Survey of British Births,
1970. The aim of that investigation was to improve the survival rate and the care of British babies at and soon after birth, by collecting and analysing data on new-born babies. Data collected on one or two babies would be of little use, because the items of information being measured would vary substantially from one baby to another. For example, their weights at birth would vary considerably. The aim was to gain information about a population , namely ‘all British births’.
A population is the collection of items under discussion. It may be finite, as in this example, or infinite; it may be real, as here, or hypothetical (as in some of the models we set up later in the book).
It was not practicable to collect full information on every birth in Britain, even for a single year. A smaller population was specified which was expected to be very similar to the whole population of British births. This smaller population consisted of all babies who were born (alive or dead) after the 24th week of gestation, between 0001 hours on Sunday 5 April and 2400 hours on Saturday 11 April 1970. A large amount of information was collected about each baby, its mother, and the circumstances of the birth. Data recorded included the birth-weight and sex of the baby, the place of birth (home, hospital or elsewhere), and whether the mother smoked or not. Some of these data are measured (e.g. weight), whereas others are classified (e.g. sex must be male or female). We call these classified records attributes. Each of the quantities or attributes recorded is called a variate.
A variate is any quantity or attribute whose value varies from one unit of investigation to another.
The units in this example were the individual babies in the population. Note that ‘home’, ‘hospital’ and ‘elsewhere’ are the three possible values of the variate ‘place of birth’. It is convenient to stretch the meaning of the word ‘value’ to include cases like this, rather than invent a new word for variates that are not numerical.
An observation is the value taken by a variate for a particular unit of investigation.
The number of variates used exceeded 100, and the number of babies in the population investigated was over 17 000. There were therefore about 17 000 observations on each of 100-plus variates. However, as is usual in any large or complex statistical investigation, there were ‘missing data’: not every variate was recorded on every baby. Mishaps led to some of these measurements not being available, though in fact there were few such mishaps in this study. Some variates were simply not applicable to all babies, e.g. the cause of death applied only to the dead babies. Even so, the total number of observations – the mass of data – was enormous. The original ‘raw’ data were collected by filling in questionnaire forms. These data were then transferred to punched cards so that a computer could be used to deal with them. The prime problem with such an amount of data is to summarize them so that they can be interpreted. Variates differ in nature, and the methods of analysis appropriate to a variate will obviously depend on its nature. We can distinguish between quantitative variates (like the birth-weight of the baby, or the daily number of cigarettes smoked by the mother at a specified time during pregnancy) and qualitative variates , or attributes (such as the sex of the baby or the place of birth).
A quantitative variate is a variate whose values are numerical.
A qualitative variate or attribute is a variate whose values are not numerical.
Quantitative variates can also be divided into two types: they may be continuous , if they can take any value we care to specify within some range, or discrete if their values change by steps or jumps. Thus birth-weight is continuous, because there is no reason why a baby should not have a weight of 6.943762 lb – even if no scales could measure it this accurately! However, a variate like the number of
2
draw up frequency tables, one for each population. This time the variate being summarized is weight, a continuous measurement. Instead of looking at the frequency of each variate-value that occurs we first group the values into inter- vals, that is, subdivisions of the total range of possible values of the variate. In this example the variate is conveniently collected into the class-intervals 1–500, 501–1000,... , 4501–5000, 5001–5500 grams.
A class-interval is a subdivision of the total range of values which a (continuous) variate may take.
The class-frequency is the number of observations of the variate which fall in a given interval.
The frequency distribution of a (continuous) variate is the set of class-intervals for the variate, together with the associated class-frequencies.
Table 1.2 shows the frequency distribution of birth-weight in the two populations. We cannot make a direct comparison of these two sets of frequencies because the total frequencies in the two populations differ. In order to obtain two sets of figures expressed in the same units, which can be compared, we calculate the relative frequencies as shown. Comparison is helped by drawing a graph. The frequency polygon illustrating a set of frequencies, or, as here, a set of relative frequencies, is obtained by plotting
Interval (g) 1– 501– 1001– 1501– 2001– 2501– Frequency S† 4 38 72 123 479 1661 Frequency N† 4 20 25 73 236 1089 Relative frequency S† 0.001 0.005 0.010 0.017 0.067 0. Relative frequency N† 0.001 0.003 0.004 0.011 0.036 0.
Interval (g) 3001– 3501– 4001– 4501– 5001– Frequency S† 2831 1560 343 36 2 Frequency N† 2530 1927 551 101 7 Relative frequency S† 0.396 0.218 0.048 0.005 0. Relative frequency N† 0.385 0.294 0.084 0.015 0. † S refers to the population of babies whose mothers smoke; N refers to the population of babies whose mothers have never smoked.
4
class-frequencies or relative frequencies as y -values against the centre-points of class-intervals as x -values. Then the plotted points are joined by straight lines. In Fig. 1.1, the two frequency polygons for the populations are shown on the same graph. One can see that although the two distributions have a similar shape, the distribution of birth-weight for babies whose mothers smoke is to the left of the corresponding distribution for those whose mothers have never smoked. Considered as a whole, the birth-weights for babies whose mothers smoke are rather smaller than those for babies whose mothers have never smoked. There is a great deal of overlapping of the two populations: many individual babies in the ‘smokers’ population have weights greater than individuals in the ‘non-smokers’ population. Nevertheless the difference between the two populations viewed in their entirety is quite distinctive. The apparent evidence about smoking from this survey is not conclusive. The women chose for themselves whether or not to smoke. There is a possibility, which cannot be excluded and should not be overlooked, that the women who did smoke might be different in other ways (e.g. richer or more nervous) from those who did not. The difference in birth-weight of the babies might be a result of some other unknown factor, not of smoking. Before using results to make claims about possible causes, an investigator should always check up, so far as possible, whether other factors may be influencing the results. This example has illustrated how simple tables and graphs help to summarize a great deal of information. In the examples which now follow we consider the practical processes of constructing tables and graphs in more detail.
40
35
30
% Frequency
25
20
15
10
5
0 1000 2000 Birth-weight, grams
3000 4000 5000
5
The height of the bars is proportional to the frequency of each variate-value. The bars may be thickened, though still centred on the variate-values, to make the bars more obvious. The thickness has no significance. The bars must be kept distinct (cf. the histogram in the next example), to show that the variate-values are distinct.
Measurement of the lengths of 100 eggs of cuckoos gave the following results (all readings in millimetres):
22.5, 20.1, 23.3, 22.9, 23.1, 22.0, 22.3, 23.6, 24.7, 23.7, 24.0, 20.4, 21.3, 22.0, 24.2, 21.7, 21.0, 20.1, 21.9, 21.9, 21.7, 22.6, 20.9, 21.6, 22.2, 22.5, 22.2, 24.3, 22.3, 22.6, 20.1, 22.0, 22.8, 22.0, 22.4, 22.3, 20.6, 22.1, 21.9, 23.0, 22.0, 22.0, 22.1, 22.0, 19.6, 22.8, 22.0, 23.4, 23.8, 23.3, 22.5, 22.3, 21.9, 22.0, 21.7, 23.3, 22.2, 22.3, 22.8, 22.9, 23.7, 22.0, 21.9, 22.2, 24.4, 22.7, 23.3, 24.0, 23.6, 22.1, 21.8, 21.1, 23.4, 23.8, 23.3, 24.0, 23.5, 23.2, 24.0, 22.4, 23.9, 22.0, 23.9, 20.9, 23.8, 25.0, 24.0, 21.7, 23.8, 22.8, 23.1, 23.1, 23.5, 23.0, 23.0, 21.8, 23.0, 23.3, 22.4, 22.4. Construct a frequency table. Illustrate the data graphically by means of: (a) a stem-and-leaf diagram; (b) a histogram; (c) a cumulative frequency diagram.
With data such as these the practice is to record to the nearest value for the number of decimal places chosen. Thus a recorded value of 22.5 represents a number in the interval 22.45–22.55 mm. A suitable class-interval must be chosen. The maximum and minimum obser- vations are 25.0 and 19.6. The range of the observations is therefore 25. 0 − 19. 6 = 5 .4. (More precisely it is 25.05 − 19.55 = 5.50, but this is an unnecessary refine- ment here.) About 10 intervals is usually suitable; division of the range by 10 gives
Interval (mm)
Centre (mm) Tally count^ Frequency
1 4 3 3 12 27 12
19.5–19.9 19. 20.0–20.4 20. 20.5–20.9 20. 21.0–21.4 21. 21.5–21.9 21. 22.0–22.4 22. 22.5–22.9 22. 23.0–23.4 23. 23.5–23.9 23. 24.0–24.4 24. 24.5–24.9 24. 25.0–25.4 25.
16 12 8 1 1 —– Total 100
7
a value of 0.54, which suggests a class-interval of 0.5 mm. The end-points of the interval should be chosen so that no observation can fall on them. They will thus be expressed to one place more of decimals than the actual observations. Suitable intervals are 19.45–19.95, 19.95–20.45,.... The class-centres are 19.7, 20.2,.... An alternative form of specification of the intervals is 19.5–19.9, 20.0–20.4,.... A tally count for the cuckoo-egg measurements is given in the table on p. 7. (a)An alternative to a tally count table is a stem-and-leaf diagram (or stem plot ) invented by J. W. Tukey (American, 1915–2000). It has the virtue of retaining all the information on individual values; we may think of it as a concise way of writing down a set of numbers. A stem-and-leaf diagram for the cuckoo-egg data takes the form:
Length of cuckoo eggs (in tenths of mm) Depth 19 0 19 6 1 20 1 4 1 1 5 20 9 6 9 8 21 3 0 1 11 21 7 9 9 7 6 9 9 7 9 8 7 8 23 22 0 3 0 2 2 3 0 0 4 3 1 0 0 1 0 0 3 0 2 3 0 2 1 4 0 4 4 50 22 5 9 6 5 6 8 8 5 8 9 7 8 50 23 3 1 0 4 3 3 3 4 3 2 1 1 0 0 0 3 38 23 6 7 8 7 6 8 5 9 9 8 8 5 22 24 0 2 3 4 0 0 0 0 10 24 7 2 25 0 1
Stem unit = 1 mm; Leaf unit = 0.1 mm
Tukey calls each row of the diagram a stem ; the stem labels such as 19 or 20 are put to the left of the ruled line while the digits to the right of the line are the leaves on the stems. A measurement 20.1 is represented on the third row, i.e. the stem with label ‘20’, by the digit, or leaf, ‘1’. The remaining three leaves on the same stem in this example represent the measurements 20.4, 20.1 and 20.1. The intervals used are identical with those in the tally count. The diagram above was constructed by reading along the rows of the data and adding the leaves one by one. Intervals should be of 5 units, as for the cuckoo eggs, or of 2 or 10 units. The column on the right might be used for a frequency count, as in the tally count above. Tukey recommends using it, as we have done, for depth , which is a measure of how deep any observation is in the distribution. The column gives cumulative depths counting from each end of the distribution. Thus, on the ‘21’ stem the cumulative depth is 11 = 1 + 4 + 3 + 3, the sum of the frequency counts of the earlier stems. For the cuckoo-egg data the median (see p. 24) happens to fall on an interval boundary, because it has depth 12 (100 + 1) = 50 .5. If this does not happen, there will be a central interval that cannot sensibly be cumulated with the totals from each end. The convention is to give the frequency count for that interval in brackets (see the example on p. 90).
8
observations that fall below the upper end-point of the class-interval. Hence it is the upper end-point that we put in the table and use in the graph.
Upper end-point 19.45 19.95 20.45 20.95 21.45 21.95 22.45 22. Cumulative frequency 0 1 5 8 11 23 50 62 Upper end-point 23.45 23.95 24.45 24.95 25. Cumulative frequency 78 90 98 99 100
Figure 1.4 shows the cumulative frequencies plotted against the corresponding end-points. We must always make the first cumulative frequency in our table 0 to give a starting point for our diagram. This 0 will correspond to the upper end-point of the interval below the one that contains the first observation. The diagram that we have drawn in Fig. 1.4 may be called a cumulative frequency polygon, since it consists entirely of straight lines. When a statistician wants to make inferences about a population on the basis of a sample from it, the first step is very often to smooth this polygon into a curve. This will be illustrated in Fig. 4.2 (p. 77).
100
90
80
70
60
Cumulative frequency
50
40
30
20
10
19 20 21 22 23 Length of eggs (mm)
24 25 26
0
10
A certain disease affects children in their early years, and sometimes kills them. The frequency table of the age at death, in years, of 95 children dying from this disease is:
Age at death (years) 0– 1– 2– 3– 4– 5–10 Total Frequency 10 40 20 10 5 10 95
Draw a histogram to represent the data.
Note that the convention with age is to record not to the nearest number of years but to the number of completed years. A child said to be aged 3 years has an age in the interval 3 to 4 years, including the lower end-point but excluding the upper end-point; in symbols, if the age is x , 3 x < 4. This interval is most conveniently denoted 3–, and this convention has been used in the preceding table. The class-centre of the class-interval 3– is 3.5 years. The histogram (Fig. 1.5) is drawn in a similar manner to that in Example 1. except that we must take care with the final class-interval 5–10, since it is longer than the other class-intervals. We do not draw a rectangle with height equal to the frequency, 10, over the final interval, because that does not represent the data fairly. The incidence of the disease wanes with age after age 1; we see that there were 10 cases per year of age at age 3, 5 cases per year of age at age 4 and 2
40
30
20
Frequency density (number per year of age)
10
0 1 2 3 4 Age at death (years)
5 6 7 8 9 10
11
In a table use is made of the geometrical ordering of rows and columns to exhibit relations between the numbers in the table. In diagrams (which we take to include graphs), the magnitude of numbers is represented geometrically in order to aid comparisons. In this chapter, frequency diagrams have been considered. A simple plot of the values of the observations on a line is often useful when assessing the nature of
13
data and detecting wild observations (see Fig. 5.1, p. 90). Plots of pairs of variates (see scatter diagrams, p. 93) are also useful. These diagrams are adequate for the projects suggested in this book but the reader will find elaborations of these diagrams, and other types, in the presenta- tion of statistical data in newspapers and magazines. The reader should inspect examples critically and decide whether they are helpful or whether they mislead. As with tables, we give a check-list of points to heed when constructing graphs.
1.3.2 Points to note when constructing graphs
1.4 THE USE OF COMPUTERS
Over the past fifty years the greatest change in the study and practice of statistics, as in many human activities, has been the increasing use of computers. Com- putation has always been important in statistics. In the early 1950s statisticians carried out their analyses by literally turning the handles of calculating machines. Large investigations used punched card machinery, as was done with the 1970 Survey of British Births , which is discussed earlier in this chapter. Over time, desk calculating machines became electric, then electronic. Simultaneously, the immediate postwar experimental computers were the inspiration for huge main- frame computers occupying a whole room. These decreased in size as they used, in turn, valves, transistors and silicon chips. The two streams came together in the personal computer that we use today. We see three main uses for computers in the study of statistics: (1) to facilitate the analysis of data sets which are too large to be investigated by hand; (2) to aid the drawing of graphs and diagrams, which allow critical study of data before analysis, and improve presentation of results; (3) to simulate probability models in order to illustrate probability and statistical theory. As students become more familiar with computers the size of a set of data that can be analysed more conveniently by computer, rather than by hand, becomes smaller and smaller. Yet there is still a place for pencil and paper analysis. The teacher, having just marked thirty scripts, can gain an immediate impression of the overall performance of his pupils by setting out the marks in a stem-and- leaf diagram such as that we describe on p. 8. Scratching down the numbers, to use John Tukey’s phrase, gives an acquaintance with every item of information. Manipulating the data on a computer is more impersonal; odd, and therefore important, items can be missed.
14
Instructions may be given to MINITAB in two ways. As with Windows, items on a menu bar may be clicked and dialogue boxes completed. We shall call this the menu method. Alternatively, commands may be typed in the Session window. We shall call this the session command method. Suppose data are entered into Column 1. The column of data is automatically labelled C1. To produce a histogram from the data click Graph on the menu bar. A menu appears; from this click on Histogram. In the dialogue box that appears on the screen select Simple (it is the default setting and probably already selected) and click OK. In the next dialogue box enter C1 under Graph variables , ignoring other choices, and click OK. (MINITAB uses ‘variable’ as an alternative term to ‘column’.) The graph will be displayed. We might express the procedure more concisely as
Graph > Histogram / Select Simple / Enter Graph variable /
The symbol ‘>’ represents movement of the cursor from one word to another and the symbol ‘/’ represents a click on a menu item or on an OK box. Occasionally two successive clicks will be necessary, and this is denoted by ‘//’. The relevant words on the computer screen are shown here in bold type. The alternative method is to use session commands. Click anywhere in the Session window and then click Editor on the menu bar. Click on Enable Session Commands. A prompt ‘MTB>’ appears on the left-hand side of the window. To produce a histogram type
after the prompt and press ‘Return’. Again a graph appears, in its own window. For clarity, we shall write the command words in capitals, but this is not nec- essary when using MINITAB. Either upper or lower case letters may be used for commands and arguments; only the first four letters of a command word are necessary. We recommend that readers enable the session commands even if they intend using only the menu approach. If this is done, the session commands equivalent to the menu commands appear in the Session window and provide a record of the analysis. To enter data into a worksheet, choose say the first column and click the cell below ‘C1’. The name of the variate may be entered. This is usually desirable but is not necessary. Use the arrow down key (↓) to move the cursor to the cell below and enter the first value. Repeat this procedure to enter the whole set. Data may also be entered through the Session window. For data that are to go into column 1 use SET C A prompt ‘DATA>’ appears. Type in the data, separated by spaces or commas, over one or more lines and signal completion by typing END. For data in several columns use READ, which also provokes the DATA prompt.
16
Of the forms of data that MINITAB offers we shall make use of columns and stored constants. A column may contain numbers or text and is referred to by C plus a number, as in C1 or C23, or by name. A name may be typed at the head of a column or assigned by, for example, the command NAME C1 ‘Length’ Individual entries in a column may be referred to as C1(1), C3(2), etc. A stored constant is a single number or a text string. It is referred to by K plus a number, as in K15. It can be given a name by the NAME command. Its value may be displayed by use of commands such as PRINT K or PRINT ‘density’
To discover the names of columns and constants in use, type the session command
INFO
To carry out arithmetic, MINITAB offers a calculator obtained by choosing
Calc > Calculator /
Typical operations are adding two columns and putting the result in a third col- umn, or transforming data by, for example, constructing a new column of values which are the logarithms of an original column of data. Equivalent operations can be carried out by using LET in session commands such as LET C3 = C1 + C LET C2 = LOGE(C1)
1.6 MINITAB: GRAPHS, DIAGRAMS AND
TABLES
1.6.1 Graphs and diagrams
MINITAB can be used to produce the bar charts, histograms and stem-and-leaf diagrams that we have described in this chapter. With data in a column of the worksheet, a bar chart is produced by choosing Graph > Bar Chart / Select Simple / Enter Categorical variable , i.e. column name/
To obtain a histogram choose
Graph > Histogram / Select Simple / Enter Graph variable /
17
1.7 EXERCISES ON CHAPTER 1
47, 61, 53, 43, 46, 46, 68, 48, 72, 57, 48, 54, 41, 63, 49, 42, 58, 65, 45, 44, 43, 51, 45, 38, 48, 46, 44, 52, 43, 47.
19
Group these times into a frequency table using eight equal class-intervals, the first of which contains measured times in the range 35–39 seconds. Draw a histogram of the grouped frequency distribution. Which is the modal class? [ Note. See Definition 2.8.] ( Welsh )
(a) Construct a stem-and-leaf diagram for these data. (b) What are the advantages of a stem-and-leaf display, when compared with a histogram, for illustrating these data? ( IOS )
W
Height 26 27 28 29 30 31 32 33 34 35 36 37 38 Frequency 1 0 2 1 0 1 3 0 1 0 4 1 2
Height 39 40 41 42 43 44 45 46 47 48 49 50 51 Frequency 6 10 12 8 6 15 17 20 13 9 12 7 8
20