Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Phân tích số liệuabcdefghikla, Study Guides, Projects, Research of Neuroscience

thống kê aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Typology: Study Guides, Projects, Research

2017/2018

Uploaded on 04/08/2018

T_nh.ng
T_nh.ng 🇬🇧

5

(1)

1 document

1 / 393

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Phân tích số liệuabcdefghikla and more Study Guides, Projects, Research Neuroscience in PDF only on Docsity!

i

L

' ..,J -.,..-' ~

SIXTH EDITION

Applied Multivariate

Statistical Analysis

RICHARD A. JOHNSON

University of Wisconsin-Madison

DEAN W. WICHERN

Texas A&M University

PEARSON

-----~----

Prentice

Hall

vppe_rsadd.leRi_ver,N_ewJe_rse

y

0_7458_~ 1IIIIIIillllllll

,brary of Congress Cataloging-in-Publication Data

hnson, Richard A. Statistical analysisiRichard A. Johnson.-6 1h^ ed. Dean W. Winchern p.em. Includes index. ISBN 0-13-187715-

  1. Statistical Analysis

~IP Data Available

\xecutive AcquiSitions Editor: Petra Recter Vice President and Editorial Director, Mathematics: Christine Hoag roject Manager: Michael Bell Production Editor: Debbie Ryan' .>emor Managing Editor: Unda Mihatov Behrens 1:anufacturing Buyer: Maura Zaldivar Associate Director of Operations: Alexis Heydt-Long Aarketing Manager: Wayne Parkins ~arketing Assistant: Jennifer de Leeuwerk Editorial AssistantlPrint Supplements Editor: Joanne Wendelken \It Director: Jayne Conte Director of Creative Service: Paul Belfanti .::over Designer: B rnce Kenselaar '\rt Studio: Laserswords

© 2007 Pearson Education, Inc. Pearson Prentice Hall Pearson Education, Inc. Upper Saddle River, NJ 07458

All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher.

Pearson Prentice HaWM is a tradell.1ark of Pearson Education, Inc.

Printed in the United States of America ID 9 8 7 6 5 4 3 2 1

ISBN-13:

ISBN-l0:

978-0-13-187715- 0- 13 - 187715'- 1

Pearson Education LID., London Pearson Education Australia P1Y, Limited, Sydney Pearson Education Singapore, Pte. Ltd Pearson Education North Asia Ltd, Hong Kong Pearson Education Canada, Ltd., Toronto Pearson Educaci6n de Mexico, S.A. de C.V. Pearson Education-Japan, Tokyo Pearson Education Malaysia, Pte. Ltd

To

the memory of my mother and my father.

R. A. J.

To Dorothy, Michael, and An drew.

D. W. W.

viii Contents

Supplement 2A: Vectors and Matrices: Basic Concepts 82 Vectors, 82 Matrices, 87 Exercises 103 References 110

SAMPLE GEOMETRY AND RANDOM SAMPLING

3.1 Introduction 111 3.2 The Geometry of the Sample 111 3.3 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 119 3.4 Generalized Variance 123 Situations in which the Generalized Sample Variance Is Zero, 129

Generalized Variance Determined by I R I

and Its Geometrical Interpretation, 134 Another Generalization of Variance, 137 3.5 Sample Mean, Covariance, and Correlation As Matrix Operations 137 3.6 Sample Values of Linear Combinations of Variables 140 Exercises 144 References 148

THE MULTlVARIATE NORMAL DISTRIBUTION

4.1 Introduction 149 4.2 The Multivariate Normal Density and Its Properties 149 Additional Properties of the Multivariate Normal Distribution, 156 4.3 Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 168

The Multivariate Normal Likelihood, 168 Maximum Likelihood Estimation of P and I, 170 Sufficient Statistics, 173

The Sampling Distribution of X and S 173

Properties of the Wishart Distribution, 174

Large-Sample Behavior of X and S 175

Assessing the Assumption of Normality 177 Evaluating the Normality of the Univariate Marginal Distributions, 177 Evaluating Bivariate Normality, 182 Detecting Outliers and Cleaning Data 187 Steps for Detecting Outtiers, 189 Transformations to Near Normality 192 Transforming Multivariate Observations, 195 Exercises 200 References 208

149 I

I

J

5 INFERENCES ABOUT A MEAN VECTOR

5.1 Introduction 210 5.2 The Plausibility of Po as a Value for a Normal Population Mean 210 5.3 HotelIing's T2 and Likelihood Ratio Tests 216 General Likelihood Ratio Method, 219 5.4 Confidence Regions and Simultaneous Comparisons of Component Means 220 Simultaneous Confidence Statements, 223 A Comparison of Simultaneous Confidence Intervals with One-at-a-Time Intervals, 229 The Bonferroni Method of Multiple Comparisons, 232

Contents

5.5 Large Sample Inferences about a Population Mean Vector 234 5.6 Multivariate Quality Control Charts 239 Charts for Monitoring a Sample of Individual Multivariate Observations for Stability, 241 Control Regions for Future Individual Observations, 247 Control Ellipse for Future Observations 248 T -Chart^2 for Future Observations, 248 ' Control Charts Based on Subsample Means, 249 Control Regions for Future SUbsample Observations, 251 5.7 Inferences about Mean Vectors when Some Observations Are Missing 251 5.8 Difficulties Due to TIme Dependence in Multivariate Observations 256 Supplement 5A: Simultaneous Confidence Intervals and Ellipses as Shadows of the p-Dimensional Ellipsoids 258 Exercises 261 References 272

ix

210

6 COMPARISONS OF SEVERAL MULTIVARIATE MEANS 273 6.1 Introduction 273 6.2 Paired Comparisons and a Repeated Measures Design 273 Paired Comparisons, 273 A Repeated Measures Design for Comparing Treatments, 279 6.3 Comparing Mean Vectors from Two Populations 284 Assumptions Concerning the Structure of the Data, 284 Further Assumptions When nl and n2 Are Small, 285 Simultaneous Confidence Intervals, 288

The Two-Sample Situation When 1:1 oF l;z,

An Approximation to the Distribution of T2 for Normal Populations When Sample Sizes Are Not Large, 294 6.4 Comparing Several Multivariate Population Means (One-Way Manova) 296 Assumptions about the Structure of the Data for One-Way MANOVA, 296

x Contents

Contents 9 FACTOR ANALYSIS AND INFERENCE xi 430

 - A Summary of Univariate ANOVA, - Multivariate Analysis of Variance (MANOVA), 
  • 6.5 Simultaneous Confidence Intervals for Treatment Effects
  • 6.6 Testing for Equality of Covariance Matrices
  • 6.7 1\vo-Way Multivariate Analysis of Variance - Univariate Two-Way Fixed-Effects Model with Interaction, - Multivariate Two- Way Fixed-Effects Model with Interaction,
    • 6.8 Profile Analysis
    • 6.9 Repeated Measures Designs and Growth Curves - Multivariate Models 6.10 Perspectives and a Strategy for Analyzing - Exercises - References
  • 7 MULTlVARIATE LINEAR REGRESSION MODELS - 7.1 Introduction - 7.2 The Classical Linear Regression Model - 7.3 Least Squares Estimation - Sum-oJ-Squares Decomposition, - Geometry of Least Squares, - Sampling Properties of Classical Least Squares Estimators, - 7.4 Inferences About the Regression Model - Inferences Concerning the Regression Parameters, - Likelihood Ratio Tests for the Regression Parameters, - 7.5 Inferences from the Estimated Regression Function - Estimating the Regression Function at Zo, - Forecasting a New Observation at Zo, - 7.6 Model Checking and Other Aspects of Regression - Does the Model Fit?, - Leverage and Influence, - Additional Problems in Linear Regression, - 7.7 Multivariate Multiple Regression - Likelihood Ratio Tests for Regression Parameters, - Other Multivariate Test Statistics, - Predictions from Multivariate Multiple Regressions, - 7.8 The Concept of Linear Regression - Prediction of Several Variables, - Partial Correlation Coefficient, - 7.9 Comparing the Two Formulations of the Regression Model - Mean Corrected Form of the Regression Model, - Relating the Formulations, - 7.10 Multiple Regression Models with Time Dependent Errors - for the Multivariate Multiple Regression Model Supplement 7A: The Distribution of the Likelihood Ratio - Exercises - - References - 8.1 Introduction 8 PRINCIPAL COMPONENTS - 8.2 Population Principal Components - Principal Components Obtained from Standardized Variables - with Special Structures, Principal Components for Covariance Matrices ' - 8.3 Summarizing Sample Variation by Principal Components - The Number of Principal Components, - Interpretation of the Sample Principal Components, - Standardizing the Sample Principal Components, - 8.4 Graphing the Principal Components - 8.5 Large Sample Inferences - Large Sample Properties of Aj and ej, - Testing for the Equal Correlation Structure, - 8.6 Monitoring Quality with Principal Components - Checking a Given Set of Measurements for Stability, - Controlling Future Values, - Component Approximation Supplement 8A: The Geometry of the Sample Principal - The p-Dimensional Geometrical Interpretation, - The n-Dimensional Geometrical Interpretation, - Exercises - References - 9.1 Introduction FOR STRUCTURED COVARIANCE MATRICES - 9.2 The Orthogonal Factor Model - 9.3 Methods of Estimation - The Pri,!cipal Component (and Principal Factor) Method, - A ModifiedApproach-the Principal Factor Solution, - The Maximum Likelihood Method, - A Large Sample Test for the Number of Common Factors - Oblique Rotations, 9.4 Factor Rotation 504 ' - 9.5 Factor Scores - The Weighted Least Squares Method, - The Regression Method, - 9.6 Perspectives and a Strategy for Factor Analysis - for Maximum Likelihood Estimation Supplement 9A: Some Computational Details - Recommended Computational Scheme, - Maximum Likelihood Estimators of p = L.L~ + 1/1. - Exercises - References

:l:,

I

if!

I

j

r

I

Preface

INTENDED AUDIENCE

LEVEL

This book originally grew out of our lecture notes for an "Applied Multivariate Analysis" course offered jointly by the Statistics Department and the School of Business at the University of Wisconsin-Madison. Applied Multivariate Statisti- calAnalysis, Sixth Edition, is concerned with statistical methods for describing and analyzing multivariate data. Data analysis, while interesting with one variable, becomes truly fascinating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect mea- surements on several variables. Modem computer packages readily provide the· numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpreta- tions, selecting appropriate techniques, and understanding their strengths and weaknesses. We hope our discussions wiII meet the needs of experimental scien- tists, in a wide variety of subject matter areas, as a readable introduction to the statistical analysis of multivariate observations.

Our aim is to present the concepts and methods of muItivariate analysis at a level that is readily understandable by readers who have taken two or more statistics courses. We emphasize the applications of multivariate methods and, conse- quently, have attempted to make the mathematics as palatable as possible. We avoid the use of calculus. On the other hand, the concepts of a matrix and of ma- trix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of muItivari- ate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or^ no previous exposure^ to^ the subject. This supplementary material helps make the book self-contained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book ac- cessible to a wide audience. In our attempt to make the study of muItivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice xv

xvi Preface

onsistency of level. Some sections are harder than others. In particular, we ~~ve summarized a volumi?ous amount .of materi~l?n regres~ion ~n Chapter 7. The resulting presentation IS rather SUCCInct and difficult the fIrst ~Ime throu~h. We hope instructors will be a?le to compensat.e for the une~enness In l~vel by JU- diciously choosing those s~ctIons, and subsectIOns, appropnate for theIr students and by toning them tlown If necessary.

ORGANIZATION AND APPROACH

The methodological "tools" of multlvariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book, but they cannot be assimilated without much of the material in the introd~ctory Chapters 1 thr?~gh

  1. Even those readers with a good kno~ledge of matrix algebra or those willing t accept the mathematical results on faIth should, at the very least, peruse Chap- o 3 "Sample Geometry," and Chapter 4, "Multivariate Normal Distribution." ter , Our approach in the methodological ~hapters is to ~eep the discussion.di- t and uncluttered. Typically, we start with a formulatIOn of the population re~dels delineate the corresponding sample results, and liberally illustrate every- :'ing ~ith examples. The exa~ples are of two types: those that are simple and hose calculations can be easily done by hand, and those that rely on real-world ~ata and computer software. These will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using methods other than the ones we have used or suggest~d.. The division of the methodological chapters (5 through 12) Into three umts llo~s instructors some flexibility in tailoring a course to their needs. Possible a uences for a one-semester (two quarter) course are indicated schematically. seq... fr h t Each instructor will undoubtedly omit certam sectIons om some c ap ers to cover a broader collection of topics than is indicated by these two choices.

Getting Started Chapters 1-

For most students, we would suggest a quick pass through the first four hapters (concentrating primarily on the material in Chapter 1; Sections 2.1, 2.2, ~.3, 2.5, 2.6, and 3.6; and the "assessing normality" material in Chapter ~) fol- lowed by a selection of methodological topics. For example, one mIght dISCUSS the comparison of mean vectors, principal components, factor analysis, discrimi- nant analysis and clustering. The di~cussions could feature the many "worke? out" examples included in these sections of the text. Instructors may rely on dI-

Preface xvii

agrams and verbal descriptions to teach the corresponding theoretical develop- ments. If the students have uniformly strong mathematical backgrounds, much of the book can successfully be covered in one term. We have found individual data-analysis projects useful for integrating ma- terial from several of the methods chapters. Here, our rather complete treatments of multivariate analysis of variance (MANOVA), regression analysis, factor analy- sis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures.

CHANGES TO THE SIXTH EDITION

New material. Users of the previous editions will notice several major changes in the sixth edition.

  • Twelve new data sets including national track records for men and women, psychological profile scores, car body assembly measurements, cell phone tower breakdowns, pulp and paper properties measurements, Mali family farm data, stock price rates of return, and Concho water snake data.
  • Thirty seven new exercises and twenty revised exercises with many of these exercises based on the new data sets.
  • Four new data based examples and fifteen revised examples.
  • Six new or expanded sections:
  1. Section 6.6 Testing for Equality of Covariance Matrices
  2. Section 11.7 Logistic Regression and Classification
  3. Section 12.5 Clustering Based on Statistical Models
  4. Expanded Section 6.3 to include "An Approximation to the, Distrib- ution of T2 for Normal Populations When Sample Sizes are not Large"
  5. Expanded Sections 7.6 and 7.7 to include Akaike's Information Cri- terion
  6. Consolidated previous Sections 11.3 and 11.5 on two group discrimi- nant analysis into single Section 11.

Web Site. To make the methods of multivariate analysis more prominent in the text, we have removed the long proofs of Results 7.2,7.4,7.10 and 10. and placed them on a web site accessible through www.prenhall.comlstatistics. Click on "Multivariate Statistics" and then click on our book. In addition, all full data sets saved as ASCII files that are used in the book are available on the web site.

Instructors' Solutions Manual. An Instructors Solutions Manual is available on the author's website accessible through www.prenhall.comlstatistics.For infor- mation on additional for-sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall web site at www.pren- hall. corn.

Chapter

ASPECTS OF MULTIVARIATE

ANALYSIS

1.1 Introduction

Scientific inquiry is an iterative learning process. Objectives pertaining to the expla- nation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experi- mentation or observation will usually suggest a modified explanation of the phe- nomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an inves- tigator to collect observations on many different variables. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body

. of methodology is called multivariate analysis. The need to understand the relationships between many variables makes multi- variate analysis an inherently difficult subject. Often, the human mind is over- whelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univari- ate setting. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathe- matics. Nonetheless, some mathematical sophistication and a desire to think quanti- tatively will be required. Most of our emphasis will be on the analysis of measurements obtained with- out actively controlling or manipulating any of the variables on which the mea- surements are made. Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of im- portant variables. Although the experimental design is ordinarily the most impor- tant part of a scientific investigation, it is frequently impossible to control the

2 Chapter 1 Aspects of Multivariate Analysis

generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.) You should consult [6] and [7] for detailed accounts of design principles that, fortunately, also apply to multi- variate situations. It will become increasingly clear that many multivariate methods are based upon an underlying proBability model known as the multivariate normal distribution. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be implemented on a computer. Recent advances in computer technology have been accompanied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classification scheme for multivariate techniques that is both widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques de- signed to study interdependent relationships from those designed to study depen- dent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a sta- tistical method and the objectives of the study. These problems, plus the examples in the text, should provide you with an appreciation of the applicability of multivariate techniques acroSS different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following: L Data reduction or structural simplification. The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier.

2. Sorting and grouping. Groups of "similar" objects or variables are created, based upon measured characteristics. Alternatively, rules for classifying objects into well-defined groups may be required. 3. Investigation of the dependence among variables. The nature of the relation- ships among variables is of interest. Are all the variables mutually independent or are one or more variables dependent on the others? If so, how? 4. Prediction. Relationships between variables must be determined for the pur- pose of predicting the values of one or more variables on the basis of observa- tions on the other variables. 5. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions. We conclude this brief overview of multivariate analysis with a quotation from F. H. C Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should keep it in mind whenever you attempt or read about a data analysis. It allows one to

t

. f

I

Applications of Multivariate Techniques 3

maintain a proper perspective and not be overwhelmed by the elegance of some of the theory:

If the results disagree with informed opinion, do not admit a simple logical interpreta- tion, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines automatically transforming bodies of numbers into packets of scientific fact.

1.2 Applications of Multivariate Techniques

The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of real-world applications of these methods with brief discussions, as we did in earlier editions of this book. How- ever, in order to give some indication of the usefulness of multivariate techniques, we offer the following short descriptions_of the results of studies from several disci- plines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are multifaceted and could be placed in more than one category.

Data reduction or simplification

  • Using data on several variables related to cancer patient responses to radio- therapy, a simple measure of patient response to radiotherapy was constructed. (See Exercise 1.15.)
  • ltack records from many nations were used to develop an index of perfor- mance for both male and female athletes. (See [8] and [22].)
  • Multispectral image data collected by a high-altitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimensions. (See [23].)
  • Data on several variables relating to yield and protein content were used to cre- ate an index to select parents of subsequent generations of improved bean plants. (See [13].)
  • A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was determined. (See [21].) Sorting and grouping
  • Data on several variables related to computer use were employed to create clusters of^ categories^ of^ computer jobs^ that^ allow a^ better^ determination^ of existing (or planned) computer utilization. (See [2].)
  • Measurements of several physiological variables were used to develop a screen- ing procedure that discriminates alcoholics from nonalcoholics. (See [26].)
  • Data related to responses to visual stimuli were used to develop a rule for sepa- rating people suffering from a multiple-sclerosis-caused visual pathology from those not suffering from the disease. (See Exercise 1.14.)

6 Chapter 1 Aspects of MuItivariate Analysis

Example 1.1 (A data array) A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can re&ard the corresponding numbers on the receipts as four mea- surements on two variables. Suppose the data, in tabular form, are

Variable 1 (dollar sales): 42 52 48 58 Variable 2 (number of books): 4 5 4 3

Using the notation just introduced, we have

Xll = 42 X2l = 52 X3l = 48 X4l = 58 X12 = 4 X22 = 5 X32 = 4 X42 = 3

and the data array X is

l

X = 4252 4l 5 48 4 58 3

with four rows and two columns.

Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner. The efficiency is twofold, as gains are attained in both (1) describing nu- merical calculations as operations on arrays and (2) the implementation of the cal- culations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for dis- playing data.

Descriptive Statistics

A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics. For example, the arithmetic average, or sample mean, is a descriptive sta- tistic that provides a measure of location-that is, a "central value" for a set of num- bers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, varia- tion, and^ linear association. The formal definitions of these quantities follow. Let Xll, X2I>"" Xnl be n measurements on the first variable. Then the arith- metic average of these measurements is

r

I

,

I

I

The Organization of Data 7

If the n measurements represent a subset of the full set of measurements that might have been observed, then Xl is also called the sample mean for the first vari- able. We adopt this terminology because the bulk of this book is devoted to proce- dUres designed to analyze samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means:

1 n

Xk = - 2: Xjk

n j=l

k = 1,2, ... ,p (1-1)

A measure of spread is provided by the sample variance, defined for n measure- ments on the first variable as

2 1~( _ SI = - "'" Xjl - xd n j=l

where Xl is the sample mean of the XiI'S. In general, for p variables, we have

2 1 ~ ( _ ) Sk = - "'" Xjk - Xk n j=l.

k = 1,2, ... ,p (1-2)

1\vo comments are in order. First, many authors define the sample variance with a divisor of n - 1 rather than n. Later we shall see that there are theoretical reasons for doing this, and it is particularly appropriate if the number of measurements, n, is small. The two versions of the sample variance will always be differentiated by dis- playing the appropriate expression. Second, although the S2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample vari- ances lie along the main diagonal. In this situation, it is convenient to use double subscripts on the variances in order to indicate their positions in the array. There- fore, we introduce the notation Skk to denote the same variance computed from measurements on the kth variable, and we have the notational identities

k=I,2, ... ,p (1-3)

The square root of the sample variance, ~, is known as the sample standard deviation. This measure of variation uses the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:

[xu], X12^ [X2l],X22 •..^ ,^ [Xnl]X n 2

That is, Xjl and Xj2 are observed on the jth experimental item (j = 1,2, ... , n). A measure of linear association between the measurements of variables 1 and 2 is pro- vided by the sample covariance

8 Chapter 1 Aspects of Multivariate Analysis

or the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, sl2 will be positive. If large values from one vari- able occur with small values for the other variable, Sl2 will be negative. If there is no particular association between the values for the two variables, Sl2 will be approxi- mately zero. The sample covariance 1 n _ ~ Sik = -:L (Xji - Xi)(Xjk - Xk) i = 1,2, ... ,p, k = 1,2, ... ,p (1-4) n j=l measures the association between the ·ith and kth variables. We note that the covari- ance reduces to the sample variance when i = k. Moreover, Sik = Ski for all i and k .. The final descriptive statistic considered here is the sample correlation coeffi- cient (or Pearson's product-moment correlation coefficient, see [14]). This measure of the linear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as n :L (Xji - x;) (Xjk - Xk) j=l

for i = 1,2, ... , p and k = 1,2, ... , p. Note rik = rki for all i and k.

The sample correlation coefficient is a standardized version of the sample co- variance, where the product of the square roots of the sample variances provides the standardization. Notice that rik has the same value whether n or n - 1 is chosen as

the common divisor for Sii, sa, and Sik'

The sample correlation coefficient rik can also be viewed as a sample co variance. Suppose the original values 'Xji and Xjk are replaced by standardized values (Xji - xi)/~and(xjk - xk)/~.Thestandardizedvaluesarecommensurablebe cause both sets are centered at zero and expressed in standard deviation units. The sam- ple correlation coefficient is just the sample covariance of the standardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bounded. To summarize, the sample correlation r has the following properties:

1. The value of r must be between -1 and +1 inclusive.

2. Here r measures the strength of the linear association. If r = 0, this implies a lack of linear association between the components. Otherwise, the sign of r indi-

cates the direction of the association: r < 0 implies a tendency for one value in

the pair to be larger than its average when the other is smaller than its average;

and r > 0 implies a tendency for one value of the pair to be large when the

other value is large and also for both values to be small together.

3. The value of rik remains unchanged if the measurements of the ith variable

are changed to Yji = aXji + b, j = 1,2, ... , n, and the values of the kth vari-

able are changed to Yjk = CXjk + d, j == 1,2, ... , n, provided that the con-

stants a and c have the same sign.

fif

j

The Organization of Data, 9

The ~u~ntities Sik and rik do not, in general, convey all there is to know about

the aSSOCIatIOn between two variables. Nonlinear associations can exist that are not revealed .by these ~es~riptive statistics. Covariance and corr'elation provide mea- sures of lmear aSSOCIatIOn, or association along a line. Their values are less informa-

tive ~~r other kinds of association. On the other hand, these quantities can be very

sensIttve to "wild" observations ("outIiers") and may indicate association when in fact, little exists. In spite of these shortcomings, covariance and correlation coeffi-

cien~s are routi':lel.y calculated and analyzed. They provide cogent numerical sum-

man~s ~f aSSOCIatIOn ~hen the data do not exhibit obvious nonlinear patterns of aSSOCIation and when WIld observations are not present.

. Suspect observa.tions must be accounted for by correcting obvious recording mIstakes and by takmg actions consistent with the identified causes. The values of Sik and rik should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of cross- product deviations are often of interest themselves. These quantities are

and n

n Wkk = 2: (Xjk - Xk) j=I

Wik = 2: (Xji - x;) (Xjk - Xk) j=l

k = 1,2, ... ,p (^) (1-6)

i = 1,2, ... ,p, k = 1,2, ... ,p (1-7)

The descriptive statistics computed from n measurements on p variables can also be organized into arrays.

Arrays of Basic Descriptive Statistics

Sample means i~m

[u

Sl '" ]

Sample variances

and covariances Sn^ =^ S~l^ S22^ S2p^ (1-8)

Spl sp2 (^) spp

R ~ l~'

r Sample correlations (^) '" ] (^1) r2p

'pI (^) 'p2^1

-

12 Chapter 1 Aspects of Multivariate Analysis

Also shown in Figure 1.1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (marginal) dot diagrams. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. The information contained in the single-variable dot diagrams can be used to calculate the sample means Xl and X2 and the sample variances SI 1 and S22' (See Ex- ercise 1.1.) The scatter diagram indicates the orientation of the points, and their co- ordinates can be used to calculate the sample covariance s12' In the scatter diagram of Figure 1.1, large values of Xl occur with large values of X2 and small values of Xl with small values of X2' Hence, S12 will be positive. Dot diagrams and scatter plots contain different kinds of information. The in- formation in the marginal dot diagrams is not sufficient for constructing the scatter plot. As an illustration, suppose the data preceding Figure 1.1 had been paired dif- ferently, so that the measurements on the variables Xl and X2 were as follows:

Variable 1 (Xl): Variable 2 (X2):

5 5

2 10

5

(We have simply rearranged the values of variable 1.) The scatter and dot diagrams for the "new" data are shown in Figure 1.2. Comparing Figures 1.1 and 1.2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decid- edly different. In Figure 1.2, large values of Xl are paired with small values of X2 and small values of Xl with large values of X2' Consequently, the descriptive statistics for the individual variables Xl, X2, SI 1> and S22 remain unchanged, but the sample covari- ance S12, which measures the association between pairs of variables, will now be negative. The different orientations of the data in Figures 1.1 and 1.2 are not discernible from the marginal dot diagrams alone. At the same time, the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are nqt competitors. The next two examples further illustrate the information that can be conveyed by a graphic display.

X2 X

  • 10 10 •
  • (^8 8) •
  • (^) •

6 6

  • • •^ •
  • 4 4 •

2 2

0 2 4 6 8 10 XI

  • Figure 1.2 Scatter plot t •^ t •^ t t I and dot diagrams for 2 4 6 8 10

... XI rearranged data.

f 1 I I f •

The Organization of Data 13

Example 1.3 (The effect of unusual observations on sample correlations) Some fi-. nancial data representing jobs and productivity for the 16 largest publishing firms appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of variables Xl = employees Gobs) and X2 = profits per employee (productivity) are graphed in Figure 1.3. We have labeled two "unusual" observations. Dun & Brad- street is the largest firm in terms of number of employees, but is "typical" in terms of profits per employee. TIme Warner has a "typical" number of employees, but com- paratively small (negative) profits per employee.

X 2

40

8';,' • S,§ 30

  • (^0) • ~:::'-' 0 • ~ ~^20 Co] tE ~ £~ 10 ,

0

  • 0 - - Dun^ &^ Bradstreet - • • •
  • ••

Time Warner

Employees (thousands)

Figure 1.3 Profits per employee and number of employees for 16 publishing firms.

The sample correlation coefficient computed from the values of Xl and X2 is

-.39 for all 16 firms -.56 for all firms but Dun & Bradstreet r12 = _ .39 for all firms but Time Warner -.50 for all firms but Dun & Bradstreet and Time Warner

It is clear that atypical observations can have a considerable effect on the sample correlation coefficient.

Example 1.4 (A scatter plot for baseball data) In a July 17,1978, article on money in sports, Sports Illustrated magazine provided data on Xl = player payroll for Nation- al League East baseball teams. We have added data on X2 = won-lost percentage "for 1977. The results are given in Table 1.1. The scatter plot in Figure 1.4 supports the claim that a championship team can be bought. Of course, this cause-effect relationship cannot be substantiated, be- cause the experiment did not include a random assignment of payrolls. Thus, statis- tics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries?

14 Chapter 1 Aspects of Multivariate Analysis

Table 1.1 1977 Salary and Final Record for the National League East

Team Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets

o

Xl = player payroll 3,497, 2,485, 1,782, 1,725, 1,645, 1,469,

Player payroll in millions of dollars

X2= won-lost percentage . . . . . .

Figure 1.4 Salaries and won-lost percentage from Table 1.1.

To construct the scatter plot in Figure 1.4, we have regarded the six paired ob- servations in Table 1.1 as the coordinates of six points in two-dimensional space. The figure allows us to examine visually the grouping of teams with respect to the vari- ables total payroll and won-lost percentage. -

Example I.S (Multiple scatter plots for paper strength measurements) Paper is man- ufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when measured in the direction pro- duced by the machine than when measured across, or at right angles to, the machine direction. Table 1.2 shows the measured values of

Xl = density (grams/cubic centimeter) X2 = strength (pounds) in the machine direction X3 = strength (pounds) in the cross direction

A novel graphic presentation of these data appears in Figure 1.5, page' 16. The scatter plots are arranged as the off-diagonal elements of a covariance array and box plots as the diagonal elements. The latter are on a different scale with this

The Organization of Data 15

Table 1.2 Paper-Quality Measurements

Strength Specimen Density (^) Machine direction (^) Cross direction

(^1) .801 121.41 (^) 70. (^2) .~24 (^) 127.70 72. (^3) .841 129.20 (^) 78. (^4) .816 (^) 131.80 74. (^5) .840 (^) 135.10 71. 6 .842 (^) 131.50 (^) 78. (^7) .820 (^) 126.70 69. 8 .802 (^) 115.10 73. 9 .828 (^) 130.80 (^) 79. (^10) .819 124.60 (^) 76. (^11) .826 (^) 118.31 70. (^12) .802 (^) 114.20 (^) 72. 13 .810 (^) 120.30 (^) 68. (^14) .802 (^) 115.70 68. (^15) .832 (^) 117.51 (^) 71. 16 .796 (^) 109.81 (^) 53. (^17) .759 (^) 109.10 50. (^18) .770 (^) 115.10 51. 19 .759 (^) 118.31 (^) 50. 20 .772 (^) 112.60 (^) 53. (^21) .806 (^) 116.20 56. (^22) .803 (^) 118.00 70.70. 23 .845 (^) 131.00 (^) 74. (^24) .822 125.70 (^) 68. (^25) .971 (^) 126.10 72. 26 .816 (^) 125.80 (^) 70. 27 .836 (^) 125.50 (^) 76. (^28) .815 (^) 127.80 76. (^29) .822 (^) 130.50 (^) 80. 30 .822 (^) 127.90 (^) 75. (^31) .843 123.90 (^) 78. (^32) .824 (^) 124.10 (^) 71. 33 .788 (^) 120.80 (^) 68. (^34) .782 107.40 (^) 54. (^35) .795 (^) 120.70 70. 36 .805 (^) 121.91 (^) 73. (^37) .836 (^) 122.31 74. 38 .788 (^) 110.60 53. 39 .772 (^) 103.51 (^) 48. (^40) .776 (^) 110.71 53. (^41) .758 (^) 113.80 (^) 52.

Source: Data courtesy of SONOCO Products Company.

IS Cbapter

f Multivariate Analysis 1 AspectS 0

....

  • • ~

...

SVL

(^135) Figure 1.6 3D scatter

115 HLS

95 plot of lizard data^ from (^90) Table 1.3.

er lot. Figure 1.7 gives the three-dimensio~al scatter plot for ~he stan- in the scatt^. Pbl Most of the variation can be explamed by a smgle vanable de-

. d vana es. dard~ze d b a line through the cloud of points. ternune^ y

3 2 ~ 1 ~ ~ 0

    • -3 -

..

ZSVL·

  • ...

.... -

Figure 1.1 3D scatter plot of standardized lizard data. -

. sional scatter plot can often reveal group structure.

A three-difnen

~oking for group structure in three dimensions) ~eferring to Exam-

E)(arnpl~ ~.~ ~eresting to see if male and female lizards occupy different parts ~f the ple 1.6, It IS m. I space containing the size data. The gender, by row, for the lizard hree_dimenslona ~ata in Table 1.3 are fmffmfmfmfmfm mmmfmmmffmff

Data Displays and Pictorial Representations 19

Figure 1.8 repeats the scatter plot for the original variables but with males marked by solid circles and females by open circles. Clearly, males are typically larg- er than females.

5

SVL

o· ~.^ -

<Bo

Figure 1.8 3D scatter plot of male and female lizards.

\oTl ~

p Points in n Dimensions. The n observations of the p variables can also be re-

garded as p points in n-dimensional space. Each column of X determines one of the

points. The ith column,

consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points in n dimensions can be relat- ed to measures of association between the corresponding variables.

1.4 Data Displays and Pictorial Representations

The rapid development of powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graphics. It is often possible, for example, to sit at one's desk and examine the nature of multidi- mensional data with clever computer-generated pictures. These pictures are valu- able aids in understanding data and often prevent many false starts and subsequent inferential problems. As we shall see in Chapters 8 and 12, there are several techniques that seek to represent p-dimensional observations in few dimensions such that the original dis- tances (or similarities) between pairs of observations are (nearly) preserved. In gen- eral, if multidimensional observations can be represented in two dimensions, then outliers, relationships, and distinguishable groupings can often be discerned by eye. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. One good source for more discussion of graphical methods is [11].

20 Chapter 1 Aspects of Multivariate Analysis

Linking Multiple Two-Dimensional Scatter Plots

One of the more exciting new graphical procedures involves electronically connect- ing many two-dimensional scatter plots.

Example 1.8 (Linked scatter plots and brushing) To illustrate linked two-dimensional scatter plots, we refer to the paper-quality data in Thble 1.2. These data represent

measurements on the variables Xl = density, X2 = strength in the machine direction,

and X3 = strength in the cross direction. Figure 1.9 shows two-dimensional scatter plots for pairs of these variables organized as a 3 X 3 array. For example, the picture in the upper left-hand corner of the figure is a scatter plot of the pairs of observations (Xl' X3)' That is, the Xl values are plotted along the horizontal axis, and the X3 values are plotted along the vertical axis. The lower right-hand corner of the figure contains a scatter plot of the observations (X3, Xl)' That is, the axes are reversed. Corresponding interpretations hold for the other scatter plots in the figure. Notice that the variables and their three-digit ranges are indicated in the boxes along the SW-NE diagonal. The operation of marking (selecting), the obvious outlier in the (Xl, X3) scatter plot of Figure 1.9 creates Figure 1.1O(a), where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots. Specimen 25 also appears to be an outlierin the (Xl, X2) scatter plot but not in the (Xz, X3) scatter plot. The operation of deleting this specimen leads to the modified scatter plots of Figure 1.10(b). From Figure 1.10, we notice that some points in, for example, the (X2' X3) scatter plot seem to be disconnected from the others. Selecting these points, using the (dashed) rectangle (see page 22), highlights the selected points in all of the other scatter plots and leads to the display in Figure 1.ll(a). Further checking revealed that specimens 16-21, specimen 34, and specimens 38-41 were actually specimens

.....:-.'

....^ ,^ "'.

....

.... :.

... ' =t.: ...,^.

. .,

... r

...

.

Density (Xl)

.

... (^) ,... , , ~ '\ (^). - -:- ,

.., ..

  • •• 48.

135

. 104

..I .... ..~ .... ' .:.}-

...^.^ ... , " .....

.^ .... ,

",.i. .. ...... ';. .... ".

Figure 1.9 Scatter plots for the paper- quality data of Table 1.2.

, :-.'.... ...~.

*. -.'*

- .,. .. ~ ,~ ... ...;, ,

.. r

.

Density (Xl)

::.' ,.~. ....^ .-^ ...

.... : ..

... , .,.. -ot-. .. '. .. ,

.: r

...

.

Density (Xl)

25

25

.

.

Data Displays and Pictorial Representations 21

... (^) ,... , , , • ·:·

.. ,

-",e. ..

104

135

25

..I. -.. ••

. ... :

..; .. ..

104

(a)

...... , , , , .. :.

. ,

135

" .:;,.: ,'". ..I.....

..

- 1

....:.

.. .,

. .... : .. .. :-.:

(b)

, '. ···hs:. I' .-

· ,

. 25

.^ ....t ..^ .......^. .... '

, '. ... : .. -

. '.

  • J .... •• · ,

,.i. .. ...... ':. .... '

Figure 1.10 Modified scatter plots for the paper-quality data with outlier (25) (a) selected and (b) deleted.