









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Experimental design is a vast topic. As one thinks about the information derived from scientific studies, one confronts difficult issues in statistical theory ...
Typology: Lecture notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!
6.1 Introduction
Experimental design is a vast topic. As one thinks about the information derived from scientific studies, one confronts difficult issues in statistical theory and the limits of knowledge. In this chapter, we confine our discussion to a few of the most important issues in experimental design. This will enable students with no background in behavior research to critically evaluate psychological experi- ments, and to better understand the nature of empirical research in cognitive science. Experimental psychology is a young science. The first laboratory of experi- mental psychology was established just over 100 years ago. Consequently, there are a great many mysteries about human behavior, perception, and perfor- mance that have not yet been solved. This makes it an exciting time to engage in psychological research—the field is young enough that there is still a great deal to do, and it is not difficult to think up interesting experiments. The goal of this chapter is to guide the reader in planning and implementing experiments, and in thinking about good experimental design. A ‘‘good’’ experiment is one in which variables are carefully controlled or accounted for so that one can draw reasonable conclusions from the experi- ment’s outcome.
6.2 The Goals of Scientific Research
Generally, scientific research has four goals:
From ‘‘Experimental Design in Psychoacoustic Research,’’ chapter 23 in Music, Cognition, and Com- puterized Sound (Cambridge, MA: MIT Press, 1999), 299–328. Reprinted with permission.
systematic biases that could influence descriptions (goal 1). By studying a phenomenon, one frequently develops the ability to predict certain behaviors or outcomes (goal 2), although prediction is possible without an understanding of underlying causes (we’ll look at some examples in a moment). Controlled experiments are one tool that scientists use to reveal underlying causes so that they can advance from merely predicting behavior to understanding the cause of behavior (goal 3). Explaining behavior (goal 4) requires more than just a knowledge of causes; it requires a detailed understanding of the mechanisms by which the causal factors perform their functions. To illustrate the distinction between the four goals of scientific research, con- sider the history of astronomy. The earliest astronomers were able to describe the positions and motions of the stars in the heavens, although they had no ability to predict where a given body would appear in the sky at a future date. Through careful observations and documentation, later astronomers became quite skillful at predicting planetary and stellar motion, although they lacked an understanding of the underlying factors that caused this motion. Newton’s laws of motion and Einstein’s special and general theories of relativity, taken to- gether, showed that gravity and the contour of the space–time continuum cause the motions we observe. Precisely how gravity and the topology of space–time accomplish this still remains unclear. Thus, astronomy has advanced to the de- termination of causes of stellar motion (goal 3), although a full explanation re- mains elusive. That is, saying that gravity is responsible for astronomical motion only puts a name on things; it does not tell us how gravity actually works. As an illustration from behavioral science, one might note that people who listen to loud music tend to lose their high-frequency hearing (description). Based on a number of observations, one can predict that individuals with nor- mal hearing who listen to enough loud music will suffer hearing loss (predic- tion). A controlled experiment can determine that the loud music is the cause of the hearing loss (determining causality). Finally, study of the cochlea and basi- lar membrane, and observation of damage to the delicate hair cells after expo- sure to high-pressure sound waves, meets the fourth goal (explanation).
6.3 Three Types of Scientific Studies
In science there are three broad classes of studies: controlled studies, correla- tional studies, and descriptive studies. Often the type of study you will be able to do is determined by practicality, cost, or ethics, not directly by your own choice.
6.3.1 Controlled Studies (‘‘True Experiments’’) In a controlled experiment, the researcher starts with a group of subjects and randomly assigns them to an experimental condition. The point of random assignment is to control for extraneous variables that might affect the outcome of the experiment: variables that are different from the variable(s) being studied. With random assignment, one can be reasonably certain that any differences among the experimental groups were caused by the variable(s) manipulated in the experiment.
116 Daniel J. Levitin
bottom half of the students in the silence condition. Then if the result of the experiment was that the music listeners as a group tended to perform better on their next exam, one could argue that this was not because they listened to music while they studied, but because they were the better students to begin with. Again, the theory behind random assignment is to have groups of subjects who start out the same. Ideally, each group will have similar distributions on every conceivable dimension—age, sex, ethnicity, IQ, and variables that you might not think are important, such as handedness, astrological sign, or favor- ite television show. Random assignment makes it unlikely that there will be any large systematic differences between the groups. A similar design flaw would arise if the experimental conditions were different. For example, if the music-listening group studied in a well-lit room with win- dows, and the silence group studied in a dark, windowless basement, any dif- ference between the groups could be due to the different environments. The room conditions become confounded with the music-listening conditions, such that it is impossible to deduce which of the two is the causal factor. Performing random assignment of subjects is straightforward. Conceptually, one wants to mix the subjects’ names or numbers thoroughly, then draw them out of a hat. Realistically, one of the easiest ways to do this is to generate a different random number for each subject, and then sort the random numbers. If n equals the total number of subjects you have, and g equals the number of groups you are dividing them into, the first n/g subjects will comprise the first group, the next n/g will comprise the second group, and so on. If the results of a controlled experiment indicate a difference between groups, the next question is whether these findings are generalizable. If your initial group of subjects (the large group, before you randomly assigned subjects to conditions) was also randomly selected (called random sampling or random selec- tion, as opposed to random assignment), this is a reasonable conclusion to draw. However, there are almost always some constraints on one’s initial choice of subjects, and this constrains generalizability. For example, if all the subjects you studied in your music-listening experiment lived in fraternities, the finding might not generalize to people who do not live in fraternities. If you want to be able to generalize to all college students, you would need to take a representa- tive sample of all college students. One way to do this is to choose your sub- jects randomly, such that each member of the population you are considering (college students) has an equal likelihood of being placed in the experiment. There are some interesting issues in representative sampling that are beyond the scope of this chapter. For example, if you wanted to take a representative sample of all American college students and you chose American college stu- dents randomly, it is possible that you would be choosing several students from some of the larger colleges, such as the University of Michigan, and you might not choose any students at all from some of the smaller colleges, such as Bennington College; this would limit the applicability of your findings to the colleges that were represented in your sample. One solution is to conduct a stratified sample, in which you first randomly select colleges (making it just as likely that you’ll choose large and small colleges) and then randomly select the
118 Daniel J. Levitin
same number of students from each of those colleges. This ensures that colleges of different sizes are represented in the sample. You then weight the data from each college in accordance with the percentage contribution each college makes to the total student population of your sample. (For further reading, see Shaughnessy and Zechmeister 1994.) Choosing subjects randomly requires careful planning. If you try to take a random sample of Stanford students by standing in front of the Braun Music Building and stopping every third person coming out, you might be selecting a greater percentage of music students than actually exists on campus. Yet truly random samples are not always practical. Much psychological research is con- ducted on college students who are taking an introductory psychology class, and are required to participate in an experiment for course credit. It is not at all clear whether American college students taking introductory psychology are representative of students in general, or of people in the world in general, so one should be careful not to overgeneralize findings from these studies.
6.3.2 Correlational Studies A second type of study is the correlational study (figure 6.2). Because it is not always practical or ethical to perform random assignments, scientists are sometimes forced to rely on patterns of co-occurrence, or correlations between events. The classic example of a correlational study is the link between cigarette smoking and cancer. Few educated people today doubt that smokers are more likely to die of lung cancer than are nonsmokers. However, in the history of scientific research there has never been a controlled experiment with human subjects on this topic. Such an experiment would take a group of healthy non- smokers, and randomly assign them to two groups, a smoking group and a nonsmoking group. Then the experimenter would simply wait until most of the people in the study have died, and compare the average ages and causes of death of the two groups. Because our hypothesis is that smoking causes cancer, it would clearly be unethical to ask people to smoke who otherwise would not. The scientific evidence we have that smoking causes cancer is correlational. That is, when we look at smokers as a group, a higher percentage of them do indeed develop fatal cancers, and die earlier, than do nonsmokers. But without a controlled study, the possibility exists that there is a third factor—a mysteri- ous ‘‘factor x’’—that both causes people to smoke and to develop cancer. Per- haps there is some enzyme in the body that gives people a nicotine craving, and this same enzyme causes fatal cancers. This would account for both out- comes, the kinds of people who smoke and the rate of cancers among them, and it would show that there is no causal link between smoking and cancer. In correlational studies, a great deal of effort is devoted to trying to uncover differences between the two groups studied in order to identify any causal fac- tors that might exist. In the case of smoking, none have been discovered so far, but the failure to discover a third causal factor does not prove that one does not exist. It is an axiom in the philosophy of science that one can prove only the presence of something; one can’t prove the absence of something—it could al- ways be just around the corner, waiting to be discovered in the next experiment (Hempel 1966). In the real world, behaviors and diseases are usually brought
Experimental Design in Psychological Research 119
the core of other planets, but to learn more about the origins of the universe. In psychology, we might want to know the part of the brain that is activated when someone performs a mental calculation, or the number of pounds of fresh green peas the average Canadian eats in a year (figure 6.3). Our goal in these cases is not to contrast individuals but to acquire some basic data about the nature of things. Of course, descriptive studies can be used to establish ‘‘norms,’’ so that we can compare people against the average, but as their name implies, the pri- mary goal in descriptive experiments is often just to describe something that had not been described before. Descriptive studies are every bit as useful as controlled experiments and correlational studies—sometimes, in fact, they are even more valuable because they lay the foundation for further experimental work.
6.4 Design Flaws in Experimental Design
6.4.1 Clever Hans There are many examples of flawed studies or flawed conclusions that illustrate the difficulties in controlling extraneous variables. Perhaps the most famous case is that of Clever Hans. Clever Hans was a horse owned by a German mathematics teacher around the turn of the twentieth century. Hans became famous following many dem- onstrations in which he could perform simple addition and subtraction, read German, and answer simple questions by tapping his hoof on the ground (Watson 1967). One of the first things that skeptics wondered (as you might) is whether Hans would continue to be clever when someone other than his owner asked the questions, or when Hans was asked questions that he had never heard before. In both these cases, Hans continued to perform brilliantly, tap- ping out the sums or differences for arithmetic problems. In 1904, a scientific commission was formed to investigate Hans’s abilities more carefully. The commission discovered, after rigorous testing, that Hans could never answer a question if the questioner did not also know the answer,
Figure 6. In a descriptive study, the researcher seeks to describe some aspect of the state of the world, such as people’s consumption of green peas.
Experimental Design in Psychological Research 121
or if Hans could not see his questioner. It was finally discovered that Hans had become very adept at picking up subtle (and probably unintentional) move- ments on the part of the questioner that cued him as to when he should stop tapping his foot. Suppose a questioner asked Hans to add 7 and 3. Hans would start tapping his hoof, and keep on tapping until the questioner stopped him by saying ‘‘Right! Ten!’’ or, more subtly, by moving slightly when the correct answer was reached. You can see how important it is to ensure that extraneous cues or biases do not intrude into an experimental situation.
6.4.2 Infants’ Perception of Musical Structure In studies of infants’ perception of music, infants typically sit in their mother’s lap while music phrases are played over a speaker. Infants tend to turn their heads toward a novel or surprising event, and this is the dependent variable in many infant studies; the point at which the infants turn their heads indicates when they perceive a difference in whatever is being played. Suppose you ran such a study and found that the infants were able to distinguish Mozart selec- tions that were played normally from selections of equal length that began or ended in the middle of a musical phrase. You might take this as evidence that the infants have an innate understanding of musical phraseology. Are there alternative explanations for the results? Suppose that in the exper- imental design, the mothers could hear the music, too. The mothers might unconsciously cue the infants to changes in the stimulus that they (the mothers) detect. A simple solution is to have the mothers wear headphones playing white noise, so that their perception of the music is masked.
6.4.3 Computers, Timing, and Other Pitfalls It is very important that you not take anything for granted as you design a careful experiment, and control extraneous variables. For example, psycholo- gists studying visual perception frequently present their stimuli on a computer using the MacIntosh or Windows operating system. In a computer program, the code may specify that an image is to remain on the computer monitor for a precise number of milliseconds. Just because you specify this does not make it happen, however. Monitors have a refresh rate (60 or 75 Hz is typical), so the ‘‘on time’’ of an image will always be an integer multiple of the refresh cycle (13.33 milliseconds for a 75 Hz refresh rate) no matter what you instruct the computer to do in your code. To make things worse, the MacIntosh and Windows operating systems do not guarantee ‘‘refresh cycle accuracy’’ in their updating, so an instruction to put a new image on the screen may be delayed an unknown amount of time. It is important, therefore, always to verify, using some external means, that the things you think are happening in your experiment are actually happening. Just because you leave the volume control on your amplifier at the same spot doesn’t mean the volume of a sound stimulus you are playing will be the same from day to day. You should measure the output and not take the knob posi- tion for granted. Just because a frequency generator is set for 1000 Hz does not mean it is putting out a 1000 Hz signal. It is good science for you to measure the output frequency yourself.
122 Daniel J. Levitin
Testing 50 subjects might not be practical. An alternative is a within-subjects design, in which every subject is tested in every condition (also called a repeated measures design). In this example, a total of ten subjects could be randomly divided into the five conditions, so that two subjects experience each condition for a given period of time. Then the subjects switch to another condition. By the time the experiment is completed, ten observations have been collected in each cell, and only ten subjects are required. The advantage of each subject experiencing each condition is that you can obtain measures of how each individual is affected by the manipulation, some- thing you cannot do in the between-subjects design. It might be the case that some people do well in one type of condition and other people do poorly in it, and the within-subjects design is the best way to show this. The obvious advan- tage to the within-subjects design is the smaller number of subjects required. But there are disadvantages as well. One disadvantage is demand characteristics. Because each subject experiences each condition, they are not as naive about the experimental manipulation. Their performance could be influenced by a conscious or unconscious desire to make one of the conditions work better. Another problem is carryover effects. Suppose you were studying the effect of Prozac on learning, and that the half- life of the drug is 48 hours. The group that gets the drug first might still be under its influence when they are switched to the nondrug condition. This is a carryover effect. In the music-listening experiment, it is possible that listening to rock music creates anxiety or exhilaration that might last into the next condition. A third disadvantage of within-subjects designs is order effects, and these are particularly troublesome in psychophysical experiments. An order effect is sim- ilar to a carryover effect, and it concerns how responses in an experiment might be influenced by the order in which the stimuli or conditions are presented. For instance, in studies of speech discrimination, subjects can habituate (become used to, or become more sensitive) to certain sounds, altering their threshold for the discriminability of related sounds. A subject who habituates to a certain sound may respond differently to the sound immediately following it than he/ she normally would. For these reasons, it is important to counterbalance the order of presentations; presenting the same order to every subject makes it dif- ficult to account for any effects that are due merely to order. One way to reduce order effects is to present the stimuli or conditions in random order. In some studies, this is sufficient, but to be really careful about order effects, the random order simply is not rigorous enough. The solution is to use every possible order. In a within-subjects design, each subject would
Table 6. Between-subjects experiment on music and study habits
Condition Only while studying Only while not studying
Music Classical Subjects 1–10 Subjects 11– Rock Subjects 21–30 Subjects 31–
No music Subjects 41–50 Subjects 41–
124 Daniel J. Levitin
complete the experiment with each order. In a between-subjects design, different subjects would be assigned different orders. The choice will often depend on the available resources (time and availability of subjects). The number of pos- sible orders is N! (‘‘n factorial’’), where N equals the number of stimuli. With two stimuli there are two possible orders ð 2! ¼ 2 � 1 Þ; with three stimuli there are six possible orders ð 3! ¼ 3 � 2 � 1 Þ; with six stimuli there are 720 possible orders ð 6! ¼ 6 � 5 � 4 � 3 � 2 � 1 Þ. Seven hundred twenty orders is not practical for a within-subjects design, or for a between-subjects design. One solution in this case is to create an order that presents each stimulus in each serial position. A method for accomplishing this involves using the Latin Square. For even- numbered N, the size of the Latin Square will be N � N; therefore, with six stimuli you would need only 36 orders, not 720. For odd-numbered N, the size of the Latin Square will be N � 2N. Details of this technique are covered in ex- perimental design texts such as Kirk (1982) and Shaughnessy and Zechmeister (1994).
6.7 Ethical Considerations in Using Human Subjects
Some experiments on human subjects in the 1960s and 1970s raised questions about how human subjects are treated in behavioral experiments. As a result, guidelines for human experimentation were established. The American Psy- chological Association, a voluntary organization of psychologists, formulated a code of ethical principles (American Psychological Association 1992). In addi- tion, most universities have established committees to review and approve re- search using human subjects. The purpose of these committees is to ensure that subjects are treated ethically, and that fair and humane procedures are fol- lowed. In some universities, experiments performed for course work or experi- ments done as ‘‘pilot studies’’ do not require approval, but these rules vary from place to place, so it is important to determine the requirements at your institution before engaging in any human subject research. It is also important to understand the following four basic principles of ethics in human subject research:
Experimental Design in Psychological Research 125
tests are beyond the scope of this chapter, and the reader is referred to the sta- tistics textbooks mentioned earlier.
Significance Testing Suppose you wish to observe differences in interval iden- tification ability between brass players and string players. The question is whether the difference you observe between the two groups can be wholly accounted for by measurement and performance error, or whether a difference of the size you observe indicates a true difference in the abilities of these musicians. Significance tests provide the user with a ‘‘p value,’’ the probability that the experimental result could have arisen by chance. By convention, if the p value is less than .05, meaning that the result could have arisen by chance less than 5% of the time, scientists accept the result as statistically significant. Of course, p < :05 is arbitrary, and it doesn’t deal directly with the opposite case, the probability that the data you collected indicate a genuine effect, but the statis- tical test failed to detect it (a power analysis is required for this). In many studies, the probability of failing to detect an effect, when it exists, can soar to 80% (Schmidt 1996). An additional problem with a criterion of 5% is that a researcher who measures 20 different effects is likely to measure one as signifi- cant by chance, even if no significant effect actually exists. Statistical significance tests, such as the analysis of variance (ANOVA), the f-test, chi-square test, and t-test, are methods to determine the probability that observed values in an experiment differ only as a result of measurement errors. For details about how to choose and conduct the appropriate tests, or to learn more about the theory behind them, consult a statistics textbook (e.g., Daniel 1990; Glenberg 1988; Hayes 1988).
Alternatives to Classical Significance Testing Because of problems with tradi- tional significance testing, there is a movement, at the vanguard of applied statistics and psychology, to move away from ‘‘p value’’ tests and to rely on alternative methods, such as Bayesian inferencing, effect sizes, confidence intervals, and meta-analyses (refer to Cohen 1994; Hunter and Schmidt 1990; Schmidt 1996). Yet many people persist in clinging to the belief that the most important thing to do with experimental data is to test them for statistical sig- nificance. There is great pressure from peer-reviewed journals to perform sig- nificance tests, because so many people were taught to use them. The fact is, the whole point of significance testing is to determine whether a result is repeatable when one doesn’t have the resources to repeat an experiment. Let us return to the hypothetical example mentioned earlier, in which we examined the effect of music on study habits using a ‘‘within-subjects’’ design (each subject is in each condition). One possible outcome is that the difference in the mean test scores among groups was not significantly different by an analysis of variance (ANOVA). Yet suppose that, ignoring the means, every subject in the music-listening condition had a higher score than in the no-music condition. We are not interested in the size of the difference now, only in the direction of the difference. The null hypothesis predicts that the manipulation would have no effect at all, and that half of the subjects should show a differ- ence in one direction and half in the other. The probability of all 10 sub- jects showing an effect in the same direction is 1/2 10 or 0.0009, which is highly
Experimental Design in Psychological Research 127
significant. Ten out of 10 subjects indicates repeatability. The technique just de- scribed is called the sign test, because we are looking only at the arithmetic sign of the differences between groups (positive or negative). Often, a good alternative to significance tests is estimates of confidence inter- vals. These determine with a given probability (e.g., 95%) the range of values within which the true population parameters lie. Another alternative is an analysis of conditional probabilities. That is, if you observe a difference between two groups on some measure, determine whether a subject’s membership in one group or the other will improve your ability to predict his/her score on the dependent variable, compared with not knowing what group he/she was in (an example of this analysis is in Levitin 1994a). A good overview of these al- ternative statistical methods is contained in the paper by Schmidt (1996). Aside from statistical analyses, in most studies you will want to compute the mean and standard deviation of your dependent variable. If you had distinct treatment groups, you will want to know the individual means and standard deviations for each group. If you had two continuous variables, you will prob- ably want to compute the correlation, which is an index of how much one vari- able is related to the other. Always provide a table of means and standard deviations as part of your report.
6.8.2 Qualitative Analysis, or ‘‘How to Succeed in Statistics without Significance Testing’’ If you have not had a course in statistics, you are probably at some advantage over anyone who has. Many people who have taken statistics courses rush to plug the numbers into a computer package to test for statistical significance. Unfortunately, students are not always perfectly clear on exactly what it is they are testing or why they are testing it. The first thing one should do with experimental data is to graph them in a way that clarifies the relation between the data and the hypothesis. Forget about statistical significance testing—what does the pattern of data suggest? Graph everything you can think of—individual subject data, subject averages, averages across conditions—and see what patterns emerge. Roger Shepard has pointed out that the human brain is not very adept at scanning a table of numbers and picking out patterns, but is much better at picking out patterns in a visual display. Depending on what you are studying, you might want to use a bar graph, a line graph, or a bivariate scatter plot. As a general rule, even though many of the popular graphing and spreadsheet packages will allow you to make pseudo-three-dimensional graphs, don’t ever use three dimensions unless the third dimension actually represents a variable. Nothing is more confusing than a graph with extraneous information. If you are making several graphs of the same data (such as individual subject graphs), make sure that each graph is the same size and that the axes are scaled identically from one graph to another, in order to facilitate comparison. Be sure all your axes are clearly labeled, and don’t divide the axis numbers into units that aren’t meaningful (for example, in a histogram with ‘‘number of subjects’’ on the ordinate, the scale shouldn’t include half numbers because subjects come only in whole numbers).
128 Daniel J. Levitin
———. (1994b). Problems in Applying the Kolmogorov-Smirnov Test: The Need for Circular Statistics in Psychology. Technical Report #94-07. University of Oregon, Institute of Cognitive & Decision Sciences. Schmidt, F. L. (1996). ‘‘Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for the Training of Researchers.’’ Psychological Methods, VI (2): 115–129. Shaughnessy, J. J., and E. B. Zechmeister. (1994). Research Methods in Psychology. Third edition. New York: McGraw-Hill. Stern, A. W. (1993). ‘‘Natural Pitch and the A440 Scale.’’ Stanford University, CCRMA. (Unpub- lished report). Watson, J. B. (1967). Behavior: An Introduction to Comparative Psychology. New York: Holt, Rinehart and Winston. First published 1914. Zar, J. H. (1984). Biostatistical Analysis. Second edition. Englewood Cliffs, N.J.: Prentice-Hall.
130 Daniel J. Levitin