









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
In this chapter, we first present the logic behind hypothesis testing—a sec- tion that can be skimmed without any loss of information provided that one pays ...
Typology: Study notes
1 / 15
This page cannot be seen from the preview
Don't miss anything!
The crux of neuroscience is estimating whether a treatment group differs from a control group on some response, whether different doses of a drug are asso- ciated with a systematic difference in response, or a host of other questions. All of these queries have one thing in common—they ask the scientist to make inferences about the descriptive statistics from a study. This is the domain of inferential statistics and the generic topic is hypothesis testing. Many topics in this chapter have been touched on in earlier discussion of parametric statistics (Section X.X), Terminology (Section X.X), and Distributions (Section X.X). Here these disparate concepts are presented together in a unified framework. In this chapter, we first present the logic behind hypothesis testing—a sec- tion that can be skimmed without any loss of information provided that one pays attention to the definitions. We then outline the three major modes for hypothesis testing—the test statistic/p value approach, the critical value ap- proach, and the confidence limits approach. With modern computers, almost everyone uses the test statistic/p value approach. Having some familiarity with the other two approaches, however, increases understanding of the inferential statistics.
In the hierarchy of mathematics, statistics is a subset of probability theory. Thus, inferential statistics always involves the probability distribution for a statistic. It is easier to examine this distribution using a specific statistic than it is to treat it in general terms. Hence, we start with the mean.
Figure 7.1: Sampling Distribution of the Means.
Indivi- duals Means
XN
X 2
X 1
= X
( X 1 + X 2 + … XN ) N
Figure 7.1 gives a schematic for sampling means, a topic that we touched on in Section X.X. There is a large hat containing an infinite number of observations, say people in this case. We reach into the hat, randomly select a person, and record their score on a variable, X. We do this for N individuals. We then calculate the mean of the scores on a separate piece of paper and toss that into another hat, the hat of means. Finally we repeat this an infinite number of times. The distribution of means in the hat of means is called the sampling distribution of the means. In general, if we were to perform this exercise for any statistic, the distribution in the hat of statistics is called the sampling distribution of that statistic. Now comes the trick. We can treat all the means in the hat of means as if they were raw scores. Hence, we can ask questions such as “What is the probability of randomly picking a mean greater than 82.4?” To answer such a question we must know the distribution in the hat. That distribution depends on two things: (1) the distribution of the raw scores, and (2) the sample size. Let μ and σ denote, respectively, the mean and standard deviation of the raw scores in the hat of individuals. If these raw scores have a normal distribution, then the means in the hat of means will also have a normal distribution. The
The mean of interest is 96, the population mean is 100, the population standard deviation is 15, and the sample size is 42. Hence the Z value is
X − μ √^ σX N
√^15 42
The area under the standard normal curve from negative infinity to -1.73 is .04. Hence, the probability of observing a mean of 96 or less based on a random sample of 42 is .04. Let’s take a more interesting problem. The mean IQ of the 23 third graders in Mrs. Smith’s class is 108.6. Are the children in Mrs. Smith’s class representative of the general population with respect to IQ? Let’s start with the hypothesis that the students are not representative of the general population. Then the mean of 108.6 is sampled from a hat of means where the overall mean is μ (μ = 100) and the standard deviation is 15 /
= 3.128, assuming that the standard deviation in the class is the population standard deviation 1. If we try to compute a Z from Equation 7.2, we automat- ically run into trouble. We have numbers for X (108.6) and for σ (^) X , (3.128) but what is the numerical value for μ? There is none. The hypothesis that the students are not representative implies that μ = 100 but it does not give a specific numerical value for μ. Hence, we cannot calculate any probabilities of sampling a mean of 108.6 according to this hypothesis.
Figure 7.2: Normal curve with a mean of 100 and standard deviation of 3.128.
90 95 100 105 110
Hmmm. Are we stuck? Actu- ally, no. We can still get information about the question but, using the ver- nacular, in a bass awkward way. Let us temporarily examine the hypothe- sis that the students are representa- tive of the general population. Now we can actually calculate a Z because this hypothesis implies that μ= 100. We can then examine areas under the curve. If the probability of observ- ing a mean of 108.6 is remote, then we can reject this hypothesis and con- clude that the students are not repre- sentative. This is indeed a logical ap- proach although it sounds somewhat tortuous. The first task—at least for learn- ing purposes—is to draw the normal curve for the distribution of means according to the hypothesis that Mrs. Smith’s students are representative of the general population. This is just a normal curve with a mean of 100 and a standard deviation of 3.128 and is de- picted in Figure 7.2. (For the moment, ignore the tails of the distribution.)We
(^1) That assumption could be relaxed but it would complicate the situation.
now want to define the “remote probabilities” of this curve. Naturally, these will be at the two tails of the curve. (At this point, you may wonder while we use both tails when the mean IQ for Mrs. Smith’s students is clearly above the mean. Why not just use the upper tail? The answer is that we should always set up the hypothesis tests before gathering or looking at the data. Hence, we want the most unlikely outcomes at both ends of the curve because the class could be unrepresentative by having either low IQs or high IQs). Now the question becomes, “Just how remote should the probabilities be?” You may be surprised to learn that there is no rigorous, mathematical answer to this question. Instead, may decades ago, scientists arrived at an “educated guess” that won consensus, and that tradition has been carried on to this day. For most purposes, scientists consider a “remote probability” as the 5% most unlikely outcomes. We will have more to say about this later. Let us just accept this criterion for the time being. Given that the remote probabilities are divided between each tail of the normal curve, we want to find the lower cutoff for a normal curve with a mean of 100 and standard deviation of 3.128 that has 2.5% or the curve below it. Then we must find the upper cutoff so that 2.5% of the curve is above it. We start with the lower cutoff and must find the Z score that separates the bottom 2.5% of the normal curve from the upper 97.5%.^2 Using the appropriate function provided with statistical packages, that Z score is -1.96. Now we use Equation 7.2 to solve for X:
We now repeat this exercise for the upper portion of the curve. The Z score separating the bottom 97.5% from the top 2.5% is 1.96. Substituting this into Equation X.X gives the upper cutoff as 106.13. The shaded areas at the tails of Figure 7.2 give these cutoffs and Table X.X presents code from SAS and R that calculates these cut points. Hence, we will reject the hypothesis that the students in Mrs. Smith’s class are representative of the general population if their mean is less than 93. or greater than106.13. Their observed mean is 108.6 which is greater than 106.13. Hence, we reject the hypothesis and conclude that the students are not representative of the general population.
If you grasp the logic behind this, then fine—you know the meaning of statistical inference. This is an intolerable situation for statisticians and so they have
(^2) Most statistical packages have routines that can directly find the cutoff without first converting to Z scores. The longer route is taken here because it will assist learning in other topics about hypothesis testing.
developed more jargon so that students are forced to memorize the terms and regurgitate them on tests. The hypothesis that Mrs. Smith’s students are representative of the popu- lation is called the null hypothesis with is denoted as H 0 , subscript being a zero and not an uppercase letter O. From a mathematical standpoint, the null hypothesis is a hypothesis that provides concrete numerical values so that the sampling distribution of a statistic (e.g., the mean in the Mrs. Smith example) can be calculated. From a common sense view, the “null” in the null hypothesis means that the hypothesis lack positive, distinguishing characteristics. It is an “empty” hypothesis. The purpose in research is to reject the null hypothesis and conclude that there is evidence for the hypothesis logically opposite to the null hypothesis. This logical alternative is called the alternative hypothesis usually denoted as H (^) A. The percent of outcomes regard as “remote” or, in other works, the percent of most unlikely outcomes is called the alpha or α level. By convention, the α level is set at .05 or 5%. In special cases, it may be set to lower or high values. The α level may also be looked upon as the false positive rate. If the null hypothesis is true, then we will incorrectly reject it α percent of the time. A test of a hypothesis that splits the α level in half, one half used for the upper tail of a distribution and the other half for the lower tail of the sampling distribution, is called a two-tailed test or two-sided test. Such a hypothesis is called a non directional hypothesis. When a hypothesis clearly pertains to only one side of the sampling dis- tribution then it is called a directional hypothesis and the test is termed a one-tailed test or one-sided test. An example would be a hypothesis that administration of a drug might increase locomotor activity. There are more terms to learn, but they will be introduced in the course of the remaining discussion of hypothesis testing.
7.2 The Three Approaches to Hypothesis Testing
The three approaches to hypothesis testing are: (1) the critical values approach; (2) the test-statistic/p value approach; and the (3) confidence limits approach. All three are mathematically equivalent and will always lead to the same con- clusion. In this section, these approaches are outlined in the abstract. This is meant as a reference section, so there is no need to commit these approaches to memory. Applications and examples of the methods will be provided later when specific types of problems for hypothesis testing are discussed. By far and away, the most prevalent approach is the test-statistic/p value one because this is the way in which modern statistical software usually presents the results. One important issue is the symmetry of critical values and confidence limits. The examples used below all use symmetrical critical values and confidence in- tervals. That is because we deal with issues about the mean. For other statistics, like the correlation coefficient, they may not be symmetrical. When the corre- lation is small, critical values and confidence limits are for all practical purses
Table 7.2: The critical value approach to hypothesis testing. Step Instruction: 1 State the null hypothesis (H 0 ) and the alternative hypothesis (H (^) A ). 2 Establish whether the test is one-tailed or two-tailed. (NB. All stat packages default to two-tailed testing, so most statisticians recommend two-tailed testing). 3 Establish the probability of a false positive finding (aka the α level). 4 Establish sample size (see Section X.X). 5 Draw the distribution of the statistic for which you will establish the critical values. The mean for this distribution will be the mean under the null hypothesis. You will have to look up the formula for the standard deviation. 6 Find the α most unlikely outcomes on the distribution. The procedure for doing this will vary according to the statistic for which the critical value(s) are being calculated and the direction of testing (i.e., one-tailed versus two-tailed). 7 Look up (or calculate) the values of the distribution in (3) above that correspond to these most unlikely outcomes. These become the upper and lower critical values for the test. 8 Gather the data and compute the observed statistic. 9 If the observed statistic is less than its lower critical value or greater than its upper critical value, then reject H0.
symmetric. As the correlation becomes greater in either direction, however, they become more asymmetrical. The test-statistic/p approach can compute the p value using non symmetric distributions.
The critical values approach was used above in the Mrs. Smith example. In a one-sided test, the critical value of a statistic is that number that separates the α most unlikely outcomes from the (1 – α) most probable outcomes. In the case of a two-sided test, there will be two critical values. The first is the cutoff for the lower .5α least likely outcomes and the second is the cutoff for the higher .5α least likely outcomes. In this approach the critical values are established first. Then an observed statistic (e.g., a mean, a difference in means, a correlation) is compared to the critical value(s). If the observed statistic exceeds its critical value(s) in the appropriate direction, then the null hypothesis is rejected. The approach is outlined in Table 7.2. The critical value approach has a definite advantage over the other approaches— one can set up the critical values before the experiment is conducted. It is also
Table 7.4: Steps in the confidence limit approach to hypothesis testing. Step Instruction: 1 State the null hypothesis (H 0 ) and the alternative hypothesis (H (^) A ). 2 Establish whether the test is one-tailed or two-tailed. (NB. All stat packages default to two-tailed testing, so most statisticians recommend two-tailed testing). 3 Establish the probability of a false positive finding (aka the α level). 4 Establish sample size (see Section X.X). 5 Calculate the observed descriptive statistic. 6 Find the α most unlikely outcomes on the distribution around the observed statistic. Note that confidence intervals are always calculated as two-tailed probabilities. 7 If the value of the statistic under the null hypothesis is not located within this interval, then reject then reject H 0.
interval estimates are provided in phrases such as “the estimate of the mean was 27.6 ± 4.2.” Confidence limits are always given in terms of (1 – α) units expressed as a percent. Hence, if α is .05, we speak of the 95% confidence interval. A confidence limit is a plus or minus interval such that if an infinite number of random samples were selected then the interval would capture the population parameter (1 – α) percent of the time. That is a mouthful, so let’s step back and explain it. Suppose that we repeatedly sampled 25 people from the general population and recorded their mean IQ. The means in the hat of means in this case would be normally distributed with a mean of 100 and a standard deviation of 15 /
Table 7.5: Correct and incorrect decisions in hypothesis testing. Decision State of H 0 : Reject H 0 Do not reject H 0 Probability True α (1 − α) 1. False (1 − β) β 1.
X (^) L − μ σ (^) X
solving for X (^) L gives the lower confidence limit as X (^) L = 102. 47. The Z sep- arating the upper 2.5% from the lower 97.5% is 1.96. Substituting this into Equation 7.2gives the upper confidence limit as X (^) U = 114. 73. Consequently, the confidence interval for the mean IQ in Mrs. Smith’s class is between 102. and 114.73. The final step in using a confidence interval is to examine whether the interval includes the statistic for the null hypothesis. If the statistic is not located within the interval then reject the null hypothesis. Otherwise, do not reject the null hypothesis. The mean IQ for the general population (100) is the statistic for the null hypothesis. It is not located within the confidence interval. Hence, we reject the null hypothesis that Mrs. Smith’s class is a random sample of the general population.
7.3 Issues in Hypothesis Testing
7.3.0.1 The Yin and Yang (or α and β) of Hypothesis Testing
In hypothesis testing there are two hypotheses, the null hypothesis and the alter- native hypothesis. Because we test the null hypothesis, there are two decisions about it. We can either reject the null hypothesis or fail to reject it. This leads to two other possible decisions about the alternative hypothesis—reject it or do not reject it. The two by two contingency table given in Table 7.5 summarizes these decisions. The probabilities are stated as conditional probabilities given the null hypothesis. For example, given that the null hypothesis is true, the probability of rejecting it (and generating a false positive decision) is α. The probability of not rejecting it must be (1 − α). The table introduces one more statistical term—β or the probability of a false negative error. Here, the null hypothesis is false but we err in failing to reject it. To complicate matters, the term Type I error is used as a synonym for a false positive judgment or the rejection of a null hypothesis when the null hypothesis is in fact true. In slightly different words, you conclude that your substantive hypothesis has been confirmed when, in fact, that hypothesis is false. A Type II error is equivalent to a false negative error or failure to
of all statistically significant findings in which the null hypothesis is in fact true. Calculation of the FDR depends on the nature of the problem and on the mathematical model which, in special circumstances, can be quite complicated. A technique proposed by Simes (1986) and Benjamini and Yekutieli (2001), however, may be applied in discrete brain areas. Let t denote the total number of statistical tests. Rank the tests according to their p values so that test 1 has the lowest and test t, the highest, p value. Starting with the highest test, find the cut off point such that the p value for the jth test is less than or equal to
p (^) j ≤
j t
α.
Reject the null hypotheses from all lower tests from j to 1.
Table 7.6: Controlling the false discov- ery rater in multiple hypothesis testing. Test (j ) p jα/t 1 .0059. 2 .0063. 3 .0370. 4 .0691. 5 .1200. 6 .2547.
Table 7.6 provides an example of the calculations required for using the false discovery rate. There are t = 6 total tests ordered by their p val- ues. The quantity jα/6 is calculated for each test. Starting with the sixth test, we compare the observed p value with this quantity and work our way upwards in the table until the p value is less than its critical value. This oc- curs at the second test. Hence, we reject the null hypothesis for the first and for the second test. There are two areas of neuroscience in which multiple hypothesis testing requires very specialized techniques. By the time this is published, microarray technology will permit over one million polymorphisms to be assessed for those mammals most often used in neuroscience. Similarly, the number of RNA tran- scripts available for gene-expression studies will have greatly increased. Test- ing the prevalence of polymorphisms in, say, bipolars and controls requires the same number of statistical tests as the number of polymorphisms. With over one million statistical tests, a large number are certainly to be significant just by chance. To make matters worse, these tests are “correlated” and not inde- pendent. If a false positive is found at one locus, there is a good chance that polymorphisms close to that locus will also be significant by chance because of linkage disequilibrium^3. The second area is fine-grained imaging studies. Many studies require can- vassing responses to stimuli across a number of brain areas, not all of which are independent. As in the gene arrays, the extent of statistical independence (or lack of it) depends on temporal and spatial autocorrelation among areas. Once again, these techniques are beyond the scope of an introductory text.
(^3) Linkage disequilibrium is the rule and not the exception between closely-spaced poly- morphisms on a DNA strand. It refers to the fact that one can predict the value of the polymorphism at a locus given the value of a polymorphism at a second, nearby locus.
7.3.0.3 Two-tailed or One-tailed?
In the past, considerable ink had been devoted to debate over using one-tailed versus two-tailed testing. Ironically, the emergence of statistical software has almost stifled this controversy. Most packages automatically generate two-tailed p levels. Being lazy—or perhaps psychologically inclined to treating the printed word as fact—we have become accustomed to accepting the output on face value. This may be useful for some areas of science, but it is definitely not efficient for some areas of neuroscience. Hence, some discussion of the issue is required. Mrs. Smith’s class illustrates the problem (see Section X.X). The mean IQ was 108.6. Why perform a two-tailed test when the observed mean is clearly above the population mean of 100? The reason why a two-tailed test is necessary in this case is that our judgment of the direction was made a posteriori or after the fact. In other words, we looked at the mean and then made the decision that if Mrs. Smith’s class is unrepresentative, it differs in the high end. The approach leads to problems because we are not testing the hypothesis that Mrs. Smith’s class is smarter than normal. We are testing the null hypoth- esis that the class is a random sample of the general population. If we set the alpha level to .05 and if the null hypotheses is true, then looking at the mean and using a directional test based on that information will give a false positive rate of 10%, not 5%. Let’s change the problem. Suppose we do not know the mean IQ of the class but we know beforehand that Mrs. Smith’s school is located in a wealthy suburb. The correlation between family income and IQ of offspring is about .30, so it is reasonable to conclude that Mrs. Smith’s students will differ by being smarter than average. Note the difference between this latest scenario and the original problem. Here, we have two pieces of information (wealthy suburb and correlation between income and IQ) that predict the direction of the effect. Furthermore—and absolutely crucial to the issue at hand—we predict the direction of the effect before knowing the mean, or for that matter, even gathering the data. When these conditions are met—strong theoretical justification and ignorance of the data—a one-tailed test is justified. We can generalize this scenario to conclude that all exploratory research should use two-tailed tests. In addition, if there is any question about using one- or two-tailed tests, the default should be two-tailed. That is the major reason why two-tailed tests are used in the major statistical packages. Hence, if a colleague asks you whether temperature regulation was disrupted by the drug you used in your latest experiment and if you reply that you have not looked at that, then you should always test that using a two-tailed test. Most neuroscience research is experimental and scientists do not design an experiment without good reason. This holds a fortiori for research described in grants. For a funded grant, the original investigator must have had good reason, based on theory and/or prior empirical data, for proposing an experiment and hypothesizing the outcome of that experiment. In addition, the logic was vetted by the scientific committee that recommended funding for the study. Hence,