Jul 23, 2017

The overconfident use of the NHST

In 1927 Edwin B. Wilson said:

That is why I ask: What is statistics, not what are statistics, nor yet what is statistical method? And I venture to suggest in a tentative, undogmatic sort of way that it is largely because of lack of knowledge of what statistics is that the person untrained in it trusts himself with a tool quite as dangerous as any he may pick out from the whole armamentarium of scientific methodology.

Wilson, Edwin Bidwell. “What Is Statistics?” Science, vol. 65, no. 1694, 1927, pp. 581–587. JSTOR, www.jstor.org/stable/1652319.

Statistics employs a set of measurement tools that are quite difficult to understand and have been found to be misapplied quite often. In a recent 2016 article Greenland et al state:

Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature.


Much statistical teaching and practice has developed a strong (and unhealthy) focus on the idea that the main aim of a study should be to test null hypotheses. In fact most descriptions of statistical testing focus only on testing null hypotheses, and the entire topic has been called “Null Hypothesis Significance Testing” (NHST). This exclusive focus on null hypotheses contributes to misunderstanding of tests. Adding to the misunderstanding is that many authors (including R.A. Fisher) use “null hypothesis” to refer to any test hypothesis, even though this usage is at odds with other authors and with ordinary English definitions of “null”—as are statistical usages of “significance” and “confidence.”


The general definition of a P value may help one to understand why statistical tests tell us much less than what many think they do: Not only does a P value not tell us whether the hypothesis targeted for testing is true or not; it says nothing specifically related to that hypothesis unless we can be completely assured that every other assumption used for its computation is correct—an assurance that is lacking in far too many studies.

Nonetheless, the P value can be viewed as a continuous measure of the compatibility between the data and the entire model used to compute it, ranging from 0 for complete incompatibility to 1 for perfect compatibility, and in this sense may be viewed as measuring the fit of the model to the data. Too often, however, the P value is degraded into a dichotomy in which results are declared “statistically significant” if P falls on or below a cut-off (usually 0.05) and declared “nonsignificant” otherwise. The terms “significance level” and “alpha level” (α) are often used to refer to the cut-off; however, the term “significance level” invites confusion of the cut-off with the P value itself. Their difference is profound: the cut-off value α is supposed to be fixed in advance and is thus part of the study design, unchanged in light of the data. In contrast, the P value is a number computed from the data and thus an analysis result, unknown until it is computed.

In the article Greenland et al list many of the misconceptions currently found in the practice of statistics. This is not new, others have pointed out the misuse of statistics in the past. Cohen was a critic of the misuse of statistics as well, he wrote:

Despite my career-long identification with statistical inference, I believe, together with such luminaries as Meehl (1978) Tukey (1977), and Gigerenzer (Gigerenzer & Murray, 1987), that hypothesis testing has been greatly overemphasized in psychology and in the other disciplines that use it. It has diverted our attention from crucial issues. Mesmerized by a single all-purpose, mechanized, "objective" ritual in which we convert numbers into other numbers and get a yes-no answer, we have come to neglect close scrutiny of where the numbers came from. Recall that in his delightful parable about averaging the numbers on football jerseys, Lord (1953) pointed out that "the numbers don't know where they came from." But surely we must know where they came from and should be far more concerned with why and what and how well we are measuring, manipulating conditions, and selecting our samples.

We have also lost sight of the fact that the error variance in our observations should challenge us to efforts to reduce it and not simply to thoughtlessly tuck it into the denominator of an F or t test.


The implications of the things I have learned (so far) are not consonant with much of what I see about me as standard statistical practice. The prevailing yes-no decision at the magic .05 level from a single research is a far cry from the use of informed judgment. Science simply doesn't work that way. A successful piece of research doesn't conclusively settle an issue, it just makes some theoretical proposition to some degree more likely. Only successful future replication in the same and different settings (as might be found through meta-analysis) provides an approach to settling the issue. How much more likely this single research makes the proposition depends on many things, but not on whether p is equal to or greater than .05: .05 is not a cliff but a convenient reference point along the possibility-probability continuum. There is no ontological basis for dichotomous decision making in psychological inquiry. The point was neatly made by Rosnow and Rosenthal (1989) last year in the American Psychologist. They wrote "surely, God loves the .06 nearly as much as the .05" (p. 1277). To which I say amen!

In 1994 Cohen criticized the NHST as follows:

What's wrong with NHST? Well, among many other things it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know is "Given these data what is the probability that Ho is true?" But as most of us know, what it tells us is "Given that Ho is true, what is the probability of these (or more extreme) data?" These are not the same, as has been pointed out many times over the years by the contributors to the Morrison-Henkel (1970) book" among others. and. more recently and emphatically. by Meehl(1978. 1986. 1990a. 1990b), Gigerenzer (1993). Falk and Greenbaum (in press), and yours truly (Cohen, 1990).


When one tests Ho, one is finding the probability that the data (D) could have arisen if Ho were true. P(DlHo). If that probability is small, then it can be concluded that if Ho is true, then D is unlikely. Now, what really is at issue, what is always the real issue, is the probability that Ho is true, given the data, P(HolD), the inverse probability. When one rejects Ho, one wants to conclude that Ho is unlikely, say, p < .01. The very reason the statistical test is done is to be able to reject Ho because of its unlikelihood! But that is the posterior probability, available only through Bayes's theorem, for which one needs to know P(Ho), the probability of the null hypothesis before the experiment, the "prior" probability.

Andrew Gelman, in a response to the ASA statement about p value, recently wrote of the NHST:

Ultimately the problem is not with p-values but with null hypothesis significance testing, that parody of falsificationism in which straw-man null hypothesis A is rejected and this is taken as evidence in favor of preferred alternative B (see Gelman 2014). Whenever this sort of reasoning is being done, the problems discussed above will arise. Confidence intervals, credible intervals, Bayes factors, cross-validation: you name the method, it can and will be twisted, even if inadvertently, to create the appearance of strong evidence where none exists.


What went wrong? How is it that we know that design, data collection, and interpretation of results in context are so important—and yet the practice of statistics is so associated with p-values, a typically misused and misunderstood data summary that is problematic even in the rare cases where it can be mathematically interpreted?

I put much of the blame on statistical education, for two reasons.

First, in our courses and textbooks (my own included), we tend to take the “dataset” and even the statistical model as given, reducing statistics to a mathematical or computational problem of inference and encouraging students and practitioners to think of their data as given. Even when we discuss the design of surveys and experiments, we typically focus on the choice of sample size, not on the importance of valid and reliable measurements. The result is often an attitude that any measurement will do, and a blind quest for statistical significance.

Second, it seems to me that statistics is often sold as a sort of alchemy that transmutes randomness into certainty, an “uncertainty laundering” that begins with data and concludes with success as measured by statistical significance. Again, I do not exempt my own books from this criticism: we present neatly packaged analyses with clear conclusions. This is what is expected—demanded—of subject-matter journals. Just try publishing a result with p = 0.20. If researchers have been trained with the expectation that they will get statistical significance if they work hard and play by the rules, if granting agencies demand power analyses in which researchers must claim 80% certainty that they will attain statistical significance, and if that threshold is required for publication, it is no surprise that researchers will routinely satisfy this criterion, and publish, and publish, and publish, even in the absence of any real effects, or in the context of effects that are so variable as to be undetectable in the studies that are being conducted (Gelman and Carline 2014).

In his book Statistical Reasoning in Medicine Lemuel Moyé wrote about the common practice of only focusing on the p value and disregarding other aspects of statistical and scientific practice, he wrote:

Replacing the careful consideration of a research effort’s (1) methodology, (2) sample size, (3) magnitude of the effect of interest, and (4) variability of that effect size with a simple, hasty look at the p-value is a scientific thought-crime.

Lemuel A. Moyé, Statistical Reasoning in Medicine: The Intuitive P-value Primer., 2nd ed.

There still is a lack of effort in understanding how these statistical tools should be used, their false conclusions are being interpreted in a deterministic overconfident manner.

The bold text is my emphasis.

No comments:

Post a Comment

1. You should attempt to re-express your target’s position so clearly, vividly, and fairly that your target says, “Thanks, I wish I’d thought of putting it that way.
2. You should list any points of agreement (especially if they are not matters of general or widespread agreement).
3. You should mention anything you have learned from your target.
4. Only then are you permitted to say so much as a word of rebuttal or criticism.
Daniel Dennett, Intuition pumps and other tools for thinking.

Valid criticism is doing you a favor. - Carl Sagan