# Chapter 6. What Does "Not Significant" Really Mean?

Thus far, we have used statistical methods to reach conclusions by seeing how compatible the observations were with the null hypothesis that the treatment had no effect. When the data were unlikely to occur if this null hypothesis was true, we rejected it and concluded that the treatment had an effect. We used a test statistic (*F*, *t*, *z*, or χ2) to quantify the difference between the actual observations and those we would expect if the null hypothesis of no effect were true. We concluded that the treatment had an effect if the value of this test statistic was bigger than 95% of the values that would occur if the treatment had no effect. When this is so, it is common for medical investigators to report a *statistically significant* effect. On the other hand, when the test statistic is not big enough to reject the hypothesis of no treatment effect, investigators often report *no statistically significant difference* and then discuss their results as if they had proven that the treatment had no effect. *All they really did was fail to demonstrate that it did have an effect.* The distinction between positively demonstrating that a treatment had no effect and failing to demonstrate that it did have an effect is subtle but very important, especially in the light of the small numbers of subjects included in most clinical studies.*

As already mentioned in our discussion of the *t* test, the ability to detect a treatment effect with a given level of confidence depends on the size of the treatment effect, the variability within the population, and the size of the samples used in the study. Just as bigger samples make it more likely that you will be able to detect an effect, smaller sample sizes make it harder. In practical terms, this fact means that studies of therapies that involve only a few subjects and fail to reject the null hypothesis of no treatment effect may arrive at this result because the statistical procedures lacked the *power* to detect the effect because of a too small sample size, even though the treatment did have an effect. Conversely, considerations of the power of a test permit you to compute the sample size needed to detect a treatment effect of given size that you believe is present.

* This problem is particularly encountered in small clinical studies in which there are no “failures” in the treatment group. This situation often leads to overly optimistic assessments of therapeutic efficacy. See Hanley JA, Lippman-Hand A. If nothing goes wrong, is everything all right? Interpreting zero numerators. *JAMA.* 1983;249:1743–1745.

Now, we make a radical departure from everything that has preceded: we assume that the treatment *does* have an effect.

Figure 6-1 shows the same population of people we studied in Figure 4-3 except that this time the drug given to increase daily urine production works. ...