Skip to main content

Verified by Psychology Today

Cognition

Don't Be Fooled by Flimsy Findings

It is surprisingly easy to make an "effect" go away.

Key points

  • Many statistically significant findings are actually not very meaningful.
  • It is not easy to gauge the credibility of these findings.
  • The peelback method can help people judge the robustness of a finding.

We are used to reading reports claiming that a study was done, and the participants showed the effect. The participants in the experimental group outperformed those in the control group. Often the report adds that the effect was statistically significant, perhaps at the .05 level or even at the .01 level.

So it is easy to form the impression that all the participants showed the effect. However, that would be wrong.

In tests of statistical significance, p-values (and effect sizes) are essentially the difference between group means divided by (or expressed in terms of) the standard deviation. Statistical significance is with respect to group averages. So to what extent does a difference between group means represent the differences between individuals?

Perhaps the participants all showed the result, but only to a very small extent, and the large number was enough to produce statistical significance. Statisticians have worried about this possibility and have invented measures of effect size to clarify the finding. Unfortunately, few reports, especially in the media, include any information about effect size—probably because when you add more details like this, you just muddy the picture and confuse the reader.

Perhaps only some of the participants showed the effect, but showed it to a large extent, balancing out those who didn’t show any effect at all. Simple measures of variability will show if this might be the case, but again, many lay readers won’t understand or be interested in the meaning of variability and most of the reports they receive won’t include standard deviations.

So let’s try another approach: making it very easy for readers to grasp how pervasive an effect is.

To explore this question, we obtained data from an actual funded study (not conducted by us) comparing two groups in terms of performance at a task that was either aided by an AI system (experimental group) or not aided by an AI system. There were 30 participants in each group, a reasonable number. This particular data set was selected because the distributions seem approximately normal (using the "eyeball" test).

You can see the results below. The distribution for the experimental condition is shown in purple and the distribution for the control condition is shown in black. The figure shows that the two distributions overlap a good deal.

Source: Robert Hoffman
Overlapping Distributions for Experimental and Control Groups.
Source: Robert Hoffman

So there’s an effect here, and it is statistically significant: p<,001, using a two-tailed t-test. But it certainly isn’t universal. We need to keep that in mind when we discuss findings such as these.

But what would it take to make this significant effect at the p<.001 level disappear?

The Peelback Method

We can progressively peel away the extremes. First, we removed the data for the two participants in the experimental group who scored the highest and the one participant in the control group who scored the lowest.

Bingo.

The proportion of correct responses in the experimental group dropped from 65% to 54%, and the proportion correct in the control group increased a little, from 48% to 49%, and now the t-test shows p<.334. Not even close to being counted as statistically significant. So the initial “effect” doesn’t seem very robust.

If the statistical significance were still achieved after this first peel, we could keep peeling and recomputing until the p<.05 level was crossed. We might find that we had to do a lot of peeling. In that case, we would have much more confidence in the conclusion about the statistical significance. But if statistical significance disappears merely by dropping three of the 60 participants, then how seriously can we take the results? Or how seriously should we take the t-test?

This general method could be turned into an actual proportional "metric"—that is, the number of "peeling steps" relative to the total sample size. In the present case, that number is 3/60. The smaller that number, the more tenuous the statistical effect.

Side note: We did not cherry-pick this example. We simply wanted a simple data set that yielded statistically significant results. We had no idea in advance that the example would illustrate our thesis so well.

Conclusion

For lay readers of psychology research reports, the peelback method might be much easier to grasp than other kinds of statistics such as effect sizes.

Researchers themselves might find it a useful exercise to explore and examine the peelback method for their own experiments. If they have the courage. Researchers could then consider which participants were responsible for the findings and what these participants were like.

But let's think again about what the peeling entails. Some statistics textbooks refer to the problem of "outliers," and even present procedures enabling researchers to justify the removal of data from outliers. The concept seems to be that "outliers" just add noise to the data, hiding the "true" effects.

The highest-performing participants in an experimental group (in studies like the one referenced here) demonstrate what is humanly possible. The worst-performing participants in both the experimental and the control groups might be pointers to issues of motivation or selection. The "best" and the "worst" performers are the individuals who should be studied in greater detail say, by in-depth post-experimental cognitive interviews.

Unfortunately, many psychological studies do not include in-depth post-experimental cognitive interviews as a key part of their method, and even when interviews are conducted, the results are usually given short shrift in the research reports. Data about what people are thinking will always clarify the meaning and "significance" of the numerical results.

So this little exploration of a simple idea exposes some substantive issues and traps in research methodology. Not the least of these is a cautionary tale, to never confuse a statistical effect (about groups) with a causal effect an independent variable might have on individuals.

Robert Hoffman is a co-author for this post.

advertisement
More from Gary Klein Ph.D.
More from Psychology Today