Awash in P-values – and adrift in the data
A key focus of market research is interpretation of differences. Marketing teams who want to know how worried they ought to be or what actions to take routinely ask, “Is that difference significant?” or “What’s the p-value?” And while that basic question is just a matter of straight-forward calculations, the real relevance of the answer is more elusive. These days, no one is short on data. What we’re often missing are the guidelines to assess what truly matters.
The statistical community has finally decided to cut itself loose from null hypothesis testing … The insights community has yet to join them
Prominent statisticians and behavioral scientists have engaged in colorful p-thrashing for many years without being able to dislodge it. But change has come at last. Discomfort with the way p-values are widely misused and misinterpreted has led authoritative organizations like the American Statistical Association and the American Psychological Association to abandon null hypothesis statistical testing (NHST) in favor of a different estimation framework. One academic journal has gone so far as to say it will not publish p-values at all. The overwhelming preference is for estimation frameworks that emphasize the magnitude of difference between any two numbers, not the probability of observing that difference by mere chance.
While consensus about the need to retire p is overwhelming, this tsunami has yet to reach the shores of the Insights Community―a curious island off the coast of academic research, with its own dialect and its own priorities. Research habits die especially hard when they are put to the service of tracking brand metrics over time, where consistency of approach is generally favored over innovation (all claims to the contrary). The fact that proprietary market research data are largely sheltered from public view also means that methods for analyzing them are shielded from public debate.
The eternal quest for significance – and what it actually yields
Null hypothesis significance testing is meant to tell us whether the observed difference between two estimates should be treated as probably real or the product of chance based on sampling error. If the observed difference fails to reach our designated threshold of significance (e.g. 95%), the difference is deemed probably not real. If the difference does reach significance, we can assume it is real - though not necessarily important or meaningful.
In data, as in life generally, it is helpful to distinguish the highly probable from the improbable but real and meaningful are two different notions. “Statistically significant” does not equate to “consequential.” Thus, it doesn’t tell us how much we should care about it or what action to take. Conversely, differences that fail to meet the test of significance (simply because the sample is insufficiently powered) can still be real and potentially quite consequential. Thus, significance testing on its own doesn’t tell us how much we should care about the differences we see or what actions to take. A myopic focus on p-values can divert attention from other critical considerations in interpreting and prioritizing research findings
Here are a few of the reasons why we cannot necessarily tie secure anchor to significance testing.
Statistical significance is heavily influenced by sample size.
By design, p-values give greater weight to larger samples on the presumption that more observations reduce error―and, indeed, very large samples are better able to withstand the effects of outliers. On the other hand, if the sample size is large enough, almost any observed difference will qualify as statistically significant. We have been famously misdirected to change medical practice by tiny but statistically significant correlations found in huge health studies, for instance. On the other hand, if the sample size is small ‒ for instance, because the customer universe is limited ‒ real and meaningful differences may be discounted for lack of formal significance.
Even results deemed statistically significant frequently fail to replicate.
One key reason for wholesale defection from p is a “crisis of replication” that has plagued scientific inquiry for decades. The reasons for this are complex – too long for this article – but they point to a basic problem. We are lured into a sense of false confidence about our data when we look to p for its credibility.
Statistical significance is a binary idea in a world shaded by gray.
While people may be tempted to blur the line when calls are close, significance testing is a binary idea. A p-value is either significant or it is not. You can, of course, grade on a curve by setting the threshold you want to achieve, but because the test is designed to reflect sample size as much as magnitude of difference, there is often a painful arbitrariness to the outcome. A statistical difference deemed significant with n=100 might fail to qualify as significant with just n=99. To state that a number “tends toward” significance is a statistical “wink” that violates the basic premise of the test—though it’s in line with the cloudier nature of reality, which routinely plays out on a continuum.