Moving away from p-values

#1
There has been a recent push to move away the reliance of statistical significance and using p < 0.05 as a threshold for reporting results. What do practicing statisticians think? Any merit to it? I have seen some guides on how to craft your language when reporting statistical results (e.g., Smith (2020) https://doi.org/10.1007/s12237-019-00679-y) but does anyone know how to report multiple comparisons? Especially if you have numerous comparisons to talk about (ie > 10), it would be impractical to use boxplots. Thanks in advance
 

Miner

TS Contributor
#3
I find this "controversy" interesting. My background is in industrial statistics, and there is no replication crisis or p-value bashing in this field. This leads me to believe that there are certain, fundamental differences in how frequentist methods are applied in other disciplines and that these differences may be the issue.

Some possible differences:
  • We are only interested in large effects that stand out from the background noise (no Andrew Gelman's kangaroo here). Changes cost money, so they have to pay for themselves.
  • Results must be replicated or they are not implemented (too costly)
  • We don't publish to gain publicity (results become proprietary trade secrets)
  • There is no motivation to exaggerate or make outrageous claims (you will lose your job if you are wrong)
 

noetsi

Fortran must die
#4
I doubt statisticians agree on this one way or the other. Academics rarely do. :p I also doubt they have any easy way to know what they thing.

The problem with p values is that you can have a significant p value and your effect size be meaningless. I have seen people judge the value of something purely by which p value is lower. At work people, who don't have a statistical background, treat statistical significance as critical, if it is significant than they see it as important and if it is not they don't. Despite the fact we have populations.

When you combine this with issues of statistical power, a major issue for some tests, and generalizability (something which it seems to me statisticians commonly pay too little attention to in the social sciences, and I agree we should move away from a focus on p values. We should stress effect size and how likely it is you can generalize to the population from your data.
 

Miner

TS Contributor
#5
The problem with p values is that you can have a significant p value and your effect size be meaningless. I have seen people judge the value of something purely by which p value is lower. At work people, who don't have a statistical background, treat statistical significance as critical, if it is significant than they see it as important and if it is not they don't. Despite the fact we have populations.
This is an application problem. They should decide in advance what effect size is important, then calculate the necessary sample size (see table below). The key is to keep the effect size and sample size in balance so that a significant p-value means a effect size of practical value. The only way that you get a significant p-value and an unimportant effect size is if your sample size was too large. This wasn't a problem in Fisher's day because data was hard to come by. Today there is almost too much data of unknown quality, and no questions are asked about the data's pedigree.

When you combine this with issues of statistical power, a major issue for some tests, and generalizability (something which it seems to me statisticians commonly pay too little attention to in the social sciences, and I agree we should move away from a focus on p values. We should stress effect size and how likely it is you can generalize to the population from your data.
I would agree that the p-value is not the sole focus, but I disagree that it is inherently bad. Focusing on the p-value alone is like a doctor focusing solely on your blood pressure and ignoring all of the other vital stats. That doesn't mean that blood pressure is not important, but it must be weighed together with the other vital stats to form a clear picture.

1589545104348.png
 

noetsi

Fortran must die
#6
"This is an application problem. They should decide in advance what effect size is important, then calculate the necessary sample size (see table below)."

It is the way things are commonly done in my observation. And I doubt it will change. Do you really think many decide what effect size is important before they run the numbers - even academics. I never did and never read anyone who suggested this. And I have read a lot of articles including elite journals over the years.

I don't know what inherently bad means. I think there should be more emphasis on effect size and less on p values. I think in much of the social science literature p values are all that comes up. Effect sizes are rarely discussed. It is simply, this is important and this is not because it is below p or not below it. Whether you can generalize from your cases to the population is apparently not that important (although there is always a one sentence warning about this at the end of the document ....which I think is entirely ignored.
 

Miner

TS Contributor
#7
I emphasize planning the design in advance and thinking strongly about sample sizes required because they are expensive to obtain. You are interfering with production uptime and are probably generating some scrap or rework to boot. I am just saying that some disciplines could learn from the practices of other disciplines. We don't have all the answers. For example, many of our engineers still hold to the One Factor at a Time approach, when DOE is much better. On the positive side, we don't have a replication crisis or questioning of fundamentals.

Your comment about generalization is on point. Most experiments are performed under tightly controlled conditions and will not scale up to the noise found in the real world. I cover the concept of narrow and broad inference space extensively in my classes.
 
Last edited:

noetsi

Fortran must die
#8
You work in a very different world than I miner. For one thing I have populations generally. But you have a well defined process and even basic stuff is in dispute in my world. :p Your laws are well defined. Mine are totally unknown.

I work in the world of vocational rehabilitation where even why we spend money (and how) is a mystery. I do think in general too much attention is given to statistics, because that is what most are taught in social sciences and only the basics, and far too little to DOE. Personally I think the later is the gold mind that needs far more focus than it gets.

I agree about learning across disciplines although there is danger in that those who adopt may not understand what they are adapting.
 
#9
Well, p values are like the poor; they're always with us.
https://www.vox.com/science-and-hea...values-statistical-significance-redefine-0005
The argument confuses p values, alpha, and the meaning of the word "significant".
We can't adjust p values, they are what they are. We can adjust alpha, to wherever we want it. We made a major mistake with the word "significant", our understanding of the definition.
In statland, significant means "p is closer to mu than alpha. In the real world, significant means "important".
The problem starts with statfolk confusing the two meanings, not understanding the meanings, and misinforming the non-statfolk.
If the readers of statworks do not understand the meanings, shame on them.
If the statfolk do not include alternate alphas, such as .01 and .10 in their statwork, shame on them.
If we, the statteachers, don't explain clearly, shame on us.
And, we should change "significance" to "pass/fail" or "accept/reject".
From my book:
"In normal English, "significant" means important, while in Statistics "significant" means probably true (not due to chance). A research finding may be true without being important. When statisticians say a result is "highly significant" they mean it is very probably true. They do not (necessarily) mean it is highly important.
When the terms enter the non-statistical or real world, trouble begins. They have certain specific meanings to statisticians that are NOT the meanings that non-statistician English speakers understand."
While fixing poor choices of stattalk, we should fix/delete/change/ any "confidence" and ALL "margin of error" use; MOE is about the clueless newsfolk announcing to the uncaring.
 

ondansetron

TS Contributor
#10
...
"In normal English, "significant" means important, while in Statistics "significant" means probably true (not due to chance). A research finding may be true without being important. When statisticians say a result is "highly significant" they mean it is very probably true.
This just isn't true. A very low p-value does not mean very probably true. This is part of the huge misunderstanding by (often) nonstatisticians. A p-value is calculated under the assumption that the null hypothesis is true. P-values do not quantify the probability of "truth" for any "result" or hypothesis.
 
#11
"A very low p-value does not mean very probably true." Agree, I never said so.
"A p-value is calculated under the assumption that the null hypothesis is true." Not by me. I don't think this or any assumption affects the calculation of p.

"P-values do not quantify the probability of "truth" for any "result" or hypothesis." Not in the sense that p = xx.xx% sure that Ho is accepted; but certainly in the sense that p = .49 = 49% = 49/100 makes me more sure that Ho should be accepted than does p = .06 = 6% = 6/100.
Remember, it's statistics, we can be precise AND vague, can't prove anything.
 

ondansetron

TS Contributor
#12
"A very low p-value does not mean very probably true." Agree, I never said so.
"A p-value is calculated under the assumption that the null hypothesis is true." Not by me. I don't think this or any assumption affects the calculation of p.

"P-values do not quantify the probability of "truth" for any "result" or hypothesis." Not in the sense that p = xx.xx% sure that Ho is accepted; but certainly in the sense that p = .49 = 49% = 49/100 makes me more sure that Ho should be accepted than does p = .06 = 6% = 6/100.
Remember, it's statistics, we can be precise AND vague, can't prove anything.
Let's clarify: your post claimed that "...while in Statistics "significant" means probably true (not due to chance)...When statisticians say a result is "highly significant" they mean it is very probably true." None of that is true.

Then you also claim you [don't assume Ho is true when calculating a p-value]; how are you calculating p-values? This is literally embedded in the calculation. Either you're unaware of the assumption or you're not calculating a p-value. This is definitional; assume different null hypotheses and get different p-values on the same data.

Your last statement that for any p1>p2 you should be [more sure of accepting Ho] is also not generally true. P-values have nothing to do with accepting Ho.

This is part of the issue embedded in "p-value controversy."
 
#13
"Then you also claim you [don't assume Ho is true when calculating a p-value]; how are you calculating p-values? This is literally embedded in the calculation. Either you're unaware of the assumption or you're not calculating a p-value. This is definitional; assume different null hypotheses and get different p-values on the same data."
Perhaps I'm wrong.
With two sets of numbers distributed Normal, calculate the mean and standard deviation. Estimate the distribution of sample means of one set. The mean of the other set lies somewhere on the distribution of the other set, and is a certain distance from the mean. I call that the p value.
We're done operating operating on the two sets of numbers.
We now write Ho: mu1 = mu2
and
Ha; mu 1 > mu 2, or
Mu 1 < m 2, or
Mu 1 not equal to mu 2

The p value will sorta change, but sorta not, as we change Ha. p can be 10% or 90% or 20% depending on which Ha is chosen; but it's like saying Boston to Chicago = 1000 miles, Chicago to Los Angeles is 2000 miles, Boston to Los Angeles is 3000 miles. Three numbers describe Chicago's location, Chicago's location hasn't changed.
While Ha changes p; I don't see where Ho changes p.
Perhaps I'm wrong.
 
#14
"Let's clarify: your post claimed that "...while in Statistics "significant" means probably true (not due to chance)...When statisticians say a result is "highly significant" they mean it is very probably true." None of that is true."
Statistics is about translating numbers into words, and is darn difficult. I'm talking here about the interpretation, the definitions, commonly used.
If x bar is close to mu than alpha, the result of the test/look is significant.
If x bar is closer to mu than alpha, and close to alpha, the result of the test/look is less significant. p is small.
If x bar is closer to mu than aplha, and far from alpha, the result of the test/look is is more significant. p is large.
"significant" means probably true (not due to chance. alpha is the % of variation due to chance, the probability of an alpha error, the accept/reject threshold. x bar closer to mu than alpha = p > alpha = significant = accept Ho: mu1 = mu2. If I'm wrong, tell me why.
 
#15
"Your last statement that for any p1>p2 you should be [more sure of accepting Ho] is also not generally true. P-values have nothing to do with accepting Ho."
You are correct, my words are incorrect. If x bar is closer to mu than alpha is the accept condition, then accept/reject has nothing to do with p; reject with alpha = 5% and p = 3% and reject when p = 4.99%.
However, the greater p/alpha, the more sure I am of the test result. A digital person lives in accept/reject land. I live in an analog world, where as p approaches 50%, my smile widens. Why am I wrong?
 
Last edited:

noetsi

Fortran must die
#16
What p means depends on if you use Bayes or are a frequentist.

"In statland, significant means "p is closer to mu than alpha. In the real world, significant means "important". Well that is true I think the key is it means the effect size is substantively important. Not statistically significant, a very different concept. This is the heart of the problem, and why I am not a great fan of p values. Most people are not even statisticians of course. I have problems with this every time a p value comes up.