Is a Bonferroni correction needed for my particular dataset?

I received a review of my manuscript in which one of the three reviewers advised that I will need to apply a Bonferroni correction due to multiple comparisons. However, based on the examples I have reviewed, I am not certain that a Bonferroni correction is needed in this case. Here is the description of my data and current analysis:

I collected data from a set of 10 human subjects. Each subject had their data recorded on 5 different dates during 5 different testing sessions. Each subject trained an artificial intelligence program to perform a specified task, and during each session, 3 different methods of training the AI program were used (the same 3 methods were used for all subjects and across all testing sessions; the human subjects provided the data for one of the methods, whereas the other 2 methods used algorithms to generate the associated data). For each session, I am performing a Kruskal-Wallis ANOVA analysis with post hoc multiple comparison to compare, using 3 different outcome metrics, how successfully the set of 10 subjects trained the AI program to perform the task using the 3 tested methods, with the comparison between each pair of methods being reported as either significantly different or not based on a critical p-value of 0.05.

In case it is relevant, the 5 sessions in which each subject participated "built upon" each other; e.g. the AI program that was trained in Session 1 was used as the starting point for Session 2 for the same subject, so that the AI program's performance gradually improved over the 5 sessions performed by each human subject.

I am wondering whether a Bonferroni correction is really necessary, since for each of the 5 sessions, the data was collected separately from the data of the other sessions, so that performing parallel analyses on the 5 sessions might not necessitate a correction of the p-value in the same way that performing numerous comparisons on the *same* dataset would.

Or, since I am comparing 3 different outcome measures for the 3 conditions being compared for each session, would I need a Bonferroni correction factor of 3 for the 3 outcome measures being used?

Thanks in advance for any advice you can provide.


Well-Known Member
The issue isn't whether the data is from the same data set or not. It is a question of protection against making a false positive claim. Setting a critical value for significance of p<0.05 gives you a 95% protection against a false positive for any single test. I think of it as being like Russian roulette with one bullet in a 20 chamber revolver. There is one chance in 20 of you shooting yourself in the foot.
Now, imagine that the responses are in fact perfectly random and there is no connection or difference between the responses for any of your factors. When you do your basic analyses with your random data you will get a series of p values - it's not entirely clear how many, but at least 5. Ideally, all these p values will be > 0.05, but of course, with 5 p values you now have 5 chances of shooting your self in the foot. Your protection against a false positive has been eroded from 95% protection to about 75% protection.
This problem of the erosion of protection against a false positive when there are multiple p values is one that has always plagued statisticians. Many approaches have been proposed but none really solve the problem. The simplest approach is the Bonferroni but the others are not much different.
Thanks for your response, katxt; it sounds like I should plan to perform an adjustment of my critical p-value.

However, to return to my original post, should this be a correction factor of 3 since I am reporting 3 different outcome metrics, or should this be a correction factor of 5 since I have 5 different overall datasets from the 5 different data collection sessions? To clarify, I am reporting 3 different outcome metrics for each of 5 data collection sessions, for a total of 15 p-values.

Thanks to you or to any other poster who can provide clarification of this point.


TS Contributor
I believe that if you're making 15 comparisons, then you should use 15 to make your adjustment. In other words, if c is the total number of pairwise comparisons, then you would use c instead of the number of data sets or outcome variables.