What's the correct correction factor? Multiple one-sample and paired tests...

5tudent

New Member
This question concerns the chance of Type I error in hypothesis testing with multiple statistical tests. The data come from a series of experiments. To make it easier to describe clearly, we could consider the experiment as having 3 parameters that could be varied. We could call these A, B, and C. A has five levels, B has four levels, and C has three levels. Some combinations of A, B, and C don't go together, so when we remove those we have 42 distinct combinations left.

For each of the 42 combinations, we test the combination of treatment parameters on each of seven subjects to yield a single real number as a result. Per the professor's specification, it is the ratio of the alternative treatment result to the standard treatment result. We separately compute the mean for each of these 42 treatments and test the hypothesis that each of the 42 mean values is equal to one, i.e. standard treatment is equivalent to alternative treatment. Is the Bonferroni adjustment required here even though each of the 42 t-tests is considered independent of the others? (The trials are still each being done on the same seven subjects at different times.)

In addition, the professor says we must also stratify according to the five levels of A to perform pairwise t-tests between those levels. (He did not specify how to handle levels of B and C.) The null hypothesis is that the means are equal. Statistics professors have rightly cautioned against "data dredging" and performing too many statistical tests on the same set of data to try to find something significant. If it were the pairwise t-tests only, perhaps the Bonferroni adjustment would help, but performing one-sample t-tests then performing pairwise t-tests on the same data is troubling.

Just 2 questions:
1. Is some other test better than pairwise t-test here?
2. Is some appropriate correction possible for these different kinds of test on the same data?

This post applies to part of the problem mentioned above, but no-one has answered the post so far:

This other post may apply, but no-one gave a definitive answer so far:

This similar one went unanswered as well:

Karabiner

TS Contributor
The data come from a series of experiments. To make it easier to describe clearly, we could consider the experiment as having 3 parameters that could be varied. We could call these A, B, and C. A has five levels, B has four levels, and C has three levels. Some combinations of A, B, and C don't go together, so when we remove those we have 42 distinct combinations left.

For each of the 42 combinations, we test the combination of treatment parameters on each of seven subjects to yield a single real number as a result.
etc. pp.

Could you perhaps describe what this is actually all about (research topic, research question,
study design, measurements, sample size)? Such competely abstract descriptions are not only
very boring, and difficult to follow, but also tend to miss crucial points of a problem.

With kind regards

K.

PeterFlom

New Member
While I agree with Karabiner (I have a feeling I will be doing that a lot) I will hazard a guess (but warn you that it may be off, because you have not given context).

1. If you are trying to test is 5 levels of A are all equal in their effects, the natural thing to do seems to be an ANOVA/Regression . ANOVA is a generalization of a t-test

2. Opinions differ. This is more a philosophical question than a statistical one. My own view is that p-values are pretty much useless (they answer the wrong question). In addition, people forget that lowering type I error inevitably increases type II error. Type II error may well be worse.

But are you "data-dredging"? Hard to say. It depends on why you are doing this whole process in the first place.

5tudent

New Member
Thank you for your advice. Yes, you bring up a good point about p-values, and I agree wholeheartedly. But the professor wants a p-value to report in a manuscript, and since I'm still working on getting my first manuscript published, I have to comply with the demand. About "data dredging," the description of the goal is mentioned in the response to Karabiner's post. Thank you!

5tudent

New Member
etc. pp.

Could you perhaps describe what this is actually all about (research topic, research question,
study design, measurements, sample size)? Such competely abstract descriptions are not only
very boring, and difficult to follow, but also tend to miss crucial points of a problem.

With kind regards

K.
Thank you for your reply. The research question and experiment are about optimizing direct stimulation of muscle, preferably not indirectly through the motor nerve. (We control this sometimes by modulating the current level and sometimes with tetrodotoxin.) Ideally we would try to maximize the force output for the chosen amount of power input. Toward this end, we issue a one-second control pulse, one experimental pulse train, then another control pulse. This is done to prevent temperature fluctuations from causing large changes in the measure of interest over the course of the experiment. For each control pulse or experimental pulse, we measure the peak force output. Then a ratio of the peak force from the experiment train to the average of peak forces from control pulses indicates whether the experimental pulse train is better. The hypothesized mean for each is one, which would indicate that experimental pulse and control pulse give the same peak force output and there is no benefit to the experimental pulse. A ratio of two would indicate an experimental train that produced double the peak force of the control pulse.

There are 5 pulse widths, 4 frequencies, 3 electric current levels. (...two turtle-doves, and a partridge in a pear tree. ) The 42 one-sample t-tests show whether each combination of pulse width, frequency, and current level is better than the control pulse. Like most biological examples, the muscle is nonlinear in its response, so the results are not what you would expect by linear thinking. That is, doubling the electric current does not, as a rule, double the force output. Given that a parametric math model for this tissue is not yet available, the statistical tests help to show the truth so the math modelers can work on developing a coupled differential equations model for this particular gastrointestinal tissue.

The professor also wishes to see if statistical analysis can show whether a particular pulse width (or level of "A" in the original post) is better or worse than the others. Since he wanted to show which one was better, the pairwise t-test seemed easy enough, but perhaps you would know of a test that is better and would allow for correction to avoid Type I error. Thank you for your advice!

Last edited:

Karabiner

TS Contributor
You could discuss the expected false dicovery rate on the 5% level and
alternatively on a more conservative 1% level, if all 42 hypotheses
were wrong. Personally, I'd chose 1% in order to reduce type 1 error
risk a bit while not reducing power extremely. I would discuss any
"significant" findings with respect to the multiple testing problem.

With kind regards

K.

5tudent

New Member
You could discuss the expected false dicovery rate on the 5% level and
alternatively on a more conservative 1% level, if all 42 hypotheses
were wrong. Personally, I'd chose 1% in order to reduce type 1 error
risk a bit while not reducing power extremely. I would discuss any
"significant" findings with respect to the multiple testing problem.

With kind regards

K.
Thank you very much, Karabiner!

Happy holidays to you!

Best regards,
5tudent