Non-proportion percentage as a response

#1
Moderator note: This thread was split from a previous thread as it didn't directly address the question in the previous thread and deserves its own space

Hi everybody,
I know it is a bit old topic but I hope someone can give me advice. I have a similar situation (ANOVA with percentages) but my percentages do not derive from counts.

I have to define the appropriate statistical analysis for an experiment to be carried out.
We have 7 groups with equal size (n=4). The factor is a nominal variable.
The measured variables are concentrations in specific areas of the sample and their corresponding percentages, calculated as "concentration in area" / "total concentration".

I read percentages may not meet the normality assumptions of ANOVA, especially in the case of extreme values (close to 0% or 100%).
My idea was to:
- use ANOVA if the percentages are in the range 20%-80% (found elsewhere)
- use Kruskal-Wallis if extreme values are present.

Would someone agree on this solution?
 
Last edited by a moderator:
#2
Re: ANOVA with percentages

My idea was to:
- use ANOVA if the percentages are in the range 20%-80% (found elsewhere)
- use Kruskal-Wallis if extreme values are present.

Would someone agree on this solution?
Yes, or:

- do anova on all values. The parameter estimates will still be unbiased (but the tests "wrong")

- do anova and estimate with weighted least squares (wls) with weights proportional to p*(1-p), where p is the proportion ( 0<p<1)

- think of the dependent variable (the proportion) as Beta distributed and do "beta-regression" (Just like Poisson regression where the dependent variable is Poisson distributed). But then the dependent variable must not be exactly 0 or 1. A possibility is zero-inflated Beta-regression.

- rescale the dependent (let us call it "y", where 0<y<1) with the logit-function so that y2 = log(y/(1-y)). Then the new variable will not be restricted to the 0 to 1 interval, but go (almost) from minus infinity to plus infinity.

- use an "offset" so that: log("concentration in area") = b1*log("total concentration") + intercept + b2*x2 + other variables, where b1 is known to be exactly 1. So that coefficient is not estimated. Most good software can estimate a model with offset.

I believe that Kruskal-Wallis has a minimum sample size to at all show a significance, but I don't remember. But n=4 is "small".
 
#3
Re: ANOVA with percentages

What is the range of percentages in your data? If they are not too extreme (e.g., if they are in the 20-80% range), you'll probably get a fairly decent model with a t-test or ANOVA.
 
#4
Re: ANOVA with percentages

What is the range of percentages in your data? If they are not too extreme (e.g., if they are in the 20-80% range), you'll probably get a fairly decent model with a t-test or ANOVA.
Thanks David. The experiment still needs to be carried out. But we have to document in advance what kind of statistical analysis we will do once we have the data.
 
#5
Re: ANOVA with percentages

Yes, or:

- do anova on all values. The parameter estimates will still be unbiased (but the tests "wrong")

- do anova and estimate with weighted least squares (wls) with weights proportional to p*(1-p), where p is the proportion ( 0<p<1)

- think of the dependent variable (the proportion) as Beta distributed and do "beta-regression" (Just like Poisson regression where the dependent variable is Poisson distributed). But then the dependent variable must not be exactly 0 or 1. A possibility is zero-inflated Beta-regression.

- rescale the dependent (let us call it "y", where 0<y<1) with the logit-function so that y2 = log(y/(1-y)). Then the new variable will not be restricted to the 0 to 1 interval, but go (almost) from minus infinity to plus infinity.

- use an "offset" so that: log("concentration in area") = b1*log("total concentration") + intercept + b2*x2 + other variables, where b1 is known to be exactly 1. So that coefficient is not estimated. Most good software can estimate a model with offset.

I believe that Kruskal-Wallis has a minimum sample size to at all show a significance, but I don't remember. But n=4 is "small".
Thank you!

I see there are many possibilities. Maybe I should have mentioned that the interest is in detecting significant differences between groups.
I am not a statistician so I would probably now come with silly questions:
- Re. Beta-regression: so far I have done linear regression with continuous or ordinal predictors. In this case the predictor (factor) would be nominal. How can I do beta-regression?
- in the "offset" solution what are "x2 and other variables"?

At the moment I am thinking to stick to the solution I find more understandable for me, i.e. ANOVA or Kruskal-Wallis or logit transformation.
 

Karabiner

TS Contributor
#6
Re: ANOVA with percentages

With only n=28 and a percentage as DV, I wouldn't necessarily
rely on ANOVA to give correct answers. On the other hand, Kruskal-
Wallis looks like the logical alternative, but with 7 groups or
n=4 per groups, statistical power would be very low (or maybe
calculation is even not possible). Isn't there an opportunity
to increase total sample size, or to combine categories?

With kind regards

K.
 
#7
Re: ANOVA with percentages

With only n=28 and a percentage as DV, I wouldn't necessarily
rely on ANOVA to give correct answers. On the other hand, Kruskal-
Wallis looks like the logical alternative, but with 7 groups or
n=4 per groups, statistical power would be very low (or maybe
calculation is even not possible). Isn't there an opportunity
to increase total sample size, or to combine categories?

With kind regards

K.
Thank you very much for the helpful observations.
I will definitely report the issue of the power and ask if it is possible to increase the sample size or combine categories.

Best regards,
M.
 

Dason

Ambassador to the humans
#8
Hi,

This thread was related to but not directly addressing the previous question so I moved it to its own thread.
 
#9
Re: ANOVA with percentages

I see there are many possibilities. Maybe I should have mentioned that the interest is in detecting significant differences between groups.
I am not a statistician so I would probably now come with silly questions:
- Re. Beta-regression: so far I have done linear regression with continuous or ordinal predictors. In this case the predictor (factor) would be nominal. How can I do beta-regression?
Good question!

In a usual linear regression model (implicitly with normally distributed values) there can be a one or several explanatory variables, like the group variable/factor. (Then the model or estimation would "calculate" the mean in each group.) The same is true for Beta-regression or Poisson regression. You can include you group variable (and get estimates of the underlying unknown population means).



- in the "offset" solution what are "x2 and other variables"?
This could just be one or several explanatory factors (like above), just like the group variable.

I suggested the offset model as a possibility. I am not sure if it is as good as the other models (or much worse).

I just suggested these models as possibilities, not as recommendations. It was more for the general discussion.

At the moment I am thinking to stick to the solution I find more understandable for me, i.e. ANOVA or Kruskal-Wallis or logit transformation.
I certainly think that you should use something that is understandable!

It is really good to think before you do something. That is experimental design. (Talking about that you should some bases for the sample size, the n=4 per group, like the length of confidence interval or the power of test.)

But if you write something, it would be stupid and unscientific to still use the same model no matter what the data looks like. (For example if there are a few big outliers.) It could be good to include a footnote that if unexpected data appears, suitable methods will be used. Quite often surprises happen!
 
#10
Re: ANOVA with percentages

Good question!

In a usual linear regression model (implicitly with normally distributed values) there can be a one or several explanatory variables, like the group variable/factor. (Then the model or estimation would "calculate" the mean in each group.) The same is true for Beta-regression or Poisson regression. You can include you group variable (and get estimates of the underlying unknown population means).
That sounds reasonable! I guess I was also thinking of how to code the levels of the explanatory variable and how this would affect the results. I should rather study before asking here.

It is really good to think before you do something. That is experimental design. (Talking about that you should some bases for the sample size, the n=4 per group, like the length of confidence interval or the power of test.)
Totally agree. I asked my colleagues if they have historical data to try and calculate the power with the current design and/or increase the sample size (it was highlighted as potential point of concern, given that we have 7 groups).

But if you write something, it would be stupid and unscientific to still use the same model no matter what the data looks like. (For example if there are a few big outliers.) It could be good to include a footnote that if unexpected data appears, suitable methods will be used. Quite often surprises happen!
Agree also here. We have included procedures to look for outliers and different analyses according to the different situations we think we may encounter. It is a highly regulated department so we need to declare in advance all these points, but luckily there is possibility to make changes along the way.

I am very thankful for the help.