Please help! Running ANOVA with skewed data in some conditions

ninja

New Member
#1
Hi there, I really hope you can help. I've been looking on a number of forums and have had no luck. Also have been looking through some of the threads here and am feeling more optimistic that someone can help me!

I'm a PhD student, in my final 8 weeks so am desperate for some help!

I'm planning to run a 3 x 2 ANOVA on some of my data. The DV (correct responses) is skewed in about half of the conditions. I transformed those conditions using Log10, adding a constant of 1 (because the data includes 0 values). Should I also transform the rest of the condition's data? I am thinking that by adding the constant I will be exaggerating any differences if I don't transform the rest, or at least add + 1 to all conditions?

Also, when I've transformed the outcomes, some are still marginally out of range for skew (1.165). Could I still proceed and argue that it is not THAT skewed?

I've seen some people suggesting plotting the Q-Q values for the residuals, but I don't know how to do that. If I need to do this, does anyone know how? Also, is there a way to do this in one go so I can see if the residuals are normally distributed?

I really hope that you can help me! I have 6 papers to write up all following a similar format, so am just trying to make sure I am doing it right before I analyse and write up the rest of the results sections.
 

Karabiner

TS Contributor
#2
I'm planning to run a 3 x 2 ANOVA on some of my data. The DV (correct responses) is skewed in about half of the conditions.
Could you tell us a bit more? What is the topic of the research,
what are these factors, is this an experimental study or an
observational study, how large is the sample size, what
was actually measured, and in which range do we find the
responses?
Should I also transform the rest of the condition's data?
Yes, of course. You cannot transform the dependent variable
just in some conditions and leave the rest as is. You have to
transform - if necessary at all - the complete DV.

But if your sample size is not tiny, you usually don't have to fear
too much from skewed distributions (except it is from outliers).
Or, is the proportion of zero values perhaps quite high?

I've seen some people suggesting plotting the Q-Q values for the residuals, but I don't know how to do that.
You run the ANOVA with your software, save the
residuals from that analysis (the residuals from the
whole model), and perform a Q-Q plot on that saved
residuals, if your software provides it.

With kind regards

K.
 

hlsmith

Not a robit
#3
Karabiner provided some good guidance. You might also want to look and see if the residuals are normally distribute when the DV is not transformed - perhaps you may not need to transform it.
 

ninja

New Member
#4
Hi, thank you so much for the replies! I was feeling pretty alone in the world until I saw them! Thank you!

I'm running a 3 x 2 using a word detection task with different types of words presented at different speeds. It's a small sample (repeated measures) about n=65 after exclusions. The DV was correct responses on the identification task. Answers are in the range 0-5 and there are a lot of zeros.

I'm going to try the plots and see if they look like a line (which is what I think I'm meant to do). I'm using SPSS so hopefully it should be ok to plot the residuals to check before I go in with transformations.

If I do still need to do them, is 1.165 a suitable fix on skewness, or does it really have to be at 1? I'm also planning to run regressions on some of the measures after the ANOVAS.
 

ninja

New Member
#5
Just an update, I've added 'residual plot' as an option in SPSS on the ANOVA, but it just comes up with a box with 9 segments (kind of a matrix) with the observed / predicted / std. residual. It wasn't the Q-Q plot I was expecting. Can I tell if the residuals are normally distributed from this?
 

ninja

New Member
#6
I've just managed to save the residuals and run a Q-Q plot on them with the standard options.. nothing resembles a line really, so I think I will transform it all now. Do I need to check it again after or should it be ok if it is all kind of within range of skewness? Hoping 1.165 is ok on that measure still?
 

CB

Super Moderator
#7
With N = 65, a tiny bit of skewness isn't really something to worry about. While formally one makes the assumption that the distribution of the DV within each condition is normal, the direct assumption is that the sampling distribution of the coefficients is normal. Even with a data distribution that isn't remotely normal, the sampling distribution with N = 65 is likely to very closely approximate a normal distribution. In your case, with only slight skewness, it just isn't a major concern. The cost in terms of interpretability of transforming your DV doesn't seem worth taking here. If you really really want to avoid taking any risks on the normality issue, calculate confidence intervals or significance tests using bootstrapping rather than transforming. (Bootstrapping is readily available in SPSS and R and many other packages).

In a lot of ways the normality assumption is the least important assumption in a linear model. Instead worry more about assumptions like homogeneity of variance, independence of errors, and lack of correlated measurement error. The consequences that breaches of these assumptions could have are much more crucial than for a lack of normality. E.g., coefficients from the linear model remain unbiased, consistent and efficient even without normality, but that won't be the case if these other assumptions are breached.
 

ninja

New Member
#8
Hi, thanks for the reply : ) Sadly the skewness is only mild after the transformation, it was about 9 with some variables pre log10. However, I think what you are saying is still relevant here and maybe I can just go ahead with ANOVA post transformation with a couple of slightly skewed condition's data. I can't say whether the assumption of homogeneity of variance will be met as yet, but pre transformation I did have to use greenhouse-geisser for one of my variables. Hopefully by tomorrow night I will be a bit closer to an answer with that.
 

hlsmith

Not a robit
#9
And dont forget that once you get everything ironed out, to make sure you interprete your transformed results correctly.
 
#10
Hi all, I've managed to finish one of my studies - I'm just totally confused though - with the ANOVA, I now get that it is robust to quite strange distributions in that I won't need to transform just because of that. I've plotted the residuals though as suggested and they aren't normally distributed. Is this a problem? I don't actually know what bootstrapping does, so if I need this can anyone please let me know why I would use it? Thank you sooooo much!!!!
 

CB

Super Moderator
#11
In (conventional) analytic statistics, the sampling distribution of a statistic (e.g., a group mean) is constructed by making assumptions about the data that statistic is based on. E.g., we often assume sampling distributions are normal, based on the assumption that they are estimated from models that have a normally distributed error term. It's the sampling distribution that we use to make inferences from the sample to the population.

In bootstrapping a sampling distribution is approximated empirically, by sampling with replacement from the original sample a large number of times. The resulting sampling distribution will not generally exactly follow any defined probability distribution. The result is that you do not need to assume that the sampling distribution is normal (or that the errors are normal). The advantage of this method over something like transformation is that you are still applying the same familar ANOVA, with results that can be interpreted in the usual way; you are just obtaining standard errors, confidence intervals and/or p values in a manner that avoids relying on the normality assumption. You can find a nice intro here.
 
#12
Thank you for replying! :) Would looking at the coefficient of variation help me with seeing if the sampling distributions are similar enough? If so, I've calculated I have a maximum of about .55 difference between the outcome measures CVs, and some are much less. With N=65 and some skew on the residuals (up to 2) do you think I would still be ok to use ANOVA without transformation or bootstrapping? It's a repeated measures design so I'm not sure how important homogenity of variance is in that situation? Think I am getting there now, although my brain hurts haha
 
#13
Also, might as well mention, I'm going to be running multiple regressions with the data after the ANOVA and correlations - would I be able to follow the same protocol as I'm going to with the ANOVA, I guess depending on what that is going to be (transform or not)
 

CB

Super Moderator
#14
When checking the homoscedasticity (constant variance) assumption, you need to primarily look at the variances in the individual groups, not the coefficient of variation.

Multiple regression has similar assumptions to RM Anova (though not quite the same). They both assume homoscedasticity and normality of errors, but are fairly robust to breaches of the latter assumption. So probably the same protocol for both would be reasonable.