Data Transformation


I have a huge amount of data derived from replicated simulations. If I plot a frequency chart I have a distribution that looks like a poisson or an inverse distribution. The highest value (the mode) falls on zero. Because the distribution is not normalised, I need to transform my data so I can have a normalised distribution. I tried all variants of square roots, logarithms, arcisins... but the distribution remains not normalised because of the high frequency of the zeroes.

Do you have any idea on how can I transform my data to normalise them?

Many Thanks

why can't you use poisson regression for your analysis instead? The other choice is something called Two-Part model. In this type of modeling, you first run a model that is dichotomous, (zero or not zero) then you run a second model transformed by a logarithm(or something else) for linear regression. Then you combine the two back into one.
You can read it in a paper "Methods for Improving Regression Analysis for Skewed Continuous or Counted Responses" in Annual Review of Public Health printed in April of 2007. Email me if you need a copy of the paper.

Jenny Kotlerman
The 'why' of normalization


I need to normalise the data because I have to conduct a repeated-measures 4-factorial/4-variables ANOVA (maybe I will just work with one variable). Since the data is skewed the statistical method is not applicable.

I use SPSS to conduct the analysis and the amount of data I have is huge (100 subjects, 23 time-dependent repeated measures, 4 between-subjects factors and 1 within-subjects factor) so I think it is easier to transform my data to normalise it rather than conduct a non-parametric test (which I don't know how!)

I am not a statistician, but I am a population genetics postgraduate student and my study involves a lot of handling data and hypotheses testing.

Many Thanks for the reply :)



TS Contributor
Actually it is a very common misconception that you cannot do ANOVA if the population is not normal.

You need to remember that statistical inference is based on sampling distributions of means, not individual data points, and the sampling distribution of means from a non-normal population will approach a normal as sample size increases. The need for strict adherance to underlying assumptions is very often over-stated in my opinion.

In addition, I can site probably close to a dozen stats textbooks that clearly state that parametric procedures such as ANOVA (and others) are fairly robust to violations of their underlying assumptions, especially if only one assumption is violated.

If you need further justification, my thesis looked at how severe the degree of skewness or kurtosis needs to be in order for ANOVA to be a less desirable test (higher Type-I and/or Type II error rate) than a nonparametric version. Answer - it needs to be really, really skewed.

Here's one more before I get off my soap-box. I hate transforming data. Why? It makes the practical interpretation and application of results very difficult.
the genmod procedure in SAS will do Poisson distribution with repeated measures. I agree that the interpretation after transformation is almost impossible, but using ANOVA when the data is clearly poisson will produce wrong results. There is nothing wrong with doing ANOVA if the data is a bit skewed, I don't think that is the case right now.

Jenny Kotlerman


TS Contributor
I guess it also depends on what you mean by "results." Are you looking for a very precise estimate of an effect size, or maybe the F-statistic? Then yes, ANOVA will produce "wrong" results with an underlying Poisson data set.

However, if you're just trying to judge the significance/non-significance of factors, then I think the risk is pretty low.
ANOVA and Data transformation


The Skewness of my data is: 1.987 (SE: 0.51)
The Kurtosis instead is: 3.409 (SE: 0.102)

I think the skewness and kurtosis are far from being zero to be able to conduct a repeated-measures ANOVA even if the sphericity test showed a significance of 0.045 (that is, in my opinion, very close to the 0.05 level).

The other set of simulations (that can't be compared to the first) instead the sphericity test showed a significance of 0.000.

I am doing the repeated-measures ANOVA for the following reasons:

1. I have to calculate the means and the standard deviation in the within-subjects factors
2. I have to check if there is some form of interaction between factors
3. I have to test whether the means calculated are not significantly different.

I hope to be clear...