+ Reply to Thread
Results 1 to 9 of 9

Thread: Data Transformation

  1. #1
    Points: 3,640, Level: 37
    Level completed: 94%, Points required for next Level: 10

    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Data Transformation



    Hello,

    I have a huge amount of data derived from replicated simulations. If I plot a frequency chart I have a distribution that looks like a poisson or an inverse distribution. The highest value (the mode) falls on zero. Because the distribution is not normalised, I need to transform my data so I can have a normalised distribution. I tried all variants of square roots, logarithms, arcisins... but the distribution remains not normalised because of the high frequency of the zeroes.

    Do you have any idea on how can I transform my data to normalise them?

    Many Thanks

    pitto

  2. #2
    Points: 3,904, Level: 39
    Level completed: 70%, Points required for next Level: 46

    Location
    Los Angeles
    Posts
    74
    Thanks
    0
    Thanked 0 Times in 0 Posts
    why can't you use poisson regression for your analysis instead? The other choice is something called Two-Part model. In this type of modeling, you first run a model that is dichotomous, (zero or not zero) then you run a second model transformed by a logarithm(or something else) for linear regression. Then you combine the two back into one.
    You can read it in a paper "Methods for Improving Regression Analysis for Skewed Continuous or Counted Responses" in Annual Review of Public Health printed in April of 2007. Email me if you need a copy of the paper.

    Jenny Kotlerman
    www.****************************.

  3. #3
    TS Contributor
    Points: 13,042, Level: 74
    Level completed: 48%, Points required for next Level: 208
    Awards:
    User with most referrers
    JohnM's Avatar
    Posts
    1,948
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Basic question - if the data, as a result of a simulation, isn't normal, why do you need to normalize it?

  4. #4
    Points: 3,640, Level: 37
    Level completed: 94%, Points required for next Level: 10

    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    The 'why' of normalization

    Hello!

    I need to normalise the data because I have to conduct a repeated-measures 4-factorial/4-variables ANOVA (maybe I will just work with one variable). Since the data is skewed the statistical method is not applicable.

    I use SPSS to conduct the analysis and the amount of data I have is huge (100 subjects, 23 time-dependent repeated measures, 4 between-subjects factors and 1 within-subjects factor) so I think it is easier to transform my data to normalise it rather than conduct a non-parametric test (which I don't know how!)

    I am not a statistician, but I am a population genetics postgraduate student and my study involves a lot of handling data and hypotheses testing.

    Many Thanks for the reply

    pitto

  5. #5
    TS Contributor
    Points: 13,042, Level: 74
    Level completed: 48%, Points required for next Level: 208
    Awards:
    User with most referrers
    JohnM's Avatar
    Posts
    1,948
    Thanks
    0
    Thanked 4 Times in 4 Posts
    Actually it is a very common misconception that you cannot do ANOVA if the population is not normal.

    You need to remember that statistical inference is based on sampling distributions of means, not individual data points, and the sampling distribution of means from a non-normal population will approach a normal as sample size increases. The need for strict adherance to underlying assumptions is very often over-stated in my opinion.

    In addition, I can site probably close to a dozen stats textbooks that clearly state that parametric procedures such as ANOVA (and others) are fairly robust to violations of their underlying assumptions, especially if only one assumption is violated.

    If you need further justification, my thesis looked at how severe the degree of skewness or kurtosis needs to be in order for ANOVA to be a less desirable test (higher Type-I and/or Type II error rate) than a nonparametric version. Answer - it needs to be really, really skewed.

    Here's one more before I get off my soap-box. I hate transforming data. Why? It makes the practical interpretation and application of results very difficult.

  6. #6
    Points: 3,904, Level: 39
    Level completed: 70%, Points required for next Level: 46

    Location
    Los Angeles
    Posts
    74
    Thanks
    0
    Thanked 0 Times in 0 Posts
    the genmod procedure in SAS will do Poisson distribution with repeated measures. I agree that the interpretation after transformation is almost impossible, but using ANOVA when the data is clearly poisson will produce wrong results. There is nothing wrong with doing ANOVA if the data is a bit skewed, I don't think that is the case right now.

    Jenny Kotlerman
    www.****************************.com

  7. #7
    TS Contributor
    Points: 13,042, Level: 74
    Level completed: 48%, Points required for next Level: 208
    Awards:
    User with most referrers
    JohnM's Avatar
    Posts
    1,948
    Thanks
    0
    Thanked 4 Times in 4 Posts
    I guess it also depends on what you mean by "results." Are you looking for a very precise estimate of an effect size, or maybe the F-statistic? Then yes, ANOVA will produce "wrong" results with an underlying Poisson data set.

    However, if you're just trying to judge the significance/non-significance of factors, then I think the risk is pretty low.

  8. #8
    Points: 3,640, Level: 37
    Level completed: 94%, Points required for next Level: 10

    Posts
    3
    Thanks
    0
    Thanked 0 Times in 0 Posts

    ANOVA and Data transformation

    Hello,

    The Skewness of my data is: 1.987 (SE: 0.51)
    The Kurtosis instead is: 3.409 (SE: 0.102)

    I think the skewness and kurtosis are far from being zero to be able to conduct a repeated-measures ANOVA even if the sphericity test showed a significance of 0.045 (that is, in my opinion, very close to the 0.05 level).

    The other set of simulations (that can't be compared to the first) instead the sphericity test showed a significance of 0.000.

    I am doing the repeated-measures ANOVA for the following reasons:

    1. I have to calculate the means and the standard deviation in the within-subjects factors
    2. I have to check if there is some form of interaction between factors
    3. I have to test whether the means calculated are not significantly different.

    I hope to be clear...

    pitto

  9. #9
    Points: 3,904, Level: 39
    Level completed: 70%, Points required for next Level: 46

    Location
    Los Angeles
    Posts
    74
    Thanks
    0
    Thanked 0 Times in 0 Posts

    I think you should explore the Genmod procedure in SAS

    Jenny Kotlerman
    www.****************************.com

+ Reply to Thread

Similar Threads

  1. Need help in data transformation
    By woro2006 in forum R
    Replies: 0
    Last Post: 04-16-2011, 02:33 PM
  2. Data Transformation
    By Lazar in forum R
    Replies: 1
    Last Post: 04-06-2011, 08:38 AM
  3. Replies: 1
    Last Post: 05-13-2010, 10:27 AM
  4. Data transformation
    By spider-data in forum Psychology Statistics
    Replies: 5
    Last Post: 11-16-2009, 02:20 AM
  5. Data Transformation
    By a_quantum in forum Statistics
    Replies: 0
    Last Post: 08-09-2008, 04:34 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats