+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 22

Thread: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

  1. #1
    TS Contributor
    Points: 22,359, Level: 93
    Level completed: 1%, Points required for next Level: 991
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?




    i have been experimenting with different methods to generate correlated, non-normal data for comprehensive-exam purposes and i'm finding something i cannot quite explain and would appreciate some insight.

    for the case of normally-distributed data, the process is very simple:

    (1) decide on the correlation matrix that you want
    (2) do some matrix decomposition on it (cholesky/PCA i think are the most popular ones)
    (3) multiply the decomposition of the correlation matrix times the matrix of (normally-distributed) variables

    the result is a nice multivariate normal distribution with the correlation matrix you intended in (1).

    i have found a couple of articles now that apply the same algorithm but for non-normal data. i reproduced their examples in R and i can see they work as far as preserving the correlation structure goes but that there are important changes in the moments of the distribution. now, i already knew that the matrix multiplication would alter the moments of the data... but what i didn't suspect is that the new datasets look...well... for lack of a better word, more "normally distributed".

    lemme show you what i mean:

    Code: 
    
    set.seed(123)
    R <- matrix(c(1.0, 0.5, 0.5,
                  0.5, 1.0, 0.5,
                  0.5, 0.5, 1.0),3,3)
    
    
    n<- 5000
    
    unf <- matrix(c(runif(n), runif(n), runif(n)),n,3) 
    
    datz <- unf%*%chol(R)
    i did a density plot of variable #3 (which is the one that changes the most) as an example:

    BEFORE THE TRANSFORMATION


    AFTER THE TRANSFORMATION



    and i can even just look at the descriptives and see just how much it changed:

    Code: 
    library(psych)
    
    > describe(unf)
      var    n mean   sd median trimmed  mad min max range  skew kurtosis se
    1   1 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.01    -1.20  0
    2   2 5000  0.5 0.29   0.51     0.5 0.38   0   1     1 -0.03    -1.22  0
    3   3 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.00    -1.21  0
    
    > describe(datz)
      var    n mean   sd median trimmed  mad  min  max range  skew kurtosis se
    1   1 5000 0.50 0.29   0.50    0.50 0.37 0.00 1.00  1.00  0.01    -1.20  0
    2   2 5000 0.68 0.29   0.69    0.68 0.33 0.00 1.35  1.35 -0.02    -0.80  0
    3   3 5000 0.80 0.29   0.81    0.80 0.32 0.06 1.53  1.48  0.00    -0.65  0

    i mean, variable #3 reduced its kurtosis from -1.21 to -0.65! that's almost less than half the kurtosis!

    i also explored the beta distribution with parameters shape1=shape2=0.5 as an example of another symmetric distirbution and the same thing appeared:

    BEFORE THE TRANSFORMATION:




    AFTER THE TRANSFORMATION:



    i mean, here it was so extreme it even lost its bimodality!


    i tried it with skewed distributions as well. i can see they don't lose their skewness, but the skewness is also considerably reduced after they get transformed.

    i've been looking around all day on the internet but it doesn't seem like anyone has explored this method within the context of non-normal data. more importantly, nobody seems to explain why what initially looks like non-normal distributions become more "bell-shaped" after the transformation, ESPECIALLY if they are symmetric distributions.

    ideas?
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  2. #2
    TS Contributor
    Points: 40,089, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,367
    Thanks
    232
    Thanked 301 Times in 225 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    I'll explain this, Spunky...
    Just give 10 years to grasp the issue you're describing :-)

    Gm
    http://cainarchaeology.weebly.com/

  3. The Following User Says Thank You to gianmarco For This Useful Post:

    spunky (09-21-2014)

  4. #3
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    What does the command 'chq' do? And where does the command "describe" comes from?

    (Please make a reproducible example Spunky! )

  5. The Following User Says Thank You to GretaGarbo For This Useful Post:

    spunky (09-21-2014)

  6. #4
    TS Contributor
    Points: 40,089, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Downloads
    gianmarco's Avatar
    Location
    Italy
    Posts
    1,367
    Thanks
    232
    Thanked 301 Times in 225 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Right Greta! It is exactly what I was about to reply :-)


    p.s.
    Gianmarco's thread-deterioration vengeance: ON
    http://cainarchaeology.weebly.com/

  7. #5
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Your original code isn't reproducible. I suspect you changed a variable name at some point.

    My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
    I don't have emotions and sometimes that makes me very sad.

  8. The Following User Says Thank You to Dason For This Useful Post:

    spunky (09-21-2014)

  9. #6
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by gianmarco View Post
    p.s.
    Gianmarco's thread-deterioration vengeance: ON
    You are right Gianmarco. The suggestion Spunky got was:

    Quote Originally Posted by Dason View Post
    "... blah blah blah... ... ... blah blah blah"
    (I hope you note my correct quotation!)

  10. #7
    Super Moderator
    Points: 13,151, Level: 74
    Level completed: 76%, Points required for next Level: 99
    Dragan's Avatar
    Location
    Illinois, US
    Posts
    2,014
    Thanks
    0
    Thanked 223 Times in 192 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by Dason View Post
    Your original code isn't reproducible. I suspect you changed a variable name at some point.

    My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
    Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.

    Spunky: Let me give an example to explain what is going on. It may not be what you are referring to in those articles you've looked at - but it makes the point.

    Here we go:

    Let X and E be independent standardized Uniform random variates on the interval [-Sqrt(3), + Sqrt(3)] i.e. both X and E have means of zero, variance of one, skew of zero, and kurtosis of -1.20 .

    Now, apply the following algorithm to create a variable (Y) that would have a correlation of 0.5 with X.

    Y = X*0.5 + Sqrt[1 - 0.5^2]*E

    The result is that Y and X would have a correlation of 0.5 but the kurtosis of Y would be -0.75 i.e. more "normal-like" because of the CLT i.e. Y is the SUM of a function of two other variables (X and E).

    Hope this helps.
    Last edited by Dragan; 09-21-2014 at 05:31 PM. Reason: clarity

  11. The Following User Says Thank You to Dragan For This Useful Post:

    spunky (09-21-2014)

  12. #8
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by Dason View Post
    My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
    Haha, basically my thoughts exactly... "I'm sure the CLT is lurking around here somewhere...maybe in the matrix multiplication step"
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  13. The Following User Says Thank You to Jake For This Useful Post:

    spunky (09-21-2014)

  14. #9
    TS Contributor
    Points: 22,359, Level: 93
    Level completed: 1%, Points required for next Level: 991
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    thanks everyone! sorry, i posted that at 3am so i guess i did skipped one step.

    but you're all right (and thnx Dragan for the example!) the sums of random variables + CLT kicking in explanation make perfect sense now.

    balance has now been restored to my universe...
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  15. #10
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    I had not idea about what spunky was trying to do since I could not get the code to work.

    Quote Originally Posted by Dason View Post
    My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"

    Quote Originally Posted by Dragan View Post
    Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.
    This give me the opportunity to show my favorite quotation:

    Dragan:
    I love acronyms....because they confuse so many people
    So I guess that "CLT" is about "Cilly Little Things"?

    Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.

    Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

    I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?

    Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?

    @Spunky, What package did give "describe()"?

  16. #11
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by GretaGarbo View Post
    So I guess that "CLT" is about "Cilly Little Things"?
    I would have thought the mighty GretaGarbo would have run across us using CLT for Central Limit Theorem before
    I don't have emotions and sometimes that makes me very sad.

  17. #12
    TS Contributor
    Points: 22,359, Level: 93
    Level completed: 1%, Points required for next Level: 991
    spunky's Avatar
    Location
    vancouver, canada
    Posts
    2,135
    Thanks
    166
    Thanked 537 Times in 431 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by GretaGarbo View Post
    I had not idea about what spunky was trying to do since I could not get the code to work.
    sorry Greta, you're right. my code wasn't reproducible, until now. i just changed the stuff so it would run on anyone's R. on my defense, i was doing R while tipsy at 3am on a sunday morning. but i just took care of it. i did change a variable name. i was trying several distributions to see whether this was unique to symmetric distributions or if skewed distributions would also become "normalized" (as in "closer to the normal distribution")


    Quote Originally Posted by GretaGarbo View Post
    Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.
    you are most correct, that is what i was aiming for.

    Quote Originally Posted by GretaGarbo View Post
    Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

    I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?
    i don't think there's anything from stopping the cholesky decomposition (or PCA, which is more common here in social-science land) from working. the issue i keep on finding on my area (social/behavioural sciences) is that people are very, very fond of taking a "black box" approach to simulation studies. by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data. now, there is nothing wrong when people limit their claims to multivariate normality, because the normal distribution is closed under linear combinations. but i do realize that when people say they're studying the effect of 'non-normality' by using other non-normal distributions and apply this algorithm, many of their variables end up looking quite normal. i guess when you have few variables (like, in my case, 3) things are not too bad. but the moment you start looking at things with 10 or 20 variables, quite a few of those look a little 'too normal' (both through plotting them and by doing normality tests) which makes me question whether the recommendations on these published articles are still valid or not. a very good example of this can actually be found here.


    Quote Originally Posted by GretaGarbo View Post
    Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?
    the Cauchy Distribution needs to be exorcised! and i feel repulsed to admit it, but it will feature prominently on my comprehensive doctoral exam, as per Dason's suggestion on a sensible distribution to generate outliers. i'm probably going to take a Gaussian copula approach to this though. but i will still feel dirty

    Quote Originally Posted by GretaGarbo View Post
    @Spunky, What package did give "describe()"?
    the psych package. i also updated that on my code so we could have a fully-reproducible example.
    for all your psychometric needs! https://psychometroscar.wordpress.com/about/

  18. #13
    Super Moderator
    Points: 13,151, Level: 74
    Level completed: 76%, Points required for next Level: 99
    Dragan's Avatar
    Location
    Illinois, US
    Posts
    2,014
    Thanks
    0
    Thanked 223 Times in 192 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    If you want to create a Cauchy distribution for input, then it's a straight-forward thing to do.

    Specifically, if we let Z1 and Z2 be independent standard normal deviates, then the ratio X=Z1/Z2 will follow a Cauchy distribution.

    Maybe I'm missing something here (Greta?).

  19. #14
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

    Quote Originally Posted by Dason View Post
    I would have thought the mighty GretaGarbo would have run across us using CLT for Central Limit Theorem before
    So THAT is what it means!!

    Is that abbreviation in the general language also as well known as "USA" or "NATO"?

    (Maybe it is in his language and at his department. From now on we will only talk about CGS - Centrala Gränsvärdessatsen.)

    Does MVC also means "Many Volatile Contributors"?

    (You are right Gianmarco, thread deterioration is: ON)

  20. #15
    Human
    Points: 12,672, Level: 73
    Level completed: 56%, Points required for next Level: 178
    Awards:
    Master Tagger
    GretaGarbo's Avatar
    Posts
    1,361
    Thanks
    455
    Thanked 462 Times in 402 Posts

    Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?


    Quote Originally Posted by spunky View Post
    by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data.
    Now (I believe) that I understand more of the motivation for the study. I guess that if people generate data from a skewed distribution and use the Cholesky transformation, they might believe that they are using a skew distribution when it is in fact quite "normal".

    Is there any distance measures that can summarize how far away the generated distribution is from an believed distribution? To my mind came Kullback-Leiber distance or good old Pearson chi square? Or is it silly to imagine a one-number-quality-index? It would be convenient with such a measure.

    Will the central limit theorem ensure multivariate normality in this case?

    Quote Originally Posted by spunky View Post
    the psych package. i also updated that on my code so we could have a fully-reproducible example.
    Thanks!




    Quote Originally Posted by Dragan View Post
    Maybe I'm missing something here (Greta?).

    It has been suggested - especially to spunky - that the Cauchy one day will trick him.

    (I saw somewhere - maybe here - someone was generating random numbers from a t-distribution with degrees of freedom just above 2 (close to infinite variance) and they got strange results.

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats