# Thread: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

1. ## simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

i have been experimenting with different methods to generate correlated, non-normal data for comprehensive-exam purposes and i'm finding something i cannot quite explain and would appreciate some insight.

for the case of normally-distributed data, the process is very simple:

(1) decide on the correlation matrix that you want
(2) do some matrix decomposition on it (cholesky/PCA i think are the most popular ones)
(3) multiply the decomposition of the correlation matrix times the matrix of (normally-distributed) variables

the result is a nice multivariate normal distribution with the correlation matrix you intended in (1).

i have found a couple of articles now that apply the same algorithm but for non-normal data. i reproduced their examples in R and i can see they work as far as preserving the correlation structure goes but that there are important changes in the moments of the distribution. now, i already knew that the matrix multiplication would alter the moments of the data... but what i didn't suspect is that the new datasets look...well... for lack of a better word, more "normally distributed".

lemme show you what i mean:

Code:
``````
set.seed(123)
R <- matrix(c(1.0, 0.5, 0.5,
0.5, 1.0, 0.5,
0.5, 0.5, 1.0),3,3)

n<- 5000

unf <- matrix(c(runif(n), runif(n), runif(n)),n,3)

datz <- unf%*%chol(R)``````
i did a density plot of variable #3 (which is the one that changes the most) as an example:

BEFORE THE TRANSFORMATION

AFTER THE TRANSFORMATION

and i can even just look at the descriptives and see just how much it changed:

Code:
``````library(psych)

> describe(unf)
var    n mean   sd median trimmed  mad min max range  skew kurtosis se
1   1 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.01    -1.20  0
2   2 5000  0.5 0.29   0.51     0.5 0.38   0   1     1 -0.03    -1.22  0
3   3 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.00    -1.21  0

> describe(datz)
var    n mean   sd median trimmed  mad  min  max range  skew kurtosis se
1   1 5000 0.50 0.29   0.50    0.50 0.37 0.00 1.00  1.00  0.01    -1.20  0
2   2 5000 0.68 0.29   0.69    0.68 0.33 0.00 1.35  1.35 -0.02    -0.80  0
3   3 5000 0.80 0.29   0.81    0.80 0.32 0.06 1.53  1.48  0.00    -0.65  0``````

i mean, variable #3 reduced its kurtosis from -1.21 to -0.65! that's almost less than half the kurtosis!

i also explored the beta distribution with parameters shape1=shape2=0.5 as an example of another symmetric distirbution and the same thing appeared:

BEFORE THE TRANSFORMATION:

AFTER THE TRANSFORMATION:

i mean, here it was so extreme it even lost its bimodality!

i tried it with skewed distributions as well. i can see they don't lose their skewness, but the skewness is also considerably reduced after they get transformed.

i've been looking around all day on the internet but it doesn't seem like anyone has explored this method within the context of non-normal data. more importantly, nobody seems to explain why what initially looks like non-normal distributions become more "bell-shaped" after the transformation, ESPECIALLY if they are symmetric distributions.

ideas?

2. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

I'll explain this, Spunky...
Just give 10 years to grasp the issue you're describing :-)

Gm

3. ## The Following User Says Thank You to gianmarco For This Useful Post:

spunky (09-21-2014)

4. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

What does the command 'chq' do? And where does the command "describe" comes from?

(Please make a reproducible example Spunky! )

5. ## The Following User Says Thank You to GretaGarbo For This Useful Post:

spunky (09-21-2014)

6. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Right Greta! It is exactly what I was about to reply :-)

p.s.

7. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Your original code isn't reproducible. I suspect you changed a variable name at some point.

My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"

8. ## The Following User Says Thank You to Dason For This Useful Post:

spunky (09-21-2014)

9. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by gianmarco
p.s.
You are right Gianmarco. The suggestion Spunky got was:

Originally Posted by Dason
"... blah blah blah... ... ... blah blah blah"
(I hope you note my correct quotation!)

10. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by Dason
Your original code isn't reproducible. I suspect you changed a variable name at some point.

My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.

Spunky: Let me give an example to explain what is going on. It may not be what you are referring to in those articles you've looked at - but it makes the point.

Here we go:

Let X and E be independent standardized Uniform random variates on the interval [-Sqrt(3), + Sqrt(3)] i.e. both X and E have means of zero, variance of one, skew of zero, and kurtosis of -1.20 .

Now, apply the following algorithm to create a variable (Y) that would have a correlation of 0.5 with X.

Y = X*0.5 + Sqrt[1 - 0.5^2]*E

The result is that Y and X would have a correlation of 0.5 but the kurtosis of Y would be -0.75 i.e. more "normal-like" because of the CLT i.e. Y is the SUM of a function of two other variables (X and E).

Hope this helps.

11. ## The Following User Says Thank You to Dragan For This Useful Post:

spunky (09-21-2014)

12. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by Dason
My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
Haha, basically my thoughts exactly... "I'm sure the CLT is lurking around here somewhere...maybe in the matrix multiplication step"

13. ## The Following User Says Thank You to Jake For This Useful Post:

spunky (09-21-2014)

14. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

thanks everyone! sorry, i posted that at 3am so i guess i did skipped one step.

but you're all right (and thnx Dragan for the example!) the sums of random variables + CLT kicking in explanation make perfect sense now.

balance has now been restored to my universe...

15. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

I had not idea about what spunky was trying to do since I could not get the code to work.

Originally Posted by Dason
My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"

Originally Posted by Dragan
Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.
This give me the opportunity to show my favorite quotation:

Dragan:
I love acronyms....because they confuse so many people
So I guess that "CLT" is about "Cilly Little Things"?

Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.

Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?

Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?

@Spunky, What package did give "describe()"?

16. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by GretaGarbo
So I guess that "CLT" is about "Cilly Little Things"?
I would have thought the mighty GretaGarbo would have run across us using CLT for Central Limit Theorem before

17. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by GretaGarbo
I had not idea about what spunky was trying to do since I could not get the code to work.
sorry Greta, you're right. my code wasn't reproducible, until now. i just changed the stuff so it would run on anyone's R. on my defense, i was doing R while tipsy at 3am on a sunday morning. but i just took care of it. i did change a variable name. i was trying several distributions to see whether this was unique to symmetric distributions or if skewed distributions would also become "normalized" (as in "closer to the normal distribution")

Originally Posted by GretaGarbo
Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.
you are most correct, that is what i was aiming for.

Originally Posted by GretaGarbo
Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?
i don't think there's anything from stopping the cholesky decomposition (or PCA, which is more common here in social-science land) from working. the issue i keep on finding on my area (social/behavioural sciences) is that people are very, very fond of taking a "black box" approach to simulation studies. by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data. now, there is nothing wrong when people limit their claims to multivariate normality, because the normal distribution is closed under linear combinations. but i do realize that when people say they're studying the effect of 'non-normality' by using other non-normal distributions and apply this algorithm, many of their variables end up looking quite normal. i guess when you have few variables (like, in my case, 3) things are not too bad. but the moment you start looking at things with 10 or 20 variables, quite a few of those look a little 'too normal' (both through plotting them and by doing normality tests) which makes me question whether the recommendations on these published articles are still valid or not. a very good example of this can actually be found here.

Originally Posted by GretaGarbo
Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?
the Cauchy Distribution needs to be exorcised! and i feel repulsed to admit it, but it will feature prominently on my comprehensive doctoral exam, as per Dason's suggestion on a sensible distribution to generate outliers. i'm probably going to take a Gaussian copula approach to this though. but i will still feel dirty

Originally Posted by GretaGarbo
@Spunky, What package did give "describe()"?
the psych package. i also updated that on my code so we could have a fully-reproducible example.

18. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

If you want to create a Cauchy distribution for input, then it's a straight-forward thing to do.

Specifically, if we let Z1 and Z2 be independent standard normal deviates, then the ratio X=Z1/Z2 will follow a Cauchy distribution.

Maybe I'm missing something here (Greta?).

19. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by Dason
I would have thought the mighty GretaGarbo would have run across us using CLT for Central Limit Theorem before
So THAT is what it means!!

Is that abbreviation in the general language also as well known as "USA" or "NATO"?

(Maybe it is in his language and at his department. From now on we will only talk about CGS - Centrala Gränsvärdessatsen.)

Does MVC also means "Many Volatile Contributors"?

(You are right Gianmarco, thread deterioration is: ON)

20. ## Re: simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

Originally Posted by spunky
by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data.
Now (I believe) that I understand more of the motivation for the study. I guess that if people generate data from a skewed distribution and use the Cholesky transformation, they might believe that they are using a skew distribution when it is in fact quite "normal".

Is there any distance measures that can summarize how far away the generated distribution is from an believed distribution? To my mind came Kullback-Leiber distance or good old Pearson chi square? Or is it silly to imagine a one-number-quality-index? It would be convenient with such a measure.

Will the central limit theorem ensure multivariate normality in this case?

Originally Posted by spunky
the psych package. i also updated that on my code so we could have a fully-reproducible example.
Thanks!

Originally Posted by Dragan
Maybe I'm missing something here (Greta?).

It has been suggested - especially to spunky - that the Cauchy one day will trick him.

(I saw somewhere - maybe here - someone was generating random numbers from a t-distribution with degrees of freedom just above 2 (close to infinite variance) and they got strange results.