simulating correlated, non-normal data: WHY DOES THIS HAPPEN!?

spunky

Super Moderator
#1
i have been experimenting with different methods to generate correlated, non-normal data for comprehensive-exam purposes and i'm finding something i cannot quite explain and would appreciate some insight.

for the case of normally-distributed data, the process is very simple:

(1) decide on the correlation matrix that you want
(2) do some matrix decomposition on it (cholesky/PCA i think are the most popular ones)
(3) multiply the decomposition of the correlation matrix times the matrix of (normally-distributed) variables

the result is a nice multivariate normal distribution with the correlation matrix you intended in (1).

i have found a couple of articles now that apply the same algorithm but for non-normal data. i reproduced their examples in R and i can see they work as far as preserving the correlation structure goes but that there are important changes in the moments of the distribution. now, i already knew that the matrix multiplication would alter the moments of the data... but what i didn't suspect is that the new datasets look...well... for lack of a better word, more "normally distributed".

lemme show you what i mean:

Code:
set.seed(123)
R <- matrix(c(1.0, 0.5, 0.5,
              0.5, 1.0, 0.5,
              0.5, 0.5, 1.0),3,3)


n<- 5000

unf <- matrix(c(runif(n), runif(n), runif(n)),n,3) 

datz <- unf%*%chol(R)
i did a density plot of variable #3 (which is the one that changes the most) as an example:

BEFORE THE TRANSFORMATION


AFTER THE TRANSFORMATION



and i can even just look at the descriptives and see just how much it changed:

Code:
library(psych)

> describe(unf)
  var    n mean   sd median trimmed  mad min max range  skew kurtosis se
1   1 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.01    -1.20  0
2   2 5000  0.5 0.29   0.51     0.5 0.38   0   1     1 -0.03    -1.22  0
3   3 5000  0.5 0.29   0.50     0.5 0.37   0   1     1  0.00    -1.21  0

> describe(datz)
  var    n mean   sd median trimmed  mad  min  max range  skew kurtosis se
1   1 5000 0.50 0.29   0.50    0.50 0.37 0.00 1.00  1.00  0.01    -1.20  0
2   2 5000 0.68 0.29   0.69    0.68 0.33 0.00 1.35  1.35 -0.02    -0.80  0
3   3 5000 0.80 0.29   0.81    0.80 0.32 0.06 1.53  1.48  0.00    -0.65  0

i mean, variable #3 reduced its kurtosis from -1.21 to -0.65! that's almost less than half the kurtosis!

i also explored the beta distribution with parameters shape1=shape2=0.5 as an example of another symmetric distirbution and the same thing appeared:

BEFORE THE TRANSFORMATION:




AFTER THE TRANSFORMATION:



i mean, here it was so extreme it even lost its bimodality!


i tried it with skewed distributions as well. i can see they don't lose their skewness, but the skewness is also considerably reduced after they get transformed.

i've been looking around all day on the internet but it doesn't seem like anyone has explored this method within the context of non-normal data. more importantly, nobody seems to explain why what initially looks like non-normal distributions become more "bell-shaped" after the transformation, ESPECIALLY if they are symmetric distributions.

ideas? :)
 

Dason

Ambassador to the humans
#5
Your original code isn't reproducible. I suspect you changed a variable name at some point.

My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
 

Dragan

Super Moderator
#7
Your original code isn't reproducible. I suspect you changed a variable name at some point.

My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.

Spunky: Let me give an example to explain what is going on. It may not be what you are referring to in those articles you've looked at - but it makes the point.

Here we go:

Let X and E be independent standardized Uniform random variates on the interval [-Sqrt(3), + Sqrt(3)] i.e. both X and E have means of zero, variance of one, skew of zero, and kurtosis of -1.20 .

Now, apply the following algorithm to create a variable (Y) that would have a correlation of 0.5 with X.

Y = X*0.5 + Sqrt[1 - 0.5^2]*E

The result is that Y and X would have a correlation of 0.5 but the kurtosis of Y would be -0.75 i.e. more "normal-like" because of the CLT i.e. Y is the SUM of a function of two other variables (X and E).

Hope this helps.
 
Last edited:

Jake

Cookie Scientist
#8
My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
Haha, basically my thoughts exactly... "I'm sure the CLT is lurking around here somewhere...maybe in the matrix multiplication step"
 

spunky

Super Moderator
#9
thanks everyone! sorry, i posted that at 3am so i guess i did skipped one step.

but you're all right (and thnx Dragan for the example!) the sums of random variables + CLT kicking in explanation make perfect sense now.

balance has now been restored to my universe...
 
#10
I had not idea about what spunky was trying to do since I could not get the code to work.

My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"

Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.
This give me the opportunity to show my favorite quotation:

Dragan:
I love acronyms....because they confuse so many people
So I guess that "CLT" is about "Cilly Little Things"?

Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.

Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?

Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?

@Spunky, What package did give "describe()"?
 

spunky

Super Moderator
#12
I had not idea about what spunky was trying to do since I could not get the code to work.
sorry Greta, you're right. my code wasn't reproducible, until now. i just changed the stuff so it would run on anyone's R. on my defense, i was doing R while tipsy at 3am on a sunday morning. but i just took care of it. i did change a variable name. i was trying several distributions to see whether this was unique to symmetric distributions or if skewed distributions would also become "normalized" (as in "closer to the normal distribution")


Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.
you are most correct, that is what i was aiming for.

Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)

I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?
i don't think there's anything from stopping the cholesky decomposition (or PCA, which is more common here in social-science land) from working. the issue i keep on finding on my area (social/behavioural sciences) is that people are very, very fond of taking a "black box" approach to simulation studies. by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data. now, there is nothing wrong when people limit their claims to multivariate normality, because the normal distribution is closed under linear combinations. but i do realize that when people say they're studying the effect of 'non-normality' by using other non-normal distributions and apply this algorithm, many of their variables end up looking quite normal. i guess when you have few variables (like, in my case, 3) things are not too bad. but the moment you start looking at things with 10 or 20 variables, quite a few of those look a little 'too normal' (both through plotting them and by doing normality tests) which makes me question whether the recommendations on these published articles are still valid or not. a very good example of this can actually be found here.


Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?
the Cauchy Distribution needs to be exorcised! and i feel repulsed to admit it, but it will feature prominently on my comprehensive doctoral exam, as per Dason's suggestion on a sensible distribution to generate outliers. i'm probably going to take a Gaussian copula approach to this though. but i will still feel dirty

@Spunky, What package did give "describe()"?
the psych package. i also updated that on my code so we could have a fully-reproducible example.
 

Dragan

Super Moderator
#13
If you want to create a Cauchy distribution for input, then it's a straight-forward thing to do.

Specifically, if we let Z1 and Z2 be independent standard normal deviates, then the ratio X=Z1/Z2 will follow a Cauchy distribution.

Maybe I'm missing something here (Greta?).
 
#14
I would have thought the mighty GretaGarbo would have run across us using CLT for Central Limit Theorem before :p
So THAT is what it means!!

Is that abbreviation in the general language also as well known as "USA" or "NATO"?

(Maybe it is in his language and at his department. From now on we will only talk about CGS - Centrala Gränsvärdessatsen.)

Does MVC also means "Many Volatile Contributors"?

(You are right Gianmarco, thread deterioration is: ON)
 
#15
by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data.
Now (I believe) that I understand more of the motivation for the study. I guess that if people generate data from a skewed distribution and use the Cholesky transformation, they might believe that they are using a skew distribution when it is in fact quite "normal".

Is there any distance measures that can summarize how far away the generated distribution is from an believed distribution? To my mind came Kullback-Leiber distance or good old Pearson chi square? Or is it silly to imagine a one-number-quality-index? It would be convenient with such a measure.

Will the central limit theorem ensure multivariate normality in this case?

the psych package. i also updated that on my code so we could have a fully-reproducible example.
Thanks!




Maybe I'm missing something here (Greta?).
:)
It has been suggested - especially to spunky - that the Cauchy one day will trick him.

(I saw somewhere - maybe here - someone was generating random numbers from a t-distribution with degrees of freedom just above 2 (close to infinite variance) and they got strange results.
 

Dason

Ambassador to the humans
#16
If you want to create a Cauchy distribution for input, then it's a straight-forward thing to do.

Specifically, if we let Z1 and Z2 be independent standard normal deviates, then the ratio X=Z1/Z2 will follow a Cauchy distribution.

Maybe I'm missing something here (Greta?).
rcauchy would probably work too ;)

So THAT is what it means!!

Is that abbreviation in the general language also as well known as "USA" or "NATO"?
In general language? Probably not. But it's pretty common in stats courses.
 

Dason

Ambassador to the humans
#18
Yes, in stat courses in the English language. Does every participant here on TalkStats comes from a stats course? Or from a class with the English language?
I'm not sure what you're getting at. But I felt pretty safe using CLT in a discussion with spunky.
 

spunky

Super Moderator
#19
Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?
rcauchy would probably work too ;)
Code:
library(copula)
library(psych)

cop1 <- mvdc(normalCopula(c(0.5, 0.5, 0.5), dim=3, dispstr="un"), 
	 c("cauchy", "cauchy", "cauchy"),list(list(location=0, scale=1),
         list(location=0, scale=1), list(location=0, scale=1)))

Q <- rMvdc(1000, cop1)
THERE! THERE YOU HAVE IT! now i have summoned its DARK MAGIC to further taint my thread. just look at these descriptive stats!


Code:
> describe(Q)
  var    n  mean    sd median trimmed  mad     min    max   range   skew kurtosis   se
1   1 1000  1.91 37.16   0.03    0.04 1.44 -167.96 681.39  849.35  14.79    249.4 1.18
2   2 1000 -0.09 21.61  -0.01    0.07 1.44 -480.00 226.15  706.15 -10.85    270.4 0.68
3   3 1000  0.46 34.85   0.02    0.05 1.43 -400.54 805.44 1205.98  11.46    309.2 1.10
look at that! kurtoses over 200! a range of more than a 1000!!

and if we plot variable #3:



LOOK AT IT! IT'S FILTHY! FILTHY!!!! FILTHY!!!

no God or Goddess in this Universe could allow such... such... MONSTROSITY to exist...
 
#20
I'm not sure what you're getting at. But I felt pretty safe using CLT in a discussion with spunky.
You know very well that what I am getting at is that many readers here at TalskStats will not understand what you mean by "CLT" because they have not attended a stats course (where theorem are turned to abbreviations) or been at an English speaking university. Is it clear for you if I talk about CGS or ZGS? (But maybe it it obvious for Karabiner, the last one, that is)

Or do you expect Spunky to be the only reader?