I'll explain this, Spunky...
Just give 10 years to grasp the issue you're describing :-)
Gm
i have been experimenting with different methods to generate correlated, non-normal data for comprehensive-exam purposes and i'm finding something i cannot quite explain and would appreciate some insight.
for the case of normally-distributed data, the process is very simple:
(1) decide on the correlation matrix that you want
(2) do some matrix decomposition on it (cholesky/PCA i think are the most popular ones)
(3) multiply the decomposition of the correlation matrix times the matrix of (normally-distributed) variables
the result is a nice multivariate normal distribution with the correlation matrix you intended in (1).
i have found a couple of articles now that apply the same algorithm but for non-normal data. i reproduced their examples in R and i can see they work as far as preserving the correlation structure goes but that there are important changes in the moments of the distribution. now, i already knew that the matrix multiplication would alter the moments of the data... but what i didn't suspect is that the new datasets look...well... for lack of a better word, more "normally distributed".
lemme show you what i mean:
i did a density plot of variable #3 (which is the one that changes the most) as an example:Code:set.seed(123) R <- matrix(c(1.0, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5, 0.5, 1.0),3,3) n<- 5000 unf <- matrix(c(runif(n), runif(n), runif(n)),n,3) datz <- unf%*%chol(R)
BEFORE THE TRANSFORMATION
AFTER THE TRANSFORMATION
and i can even just look at the descriptives and see just how much it changed:
Code:library(psych) > describe(unf) var n mean sd median trimmed mad min max range skew kurtosis se 1 1 5000 0.5 0.29 0.50 0.5 0.37 0 1 1 0.01 -1.20 0 2 2 5000 0.5 0.29 0.51 0.5 0.38 0 1 1 -0.03 -1.22 0 3 3 5000 0.5 0.29 0.50 0.5 0.37 0 1 1 0.00 -1.21 0 > describe(datz) var n mean sd median trimmed mad min max range skew kurtosis se 1 1 5000 0.50 0.29 0.50 0.50 0.37 0.00 1.00 1.00 0.01 -1.20 0 2 2 5000 0.68 0.29 0.69 0.68 0.33 0.00 1.35 1.35 -0.02 -0.80 0 3 3 5000 0.80 0.29 0.81 0.80 0.32 0.06 1.53 1.48 0.00 -0.65 0
i mean, variable #3 reduced its kurtosis from -1.21 to -0.65! that's almost less than half the kurtosis!
i also explored the beta distribution with parameters shape1=shape2=0.5 as an example of another symmetric distirbution and the same thing appeared:
BEFORE THE TRANSFORMATION:
AFTER THE TRANSFORMATION:
i mean, here it was so extreme it even lost its bimodality!
i tried it with skewed distributions as well. i can see they don't lose their skewness, but the skewness is also considerably reduced after they get transformed.
i've been looking around all day on the internet but it doesn't seem like anyone has explored this method within the context of non-normal data. more importantly, nobody seems to explain why what initially looks like non-normal distributions become more "bell-shaped" after the transformation, ESPECIALLY if they are symmetric distributions.
ideas?
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
I'll explain this, Spunky...
Just give 10 years to grasp the issue you're describing :-)
Gm
http://cainarchaeology.weebly.com/
spunky (09-21-2014)
What does the command 'chq' do? And where does the command "describe" comes from?
(Please make a reproducible example Spunky! )
spunky (09-21-2014)
Right Greta! It is exactly what I was about to reply :-)
p.s.
Gianmarco's thread-deterioration vengeance: ON
http://cainarchaeology.weebly.com/
Your original code isn't reproducible. I suspect you changed a variable name at some point.
My guess as to what is going on is something like "sums of variables... blah blah blah... CLT ... blah blah blah"
I don't have emotions and sometimes that makes me very sad.
spunky (09-21-2014)
Dason: You are correct. That is, your guess is correct that "what is going on" is attributed to the CLT.
Spunky: Let me give an example to explain what is going on. It may not be what you are referring to in those articles you've looked at - but it makes the point.
Here we go:
Let X and E be independent standardized Uniform random variates on the interval [-Sqrt(3), + Sqrt(3)] i.e. both X and E have means of zero, variance of one, skew of zero, and kurtosis of -1.20 .
Now, apply the following algorithm to create a variable (Y) that would have a correlation of 0.5 with X.
Y = X*0.5 + Sqrt[1 - 0.5^2]*E
The result is that Y and X would have a correlation of 0.5 but the kurtosis of Y would be -0.75 i.e. more "normal-like" because of the CLT i.e. Y is the SUM of a function of two other variables (X and E).
Hope this helps.
Last edited by Dragan; 09-21-2014 at 05:31 PM. Reason: clarity
spunky (09-21-2014)
spunky (09-21-2014)
thanks everyone! sorry, i posted that at 3am so i guess i did skipped one step.
but you're all right (and thnx Dragan for the example!) the sums of random variables + CLT kicking in explanation make perfect sense now.
balance has now been restored to my universe...
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
I had not idea about what spunky was trying to do since I could not get the code to work.
This give me the opportunity to show my favorite quotation:
So I guess that "CLT" is about "Cilly Little Things"?Dragan:
I love acronyms....because they confuse so many people
Now I believe that Spunky was trying to do something like in here in equation 7, creating random variables with a certain correlation structure from uncorrelated variables with the help of Cholesky decomposition.
Isn't it so that the Cholesky decomposition will work for all covariance matrices and correlation matrices? (work in the sense of being able to create the wanted correlation structure.)
I believed that the theorem about "Cilly Little Things" is valid also if the coefficients are fixed but not necessarily the same, as in Cholesky decomposition? But I also guess that there must be some restriction on the size of the coefficients relative to the number of variables that are summed? But I guess that, say a 20 by 20 correlation matrix, would give random variables that are quite close to normality? (And I wonder about multivariate normality and marginal normality?) Does the Cholesky decomposition have the property of always giving coefficients small enough so that the theorem about "Cilly Little Things" start to "kick in" as the number of variables increase?
Isn't there anybody who is going to suggest Spunky to simulate Cauchy distributed variables as input?
@Spunky, What package did give "describe()"?
sorry Greta, you're right. my code wasn't reproducible, until now. i just changed the stuff so it would run on anyone's R. on my defense, i was doing R while tipsy at 3am on a sunday morning. but i just took care of it. i did change a variable name. i was trying several distributions to see whether this was unique to symmetric distributions or if skewed distributions would also become "normalized" (as in "closer to the normal distribution")
you are most correct, that is what i was aiming for.
i don't think there's anything from stopping the cholesky decomposition (or PCA, which is more common here in social-science land) from working. the issue i keep on finding on my area (social/behavioural sciences) is that people are very, very fond of taking a "black box" approach to simulation studies. by "black box" i mean they don't quite understand what their software of choice is doing and just blindly follow either what someone else did or let the software take care of things (i've found many Mplus users are very prone of this option). the main motivation behind this question is because i was doing an overview of methods to test differences of correlation coefficients and found quite a bit of people take this 'covariance matrix decomposition' approach to generate their data. now, there is nothing wrong when people limit their claims to multivariate normality, because the normal distribution is closed under linear combinations. but i do realize that when people say they're studying the effect of 'non-normality' by using other non-normal distributions and apply this algorithm, many of their variables end up looking quite normal. i guess when you have few variables (like, in my case, 3) things are not too bad. but the moment you start looking at things with 10 or 20 variables, quite a few of those look a little 'too normal' (both through plotting them and by doing normality tests) which makes me question whether the recommendations on these published articles are still valid or not. a very good example of this can actually be found here.
the Cauchy Distribution needs to be exorcised! and i feel repulsed to admit it, but it will feature prominently on my comprehensive doctoral exam, as per Dason's suggestion on a sensible distribution to generate outliers. i'm probably going to take a Gaussian copula approach to this though. but i will still feel dirty
the psych package. i also updated that on my code so we could have a fully-reproducible example.
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
If you want to create a Cauchy distribution for input, then it's a straight-forward thing to do.
Specifically, if we let Z1 and Z2 be independent standard normal deviates, then the ratio X=Z1/Z2 will follow a Cauchy distribution.
Maybe I'm missing something here (Greta?).
So THAT is what it means!!
Is that abbreviation in the general language also as well known as "USA" or "NATO"?
(Maybe it is in his language and at his department. From now on we will only talk about CGS - Centrala Gränsvärdessatsen.)
Does MVC also means "Many Volatile Contributors"?
(You are right Gianmarco, thread deterioration is: ON)
Now (I believe) that I understand more of the motivation for the study. I guess that if people generate data from a skewed distribution and use the Cholesky transformation, they might believe that they are using a skew distribution when it is in fact quite "normal".
Is there any distance measures that can summarize how far away the generated distribution is from an believed distribution? To my mind came Kullback-Leiber distance or good old Pearson chi square? Or is it silly to imagine a one-number-quality-index? It would be convenient with such a measure.
Will the central limit theorem ensure multivariate normality in this case?
Thanks!
It has been suggested - especially to spunky - that the Cauchy one day will trick him.
(I saw somewhere - maybe here - someone was generating random numbers from a t-distribution with degrees of freedom just above 2 (close to infinite variance) and they got strange results.
Tweet |