GretaGarbo (09-21-2015)
I have a question but maybe it's the wrong question so I'll state the task first...
I want to make data that looks like the data I'm working with without actually being the data itself. So I want to maintain structure as much as possible and generate an n row data set with similar correlations between variables and distributional shapes of column variables. Let's say this is the data I have (in R):
Is there a way to figure out the distribution and parameters of the data set in order to generate a new similar data set? I was thinking you could use the Kolmogorov-Smirnov and just compare to 10ish common distributions and select the one with the lowest highest p-value. But I realized I'd have to know the parameters of the distribution in advance.Code:set.seed(10) dat <- data.frame( pois_10 = rpois(100, 10), binom_5_.2 = rbinom(100, 5,.2), binom_1_.2 = rbinom(100, 1,.2), runif_0_1 = runif(100), chisq_30 = rchisq(100, 30), chisq_10 = rchisq(100, 10), logistic_0 = rlogis(100), logistic_10 = rlogis(100, 10) ) head(dat) pois_10 binom_5_.2 binom_1_.2 runif_0_1 chisq_30 chisq_10 logistic_0 logistic_10 1 10 0 1 0.3791907 30.72111 11.386525 -0.0951142 9.329344 2 9 2 1 0.9144744 50.45788 16.955645 3.9631567 9.288006 3 5 1 0 0.4774175 41.51492 8.591872 -1.1165473 7.216612 4 8 1 1 0.2141185 23.79510 15.054063 4.1400485 6.733500 5 9 0 1 0.7683779 25.35576 8.666049 -2.5407420 9.550825 6 10 2 1 0.9273926 49.16303 5.554433 -1.4603903 12.853794 12.853794
yields:Code:x <- rnorm(500) y <- runif(500) ks.test(x, "pnorm") ks.test(y, "pnorm") ks.test(y, "punif") ks.test(x, "punif") ks.test(x, "pt", 4)
So I want to create data that looks similar to data I already have with similar correlation matrix and similar distributions.Code:> ks.test(x, "pnorm") One-sample Kolmogorov-Smirnov test data: x D = 0.039414, p-value = 0.419 alternative hypothesis: two-sided > ks.test(y, "pnorm") One-sample Kolmogorov-Smirnov test data: y D = 0.50147, p-value < 0.00000000000000022 alternative hypothesis: two-sided > ks.test(y, "punif") One-sample Kolmogorov-Smirnov test data: y D = 0.031638, p-value = 0.6988 alternative hypothesis: two-sided > ks.test(x, "punif") One-sample Kolmogorov-Smirnov test data: x D = 0.476, p-value < 0.00000000000000022 alternative hypothesis: two-sided > > ks.test(x, "pt", 4) One-sample Kolmogorov-Smirnov test data: x D = 0.055559, p-value = 0.09129 alternative hypothesis: two-sided
If it was all normal the task of generating similar data is pretty easy using something like this:
But it'd be better if we could mimic something like a uniform or poison distribution more closely if that's what the data was more closely shaped like.Code:mvrnormR <- function(n, mu, sigma) { ncols <- ncol(sigma) mu <- rep(mu, each = n) ## not obliged to use a matrix (recycling) mu + matrix(rnorm(n * ncols), ncol = ncols) %*% chol(sigma) }
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
GretaGarbo (09-21-2015)
Is there a reason why you can't just use your dataset saved under another name? Don't get me wrong, simulating a dataset sounds fun, but all of the slight differences would be errors. Now you run the same stats you would have using the original dataset but you have:
parameter + sampling variation + new systematic error from simulating
P.S., real soon this webpage will be able to simulate any dataset:
http://www.studysimulator.com/
trinker (09-21-2015)
The data is our client's data. I can't actually use their data as it would be viewable by others but it needs to be similar.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Can you scrub any identifiers and then just add some noise to the actual data?
I don't have emotions and sometimes that makes me very sad.
yes ideas...? Are you thinking a jitter? How would I know how much to jitter. We don't want to jitter a binary variable as much as a wide ranged numeric variable.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Call them X1-Xk, and perhaps also addition of constant. Depending on the dataset as you tried, simulation or partial simulation.
Tweet |