Determine distribution and parameters

trinker

ggplot2orBust
#1
I have a question but maybe it's the wrong question so I'll state the task first...

I want to make data that looks like the data I'm working with without actually being the data itself. So I want to maintain structure as much as possible and generate an n row data set with similar correlations between variables and distributional shapes of column variables. Let's say this is the data I have (in R):

Code:
set.seed(10)

dat <- data.frame(
  pois_10 = rpois(100, 10),
  binom_5_.2 = rbinom(100, 5,.2),
  binom_1_.2 = rbinom(100, 1,.2),
  runif_0_1 = runif(100),
  chisq_30 = rchisq(100, 30),
  chisq_10 = rchisq(100, 10),
  logistic_0 = rlogis(100),
  logistic_10 = rlogis(100, 10)
)

head(dat)

  pois_10 binom_5_.2 binom_1_.2 runif_0_1 chisq_30  chisq_10 logistic_0 logistic_10
1      10          0          1 0.3791907 30.72111 11.386525 -0.0951142    9.329344
2       9          2          1 0.9144744 50.45788 16.955645  3.9631567    9.288006
3       5          1          0 0.4774175 41.51492  8.591872 -1.1165473    7.216612
4       8          1          1 0.2141185 23.79510 15.054063  4.1400485    6.733500
5       9          0          1 0.7683779 25.35576  8.666049 -2.5407420    9.550825
6      10          2          1 0.9273926 49.16303  5.554433 -1.4603903   12.853794
  12.853794
Is there a way to figure out the distribution and parameters of the data set in order to generate a new similar data set? I was thinking you could use the Kolmogorov-Smirnov and just compare to 10ish common distributions and select the one with the lowest highest p-value. But I realized I'd have to know the parameters of the distribution in advance.

Code:
x <- rnorm(500)
y <- runif(500)

ks.test(x, "pnorm")
ks.test(y, "pnorm")
ks.test(y, "punif")
ks.test(x, "punif")

ks.test(x, "pt", 4)
yields:

Code:
> ks.test(x, "pnorm")

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.039414, p-value = 0.419
alternative hypothesis: two-sided

> ks.test(y, "pnorm")

        One-sample Kolmogorov-Smirnov test

data:  y
D = 0.50147, p-value < 0.00000000000000022
alternative hypothesis: two-sided

> ks.test(y, "punif")

        One-sample Kolmogorov-Smirnov test

data:  y
D = 0.031638, p-value = 0.6988
alternative hypothesis: two-sided

> ks.test(x, "punif")

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.476, p-value < 0.00000000000000022
alternative hypothesis: two-sided

> 
> ks.test(x, "pt", 4)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.055559, p-value = 0.09129
alternative hypothesis: two-sided
So I want to create data that looks similar to data I already have with similar correlation matrix and similar distributions.

If it was all normal the task of generating similar data is pretty easy using something like this:

Code:
mvrnormR <- function(n, mu, sigma) {
    ncols <- ncol(sigma)
    mu <- rep(mu, each = n) ## not obliged to use a matrix (recycling)
    mu + matrix(rnorm(n * ncols), ncol = ncols) %*% chol(sigma)
}
​
But it'd be better if we could mimic something like a uniform or poison distribution more closely if that's what the data was more closely shaped like.
 

hlsmith

Not a robit
#2
Is there a reason why you can't just use your dataset saved under another name? Don't get me wrong, simulating a dataset sounds fun, but all of the slight differences would be errors. Now you run the same stats you would have using the original dataset but you have:


parameter + sampling variation + new systematic error from simulating


P.S., real soon this webpage will be able to simulate any dataset:


http://www.studysimulator.com/
 

trinker

ggplot2orBust
#3
The data is our client's data. I can't actually use their data as it would be viewable by others but it needs to be similar.
 

trinker

ggplot2orBust
#5
yes ideas...? Are you thinking a jitter? How would I know how much to jitter. We don't want to jitter a binary variable as much as a wide ranged numeric variable.
 

hlsmith

Not a robit
#6
Call them X1-Xk, and perhaps also addition of constant. Depending on the dataset as you tried, simulation or partial simulation.