+ Reply to Thread
Results 1 to 6 of 6

Thread: Determine distribution and parameters

  1. #1
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Determine distribution and parameters




    I have a question but maybe it's the wrong question so I'll state the task first...

    I want to make data that looks like the data I'm working with without actually being the data itself. So I want to maintain structure as much as possible and generate an n row data set with similar correlations between variables and distributional shapes of column variables. Let's say this is the data I have (in R):

    Code: 
    set.seed(10)
    
    dat <- data.frame(
      pois_10 = rpois(100, 10),
      binom_5_.2 = rbinom(100, 5,.2),
      binom_1_.2 = rbinom(100, 1,.2),
      runif_0_1 = runif(100),
      chisq_30 = rchisq(100, 30),
      chisq_10 = rchisq(100, 10),
      logistic_0 = rlogis(100),
      logistic_10 = rlogis(100, 10)
    )
    
    head(dat)
    
      pois_10 binom_5_.2 binom_1_.2 runif_0_1 chisq_30  chisq_10 logistic_0 logistic_10
    1      10          0          1 0.3791907 30.72111 11.386525 -0.0951142    9.329344
    2       9          2          1 0.9144744 50.45788 16.955645  3.9631567    9.288006
    3       5          1          0 0.4774175 41.51492  8.591872 -1.1165473    7.216612
    4       8          1          1 0.2141185 23.79510 15.054063  4.1400485    6.733500
    5       9          0          1 0.7683779 25.35576  8.666049 -2.5407420    9.550825
    6      10          2          1 0.9273926 49.16303  5.554433 -1.4603903   12.853794
      12.853794
    Is there a way to figure out the distribution and parameters of the data set in order to generate a new similar data set? I was thinking you could use the Kolmogorov-Smirnov and just compare to 10ish common distributions and select the one with the lowest highest p-value. But I realized I'd have to know the parameters of the distribution in advance.

    Code: 
    x <- rnorm(500)
    y <- runif(500)
    
    ks.test(x, "pnorm")
    ks.test(y, "pnorm")
    ks.test(y, "punif")
    ks.test(x, "punif")
    
    ks.test(x, "pt", 4)
    yields:

    Code: 
    > ks.test(x, "pnorm")
    
            One-sample Kolmogorov-Smirnov test
    
    data:  x
    D = 0.039414, p-value = 0.419
    alternative hypothesis: two-sided
    
    > ks.test(y, "pnorm")
    
            One-sample Kolmogorov-Smirnov test
    
    data:  y
    D = 0.50147, p-value < 0.00000000000000022
    alternative hypothesis: two-sided
    
    > ks.test(y, "punif")
    
            One-sample Kolmogorov-Smirnov test
    
    data:  y
    D = 0.031638, p-value = 0.6988
    alternative hypothesis: two-sided
    
    > ks.test(x, "punif")
    
            One-sample Kolmogorov-Smirnov test
    
    data:  x
    D = 0.476, p-value < 0.00000000000000022
    alternative hypothesis: two-sided
    
    > 
    > ks.test(x, "pt", 4)
    
            One-sample Kolmogorov-Smirnov test
    
    data:  x
    D = 0.055559, p-value = 0.09129
    alternative hypothesis: two-sided
    So I want to create data that looks similar to data I already have with similar correlation matrix and similar distributions.

    If it was all normal the task of generating similar data is pretty easy using something like this:

    Code: 
    mvrnormR <- function(n, mu, sigma) {
        ncols <- ncol(sigma)
        mu <- rep(mu, each = n) ## not obliged to use a matrix (recycling)
        mu + matrix(rnorm(n * ncols), ncol = ncols) %*% chol(sigma)
    }
    ​
    But it'd be better if we could mimic something like a uniform or poison distribution more closely if that's what the data was more closely shaped like.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. The Following User Says Thank You to trinker For This Useful Post:

    GretaGarbo (09-21-2015)

  3. #2
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Determine distribution and parameters

    Is there a reason why you can't just use your dataset saved under another name? Don't get me wrong, simulating a dataset sounds fun, but all of the slight differences would be errors. Now you run the same stats you would have using the original dataset but you have:


    parameter + sampling variation + new systematic error from simulating


    P.S., real soon this webpage will be able to simulate any dataset:


    http://www.studysimulator.com/

  4. The Following User Says Thank You to hlsmith For This Useful Post:

    trinker (09-21-2015)

  5. #3
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Determine distribution and parameters

    The data is our client's data. I can't actually use their data as it would be viewable by others but it needs to be similar.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  6. #4
    Devorador de queso
    Points: 95,540, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,930
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Determine distribution and parameters

    Can you scrub any identifiers and then just add some noise to the actual data?
    I don't have emotions and sometimes that makes me very sad.

  7. #5
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Determine distribution and parameters

    yes ideas...? Are you thinking a jitter? How would I know how much to jitter. We don't want to jitter a binary variable as much as a wide ranged numeric variable.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  8. #6
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Determine distribution and parameters


    Call them X1-Xk, and perhaps also addition of constant. Depending on the dataset as you tried, simulation or partial simulation.

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats