Generate Data for Correlated Continuous and Discrete variables


I was wondering if anybody could suggest an appropriate approach for the following...

I have one independent variable, X, which takes the values 0, 12.5, 25. I have two dependent variables, 1 normal and 1 binary. The mean of the normal variable depends on a function of a value of X, and the parameter "p" in the Bernoulli variable depends on a function of a value of X.

I want to generate data for the normal variable (by specifying a different mean and s.d for each X) and the binary variable (by specifying a different p for each X), and also have the ability to specify the correlation between them.

I have done this already for the normal and binary variable separately, assuming independence....see below....

I generated 1000 data points for each of the 2 dependent variables for each value of X such that I have

1000 data points from N(0.01,1)
1000 data points from Bernouilli(1,0.01)
1000 data points from N(0.201,1)
1000 data points from Bernouilli(1,0.201)
1000 data points from N(0.401,1)
1000 data points from Bernouilli(1,0.401)

I then collated the data together so effectively I have a table of data constructed of 2 columns where column 1 is the normal data and column 2 is the binary data.

But, what I really want to do is generate data as above, but assume that there may be a dependence between the normal and binary variables , that is, have the ability to specify the correlation between Column 1 and Column 2 and simulate data based on this, as well as the means and “p” parameters I have specified above.

Thanks for any replies.


Can't make spagetti
in my case, if i want to generate data that follows some sort of dependency i can manipulate i always use copulas. i havent really given too much thought, but maybe my first approach would be to build an elliptical bivariate copula with a marginal as a normal distirbution and another marginal as a binomial distribution.... R can handle these quite nicely but i know MATLAB is also pretty powerful so i'm sure there's a way to work out copulas there as well...
Thanks for your reply spunky.

Copula are actually something I have been playing around with and I have actually built a guassian copula.

I then initially tried creating two marginals - one normal and one Bernoulli. However I've noticed a problem... I initially specify a high dependency (rank correlation of about 0.8, which translates into a linear correlation rho parameter for the Copula of about 0.98) and generate my marginals according to the following parameters:
Normal: N(0,1)
Bernouilli: p = 0.1
giving me two marginals (say 1000 values in each). The 1's in the Bernoulli distribution tend be to around the higher Normal distribution values (as I’d expect from the correlation specfied). However when I calculate the bi-serial correlation it doesn’t come anywhere near my original 0.8.

And I think this probably seems reasonable. To illustrate, say I generate 10 values for each marginal rather than 1000:
-0.6358 0
-0.2817 0
-0.2901 0
-0.1780 0
1.0907 0
-0.6585 0
-2.3632 0
-1.0660 0
1.9706 1
-0.1151 0
Then if I calculate the bi-serial correlation between column 1 and column 2, I get 0.63. But I think this is simply because I'm trying to specify such a high correlation, but at the same time am only specifying the probability of a 1 in column 2 to be 0.1.

It's a similar situation if I generate 2 normal variables and dichotomise one of them. I'm "losing" some of the correlation. I can increase the correlation by increasing the probability of getting a 1 in the dichotomised variable, but I don’t really want to do that.

Does this make sense, that I can’t expect to retrieve my specified correlation (0.8) if I’m only allowing p = 0.1 in the Bernoulli marginal? Is this just something I have to accept?


Can't make spagetti
well, upon re-thinking it this could probably be because you're not using the right copula. gaussian copulas are better suited for continuous distributions on its marginals. the probability-integral transformation does not work as nicely with discrete distributions so the issue may lie right there. perhaps trying out a copula from the archimedean family? well, in any case, i'm not all that familiar with the specific problem you're trying to address so i shouldnt be just throwing advise randomly around.

if you'd like to seek further guidance, get ahold of "Multivariate models and dependence concepts" by Harry Joe. he is one of the leading authorities on copulas and copula-based modelling so i'm sure you'll be able to find an answer there.