Generate Grouped Data with Exact Correlations, Means, and SD

want to generate data with the following constraints. There is one factor and two numeric variables. Note that the desired mean structure is broken down by factor, which is what makes this a challenging problem.

Correlation between two variables is set (e.g., .50).

Mean/SD for group (there are two groups) are set (e.g., group 1 has a mean of 2 and 4 on the two variables, group 2 means of 4 and 6). SD can be set at 1 for all variables.

The code below provides a reproducible example of what doesn't work. Does anyone know how to get the desired structure?

out1 <- mvrnorm(40, mu = c(2,4), 
            Sigma = matrix(c(1,.5,
                             .5,1), ncol = 2),
            empirical = TRUE) #Generates data for first group  
out1$ivbg<-1 #identifies group 1

out2 <- mvrnorm(40, mu = c(4,6),Sigma = matrix(c(1,.5,.5,1), ncol = 2),
            empirical = TRUE)

merged<-rbind(out1,out2) # Put them together

cor(out1$V1,out1$V2) #Group 1 correlations = .5
sd(out1$V1) #Group 1 SD = 1
sd(out1$V2) #Group 1 SD = 1

cor(out2$V1,out2$V2) #Group 2 correlations = .5
sd(out2$V1) #Group 2 SD = 1
sd(out2$V2) #Group 2 SD = 1

cor(merged$V1,merged$V2) #Merged Correlation = .75
sd(merged$V1) #Merged sd = 1.414
sd(merged$V2) #Merged sd = 1.414
The mean structure is correct but my goal is to generate a dataset that retains the variance/covariance structure (i.e., r = .5, sd = 1.0).


Can't make spagetti
What you're dealing with is a Gaussian mixture model. In your particular example, you have a mixing probability of 0.5. What you would need to know is figure out what the mean, variance and covariance of the Gaussian mixture is given the mixing proportion parameter you are working with.

For example (the easiest case). For a 2-component Gaussian mixture, with probability p of mixing the population mean is (sorry for the ugly PLUS sign, the '+' doesn't seem to be working with my tex code tags):

[tex]\mu_{12}=p\mu_1[/tex] + [tex](1-p)\mu_2 [/tex]

You can see this right away from your example. Again, because your code implies that p=0.5 you can do:

> 0.5*mean(out1$V1)+(1-0.5)*mean(out2$V1)
[1] 3
> mean(merged$V1)
[1] 3
The formula for the variance of a mixture is a little bit trickier. I don't remember where I got it from but I have it in my notes as:

[tex] p \sigma_1^2[/tex] + [tex](1-p) \sigma_2^2 [/tex] + [tex] [p\mu_1^2 [/tex] + [tex] (1-p)\mu_1^2 - (p \mu_1 [/tex] + [tex] (1-p) \mu_2)^2][/tex]

And you can see it works:

> var(merged$V1)
[1] 2
> .5*var(out1$V1)+.5*var(out2$V2)+.5*mean(out1$V1)^2+0.5*mean(out2$V1)^2-(.5*mean(out1$V1)+.5*mean(out2$V1))^2
[1] 2
The covariance/correlation seems ugly to derive but I guess if you spend some time on it you can get a closed-form expression that may help you out.

If your mixing proportion is always gonna be p=0.5, then I'm srue there's some stuff that should become easier than when you have different proportions.
Last edited by a moderator: