# Thread: Fill in multiple missing values given correlation matrix

1. ## Fill in multiple missing values given correlation matrix

I have a correlation matrix:

Code:
mat <- round(cor(mtcars), 2)

mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
I am given 3 values for 3 of the 11 variables. Can/is there a way to predict the 8 missing values? If so how?

Code:
vals <- setNames(c(NA, 8, 170, NA, 3.5, rep(NA, 6)), colnames(mtcars))

mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
NA   8.0 170.0    NA   3.5    NA    NA    NA    NA    NA    NA
I assume so b/c you can do regression from a correlation matrix so you can do this.

EDIT: Per @Vinux's comment I also know the column means, column sd:

Code:
means <- round(colMeans(mtcars), 2)
sds <- round(apply(mtcars, 2, sd), 2)

2. ## Re: Fill in multiple missing values given correlation matrix

Are you assuming all the variables are standardised? If yes, you can think of 8 regression equations to estimate the missing values.

3. ## The Following User Says Thank You to vinux For This Useful Post:

trinker (04-26-2016)

4. ## Re: Fill in multiple missing values given correlation matrix

Standardized not guaranteed why is that a requisite?

5. ## Re: Fill in multiple missing values given correlation matrix

PS they are on the same scale already so standardization may not be necessary.

6. ## Re: Fill in multiple missing values given correlation matrix

Originally Posted by trinker
Standardized not guaranteed why is that a requisite?
If you don't know their means (expectation) and standard deviation, it is not possible to do the missing data estimation.

Correlation of X & Y is same and Correlation of aX+b and cY + d (a and c are nonzero values). Correlation is free from location and scale changes. This will lead to problem in missing data estimation.

7. ## Re: Fill in multiple missing values given correlation matrix

I agree with vinux. I was thinking the same thing. In particular, look at what is needed if you are simulating data from a correlation matrix. Without a proxy for how these data may be central or dispersed the amount of potential values is countless. Its not like a Sudoku puzzle, where the answers are finite given conditionality.

Follow-up questions, why do you have three values? Are they random values? Do you know the ranges for the variables if they are all on the same scale?

8. ## Re: Fill in multiple missing values given correlation matrix

@vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.

9. ## Re: Fill in multiple missing values given correlation matrix

@hlsmith. New participants can only select 3 variables at max. The ranges can be from 0-5 It's a liker scale. Basically select three things you find important and rate them (of 13 things that could be selected).

Also I thought about your simulation comment. I have done this using something like this: http://blog.revolutionanalytics.com/...ta_with_r.html Pretty easy. Thought about this to regenerate the data and get the percentages of the other variables given the three known. But seems if there's a way to use regression it may be more efficient and better.

10. ## Re: Fill in multiple missing values given correlation matrix

Originally Posted by trinker
@vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.
With mean, sd, and correlation matrix you can do regressions (for simple regression Eqn ; ). I guess the missing data (number of records) are less in number (thumb rule is less than 5 %). If the missing % is high, the regression approach missing data estimation may result collinear structure.

You should also check the type of missing data (See this: https://en.wikipedia.org/wiki/Missing_data). By regression method, it is a MAR (missing at random) type. You could also replace the missing data with respective means if you assume MCNR.

11. ## The Following 2 Users Say Thank You to vinux For This Useful Post:

hlsmith (04-27-2016), trinker (04-27-2016)

12. ## Re: Fill in multiple missing values given correlation matrix

In generalities, this seems like an approachable problem. Though, I am a little confused. So there are 13 questions, but you only have data for three of them for each person. Now you want to create data for the other 10 variables? I am probably missing interpreting something, but if so, please describe the background scenario or issue - why and how much data is missing.

vinux, good point about the amount of missing and possible collinear structure!

13. ## The Following User Says Thank You to hlsmith For This Useful Post:

trinker (04-27-2016)

14. ## Re: Fill in multiple missing values given correlation matrix

@hlsmith. Yes and no. There is a correlation table, column sd and means reported in the company's tech manual. For that data set they had info from 300000 participants for all 13 questions. From that point further the company only required (in fact capped) answering to 3 questions to focus the participant's attention on self improvement to just those three constructs. It's be like if you took a self improvement questionnaire and there were 13 areas of improvement. It's not likely you have the time or energy to improve in all 13 areas so we target your attention to just three. Still the company would like to serve back information about the 10 unrated items based on past participants behavior...i.e., based on your ratings for three (top three) items and other people who ave done this assessment we think you'd have rated the other items this way. So of the original data set, where the correlation table was made there were 300000 participants with almost no missing values. We want to generalize what we learned from that original data to new participants without giving them the full assessment.

15. ## Re: Fill in multiple missing values given correlation matrix

So what is your current n-value?

And when you say 3, is it the same three items collected for each person or does it vary? If it varies, why?

So is there historic data for some people, but not others? Or no historic data, repeat measures?

16. ## Re: Fill in multiple missing values given correlation matrix

Here is an example of regression based missing data estimation using mean, sd, and correlation matrix information (with given cyl, disp, and drat)

Let

Let the model be

for eg:

Code:
## Assuming the means, sds, and cor matrix values are population values
mat <- round(cor(mtcars[,c(1,2,3,5)]), 2)
means <- round(colMeans(mtcars[,c(1,2,3,5)]), 2)
sds <- round(apply(mtcars[,c(1,2,3,5)], 2, sd), 2)

## converting to cov matrix
CovM =diag(sds)%*%mat%*%diag(sds)

## Now comes regression coefficients
beta <- CovM[2:4, 1]%*%solve(CovM[2:4,2:4])
beta0 <- means[1] - beta%*% means[2:4]

mpg.est <- beta0 + sum(beta *c(8.0, 170.0,3.5))
For more details https://en.wikipedia.org/wiki/Multiv..._distributions

PS: I wanted to write this in a more descriptive way. But too lazy.

17. ## The Following User Says Thank You to vinux For This Useful Post:

trinker (04-27-2016)

18. ## Re: Fill in multiple missing values given correlation matrix

trinker,

Just a side note, keep in mind that modern approaches to missing data use multiple imputation. You all appear to be doing what I will call a pseudo-scoring of data. Multiple imputation, imputes multiple values for the missing data to account for its uncertainty. Without multiple imputation you are only accounting for the variability between observations and not accounting for the uncertainty in those unique measures. This difference means inferential statistics may be at risk for type I errors.

19. ## Re: Fill in multiple missing values given correlation matrix

I think the approach vinux is describing is called regression imputation, a form of single imputation.