Fill in multiple missing values given correlation matrix

trinker

ggplot2orBust
#1
I have a correlation matrix:

Code:
mat <- round(cor(mtcars), 2)

[COLOR="gray"]       mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00[/COLOR]
I am given 3 values for 3 of the 11 variables. Can/is there a way to predict the 8 missing values? If so how?

Code:
vals <- setNames(c(NA, 8, 170, NA, 3.5, rep(NA, 6)), colnames(mtcars))


mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
 NA   8.0 170.0    NA   3.5    NA    NA    NA    NA    NA    NA
I assume so b/c you can do regression from a correlation matrix so you can do this.

EDIT: Per @Vinux's comment I also know the column means, column sd:

Code:
means <- round(colMeans(mtcars), 2)
sds <- round(apply(mtcars, 2, sd), 2)
 

vinux

Dark Knight
#2
Are you assuming all the variables are standardised? If yes, you can think of 8 regression equations to estimate the missing values.
 

vinux

Dark Knight
#5
Standardized not guaranteed why is that a requisite?
If you don't know their means (expectation) and standard deviation, it is not possible to do the missing data estimation.

Correlation of X & Y is same and Correlation of aX+b and cY + d (a and c are nonzero values). Correlation is free from location and scale changes. This will lead to problem in missing data estimation.
 

hlsmith

Omega Contributor
#6
I agree with vinux. I was thinking the same thing. In particular, look at what is needed if you are simulating data from a correlation matrix. Without a proxy for how these data may be central or dispersed the amount of potential values is countless. Its not like a Sudoku puzzle, where the answers are finite given conditionality.


Follow-up questions, why do you have three values? Are they random values? Do you know the ranges for the variables if they are all on the same scale?
 

trinker

ggplot2orBust
#8
@hlsmith. New participants can only select 3 variables at max. The ranges can be from 0-5 It's a liker scale. Basically select three things you find important and rate them (of 13 things that could be selected).

Also I thought about your simulation comment. I have done this using something like this: http://blog.revolutionanalytics.com/2016/02/multivariate_data_with_r.html Pretty easy. Thought about this to regenerate the data and get the percentages of the other variables given the three known. But seems if there's a way to use regression it may be more efficient and better.
 

vinux

Dark Knight
#9
@vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.
With mean, sd, and correlation matrix you can do regressions (for simple regression Eqn [math]y = \alpha + \beta x,[/math]; [math]\hat \beta = \rho \frac{\sigma_y}{\sigma_x}[/math] [math]\hat\alpha =\bar y- \hat \beta \bar x[/math] ). I guess the missing data (number of records) are less in number (thumb rule is less than 5 %). If the missing % is high, the regression approach missing data estimation may result collinear structure.

You should also check the type of missing data (See this: https://en.wikipedia.org/wiki/Missing_data). By regression method, it is a MAR (missing at random) type. You could also replace the missing data with respective means if you assume MCNR.
 

hlsmith

Omega Contributor
#10
In generalities, this seems like an approachable problem. Though, I am a little confused. So there are 13 questions, but you only have data for three of them for each person. Now you want to create data for the other 10 variables? I am probably missing interpreting something, but if so, please describe the background scenario or issue - why and how much data is missing.


vinux, good point about the amount of missing and possible collinear structure!
 

trinker

ggplot2orBust
#11
@hlsmith. Yes and no. There is a correlation table, column sd and means reported in the company's tech manual. For that data set they had info from 300000 participants for all 13 questions. From that point further the company only required (in fact capped) answering to 3 questions to focus the participant's attention on self improvement to just those three constructs. It's be like if you took a self improvement questionnaire and there were 13 areas of improvement. It's not likely you have the time or energy to improve in all 13 areas so we target your attention to just three. Still the company would like to serve back information about the 10 unrated items based on past participants behavior...i.e., based on your ratings for three (top three) items and other people who ave done this assessment we think you'd have rated the other items this way. So of the original data set, where the correlation table was made there were 300000 participants with almost no missing values. We want to generalize what we learned from that original data to new participants without giving them the full assessment.
 

hlsmith

Omega Contributor
#12
So what is your current n-value?


And when you say 3, is it the same three items collected for each person or does it vary? If it varies, why?


So is there historic data for some people, but not others? Or no historic data, repeat measures?
 

vinux

Dark Knight
#13
Here is an example of regression based missing data estimation using mean, sd, and correlation matrix information (with given cyl, disp, and drat)


Let [math] x = cyl, y = disp, z= drat [/math]

Let the model be [math] m_i= b_0 + b_1 x_i + b_2 y_i + b_3 z_i + e_i[/math]

for eg: [math] m= mpg[/math]

Code:
## Assuming the means, sds, and cor matrix values are population values
mat <- round(cor(mtcars[,c(1,2,3,5)]), 2)
means <- round(colMeans(mtcars[,c(1,2,3,5)]), 2)
sds <- round(apply(mtcars[,c(1,2,3,5)], 2, sd), 2)

## converting to cov matrix
CovM =diag(sds)%*%mat%*%diag(sds)

## Now comes regression coefficients
beta <- CovM[2:4, 1]%*%solve(CovM[2:4,2:4])
beta0 <- means[1] - beta%*% means[2:4]

mpg.est <- beta0 + sum(beta *c(8.0, 170.0,3.5))
For more details https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions


PS: I wanted to write this in a more descriptive way. But too lazy.
 

hlsmith

Omega Contributor
#14
trinker,


Just a side note, keep in mind that modern approaches to missing data use multiple imputation. You all appear to be doing what I will call a pseudo-scoring of data. Multiple imputation, imputes multiple values for the missing data to account for its uncertainty. Without multiple imputation you are only accounting for the variability between observations and not accounting for the uncertainty in those unique measures. This difference means inferential statistics may be at risk for type I errors.
 
Last edited:

hlsmith

Omega Contributor
#16
I thought about this for awhile, if you have a prior sample with all items completed; now you have a new sample that only completed 3 items, are you trying to impute the other 10 questions? Did you all pick the 3 items that the new sample completed or do they?

Were the original questions in a particular order? And how do the 3 current items fit into that ordering?

I was thinking that if you did go forward you may want to make your 95% CI wider to account for not using MI (perhaps robust somehow).


Also a BIG question, do you have the current raw data? Do the items seem to have the same correlation amongst each other in the new sample? You may have to think about if your other corr are partial or not. This would possibly get at if the mechanism of asking the people is still functioning the same given now the instrument is shorter.
 

hlsmith

Omega Contributor
#17
I didn't click on Jakes links but I wonder if this process is typically used out of sample. I think more appropriate may be calling this a new sample you are scoring, perhaps?