Are you assuming all the variables are standardised? If yes, you can think of 8 regression equations to estimate the missing values.
I have a correlation matrix:
I am given 3 values for 3 of the 11 variables. Can/is there a way to predict the 8 missing values? If so how?Code:mat <- round(cor(mtcars), 2) mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55 cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53 disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39 hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75 drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09 wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43 qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66 vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57 am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06 gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27 carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
I assume so b/c you can do regression from a correlation matrix so you can do this.Code:vals <- setNames(c(NA, 8, 170, NA, 3.5, rep(NA, 6)), colnames(mtcars)) mpg cyl disp hp drat wt qsec vs am gear carb NA 8.0 170.0 NA 3.5 NA NA NA NA NA NA
EDIT: Per @Vinux's comment I also know the column means, column sd:
Code:means <- round(colMeans(mtcars), 2) sds <- round(apply(mtcars, 2, sd), 2)
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
Are you assuming all the variables are standardised? If yes, you can think of 8 regression equations to estimate the missing values.
In the long run, we're all dead.
trinker (04-26-2016)
Standardized not guaranteed why is that a requisite?
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
PS they are on the same scale already so standardization may not be necessary.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
If you don't know their means (expectation) and standard deviation, it is not possible to do the missing data estimation.
Correlation of X & Y is same and Correlation of aX+b and cY + d (a and c are nonzero values). Correlation is free from location and scale changes. This will lead to problem in missing data estimation.
In the long run, we're all dead.
I agree with vinux. I was thinking the same thing. In particular, look at what is needed if you are simulating data from a correlation matrix. Without a proxy for how these data may be central or dispersed the amount of potential values is countless. Its not like a Sudoku puzzle, where the answers are finite given conditionality.
Follow-up questions, why do you have three values? Are they random values? Do you know the ranges for the variables if they are all on the same scale?
Stop cowardice, ban guns!
@vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
@hlsmith. New participants can only select 3 variables at max. The ranges can be from 0-5 It's a liker scale. Basically select three things you find important and rate them (of 13 things that could be selected).
Also I thought about your simulation comment. I have done this using something like this: http://blog.revolutionanalytics.com/...ta_with_r.html Pretty easy. Thought about this to regenerate the data and get the percentages of the other variables given the three known. But seems if there's a way to use regression it may be more efficient and better.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
With mean, sd, and correlation matrix you can do regressions (for simple regression Eqn ; ). I guess the missing data (number of records) are less in number (thumb rule is less than 5 %). If the missing % is high, the regression approach missing data estimation may result collinear structure.
You should also check the type of missing data (See this: https://en.wikipedia.org/wiki/Missing_data). By regression method, it is a MAR (missing at random) type. You could also replace the missing data with respective means if you assume MCNR.
In the long run, we're all dead.
In generalities, this seems like an approachable problem. Though, I am a little confused. So there are 13 questions, but you only have data for three of them for each person. Now you want to create data for the other 10 variables? I am probably missing interpreting something, but if so, please describe the background scenario or issue - why and how much data is missing.
vinux, good point about the amount of missing and possible collinear structure!
Stop cowardice, ban guns!
trinker (04-27-2016)
@hlsmith. Yes and no. There is a correlation table, column sd and means reported in the company's tech manual. For that data set they had info from 300000 participants for all 13 questions. From that point further the company only required (in fact capped) answering to 3 questions to focus the participant's attention on self improvement to just those three constructs. It's be like if you took a self improvement questionnaire and there were 13 areas of improvement. It's not likely you have the time or energy to improve in all 13 areas so we target your attention to just three. Still the company would like to serve back information about the 10 unrated items based on past participants behavior...i.e., based on your ratings for three (top three) items and other people who ave done this assessment we think you'd have rated the other items this way. So of the original data set, where the correlation table was made there were 300000 participants with almost no missing values. We want to generalize what we learned from that original data to new participants without giving them the full assessment.
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
So what is your current n-value?
And when you say 3, is it the same three items collected for each person or does it vary? If it varies, why?
So is there historic data for some people, but not others? Or no historic data, repeat measures?
Stop cowardice, ban guns!
Here is an example of regression based missing data estimation using mean, sd, and correlation matrix information (with given cyl, disp, and drat)
Let
Let the model be
for eg:
For more details https://en.wikipedia.org/wiki/Multiv..._distributionsCode:## Assuming the means, sds, and cor matrix values are population values mat <- round(cor(mtcars[,c(1,2,3,5)]), 2) means <- round(colMeans(mtcars[,c(1,2,3,5)]), 2) sds <- round(apply(mtcars[,c(1,2,3,5)], 2, sd), 2) ## converting to cov matrix CovM =diag(sds)%*%mat%*%diag(sds) ## Now comes regression coefficients beta <- CovM[2:4, 1]%*%solve(CovM[2:4,2:4]) beta0 <- means[1] - beta%*% means[2:4] mpg.est <- beta0 + sum(beta *c(8.0, 170.0,3.5))
PS: I wanted to write this in a more descriptive way. But too lazy.
In the long run, we're all dead.
trinker (04-27-2016)
trinker,
Just a side note, keep in mind that modern approaches to missing data use multiple imputation. You all appear to be doing what I will call a pseudo-scoring of data. Multiple imputation, imputes multiple values for the missing data to account for its uncertainty. Without multiple imputation you are only accounting for the variability between observations and not accounting for the uncertainty in those unique measures. This difference means inferential statistics may be at risk for type I errors.
Last edited by hlsmith; 05-01-2016 at 09:18 PM. Reason: i wrote Type II error when i meant Type I, finding significance when truth is no difference.
Stop cowardice, ban guns!
I think the approach vinux is describing is called regression imputation, a form of single imputation.
In God we trust. All others must bring data.
~W. Edwards Deming
Tweet |