+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 17

Thread: Fill in multiple missing values given correlation matrix

  1. #1
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Fill in multiple missing values given correlation matrix




    I have a correlation matrix:

    Code: 
    mat <- round(cor(mtcars), 2)
    
           mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
    cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
    disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
    hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
    drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
    wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
    qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
    vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
    am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
    gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
    carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
    I am given 3 values for 3 of the 11 variables. Can/is there a way to predict the 8 missing values? If so how?

    Code: 
    vals <- setNames(c(NA, 8, 170, NA, 3.5, rep(NA, 6)), colnames(mtcars))
    
    
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
     NA   8.0 170.0    NA   3.5    NA    NA    NA    NA    NA    NA
    I assume so b/c you can do regression from a correlation matrix so you can do this.

    EDIT: Per @Vinux's comment I also know the column means, column sd:

    Code: 
    means <- round(colMeans(mtcars), 2)
    sds <- round(apply(mtcars, 2, sd), 2)
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  2. #2
    Dark Knight
    Points: 6,762, Level: 54
    Level completed: 6%, Points required for next Level: 188
    vinux's Avatar
    Posts
    2,011
    Thanks
    52
    Thanked 241 Times in 205 Posts

    Re: Fill in multiple missing values given correlation matrix

    Are you assuming all the variables are standardised? If yes, you can think of 8 regression equations to estimate the missing values.
    In the long run, we're all dead.

  3. The Following User Says Thank You to vinux For This Useful Post:

    trinker (04-26-2016)

  4. #3
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Fill in multiple missing values given correlation matrix

    Standardized not guaranteed why is that a requisite?
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  5. #4
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Fill in multiple missing values given correlation matrix

    PS they are on the same scale already so standardization may not be necessary.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  6. #5
    Dark Knight
    Points: 6,762, Level: 54
    Level completed: 6%, Points required for next Level: 188
    vinux's Avatar
    Posts
    2,011
    Thanks
    52
    Thanked 241 Times in 205 Posts

    Re: Fill in multiple missing values given correlation matrix

    Quote Originally Posted by trinker View Post
    Standardized not guaranteed why is that a requisite?
    If you don't know their means (expectation) and standard deviation, it is not possible to do the missing data estimation.

    Correlation of X & Y is same and Correlation of aX+b and cY + d (a and c are nonzero values). Correlation is free from location and scale changes. This will lead to problem in missing data estimation.
    In the long run, we're all dead.

  7. #6
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Fill in multiple missing values given correlation matrix

    I agree with vinux. I was thinking the same thing. In particular, look at what is needed if you are simulating data from a correlation matrix. Without a proxy for how these data may be central or dispersed the amount of potential values is countless. Its not like a Sudoku puzzle, where the answers are finite given conditionality.


    Follow-up questions, why do you have three values? Are they random values? Do you know the ranges for the variables if they are all on the same scale?
    Stop cowardice, ban guns!

  8. #7
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Fill in multiple missing values given correlation matrix

    @vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  9. #8
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Fill in multiple missing values given correlation matrix

    @hlsmith. New participants can only select 3 variables at max. The ranges can be from 0-5 It's a liker scale. Basically select three things you find important and rate them (of 13 things that could be selected).

    Also I thought about your simulation comment. I have done this using something like this: http://blog.revolutionanalytics.com/...ta_with_r.html Pretty easy. Thought about this to regenerate the data and get the percentages of the other variables given the three known. But seems if there's a way to use regression it may be more efficient and better.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  10. #9
    Dark Knight
    Points: 6,762, Level: 54
    Level completed: 6%, Points required for next Level: 188
    vinux's Avatar
    Posts
    2,011
    Thanks
    52
    Thanked 241 Times in 205 Posts

    Re: Fill in multiple missing values given correlation matrix

    Quote Originally Posted by trinker View Post
    @vinux I know the column means and sd for the correlation matrix as well as the N. I updated the original question.
    With mean, sd, and correlation matrix you can do regressions (for simple regression Eqn y = \alpha + \beta x,; \hat \beta = \rho \frac{\sigma_y}{\sigma_x} \hat\alpha =\bar y- \hat \beta \bar x ). I guess the missing data (number of records) are less in number (thumb rule is less than 5 %). If the missing % is high, the regression approach missing data estimation may result collinear structure.

    You should also check the type of missing data (See this: https://en.wikipedia.org/wiki/Missing_data). By regression method, it is a MAR (missing at random) type. You could also replace the missing data with respective means if you assume MCNR.
    In the long run, we're all dead.

  11. The Following 2 Users Say Thank You to vinux For This Useful Post:

    hlsmith (04-27-2016), trinker (04-27-2016)

  12. #10
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Fill in multiple missing values given correlation matrix

    In generalities, this seems like an approachable problem. Though, I am a little confused. So there are 13 questions, but you only have data for three of them for each person. Now you want to create data for the other 10 variables? I am probably missing interpreting something, but if so, please describe the background scenario or issue - why and how much data is missing.


    vinux, good point about the amount of missing and possible collinear structure!
    Stop cowardice, ban guns!

  13. The Following User Says Thank You to hlsmith For This Useful Post:

    trinker (04-27-2016)

  14. #11
    ggplot2orBust
    Points: 71,220, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    User with most referrers
    trinker's Avatar
    Location
    Buffalo, NY
    Posts
    4,417
    Thanks
    1,811
    Thanked 928 Times in 809 Posts

    Re: Fill in multiple missing values given correlation matrix

    @hlsmith. Yes and no. There is a correlation table, column sd and means reported in the company's tech manual. For that data set they had info from 300000 participants for all 13 questions. From that point further the company only required (in fact capped) answering to 3 questions to focus the participant's attention on self improvement to just those three constructs. It's be like if you took a self improvement questionnaire and there were 13 areas of improvement. It's not likely you have the time or energy to improve in all 13 areas so we target your attention to just three. Still the company would like to serve back information about the 10 unrated items based on past participants behavior...i.e., based on your ratings for three (top three) items and other people who ave done this assessment we think you'd have rated the other items this way. So of the original data set, where the correlation table was made there were 300000 participants with almost no missing values. We want to generalize what we learned from that original data to new participants without giving them the full assessment.
    "If you torture the data long enough it will eventually confess."
    -Ronald Harry Coase -

  15. #12
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Fill in multiple missing values given correlation matrix

    So what is your current n-value?


    And when you say 3, is it the same three items collected for each person or does it vary? If it varies, why?


    So is there historic data for some people, but not others? Or no historic data, repeat measures?
    Stop cowardice, ban guns!

  16. #13
    Dark Knight
    Points: 6,762, Level: 54
    Level completed: 6%, Points required for next Level: 188
    vinux's Avatar
    Posts
    2,011
    Thanks
    52
    Thanked 241 Times in 205 Posts

    Re: Fill in multiple missing values given correlation matrix

    Here is an example of regression based missing data estimation using mean, sd, and correlation matrix information (with given cyl, disp, and drat)


    Let x = cyl, y = disp, z= drat

    Let the model be m_i= b_0 + b_1 x_i + b_2 y_i + b_3 z_i + e_i

    for eg: m= mpg

    Code: 
    ## Assuming the means, sds, and cor matrix values are population values
    mat <- round(cor(mtcars[,c(1,2,3,5)]), 2)
    means <- round(colMeans(mtcars[,c(1,2,3,5)]), 2)
    sds <- round(apply(mtcars[,c(1,2,3,5)], 2, sd), 2)
    
    ## converting to cov matrix
    CovM =diag(sds)%*%mat%*%diag(sds)
    
    ## Now comes regression coefficients
    beta <- CovM[2:4, 1]%*%solve(CovM[2:4,2:4])
    beta0 <- means[1] - beta%*% means[2:4]
    
    mpg.est <- beta0 + sum(beta *c(8.0, 170.0,3.5))
    For more details https://en.wikipedia.org/wiki/Multiv..._distributions


    PS: I wanted to write this in a more descriptive way. But too lazy.
    In the long run, we're all dead.

  17. The Following User Says Thank You to vinux For This Useful Post:

    trinker (04-27-2016)

  18. #14
    Omega Contributor
    Points: 38,289, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,992
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Fill in multiple missing values given correlation matrix

    trinker,


    Just a side note, keep in mind that modern approaches to missing data use multiple imputation. You all appear to be doing what I will call a pseudo-scoring of data. Multiple imputation, imputes multiple values for the missing data to account for its uncertainty. Without multiple imputation you are only accounting for the variability between observations and not accounting for the uncertainty in those unique measures. This difference means inferential statistics may be at risk for type I errors.
    Last edited by hlsmith; 05-01-2016 at 09:18 PM. Reason: i wrote Type II error when i meant Type I, finding significance when truth is no difference.
    Stop cowardice, ban guns!

  19. #15
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: Fill in multiple missing values given correlation matrix


    I think the approach vinux is describing is called regression imputation, a form of single imputation.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats