+ Reply to Thread
Results 1 to 12 of 12

Thread: Multicollinearity issue

  1. #1
    Points: 682, Level: 13
    Level completed: 64%, Points required for next Level: 18

    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Multicollinearity issue




    Hi, I have a multivariable logistic model, and one of my categorical independent variables has 5 possible levels, with each level calculated as a proportion (prevalence of outcome at a given level of the variable).

    If I have a low event rate (i.e. low proportion) in several of those levels of the variable, my confidence interval increases, and I wanted to confirm mathematically how that's possible. Would the standard error of the proportion be increasing as the event rate decreases/as the overall proportion decreases, thereby increasing the width of the confidence interval? What is the formula that describes that phenomenon?

    Second, presumably if the standard error increases, then there is a greater chance of multicollinearity in the model. So am I correct that a decrease in event rate (lower proportions for different levels of one or multiple variables) will increase the chance of multicollinearity and model instability?

    Thanks very much.

  2. #2
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Multicollinearity issue

    This come down to understanding what SE is. With smaller samples and sparsity in data you become less confident. Look up the formula for SE in regards to odds ratios, and plug different values into the classification table. This should show you how low prevalence will inflate the SE.

    As for multicollinearity, I don't believe sparse data groupings really affect the risk for collinearity, it may make SE estimates worse. BUT, collinearity has to truly exist itself initially, low counts won't create the collinearity.
    Stop cowardice, ban guns!

  3. #3
    TS Contributor
    Points: 12,227, Level: 72
    Level completed: 45%, Points required for next Level: 223
    rogojel's Avatar
    Location
    I work in Europe, live in Hungary
    Posts
    1,470
    Thanks
    160
    Thanked 332 Times in 312 Posts

    Re: Multicollinearity issue

    hi,
    imho it goes the other way, actually. High multicollinearity will cause a large SE, through the variance inflation, if i am not mistaken.
    regards

  4. #4
    Points: 682, Level: 13
    Level completed: 64%, Points required for next Level: 18

    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Multicollinearity issue

    I believe the formula for standard error of a regression coefficient shows that as multicolinearity increases, standard error for the involved coefficient will increase, making it harder to determine which of the highly correlated variables is causing the variance in the outcome (i.e. higher chance of a false negative).

    My question would be, is there a specific formula or set of steps I can follow that shows how an increase in standard error of a regression coefficient makes the p-value higher? Or can we only use the theoretical statement that I said above regarding increased variance brings a higher chance of a false negative?

  5. #5
    Devorador de queso
    Points: 95,705, Level: 100
    Level completed: 0%, Points required for next Level: 0
    Awards:
    Posting AwardCommunity AwardDiscussion EnderFrequent Poster
    Dason's Avatar
    Location
    Tampa, FL
    Posts
    12,931
    Thanks
    307
    Thanked 2,629 Times in 2,245 Posts

    Re: Multicollinearity issue

    The test stat is estimate/standard_error so it decreases as the se increases and the p-value is a monotonic function of the test statistic
    I don't have emotions and sometimes that makes me very sad.

  6. #6
    Points: 682, Level: 13
    Level completed: 64%, Points required for next Level: 18

    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Multicollinearity issue

    Thanks, that's helpful. But in the context of a multivariable logistic regression model for instance, I am using a chi-square test to determine significance of each regression coefficient. Would this be an example of a goodness of fit chi-square test?

    And, how would I figure out the degrees of freedom for each chi square test? Is it the total N in the entire model (for all variables), minus k (total number of variables), minus 1?

    Thanks again.

  7. #7
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Multicollinearity issue

    Sparsity in data results in higher SE, multicolinearity results in higher SE. Multicollinearity does not effect the regression effect estimates, but inflates variance measure which decreases confidence.

    Are you talking about Wald stats which are interpreted on the chisq dist? As dason mentions the test is estimate divided by SE. Higher the SE the smaller the test stat value.
    Stop cowardice, ban guns!

  8. #8
    Points: 682, Level: 13
    Level completed: 64%, Points required for next Level: 18

    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Multicollinearity issue

    Yes I think this is the Wald test for determining whether a regression coefficient is significantly different than the null. I'm wondering how I can take the test statistic from the Wald test and then find the appropriate p-value on the chi-square distribution. I read online that the degrees of freedom would be 'q' but not sure how to calculate that.

  9. #9
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Multicollinearity issue

    I believe the following:

    Wald statistic for \beta coefficient is:


    [ \beta / SE_{(\beta)} ] ^{2}

    X ^{2} ~ distributed with 1 degree of freedom.


    If you are running a program you can always just look up a test statistic value and see if you get the same p-value.
    Stop cowardice, ban guns!

  10. #10
    Points: 682, Level: 13
    Level completed: 64%, Points required for next Level: 18

    Posts
    20
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Multicollinearity issue

    Thanks, that was very helpful. Do you know of any advantages of the wald approach over the LRT for instance, for the purpose of determining whether individual regression coefficients are significant? From what I've read, it seems LRT is preferred, but Wald is the standard approach used by my research team.

    Also, if I have sparse data and/or multicollinearity, could those both be potential reasons that could explain why a model might not converge in SAS?

  11. #11
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Multicollinearity issue

    I typically use SAS, so feel free to post your code, results, log, or output. I am guessing you are getting the "complete or quasi-complete separation error". That comes when you have sparsity and some subgroups perfectly predict the outcome. So say you are looking at predictors of heart failure and you have 5 uninsured patients over the age of 70 and all have heart failure, plus some other combinations that play out like this. Overall, this could mean you have overparameterized the model given your sample size. You can change the number of iterations the model tries during convergence to get it to converge, but the original error should be a mental warning that even if you get the model to converge, it still may not be that best model or fit.


    Yeah, I have also read that the Wald estimates can be less ideal (less stable), but I usually assume the difference is trivial. I just popped online and saw that PROC LOGISTIC doesn't explicitly generate those estimates, but you can use the model likelihood test to get at them. So run the model with and without the variable and perform a likelihood test on the two models. Not sure if you can get the betas directly from that test?

    I don't think the MC really messes up the MLE process, it just results in inflated SEs.
    Stop cowardice, ban guns!

  12. #12
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Multicollinearity issue


    Sometimes gathering more data will help with Multicolinearity sometimes it won't. It depends on what is causing it in the first place. Generally this is not an issue that is considered all that important, see for instance John Fox's comments on this topic. It has no effect at all on the effect size, it only effects the test of significance.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats