1. ## Multicollinearity issue

Hi, I have a multivariable logistic model, and one of my categorical independent variables has 5 possible levels, with each level calculated as a proportion (prevalence of outcome at a given level of the variable).

If I have a low event rate (i.e. low proportion) in several of those levels of the variable, my confidence interval increases, and I wanted to confirm mathematically how that's possible. Would the standard error of the proportion be increasing as the event rate decreases/as the overall proportion decreases, thereby increasing the width of the confidence interval? What is the formula that describes that phenomenon?

Second, presumably if the standard error increases, then there is a greater chance of multicollinearity in the model. So am I correct that a decrease in event rate (lower proportions for different levels of one or multiple variables) will increase the chance of multicollinearity and model instability?

Thanks very much.

2. ## Re: Multicollinearity issue

This come down to understanding what SE is. With smaller samples and sparsity in data you become less confident. Look up the formula for SE in regards to odds ratios, and plug different values into the classification table. This should show you how low prevalence will inflate the SE.

As for multicollinearity, I don't believe sparse data groupings really affect the risk for collinearity, it may make SE estimates worse. BUT, collinearity has to truly exist itself initially, low counts won't create the collinearity.

3. ## Re: Multicollinearity issue

hi,
imho it goes the other way, actually. High multicollinearity will cause a large SE, through the variance inflation, if i am not mistaken.
regards

4. ## Re: Multicollinearity issue

I believe the formula for standard error of a regression coefficient shows that as multicolinearity increases, standard error for the involved coefficient will increase, making it harder to determine which of the highly correlated variables is causing the variance in the outcome (i.e. higher chance of a false negative).

My question would be, is there a specific formula or set of steps I can follow that shows how an increase in standard error of a regression coefficient makes the p-value higher? Or can we only use the theoretical statement that I said above regarding increased variance brings a higher chance of a false negative?

5. ## Re: Multicollinearity issue

The test stat is estimate/standard_error so it decreases as the se increases and the p-value is a monotonic function of the test statistic

6. ## Re: Multicollinearity issue

Thanks, that's helpful. But in the context of a multivariable logistic regression model for instance, I am using a chi-square test to determine significance of each regression coefficient. Would this be an example of a goodness of fit chi-square test?

And, how would I figure out the degrees of freedom for each chi square test? Is it the total N in the entire model (for all variables), minus k (total number of variables), minus 1?

Thanks again.

7. ## Re: Multicollinearity issue

Sparsity in data results in higher SE, multicolinearity results in higher SE. Multicollinearity does not effect the regression effect estimates, but inflates variance measure which decreases confidence.

Are you talking about Wald stats which are interpreted on the chisq dist? As dason mentions the test is estimate divided by SE. Higher the SE the smaller the test stat value.

8. ## Re: Multicollinearity issue

Yes I think this is the Wald test for determining whether a regression coefficient is significantly different than the null. I'm wondering how I can take the test statistic from the Wald test and then find the appropriate p-value on the chi-square distribution. I read online that the degrees of freedom would be 'q' but not sure how to calculate that.

9. ## Re: Multicollinearity issue

I believe the following:

Wald statistic for coefficient is:

[ / SE

~ distributed with 1 degree of freedom.

If you are running a program you can always just look up a test statistic value and see if you get the same p-value.

10. ## Re: Multicollinearity issue

Thanks, that was very helpful. Do you know of any advantages of the wald approach over the LRT for instance, for the purpose of determining whether individual regression coefficients are significant? From what I've read, it seems LRT is preferred, but Wald is the standard approach used by my research team.

Also, if I have sparse data and/or multicollinearity, could those both be potential reasons that could explain why a model might not converge in SAS?

11. ## Re: Multicollinearity issue

I typically use SAS, so feel free to post your code, results, log, or output. I am guessing you are getting the "complete or quasi-complete separation error". That comes when you have sparsity and some subgroups perfectly predict the outcome. So say you are looking at predictors of heart failure and you have 5 uninsured patients over the age of 70 and all have heart failure, plus some other combinations that play out like this. Overall, this could mean you have overparameterized the model given your sample size. You can change the number of iterations the model tries during convergence to get it to converge, but the original error should be a mental warning that even if you get the model to converge, it still may not be that best model or fit.

Yeah, I have also read that the Wald estimates can be less ideal (less stable), but I usually assume the difference is trivial. I just popped online and saw that PROC LOGISTIC doesn't explicitly generate those estimates, but you can use the model likelihood test to get at them. So run the model with and without the variable and perform a likelihood test on the two models. Not sure if you can get the betas directly from that test?

I don't think the MC really messes up the MLE process, it just results in inflated SEs.

12. ## Re: Multicollinearity issue

Sometimes gathering more data will help with Multicolinearity sometimes it won't. It depends on what is causing it in the first place. Generally this is not an issue that is considered all that important, see for instance John Fox's comments on this topic. It has no effect at all on the effect size, it only effects the test of significance.

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts