By categorical variable, do you mean ordinal?
Hello everyone
I am having trouble interpreting some of my results.
I am using logistic regression to infer a model based on measured data. Some of my explanatory variables are continuous (e.g. temperature [°C]) and some are categorical (e.g. time of day [night, morning, day, afternoon, evening]). To investigate multicollinearity issues, I have calculated Generalized Variance Inflation Factors (GVIF) in R (using the CAR package). R automatically calculates GVIF^(1/(2Df)), which to my understanding is an estimate of the factor by which the confidence interval of each coefficient is inflated (please correct me if I am wrong).
My problem is: How should I interpret the GVIF of interaction terms between continuous and categorical variables?
One of my simple models looks like this:
with the following GVIFsCode:Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.76208 2.54158 -3.447 0.000566 *** Temperature 0.08847 0.11441 0.773 0.439363 timeMorning 3.20504 2.82524 1.134 0.256614 timeDay 0.72913 2.77043 0.263 0.792409 timeAfternoon -0.34141 2.77430 -0.123 0.902057 timeEvening -0.97397 3.16012 -0.308 0.757926 Temperature:timeMorning -0.02669 0.12782 -0.209 0.834601 Temperature:timeDay 0.06239 0.12415 0.503 0.615302 Temperature:timeAfternoon 0.09116 0.12386 0.736 0.461711 Temperature:timeEvening 0.06535 0.13907 0.470 0.638410
I would like to make a table like the one below:Code:GVIF Df GVIF^(1/(2*Df)) Temperature 2.091206e+01 1 4.572971 time 1.595779e+08 4 10.601604 Temperature:time 1.899285e+08 4 10.834872
My problem is: How do I calculate the Inflation of the confidence intervals?Code:Estimate std.Dev std.Err C.I. 2.5% C.I 97.5% Inflation Intercept Night -8.76208 2.54158 0.010494 -8.78 -8.74 XX Morning -5.55704 3.800212 0.01569 -5.59 -5.53 XX Day -8.03295 3.759642 0.015523 -8.06 -8.00 XX Afternoon -9.10349 3.762495 0.015535 -9.13 -9.07 Evening -9.73605 4.055365 0.016744 -9.77 -9.70 Temperature Night 0.08847 0.11441 0.000472 0.0875 0.0894 Morning 0.06178 0.171545 0.000708 0.0604 0.0632 Day 0.15086 0.168828 0.000697 0.1495 0.1522 Afternoon 0.17963 0.168615 0.000696 0.1783 0.1810 Evening 0.15382 0.180084 0.000744 0.1524 0.1553
I would really appreciate if anyone can help!!
Last edited by enur; 08-03-2012 at 10:43 AM.
By categorical variable, do you mean ordinal?
You should not think too hard about the VIFs in this scenario. If you do not center your predictors (and it looks like you haven't!), then there will almost always be apparently extreme multicollinearity between the interaction term and the simple effect terms. This makes sense: if you have an interaction term A*B, it should not be surprising that this is highly correlated with A, because half of what comprises A*B is A itself!
However, this multicollinearity is a red herring. It is an artifact of having not centered your predictors and does not actually inflate your confidence intervals to an undue degree.
To see this, take a look at the formula for the % confidence interval of the coefficient for a predictor :
is the critical value of F, is the mean squared error of the model, is the variance of the predictor (technically the sum of squared errors: ), and is the "tolerance" of , which is just .
As you can see, as the tolerance decreases (conversely, as the VIF increases), the confidence interval expands. (This also answers your question about what exactly the inflaction factor is -- the confidence interval expands with the square root of the VIF.) However, in the situation of predictors that are products of uncentered variables, it turns out that this decrease in tolerance caused by not centering the predictors is counterweighed by an increase in the variance of the predictor, , so that these two effects cancel out and the width of the confidence interval is net unchanged.
The following tables might help to illustrate the effect of centering on both multicollinearity and variance:
UncenteredCenteredCode:> uncen x1 x2 x1x2 [1,] 7 8 56 [2,] 4 6 24 [3,] 9 9 81 [4,] 6 8 48 [5,] 6 9 54 [6,] 6 5 30 [7,] 6 9 54 [8,] 6 1 6 [9,] 8 3 24 [10,] 5 9 45 > > # correlations > cor(uncen) x1 x2 x1x2 x1 1.000000000 -0.002730559 0.4655388 x2 -0.002730559 1.000000000 0.8715734 x1x2 0.465538807 0.871573389 1.0000000 > > # variances > apply(uncen, 2, var) x1 x2 x1x2 2.011111 8.233333 459.733333
Code:> cen x1 x2 x1x2 [1,] 0.7 1.3 0.91 [2,] -2.3 -0.7 1.61 [3,] 2.7 2.3 6.21 [4,] -0.3 1.3 -0.39 [5,] -0.3 2.3 -0.69 [6,] -0.3 -1.7 0.51 [7,] -0.3 2.3 -0.69 [8,] -0.3 -5.7 1.71 [9,] 1.7 -3.7 -6.29 [10,] -1.3 2.3 -2.99 > > # correlations > cor(cen) x1 x2 x1x2 x1 1.000000000 -0.002730559 0.1632143 x2 -0.002730559 1.000000000 0.1961749 x1x2 0.163214305 0.196174937 1.0000000 > > # variances > apply(cen, 2, var) x1 x2 x1x2 2.011111 8.233333 10.530667
“In God we trust. All others must bring data.”
~W. Edwards Deming
hlsmith (01-06-2015)
enur, what would be the purpose of examining this collinearity?
Thank you for your replies – I really appreciate it!
Hlsmith: yes when I wrote categorical I meant ordinal - I often make this mistake (I don’t know why).
Jake: I am not really sure what you mean by centering of predictors. Maybe I was not clear enough about my data.
I have been measuring temperature in residential buildings (and some other variables) for a period of time. The variables (including temperature) were measured on 10 minute intervals. Based on the time of day, I have created an ordinal variable called time [night, morning, day, afternoon, evening].
I have also recorded different events (on/off) in the buildings. My aim is to create models, which can predict events, based on the measured variables. I have used logistic regression and stepwise forward and backward selection of variables to infer the different models (the selection was based on AIC). The model in my example was the simplest I could think of. Most of the inferred models include more variables.
I would like to calculate the possible inflation of the confidence intervals due to collinearity, so I (and others) can be aware of this in the future, when I start using (and validating) the models.
If I understand it correctly, a GVIF^(1/(2DF)) of 10.6 for the variable ‘time’ means that the effects of ‘time’ on the intercept may be inflated to such an extent, that the inferred confidence intervals for the intercept may be up to 10.6 times too large. A GVIF^(1/(2DF)) of 4.6 for the variable ‘Temperature’ means that the confidence interval for the ‘Temperature’ coefficient may be up to 4.6 times too large, as compared to the case with no multicollinearity. My problem is that I have interactions between ‘time’ and ‘Temperature’ resulting in five different coefficients for the variable ‘Temperature’. How do I interpret a GVIF^(1/(2DF)) of 10.8 for the interaction between Temperature and time?
Can I simply add the GVIFS, so that the temperature confidence intervals may be 4.6+10.8=15.4 times too large?
Any insights are highly appreciated!
Yes, I think I understand the example. To "center" a predictor means to subtract off the mean value of that predictor from all the individual values, so that the new mean is 0. Observe the values of x1 and x2 in the first code block that I posted and compare them to the values of x1 and x2 in the second code block.
“In God we trust. All others must bring data.”
~W. Edwards Deming
At risk of hijacking (and asked here to avoid the risk of making crap threads with 1 liner questions):
If there are two regressors in a 5 regressor cross-sectional regression that have correlation (pearson) = 0.5 with p-value < 0.001, is this bad? I have about 350 observations in the cross-section and other Gauss-Markov assumptions are in tact.
Probably not. What are the VIFs?
Note that even when multicollinearity is a big problem, it's really only a "problem" from the perspective of having a negative influence on power. There is no "assumption" of non-collinearity to be violated. It just works out more nicely to have the predictors be close to orthogonal.
“In God we trust. All others must bring data.”
~W. Edwards Deming
An old thread, but one I have a question on. If I understand Jakes comment...
correctly then while VIF will likely indicate multicolinearity for interaction terms [and possibly the main effects associated with this] the multicolinearity will not effect the test of statistical signficance [through the standard errors] as it would normally. I assume this is because there really is no actual multicolinearity in this case it is only the VIF test being distorted [although I am not certain of this from the post].As you can see, as the tolerance decreases (conversely, as the VIF increases), the confidence interval expands. (This also answers your question about what exactly the inflaction factor is -- the confidence interval expands with the square root of the VIF.) However, in the situation of predictors that are products of uncentered variables, it turns out that this decrease in tolerance caused by not centering the predictors is counterweighed by an increase in the variance of the predictor, , so that these two effects cancel out and the width of the confidence interval is net unchanged.
I assume Josh means main effects when he mentions simple effect terms in his post.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Tweet |