What to do when the predictors are not what I expected (when the model is fine)?

#1
I would try to clarify the problem and then ask the questions.

The problem (variable names are masked due to confidentiality):

I ran a binary logistic regression, in which there were 5 independent variables (IVs): A, B, C, D, and E. A and B are my concern. C is also my concern and I would talk about it later. They have two factors each (A1 and A2, B1 and B2). When I ran the estimation, some significant findings appeared. Nevertheless, the direction of the coefficients (beta values) for A and B was the opposite of what I expected! According to the most of the literature (not all of it), I expected A and B to have a positive beta, while they both had a negative coefficient.

First I checked the models for about three full days. There was no mistake in them, and the directions did not change whatever changes I made to the models (except that in none of those changes, I attempted to drop the interactions). The log likelihoods as well showed that I am in a good direction.

Then I decided to put my subjective view against the strange results aside and trust the results of the regression analysis. Then I passed to discuss those strange results and tried to justify the controversial findings. While discussing, I came to this point that "those two variables are heavily interconnected. Firstly, they had a significant interaction. Secondly, the distribution of the predictor A was heavily affected on the B, and according to the literature, A and B could have opposite effects; it could be important in my sample which was not balanced in terms of B. This imbalance could confound the effect of A as well.

So I though maybe these are causing some problems. Then since C was as well strange, I thought maybe the whole model is being affected in a bad way by problems such as multicollinearity. I asked myself "what will happen if I isolate only A and B in the model?" If the interactions between A, B, and C are some sources of bias, can reducing the number of IVs lead to different results? The answer was yes: when I excluded all the other IVs from the model and left only A, B, and A*B, one of the coefficients became favorable and more in line with the literature and common sense. Thus I might tell that some errors do exist in my model which disrupt the main model (such as multicollinearity maybe).

The I decided to examine every strange predictor, in isolation. When I excluded the interactions and left only the five IVs, the results seemed Much more consistent. Apparently, the problem begins when some specific interactions (but not all of them) are added to the model. After adding them, the directions of betas for A and C get reversed. It is a little annoying since by adding those specific interactions, the log likelihood reduces considerably (from about -75 to -48), so I cannot easily ignore those interactions.

Questions (the main ones are 3 and 4, but an answer to the rest is as well much appreciated):

1. When the model acts strangely, but LRT and log likelihood tell that it is fine, which one should we choose? The subjective common sense, or the objective statistical measures?

2. Do you think is there a "problem" in my case, in the first place? Maybe everything is fine. If you wished, I can provide the raw data too.

3. What would you do in my case? At least three choices can be made: A. Dropping the interactions. B. Not dropping them and reporting the strange model. C. Reporting both the models with and without interactions, and also models of limited numbers of IVs (for example only A and B), and then try to subjectively discuss that "it is the interactions that cause the main large model strange".

4.I am going to do the latter (3.C). But that would be so messy and not so good looking. I wonder if there is an elegant, objective way of finding the source of error in the main model (well if there is any errors, of course), so instead of subjective discussions, I can substantiate my claims on some objective statistical measures. For example, is there a way to highlight the problematic interactions according to some statistics?

5. Do you have any other valuable suggestions or ideas?
 

trinker

ggplot2orBust
#2
I think multicolinearity is not the same as interactions. An interaction changes how x1 effects x2's impact on y. Multi colinearity is when two predictors basically are so closely aligned that they both are taking the same chunk out of y.

What do the standard errors look like?
Are you sure things are dummy coded correctly?

PS I'm not the expert in the room but figure might aswell put an idea out and if it's wrong one of those experts will correct it :D
 
#3
lol, of course you are. Thanks trinker.

Yeah you are right, that is not a case of multicollinearity, I should correct that part. Thanks again.

What do the standard errors look like?
They are so small that significant P values at the 0.01 or 0.001 level are outputted. Those do not look problematic. I can post everything (temporarily) here too. I am gathering the outputs.

Are you sure things are dummy coded correctly?
Yeah, the categorical variables are of two levels only and there is an ordinal variable as well.
 

Dason

Ambassador to the humans
#4
I think what trinker was asking about the coding is are you sure that your 'success' condition is being coded as 1 (as opposed to being coded as 0 and the failure condition being coded as 1).
 
#5
UPDATE: I think it is a multicollinearity case.

Oh of course, during the previous three days, I checked that for so many times!!! I first doubted that, but the file is OK, and besides, if that was the case, everything would be reversed (not just a couple of things). I am now attaching things in our private room.
 

Dason

Ambassador to the humans
#6
You mention that A and B have slopes opposite of what you expect. If you fit a model with just A do you get a slope that you expect? What about for B? Humans are pretty bad at guessing/understanding slopes with multiple predictors in the model.
 
#7
Well the literature is a little bit controversial about B so if it is positive or negative or even non-significant, it is OK. But less controversy exists over A so it should be positive. When not too many interactions were included yet, that beta for A was alright (the file attached). However, the problem started once I entered specific variables (interactions) to the model.

So I checked the correlation matrices and there is huge multicollinearity. There are correlations up to 0.9 or maybe more between some of the variables, as a matter of fact between many of them. And I read somewhere that it is wise to consider correlations greater than 0.4 as cases of collinearity.

Overall, I tried to rule out any sources of bad guessing and value the model more than my own subjective mind. Now I have objective evidence that the culprit is the multicollinearity and it is relieving. :) Now I should go find ways to deal with it.
 

Dason

Ambassador to the humans
#8
Multicollinearity gets blamed for increased standard errors. But switching a slope isn't something you should 'blame' on multicollinearity. Like I said - it's just hard to understand how variables interact when fitting a model with multiple predictors.
 
#9
Yeah mostly, but there might be some other consequences of multicollinearity according to wiki:

"Indicators that multicollinearity may be present in a model:
1. Large changes in the estimated regression coefficients when a predictor variable is added or deleted"

I think my beta gets reversed to offset the effect of the last variable included.
 
#10
SUCCESS! Finally found the perfect model after four or five days of banging my head against the wall. I am going to reward myself. :tup: to myself!
 
#11
Of course when multicolinearity is present it is possible that the sign of an estimated parameter change when an other colinear variable is included. The values of the betas can "flip around".


Even if the design had been perfectly balanced, perfectly orthogonal (so that all x-variables were uncorrelated to each other and to linear combinations), then it would not have been that easy to to interpret the parameter estimates if interaction effects are present. Even with perfect balance it is good practice to plot an "interaction plot" to look at the combined effect of factor A and B.

With multicolinearity it is even more dificult.

One method to avoid misstakes can be to "collaps" the two factors into one factor so that the two levels of factor A and the two levels of factor B are combinded into a factor say G with four levels. This give exactly the same parameter estimates as the model with "A + B + A:B", so nothing is statistically gained or lost but the model with one factor "G" is not as "tricky"/difficult. Then two levels of G can be compared (eg. level 2 with level 3).

Next the model with G can be estimated with the others like: "G + C + D" (plus possible interactions).

The problem with multicolinearity is a problem with the sample, not the population. So one (common) suggestion is to get better data. However it is often the case that the nature and society has a "bad design", so that colinear patterns appears.

Also note that in logistic regression the individual values are "weighted" by the different variances, in contrast to a linear regression model estimated by least squares, so that what appeared to be a balanced design in the linear regression model would not be perfectly balanced in logistic regression.

It also makes me a little bit worried when someone has found "the perfect model" after having struggled with multicolinearity a long time. There are no perfect models. As Box said: All models are wrong but some are useful.
 

noetsi

Fortran must die
#12
I am not sure if the concern is interaction or MC. MC can reverse the sign if a variable is highly multicolinear and one variable is removed. This is one of the warning signs of high MC. My suggestion is to test for high MC and see what you get (what is your VIF)?

I think what GreatGarbo said about MC would be the standard treatment of that topic although I dont think you can argue that its only an issue of the sample. It could well be that the two IV are confounded in their impact on the DV in the actual population (which is moot however since you will rarely if ever know the real population). John Fox covers this topic extensively in Regression Diagnostics (by Sage) - notably page 10-20. He does a much better job of explaining what does not work than what does. :p

1. When the model acts strangely, but LRT and log likelihood tell that it is fine, which one should we choose? The subjective common sense, or the objective statistical measures?
I would chose the one that makes the most theoretical (substantive) sense to me if I knew that. I was taught to always try to figure out why you had strange results and it makes me nervous when I can't explain some anomaly. Trying to figure out why the strange results are occuring is always important (which of course you are doing).

2. Do you think is there a "problem" in my case, in the first place? Maybe everything is fine. If you wished, I can provide the raw data too.
What is you VIF?

3. What would you do in my case? At least three choices can be made: A. Dropping the interactions. B. Not dropping them and reporting the strange model. C. Reporting both the models with and without interactions, and also models of limited numbers of IVs (for example only A and B), and then try to subjectively discuss that "it is the interactions that cause the main large model strange".
I would chose C which I think is how a journal would go as well. You might not present the full results, but at least enough to the summary results and why you think they are occuring and what they mean. Having said that I never submitted to a statistical journal and its been a long time since I submitted to any....


Then since C was as well strange, I thought maybe the whole model is being affected in a bad way by problems such as multicollinearity. I asked myself "what will happen if I isolate only A and B in the model?" If the interactions between A, B, and C are some sources of bias, can reducing the number of IVs lead to different results? The answer was yes: when I excluded all the other IVs from the model and left only A, B, and A*B, one of the coefficients became favorable and more in line with the literature and common sense. Thus I might tell that some errors do exist in my model which disrupt the main model (such as multicollinearity maybe).
Perhaps what this is showing is that existing theory is wrong. That they have failed to consider the impact of some phemenon (C) which changes the relationship of the other other factors on the DV. Our understanding of reality is often simplistic. The real question here is this simply a sampling issue, where C is distorting your results because of limits on the sample, or is it a problem in the real world. To answer that you would need to consider the theoretical implications of the impact of C - why would it logically have thise effect.

It could be that existing theory is wrong, or incomplete, and your model is telling you that.
 

hlsmith

Omega Contributor
#13
Whew, this is a long thread with big posts, I will save 15 minutes of my life and skip reading this monster. Vic, I bet you are one of those people with 50 slides jammed full of text, even when they are giving a 3 minute presentation. Word of the day "brevity".
 

noetsi

Fortran must die
#14
Whew, this is a long thread with big posts, I will save 15 minutes of my life and skip reading this monster. Vic, I bet you are one of those people with 50 slides jammed full of text, even when they are giving a 3 minute presentation. Word of the day "brevity".
lol

parsimony :p
 

hlsmith

Omega Contributor
#15
Maybe we need a stepwise model for writing a post. Rank statements - take highest ranked statement, rank statements take new highest rank statement, rank statements, ... Now make sure your post is not saturated. How many words to concepts do you need? A long post has a high R^2 since it covers everthing, but this is not corrected for length. This seems like a Trinker R program waiting to happen!
 
#17
Guys, thanks a lot for your replies. As I had stated, I had already solved this problem (luckily), and am going to share here what I did to solve it. It was actually pretty simple.

In the same Excel file I had attached in the lounge, you can see that I had entered every variable (one by one) into a new model and checked what happens by entering it. In the Excel file, I had highlighted the severe and unfavorable changes happening to the coefficients by red color and had commented them. Noetsi had mentioned the probability of my beliefs being wrong (and that the model is actually pointing to the wrongness of the common held view [theory]). Although I agree that it can happen, it was not the case this time. The correct, not compromised model was consistent with the common held view (except one variable which had a surprising beta, but its beta was super consistent and I had already accepted that). In the Excel file I have highlighted the model with the most desirable result. The problem started when some interactions were added to the model.

Therefore, all I did was to pin down the problematic interactions. When I detected the first problematic interaction, I looked for its correlation matrix and verified that it had severe correlation with either or both sides of the interaction (or even with other variables). So it was confirmed that it is a case of multicollinearity between that Interaction and the two other variables. Then I removed that problematic interaction from the model, and re-ran my code but with the culprit interaction removed.

My code entered variables one by one. Now, another problematic interaction emerged some blocks further, and I excluded it and went to haunt down the other ones with the similar method. This way, I managed to excluded five interactions.

The nice point which makes me super happy is that ALL and each of those problematic interactions that made the model "strange", had high or severe correlations with some of the variables, and the objective protocols pretty well allowed me to remove them for the sake of remedying multicollinearity.

So there remained a model with almost no severe multicollinearity, and a shining result, for which I am beyond glad. Every beta is the way I want. The coefficient of that single variable is still the opposite of what I expected, but after polishing the model for several days and obtaining many excellent results, I am now extremely sure that that surprising coefficient is OK and I should change my mind according to this finding.

-----------------------------------------

@ hlsmith, yeah I can't be brief, but part of it is because my language is not English and I have a very limited set of vocabs and NO idioms to use.

Besides, I think when I need some help, I should ask for it properly. And I personally consider it "proper" the way I did: by giving every detail the responder might need...

Many of us are replying to many posters "Could you elaborate more on your problem, so that we could help you better?"... Sometimes being brief is good, but whenever being brief can end to such a request, I think it could be better not to be brief in the first place.

-----------------------------------

Anyways, thanks all for taking time to kindly participate.
 

noetsi

Fortran must die
#19
One thing that confuses me victor is that you seem to be addressing multicolinearity and interaction as if they were the same thing. They aren't at all as far as I know.

But then I am wrong a lot.... :p
 
#20
Greta, no. I didn't go into details. The protocol was to exclude the variables with correlations greater than 0.4 and this is an objective protocol. ;)

Noetsi, after trinker reminded me of this in this thread, I didn't confuse these two any more. I said there was multicollinearity between the variable D*E and the variables D, E, and C, between the variable C*D and the variables C, and D, etc. Here the variable D*E is an interaction, and this interaction is collinear with D and E... I hope everything is OK, but if I was wrong, please let me know. :)