Doing correct backwards elimination with interaction terms?

#1
I'm wondering about this, and I can't find an answer anywhere. My textbook (Agresti and Finlay 2009) clearly says how to do backwards elimination on a linear model, but I'm sitting in a situation my textbook doesn't explain. Neither was I able to find anything on Google or this forum, about this specific question.

If we have a complete linear model with interaction terms, we are to remove the predictor with the highest P-value. That much is clear. Let's say predictor X1 is that, and we (have to) remove it. But what about the interaction term, e.g. X1:X2? Should we keep the interaction term, even if we removed predictor X1, or should we discard all interaction terms involving X1 too?

In other words, should we keep interaction terms if the original predictor isn't in the linear model anymore?
 
Last edited:

hlsmith

Less is more. Stay pure. Stay poor.
#2
You keep the main effect terms that are included in the interaction term. So regardless of the p-values of X1 and X2, if you are keeping their interaction term in the model then you keep their main effects in the model.

If you absolutely have to do backwards elimination (say for a class), you should try to create a rule in your programing to keep them in the model.
 
#3
So, if I understand correctly, this is a legit move:

Code:
fit <- lm(Y~X1+X2+X1:X2)
summary(fit)

Predictor:     Estimate:     P-value:
X1                 0.20        0.001***
X2                 0.10         0.75
X1:X2             -0.40        0.005**
From complete model (above) to reduced model (below):

Code:
reduced_fit <- lm(Y~X1+X2+X1:X2)
summary(reduced_fit)

Predictor:     Estimate:     P-value:
X1                 0.60        0.007***
X1:X2             -0.40        0.003**
In other words, I can use the reduced model without problems?
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
No you use the top model, including X2 regardless of its significance. You also, don't attempt to interpret the main effects, they just get to come along for the ride.
 
#5
Thanks for the clarification!

However, I'm not sure I understand why I can keep X2. My more or less trusty textbook says that we really have to eliminate the insignificant predictors. Could you (or anybody) maybe elaborate why, or refer to something that can?
 
#6
I would strongly advise you to read a book by Aiken and West (1991) "Multiple Regression: Modeling and Interpreting Interactions". The interactions are theory-based terms in the equation and should not be eliminated, unless (a) they do not meet certain conditions AND you properly probe (test) both full and reduced models.

The model with interactions includes main predictors, say X1 and X2, and the interaction term, say X1X2. When you estimate the coefficients, the effect of the main predictors is not "constant" anymore, but rather "conditional", that is the effect of X1 at the mean of X2. You should also be clear in your research question - that is what is the primary question you are trying to answer.

Read the book, I strongly recommend it :)
 
#7
Thanks for the clarification!

However, I'm not sure I understand why I can keep X2. My more or less trusty textbook says that we really have to eliminate the insignificant predictors. Could you (or anybody) maybe elaborate why, or refer to something that can?
What your book may say about elimination- could refer to eliminating insignificant interactions. However, not only it should be insignificant, but also the percent of explained variation attributable for that interaction should be insignificant. Plus there are a couple other theory related nuances that must be considered before eliminating anything from the model.

As for main predictors, you should not eliminate it just because it is insignificant, especially if theory says that it should be there. Insig finding is also a finding and there should be a reason for that.

P.S. I have not heard "backward" procedure, but rather "hierarchical step-down examination" for complex models (Aiken and West, 1991).
 

noetsi

No cake for spunky
#8
There is an iron rule that you never include an interaction terrm without the main effects. I assume that the text is dealing with issues generally not the specific example of interaction which has its own rules. That is the text is dealing with linear models generally not the specific case of interaction.

It might be noted that many statisticians strongly disagree with backward period which has a variety of problems including capitalizing on chance and not being robust between samples. Unless you have no theory at all it is a bad way to enter data into a model.
 
#9
Thanks kiton, I'll definitely pick up a copy of that book, because I'm pretty weary of Agresti and Finlay (2009) and its crash-course mentality and half-explanations. Unfortunately, I don't have the time to read a whole new book for what I'm working on.

I think I get it, but to be sure, would these moves be legit? Or should I do otherwise? Say we made this progressive model:

Code:
fit001 <- lm(Y~X1) # Adds first predictor, everything in order:
summary(fit001)

                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.69631    0.27210  17.259    2e-16 ***
X1                -0.20985    0.063    -3.339    0.001 **
Code:
fit002 <- lm(Y~X1+X2) # Adds second predictor, everything in order:
summary(fit002)

                  Estimate Std. Error t value   Pr(>|t|)    
(Intercept)              6.025    0.389  15.507    2e-16 ***
X1                      -0.187    0.061  -3.055    0.003 ** 
X2                      -0.754    0.162  -4.659    6.47e-06 ***
Code:
fit003 <- lm(Y~X1+X2+X3) # Adds third predictor, which is insignificant; test for interaction:
summary(fit003)

                  Estimate Std. Error t value   Pr(>|t|)    
(Intercept)              6.019    0.389  15.473    2e-16 ***
X1                      -0.196    0.062  -3.146    0.002 **
X2                      -0.682    0.184  -3.701    0.000 ***
X3                      -0.061    0.075  -0.817    0.415
Code:
fit004 <- lm(Y~X1+X2+X3+X2:X3) # Adds interaction, which is significant; model is complete and final:
summary(fit004)

                  Estimate Std. Error t value   Pr(>|t|)    
(Intercept)                5.391    0.425  12.694  2e-16 ***
X1                        -0.211    0.061  -3.482  0.001 ***
X2                        -0.255    0.222  -1.151  0.252    
X3                         0.749    0.260   2.885  0.004 ** 
X2:X3                     -0.421    0.129  -3.251  0.001 **
So summing it up, even though predictor X2 is insignificant, fit004 is good-to-go? It can be used as a final model without any problems?
 

noetsi

No cake for spunky
#10
Agresti has written a lot of books on econometrics. My sense from one of them was it was not for someone new to econometrics (don't know if you are). Half explanations are common in many statistic text, and this may be worse for econometrics although obviously I have read only a handful. The material is very involved and covers such a wide range of methods that doing it justice is very hard. Perhaps because of this or perhaps because the author assumes the reader already knows many details they leave out a lot.

I think the wisest course is to come to econometrics after you have a deep background in statistics to start. :p
 
#11
There will be no need for you to read the whole book. It is well structured and you will be able to check exactly the kind of analysis you need (1-2 chapters).

As for your model, if your theory says that X1, X2, and X3 must be there - they all should go in the model no matter if they are significant or not. Further, if there is theory-based reason for X2*X3 to exist, then it must go in the model with the rest of the predictors and in this case fit 4 seems to be correct. Keep in mind that the interpretation of the interaction is considered "constant" effect in this case and the interpretation of X1-X3 is considered as "conditional" effects (check out the way to interpret that properly). There is a reason why X2 is insignificant - you need the book to completely understand that. But in short, there are values of X3 (say high, mean, and low for simple) under which X2 might be sig and insig.
 

noetsi

No cake for spunky
#12
As for your model, if your theory says that X1, X2, and X3 must be there - they all should go in the model no matter if they are significant or not.
I don't think there is any concensus on this at all. My best guess is that most practisioners and many academics final model in fact do not include variables found to be not statistically signficant. Even if their original models did. It makes more sense to include them when you are testing theory I assume then using the model to predict.
 
#13
Thanks for the replies everyone. I'm studying social science, and shortly said, we get dragged roughly into the world of statistics, only to be thrown out of it again. It's a true WTF-just-happened experience.

Maybe I should elaborate on my theory/method I'm using (I can't get into details though). First, I'm establishing a set of predictors that explain Y best, with as many predictors as possible (i.e. making a working model). Second, I add a new predictor, and evaluate its introduction into the said model; this new predictor is X3. So, introducing X2:X3 is not a theoretical requirement, it is just a step I want to do, to make the overall revised/added model better.
 
Last edited:
#14
I don't think there is any concensus on this at all. My best guess is that most practisioners and many academics final model in fact do not include variables found to be not statistically signficant. Even if their original models did. It makes more sense to include them when you are testing theory I assume then using the model to predict.
You are surely correct. There are, though, certain suggested procedures on how to eliminate terms from the equation, not solo significance.
 

rogojel

TS Contributor
#15
hi,
a quick explanation on why you should keep the insignificant main effect :

A model without a main effect say Y = a1X1 + a3X1X2 will become one with both main effects if the scale of variable X1 is shifted by a constant i.e. X1'= X1 + c
A model that is only valid for a fixed way of measuring a variable is clearly not very satisfactory, that is why it is best to keep both term. Then the form of the model is invariant, though the coefficients are clearly scale dependent.

regards
 
#16
Having a good night's sleep and a new cop of coffee in my hand surely made its impact. Certainly learned something here. Thanks to everyone!