Why does the coefficient change sign when another variable is added to the OLS model?

#1
Dear all,

I am trying to run an OLS regression in Stata 13, with log of per capita calorie as my dependent variable and age and years of education of household head, log per capita expenditure as my independent variables (other controls to be added eventually). When I run the regression with just age and education as control, they are significant and positive. However, as soon as I add log per capita expenditure, education becomes negative and significant. I am puzzled by this result (the literature on calorie consumption argues that education of the household head has a positive impact)- I understand that education of the household head might reflect a "wealth" effect, but the correlation coefficient is not that large. I have posted my regression results below, as well as summary statistics. I was wondering if someone could help me understand what is going on here. I realize that this sort of problem might (or might not ) be overcome using other techniques than OLS, but I have just started learning OLS and would like to understand how to deal with this in OLS, or at least know why it cannot deal with this.

Thanks,

Monzur


Code:
.  regress log_pccal  age_hhhead eduy_hhhead [pw=hhweight], r
 

Linear regression                           Number of obs =    3355
                                                       F(  2,  3352) =  105.40
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0692
                                                       Root MSE      =  .25583

------------------------------------------------------------------------------
             |               Robust
   log_pccal |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  age_hhhead |   .0049182   .0003602    13.65   0.000      .004212    .0056244
 eduy_hhhead |   .0075136   .0011997     6.26   0.000     .0051613    .0098659
       _cons |   7.537586   .0171067   440.62   0.000     7.504045    7.571126
------------------------------------------------------------------------------

.  regress log_pccal age_hhhead eduy_hhhead log_pcexp [pw=hhweight], r


Linear regression                                      Number of obs =    3355
                                                       F(  3,  3351) =  601.38
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.4123
                                                       Root MSE      =  .20332

------------------------------------------------------------------------------
             |               Robust
   log_pccal |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  age_hhhead |    .001919   .0002945     6.52   0.000     .0013415    .0024964
 eduy_hhhead |  -.0082508    .001044    -7.90   0.000    -.0102977   -.0062039
   log_pcexp |   .3777407   .0100402    37.62   0.000     .3580552    .3974262
       _cons |   4.795607   .0730719    65.63   0.000     4.652337    4.938877
------------------------------------------------------------------------------

.  estat vif

    Variable |       VIF       1/VIF  
-------------+----------------------
   log_pcexp |      1.20    0.832228
 eduy_hhhead |      1.16    0.863121
  age_hhhead |      1.07    0.930743
-------------+----------------------
    Mean VIF |      1.14


.  su log_pccal eduy_hhhead log_pcexp, d

                          log_pccal
-------------------------------------------------------------

Obs                3698
Mean           7.783589
Std. Dev.       .276406
Variance       .0764003
Skewness       .0350145
Kurtosis       3.511389

            years of education of household head
-------------------------------------------------------------

Obs                3698
Sum of Wgt.        3698
Mean           2.984857
Std. Dev.      3.776812

Variance       14.26431
Skewness       .9461994
Kurtosis       2.751041

              log of hh per capita expenditure
-------------------------------------------------------------

Obs                3698
Sum of Wgt.        3698

Mean           7.762185
Std. Dev.      .4636838

Variance       .2150027
Skewness       .4395734
Kurtosis       3.433132

. pwcorr log_pccal age_hhhead eduy_hhhead log_pcexp, sig

             | log~ccal age_hh~d eduy_h~d log_pc~p
-------------+------------------------------------
   log_pccal |   1.0000
             |
             |
  age_hhhead |   0.2282   1.0000
             |   0.0000
             |
 eduy_hhhead |   0.0855  -0.1133   1.0000
             |   0.0000   0.0000
             |
   log_pcexp |   0.6401   0.1796   0.3254   1.0000
             |   0.0000   0.0000   0.0000
             |
 

spunky

Smelly poop man with doo doo pants.
#2
Re: Why does the coefficient change sign when another variable is added to the OLS mo

When I run the regression with just age and education as control, they are significant and positive. However, as soon as I add log per capita expenditure, education becomes negative and significant.
perhaps you're dealing with a suppressor effect?
 

noetsi

Fortran must die
#3
Re: Why does the coefficient change sign when another variable is added to the OLS mo

You might have multicolinearity or possibly a moderator effect (where one IV is influencing the impact of another variable on the DV). I do not know how to test for moderator effects ( I don't work with moderators generally) but you can test for MC by running a VIF test. If memory serves a change in sign when you add a variable is a sign often of one of these effects. This is an example that multivariate relationships and univariate relationships can be very different.
 

spunky

Smelly poop man with doo doo pants.
#5
Re: Why does the coefficient change sign when another variable is added to the OLS mo

Is a suppressor and moderator effect essentially the same thing (or perhaps a suppressor effect is one example of a moderator effect)?
they're different but related things.... a moderator could be a suppressor but not all suppressors are moderators. these people do a pretty good job at untangling the whole thing:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2819361/
 

spunky

Smelly poop man with doo doo pants.
#6
Re: Why does the coefficient change sign when another variable is added to the OLS mo

JEEBEZUZ! just look at the change in the fit of the model!

without the suppressor variable (log_pcexp) your R-squared is 0.0692.... so basically zero. but with your suppressor variable makes the R-squared jump to 0.4123!!!

my money's on the suppressor effect
 

noetsi

Fortran must die
#7
Re: Why does the coefficient change sign when another variable is added to the OLS mo

JEEBEZUZ! just look at the change in the fit of the model!

without the suppressor variable (log_pcexp) your R-squared is 0.0692.... so basically zero. but with your suppressor variable makes the R-squared jump to 0.4123!!!

my money's on the suppressor effect
Or very few cases :p

Seriously with 3355 cases that won't be occuring. With a very small sample size you signficantly increase r squared simply by adding more variables especially if you have a lot in the model.
 

spunky

Smelly poop man with doo doo pants.
#8
Re: Why does the coefficient change sign when another variable is added to the OLS mo

Or very few cases :p
nope, it's definitely suppression. towards the end the OP provides the correlation matrix among the variables. they're positively correlated but the regression weight changes to the opposite sign in the presence of the suppressor