Variable selection in multiple regression based on regression itself results

#1
Hi everybody,
I have a question relative to the multiple linear regression. Here is the case:
Y=dip var
x1, x2, x3= indip var
suppose that I have no collinearity between x1, x2 and x3

If I find x1 not statistically significant (p-value>alfa), can I remove x1 from my analysis and run the regression again with x2 and x3 only? Are the outcome of this second model correct?
Sometimes I tried to put in practice the removing action but some strange results comes. For example, if in the starting model (with x1, x2 and x3) x1 and x2 were statistically significant if I remove x1 one of the other two predictor become non significant. Also, the Rsqr adj vary and in particular it decrease.
Can you help me to understand better this phenomena?
Thank you in advance.
Regards.

N.
 

JesperHP

TS Contributor
#4
You need to figure out whether you're model is wellspecified by running several specificationstests: Autocorrelation, heteroscedasticity, functional misspecification being standard.

Assuming you're model is well specified testing insignificant on x1 is a reason for removing the variable from the model (and even better if t value is lower numerically than the other variable). However if you have strong theoretical arguments in favor of y depending on x1 you might still want to keep it in the regression model. However the case is not clear cut...several good introductory book are available on multiple regression.


R^2 is supposed to decrease when you remove a variable.
 

noetsi

No cake for spunky
#5
If you are asking can you remove a variable which empirically shows no effect and rerun the model without it, there is differing opinions on that. It is commonly done, but some argue you should only remove it if you have a theoretical basis for doing so. One reason for this is that you are testing the effect size based on a sample and another sample might give different result.
 
#6
You need to figure out whether you're model is wellspecified by running several specificationstests: Autocorrelation, heteroscedasticity, functional misspecification being standard.

Assuming you're model is well specified testing insignificant on x1 is a reason for removing the variable from the model (and even better if t value is lower numerically than the other variable). However if you have strong theoretical arguments in favor of y depending on x1 you might still want to keep it in the regression model. However the case is not clear cut...several good introductory book are available on multiple regression.


R^2 is supposed to decrease when you remove a variable.
Sometimes I find x1, x2 and x3 statistically significant if considered separately. When I put them together in the same model they become non statistically significant.
Is this a problem of specification?
Thanks.
regards.

N.
 
#7
Look at Lasso regression and non-negative Garotte.

Apparently, Ridge regression is crap but it achieves a similar goal.

The above are the statistical/data based approaches to variable selection. The other approach which is dominant in economics/finance is to use theory to guide variable selection. In economics the justification of using theory or qualitative reasoning to select variables is that economic variables are so noisy that you can get spurious relationships by chance easily.

I would prefer the statistical approach even though I never ever see it in financial econometrics.

In addition if you're doing time-series stuff you need to make sure your variables are I(0) using either the KPSS, PP or ADF tests. For ADF you need to use SBIC to guide lag length selection and make sure you try trends and constant as well as no constant and no trend etc etc.
 

JesperHP

TS Contributor
#8
Sometimes I find x1, x2 and x3 statistically significant if considered separately. When I put them together in the same model they become non statistically significant.
Is this a problem of specification?
Thanks.
regards.

N.
Not in general and therefore: NO. But that being said: there could be a problem of misspecification. (And I know this is so inprecise as not to be helpful but you cannot expect me to lie :) )

Figure out what happens to the variance of your OLS-(beta)-estimators in a multiple linear regression ass you add variables (should be explained in any introductory book on multiple linear regression). Even though we "suppose that I have no collinearity between x1, x2 and x3" this could mean perfect collinearity but there might still be multicollinearity - so read about this problem and this should be related to the variance of your estimator. Which again is related to the outcome of you're t- or F-tests.

2) Regarding misspecifaction:
Again my best advice is probably to pick up a book about multiple linear regression in general. I study economics and have found Woolridge Introductory Econometrics to be a fine practically oriented introduction to multiple linear regression.

Otherwise I would advise you to state more clearly what you are trying to achieve - rather than stating the problem with generic variables x1,x2,x3 - what kind of data you are working on, what are you're theoretical hypothesis or simply commonsense expectations that you are trying to test and often most importantly assuming that this is schoolrelated: What are the teachers expectations?

That being said one of the basic assumptions of multiple linear regression is that the model is linear in the parameters but this does not mean linear in x1 x2 x3.
There can be interactioneffects (as suggested by hlsmith in #2): The effect of x1 might be higher when x2 is higher.
There can be increasing or decreasing marginal effect: The effects of x1 might be higher for high values of x1 that for low values of x1.
There could be other important variables missing resulting in autocorrelated errorterm and worse bias and inconsistency of estimators.

and the list goes on...

If youre work is schoolrealated and youre not expected to know about this stuff the solution is simply to forget you ever heard about it.

My reason for not giving you a better answer is simply because for you to be able to understand misspecificationtests you need to understand how multiple linear regression works. This cannot be explained in a simple way in a few lines and is much better explained be people more knowing on the subject that I am hence the referral to books. And the same goes for the misspecificationtests themselves...
 
Last edited:

noetsi

No cake for spunky
#9
Ridge Regression intentionally biases the estimates. It is harshly criticized for doing so. I would be cautious about using it.