Regression Methods

#1
Hello,
I have 1 dependent variable and 30 independent variables. I used forward method for selecting the important variables. The Adjusted R square value was very high (near to 1 !!!) and I thought maybe there is something wrong with the methods. So according to the forward method results, I only selected the significant independent variables and ran the regression analysis with enter method again. But this time the Adjusted R square was low (about 0.4) and only one of the IVs was significant. I'm confused about these results. I appreciate any help.
 
#2
We would love to hear more information (sample size, etc.). How many of the variables selected by computer (in your perfectly fitted model) was entered into the model with the adjusted r-squared of 0.4? Furthermore, note that the forward selection method does not select the significant variables to enter the next model. It has some criteria, but not exactly what you manually did to replicate what computer does. So, if you enter the same independent variables in the last model of the forward selection method into your manually selected model, the result will be the same as the computerized forward selection method (again an adjusted r-squared of about 1.0).

The adjusted r-squared indicates how your model fits the data well, so when you enter more variables, it is possible that your model becomes better as a whole, and the adjusted r-squared increases. However, when you prune some of the variables that were non-significant, your model which depended on them too, becomes weaker in explaining your experiment, and thus your adjusted r-squared decreases. So it is possible to have an adjusted r-squared value close to 1, while the model is very complicated and actually of little or no practical use (despite being very accurate).
 
#3
Dear victorxstc,
in my research n=32. I chose the model with appropriate VIFs and T-test results for each variable in forward regression. This model only kept "3" of my "30" IVs with ARS=1. Then I wanted to test what will happen if I only use these 3 IVs in a new regression analysis with enter method. But the results weren't the same as best model selected in forward regression and according to T-test only one of these 3 IVs was significant. In addition ARS decreased to 0.4.
 

Englund

TS Contributor
#4
Also, your adjusted R2 will be highly inflated since you've got so many variables. If n-k is small the inflation of adj-R2 will be severe!!! Next I'll show a simple simulation giving the 95:th percentile of the adjusted R2 based on 100 simulations with 30 independent non-intercorrelated variables and 50 observations; the 95:th percentile was estimated to be 0.41 with the forward selection method and 0.61 with backward selection.

The inflation of adj-R2 is not as bad if n-k is large though. When I performed the same simulation as before but with n=300 and k=30, I got the following result: 0.06 and 0.06.

I would strongly recommend not to use forward regression nor backward or all subset regression, especially if n-k is small!
 
Last edited:
#6
Dear victorxstc,
in my research n=32. I chose the model with appropriate VIFs and T-test results for each variable in forward regression. This model only kept "3" of my "30" IVs with ARS=1. Then I wanted to test what will happen if I only use these 3 IVs in a new regression analysis with enter method. But the results weren't the same as best model selected in forward regression and according to T-test only one of these 3 IVs was significant. In addition ARS decreased to 0.4.
You mean you have two exactly similar models with the same three independent variables (one selected through a forward-selection process, and the other one an exact replication of the former), plus the intercepts in both models, but their properties like adjusted R-squared differ? All I guess is that something is not exactly the same in your manual replication. Have you entered any interactions? Or manipulated the intercept? Perhaps the defaults of your statistical program differs for the Enter and Forward Selection methods, so it might be possible that the two methods are not exactly the same. If they were completely similar (regardless of how the variables had been selected), the adjusted R-squared would be the same too. Btw, which software do you use?
 
#7
Dear Englund,
according to what you said It is obvious that my n-k is very small! (n=32 and k=30). So should I omit some of my IVs and do my analysis again? what method I must use this time? I'm amateur in statistics and don't know what exactly to do. I appreciate your help...
 
#8
Dear victorxstc
Yes, That's what I exactly did...I wonder what is different in these 2 analyses...I'm amateur in statistics and I use SPSS software...
 
#9
Dear Englund,
according to what you said It is obvious that my n-k is very small! (n=32 and k=30). So should I omit some of my IVs and do my analysis again? what method I must use this time? I'm amateur in statistics and don't know what exactly to do. I appreciate your help...
But you were saying you had entered only 3 IVs (both in the final model of the forward-selection regression, and in your manually selected model), right? So your k must be 3, not 30? :confused: Please if possible, put the output of SPSS for both models, so we can see which model incorporates how many and which variables. I believe if the two models were exactly the same, they would have the exactly same R-squared values, regardless of the procedures that had used to select the variables.
Besides, you don't look at all amateur. :)
 
#10
Sorry! I mean in forward regression I had 30 IVs that the third model showed 3 important IVs and in the other analysis I entered only these 3 IVs...I supposed to get the same results. anyway I've attached my results :)
 

Englund

TS Contributor
#11
Dear Englund,
according to what you said It is obvious that my n-k is very small! (n=32 and k=30). So should I omit some of my IVs and do my analysis again? what method I must use this time? I'm amateur in statistics and don't know what exactly to do. I appreciate your help...
Well, if you decide to use a forward/backward selection method, I'd strongly suggest that you evaluate your model on training data (data that was not used when estimating your model). As said before, my simulation showed that the 95:th percentile for adj-R2 was 0.41 with forward selection method; when the expected adj-R2 was 0. Thus: highly inflated adj-R2.

I think 'data mining' methods such as different selection methods are very dangerous to use if you do not know about the sideeffects. If your model gives a high adj-R2 then your model fits well to your sample, but not at all to out of sample data (given that adj-R2 has a low expected value as it was in my simulation for example).
 
#12
Sorry! I mean in forward regression I had 30 IVs that the third model showed 3 important IVs and in the other analysis I entered only these 3 IVs...I supposed to get the same results. anyway I've attached my results :)
It is interestingly strange. The only thing coming to my mind is that your data differs for the two models. I guess you have some missing data in variables other than the last third ones ([FONT=&quot]agrirangzland, urbPSCV, urbzland[/FONT]). So the forward selection method has excluded some of the cases (rows) due to missing values in other variables. Once you entered only the last three variables, the missing cells in other columns did not interfere with the regression algorithm, so SPSS did not exclude any rows because of missing values in other columns.

Perhaps the reduced n of the forward-selection method (due to excluding rows to treat missing values in any of the 30 variables) might contribute to the inflated adjusted r-squared of your forward-selection final model.

I can't think of anything else in your case. Please check and update us too. :)
 
#13
Dear victorxstc and Englund,
Thanks a lot for following my posts and your good comments and suggestions. I will check out my data and try to solve the problem considering your comments :).
 
#14
Dear victorxstc,
You were completely right! I checked my data and Yes! there are missing data in other variables and I didn't know that forward selection exclude those cases and reduce my "n"!!
again thank you very much :)
 

noetsi

Fortran must die
#16
I am confused by the comment that adjusted R square was inflated due to many variables. My understanding is the point of adjusted r squared is to avoid having higher values as you add variables (in contrast to the regular R squared).