Regression: Best Subsets

#1
Hi,

I need help finding the the most significant, non-correlated, variables in SAS.

The Current Approach:
In the "Analysis" section, I am currently going line by line to find which variables are correlated. I have 187 variables, so that is 2^187 (=1.961e56) combinations I have to go through. I am using the Pearson Correlation Coefficient to find multicolinearity b/t variables. I am eliminating ones that are >.80

Ideally I would like to:
1.) Eliminate all of the correlation b/t the 187 variables
2.) With the non correlated variables, I would like to find the most significant, best subset of variables to run in a regression

This is taking too long, and I know there is a more mathematical approach to solve this. That is why I am reaching out to people much smarter than myself :)
Thanks for any help!
 

ledzep

Point Mass at Zero
#2
187 is a lot of variables. And checking correlation one by one for all those possible pairs is impossible.
You would want to run a screening of all those variables and come up with relatively reasonable number of variables so that the variance of the predicted values won't go the roof.

I cannot remember if using All possible regression will give you a final model or not (but this requires all the variables to be quantitative).
I am pretty sure that if you use sequential methods instead of the all possible regressions, SAS will give you a final model. However, this comes with a warning: Stepwise is bit dicey.

Code:
proc reg data= mydata;
model y= predictors/ method=stepwise;
run;
You may want to look at "best possible" regressions using SAS too.

I don't know any other approach. But there must be more efficient method. It would be good to have opinions of fellow TSers.

If you have too many correlated variables, then Principle Components Regression may be your friend.
 

jrai

New Member
#5
For finding all the 1.961e56 combinations use the following code. Say your variables are named v1 to v187 and stored in the file named work.orig.

Code:
proc corr data=orig outp=test(where=(lowcase(_type_)="corr")) noprint;
var V1--V187;
run;

proc transpose data=test out=test1(where=(correlation1<=0.8)) prefix=correlation;
by _name_ notsorted;
var V1--V187;
run;
This will give you all the correlations equal to or less than 0.8 in dataset work.test1. You can then play with dataset to make selections. SAS is powerful enough to handle these.

Let me know how does it go.
 
Last edited:

edi

New Member
#6
Hi,

Why don't you try a principal component analysis (PCA) and reduce the dimension of the data set? Once the principal components (PCs) are identified run the regression with the PCs and see how it works. Hope this will help you.

Cheers!