Regression diagnostics with proc glm or proc reg

#1
I fit my model using 'proc glm' but now it seems that proc reg should be used for the diagnostics. So, do I need to fit the model all over again using proc reg and creating dummy variables (that proc glm avoided) since proc reg is to be used for the diagnostics or can diagnostics be done with proc glm?
 
#3
What diagnostics are you referring to in particular?
Outliers, leverage, cook's D, multicollinearity (vif) and whatever else needs to be tested in multiple linear regression (continuous outcome; both categorical and continuous predictors. interactions present).
 

jrai

New Member
#4
1) VIF can be estimated using tolerance statistics. tolerance=1/vif & is given by the tolerance option in the model statement of Proc GLM.

2) Cook's D can be written to the output dataset using cookd= option in the output statement of the Proc GLM.

3) If you still need to estimate model using Proc Reg then you'll have to create dummies & if you want similar results then the coding has to be done the way Proc GLM does else the coefficients might be different.

Proc Reg does give more diagnostic statistics than proc GLM.
 
#5
1) VIF can be estimated using tolerance statistics. tolerance=1/vif & is given by the tolerance option in the model statement of Proc GLM.

2) Cook's D can be written to the output dataset using cookd= option in the output statement of the Proc GLM.

3) If you still need to estimate model using Proc Reg then you'll have to create dummies & if you want similar results then the coding has to be done the way Proc GLM does else the coefficients might be different.

Proc Reg does give more diagnostic statistics than proc GLM.
Thanks so much! Is there anything else (eg. residuals, leverage, outliers) that I could do with glm? And..could you specify the code a little more? for eg.

proc glm data=statsclue;
class gender drugcategory;
model outcome=gender drugcategory volume gender*volume;
run;

Where do I put the tolerance thing? Thanks again, this was really helpful.

EDIT: I just put tolerance at the right place (would help to know about others, eg. residuals, cooks D) and got an impossible number of dummies on the output. I know I should be worried about vif>10. What tolerance value should I be worried about? Also, would it be type 1 or type 2 tolerance (the output shows these two types)?

And...a somewhat unrelated q: for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same). Thanks a LOT.
 
Last edited:

jrai

New Member
#6
I know I should be worried about vif>10. What tolerance value should I be worried about?
If your criteria is vif>10 then tolerance cut off should be 0.1. Anything less than 0.1 will indicate multicollinearity.

Also, would it be type 1 or type 2 tolerance (the output shows these two types)?
Type 2 is same as the Tolerance & corresponding VIF output from Proc reg. Therefore, I prefer using type 2.

for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same).
Any of these can be used if you understand what they are testing. Usually Type 3 makes more intuitive sense. Say you've 3 IVs a,b & c. The Type 3 statistic is for a is calculated by estimating equation with intercept, b & c i.e. excluding a. Therefore, it'll give the additional effect of variable a. If the p-value for a comes out insignificant then the equation can be safely estimated without a.

Type 1 is sequential testing. Say you specified IVs as b,c & a in the model statement in that order. Now Type 1 will fit the model in sequence i.e. intercept first followed by intercept + b, followed by int+b+c & so on. Because of the sequential structure it is less intuitive.

would help to know about others, eg. residuals, cooks D
Code:
proc glm data=statsclue;
class gender drugcategory;
model outcome=gender drugcategory volume gender*volume/ tolerance;
output out=stasclue1 cookd=cooks_statistics_in_this_var r=residuals_in_this_var;
run;
 
#7
Thanks! Not too sure of the output line. Do you mean: output out=statsclue1 cookd=cooks volume r=residuals volume; ?
Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .

In the context of tolerance , I got values for different levels instead of the whole variable, eg. tolerace for the 2 different drug categories (3rd would be referent I guess) was <.1. I didn't get a tolerance for drugcategory as a whole. Also it was <.1 for some interactions and levels of interactions. What does this mean? Should these be excluded from the model?
 

jrai

New Member
#8
Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .
This would be replaced by the variable name in which you want to store the results.

In the context of tolerance , I got values for different levels instead of the whole variable, eg. tolerace for the 2 different drug categories (3rd would be referent I guess) was <.1. I didn't get a tolerance for drugcategory as a whole. Also it was <.1 for some interactions and levels of interactions. What does this mean? Should these be excluded from the model?
This is same as getting VIF for the dummies. Low tolerance means that either there is not much variation within the variable (which I personally think is not a good case for prediction) or it means that you dummy takes value closely correlated with some other variable. I'd suggest investigating a bit but I personally don't worry too much about multicollinearity with dummies.
 
#9
that code really didn't work for cookd and residuals but thanks a bunch jrai.

Interactions should be kept when investigating tolerance/vif, right? Thanks.
 

jrai

New Member
#10
What was the problem with the code? Did you get any error message?

Yes interactions should be kept but often they show high collinearity with the base variables, viz. x_sq will always show high colinearity with x. Either you can leave it like that else, you can use ridge regression.
 
#11
ridge regression...gawd, my brain can't handle any new terminology, let alone regression. well some silly associations got into my model in a highly significant way and I was hoping to find a statistical reason further in the analysis, to drop them.
Have another q..more for a stage preceding: If after placing all siginficant interactions in the model, a formerly significant variable becomes insiginfincant (but one of its interactions reminds significant), is it reason enough to throw out that now insignificant variable along with its significant interaction?

I didn't get any error message. The command ran and gave everything else in the output without a hint of cookd or residuals.

Thanks.
 

jrai

New Member
#12
I didn't get any error message. The command ran and gave everything else in the output without a hint of cookd or residuals.
Did you check the dataset named work.statsclue1 (if you used the same code)? This is the output dataset where the results are stored.

As for model selection, ideal way is to keep base variables if you are keeping interactions. According to McClave, Benson & Sincich see if the overall model is useful indicated by the F-test & then see if the interaction is significant. If the interaction is significant then the tests on base variables are meaningless as the significance of the interaction term implies that both the variables are important.
 
#13
If the interaction is significant then the tests on base variables are meaningless as the significance of the interaction term implies that both the variables are important.
Very useful information. So that's saying that tests on base variables are meaningless, right, and NOT that KEEPING the base variables individually is meaningless if they're being kept as part of interaction, right? i.e., you've gotta keep them individually if you're keeping them as part of interaction..(?)

Checked the work folder. Only has the original dataset. Oh and the log does give a strange message at the bottom:

Variable volume already exists on file WORK.statsclue1, using volume2 instead.

Variable volume already exists on file WORK.statsclue1, using volume3 instead.
 

jrai

New Member
#14
Yes, keep the individual variables if their interaction is being kept.

Statsclue1 should be created in the work folder. There seems to be some problem with variable names in your coding. Give the codes & I can check.
 
#15
That code for cookd and residual worked but my N >1000, so it's tough looking for influential observations (though I'm not even sure what to do with them if I do find them).

Is there a way to get SAS to print out only the observations of concern, with proc glm?

Thanks.
 
#17
That code for cookd and residual worked but my N >1000, so it's tough looking for influential observations (though I'm not even sure what to do with them if I do find them).

Is there a way to get SAS to print out only the observations of concern, with proc glm?

Thanks.
Proc glm would just output the cook's D in the output dataset. Say cook's D statistic is in a dataset named analysis & it is given by variable cook. The influential variables are the ones with cook's D > 4/n.

proc print data=analysis(where=(cook>4/1001)); *assuming your dataset has 1001 observations;
run;

Once you see the observations with high influence then you investigate them & follow your policy of outliers.

For more discussion on detecting outliers you can see the following thread: http://www.talkstats.com/showthread.php/23186-Removing-outliers-using-SAS
 
#18
Worked. Got about 90 or so observations. If they are so many, how are they even outliers? :(

From what I've read, one rechecks the data, increases the sample size or finds an excuse to delete the outliers. I don't think any of this is possible in my case. Will my model be bad if I don't do anything about the outliers? THANKS.
 
#19
Try using this:

proc standard mean=0 std=1 data= out=temp; *This step finds the standard normal score i.e. z-score of all numeric variables;
var List_of_IVs;
run;

Then find the observations where IVs are more than 3 or less than -3. These are possible outliers. This method will give you smaller set of possible outliers. If you think that these outliers are really rare or such high values are not the general case or don't make sense then you can delete them.