- Thread starter StatsClue
- Start date
- Tags regression diagnostics

2) Cook's D can be written to the output dataset using cookd= option in the output statement of the Proc GLM.

3) If you still need to estimate model using Proc Reg then you'll have to create dummies & if you want similar results then the coding has to be done the way Proc GLM does else the coefficients might be different.

Proc Reg does give more diagnostic statistics than proc GLM.

2) Cook's D can be written to the output dataset using cookd= option in the output statement of the Proc GLM.

3) If you still need to estimate model using Proc Reg then you'll have to create dummies & if you want similar results then the coding has to be done the way Proc GLM does else the coefficients might be different.

Proc Reg does give more diagnostic statistics than proc GLM.

proc glm data=statsclue;

class gender drugcategory;

model outcome=gender drugcategory volume gender*volume;

run;

Where do I put the tolerance thing? Thanks again, this was really helpful.

EDIT: I just put tolerance at the right place (would help to know about others, eg. residuals, cooks D) and got an impossible number of dummies on the output. I know I should be worried about vif>10. What tolerance value should I be worried about? Also, would it be type 1 or type 2 tolerance (the output shows these two types)?

And...a somewhat unrelated q: for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same). Thanks a LOT.

Last edited:

I know I should be worried about vif>10. What tolerance value should I be worried about?

Also, would it be type 1 or type 2 tolerance (the output shows these two types)?

for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same).

Type 1 is sequential testing. Say you specified IVs as b,c & a in the model statement in that order. Now Type 1 will fit the model in sequence i.e. intercept first followed by intercept + b, followed by int+b+c & so on. Because of the sequential structure it is less intuitive.

would help to know about others, eg. residuals, cooks D

Code:

```
proc glm data=statsclue;
class gender drugcategory;
model outcome=gender drugcategory volume gender*volume/ tolerance;
output out=stasclue1 cookd=cooks_statistics_in_this_var r=residuals_in_this_var;
run;
```

Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .

In the context of

Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .

In the context of tolerance , I got values for different levels instead of the whole variable, eg. tolerace for the 2 different drug categories (3rd would be referent I guess) was <.1. I didn't get a tolerance for drugcategory as a whole. Also it was <.1 for some interactions and levels of interactions. What does this mean? Should these be excluded from the model?

Have another q..more for a stage preceding: If after placing all siginficant interactions in the model, a formerly significant variable becomes insiginfincant (but one of its interactions reminds significant), is it reason enough to throw out that now insignificant variable along with its significant interaction?

I didn't get any error message. The command ran and gave everything else in the output without a hint of cookd or residuals.

Thanks.

I didn't get any error message. The command ran and gave everything else in the output without a hint of cookd or residuals.

As for model selection, ideal way is to keep base variables if you are keeping interactions. According to McClave, Benson & Sincich see if the overall model is useful indicated by the F-test & then see if the interaction is significant. If the interaction is significant then the tests on base variables are meaningless as the significance of the interaction term implies that both the variables are important.

If the interaction is significant then the tests on base variables are meaningless as the significance of the interaction term implies that both the variables are important.

Checked the work folder. Only has the original dataset. Oh and the log does give a strange message at the bottom:

Variable volume already exists on file WORK.statsclue1, using volume2 instead.

Variable volume already exists on file WORK.statsclue1, using volume3 instead.

That code for cookd and residual worked but my N >1000, so it's tough looking for influential observations (though I'm not even sure what to do with them if I do find them).

Is there a way to get SAS to print out only the observations of concern, with proc glm?

Thanks.

Is there a way to get SAS to print out only the observations of concern, with proc glm?

Thanks.

proc print data=analysis(where=(cook>4/1001)); *assuming your dataset has 1001 observations;

run;

Once you see the observations with high influence then you investigate them & follow your policy of outliers.

For more discussion on detecting outliers you can see the following thread: http://www.talkstats.com/showthread.php/23186-Removing-outliers-using-SAS

From what I've read, one rechecks the data, increases the sample size or finds an excuse to delete the outliers. I don't think any of this is possible in my case. Will my model be bad if I don't do anything about the outliers? THANKS.

proc standard mean=0 std=1 data= out=temp; *This step finds the standard normal score i.e. z-score of all numeric variables;

var List_of_IVs;

run;

Then find the observations where IVs are more than 3 or less than -3. These are possible outliers. This method will give you smaller set of possible outliers. If you think that these outliers are really rare or such high values are not the general case or don't make sense then you can delete them.