What diagnostics are you referring to in particular?
I fit my model using 'proc glm' but now it seems that proc reg should be used for the diagnostics. So, do I need to fit the model all over again using proc reg and creating dummy variables (that proc glm avoided) since proc reg is to be used for the diagnostics or can diagnostics be done with proc glm?
What diagnostics are you referring to in particular?
1) VIF can be estimated using tolerance statistics. tolerance=1/vif & is given by the tolerance option in the model statement of Proc GLM.
2) Cook's D can be written to the output dataset using cookd= option in the output statement of the Proc GLM.
3) If you still need to estimate model using Proc Reg then you'll have to create dummies & if you want similar results then the coding has to be done the way Proc GLM does else the coefficients might be different.
Proc Reg does give more diagnostic statistics than proc GLM.
StatsClue (02-06-2012)
Thanks so much! Is there anything else (eg. residuals, leverage, outliers) that I could do with glm? And..could you specify the code a little more? for eg.
proc glm data=statsclue;
class gender drugcategory;
model outcome=gender drugcategory volume gender*volume;
run;
Where do I put the tolerance thing? Thanks again, this was really helpful.
EDIT: I just put tolerance at the right place (would help to know about others, eg. residuals, cooks D) and got an impossible number of dummies on the output. I know I should be worried about vif>10. What tolerance value should I be worried about? Also, would it be type 1 or type 2 tolerance (the output shows these two types)?
And...a somewhat unrelated q: for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same). Thanks a LOT.
Last edited by StatsClue; 02-06-2012 at 10:56 PM.
If your criteria is vif>10 then tolerance cut off should be 0.1. Anything less than 0.1 will indicate multicollinearity.I know I should be worried about vif>10. What tolerance value should I be worried about?
Type 2 is same as the Tolerance & corresponding VIF output from Proc reg. Therefore, I prefer using type 2.Also, would it be type 1 or type 2 tolerance (the output shows these two types)?
Any of these can be used if you understand what they are testing. Usually Type 3 makes more intuitive sense. Say you've 3 IVs a,b & c. The Type 3 statistic is for a is calculated by estimating equation with intercept, b & c i.e. excluding a. Therefore, it'll give the additional effect of variable a. If the p-value for a comes out insignificant then the equation can be safely estimated without a.for the output with glm, should I read the type 1 SS or type 3 SS to decide about which variables to keep in the model? (the p-values for the two outputs aren't always the same).
Type 1 is sequential testing. Say you specified IVs as b,c & a in the model statement in that order. Now Type 1 will fit the model in sequence i.e. intercept first followed by intercept + b, followed by int+b+c & so on. Because of the sequential structure it is less intuitive.
would help to know about others, eg. residuals, cooks DCode:proc glm data=statsclue; class gender drugcategory; model outcome=gender drugcategory volume gender*volume/ tolerance; output out=stasclue1 cookd=cooks_statistics_in_this_var r=residuals_in_this_var; run;
StatsClue (02-07-2012)
Thanks! Not too sure of the output line. Do you mean: output out=statsclue1 cookd=cooks volume r=residuals volume; ?
Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .
In the context of tolerance , I got values for different levels instead of the whole variable, eg. tolerace for the 2 different drug categories (3rd would be referent I guess) was <.1. I didn't get a tolerance for drugcategory as a whole. Also it was <.1 for some interactions and levels of interactions. What does this mean? Should these be excluded from the model?
This would be replaced by the variable name in which you want to store the results.Basically, not sure what you mean to be written in place of cooks_statistics_in_this_var .
This is same as getting VIF for the dummies. Low tolerance means that either there is not much variation within the variable (which I personally think is not a good case for prediction) or it means that you dummy takes value closely correlated with some other variable. I'd suggest investigating a bit but I personally don't worry too much about multicollinearity with dummies.In the context of tolerance , I got values for different levels instead of the whole variable, eg. tolerace for the 2 different drug categories (3rd would be referent I guess) was <.1. I didn't get a tolerance for drugcategory as a whole. Also it was <.1 for some interactions and levels of interactions. What does this mean? Should these be excluded from the model?
StatsClue (02-08-2012)
that code really didn't work for cookd and residuals but thanks a bunch jrai.
Interactions should be kept when investigating tolerance/vif, right? Thanks.
What was the problem with the code? Did you get any error message?
Yes interactions should be kept but often they show high collinearity with the base variables, viz. x_sq will always show high colinearity with x. Either you can leave it like that else, you can use ridge regression.
StatsClue (02-09-2012)
ridge regression...gawd, my brain can't handle any new terminology, let alone regression. well some silly associations got into my model in a highly significant way and I was hoping to find a statistical reason further in the analysis, to drop them.
Have another q..more for a stage preceding: If after placing all siginficant interactions in the model, a formerly significant variable becomes insiginfincant (but one of its interactions reminds significant), is it reason enough to throw out that now insignificant variable along with its significant interaction?
I didn't get any error message. The command ran and gave everything else in the output without a hint of cookd or residuals.
Thanks.
Did you check the dataset named work.statsclue1 (if you used the same code)? This is the output dataset where the results are stored.
As for model selection, ideal way is to keep base variables if you are keeping interactions. According to McClave, Benson & Sincich see if the overall model is useful indicated by the F-test & then see if the interaction is significant. If the interaction is significant then the tests on base variables are meaningless as the significance of the interaction term implies that both the variables are important.
StatsClue (02-09-2012)
Very useful information. So that's saying that tests on base variables are meaningless, right, and NOT that KEEPING the base variables individually is meaningless if they're being kept as part of interaction, right? i.e., you've gotta keep them individually if you're keeping them as part of interaction..(?)
Checked the work folder. Only has the original dataset. Oh and the log does give a strange message at the bottom:
Variable volume already exists on file WORK.statsclue1, using volume2 instead.
Variable volume already exists on file WORK.statsclue1, using volume3 instead.
Yes, keep the individual variables if their interaction is being kept.
Statsclue1 should be created in the work folder. There seems to be some problem with variable names in your coding. Give the codes & I can check.
StatsClue (02-09-2012)
That code for cookd and residual worked but my N >1000, so it's tough looking for influential observations (though I'm not even sure what to do with them if I do find them).
Is there a way to get SAS to print out only the observations of concern, with proc glm?
Thanks.
Tweet |