Regression diagmostics: PROC REG versus PROC GLM

noetsi

Fortran must die
#1
PROC GLM has many advantages over proc reg such as a case statement. But SAS has chosen not to include many of the diagnostics in proc glm that are in proc reg. The logical solution is to run the model in Proc Glm, than run the same model with diagnostics in proc reg.

The problem with this is, potentially, that you are treating dummy variables differently in the two, through the CLASS statement. Will this influence the diagnostics, that test for things like unequal error variance, non-linearity, multicolinearity etc?

I am concerned I will run the diagnostics in PROC REG and they will be ok, but not be ok in PROC GLM (really they won't be substantively correct at all) because of issues such as how dummy variables are treated differently in the two procs.

While I am at it, the RAMSEY RESET test is offered as a test for non-linearity. It is done as far as I know only in PROC AUTOREG and my data is not time series. Does anyone know if this can be used in another form of regression? I have found no examples of this on line.
 
Last edited:

Stu

New Member
#2
PROC AUTOREG is the only one. The good news is, PROC AUTOREG can act just like PROC REG, so you can safely use it for non-timeseries data In a nutshell, PROC AUTOREG is just like any other of the regression procedures with support for handling autoregression. If you compare the parameter estimates between PROC REG and PROC TIMESERIES, they'll be the same. Just ignore the ACF/PACF plots ;)

Code:
proc autoreg data=sashelp.cars;
	model MPG_City = EngineSize Cylinders Horsepower / reset;
run;

proc reg data=sashelp.cars;
	model MPG_City = EngineSize Cylinders Horsepower;
run;
Interesting note, PROC ARIMA can act like PROC REG, too!

Code:
proc arima data=sashelp.cars;
	identify var=MPG_City crosscorr=(EngineSize Cylinders Horsepower);
	estimate input=(EngineSize Cylinders Horsepower);
run;
 
Last edited:

hlsmith

Not a robit
#3
Hey, I think I remember looking into RESET. So it can be used to test a linear relationship in regression. Stu, you are saying that I can put my model terms in proc autoreg and run it to get at using the RESET? Does PROC AUTOREG have a CLASS option?


Does AUTOREG tolerate categorical or other dependent variables?
 

Stu

New Member
#4
Hey, I think I remember looking into RESET. So it can be used to test a linear relationship in regression. Stu, you are saying that I can put my model terms in proc autoreg and run it to get at using the RESET? Does PROC AUTOREG have a CLASS option?


Does AUTOREG tolerate categorical or other dependent variables?
Correct. Think of AUTOREG as PROC GLM with optional ARMA factors. It does support the CLASS statement, but there's a caveat.

Code:
proc autoreg data=sashelp.cars;
	class drivetrain;
	model MPG_City = drivetrain  ;
run;
...it's experimental.

Code:
WARNING: The CLASS statement is experimental in this release.
Which means you'll get this warning:


NOTE: Model is not full rank. OLS estimates for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. The parameter estimate for the following LHS variable is set to 0, since this variable is a linear combination of other RHS variables as shown.

DriveTrain Front = Intercept - DriveTrain All - DriveTrain Rear


You'll think, "DUH. I know this already! I just need a baseline" For now with PROC AUTOREG, it appears that you'll need to manually need to set a baseline variable by setting the desired base to missing.

Code:
data cars;
	set sashelp.cars;
	if(upcase(drivetrain) = 'FRONT') then call missing(drivetrain);
run;

proc autoreg data=cars;
	class drivetrain;
	model MPG_City = drivetrain  ;
run;
The good news is, SAS is smart enough to tell you that there's a problem with the model, and it won't just give you incorrect estimates without telling you that they're wrong. That's the only bug I can see with the class statement in AUTOREG.

If all else fails, you can always use the tried-and-true method of dummy variables to get the same estimates:

Code:
data cars;
	set sashelp.cars;
	array dt[*] rear all;

	do i = 1 to 2;
		dt[i] = (upcase(drivetrain) = upcase(vname(dt[i]) ) );
	end;

	drop i;
run;

proc autoreg data=cars;
	model MPG_City = rear all;
run;
 

hlsmith

Not a robit
#5
PROC AUTOREG is the only one. The good news is, PROC AUTOREG can act just like PROC REG, so you can safely use it for non-timeseries data In a nutshell, PROC AUTOREG is just like any other of the regression procedures with support for handling autoregression. If you compare the parameter estimates between PROC REG and PROC TIMESERIES, they'll be the same. Just ignore the ACF/PACF plots ;)

Code:
proc autoreg data=sashelp.cars;
    model MPG_City = EngineSize Cylinders Horsepower / ramsey;
run;

proc reg data=sashelp.cars;
    model MPG_City = EngineSize Cylinders Horsepower;
run;

STU,
Thanks for your time. So in the code above you meant "/ RESET" not "/ Ramsey". When I run the code you posted it kicks out the Ramsey's RESET Test. Power 2-4 represent continuous variables raised to **2nd , **3rd, or **4th power for all variables in the model? And since the F-test is significant this means the model with main effects is mis-specified, AKA not as good as the Power-based models. Is this all correct? Sorry for not looking this up myself, I am just trying to get the answers straight from Cary if I can.
 

Stu

New Member
#6
STU,
Thanks for your time. So in the code above you meant "/ RESET" not "/ Ramsey". When I run the code you posted it kicks out the Ramsey's RESET Test. Power 2-4 represent continuous variables raised to **2nd , **3rd, or **4th power for all variables in the model? And since the F-test is significant this means the model with main effects is mis-specified, AKA not as good as the Power-based models. Is this all correct? Sorry for not looking this up myself, I am just trying to get the answers straight from Cary if I can.
Correct, Power 2-4 does represent the continuous variables raised to the 2nd, 3rd, and 4th power. Rejecting the null hypothesis suggests that it is not as good as a power-based model.

As an exploratory exercise, we can plot different dependent variables against mpg_city to see where the non-linearity may be. We end up finding that engine size and city MPG appear to be non-linearly related.

Code:
proc sgplot data=sashelp.cars;
	scatter x=mpg_city y=enginesize;
run;


Our model improves by removing the insignificant factor, Cylinders, and adding a squared term for EngineSize (though with this, you'd be better off trying a log or inverse transform on y, but for example's sake let's roll with it).

Code:
proc glm data=sashelp.cars;
	model mpg_city = enginesize enginesize*enginesize horsepower;
run;
 
Last edited: