Help with comprehending output from proc reg

#1
So I just ran my very first regression and the data output is pasted below (I've also attached the .txt file)

Analysis of Variance

Sum of Mean
Source DF Squares Square F Value Pr > F

Model 20 3.670499E11 18352493740 1.58 0.0496
Error 1101 1.277317E13 11601428676
Corrected Total 1121 1.314022E13


Root MSE 107710 R-Square 0.0279
Dependent Mean 100585 Adj R-Sq 0.0103
Coeff Var 107.08322


Parameter Estimates

Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 85167 34579 2.46 0.0139
A A 1 0.00000861 0.00790 0.00 0.9991
B B 1 2.98968 2.46392 1.21 0.2252
C C 1 110.87518 85.72047 1.29 0.1961
D D 1 -347.03355 127.76966 -2.72 0.0067
E E 1 94.43187 99.11811 0.95 0.3409
F F 1 -0.00041549 0.00609 -0.07 0.9456
G G 1 -5.06148 49.23571 -0.10 0.9181
H H 1 16426 13104 1.25 0.2103
I I 1 -45101 22002 -2.05 0.0406
J J 1 -46856 29047 -1.61 0.1070
K K 1 -48402 22305 -2.17 0.0302
L L 1 -30408 21738 -1.40 0.1621
M M 1 -46714 22433 -2.08 0.0375
N N 1 -27593 21538 -1.28 0.2004
O O 1 -31519 23769 -1.33 0.1851
P P 1 -54905 20912 -2.63 0.0088
Q Q 1 -24632 34383 -0.72 0.4739
R R 1 912.08579 377.31024 2.42 0.0158
S S 1 1210.77268 7284.51610 0.17 0.8680
T T 1 -22182 16564 -1.34 0.1808





Given this output, the following are the conclusions I have drawn:

DEPD = 0.00000861 A + 2.98968 B + 110.87518 C + -347.03355 D + 94.43187 E+ -0.00041549 F + -5.06148 G + 16426 H + -45101 I + -46856 J + -48402 K + -30408 L + -46714 M + -27593 N + -31519 O + -54905 P + -24632 Q + 912.08579 R + 1210.77268 S + -22182 T + 85167

(However, since all the Pr > |t| values are greater than 0.001 the coefficients are not very accurate. Also since the Standard error values are pretty high, the coefficients are not very accurate.)

Parameters A to T explain 2.79% (R-square value) of the variation shown in DEPD.

1) Is this correct?

In addition,

2) What is F Value and Pr > F?

3) What is t Value?

4) Is there anything else I can get from the SAS output?

Thank you.
 
#2
Mods, i just realised this is probably in the wrong sub forum, could you help me to delete this thread so i can re post in the statistics forum?

Thanks.
 

ledzep

Point Mass at Zero
#5
The first question I have for you is: WHAT IS YOUR OBJECTIVE? Are you trying to screen out the variables and find out which of the variables are significant in explaining the response?

If you are trying to find out which of the variables are important, then running a variable selection approach would be useful (forward, backward or step wise). If you already know that some of the variables are important then you can force the variables in the model (despite being not significant).

2) What is F Value and Pr > F?
F test is assesing the significance of your model. To be precise, the F-test is testing the null hypothesis that at least one of the variables is linearly related to the response variable.
Pr>F is the p-value associated with the F-test. It is marginally significant at 5% level indicating that at least one of the variables is linearly related to the response variable.

3) What is t Value?
The t-value is the test statistics for testing the significance of the model coefficients.

4) Is there anything else I can get from the SAS output?
Yes, since you have so many variables, i.e. you are in a multiple linear regression setting, you can definitely explore a bit more. You want to know if there is a multi-collinearity i.e. if any set of the variables are correlated. This can be checked using Variance Inflation Factor (VIF). VIF>10 means the multi-collinearity is serious.
You can explore the residuals and check the underlying assumptions of the model.
 
#6
Ledzep,
Thank you so much for your reply.

Yes, I am trying to find out which of the independent variables have an effect on the dependent variable and to what extent they have an effect on the dependent variable. I know it seems like I have a lot of independent variables but variables H to P exist because there is 1 qualitative variable with 10 possible options.

Right now am I right to say that from the results it seems like none of the independent variables have an impact on the dependent variable?

Also, could you explain a bit more about Variance Inflation Factor? Do I test the VIF between independent variables or between the dependent variable and the independent variables?

Thank you!
 

ledzep

Point Mass at Zero
#7
Right now am I right to say that from the results it seems like none of the independent variables have an impact on the dependent variable?
Your p-value for "D","I","K","M","P","Q" and "R" are all significant at 5% level of significance as the p-values are <0.05. This means that these variables are significant for your response variable.

However, the results may not be reliable as this doesn't seem to be the right model as indicated by large VIF.

VIF is a measure of severity of collinearlity. Larger values (>10) means more correlation between independent variables. You just see the VIF for a given variable. For example:

Code:
/*fake data*/
data test;
input y x1 x2;
cards;
8  3  6
3  4  1
2  2  2
4  4  3
2  5  4
;
run;

/*Run glm*/
proc reg data=test;
model y= x1 x2/vif;
run;quit;

/*trimmed output*/

                                         Parameter Estimates

                              Parameter       Standard                              Variance
         Variable     DF       Estimate          Error    t Value    Pr > |t|      Inflation

         Intercept     1        2.61458        3.97927       0.66      0.5787              0
         x1            1       -0.53646        0.96588      -0.56      0.6344        1.00208
         x2            1        0.97396        0.57253       1.70      0.2310        1.00208

Here the VIF are less than 10. So, there is no collinearity between the dependent variables.
Usually running a variable selection is useful as they will help to screen you out the important variables. Once you screened out your variables, then you can fit the selected model and run diagnostic checks to check the appropriateness of the fitted model.

Code:
/*Run variable selection*/
proc reg data=test ;
model y= x1 x2/selection=stepwise;
run;quit;
 

ledzep

Point Mass at Zero
#8
I know it seems like I have a lot of independent variables but variables H to P exist because there is 1 qualitative variable with 10 possible options.
OK!!! Thanks for this additional information. I was firmly assuming up until now that all the variables were continuous variables (as you used proc reg).
So, H to P are different levels of the same variable. Is it possible to have them as a single column? then you can use proc glm by specifying the variable as a class variable instead of proc reg.

The danger of using proc reg is that it assumes the dependent variables are continuous even though they are categorical. To specify correctly the class you have to use "proc glm" and specify using class statement that a variable is a factor not continuous.
 
#9
Ledzip,
Dude you're like a statistics and SAS jedi!

I will rerun the data with the vif code.

What is the reason you put "selection=stepwise" after the model statement in the second code?

I can definitely get the categorical values into one column, the raw data specifies it as one column and I separated it out. How would I run the code then?

proc glm data=File_name;
model DEPD = A B C D E F G H Q R S T;
run;

In that case what does the coefficient for H represent?

Thanks Ledzep!
 

ledzep

Point Mass at Zero
#10
What is the reason you put "selection=stepwise" after the model statement in the second code?
I thought you're interested to find out which of the variables were significant. Using selection=stepwise will allow you to come up with the variables which were significant for your response in the presence of other variables in the model.

I can definitely get the categorical values into one column, the raw data specifies it as one column and I separated it out. How would I run the code then?

proc glm data=File_name;
model DEPD = A B C D E F G H Q R S T;
run;
Yes, the code pretty much as you said but a slight addition with a class line.

Code:
proc glm data=File_name;
class H;  *list all your categorical/factor variables here. SAS calls them Class; 
model DEPD = A B C D E F G H Q R S T;
run;
In that case what does the coefficient for H represent?
It should list 10 different estimates for H, one for each level of H (one of them should zero, as it will be set as a reference category by SAS).
 
#11
Ledzep,
I have 1 last question before I re run the regression:

I have 2 other categorical variables (S and T) but these are either yes or no. So S and T have value of 1 for yes and value of 0 for no. Should I leave them as they are or move them to class variables?

Thank you!
 

ledzep

Point Mass at Zero
#12
You should move S and T to the class list.
If you don't specify that it is a class variable, SAS will assume it to be continuous variable and will fit as a linear effect.
 

Dason

Ambassador to the humans
#13
I thought you're interested to find out which of the variables were significant. Using selection=stepwise will allow you to come up with the variables which were significant for your response in the presence of other variables in the model.
Except that stepwise selection is NOT a good procedure. Here is a link explaining a few of the reasons why you really shouldn't use it: http://www.childrensmercy.org/stats/faq/faq12.aspx
 

ledzep

Point Mass at Zero
#14
Just to give you an example of what happens when you don't specify class.

Code:
/*fake data*/
data test;
input y x1 x2;
cards;
8  1  6
3  1  1
2  1  2
4  0  3
2  0  4
;
run;

*x1 is a yes No variable;
Code:
/*With Class specified for x1*/
proc glm data=test ;
class x1;
model y= x1 x2/solution;
run;quit;

*output;
                                         Standard
                Parameter           Estimate             Error    t Value    Pr > |t|

                Intercept        1.229885057 B      1.84671547       0.67      0.5740
                x1        0     -1.850574713 B      1.74372021      -1.06      0.3998
                x1        1      0.000000000 B       .                .         .
                x2               1.034482759        0.49651979       2.08      0.1726

*TWO estimates for x1, one for each level of x1. The highest level is set as reference category by SAS. Hence,0.
Code:
/*Now, class not told to SAS*/
proc glm data=test ;
model y= x1 x2;
run;quit;

*output;
                                     Standard
                  Parameter         Estimate           Error    t Value    Pr > |t|

                  Intercept     -0.620689655      2.19257205      -0.28      0.8037
                  x1             1.850574713      1.74372021       1.06      0.3998
                  x2             1.034482759      0.49651979       2.08      0.1726


* SEE that only one estimate for x1, as SAS is assuming x1 as a continuous variable i.e. assuming linear effect. However, in fact it is not linear as we know it is a YES, NO variable.
 

Dason

Ambassador to the humans
#15
But they would probably just get an error I'm guessing since their categorical variable is probably actually text and not numeric so it wouldn't be able to treat it as continuous.
 
#16
Ledzep,
Thank you for giving an example pointing out the difference with and without treating a class variable as a class variable. It is much clearer now.

For the example you gave, is the correct regression equation

y = -1.850574713 x1 + 1.034482759 x2 + 1.229885057 ?

What happens if there is more than 2 cases for x1 (e.g. 0, 1 and 2 for no, maybe and yes)? then you get 2 coefficients for x1? How would you write that in an equation?

Thank you!

Dason,
Thank you for pointing that out. You are right that my categorical data is in text and I was about to run the code with the categorical data as text. I probably wouldn't have been able to understand why I got an error. I will convert all the categorical data to numbers in Excel using the =IF statement.

Thank you.
 
#18
Dason,
I see. Then why will I get an error?

Oh I see. You were saying I would get an error if I treated text variable as a non class variable.

Let me know if I still didn't get your point.
 
Last edited:

ledzep

Point Mass at Zero
#19
Here's an example with >2 levels.
Code:
/*fake data*/
data test;
input y x1 x2;
cards;
8  0  6
3  0  1
4  1  3
2  1  4
7  2  4
6  2  4
;
run;

/*With Class specified for x1*/
proc glm data=test ;
class x1;
model y= x1 x2/solution;
run;quit;

*output;

                                            Standard
                Parameter           Estimate             Error    t Value    Pr > |t|

                Intercept        2.961538462 B      2.04380649       1.45      0.2843
                x1        0     -0.557692308 B      1.56839863      -0.36      0.7562
                x1        1     -3.057692308 B      1.56839863      -1.95      0.1905
                x1        2      0.000000000 B       .                .         .
                x2               0.884615385        0.43087224       2.05      0.1765
(Note that variables are not significant. These equations are to get the message across).

Model is:
for x1=0,
y= 2.96-0.55+0.88x2

for x1=1,
y= 2.96-3.05+0.88x2

for x1=2,
y= 2.96+0.88x2

So, essentially you get three parallel lines here (same slope but different intercepts). Hope you got the hang of it.

And as Dason quickly suggested, using class statement will automatically does it for you.