Mods, i just realised this is probably in the wrong sub forum, could you help me to delete this thread so i can re post in the statistics forum?
Thanks.
So I just ran my very first regression and the data output is pasted below (I've also attached the .txt file)
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 20 3.670499E11 18352493740 1.58 0.0496
Error 1101 1.277317E13 11601428676
Corrected Total 1121 1.314022E13
Root MSE 107710 R-Square 0.0279
Dependent Mean 100585 Adj R-Sq 0.0103
Coeff Var 107.08322
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 85167 34579 2.46 0.0139
A A 1 0.00000861 0.00790 0.00 0.9991
B B 1 2.98968 2.46392 1.21 0.2252
C C 1 110.87518 85.72047 1.29 0.1961
D D 1 -347.03355 127.76966 -2.72 0.0067
E E 1 94.43187 99.11811 0.95 0.3409
F F 1 -0.00041549 0.00609 -0.07 0.9456
G G 1 -5.06148 49.23571 -0.10 0.9181
H H 1 16426 13104 1.25 0.2103
I I 1 -45101 22002 -2.05 0.0406
J J 1 -46856 29047 -1.61 0.1070
K K 1 -48402 22305 -2.17 0.0302
L L 1 -30408 21738 -1.40 0.1621
M M 1 -46714 22433 -2.08 0.0375
N N 1 -27593 21538 -1.28 0.2004
O O 1 -31519 23769 -1.33 0.1851
P P 1 -54905 20912 -2.63 0.0088
Q Q 1 -24632 34383 -0.72 0.4739
R R 1 912.08579 377.31024 2.42 0.0158
S S 1 1210.77268 7284.51610 0.17 0.8680
T T 1 -22182 16564 -1.34 0.1808
Given this output, the following are the conclusions I have drawn:
DEPD = 0.00000861 A + 2.98968 B + 110.87518 C + -347.03355 D + 94.43187 E+ -0.00041549 F + -5.06148 G + 16426 H + -45101 I + -46856 J + -48402 K + -30408 L + -46714 M + -27593 N + -31519 O + -54905 P + -24632 Q + 912.08579 R + 1210.77268 S + -22182 T + 85167
(However, since all the Pr > |t| values are greater than 0.001 the coefficients are not very accurate. Also since the Standard error values are pretty high, the coefficients are not very accurate.)
Parameters A to T explain 2.79% (R-square value) of the variation shown in DEPD.
1) Is this correct?
In addition,
2) What is F Value and Pr > F?
3) What is t Value?
4) Is there anything else I can get from the SAS output?
Thank you.
Mods, i just realised this is probably in the wrong sub forum, could you help me to delete this thread so i can re post in the statistics forum?
Thanks.
Nope - but I will move it for you.
I don't have emotions and sometimes that makes me very sad.
that works out even better, thanks man!
The first question I have for you is: WHAT IS YOUR OBJECTIVE? Are you trying to screen out the variables and find out which of the variables are significant in explaining the response?
If you are trying to find out which of the variables are important, then running a variable selection approach would be useful (forward, backward or step wise). If you already know that some of the variables are important then you can force the variables in the model (despite being not significant).
F test is assesing the significance of your model. To be precise, the F-test is testing the null hypothesis that at least one of the variables is linearly related to the response variable.
2) What is F Value and Pr > F?
Pr>F is the p-value associated with the F-test. It is marginally significant at 5% level indicating that at least one of the variables is linearly related to the response variable.
The t-value is the test statistics for testing the significance of the model coefficients.3) What is t Value?
Yes, since you have so many variables, i.e. you are in a multiple linear regression setting, you can definitely explore a bit more. You want to know if there is a multi-collinearity i.e. if any set of the variables are correlated. This can be checked using Variance Inflation Factor (VIF). VIF>10 means the multi-collinearity is serious.4) Is there anything else I can get from the SAS output?
You can explore the residuals and check the underlying assumptions of the model.
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
david_q (02-12-2012)
Ledzep,
Thank you so much for your reply.
Yes, I am trying to find out which of the independent variables have an effect on the dependent variable and to what extent they have an effect on the dependent variable. I know it seems like I have a lot of independent variables but variables H to P exist because there is 1 qualitative variable with 10 possible options.
Right now am I right to say that from the results it seems like none of the independent variables have an impact on the dependent variable?
Also, could you explain a bit more about Variance Inflation Factor? Do I test the VIF between independent variables or between the dependent variable and the independent variables?
Thank you!
Your p-value for "D","I","K","M","P","Q" and "R" are all significant at 5% level of significance as the p-values are <0.05. This means that these variables are significant for your response variable.Right now am I right to say that from the results it seems like none of the independent variables have an impact on the dependent variable?
However, the results may not be reliable as this doesn't seem to be the right model as indicated by large VIF.
VIF is a measure of severity of collinearlity. Larger values (>10) means more correlation between independent variables. You just see the VIF for a given variable. For example:
Usually running a variable selection is useful as they will help to screen you out the important variables. Once you screened out your variables, then you can fit the selected model and run diagnostic checks to check the appropriateness of the fitted model.Code:/*fake data*/ data test; input y x1 x2; cards; 8 3 6 3 4 1 2 2 2 4 4 3 2 5 4 ; run; /*Run glm*/ proc reg data=test; model y= x1 x2/vif; run;quit; /*trimmed output*/ Parameter Estimates Parameter Standard Variance Variable DF Estimate Error t Value Pr > |t| Inflation Intercept 1 2.61458 3.97927 0.66 0.5787 0 x1 1 -0.53646 0.96588 -0.56 0.6344 1.00208 x2 1 0.97396 0.57253 1.70 0.2310 1.00208 Here the VIF are less than 10. So, there is no collinearity between the dependent variables.
Code:/*Run variable selection*/ proc reg data=test ; model y= x1 x2/selection=stepwise; run;quit;
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
david_q (02-12-2012)
OK!!! Thanks for this additional information. I was firmly assuming up until now that all the variables were continuous variables (as you used proc reg).
So, H to P are different levels of the same variable. Is it possible to have them as a single column? then you can use proc glm by specifying the variable as a class variable instead of proc reg.
The danger of using proc reg is that it assumes the dependent variables are continuous even though they are categorical. To specify correctly the class you have to use "proc glm" and specify using class statement that a variable is a factor not continuous.
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
david_q (02-12-2012)
Ledzip,
Dude you're like a statistics and SAS jedi!
I will rerun the data with the vif code.
What is the reason you put "selection=stepwise" after the model statement in the second code?
I can definitely get the categorical values into one column, the raw data specifies it as one column and I separated it out. How would I run the code then?
proc glm data=File_name;
model DEPD = A B C D E F G H Q R S T;
run;
In that case what does the coefficient for H represent?
Thanks Ledzep!
I thought you're interested to find out which of the variables were significant. Using selection=stepwise will allow you to come up with the variables which were significant for your response in the presence of other variables in the model.
Yes, the code pretty much as you said but a slight addition with a class line.I can definitely get the categorical values into one column, the raw data specifies it as one column and I separated it out. How would I run the code then?
proc glm data=File_name;
model DEPD = A B C D E F G H Q R S T;
run;
Code:proc glm data=File_name; class H; *list all your categorical/factor variables here. SAS calls them Class; model DEPD = A B C D E F G H Q R S T; run;
It should list 10 different estimates for H, one for each level of H (one of them should zero, as it will be set as a reference category by SAS).In that case what does the coefficient for H represent?
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
david_q (02-12-2012)
Ledzep,
I have 1 last question before I re run the regression:
I have 2 other categorical variables (S and T) but these are either yes or no. So S and T have value of 1 for yes and value of 0 for no. Should I leave them as they are or move them to class variables?
Thank you!
You should move S and T to the class list.
If you don't specify that it is a class variable, SAS will assume it to be continuous variable and will fit as a linear effect.
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
Except that stepwise selection is NOT a good procedure. Here is a link explaining a few of the reasons why you really shouldn't use it: http://www.childrensmercy.org/stats/faq/faq12.aspx
I don't have emotions and sometimes that makes me very sad.
Just to give you an example of what happens when you don't specify class.
Code:/*fake data*/ data test; input y x1 x2; cards; 8 1 6 3 1 1 2 1 2 4 0 3 2 0 4 ; run; *x1 is a yes No variable;
Code:/*With Class specified for x1*/ proc glm data=test ; class x1; model y= x1 x2/solution; run;quit; *output; Standard Parameter Estimate Error t Value Pr > |t| Intercept 1.229885057 B 1.84671547 0.67 0.5740 x1 0 -1.850574713 B 1.74372021 -1.06 0.3998 x1 1 0.000000000 B . . . x2 1.034482759 0.49651979 2.08 0.1726 *TWO estimates for x1, one for each level of x1. The highest level is set as reference category by SAS. Hence,0.
Code:/*Now, class not told to SAS*/ proc glm data=test ; model y= x1 x2; run;quit; *output; Standard Parameter Estimate Error t Value Pr > |t| Intercept -0.620689655 2.19257205 -0.28 0.8037 x1 1.850574713 1.74372021 1.06 0.3998 x2 1.034482759 0.49651979 2.08 0.1726 * SEE that only one estimate for x1, as SAS is assuming x1 as a continuous variable i.e. assuming linear effect. However, in fact it is not linear as we know it is a YES, NO variable.
Oh Thou Perelman! Poincare's was for you and Riemann's is for me.
david_q (02-12-2012)
But they would probably just get an error I'm guessing since their categorical variable is probably actually text and not numeric so it wouldn't be able to treat it as continuous.
I don't have emotions and sometimes that makes me very sad.
david_q (02-12-2012)
Tweet |