Logistic Regression and Multiple Regression

#1
Hi,

Please could you help me (or direct me to someone who can help me) with the following:

I have a spreadsheet with a number of environmental independent variables (e.g. slope, radiation, etc.) with dependent variables (presence absence and densities of two plant species Ls and Ee). MY INDEPENDENT VARIABLES ARE IN CATEGORIES AND ARE THUS NOT CONTINUOUS.

I have looked at the help option in STAThelp in my Statistica v.9.1 but it seems as though my data is formatted differently and I am unsure of my results.

Please could you help me with the following:

1. Determining any collinearity between the indepedent variables.
2. Logistic regression for fire, Ls and Ee presence/absence
3. Multiple regression for Ls and Ee densities?

Your help in this regard will be much appreciated!

Kind regards,

Andrew
 
#3
Hi BGM,

Thank you for your willingness to help and prompt response.

I have already referred to many sites such as wikipedia regarding multicollinearity tests and multiple regression etc

I think mine is more dataset specific:

1. I would like to know how to perform first a logisitic regression on the presence/absence of fire in plots with various environmental independent variables that have been categorised (not continuous) (e.g. slope 0-29; 30-59; 60-90degrees).

2. When I performed a multiple regression analysis (including 0 values) I find my indepedent variables to be poor predictors of species density. I realised my mistake and ommitted the 0 value plots for species denisites. Thereafter my R(adjusted) was even less than before. This decreased even further after I grouped certain environmental variables that were correlated.

I am therefore confused as to these results. Is my procedure wrong? Is this to be expected?

Please let me know if I need to clarify any point further.

I thank you for your time and help.

Kind regards,

Andrew
 

BGM

TS Contributor
#4
1. http://en.wikipedia.org/wiki/Dummy_variable_(statistics)

First the categorical variables should be have a dummy variable coding. E.g. Create two dummy variables and assign (0, 0), (1, 0) and (0, 1) for the three categories. I think it is the same as the ordinary regression.

Afterward is just estimating the parameters by the maximum likelihood. As long as you know what is the likelihood function for the logistic regression model then you can numerically do the maximization. Most statistical software will have built-in function for you.

2. Why including the "0" values is a mistake? Your data is "zero-inflated" or you mean you including missing values? And what do you mean by "grouped certain environmental variables"?
 
#5
1. Thank you. I will look at dummy variable coding and get back to you.

2a. My supervisor for my dissertation commented on my work saying that "only plots containing the species are used in the multiple linear regression" and that the poor fit may have been the result of adding these plots with no individuals in them.

2b. Grouping of certain environmental variables was that I looked for correlation between each pair of explanatory variables (if strong, then only one was entered).

Does this make more sense?

I will get onto working with the dummy variable coding.
 

noetsi

Fortran must die
#6
You don't need to do logistic regression if your independent variables are categorical (just create dummy variables as noted above). Only if your dependent variable is categorical.

I don't know how Strata does Logistic regression. With SAS or SPSS there are specific menus for logistic regression. You assign variables and options similar to OLS (although the specific options vary). You will get a model test (similar to the F test for OLS, although it's a log liklyhood test with logistics) and the equivilent of t test for individual variables (they are a Wald chi square test but you interpret the p value similarly to a t test). The slopes are not easy to interpret. You instead commonly look at the odds ratio (which tells you the odds of being in one of the states of the dependent variable relative to another for a unit change in an independent variable).

After you create dummy variables for the categorical independent variable you interpret them the same way you would a dummy in OLS (except you use the odds ratio).

I don't understand what you mean by removing or including the O's? You can't remove dummy variable values that are 0. If you did you would have no variation in the independent variable - which will invalidate your test. If the ratio of one level of the dummy (say 1) is more than 9 times the level of the other (say 0) then it will distort your results (the slope or odds ratio). A central assumption of regression is that the variables vary. If they don't your results will be useless.

After you grouped the variables that were correlated, did you rerun a VIF or tolerance test? In some cases, if you do it wrong, you can add multicollinarity not reduce it this way. Also the multicollinarity should not be effecting your overall model (including the R squared value) as far as I know. It distorts the standard errors and thus the test of individual variable signficance not the overal model indicators.
 
#7
Hi BGM,

I looked at dummy variable coding but it would not be suitable for multiple regression of my data. The writer on Wikipedia mentions "Too many dummy variables result in a model that does not provide any general conclusions." - My data has Aspect (North-facing, South-facing); Distance from stream (Closer than 50m; Further than 50m); Slope (0-29; 30-59; 60-90); Rockiness (0; 1-25; 26-50; 51-75; 76-100%); Average annual solar radiation levels (6 categories) and soil types (8 soil types) for each plot (each a independent variable). Additionally, the densities of two plant species (dependent variables) were recorded for each plot.

I hope this lends more clarity to my situation.

I thank you again for your time and I truely appreciate it.

Kind regards,

Andrew
 
#8
Hi Noetsi,

Thank you for your helpful response. Please see the the reply I gave BGM below regarding dummy variable coding.

1. Regarding the use of logistic regression: I want to use species presence (coded 1) absence (coded 0) with the variables explained below as indepedent variables. I will try and run the analysis again as you explained though.

2. The 0-values were not dummy variable values - they were actual values (no plant recorded in the plot).

3. I did not run tolerance tests as I do not know how to do so. Do you have any information on running such tests?

Thank you for your time and help, it is much appreciated.

Kind regards,

Andrew
 

BGM

TS Contributor
#9
Yes if you have too many categories, then you will always have difficulties if you sample size is relatively small. You may imagine that you are just like running regression for each categories separately, so you need enough data points for each category as well.

You may study the nature of your variables again. I guess some of your variables are somehow ordinal in nature, in which you may assign some score for them to represent each category. Some of them are even interval data in fact, which maybe just use a representative point for that class (e.g. the mid-point, class mark) for the regression will do?
 

noetsi

Fortran must die
#10
The more levels you have the more DF you will eat up - which is particularly signficant if you have a small n. I had, an admitedly extreme, case last week where I had as many levels of the categorical IV as data points for the DV. So I had no DF and the model would not run.

If you have that many levels I suggest collapsing them into broader categories if you can.

1) That is fine for the DV. Remember that various software looks at the IV in terms of its predicting the DV level (0 or 1) differently. SAS looks at the IV in terms of predicting 0 in the DV, SPSS in terms of predicting 1. So the signs for the same data will be reversed if you use it in SPSS or SAS. The odds ratio will reverse as well.

2) In terms of your original question, you should remove the 0 levels only if that makes substantive sense. Not to make your model look better (that is have a higher pseudo Rsquared, AIC etc).

3) I don't know how to specify VIF in STRATA. I will see if I can find it. You really need to run this if you can. If you can't an alternative is to run the IV as a DV (with all the other IV but that one as an IV) for each IV one at a time. You can use OLS for this if the variable is interval in nature. If you get a Rsquared value of .9 or higher you have a problem.

That takes longer, I will see if I can find how to do VIF or tolerance in Strata for logistic regression.
 
#11
Hi BGM,

1. My sample size is 520 plots so sample size isn't the problem. Will a multiple regression analysis work with these categorised indepedent variables?

2. What did you mean by using "a representative point"?
 
#12
Hi Noetsi,

My n=520 so sample size is not an issue.

1. What is the DV and IV? I am not a great statastician and and am still learning.

2. I will look at the data again.

Many thanks for your help and comments,

Andrew
 

noetsi

Fortran must die
#13
IV is what I call independent variables (there are many names for these - they are called the predictor variable in some text). DV is the dependent variable also called the response variable. My fault for using the short hand (which I do so often I take them for granted).
 

jpkelley

TS Contributor
#14
2. Why including the "0" values is a mistake? Your data is "zero-inflated" or you mean you including missing values? And what do you mean by "grouped certain environmental variables"?
This is a good point. I don't understand why your advisor would have suggested that "only plots containing the species are used in the multiple linear regression." Species X could be absent from some plots either because they are truly absent because they aren't associated with factor x, y, and z (true zeroes) OR they are absent because they were misidentified or were randomly absent or simply missed by the observer (false zeroes). Since you likely knew all species and surveyed 100% (i.e. low detection or identification error), you're in the realm of "hurdle" models. As BGM said, this is a type of zero-inflated model. It sounds like your data are appropriate for this.