# Logistic regression, minimum sample size

#### noetsi

##### Fortran must die
I have a logistic regression with 394 usable cases. 61 of these are 0, 333 are 1. I have 31 predictors (which can not be reduced or collapsed). I have doubts given various rules of thumb I have seen if the 61 cases are enough to generate accurate results.

#### noetsi

##### Fortran must die
My concern is that only 4 of 31 variables were statistically significant at the .05 level. I am concerned low power is the issue.

#### noetsi

##### Fortran must die
to show which variables have the greatest impact on overall satisfaction. I am using the wald score to do so (one of the recommendations by those that think you can actually analyze impact). Note its recommendations not slope I am doing.

#### noetsi

##### Fortran must die
I don't know lasso well enough to do it. When the project is more advanced I will try to utilize that. I have done it before, just rarely.

#### noetsi

##### Fortran must die
I had a follow up question. My logistic regression model ran fine. However, to test linearity, I created a set of interaction terms, the log of each predictor times the predictor (to do box tidwel). When I add the 31 variables I get the warning that 'complete separation of data points detected' (usually I get this when I accidently use the DV as a predictor). So the model won't run.

My theory is with only 400 data points, 62 in the least common level, and 62 predictors I simply have too many predictors. With the same data and 31 predictors the model runs fine. Regardless, I was wondering if I even have to test linearity when all my predictors are ordinal (none are linear, they are all 4 point likert data).

#### noetsi

##### Fortran must die
A second, or third, question is that GretaGarbo noted the need to test for
variance is "n*p*(1-p)" not overvariation "(sigma^2)*n*p*(1-p)", (which I think is overdispersion).

How do you test for that? She also said I should test for a binomial distribution, ally of the DV are 1 and ) #### GretaGarbo

##### Human
She also said I should test for a binomial distribution
No, I did not say that you "should" test for that. You asked about assumptions and I mentioned that. Don't over complicate things. It is an assumption. Remember: all models are wrong, but some models are useful" (as Box said).

I would run the 32 explanatory variables. (But maybe I would use the lasso. It is a sort variable selection method.) For those which are signifikant or have a "large" parameter estimate I would test interactions.

You could run one factor at a time as a factor variabel with 4 levels (and all the others as linear effects). If the 4 levels are on a line then it is fine with the regression model. A linear regression model can be defendes as an approximation in a several variables Taylor series.

If all explanatory variables are 4 level Lickert items, then all variabels have the same scale, (I would say that going from 2 to 4 is an important change.) The most important variable is the one with highest regression coefficient. But you need to recompute the scale because your bosses will not understand: log(p/(1-p)) = a+b*x. So you need to compute p = 1/(1+exp( -(a+b*x1+...)

Don't make it to complicated. Simplify for the bosses.

#### noetsi

##### Fortran must die
Sorry GretaGarbo I misunderstood your original point. I was wondering how I was going to test for independence I am simplifying it for the bosses believe me. The primary thing I report for them are 1) which variables are significant and 2) which have the most importance [aka impact]. That is a ranking from high to low. I have used 3 different approaches which I have found recommended:

1) Odds ratio [values below 1 I use the equation 1/or which effectively generates an absolute value].
2) Highest Wald values
3) A standardized coefficient [one of many that exist, its the only one SAS generates].

My problem is that the rankings are somewhat different from highest to lowest depending on which you use. And I have found no consensus on which generates the better results.

I am not sure what you mean by this. Treat one predictor as categorical [dummy coding so you would have 3 dummies] and the rest as linear predictors. Then see if the three dummies are significant?

"You could run one factor at a time as a factor variable with 4 levels (and all the others as linear effects). If the 4 levels are on a line then it is fine with the regression model."

Only 4 of my predictors actually are significant at the .05 level.

#### noetsi

##### Fortran must die
Here is a confusing issue to me. I n this case in occurs in the context of logistic regression, although I think it applies to most forms of regression.

You have likert data (4 points in this case). Formally the non-linearity assumption does not apply to ordinal data. But if you treating the ordinal variables as continuous (that is using the odds ratio with them rather then creating a series of dummy variables) do you have to test linearity for them regardless of this? The likert scale variables might be ordinal, but you are treating them as if continuous (I think our likert data is continuous in that its reasonable to assume the difference between each point is the same even although formally it is not continuous in nature. This is commonly done although some disagree).

While I am asking many question, if you have ordinal variables and you assume they have to be tested for non-linearity (I used box Tidwell) what do you do to determine their importance if they are non-linear? Four of my variables turned out to be non-linear (although 2 came close to passing the Box Tidwell). I am not sure how to determine their relative importance to other variables if they are non-linear.

Splines or loess don't help because you can't use those to compare to other predictor variables in terms of their impact on the predictor.