Categorical variables - Logistic regression

#1
Hello,

I am trying to run a logistic regression model. I have 4 numeric and 1 categorical variable which are significant. The categorical
variable has 8 levels. I want to bin this categorical variable into 3 levels based on the response. Can I do this?
For ex. The new variable has 3 levels "GOOD", "AVERAGE" and "BAD"

Level - GOOD

Level A has 25% positive response
Level D has 24% positive response
and Level E has 27% positive response

Level - AVERAGE

Level B has 14% positive response
Level F has 16% positive response


Level - BAD

Level C has 8% positive response
Level G has 7% positive response
and Level H has 5% positive response

Can I group the categorical variables into smaller bins based on the % positive response?

Please advice. Thank you in advance.
 

jrai

New Member
#2
I think that you do need to explain more. What is your DV? Is this categorical variable that you're trying to group your DV? Explain the current possible values of this categorical variable & what exactly do you mean by positive response here?

And also explain what are the potential benefits you're trying to gain from this grouping exercise?
 
#3
Hi Jrai,

The response variable is 2 levels 0 and 1.
% positive response = Sum of 1's / Total population

Response variable -
Target (0,1)
Independent variable -
IV1 - IV4 (Numeric)
IV5 - Categorical (8 levels A-H)
I don't want to have all levels of IV5 and so decided to use the % positive response to group the categorical variable into a transformed variable (TV) with lesser number of levels. I have mentioned the approach in the first post.

My final model has:
Response variable -
Target (0,1)
Independent variable -
IV1 - IV4 (Numeric)
TV - Categorical (3 levels GOOD, AVERAGE and BAD)

The transformed variable has been created using the target variable. Can i use this approach?

Thanks
 

terzi

TS Contributor
#4
Since your transformed variable will be created based on the DV, I don't see the point in using it as a regressor. I mean, it is obviously going to have a significant result. Besides, that would confuse the real contributions of the 8 original levels in your IV. I'd suggest reducing the groups but without information of the DV. On the other hand, if you have enough sample size you can even use seven dummy variables for analyzing it.

Maybe if you can tell us what are those variables about and why you are changing the groups based on your response we may be able to suggest something better.
 
#5
Hi Terzi,

Thank you for the information. The variable is "Regions" and i wanted to cluster the regions into groups instead of having many individual regions. The sample size for a few are too small (< 5%). Can i group them with the region having similar % response?
I will create dummy variables for the rest of the regions instead of binning them.

Thanks!
 

terzi

TS Contributor
#6
The grouping will depend on how the regions were first designed. If these are geographical regions, you can merge the smallest ones with the closer ones. If those were originated with some analysis, then it should be useful to know how those were developed to detect the best way to unite them. I wouldn't recommend using the DV to group them. Maybe you can merge one or two of them and then work with only five or six regions.

Hope this helps