# Grouping Categorical Variables in Regression differently in Main vs Interactions

#### FigNewtons1978

##### New Member
All,

I have a very peculiar problem that I am trying to solve. What I mean is, the following is an explanation of the current solution that has been implemented in our group. I am not 100% sure that, in-fact - this is a Statistically valid approach.So, here it goes.

Subject Domain - Auto Insurance Data,
Dependent Variable (Y) = Claim Loss $(continuous Variable), Independent Variable (X) = Driver Age (Continuous) [X1], Driver Credit Score Group (Categorical) [X2], Business Category (Categorical) [X3] In essence, IS the total Loss$ in an accident a function of Driver Age , Driver's credit score (good/bad) and the type of Business (pizza delivery vs Long Haul Freight Delivery)

Y = F (X1, X2, X3) + Error

The basic question to be answered is ==> For e.g, what is the effect of a person's Credit Score Grouping on Claims Loss $? Some more problem statement/data set-up. For example: X2 (Credit Score ) have been grouped into 5 distinct groups, C1,C2,C3,C4,C5 (based on the credit score of a person) X3 (Business Category) has been grouped into 5 distinct groups B1, B2, B3, B4 and B5 (based on certain business characteristics) So,in the context of the question above - If you are a driver whose credit score falls into the C1 category, what is the COMBINED EFFECT of credit on the Loss$ ?

GENERALLY SPEAKING, this is how I would set up a simple Multiple Regression Model (and this is supported by regression theory/literature)

Y = b0 + b1 * Age + b2 * Credit + b3 * Business + b4 * (Credit X Business) + b5 * (Credit x Age)
whereby, I get the "main effects" of age, credit and business and also the "interaction effect" of credit and business, Credit and Age.

So, here is the current solution (and I cannot find any academic references/papers that uses this approach).

Here is how the independent variables have been set up currently - specifically credit group

For the main effect of Credit (i.e, co-efficient b2) -> C1,C2,C3,C4,C5 are used as the dependent variable (as is)

For the interaction effect of credit with business -> (C1 + C2) => grouped as "A", (C3+C4) ==> grouped as "B" and C5 is left as is (called Group "C"). i.e, a new column with these new credit groups is created and then interacted with Business.

For the interaction effect of credit with Age -> (C1) => grouped as "A1", (C2+C3+C4) ==> grouped as "A2" and C5 is left as is (call Group "A3"). i.e, a new column with these new credit groups is created and then interacted with Age.

So in essence, the main effect uses ONE Grouping of Credit, the Interaction columns uses a different grouping depending on the interaction.[A,B,C for Credit x Business ] [A1,A2,A3 for Credit x Age] ???

Furthermore, Here is the kicker when it comes to the regression.

FIRST STEP
==> SOLVE ONLY THE MAIN EFFECTS

Y = b0 + b1 * Age + b2 * Credit + b3 * Business [USING C1 to C5 for credit (as defined initially)]

Obtain the residuals from the step one above

SECOND STEP
Regress the residuals obtained from step 1 above on the interaction effects

Residuals from Step 1 = b4 * (Credit[A,B,C] X Business) + b5 * (Credit[A1,A2,A3] x Age)

*ALL OF THIS IS DONE IN SAS
Here is the output is interpreted and used.

For example: a person with a credit score of C1 the effect of credit is calculated as ==> b2 (from step 1 - Main Effect) + b4 (from step 2 - Interaction of Credit with Business) + b5 (from step 3 - Interaction of Credit with Age).

So, In essence --> My question is is ==> Can I use ONE SET OF groupings of the categorical variable in the "main effects" , different sets of groupings for the "interaction effects" and then JUST ADD THEM (the co-efficients) UP TO GET A FINAL "EFFECT" VALUE ? My point is - if the credit grouping had been maintained CONSISTENTLY across the main effect AND interactions, then the above interpretation holds good. How can we just "add" up the main effects and interaction effects, when the grouping of the categorical variable is NOT consistent across ? I am baffled by this. Could anybody shed some light on this ? Your help is greatly appreciated.

* I am terribly sorry for setting up such a long explanation before I finally asked the question - but, there is no easy way to explain the above problem and I still baffled at this solution that is currently being used.