I have a very peculiar problem that I am trying to solve. What I mean is, the following is an explanation of the current solution that has been implemented in our group. I am not 100% sure that, in-fact - this is a Statistically valid approach.So, here it goes.

Subject Domain - Auto Insurance Data,

Dependent Variable (Y) =

**Claim Loss $**(continuous Variable),

Independent Variable (X) =

**Driver Age**(Continuous) [X1],

**Driver Credit Score Group**(Categorical) [X2],

**Business Category**(Categorical) [X3]

In essence, IS the total Loss $ in an accident a function of Driver Age , Driver's credit score (good/bad) and the type of Business (pizza delivery vs Long Haul Freight Delivery)

Y = F (X1, X2, X3) + Error

The basic question to be answered is ==> For e.g, what is the effect of a person's Credit Score Grouping on Claims Loss $ ?

Some more problem statement/data set-up. For example:

X2 (Credit Score ) have been grouped into 5 distinct groups, C1,C2,C3,C4,C5 (based on the credit score of a person)

X3 (Business Category) has been grouped into 5 distinct groups B1, B2, B3, B4 and B5 (based on certain business characteristics)

So,in the context of the question above - If you are a driver whose credit score falls into the C1 category,

**what is the COMBINED EFFECT of credit**on the Loss $ ?

*GENERALLY SPEAKING, this is how I would set up a simple Multiple Regression Model (and this is supported by regression theory/literature)*

**Y = b0 + b1 * Age + b2 * Credit + b3 * Business + b4 * (Credit X Business) + b5 * (Credit x Age)**

whereby, I get the "main effects" of age, credit and business and also the "interaction effect" of credit and business, Credit and Age.

So, here is the current solution

**(and I cannot find any academic references/papers that uses this approach).**

Here is how the independent variables have been set up currently - specifically credit group

For the

**main effect of Credit**(i.e, co-efficient b2) -> C1,C2,C3,C4,C5 are used as the dependent variable (as is)

For the interaction effect of

**credit with business**-> (C1 + C2) => grouped as "A", (C3+C4) ==> grouped as "B" and C5 is left as is (called Group "C"). i.e, a new column with these new credit groups is created and then interacted with Business.

For the interaction effect of

**credit with Age**-> (C1) => grouped as "A1", (C2+C3+C4) ==> grouped as "A2" and C5 is left as is (call Group "A3"). i.e, a new column with these new credit groups is created and then interacted with Age.

**So in essence, the main effect uses ONE Grouping of Credit, the Interaction columns uses a different grouping depending on the interaction.[A,B,C for Credit x Business ] [A1,A2,A3 for Credit x Age]**???

Furthermore, Here is the kicker when it comes to the regression.

**FIRST STEP**

==> SOLVE ONLY THE MAIN EFFECTS

==> SOLVE ONLY THE MAIN EFFECTS

Y = b0 + b1 * Age + b2 * Credit + b3 * Business [USING C1 to C5 for credit (as defined initially)]

Obtain the residuals from the step one above

SECOND STEP

Regress the residuals obtained from step 1 above on the interaction effects

SECOND STEP

Regress the residuals obtained from step 1 above on the interaction effects

Residuals from Step 1 = b4 * (Credit[A,B,C] X Business) + b5 * (Credit[A1,A2,A3] x Age)

***ALL OF THIS IS DONE IN SAS**

Here is the output is interpreted and used.

For example:

**a person with a credit score of C1**the effect of credit is calculated as ==>

**b2 (from step 1 - Main Effect) + b4 (from step 2 - Interaction of Credit with Business) + b5 (from step 3 - Interaction of Credit with Age).**

So, In essence -->

**My question is**is ==> Can I use ONE SET OF groupings of the categorical variable in the "main effects" , different sets of groupings for the "interaction effects" and then

**JUST ADD THEM (the co-efficients)**UP TO GET A FINAL "EFFECT" VALUE ? My point is - if the credit grouping had been

**maintained CONSISTENTLY across the main effect AND interactions**, then the above interpretation holds good. How can we just "add" up the main effects and interaction effects, when the grouping of the categorical variable is NOT consistent across ? I am baffled by this.

**Could anybody shed some light on this ? Your help is greatly appreciated.**

* I am terribly sorry for setting up such a long explanation before I finally asked the question - but, there is no easy way to explain the above problem and I still baffled at this solution that is currently being used.