What technique to use?

#1
I am studying modeling, and I am looking for a good technique to optimize an insurance process, but I am not quite sure how to accomplish what I want to do.

I have a dataset with several categorical variables (policy characteristics) and one numerical variable (historical losses). What I want to do is choose values for the categorical variables that cluster the numerical variable into 5 groups.

I have considered using regression trees because it is the only method that I have seen that divides the categorical variables to reach an optimal grouping. Am I on the right track?

If it matters, I am working in R, but I also use SAS.

Thanks,
Chris
 

staassis

Active Member
#2
There are many ways to partition the obervations into 5 clusters. To chose only one, you need to define the clustering / classification criterion in terms of the numerical variable... In plain English, what are you trying to do?
 
#3
The numerical variable is the percentage of loss. The others are categorical variables like credit score 1-5, etc. I need criteria based on the categorical variable that will sort the policies into 5 groups such that the groups have a similar mean with small deviation based on the numerical variable. In other words, I want a box and whiskers plot of loss percentage based on the group number to be stair stepped.
 
#4
To explain that a little better, currently, I create a grid with the possible combinations of categorical variables. Then a value of 1-5 is assigned to each cell, corresponding to a particular price. I would like a process to find the optimal grouping to base this price.
 

staassis

Active Member
#5
If I understand correctly, you want to predict a "category" of the numerical variable (Y) using the categorical variables in your data set. One thing you can do is perform the following two steps:

1. Build the optimal linear model: Y = Beta_0 + Beta_1 * X1 + Beta_2 * X2 + ... Beta_p * Xp + Epsilon. Here (X1, X2, ..., Xp) are the dummy variables coding the categories of the categorical variables in the data set.
2. Split the range of Y into 5 clusters according to a standard clustering technique (like hierarchical clustering).
3. Classify each new observation into one of the 5 clusters based on where fit Y_Hat = Beta_0 + Beta_1 * X1 + Beta_2 * X2 + ... Beta_p * Xp falls.