What technique to use?

I am studying modeling, and I am looking for a good technique to optimize an insurance process, but I am not quite sure how to accomplish what I want to do.

I have a dataset with several categorical variables (policy characteristics) and one numerical variable (historical losses). What I want to do is choose values for the categorical variables that cluster the numerical variable into 5 groups.

I have considered using regression trees because it is the only method that I have seen that divides the categorical variables to reach an optimal grouping. Am I on the right track?

If it matters, I am working in R, but I also use SAS.



Active Member
There are many ways to partition the obervations into 5 clusters. To chose only one, you need to define the clustering / classification criterion in terms of the numerical variable... In plain English, what are you trying to do?
The numerical variable is the percentage of loss. The others are categorical variables like credit score 1-5, etc. I need criteria based on the categorical variable that will sort the policies into 5 groups such that the groups have a similar mean with small deviation based on the numerical variable. In other words, I want a box and whiskers plot of loss percentage based on the group number to be stair stepped.
To explain that a little better, currently, I create a grid with the possible combinations of categorical variables. Then a value of 1-5 is assigned to each cell, corresponding to a particular price. I would like a process to find the optimal grouping to base this price.


Active Member
If I understand correctly, you want to predict a "category" of the numerical variable (Y) using the categorical variables in your data set. One thing you can do is perform the following two steps:

1. Build the optimal linear model: Y = Beta_0 + Beta_1 * X1 + Beta_2 * X2 + ... Beta_p * Xp + Epsilon. Here (X1, X2, ..., Xp) are the dummy variables coding the categories of the categorical variables in the data set.
2. Split the range of Y into 5 clusters according to a standard clustering technique (like hierarchical clustering).
3. Classify each new observation into one of the 5 clusters based on where fit Y_Hat = Beta_0 + Beta_1 * X1 + Beta_2 * X2 + ... Beta_p * Xp falls.