So my particular approach may be completely wrong, so let me provide a high level goal first:

The data I'm working with is visitor demographic features with a score. Currently I am using 14 demographic features, each with up to 5 levels. I want determine which partition of demographic features will yield the biggest difference in average score.

I don't want to get too granular, so I will likely omit partitions that consist of less than 5% of the data. My end result would ideally be to a partition of all visitors into 2-4 groups with different average scores.

***************

So! I wanted to start small and begin by calculating the average score based on every combination of 2 distinct features. I got stuck building the script to accomplish this. Here is a sample input/output:

Code for Input:

data=matrix(c("Blue","Blue","Brown","M","M","F","Dem","Dem","Rep",1,0,1), ncol=4)

dimnames(data)=list(c(1,2,3),c("Eyes","Gender","Politic","Score"))

Input:

Eyes Gender Politic Score

1 "Blue" "M" "Rep" "1"

2 "Blue" "M" "Dem" "0"

3 "Brown" "F" "Rep" "1"

Output:

Blue, M : .5

Brown, F:1

Blue, Rep : 1

Blue, Dem : 0

Brown,Rep :1

M, Rep : 1

M, Dem : 0

F, Rep : 1

So right now I am getting stuck at all the looping. To start I am just trying to build a function that creates a matrix of all distinct pairs of answers and questions. When it comes to looping through the questions, THEN each answer to the question, I get various errors.

analyze = function(test_data) {

x=matrix(ncol=2)

categories = lapply(test_data, unique) #created list of all distinct categoric values

category_names = names(categories)

for (feature in category_names) {

for (ans in categories$feature){

x=rbind(x,c(ans,feature))

}

}

return(x)

}

*I know this isnt representative of the whole problem described at the beginning, I just tried to simplify down to an easier problem whose answer will help the most.