Calculate values for all permutations of categorical variables

#1
Hi all,

So my particular approach may be completely wrong, so let me provide a high level goal first:

The data I'm working with is visitor demographic features with a score. Currently I am using 14 demographic features, each with up to 5 levels. I want determine which partition of demographic features will yield the biggest difference in average score.

I don't want to get too granular, so I will likely omit partitions that consist of less than 5% of the data. My end result would ideally be to a partition of all visitors into 2-4 groups with different average scores.

***************

So! I wanted to start small and begin by calculating the average score based on every combination of 2 distinct features. I got stuck building the script to accomplish this. Here is a sample input/output:

Code for Input:
data=matrix(c("Blue","Blue","Brown","M","M","F","Dem","Dem","Rep",1,0,1), ncol=4)
dimnames(data)=list(c(1,2,3),c("Eyes","Gender","Politic","Score"))

Input:

Eyes Gender Politic Score
1 "Blue" "M" "Rep" "1"
2 "Blue" "M" "Dem" "0"
3 "Brown" "F" "Rep" "1"

Output:

Blue, M : .5
Brown, F:1
Blue, Rep : 1
Blue, Dem : 0
Brown,Rep :1
M, Rep : 1
M, Dem : 0
F, Rep : 1


So right now I am getting stuck at all the looping. To start I am just trying to build a function that creates a matrix of all distinct pairs of answers and questions. When it comes to looping through the questions, THEN each answer to the question, I get various errors.

analyze = function(test_data) {

x=matrix(ncol=2)
categories = lapply(test_data, unique) #created list of all distinct categoric values
category_names = names(categories)

for (feature in category_names) {
for (ans in categories$feature){

x=rbind(x,c(ans,feature))
}
}
return(x)
}


*I know this isnt representative of the whole problem described at the beginning, I just tried to simplify down to an easier problem whose answer will help the most.
 

Jake

Cookie Scientist
#2
I think you want to check out the outer() and combn() functions. The former crosses all elements of a first vector with all the elements of a second vector and applies some function that you define to each of the pairings. The latter takes a single vector and results all unique pairings of elements that can be formed from that single vector.
 
C

consuli1

Guest
#3
Another posibility is the following.

1. Create 4 non-redundant single column DATAFRAMES for Eyes Gender Politic Score using duplicated().
2. Join the DAATAFRAMES Eyes Gender Politic Score using merge() 3 times without match condition (=full join).

The result is a dataframe with all possible permutations (ignoring order).

Regards
Consuli
 
Last edited by a moderator:

trinker

ggplot2orBust
#4
@smithosaurus

When you're posting code, dataframes or computer output it's helpful to wrap this information in code tags by:
  1. either clicking the pound (#) sign icon or
  2. wrap with [NOPARSE]
    Code:
    some code
    [/NOPARSE]

which produces:
Code:
some code
For more see this (LINK)