K-Means Clustering - Selecting K with Cross Validation of prediction SSE

bryangoodrich

Probably A Mammal
#1
I'll try to fill this out more later, but let us just assume we're using the Iris data set.

It's not hard to do some clustering

Code:
x <- iris
set.seed(666)
N <- nrow(x)
idx <- sample(seq(N), 30)
train <- x[-idx, 1:4]
test <- x[idx, 1:4]
kc <- kmeans(train, 3, nstart = 50)
train.ss <- kc$tot.withinss
# ApplyModel to test
# Compute test.ss
Now I'm looking at how to apply this model to the test training set. The DeducerExtras package has a method for applying a model. It's basically assigning the data to the nearest centers from the model. Great. That saves me the trouble.

However, I want to calculate the tot.withinss for this assignment! It doesn't provide that. Short of digging into some code, I was wondering if anyone knows what that code would be or ran across it? It's not the most terrible thing to calculate in the world, but why reinvent the wheel when I'm in a hurry? Right!

Once I can get a test.ss, I'll be set to do some cross-validation to find which k minimizes the test.ss. Make sense? Right! Any help is appreciated. Unless it's from Spunky. Then I'll demoralize him ruthlessly.
 

Lazar

Phineas Packard
#2

bryangoodrich

Probably A Mammal
#3
I prefer 5 or 10 fold CV. I'm not as comfortable with AIC in this situation. All I really need to do is

1. Assign each point in the test set to a cluster.
2. Calculate the sum of squares difference for each point to its cluster center
3. Aggregate that across clusters and in total

I know this can be done with the dist function or something. I just don't want to do it from scratch, but my Google searches keep bringing up other nonsense. Maybe I'll just do it from scratch then. It's not complicated, but will R be fast enough? Maybe this is a case where a little Rcpp will be useful!
 

TheEcologist

Global Moderator
#4
What about using AIC instead. This should be equivalent to k-folds cross validation.
My experience with AIC and ML techniques is that it is way to conservative, which is great when you do statistical inference, but really bad if you want to use AIC for prediction as it almost always under fits the data. I couldn't get a decent pattern recognition algorithm running with AIC, well not nearly as good as using CV.

If you have the data, CV is the way to go.
 

TheEcologist

Global Moderator
#5
I know this can be done with the dist function or something. I just don't want to do it from scratch, but my Google searches keep bringing up other nonsense. Maybe I'll just do it from scratch then. It's not complicated, but will R be fast enough? Maybe this is a case where a little Rcpp will be useful!
Don't go mucking around with Rcpp before profiling identifies a bottleneck that can be solved with Rcpp. It will be a major time waste.. I speak from experience. I tend to drop to C only when things like mathematical operators, random number generation or something like that is the bottleneck. Basically whenever there are no vectorized ways to conduct my problem - or when I combine too many vectorized programs in which case making your own custom vectorized program will help.
 

bryangoodrich

Probably A Mammal
#6
Great advice, and thanks for the insights. I don't think I should need to write something in C++, but I have to figure out how to do group-wise (each cluster) sum of squared differences for each of 96-dimensional point from the cluster center. It's just Euclidean distance, but I'm having trouble thinking about how to do that vectorized in R. Once I figure it out, I'll post my code here. I can just think of how to do it in C++ iteratively than I can in R right now!
 

bryangoodrich

Probably A Mammal
#7
Had some fun today after I spent all day getting a data set together. Did k means on nearly 5000 customers hourly energy consumption to produce 6 clusters. I think I detected our grow houses because one cluster has 6x or more consumption in morning and evening (overnight). This was still based on the training sum of squares. I still need to calculate the prediction ss to do this better.

From reading the help pages for dist, I guess I could add the center to the cluster points as assigned, take the correct vector out of the dist matrix, and that should give me what I want. I just have to do that for each cluster and center to get the within ss.

In any case, I found this link researching kmeans. Very educational.

http://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means
 

bryangoodrich

Probably A Mammal
#8
Okay, I toyed around with that dist function today. I like it!

I'm still trying to figure out the "best" way to approach this problem because on the one hand, I have my data set and on the other hand, I have the clustering model. To get the distance is pretty easy if I simply rbind the center for a given cluster to the top of the matrix, calculate the distance with dist, and then return the first row of that matrix. It's the distance of every point from the center in that case. Make sense? Then a simple sum(d^2) will get me the withinss for that center.

The challenge now is to make a clean and intuitive algorithm that can mash up a test data set with the given model, apply the model to that data, then do the above to calculate the prediction sum of squares. Feel free to contribute, but next time I have time to work on this, I'll post my solution. My guess is something along the lines of

Code:
kc <- kmeans(train, k, nstart = 50)  # k-means for some training data of k centers
test <- applyModel(test, kc)  # function from the other package that returns the test data with applied centers to it
d <- as.matrix(dist( rbind(kc$centers[1, ], test[test[, 5] == 1, -5]) ))[1, ]  # Assume clusters are in column 5
testss1 <- sum(d^2)
Easy, right? I just need to make this process clean, modular, and fast. I'll be doing it on a **** ton of data here!

Note: alternative is to just use predict.kmeans(kc, test) to get a vector of cluster assignments, then do something like lapply the centers (or row apply, I guess) and pass those assignments for the given center.

Code:
withinss <- function(n, x, centers, assignments) {
    p <- centers[n, ]
    m <- rbind(p, x[assignments == n, ])
    sum((as.matrix(dist(m))[1, ])^2)
}

centers <- kc$centers
assignments <- predict.kmeans(kc, test)

lapply( seq(nrow(centers)), withinss, x = test, centers = centers, assignments = assignments )
Clearly I'm passing too much junk around, but that's where my thinking is going, assuming that code even works! lol
 

bryangoodrich

Probably A Mammal
#9
Toying around tonight, I figured using dist would be greatly inefficient because it has to calculate all distances when I only really want that one vector of distances from the cluster center. Instead, I'll use closures

Code:
x <- matrix(sample(1:1000, 10000, TRUE), ncol = 5)  # make some big data set
p <- c(1, 2, 3, 4, 5)  # arbitrary "center" to get distances from
euclidean_factory <- function(y) function(x) sqrt(sum((y-x)^2))  # closure, baby!
pdist <- euclidean_factory(p)  # distance of a point from p
what_i_want <- apply(x, 1, pdist)  # way faster than the method I described earlier
I ran this against the dist approach above and it's WAY faster, for obvious reasons.

By using closures, I can generate the function I want off of a given centroid at the moment I need to compute the withinss for a given subset of the test data once I assign clusters from a given model. So now it'll be something like

Code:
# ... partitioning, cv, whatever
kc <- kmeans(train, k, nstart = 50)  # define a model for a given k
assignments <- predict.kmeans(kc, test)  # assign the clusters from the given model
centers <- kc$centers
ss <- function(x) sum(x^2)  # Why not ...
withinss <- vector('numeric', k)  # store ss for each k center
for (n in seq(k)) {
    pdist <- euclidean_factory(centers[n, ])  # create this cluster's distance function
    d <- apply(test[assignments == k, ], 1, pdist)  # calculate distances for that assigned cluster
    withinss[k] <- ss(d)
} # bam! Wrap this in a cross-validation loop
 

bryangoodrich

Probably A Mammal
#10
Well **** .... I actually needed this function the other month and completely forgot I already made a function for doing it! !@$%#

For completeness, the flexclust kcca function is much better and I like their assignment function better. It's better at memory management and handles things in S4 class objects.