Can you assign new observations into existing clusters?

jawon

New Member
#1
Have never done cluster analysis. Is a scenario like this possible/recommended...

1. Run 100 observations thru cluster analysis and end up with 4 clusters.
2. Now I have a "cluster model".
3. I get 20 more observations.
4. Run these 20 thru my existing cluster model and have each of them assigned into one of the existing 4 clusters.

So it's kind of a regression approach... you have a model that you use to score future observations. Or would those 4 clusters now be invalid and I'd have to "re-cluster" based on the complete 120 obs?

Thank you!
 

spunky

Super Moderator
#2
you have a model that you use to score future observations. Or would those 4 clusters now be invalid and I'd have to "re-cluster" based on the complete 120 obs?
i would say that just for the sake of exploring your data carefully, you should probably re-run the clustering with the added 20 observations. because, in all honesty, if 20 observations end up changing cluster membership a lot maybe your 'cluster model' wasn't all that great to begin with.

disclaimer: i prefer model-based clustering methods like finite mixtures than centroid-based clusting algorithms. which method are you using to cluster your observations?
 

jawon

New Member
#3
Thank you for the response.

The quantities were just to provide a concrete example. My dataset would actually have much more.

My question is more about the concept of assigning NEW observations into EXISTING clusters. Admittedly I have more of a regression modeling lens on and I don't know if that works in the clustering world.
 

bryangoodrich

Probably A Mammal
#4
Which clustering algorithm are you using? That determines how you apply your data mining model. For instance, if you're using k-means, the result of training a model on a set of data with a given similarity measure is to produce a set of cluster centers. You apply that model by assigning new observations based on which center they are most new using that measure the model was trained with. That model can be thought of as a pair (k, +) where 'k' represents the centers and '+' represents the measure. It makes no sense to say you've added observations to the cluster. The cluster isn't the "thing" that you've modeled. What you've modeled was (k, +). That was trained on the original data set. The model assigned new observations to whichever clusters they got assigned to. So on that model, what is most similar to a given k is just whatever the model determines. If you fit a new model you might end up with some (k', +) with very different clusters. You can also change the number of k that exist or use a completely different similarity measure. The point being, you aren't creating clusters. You're finding centers in that case.

Of course, this only applies to what k-means produces. If you used hierarchical clustering or knn or some other approach, you'd have a different sort of model. A hierarchical clustering model produces a dendrogram that relates every observation and you'd apply it to test data differently, but the point still remains the same. The clusters you generate aren't the end result. Those are products of the model. The model itself is the product of the clustering (cluster centers; dendrogram, etc.).
 

jawon

New Member
#5
Thank you. Despite my inaccurate language, I think I got the answer I was looking for.

Sounds like using k-means, I can train a model that will result in X number of clusters. Then I could run new observations through this original model and these new observations would get assigned to one of the original clusters.

Is this a common way of using clusters? I've seen plenty of examples where clusters are created in a one-time analysis to help describe the different segments in the universe. But what I've described is more of an ongoing process, where a model is built and then new observations are routinely added to the original clusters.