Clustering univariate measures

#1
Hello,

I'd like some guidance, since I don't know exactly what statistic would be more adequate in my case.

I have a dataset that includes several categories and only one relevant numerical variable, and each category is repeated and has several measurements. I have around 1500 data points that correspond to ~20 categories. (e.g. Imagine we have a list of people heights in centimetres and the category is the country of that person)

I would like to find a way to group these categories in 5 groups. I understand that a clustering algorithm would be suitable for this task. My final goal is being able to classify new measurements in one of the 5 clusters with some certainty.

I have tried k-means for this purpose (I'm using R, btw), and it clusters the datapoints by similarity, but it ignores the category of each datapoint, so it's not very useful to me. I have numbered the categories from 1 to 20, sorted from largest to smallest, and included that as part of the k-means cluster. That seems to group the categories in clusters, but I guess it's using that category number (rank) as part of the calculation and it shouldn't (the groups are always shown as consecutive, see attached image).

I have also tried to calculate the mean of each category and do the clustering with the means.. And I obtain a clustering. But wouldn't that just ignore all the variability within each category, which is relevant?

I have searched in many forums and tried many examples, but none use something like what I try to do

So, what would be a correct way of clustering data with just one numerical value and repeated measures?

Thank you in advance.
 

Attachments

gianmarco

TS Contributor
#2