Hello,
I'd like some guidance, since I don't know exactly what statistic would be more adequate in my case.
I have a dataset that includes several categories and only one relevant numerical variable, and each category is repeated and has several measurements. I have around 1500 data points that correspond to ~20 categories. (e.g. Imagine we have a list of people heights in centimetres and the category is the country of that person)
I would like to find a way to group these categories in 5 groups. I understand that a clustering algorithm would be suitable for this task. My final goal is being able to classify new measurements in one of the 5 clusters with some certainty.
I have tried kmeans for this purpose (I'm using R, btw), and it clusters the datapoints by similarity, but it ignores the category of each datapoint, so it's not very useful to me. I have numbered the categories from 1 to 20, sorted from largest to smallest, and included that as part of the kmeans cluster. That seems to group the categories in clusters, but I guess it's using that category number (rank) as part of the calculation and it shouldn't (the groups are always shown as consecutive, see attached image).
I have also tried to calculate the mean of each category and do the clustering with the means.. And I obtain a clustering. But wouldn't that just ignore all the variability within each category, which is relevant?
I have searched in many forums and tried many examples, but none use something like what I try to do
So, what would be a correct way of clustering data with just one numerical value and repeated measures?
Thank you in advance.
I'd like some guidance, since I don't know exactly what statistic would be more adequate in my case.
I have a dataset that includes several categories and only one relevant numerical variable, and each category is repeated and has several measurements. I have around 1500 data points that correspond to ~20 categories. (e.g. Imagine we have a list of people heights in centimetres and the category is the country of that person)
I would like to find a way to group these categories in 5 groups. I understand that a clustering algorithm would be suitable for this task. My final goal is being able to classify new measurements in one of the 5 clusters with some certainty.
I have tried kmeans for this purpose (I'm using R, btw), and it clusters the datapoints by similarity, but it ignores the category of each datapoint, so it's not very useful to me. I have numbered the categories from 1 to 20, sorted from largest to smallest, and included that as part of the kmeans cluster. That seems to group the categories in clusters, but I guess it's using that category number (rank) as part of the calculation and it shouldn't (the groups are always shown as consecutive, see attached image).
I have also tried to calculate the mean of each category and do the clustering with the means.. And I obtain a clustering. But wouldn't that just ignore all the variability within each category, which is relevant?
I have searched in many forums and tried many examples, but none use something like what I try to do
So, what would be a correct way of clustering data with just one numerical value and repeated measures?
Thank you in advance.
Attachments

160.3 KB Views: 3