Hello,
I'd like some guidance, since I don't know exactly what statistic would be more adequate in my case.
I have a dataset that includes several categories and only one relevant numerical variable, and each category is repeated and has several measurements. I have around 1500 data points that correspond to ~20 categories. (e.g. Imagine we have a list of people heights in centimetres and the category is the country of that person)
I would like to find a way to group these categories in 5 groups. I understand that a clustering algorithm would be suitable for this task. My final goal is being able to classify new measurements in one of the 5 clusters with some certainty.
I have tried k-means for this purpose (I'm using R, btw), and it clusters the datapoints by similarity, but it ignores the category of each datapoint, so it's not very useful to me. I have numbered the categories from 1 to 20, sorted from largest to smallest, and included that as part of the k-means cluster. That seems to group the categories in clusters, but I guess it's using that category number (rank) as part of the calculation and it shouldn't (the groups are always shown as consecutive, see attached image).
I have also tried to calculate the mean of each category and do the clustering with the means.. And I obtain a clustering. But wouldn't that just ignore all the variability within each category, which is relevant?
I have searched in many forums and tried many examples, but none use something like what I try to do
So, what would be a correct way of clustering data with just one numerical value and repeated measures?
Thank you in advance.
I'd like some guidance, since I don't know exactly what statistic would be more adequate in my case.
I have a dataset that includes several categories and only one relevant numerical variable, and each category is repeated and has several measurements. I have around 1500 data points that correspond to ~20 categories. (e.g. Imagine we have a list of people heights in centimetres and the category is the country of that person)
I would like to find a way to group these categories in 5 groups. I understand that a clustering algorithm would be suitable for this task. My final goal is being able to classify new measurements in one of the 5 clusters with some certainty.
I have tried k-means for this purpose (I'm using R, btw), and it clusters the datapoints by similarity, but it ignores the category of each datapoint, so it's not very useful to me. I have numbered the categories from 1 to 20, sorted from largest to smallest, and included that as part of the k-means cluster. That seems to group the categories in clusters, but I guess it's using that category number (rank) as part of the calculation and it shouldn't (the groups are always shown as consecutive, see attached image).
I have also tried to calculate the mean of each category and do the clustering with the means.. And I obtain a clustering. But wouldn't that just ignore all the variability within each category, which is relevant?
I have searched in many forums and tried many examples, but none use something like what I try to do
So, what would be a correct way of clustering data with just one numerical value and repeated measures?
Thank you in advance.
Attachments
-
160.3 KB Views: 3