Which clustering method can I use?

arkm25

New Member
I have a data-set which consists of 1 dependent continuous variable and 3 independent categorical variables. I need to find the cluster/group of data points with the smallest within-cluster variance of the independent variable.

Any suggestions as to which clustering method I can use?

Last edited:

Karabiner

TS Contributor
What is your continuous variable, how was it measured? Are there pecularities regarding its distribution (e.g. markedly skewed, or uniform etc.)? What are your categorical variables, and many categories do they have? How large is your sample size?

With kind regards

Karabiner

hlsmith

Less is more. Stay pure. Stay poor.
Can you add more, such as is this a supervised or unsupervised problem? Also, are you looking to repeat the process three times. I am confused by the description. If you can provide real context that would greatly help.

arkm25

New Member
Can you add more, such as is this a supervised or unsupervised problem? Also, are you looking to repeat the process three times. I am confused by the description. If you can provide real context that would greatly help.
I am sorry I had to edit my question. What I meant was I have 1 dependent continuous variable and 3 independent categorical variables.
Since there is a independent/response variable, this becomes supervised learning.

Some context: The response variable is the time until a factory machine breaks down. The categorical predictor/independent variables are machine type, size (small, medium, large) and location.

I am not looking to repeat a process. Currently I have one data-set and I'm looking to extract just one cluster of data-points.

Best regards
arkm25

arkm25

New Member
What is your continuous variable, how was it measured? Are there pecularities regarding its distribution (e.g. markedly skewed, or uniform etc.)? What are your categorical variables, and many categories do they have? How large is your sample size?

With kind regards

Karabiner
I am sorry I had to edit my question. What I meant was I have 1 dependent continuous variable and 3 independent categorical variables.

The continuous variable is the time until a factory machine breaks down. It is simply measured as the the time from when a machine is put in operation til it no longer functions. Its distribution seem to resemble an exponential distribution.

The categorical variables with their respective number of levels are ...
Type: 4
Size: 3
Location: 5

Sample size: 83

Best regards
arkm25

Karabiner

TS Contributor
Did you already check whether all three categorical variables are associated with time to breakdown? If e.g. there was no association between location and time, then you could possibly leave location out of the clustering process.

arkm25

New Member
Did you already check whether all three categorical variables are associated with time to breakdown? If e.g. there was no association between location and time, then you could possibly leave location out of the clustering process.
Hi
Yes they are all associated with time to breakdown. There were several other variables in the original data-set that were removed, and these are the ones remaining