Dealing with missing values in cluster analysis

#1
I am currently looking for a cluster algorithm to deal with a large amount of systematically missing values. In particular, I need to cluster four-dimensional vectors which may have up to three missing values.

I have already applied common approaches to clear the dataset ahead of clustering (e.g., by listwise deletion, pairwise deletion and imputation with regression models).

For my purpose, however, I would like to dive into more sophisticated cluster algorithms that are able to deal with a high percentage of missing values (>50%).

Questions:
1) Which cluster algorithms are most suitable to determine cluster centers using all available values in a dataset with many missing values?
2) How can these algorithms be applied to the dataset with missing values in e.g. R/SPSS/STATA?
3) One approach that seems promising from the literature is the fuzzy c-means algorithm – can this algorithm be applied in R/SPSS/STATA with the partial distance strategy or optimal completion strategy?

Thank you very much for your help!
 
#3
Thank you for your response. Please find answers to your questions below:
1) Yes, the data is continuous – i.e. the values consist of probabilities [0;1]
2) The missing values account for >50% of all data fields. Within one four-dimensional vector, the missing values could range from zero to three. (e.g. (1,?,?,?);(1,?,?,0.5);(1,?,0.2,0.3);(1;0.5;0.5;0.5))
3) MAR can be assumed.
 

hlsmith

Omega Contributor
#6
Well the go to approaches deal with multiple imputation. My experiences are with categorical variables. R has mice and amelia packages. There is also the EM procedure. I can't remember what it stands for.

But theoretically multiple imputation is recommended. 50% is a lot of missing data. What are you going to do with these data next once you are done?