# Thread: Dealing with missing values in cluster analysis

1. ## Dealing with missing values in cluster analysis

I am currently looking for a cluster algorithm to deal with a large amount of systematically missing values. In particular, I need to cluster four-dimensional vectors which may have up to three missing values.

I have already applied common approaches to clear the dataset ahead of clustering (e.g., by listwise deletion, pairwise deletion and imputation with regression models).

For my purpose, however, I would like to dive into more sophisticated cluster algorithms that are able to deal with a high percentage of missing values (>50%).

Questions:
1) Which cluster algorithms are most suitable to determine cluster centers using all available values in a dataset with many missing values?
2) How can these algorithms be applied to the dataset with missing values in e.g. R/SPSS/STATA?
3) One approach that seems promising from the literature is the fuzzy c-means algorithm – can this algorithm be applied in R/SPSS/STATA with the partial distance strategy or optimal completion strategy?

Thank you very much for your help!

2. ## Re: Dealing with missing values in cluster analysis

You need to provide more info. Are all missing data continuous? How much missingness is there? You are assuming MAR?

3. ## Re: Dealing with missing values in cluster analysis

1) Yes, the data is continuous – i.e. the values consist of probabilities [0;1]
2) The missing values account for >50% of all data fields. Within one four-dimensional vector, the missing values could range from zero to three. (e.g. (1,?,?,?);(1,?,?,0.5);(1,?,0.2,0.3);(1;0.5;0.5;0.5))
3) MAR can be assumed.

4. ## Re: Dealing with missing values in cluster analysis

Your wording is confusing to me. So 4D means 4 continuous variables?

5. ## Re: Dealing with missing values in cluster analysis

Yes, exactly! Sorry for the confusion.

6. ## Re: Dealing with missing values in cluster analysis

Well the go to approaches deal with multiple imputation. My experiences are with categorical variables. R has mice and amelia packages. There is also the EM procedure. I can't remember what it stands for.

But theoretically multiple imputation is recommended. 50% is a lot of missing data. What are you going to do with these data next once you are done?

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts