+ Reply to Thread
Results 1 to 6 of 6

Thread: Dealing with missing values in cluster analysis

  1. #1
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Dealing with missing values in cluster analysis




    I am currently looking for a cluster algorithm to deal with a large amount of systematically missing values. In particular, I need to cluster four-dimensional vectors which may have up to three missing values.

    I have already applied common approaches to clear the dataset ahead of clustering (e.g., by listwise deletion, pairwise deletion and imputation with regression models).

    For my purpose, however, I would like to dive into more sophisticated cluster algorithms that are able to deal with a high percentage of missing values (>50%).

    Questions:
    1) Which cluster algorithms are most suitable to determine cluster centers using all available values in a dataset with many missing values?
    2) How can these algorithms be applied to the dataset with missing values in e.g. R/SPSS/STATA?
    3) One approach that seems promising from the literature is the fuzzy c-means algorithm can this algorithm be applied in R/SPSS/STATA with the partial distance strategy or optimal completion strategy?

    Thank you very much for your help!

  2. #2
    Omega Contributor
    Points: 38,374, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,998
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Dealing with missing values in cluster analysis

    You need to provide more info. Are all missing data continuous? How much missingness is there? You are assuming MAR?
    Stop cowardice, ban guns!

  3. #3
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Dealing with missing values in cluster analysis

    Thank you for your response. Please find answers to your questions below:
    1) Yes, the data is continuous – i.e. the values consist of probabilities [0;1]
    2) The missing values account for >50% of all data fields. Within one four-dimensional vector, the missing values could range from zero to three. (e.g. (1,?,?,?);(1,?,?,0.5);(1,?,0.2,0.3);(1;0.5;0.5;0.5))
    3) MAR can be assumed.

  4. #4
    Omega Contributor
    Points: 38,374, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,998
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Dealing with missing values in cluster analysis

    Your wording is confusing to me. So 4D means 4 continuous variables?
    Stop cowardice, ban guns!

  5. #5
    Points: 23, Level: 1
    Level completed: 45%, Points required for next Level: 27

    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Dealing with missing values in cluster analysis

    Yes, exactly! Sorry for the confusion.

  6. #6
    Omega Contributor
    Points: 38,374, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,998
    Thanks
    398
    Thanked 1,186 Times in 1,147 Posts

    Re: Dealing with missing values in cluster analysis


    Well the go to approaches deal with multiple imputation. My experiences are with categorical variables. R has mice and amelia packages. There is also the EM procedure. I can't remember what it stands for.

    But theoretically multiple imputation is recommended. 50% is a lot of missing data. What are you going to do with these data next once you are done?
    Stop cowardice, ban guns!

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats