PCA or clustering on binary data?

jk78

New Member
#1
Hello,

I have a dataset consisting of around 30 species of plants (rows) and around 50 variables (my columns) that are medicinal properties of the plant species, and which take on binary values; for instance, if a plant species has anti-fungal properties (one of the variables) it would have value=1 for that variable, otherwise value=0.
I would like to establish similarities between the plant species in terms of their medicinal properties.
I had initially thought about doing a PCA or correspondence analysis, because I thought that it would allow me to plot the plant species in a 2D space and visualise the similarities between them in terms of distance. But I have read about many drawbacks about using these techniques on binary data.
What would you recommend as a first approach to finding similarities in my plant species in terms of their medicinal profile? Is a multivariate technique appropriate, or would some clustering method be better?


Many thanks

Jorma
 
#2
Hi,

I think you could use the technique called "multiple correspondence analysis" (MCA) which is related to PCA but specialized for binary data instead of quantitative data. Furthermore, it is applicable to a large set of variables, such as in your case. There are several functions to do MCA in R, e.g. MCA() (FactoMineR package) or mca() (MASS package), and several more.

Best
 

gianmarco

TS Contributor
#3
Hello,
I do not believe MCA is appropriate in this contex. I would go with PCA or CA. FactoMiner has clustering facilities to perform clustering over PCA or CA results.
As for the issues with binary data, I do not know any of them, but I would like to know more about them if you provide some references.

Gm
 
#5
What would you recommend as a first approach to finding similarities
I would use principal component analysis PCA as a first approach. And I would do a biplot so that both loadings and scores can be plotted in the same diagram. Lots of people do that and that is all they do.

Later maybe you could estimate tetrachoric correlations and re-estimate the PCA-model.

Later maybe you can get the original measurements, concentrations about the anti fungal properties, try to normalize them so that it looks like multivariate normality and re-estimate the correlations and re-estimate the PCA. If you have measurements of anti fungal property you could possibly do a predictive model.

PCA is a multivariate technique. But there is in no statistical model involved. So you can not say that there is a lack of fit. The method, or let's call it the algorithm, just decomposes data into orthogonal components. And if you can gain some insights by that, then fine!

Gianmarco knows a lot about correspondence analysis. I would trust his suggestion about that.
 

bugman

Super Moderator
#6
Are you just looking at some exploratory / descriptive methods here, or is your aim to test for differences between groups?

PCoA can be a good alternative to PCA because it allows you to choose any distance measure (and there are a lot for binary data: http://www-01.ibm.com/support/knowl...s.help/cmd_proximities_sim_measure_binary.htm).

You could also consider MDS or NMDS accompanied by a cluster analysis and simprof test, but this will all depend on your specific goals.
 
#7
You might also consider Principal Coordinates Analysis (PCoA) or NMDS, with a binary similarity/distance measure such as Dice or Hamming.
 

bugman

Super Moderator
#8
You might also consider Principal Coordinates Analysis (PCoA) or NMDS, with a binary similarity/distance measure such as Dice or Hamming.
Thanks ohammer - as mentioned in my response above...


But out of interest, the Dice and Hamming measures are two that I know of, but have never used. what properties to these have that made you suggest them?
 
Last edited:

jk78

New Member
#9
Thank you very much for all these useful replies. It is a huge help.
I'm following bugman and ohammer's suggestions of trying PCoA or NMDS. I have 2 further questions:

- I would also be very interested to know why Dice and Hamming are the similarity measures best adapted to the kind of binary data I have?

- I tried NMDS in R (with MetaMDS function) and I get an error message saying NA values not allowed in the similariy matrix (when running function cmdscale). What is the most efficient way to deal with NotAvailable values in my data? I was thinking of just setting NA values to 0. I know this isn't ideal because a 0 meaning that the plant species in question has tested negative for a medicinal property isn't the same as a 0 meaning the plant hasn't been tested at all, but is it acceptable to do this as a quick solution or do you think it will skew the results too much?

Thanks very much again

Jorma
 

bugman

Super Moderator
#10
No, that’s not ideal at all. And this is a classical example of why you must think carefully about your metric / similarity / dissimilarity measure - i.e. different ones handle zeros differently (i.e. some say 0's mean that a site or group is similar, others weight them and others still say that they are different).

In your data set, are the NA's there because they were not measured? And how many missing values do you have (relative to the total)?
 
Last edited:
#11
Thanks ohammer - as mentioned in my response above...


But out of interest, the Dice and Hamming measures are two that I know of, but have never used. what properties to these have that made you suggest them?
Sorry, don't know how I missed your answer there!

Hamming distance simply counts the number of positions with either 0 or 1 in both rows. This can be a bit misleading in e.g. ecology because double absence counts as similarity. For example, it is debatable whether the Arctic and an Asian rain forest are similar just because you don't find giraffes in any of them. Dice avoids this problem.
 
#12
A simple way (I don't know if it's the best) to treat missing data when making a distance matrix is to simply skip a variable when it contains a missing value in one or both of the rows to be compared. This is sometimes called pairwise deletion. The program "Past" will do this automatically for PCoA and NMDS when you have a '?' instead of a 0/1 in your data.
 
#13
Thanks bugman and ohammer.
The majority of my values are missing (they were not measured): 861 missing values for a total of 1144 values
So you suggest that I do a pairwise deletion and use a Dice similarity measure?
I'll try using the Past program for my analysis.
Is there any reason to use NMDS instead of PCoA, or vice-versa?
Many thanks again,

Jorma
 
#14
I just tried the Past program and it is awesome. Very useful for testing the different similarity measures.


A simple way (I don't know if it's the best) to treat missing data when making a distance matrix is to simply skip a variable when it contains a missing value in one or both of the rows to be compared. This is sometimes called pairwise deletion. The program "Past" will do this automatically for PCoA and NMDS when you have a '?' instead of a 0/1 in your data.
 
#15
Thanks very much for all the replies.
I am still a bit confused regarding PCoA (sorry for such a basic question):

- Why do you not get any loadings in a PCoA analysis?

I find the distances in a PCoA plot very hard to interpret without knowing which are the most discriminartory variables for each of the axis. If you do not have loadings, what other measure can you rely on to interpret the PCoA plot?

Thanks

Jorma


A simple way (I don't know if it's the best) to treat missing data when making a distance matrix is to simply skip a variable when it contains a missing value in one or both of the rows to be compared. This is sometimes called pairwise deletion. The program "Past" will do this automatically for PCoA and NMDS when you have a '?' instead of a 0/1 in your data.