Appropriateness of using ANOVA on PC-scores (PCA)

#1
Hi everyone,

I'm new to the forum but wanted to ask a question about the appropriateness of using a test such as ANOVA or Mann-Whitney U test on the PC-scores of a Principal Component Analysis. I have recently received the critique on a submitted manuscript that it is problematic that I did multiple (non-independent) testing on the pc-scores of a PCA analysis using an ANOVA. I did an individual ANOVA on the scores of each principal component to test for significance of between group differences.

Although I realize this may not be a good approach, can anyone perhaps explain to me in more detail why this would be problematic? And is it really the case that pc-scores cannot be treated as independent variables?

Thank you,

Jan
 

noetsi

Fortran must die
#2
I have never seen anyone do this. :p What were you testing, that is what was your dependent and independent variables?
 

spunky

Doesn't actually exist
#3
Although I realize this may not be a good approach, can anyone perhaps explain to me in more detail why this would be problematic? And is it really the case that pc-scores cannot be treated as independent variables?
The most immediate reason of why this is a suboptimal approach is because your PCA scores are estimated from the data. These are datapoints that you didn't collect and observe so by treating them as just another collection of data points you underestimate the true uncertainty associated with whatever composite variable you are using them on. There are ways to account for that, but none of them are particularly easy or readily available in software (that I know of).
 
#4
I have never seen anyone do this. :p What were you testing, that is what was your dependent and independent variables?
Thanks for your input Noetsi. It is actually a morphometric study on the shape of the wings in different bird species. I did a PCA on the wing measurements between three bird species. In the resulting scatterplot there is some overlap between the different clusters. The idea of the ANOVA was to have some quantitative estimate of the significance of the clusters.
 
#5
The most immediate reason of why this is a suboptimal approach is because your PCA scores are estimated from the data. These are datapoints that you didn't collect and observe so by treating them as just another collection of data points you underestimate the true uncertainty associated with whatever composite variable you are using them on. There are ways to account for that, but none of them are particularly easy or readily available in software (that I know of).

Thanks Spunky, that's a good point, but I have seen a number of people conducting MANOVA on PC-scores. Does this make more sense from your perspective? It could be argued that, although the PC-scores are still composite variables, in a MANOVA (almost) all the composite variables are included in the analysis, so also the original datapoints I measured? The critique I got, seems to be mainly on the fact that I used multiple testing on the scores of the individual components and that they cannot be treated as independent variables. But I can't get my head around the rationale behind this.
 

noetsi

Fortran must die
#6
Could you post the reviewers comments (not sure if that is allowed or not). There are analysis that do not assume independence. But details are needed.