Principal component analysis (PCA) on clustered data

Hi Everyone,
I am analysing data about physical condition tests on students belonging to different type of schools (general, professional, agricultural). So the data are not independent because students in a same type of school tend to be similar (the intraclass coefficient correlation is about 20%).
Is it a problem for using a principal component analysis (PCA) ? Is there a method to produce a PCA taking into account a possible "cluster effect" (schools)?
Thanks in advance


Probably A Mammal
Not entirely my area, but I see 2 possible approaches

1. Use PCA on each group separately. If they're structurally different, then I would think PCA would pull out components representative of each group, to which you can then analyze about underlying phenomena across the groups. Though, it does raise the question if you're comparing apples to apples, so to speak, using these component representatives.

2. Try to include a feature (or set of indicator features; "dummy variable") that represents the groups, but is still stable to using in PCA. I'm not entirely sure of the best approach (e.g., using 0 and 1 or -1 and 1 for each n-1 group?). This method lends itself to other approaches of data transformation, but PCA is definitely not my forte.
First of all, there are a couple of ways of dealing with clustered data in PCA. And yes, it will affect analysis, mainly in terms of your effective sample size.

One quick and easy way of handling this problem if you dont care about the clusters, is to use the intraclass correlation to calculate the effective sample size through a design effect. Then run your PCA on the data.

But BECAUSE you do care about the groups, what you need to do is test for model invariance. Fit your PCA and then test for model invariance between the groups. Particularly, you want to test for multi-group invariance.