PDA

View Full Version : Principal component analysis

Laura_247
02-24-2011, 08:48 AM
I have a problem with the analysis of my dissertation. The topic is constructed wetlands and I have measured a set of variables to categorise these, e.g. size, age, % of different land use in the surroundings... My intention is to use the Principal Component Analysis to combine these variables. In addition, I have also looked at the abundance of birds in the wetlands and I want to find out if this regresses with a certain PCA component. The problem is that most of the variables are not normally distributed, for example the % coniferous in the surroundings is often 0. It seems like the assumption for PCA is the same as for correlation, i.e. normal distribution. I have tried transforming the data but it's not possible for all of the variables. Is it still possible to do the PCA? Any advise on what to do?

ohammer
02-26-2011, 07:12 AM
PCA is simply a method for producing orthogonal axes explaining maximal variance in your data. PCA will do this correctly no matter the distribution. In this sense, PCA does not make any assumptions whatsoever, and you are fine. PCA is not a formal statistical test, but a method for data reduction and visualization.

Another question is whether PCA will give useful results. If you have outliers, nonlinear relationships between variables or strange distributions, then the simple maximization of variance may not succeed in revealing the "underlying connections".

Therefore, normal practice is to just run the PCA and inspect the results in terms of eigenvalues and interpretability of the components. If it "works", you are happy, if not, you try something else!

Laura_247
02-28-2011, 05:22 AM
Thanks for the help!
How do I interpret the loading values, specifically the negative ones? For example, the loading value for % grazed land is -0.47. And how do I interpret Eigenvalues, loading plots, score plots and biplots?

ohammer
02-28-2011, 05:30 AM
Which software are you using?

Since you have variables with different units, did you carry out the PCA on variables standardized with respect to variance ("PCA on the correlation matrix") ?

A negative loading just means that the variable in question correlates negatively with the axis.

Laura_247
02-28-2011, 09:03 AM
I'm using Minitab and the correlation matrix. So if it correlates negatively with the axis what does that mean biologically, for exampe in the case of % grazing in the surrounding land?

ohammer
02-28-2011, 09:26 AM
OK. The eigenvalues describe how much variance is "explained" by each axis. You can inspect these "by eye" to get an idea about which axes are "real" in terms of underlying patterns and which are "noise" (there are also more objective methods for this, my favourite is bootstrapping the PCA and comparing the eigenvalues and their confidence intervals with a "broken stick" model, but this is not so important).

The loadings say something about how much each variable is involved in each axis, positively or negatively. This is important in order to interpret the "meaning" of each axis.

The scatter plot of scores shows all your sites plotted in the space spanned by the PCA axes. The biplot is a good way to show the loadings and the scores in the same plot.

If you have a negative loading for % grazing on e.g. Axis 1, it means that sites with negative scores on Axis 1 have generally high % grazing, while sites with large scores have generally low % grazing. It also means that along this axis, % grazing correlates positively with other variables with negative loading, negatively with other variables with positive loading.

You can arbitrarily flip a PCA axis 180 degrees, changing the sign of all its loadings, so whether a particular loading is positive or negative is not in itself interesting, only how this relates to the other variables.

Phew, all this typing is hard work, I should have referred to some web page instead :(