Reduce number of predictor variables ((multi)collinearity): PCA not helpful

#1
Howdy,
The attached spreadsheet contains the correlation matrix of predictor variables and response variable. Correlation exists between them. Additionaly included are the results from a Principal Components Analysis on my (scaled) predictor variables. I used the prcomp command in R with the default settings.

In my opinion these PCA results do not give me a 'better' variable or one that is easier to work with. The loadings and the number of pc's required for variation explanation is high. I have attached a scree plot and biplot of the prcomp results.

I'd appreciate your input regarding the best approach to reducing the variables. I used stepwise AIC on the above and the "best" model still had a poor adjusted r2 value. I know I should know the data and decide what variables are important. The ones included are the those that were paired down from a larger list.

Thanks,
Mike
 

bugman

Super Moderator
#2
I would begin by looking for redundancy in you variables. Are there to highly coorelated varaibles that could serve as a proxy for one another? If so, look for these and remove them, then re-run the analysis. If this doesn't work, how are the diagnostics of each valriable? Are they linear? Have you transformed them at all? I would scrutinize these, and then apply suitable transformations.
 
#3
Some variables are inherently correlated but described different things. I have removed some variables and re-ran the regression to no avail. I am ok without finding a good fitting model, but I want to be sure that I come to that conclusion correctly.

>> If this doesn't work, how are the diagnostics of each valriable? Are they linear?
Some don't come from normal distributions. Others are normal, and some are not. I have some high levels for skew and kurtosis for a few of the variables but that represents reality. I am always afraid that performing transformations makes things even harder to interpret.

Without transforming the variable is attempting a linear regression wrong? If so is there another to try that wouldn't be?

Thanks again,
Mike
 
#4
My worry is that to make all the variables normal I would need about 5 different transformations.
Code:
	       skew	kurtosis se
AREA1prop	2.66	8.43	0.23
AREA2prop	3.54	16.38	0.48
AREA3prop	-0.92	0.07	2.2
AREA4prop	1.15	0.84	1.97
AREA5prop	6.31	47.22	0.28
eda	       -2.47	10.76	1.3
lpa	       -4.64	32.14	1.14
lsia	       -2.5	11.4	1.31
pda	       -1.49	5.12	1.43
ppt	       -0.43	0.04	0.84
tmax	       -1.18	0.81	0.23
tmin	       -1.29	1.59	0.22
inso	       -1.09	1.35	11.451072
response	0.21	-0.77	0.26
Could I use robust regression?

Thanks for the help,
Mike
 

Dason

Ambassador to the humans
#5
We don't require any assumptions about the distribution of the predictor variables. All we care about is if the error term (quantified by the residual) is approximately normally distributed.
 

bugman

Super Moderator
#6
PCA performs better when there are linear realtionships between variables, because it uses correlations and covariances as a measure of association between the varaibles. Don't worry about mulitple transformation if needs be, this is common. Outliers will also have a strong influence on your output which is another reason to transform.
 
#7
I have linear relationships between many of the variables, but the PCA still didn't seem to give me a better option. I don't think its wise for me to remove variables unless I need to. With the non-normal data set is the linear regression invalid? The residuals of the best linear model (still poor fitting) has normally distributed residuals.

Do you have suggestions on how to find the right transformation? I would appreciate that.

I have included a data set should anyone care to view it.

Thanks again,
Cheers,
Mike
 
#8
We don't require any assumptions about the distribution of the predictor variables. All we care about is if the error term (quantified by the residual) is approximately normally distributed.
Dason, could you please point me to some literature on this? Thanks, Mike
 
C

cronjob

Guest
#9
If you have enough data, you may consider trying to learn the transformations using something like an additive spline model. I suppose it largely depends on whether or not your goal is interpretation of your model or prediction.

If it's interpretation, you may look into something like ridge regression. It tends to do better with multicollinearity. There are plenty of other penalized regression methods you could look into as well.

Good luck!

Andrew
 

bugman

Super Moderator
#11
Mike, just another thought. Did you normalise these data prior to performing PCA?

As far as texts go, look at general texts or ones specific to your field. I like Biometry by Sokal and Roff.

This book has useful instructions fro transformations: http://www.zoology.unimelb.edu.au/qkstats/

Its geared for biologists but the theory applies to all. They also have and excellent chapter on PCA.

Another useful resource is this:

http://udel.edu/~mcdonald/stattransform.html

P