Regression with 4 highly collinear IVs and 4 DVs, best approach?



I have a dataset consisting of four continuous independent variables with a value between 0 and 1, and four continuous dependent variables with a score between -3 and 3. The IVs are values representing a same biomarker, but in different regions of the brain. The DVs are values representing four separate neuropsychological performance measures. For all variables, higher values are "good", and lower values are "bad". I am interested in seeing whether changes in the IVs can predict changes in the DVs. I also have two continuous covariates which I would like to partial out of the regression.

I have a total of 38 data points for each variable; consisting of 19 patients that were assessed prior to and following treatment. Except for the covariates, all data are normally distributed. Here is a correlation matrix I made in R:

          DV1          DV2        DV3           DV4             IV1              IV2           IV3
DV2       0.60***                                                                                           
DV3       0.30         0.33*                                                                              
DV4       0.42**       0.25       0.60***                                                                  
IV1       0.28         0.27       0.47**        0.39*                                                    
IV2       0.20         0.20       0.59***       0.31            0.82***                                 
IV3       0.19         0.27       0.55***       0.37*           0.85***          0.92***              
IV4       0.23         0.33*      0.33*         0.19            0.60***          0.71***       0.69***
As you can see, the four IVs are highly collinear (though somewhat less so for IV4). This makes sense experimentally, because it simply means the effect is not region specific.

I would like to do two things with these data: 1) create a composite score of the four IVs (potentially excluding IV4, if appropriate), and 2) compute a linear regression between this composite score and each one of the four DVs (with the two covariates partialled out). What is the best approach to this? Principal component analysis? Multivariate regression? I am rather new to multivariate statistics, and just started learning R yesterday, so some detail would be appreciated.

Last edited by a moderator:


Fortran must die
I don't think 38 data points is enough to run a regression model (well it will run possibly, but your power will be awful and its hard to believe it can reasonably represent any real population). Also it's not collinearity you need to worry about its multicolinearity. The way to test for that is to run a regression and see if you tolerance or VIF violates the rule of thumb for multicolinearity. I don't know how that is done in R, but I am sure its not difficult (its simply requesting regression diagnostices in SAS or SPSS).

I don't understand how a variable between -3 and 3 or 1 and 0 is continuous unless answers can be fractions. Also if your 38 data points reflect 19 before and after values then it would seem that either what you really end up with are 19 difference values (the change from point 1 to point 2) or else paired values which requires a different form of regression. If you really have 19 cases of course that makes the power and generalizability of your analysis (and the problems of running regression or any method) that much worse.

There are many ways to combine variables, the simplest is just to add their values together and use this as a composite variable. I have not heard that any one approach is prefered, although I do not do this much. Principal component analysis is normally used when you have a very large number of variables you are trying to utilize to generate factors with. I don't think it makes much sense with 4 variables (and certainly 38 cases will be far too few to use it).

You can try linear regression, but you should do a power test first (Gpower is free and relatively easy to use), but also consider even if the method runs your power issues and the (seperate) concern of whether 38 cases can tell you anything about an actual population.
Last edited: