Hello, I'm new to this forum. Hopefully I am doing this right!

I am trying to do a multiple regression analysis on a data set that has about 250 independent variables. Of these, about 200 show an enormous amount of redundancy (although still some degree of individual meaning). These 200 redundant variables also overwhelmingly account for the correlation to the outcome variable.

I would like to produce a model that accounts for the 200 redundant terms as a single factor, and then allows for meaningful examination of the remaining (not redundant) terms. Reducing the entire multiple regression to an examination of PCA terms requires many PCA terms, most of which do not break down so as to allow for clear conclusions regarding the individual effects of the non-redundant terms.

I would like to isolate the 200 redundant terms for a PCA (signifying their combined, somewhat unified, effect), and then incorporate that PCA term into regression efforts using the remaining terms. I would end up with a model like this:

y = a(PC1) + bx1 + cx2....

The model would not be used as a predictive model, but as a clear elucidation of the relative effects of the independent variables. This seems logical and effective to me, but I have never encountered anyone doing such a thing (I'm more of a chemist than a statistician). Can anyone clarify to me if this is a valid idea? And if it's not, then is there something else I should do instead?

I really appreciate your help (in advance). Thank you!