Hello!
I'm a newcomer to statistics and data analysis and I'm currently facing a problem which I don't know how to solve. I have a dataset with around 100 variables stored for each observation. Not all of those variables are independent, some are related to each other. I would like to have a method to find all linear relations between variables (nonlinear ones would also be interesting, but I doubt there's any general method for finding them). Maybe I have a set of three variables x,y,z which are related such that z = 1 - x - y. I want to identify those three variables then and how they are related.
How can I systematically detect all such linear relations? I understand that for a linear relation between only two variables all I need is the correlation matrix-if any nondiagonal entry is +1 or -1, those two variables are linearly related and I can use linear regression to find the coefficients in the relation. But what if I do not know how many variables form a relation? Could be three like in the example above, or maybe there's a set of eight variables that have a linear relation. I would need something like a three- or eight-dimensional analogue of the correlation matrix then.
From what I understand I could in principle do multiple regression on all subsets of variables and could detect those relations that way. However, doing such things by brute force seems absolutely hopeless given that there are around 100 variables and about 800.000 observations each. The number of all possible subsets of a set of 100 variables is just too large.
Is there any clever and systematic way to do this without "guessing" from the meaning of the variables (which is not documented anywhere for all of them, only for those that are most often used) which ones might have a relation between them and then checking it? Just going through all variables and let the program find those relations without telling it a priori which ones might be interesting to check and which ones most likely not? Preferrably one that could relatively easily be implemented in SAS.
Thanks in advance,
Lamazivardi
Tweet |