All subsets regression is one approach that is used when this is the case. The technique involves fitting all possible linear models for all levels of data sparsity. Forward stepwise regression is sometimes used for this case as well.
What are some techniques that are utilized when the number of predictors is greater than the number of observations?
All subsets regression is one approach that is used when this is the case. The technique involves fitting all possible linear models for all levels of data sparsity. Forward stepwise regression is sometimes used for this case as well.
1) Gather more data (probably the best way)
2) Remove or collapse variables. For example you may decide that one variable really is not critical, that one will serve to measure two and so on. One possibility is to create an index variable that adds several of your variables together. If you have likert scale data a second advantage of this type of combination is that the results will likely be interval while (according to some statisticians anyhow) likert data normally is not.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
I try using subset selection in R, but the number of variables I have is over 500. R just hangs when I use the leaps backage.
If you have 500 variables (something that I find amazing) you might try factor analysis and use the factors rather than the variables in your model if that makes conceptual sense.
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
I would also suggest all possible subsets using Mallow's C(p) as a criterion. SAS handles this well.
Is it even possible to estimate the parameters if number of variables is greater than number of observations? I guess the degrees of freedom gets below 0 if this is the case, which means that you cannot find any solution (or are there infinitely many solutions, don't remember?).
As far as I know you can not estimate a model with more parameters than observations. Unique parameters can not be estimated with 0 or negative DF (I don't understand conceptually what a negative DF even means, sort of like something disolving before it enters solution in a PH problems).
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
Poster,
You have received some great suggestions here. Two questions: first, you are testing these potential independent varaibles because they make reasonable sense as predictors, or is this a fishing expedition. Second, can you share the context of the scenario (this might open the door for others whom perform comparable research to describe their techniques to your situation)?
Stop cowardice, ban guns!
This will go down in history as the 1000th post of hlsmith!!!!!!!!!!!!!!!!!!!!!!!!!
"If you torture the data long enough it will eventually confess."
-Ronald Harry Coase -
hlsmith (02-15-2013)
Other possibilities is PCR, also called PCA and PLS. If hlsmith had not cut down on post length I would have told you what these abbreviations means and given some appications examples. I don't want to leave a "monster" to read.
Whenever I run factanal I get an error Error in solve.default(cv) :
system is computationally singular: reciprocal condition number = 3.80806e-21. This means that I cannot use factor analysis
Do you mean that I should run a logistic regression with Mallows Cp as the criterion?
I dont know that code, but my guess is you can not run the EFA with the number of observations you have. You need enough data to calculate unique parameters and I don't think you have that. Is it possible to gather more data?
"Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995
I have 12000 observations and can only use 200 observations for the training set. The rest is for the test set. So I cannot gather more data. The goal is to build a predictive model that has good predictive accuracy for the test set.
Tweet |