I am hoping to publish the results of this in a journal (not a statistics one, an experimental science one).
I have a sample size of 71 datapoints, with a large number of potential IVs (actually, more than the number of datapoints, but that is another story). I am wanting a multiple linear regression equation to use to later predict new data. I am aware of not wanting to overfit.
I have asked three statisticians, and have been given three very different answers. The first said, given the number of datapoints, use no more than 3 IVs. (He was quite adamant, but gave no reason.) Another said use Mallow's Cp to determine how many. (This proved to be 7 or 8, depending on the variables used.) Another said use stepwise regression, which automatically cuts out when you attempt to overfit. I did this, and found it ran out to about 24 IVs due to its reliance on coefficient criteria. Interestingly, all of R-sqr, Adj R-sqr and Pred R-sqr are increasing until the 11th IV, so, even for prediction, it didn't seem to overfit up to the 11th IV.
So who is correct?
Personally, I would like to use at least 6 or 7, as this boosts the Pred R-sqr to a reasonable value (to about 0.775) - there is a tendency in the field to reject values below 0.7.
Since writing that question, I then delved into it a bit further and found the following paper
http://www.iasbs.ac.ir/chemistry/chemom ... alysis.pdf
This shows that, if you have a large number of IVs, then there is a chance that the next IV you pick in an equation will be a result of its random behaviour, rather than any underlying contributory variability associated to the variable.
Would you consider the following a valid test?
Method: Create an equal sized set of IVs filled with random data. Systematically use best subsets method with the best model plus each random IV to see whether any replaces an IV in the best solution.
I did this for my dataset above, finding that one replaced the 6th IV, and two others replaced the 7th IV. Therefore, my conclusion was that the model was valid to the 5th IV. However, I am not sure that I can say that the 6th IV is invalid - other tests show that it is not over-fitting.
I have a sample size of 71 datapoints, with a large number of potential IVs (actually, more than the number of datapoints, but that is another story). I am wanting a multiple linear regression equation to use to later predict new data. I am aware of not wanting to overfit.
I have asked three statisticians, and have been given three very different answers. The first said, given the number of datapoints, use no more than 3 IVs. (He was quite adamant, but gave no reason.) Another said use Mallow's Cp to determine how many. (This proved to be 7 or 8, depending on the variables used.) Another said use stepwise regression, which automatically cuts out when you attempt to overfit. I did this, and found it ran out to about 24 IVs due to its reliance on coefficient criteria. Interestingly, all of R-sqr, Adj R-sqr and Pred R-sqr are increasing until the 11th IV, so, even for prediction, it didn't seem to overfit up to the 11th IV.
So who is correct?
Personally, I would like to use at least 6 or 7, as this boosts the Pred R-sqr to a reasonable value (to about 0.775) - there is a tendency in the field to reject values below 0.7.
Since writing that question, I then delved into it a bit further and found the following paper
http://www.iasbs.ac.ir/chemistry/chemom ... alysis.pdf
This shows that, if you have a large number of IVs, then there is a chance that the next IV you pick in an equation will be a result of its random behaviour, rather than any underlying contributory variability associated to the variable.
Would you consider the following a valid test?
Method: Create an equal sized set of IVs filled with random data. Systematically use best subsets method with the best model plus each random IV to see whether any replaces an IV in the best solution.
I did this for my dataset above, finding that one replaced the 6th IV, and two others replaced the 7th IV. Therefore, my conclusion was that the model was valid to the 5th IV. However, I am not sure that I can say that the 6th IV is invalid - other tests show that it is not over-fitting.
Last edited: