How many IVs can I use?

BrenH

New Member
#1
I am hoping to publish the results of this in a journal (not a statistics one, an experimental science one).

I have a sample size of 71 datapoints, with a large number of potential IVs (actually, more than the number of datapoints, but that is another story). I am wanting a multiple linear regression equation to use to later predict new data. I am aware of not wanting to overfit.

I have asked three statisticians, and have been given three very different answers. The first said, given the number of datapoints, use no more than 3 IVs. (He was quite adamant, but gave no reason.) Another said use Mallow's Cp to determine how many. (This proved to be 7 or 8, depending on the variables used.) Another said use stepwise regression, which automatically cuts out when you attempt to overfit. I did this, and found it ran out to about 24 IVs due to its reliance on coefficient criteria. Interestingly, all of R-sqr, Adj R-sqr and Pred R-sqr are increasing until the 11th IV, so, even for prediction, it didn't seem to overfit up to the 11th IV.

So who is correct?

Personally, I would like to use at least 6 or 7, as this boosts the Pred R-sqr to a reasonable value (to about 0.775) - there is a tendency in the field to reject values below 0.7.

Since writing that question, I then delved into it a bit further and found the following paper

http://www.iasbs.ac.ir/chemistry/chemom ... alysis.pdf

This shows that, if you have a large number of IVs, then there is a chance that the next IV you pick in an equation will be a result of its random behaviour, rather than any underlying contributory variability associated to the variable.

Would you consider the following a valid test?

Method: Create an equal sized set of IVs filled with random data. Systematically use best subsets method with the best model plus each random IV to see whether any replaces an IV in the best solution.

I did this for my dataset above, finding that one replaced the 6th IV, and two others replaced the 7th IV. Therefore, my conclusion was that the model was valid to the 5th IV. However, I am not sure that I can say that the 6th IV is invalid - other tests show that it is not over-fitting.
 
Last edited:

Blaz

New Member
#2
This is a rather complicated issue, and one needs to be very careful when choosing an appropriate number of IVs. There are numerous factors influencing this choice of which the two most important are the size of the effect you expect and the correlation among IVs. If you expect large effects and have IVs which are relatively independent, you can use more IVs. But with 71 cases I would have to say that I agree most with the first of your statisticians, no more than 3 or 4, definitely. If your IVs are correlated 2 or 3 would be maximum here.

On a related note, whatever you do, please refrain from using a stepwise regression in this case.

Hope it helps.
 

terzi

TS Contributor
#3
Have you checked the assumptions of your models? In my experience, with more than 6 or 7 IV's it is harder to achieve all the assumptions, specially Multicollinearity. I would agree with Blaz, you shouldn't take over 5 IV.

Now, you have an option, you could reduce dimensions with Principal Component Analysis or Factor Analysis, that way you could have the information of many variables including only a few in the model.