1. ## optimal sample size

I'm performing a regression analysis (OLS) and I have the luxury that my sample size is very high (5.000 observations) compared to the number of independent variables (around 10-15). So I thought, that it's better to take a random sample with a smaller number of observations. Are there any theories regarding the optimal sample size? My guess what to take a sample of 500-1000, but that's not really based on solid theory or experience.

What could make it a bit more complicated:
- for 3 independent variables 30-60% of the data is missing
- the variables (dependent and independent) are skewed.
- I expect some multicollinearity

2. ## Re: optimal sample size

my sample size is very high (5.000 observations) compared to the number of independent variables (around 10-15). So I thought, that it's better to take a random sample with a smaller number of observations.
Why do you think that a smaller sample is better?
What could make it a bit more complicated:
- for 3 independent variables 30-60% of the data is missing
Looks like you better leave them out of the analysis.
- the variables (dependent and independent) are skewed.
Tht is not necessarily a problem.
- I expect some multicollinearity
So you leave some correlated independent variables out,
and/or you provide a large sample size in order to reduce
the inflated standard errors associated with multicollinearity.

With kind regards

K.

3. ## The Following User Says Thank You to Karabiner For This Useful Post:

surveyman (05-01-2015)

4. ## Re: optimal sample size

Why would you want to randomly sample from a sample? I would think to start with that would reduce your power. More importantly (assuming your original sample was a random sample) reducing the n should reduce your chances to generalize to the larger population without error. I can't imagine that would ever be a good thing.

If you're concerned your p values are too low just note this in your analysis (or concentrate on the effect size rather than the test of statistical signficance which is better anyhow).

I don't think there is a theory of what an "optimal" sample size is which is a subjective not theoretical judgement. As sample size goes up (assuming no error in your sampling) power will go up and the possibility of generalizing gets better. The "theories" in this reflect comments about trading off these gains (which tend to be small over 1000 cases) against the cost of getting the information. But if you have the information I can't imagine getting rid of it for any reason (other than errors in the data).

Larger samples are the "best" solution usually for multicolinearity, another reason not to get rid of it. There is disagreement if you should leave out correlated variables or not. Remember that, if the variables you drop out improve the overall model fit, you are giving up this capacity to better predict individual variables. You need to decide what is better, improve your prediction or evaluate individual variables. Multicolinearity, which has to be very serve to matter, won't reduce your prediction of the model (although having variables that don't actually predict the DV will cause problems potentially and should not stay in the model).

If you have a lot of missing data you can try approaches to missing data such as multiple imputations, but you would need to learn it and have the software. My guess is if you are missing huge amounts of data you should ask why this occured.

5. ## The Following User Says Thank You to noetsi For This Useful Post:

surveyman (05-01-2015)

6. ## Re: optimal sample size

My initial fear was that a high sample size would lead non-significant variables to be shown as significant. I was also unsure if it would affect the R2. But you are right, I should rather look to the coefficients and ignore variables with a small beta coefficients. Thanks.

I don't want to exclude the variables with a high percentage of missings. Those questions/variables weren't relevant for those people, they didn't got the question (I use survey data for the regression analysis).

Regarding the multicolinearity: I think I have to read a bit more about that. I don't really understand why a larger sample size has a positive effect on preventing multicolinearity. In this case it's a bit more important to predict the effects of the individual variables rather than the whole model.

7. ## Re: optimal sample size

My initial fear was that a high sample size would lead non-significant variables to be shown as significant.
This is an issue I grapple with a lot myself and it is (when you really think about it) really silly. High power can never make a variable "signficant." All it can do is increase the chance that you will detect an effect that actually exists. This goes to the sense that test of statistical signficance actually show substantive significance, which dominates the academic and practisoner world and is fundamentally wrong (as CWB reminded me again last week). A low p value does not make a variable's impact on the DV meaningful, nor does a high p value show it has no impact although it is commonly intepreted that way.

If you didn't ask people a specific question then it is not missing. Missing data, a major issue, deals only with the possible responses. You should be careful in a publication especially about saying data is missing when it was not asked of specific respondents - you will get the same response you did from me

I suggest reading John Fox "Regression Diagnostics" by Sage. He has a solid section on MC. He is highly critical of those who see MC as a major problem and of many of the solutions. If it is critical to predict individual variables than MC can be a major issue although it rarely is. Run a VIF or tolerance test and see what you get. Remember that MC will not effect the slopes at all. It only effects the standard errors making the significance test less likely to reject the null than they should (with the huge data set you have that is less likely to be an issue).

8. ## The Following User Says Thank You to noetsi For This Useful Post:

surveyman (05-02-2015)

9. ## Re: optimal sample size

Thanks! You are right about the missings. Than probably the only thing is, that I have a smaller N for those independent variables. But if I understood it right, that doesn't effect the slope, only the confidence interval and statistical test of that particular variable (in which I'm less interested in this case).

I thought that multicolinearity would effect the slopes, but if only the standard errors are influenced, that indeed doesn't really matter much in this case. I will see if I can get the John Fox paper to read more about MC.

10. ## Re: optimal sample size

It's not a paper. It's a sage monograph (a green short book).

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts