+ Reply to Thread
Results 1 to 7 of 7

Thread: optimal sample size

  1. #1
    Points: 1,313, Level: 20
    Level completed: 13%, Points required for next Level: 87

    Posts
    9
    Thanks
    7
    Thanked 1 Time in 1 Post

    optimal sample size




    I'm performing a regression analysis (OLS) and I have the luxury that my sample size is very high (5.000 observations) compared to the number of independent variables (around 10-15). So I thought, that it's better to take a random sample with a smaller number of observations. Are there any theories regarding the optimal sample size? My guess what to take a sample of 500-1000, but that's not really based on solid theory or experience.

    What could make it a bit more complicated:
    - for 3 independent variables 30-60% of the data is missing
    - the variables (dependent and independent) are skewed.
    - I expect some multicollinearity

  2. #2
    TS Contributor
    Points: 17,775, Level: 84
    Level completed: 85%, Points required for next Level: 75
    Karabiner's Avatar
    Location
    FC Schalke 04, Germany
    Posts
    2,541
    Thanks
    56
    Thanked 640 Times in 602 Posts

    Re: optimal sample size

    my sample size is very high (5.000 observations) compared to the number of independent variables (around 10-15). So I thought, that it's better to take a random sample with a smaller number of observations.
    Why do you think that a smaller sample is better?
    What could make it a bit more complicated:
    - for 3 independent variables 30-60% of the data is missing
    Looks like you better leave them out of the analysis.
    - the variables (dependent and independent) are skewed.
    Tht is not necessarily a problem.
    - I expect some multicollinearity
    So you leave some correlated independent variables out,
    and/or you provide a large sample size in order to reduce
    the inflated standard errors associated with multicollinearity.

    With kind regards

    K.

  3. The Following User Says Thank You to Karabiner For This Useful Post:

    surveyman (05-01-2015)

  4. #3
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: optimal sample size

    Why would you want to randomly sample from a sample? I would think to start with that would reduce your power. More importantly (assuming your original sample was a random sample) reducing the n should reduce your chances to generalize to the larger population without error. I can't imagine that would ever be a good thing.

    If you're concerned your p values are too low just note this in your analysis (or concentrate on the effect size rather than the test of statistical signficance which is better anyhow).

    I don't think there is a theory of what an "optimal" sample size is which is a subjective not theoretical judgement. As sample size goes up (assuming no error in your sampling) power will go up and the possibility of generalizing gets better. The "theories" in this reflect comments about trading off these gains (which tend to be small over 1000 cases) against the cost of getting the information. But if you have the information I can't imagine getting rid of it for any reason (other than errors in the data).

    Larger samples are the "best" solution usually for multicolinearity, another reason not to get rid of it. There is disagreement if you should leave out correlated variables or not. Remember that, if the variables you drop out improve the overall model fit, you are giving up this capacity to better predict individual variables. You need to decide what is better, improve your prediction or evaluate individual variables. Multicolinearity, which has to be very serve to matter, won't reduce your prediction of the model (although having variables that don't actually predict the DV will cause problems potentially and should not stay in the model).

    If you have a lot of missing data you can try approaches to missing data such as multiple imputations, but you would need to learn it and have the software. My guess is if you are missing huge amounts of data you should ask why this occured.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  5. The Following User Says Thank You to noetsi For This Useful Post:

    surveyman (05-01-2015)

  6. #4
    Points: 1,313, Level: 20
    Level completed: 13%, Points required for next Level: 87

    Posts
    9
    Thanks
    7
    Thanked 1 Time in 1 Post

    Re: optimal sample size

    My initial fear was that a high sample size would lead non-significant variables to be shown as significant. I was also unsure if it would affect the R2. But you are right, I should rather look to the coefficients and ignore variables with a small beta coefficients. Thanks.

    I don't want to exclude the variables with a high percentage of missings. Those questions/variables weren't relevant for those people, they didn't got the question (I use survey data for the regression analysis).

    Regarding the multicolinearity: I think I have to read a bit more about that. I don't really understand why a larger sample size has a positive effect on preventing multicolinearity. In this case it's a bit more important to predict the effects of the individual variables rather than the whole model.

  7. #5
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: optimal sample size

    My initial fear was that a high sample size would lead non-significant variables to be shown as significant.
    This is an issue I grapple with a lot myself and it is (when you really think about it) really silly. High power can never make a variable "signficant." All it can do is increase the chance that you will detect an effect that actually exists. This goes to the sense that test of statistical signficance actually show substantive significance, which dominates the academic and practisoner world and is fundamentally wrong (as CWB reminded me again last week). A low p value does not make a variable's impact on the DV meaningful, nor does a high p value show it has no impact although it is commonly intepreted that way.

    If you didn't ask people a specific question then it is not missing. Missing data, a major issue, deals only with the possible responses. You should be careful in a publication especially about saying data is missing when it was not asked of specific respondents - you will get the same response you did from me

    I suggest reading John Fox "Regression Diagnostics" by Sage. He has a solid section on MC. He is highly critical of those who see MC as a major problem and of many of the solutions. If it is critical to predict individual variables than MC can be a major issue although it rarely is. Run a VIF or tolerance test and see what you get. Remember that MC will not effect the slopes at all. It only effects the standard errors making the significance test less likely to reject the null than they should (with the huge data set you have that is less likely to be an issue).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  8. The Following User Says Thank You to noetsi For This Useful Post:

    surveyman (05-02-2015)

  9. #6
    Points: 1,313, Level: 20
    Level completed: 13%, Points required for next Level: 87

    Posts
    9
    Thanks
    7
    Thanked 1 Time in 1 Post

    Re: optimal sample size

    Thanks! You are right about the missings. Than probably the only thing is, that I have a smaller N for those independent variables. But if I understood it right, that doesn't effect the slope, only the confidence interval and statistical test of that particular variable (in which I'm less interested in this case).

    I thought that multicolinearity would effect the slopes, but if only the standard errors are influenced, that indeed doesn't really matter much in this case. I will see if I can get the John Fox paper to read more about MC.

  10. #7
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: optimal sample size


    It's not a paper. It's a sage monograph (a green short book).
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats