+ Reply to Thread
Results 1 to 8 of 8

Thread: Minimum number of cases for the reference level

  1. #1
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Minimum number of cases for the reference level




    Before I ask this I have the true population so I am not sure this issue even applies

    Is there a minimum number of cases the reference level has to have to be valid in the regression analysis? I have a population of about 11,000 cases. For substantive reasons (it makes the most sense to use this reference level substantively) my reference level is one that turns out to have only about a hundred cases. I prefer for analytical reasons to leave it the reference level, but I get nervous having so few cases as a reference level. There are about 19 variables in the model.

    I have not seen anything for linear regression that says how many cases you need or if the ratio of cases in the reference level to one of the related dummies matters.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  2. #2
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: Minimum number of cases for the reference level

    No. I mean, you need at least 1 in the reference group (and more than 1 in at least one of the other groups). Any more than that is bonus.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  3. The Following User Says Thank You to Jake For This Useful Post:

    noetsi (04-25-2016)

  4. #3
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Minimum number of cases for the reference level

    Side note, many things run better with balanced designs, but it is not obligatory/mandatory. I believe the SE may get a little larger in some cases, say in odds ratios and when looking at case/control studies there are optimal tradeoffs for power, and some things take longer to converge (e.g., mixed models). But this may be one of those, the effects are stable enough, but confidence is what gets compromised.


    If you have reservations, just keep that in mind during your interpretations. If you are only interested in that particular variable, you could always balance covariates (e.g., 19), via matching or propensity scores.
    Stop cowardice, ban guns!

  5. The Following User Says Thank You to hlsmith For This Useful Post:

    noetsi (04-25-2016)

  6. #4
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: Minimum number of cases for the reference level

    For a fixed total sample size, parameter estimates are most precise when the data are balanced across all categorical factors. So, for example, it is more efficient to have n=10 in each of two groups than to have n=5 in one group and n=15 in the other. But if the comparison is between, say, having n=10 in both groups vs. having n=10 in one group and n=50 in the other group (in other words, if we are not talking about a fixed total sample size), then the latter will be more efficient, all other things equal, owing to the larger sample size. But the basic point is that there is nothing inherently bad about having data that are highly unbalanced across the categorical predictors.
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  7. The Following User Says Thank You to Jake For This Useful Post:

    noetsi (04-26-2016)

  8. #5
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Minimum number of cases for the reference level

    That is good to know because the non-experimental designs I work with have huge variances between groups. On the other hand I commonly have at least hundreds if not thousands of cases. An issue I had not thought concerns power. I always thought of power as the total number of cases in the design (in this example that would be over 11,000). But is power instead tied to the number of cases in one of the subgroups (here a reference level of one dummy variable)? I assume this would only impact the power for that dummy variable, not the overall model or other variables in the model.

    But obviously I am not certain
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  9. #6
    Cookie Scientist
    Points: 13,431, Level: 75
    Level completed: 46%, Points required for next Level: 219
    Jake's Avatar
    Location
    Austin, TX
    Posts
    1,293
    Thanks
    66
    Thanked 584 Times in 438 Posts

    Re: Minimum number of cases for the reference level

    Power is a joint function of the total sample size and the degree of balance across the predictor categories (because the latter affects the degree of multicollinearity -- the more unbalanced the groups, the more collinear are the predictors).
    “In God we trust. All others must bring data.”
    ~W. Edwards Deming

  10. The Following User Says Thank You to Jake For This Useful Post:

    noetsi (04-26-2016)

  11. #7
    Omega Contributor
    Points: 38,303, Level: 100
    Level completed: 0%, Points required for next Level: 0
    hlsmith's Avatar
    Location
    Not Ames, IA
    Posts
    6,993
    Thanks
    397
    Thanked 1,185 Times in 1,146 Posts

    Re: Minimum number of cases for the reference level

    Quote Originally Posted by Jake View Post
    Power is a joint function of the total sample size and the degree of balance across the predictor categories (because the latter affects the degree of multicollinearity -- the more unbalanced the groups, the more collinear are the predictors).

    Can you provide a hypothetical example of this. In my mind it seems like a sparsity of data in subgroups. So with continuous variables you have variables held at their mean, but in categorical scenarios you have multiple variables set at their reference group, plus the potential sucking up of degrees of freedom.


    Though, your description almost seems more like confounding. I guess it could be like collinearity if some subgroups have some many dimensions that no one in the group has the outcome, so a breach in positivity. so multiple variables seem linked to the outcome per being in the same subgrouping.
    Stop cowardice, ban guns!

  12. #8
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Minimum number of cases for the reference level


    I know that the number of subgroups is important to power calculations and you need a certain sample size per group [know because I have seen the calculations - I don't know why this is true]. But I don't really understand why a balanced or unbalanced design would matter for power.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats