+ Reply to Thread
Results 1 to 4 of 4

Thread: Balancing data

  1. #1
    TS Contributor
    Points: 6,786, Level: 54
    Level completed: 18%, Points required for next Level: 164

    Location
    Sweden
    Posts
    524
    Thanks
    44
    Thanked 112 Times in 100 Posts

    Balancing data




    A couple of classification (lets stick to only two groups for simplicity) models are performing better if we balance the data when there are few successes/fails. For which models is this true and for which ratio between successes and fails? Does anyone know of any good paper that answers this question?

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Balancing data

    If you are running regression or I assume ANOVA, and you have few of one type for a dummy variable you will end up with an attenuated slope. Tbachanick and Fidel cover this, but I am not sure this is what you are asking or what you mean by balancing the data
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. #3
    TS Contributor
    Points: 6,786, Level: 54
    Level completed: 18%, Points required for next Level: 164

    Location
    Sweden
    Posts
    524
    Thanks
    44
    Thanked 112 Times in 100 Posts

    Re: Balancing data

    noetsi, it is in the DV the imbalance is a problem. And as I've come to understand, the imbalance is per se not a problem; it is a problem when there are few, in aboslute terms, successes/fails.

    For example: a logit model does, in general, not perfom bad when we have 30000 observations and 300 successes. But the model may not perform well if we have 2000 observations and only 20 successes, albeit the same ratio between successes and fails. One way to "fix" this problem is to randomly select 20 fails and keep all 20 successes, and run the model on this data.

  4. #4
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Balancing data


    According to econometricians (like Gugardi which I can't spell) the number of cases you need for logistic regression is tied to the level of the dv which is least common. So if one percent of your total cases are 0 and you feel you need 500 cases then you need 500 0 cases (or 50K cases overall). Admitedly not all agreed that logistic regression requires a minimum sample size, although it is a common assumption in texts. And of course this is an extreme example.

    According to Fidel and Tbachnick reduced variability is associated with a low percentage of cases at one level of a dummy variable which can have signficant impact on estimated parameters. Whether that applies to a DV with two levels (which they do not address in the book I read) I don't know, but it seems that it would.

    I have not seen the point you just made addressed in honesty Englund. That is no one makes that distinction in terms of the relative number of cases.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats