1. ## Balancing data

A couple of classification (lets stick to only two groups for simplicity) models are performing better if we balance the data when there are few successes/fails. For which models is this true and for which ratio between successes and fails? Does anyone know of any good paper that answers this question?

2. ## Re: Balancing data

If you are running regression or I assume ANOVA, and you have few of one type for a dummy variable you will end up with an attenuated slope. Tbachanick and Fidel cover this, but I am not sure this is what you are asking or what you mean by balancing the data

3. ## Re: Balancing data

noetsi, it is in the DV the imbalance is a problem. And as I've come to understand, the imbalance is per se not a problem; it is a problem when there are few, in aboslute terms, successes/fails.

For example: a logit model does, in general, not perfom bad when we have 30000 observations and 300 successes. But the model may not perform well if we have 2000 observations and only 20 successes, albeit the same ratio between successes and fails. One way to "fix" this problem is to randomly select 20 fails and keep all 20 successes, and run the model on this data.

4. ## Re: Balancing data

According to econometricians (like Gugardi which I can't spell) the number of cases you need for logistic regression is tied to the level of the dv which is least common. So if one percent of your total cases are 0 and you feel you need 500 cases then you need 500 0 cases (or 50K cases overall). Admitedly not all agreed that logistic regression requires a minimum sample size, although it is a common assumption in texts. And of course this is an extreme example.

According to Fidel and Tbachnick reduced variability is associated with a low percentage of cases at one level of a dummy variable which can have signficant impact on estimated parameters. Whether that applies to a DV with two levels (which they do not address in the book I read) I don't know, but it seems that it would.

I have not seen the point you just made addressed in honesty Englund. That is no one makes that distinction in terms of the relative number of cases.

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts