I'm working with a very large dataset of individuals from 6 different groups/samples. The size of these groups is radically different. The largest of the 6 groups makes up 83.7% of the data, while the smallest is .2% of the entire sample. (That group is 6537 records, so it's not insignificant.)

I have a lot of analysis I need to do, but right now it seems that the large group is swamping the other groups. What alternatives exist for normalizing the data so that no single group overwhelms the analysis?

What exactly is the problem? What type of analysis are you doing?

To start with I was doing a chi-square until I realized that essentially it was comparing all the other groups to the distribution of the 83.7% group. I intend to do a logistic regression on the data, but because the one group is so much larger I worry that what I will be doing in essence is really just a logistic regression on that large group. Is there a way to weight the data so the samples are more even? Or would I be best off taking a sample of the large group and working with that?

What are the variables which you want to analyse?

The data is individual students in courses from several different institutions. One institution is MUCH larger than the others. At this point I'm doing exploratory data analysis, trying to understand the influences on whether students pass or fail their courses, and whether they stay or drop out of school. I have all types of variables available. Because I am still at the exploratory stage, I don't really have a good sense of all the analyses I want to do, but that also leaves me open to trying different techniques.

