Automatically determining top number of levels of categorical variable to consider

integrates_to_1

New Member
I am designing an application that will perform a test on an observed sample of counts of a categorical feature. The particular application is that I want to perform a two-sample goodness of fit (e.g., chi square) test and may want to drop from consideration levels that have low frequency in particular cases; however, I think the question is more general.

Note, I am dealing with a variety of features, some of which have only a few levels, some of which can have tens of thousands or more (e.g., street blocks, such as "10xx First Avenue", or medical codes of a disease). Now, due to my application, if the categorical feature has many possible levels, I may want to consider two cases:

1) Many possible levels but the distribution among them is more or less uniform
2) Man possible levels, but the distribution is skewed, with a smaller number of levels containing more of the data, and many levels having very low frequency. As an example, think of the 80-20 rule that maybe 80% of the data is contained in the top 20% of levels. For instance, in medical coding, you may have many specific codes (levels) of a disease. Some of which are relatively more common (at least relative to average), and many codes will belong to very rare diseases (simply because there a lot of potential diseases), each of which has very few observations.

For my application, I'd like to keep the vast majority of data, but omit the levels that have very low frequency relative to the average. What I am interested in doing is having some statistical rule or measurement that will tell me, particularly in case 2 above, what is the "n" for which I should consider the top-n levels. This should be determined automatically from the data, but may require a parameter to be tuned. Another way to think about this is to determine automatically which small categories (possibly none) should be grouped into an "other" level, and omitted in our case. In case 1 above, I will probably decide to either omit the feature overall from being analyzed if there are too many levels and all are small frequency, or I will perform the analysis without dropping levels if they all have sufficient frequency, even if there are many of them.

Also, say there are only a few categories. If the distribution is 55/40/5, I will want to consider all of them. But if it's 55/45/approx 0 (the smallest category is very small), I may want to drop it since it may mess up my analysis if it's practically unimportant. So it is primarily an issue of having many levels but may arise in other cases.

Surprisingly, I have not been able to find much on this online (perhaps I am looking for the wrong things); there is research on optimal discretization of numeric to categorical, but not much on categorical. I should note that there is no "target variable" here, so I can't use decision trees to determine an optimal binning as far as I'm aware (unless maybe using a random binary target will give you a meaningful result).

A few of the ideas I've had are:
-some sort of skew test on the frequencies, to detect case 1
-ordering the levels by frequency, then calculating the ratio of successive pairs and then seeing what is the convergence. I believe this test is used to distinguish heavy from long-tailed continuous distributions (using the densities). If there are many small levels that are about equal, the ratio should converge to 1, and I can ignore the ones where this happens.
-something motivated by stick-breaking from the Dirichlet process (similar to above); maybe look at each frequency count and see what its proportion is of the total frequencies remaining, maybe this converges below a given value for levels I should ignore.
-something that uses an entropy measure to measure the variability in the level frequencies

I'd appreciate any suggestions or leads you may have.