View Full Version : Comparing Distributions - Discrete vs Binned vs Continuous


Stormy
05-02-2006, 09:40 PM
I have a number of different feature types, and I want to compare the distributions of the features measured under different conditions.

Some features have naturally binned/enumerated values so a Chi-square test would be appropriate.

Some of the features are obviously continuous (measured as floating point values) so a Kolmogorov-Smirnov test will do the job.

Where I am confused is on discrete (integer-valued) features that have a large span, e.g. -1000 to 1000. Technically it would seem that the K-S test is not suitable since the data is not continuous. It does not seem right to use a Chi-square test with 2000 degrees of freedom! Of course the data can be binned for Chi-square, but the binning is arbitrary.

In the case where there is a large amount of integer-valued data which is expected to be smooth in nature, is the K-S test still applicable?

What rules of thumb can be applied to bin the data for Chi-square testing?

Many thanks.

JohnM
05-02-2006, 09:50 PM
Excellent question. With a "large" amount of integer data, it's OK to use the K-S test.

I'm not aware of any rules of thumb for when you "switch over" from Chi-Square to K-S tests. It's basically a judgment call.