Stormy
05-02-2006, 09:40 PM
I have a number of different feature types, and I want to compare the distributions of the features measured under different conditions.
Some features have naturally binned/enumerated values so a Chi-square test would be appropriate.
Some of the features are obviously continuous (measured as floating point values) so a Kolmogorov-Smirnov test will do the job.
Where I am confused is on discrete (integer-valued) features that have a large span, e.g. -1000 to 1000. Technically it would seem that the K-S test is not suitable since the data is not continuous. It does not seem right to use a Chi-square test with 2000 degrees of freedom! Of course the data can be binned for Chi-square, but the binning is arbitrary.
In the case where there is a large amount of integer-valued data which is expected to be smooth in nature, is the K-S test still applicable?
What rules of thumb can be applied to bin the data for Chi-square testing?
Many thanks.
Some features have naturally binned/enumerated values so a Chi-square test would be appropriate.
Some of the features are obviously continuous (measured as floating point values) so a Kolmogorov-Smirnov test will do the job.
Where I am confused is on discrete (integer-valued) features that have a large span, e.g. -1000 to 1000. Technically it would seem that the K-S test is not suitable since the data is not continuous. It does not seem right to use a Chi-square test with 2000 degrees of freedom! Of course the data can be binned for Chi-square, but the binning is arbitrary.
In the case where there is a large amount of integer-valued data which is expected to be smooth in nature, is the K-S test still applicable?
What rules of thumb can be applied to bin the data for Chi-square testing?
Many thanks.