PDA

View Full Version : Using R for the Kolmogorov-Smirnov test



freebird2008
08-10-2008, 07:01 PM
I have some data that consists of around five columns. I want to use the KS test in R to test it for normaility. A few questions. First off am I going to need to test one column at a time? Another one of the things I don't understand is where do I get the last two parameters?

mp83
08-11-2008, 04:57 AM
Which are the last two parameters? You mean the distribution family parameters? The theory of the KS test says that the parameters must be predefined by the user (ie not using the data). Though if you insist using the data various mle function can help you.

Yes, you have to test one at a time. But you can write a loop to do that for you.

freebird2008
08-11-2008, 08:17 AM
I am trying to find out if the data is not normal and heavy tailed. So I have
ks.test(data, "pnorm", ?, ?) any idea about what those last two parameters should be? Also I am getting this message
Warning message:
In ks.test(data, "pnorm", 0, 1) : cannot compute correct p-values with ties
What's that mean?

vinux
08-11-2008, 08:28 AM
Hi freebird,
I think you should use lillie.test() It is available in nortest library

http://pbil.univ-lyon1.fr/library/nortest/html/lillie.test.html

ks.test() is general one And the above one ( lillie.test) is for specific to test for nomality

mp83
08-11-2008, 08:33 AM
A tie occurs when x-values are replicated, for example 2.04,3,5.8,2.04,6 then 2.04 is a tie.

You have to figure out the parameters yourself. Though testing for a heavy tail with a KS test isn't that good. You should check against a Cauchy or a t sistribution,ie a big dispersion parameter should be used.

TheEcologist
08-11-2008, 11:25 AM
Hi freebird,
I think you should use lillie.test() It is available in nortest library

http://pbil.univ-lyon1.fr/library/nortest/html/lillie.test.html

ks.test() is general one And the above one ( lillie.test) is for specific to test for nomality

shapiro.test() is another possibility.

mp83
08-11-2008, 12:24 PM
A Modified Kolmogorov-Smirnov Test Sensitive to Tail Alternatives

David M. Mason and John H. Schuenemeyer

Source: Ann. Statist. Volume 11, Number 3 (1983), 933-946.
Abstract

It is well known that the Kolmogorov-Smirnov (K-S) test exhibits poor sensitivity to deviations from the hypothesized distribution that occur in the tails. A modified version of the K-S test is introduced that is more sensitive than the K-S test to deviations in the tails. The finite and infinite sample distribution along with the consistency properties of the proposed test are studied. Tables of critical values are provided for two versions of the test (one sensitive to heavy tail alternatives and one sensitive to light tail alternatives) and the finite sample properties of these two versions of the test are investigated.


> http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1176346259

Also, check (if anyone interested)

Goodness-of-fit tests for a heavy tailed distribution
> http://publishing.eur.nl/ir/repub/asset/7031/EI200544.pdf

We have to start using Anderson-Darling and Cramer-von Mises tests instead of the KS in every day practice. It's not the 80s!

freebird2008
08-11-2008, 03:25 PM
I keep getting this error as well,

Error in `[.data.frame`(x, complete.cases(x)) :
undefined columns selected

What's that mean?

I am trying to analyze this data set,

X1 X2 X3 X4 X5 Y
1 350 1 2 2 5 1000
2 350 1 5 5 5 1400
3 350 0 4 4 4 1200
4 350 1 20 20 1 1800
5 425 0 10 2 3 2800
6 425 1 15 10 3 4000
7 425 0 1 1 4 2500
8 425 1 5 5 4 3000
9 600 1 10 5 2 3500
10 600 0 8 8 3 2800
11 600 0 4 3 4 2900
12 600 1 20 10 2 3800
13 600 1 7 7 5 4200
14 700 1 8 8 1 4600
15 700 0 25 15 5 5000
16 700 1 19 16 4 4600
17 700 0 20 14 5 4700
18 400 0 6 4 3 1800
19 400 1 20 8 3 3400
20 400 0 5 3 5 2000
21 500 1 22 12 3 3200
22 500 1 25 10 3 3200
23 500 0 8 3 4 2800
24 500 0 2 1 5 2400
25 800 1 10 10 3 5200
26 475 1 10 4 3 2400
27 475 0 3 3 4 2400
28 475 1 8 8 2 3000
29 475 1 6 6 4 2800
30 475 0 12 4 3 2500
31 475 0 4 2 5 2100


Any ideas? I've been using the read.table() function.

mp83
08-11-2008, 04:13 PM
OK. Does the file include a header in the first line

X1 X2 X3 X4 X5 Y <--------------header?
350 1 2 2 5 1000
350 1 5 5 5 1400

then simply use read.table(" ",header=T).

This data set definetely has ties:)

EDIT

Don't think the Header is to bleme. I can't remember what was to bleme the last time I reproduced such a message...

freebird2008
08-11-2008, 04:33 PM
Yes, what you have there above is my header. Is there anything I can do to process this dataset even if it has ties? So when I look at a dataset and see that it has ties what does this tell me? What does this tell a statistician when he sees it? Should I look for another dataset maybe? Should I reformat it?

mp83
08-11-2008, 05:30 PM
Ties make the empirical distribution to focuse on too little points that's an indication of discreteness of the population distribution. Given thet this type odf tests focus on continuous distributions another data set would be preferred. For instance X1 would be normally tested with a chi square test.

If the file has a header you have to let the read.table() know that or else the first row would have the variable names as values

Rounds
08-11-2008, 08:03 PM
Why are you checking 'X2' through 'X5' for normality? Incidently 'X3' on its face fails to be normal =)~

freebird2008
08-11-2008, 09:04 PM
I am currently doing research and one of the things I was told to do was study several tests for normality. The Kolmogorov-Smirnov, Cramer Von Mises, Anderson-Darling, etc... this particular data set was one we decided to study.

ltleung
07-08-2012, 09:15 AM
There's a very gd text on testing normality: Thode, H.C. (2002) Testing For Normality, Marcel Dekker, New York. It discusses numerous tests on normality including K-S test which uses the supremum deviation of the CDFs, Jarque-Bera test which uses a combined moment, etc. With the sample size of 31, Shapiro-Wilk seems to be an appropriate one.