+ Reply to Thread
Results 1 to 14 of 14

Thread: Using R for the Kolmogorov-Smirnov test

  1. #1
    Points: 4,295, Level: 41
    Level completed: 73%, Points required for next Level: 55

    Posts
    42
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Using R for the Kolmogorov-Smirnov test



    I have some data that consists of around five columns. I want to use the KS test in R to test it for normaility. A few questions. First off am I going to need to test one column at a time? Another one of the things I don't understand is where do I get the last two parameters?
    Last edited by freebird2008; 08-10-2008 at 07:35 PM.

  2. #2
    TS Contributor
    Points: 5,172, Level: 46
    Level completed: 11%, Points required for next Level: 178

    Location
    Athens , Greece
    Posts
    330
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Which are the last two parameters? You mean the distribution family parameters? The theory of the KS test says that the parameters must be predefined by the user (ie not using the data). Though if you insist using the data various mle function can help you.

    Yes, you have to test one at a time. But you can write a loop to do that for you.

  3. #3
    Points: 4,295, Level: 41
    Level completed: 73%, Points required for next Level: 55

    Posts
    42
    Thanks
    0
    Thanked 0 Times in 0 Posts

    The last two parameters

    I am trying to find out if the data is not normal and heavy tailed. So I have
    ks.test(data, "pnorm", ?, ?) any idea about what those last two parameters should be? Also I am getting this message
    Warning message:
    In ks.test(data, "pnorm", 0, 1) : cannot compute correct p-values with ties
    What's that mean?

  4. #4
    Bhoot
    Points: 1,438, Level: 21
    Level completed: 38%, Points required for next Level: 62

    Posts
    1,759
    Thanks
    40
    Thanked 128 Times in 107 Posts
    Hi freebird,
    I think you should use lillie.test() It is available in nortest library

    http://pbil.univ-lyon1.fr/library/no...llie.test.html

    ks.test() is general one And the above one ( lillie.test) is for specific to test for nomality
    In the long run, we're all dead.

  5. #5
    TS Contributor
    Points: 5,172, Level: 46
    Level completed: 11%, Points required for next Level: 178

    Location
    Athens , Greece
    Posts
    330
    Thanks
    0
    Thanked 4 Times in 2 Posts
    A tie occurs when x-values are replicated, for example 2.04,3,5.8,2.04,6 then 2.04 is a tie.

    You have to figure out the parameters yourself. Though testing for a heavy tail with a KS test isn't that good. You should check against a Cauchy or a t sistribution,ie a big dispersion parameter should be used.

  6. #6
    R purist
    Points: 14,226, Level: 77
    Level completed: 44%, Points required for next Level: 224
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,373
    Thanks
    135
    Thanked 282 Times in 152 Posts
    Quote Originally Posted by vinux View Post
    Hi freebird,
    I think you should use lillie.test() It is available in nortest library

    http://pbil.univ-lyon1.fr/library/no...llie.test.html

    ks.test() is general one And the above one ( lillie.test) is for specific to test for nomality
    shapiro.test() is another possibility.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  7. #7
    TS Contributor
    Points: 5,172, Level: 46
    Level completed: 11%, Points required for next Level: 178

    Location
    Athens , Greece
    Posts
    330
    Thanks
    0
    Thanked 4 Times in 2 Posts
    A Modified Kolmogorov-Smirnov Test Sensitive to Tail Alternatives

    David M. Mason and John H. Schuenemeyer

    Source: Ann. Statist. Volume 11, Number 3 (1983), 933-946.
    Abstract

    It is well known that the Kolmogorov-Smirnov (K-S) test exhibits poor sensitivity to deviations from the hypothesized distribution that occur in the tails. A modified version of the K-S test is introduced that is more sensitive than the K-S test to deviations in the tails. The finite and infinite sample distribution along with the consistency properties of the proposed test are studied. Tables of critical values are provided for two versions of the test (one sensitive to heavy tail alternatives and one sensitive to light tail alternatives) and the finite sample properties of these two versions of the test are investigated.


    > http://projecteuclid.org/DPubS?servi...aos/1176346259

    Also, check (if anyone interested)

    Goodness-of-fit tests for a heavy tailed distribution
    > http://publishing.eur.nl/ir/repub/as...1/EI200544.pdf

    We have to start using Anderson-Darling and Cramer-von Mises tests instead of the KS in every day practice. It's not the 80s!

  8. #8
    Points: 4,295, Level: 41
    Level completed: 73%, Points required for next Level: 55

    Posts
    42
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Another error

    I keep getting this error as well,

    Error in `[.data.frame`(x, complete.cases(x)) :
    undefined columns selected

    What's that mean?

    I am trying to analyze this data set,

    X1 X2 X3 X4 X5 Y
    1 350 1 2 2 5 1000
    2 350 1 5 5 5 1400
    3 350 0 4 4 4 1200
    4 350 1 20 20 1 1800
    5 425 0 10 2 3 2800
    6 425 1 15 10 3 4000
    7 425 0 1 1 4 2500
    8 425 1 5 5 4 3000
    9 600 1 10 5 2 3500
    10 600 0 8 8 3 2800
    11 600 0 4 3 4 2900
    12 600 1 20 10 2 3800
    13 600 1 7 7 5 4200
    14 700 1 8 8 1 4600
    15 700 0 25 15 5 5000
    16 700 1 19 16 4 4600
    17 700 0 20 14 5 4700
    18 400 0 6 4 3 1800
    19 400 1 20 8 3 3400
    20 400 0 5 3 5 2000
    21 500 1 22 12 3 3200
    22 500 1 25 10 3 3200
    23 500 0 8 3 4 2800
    24 500 0 2 1 5 2400
    25 800 1 10 10 3 5200
    26 475 1 10 4 3 2400
    27 475 0 3 3 4 2400
    28 475 1 8 8 2 3000
    29 475 1 6 6 4 2800
    30 475 0 12 4 3 2500
    31 475 0 4 2 5 2100


    Any ideas? I've been using the read.table() function.

  9. #9
    TS Contributor
    Points: 5,172, Level: 46
    Level completed: 11%, Points required for next Level: 178

    Location
    Athens , Greece
    Posts
    330
    Thanks
    0
    Thanked 4 Times in 2 Posts
    OK. Does the file include a header in the first line

    X1 X2 X3 X4 X5 Y <--------------header?
    350 1 2 2 5 1000
    350 1 5 5 5 1400

    then simply use read.table(" ",header=T).

    This data set definetely has ties

    EDIT

    Don't think the Header is to bleme. I can't remember what was to bleme the last time I reproduced such a message...

  10. #10
    Points: 4,295, Level: 41
    Level completed: 73%, Points required for next Level: 55

    Posts
    42
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Ties in the dataset

    Yes, what you have there above is my header. Is there anything I can do to process this dataset even if it has ties? So when I look at a dataset and see that it has ties what does this tell me? What does this tell a statistician when he sees it? Should I look for another dataset maybe? Should I reformat it?

  11. #11
    TS Contributor
    Points: 5,172, Level: 46
    Level completed: 11%, Points required for next Level: 178

    Location
    Athens , Greece
    Posts
    330
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Ties make the empirical distribution to focuse on too little points that's an indication of discreteness of the population distribution. Given thet this type odf tests focus on continuous distributions another data set would be preferred. For instance X1 would be normally tested with a chi square test.

    If the file has a header you have to let the read.table() know that or else the first row would have the variable names as values

  12. #12
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Why are you checking 'X2' through 'X5' for normality? Incidently 'X3' on its face fails to be normal =)~

  13. #13
    Points: 4,295, Level: 41
    Level completed: 73%, Points required for next Level: 55

    Posts
    42
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Well...?

    I am currently doing research and one of the things I was told to do was study several tests for normality. The Kolmogorov-Smirnov, Cramer Von Mises, Anderson-Darling, etc... this particular data set was one we decided to study.

  14. #14
    Points: 64, Level: 1
    Level completed: 28%, Points required for next Level: 36

    Posts
    7
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Using R for the Kolmogorov-Smirnov test


    There's a very gd text on testing normality: Thode, H.C. (2002) Testing For Normality, Marcel Dekker, New York. It discusses numerous tests on normality including K-S test which uses the supremum deviation of the CDFs, Jarque-Bera test which uses a combined moment, etc. With the sample size of 31, Shapiro-Wilk seems to be an appropriate one.

+ Reply to Thread

Similar Threads

  1. Very confusing Kolmogorov-Smirnov test!
    By MaggieMay in forum Statistics
    Replies: 2
    Last Post: 03-19-2011, 05:45 AM
  2. Replies: 1
    Last Post: 03-17-2010, 02:37 AM
  3. Kolmogorov smirnov test for goodness of fit
    By aledanda in forum Statistics
    Replies: 3
    Last Post: 08-03-2009, 04:34 PM
  4. Replies: 2
    Last Post: 08-05-2008, 11:22 AM
  5. On Kolmogorov-Smirnov test
    By mp83 in forum Statistics
    Replies: 0
    Last Post: 01-04-2008, 01:54 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats