+ Reply to Thread
Results 1 to 5 of 5

Thread: [FAQ] How do I remove or deal with outliers?

  1. #1
    R purist
    Points: 18,370, Level: 86
    Level completed: 4%, Points required for next Level: 480
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,590
    Thanks
    201
    Thanked 408 Times in 227 Posts

    [FAQ] How do I remove or deal with outliers?




    How do I remove or deal with outliers?



    Removing outliers can cause your data to become more normal but contrary to what is sometimes perceived, outlier removal is subjective, there is no real objective way of removing outliers.

    The problem, as always, is what the heck does one mean by 'outlier' in these
    contexts. Seems to be like pornography -- "I know it when I see it."
    -- Berton Gunter (quoting Justice Potter Stewart in a discussion about tests
    for outliers)
    R-help (April 2005)


    Always remember that these points remain observations and you should not just throw them out on a whim. Instead you should have good reasons to remove your outliers. There may be many truly valid reasons to remove data-points. These include outliers caused by measurement errors, incorrectly entered data-points or impossible values in real life. If you feel that any outlier are erroneous data points and you can validate this, then you should feel free to remove them.

    On the other hand, if you see no reason why your outliers are erroneous measurements then there is no truly objective way to remove them. They are true observations and you may have to consider that the assumptions of your test do not correspond to the reality of your situation. You could always try a non-parametric test (which in general are less sensitive to outliers) or some other analysis that does not require the assumption that your data is normally distributed.

    Here's some more online help on the topic here.

    If you're still having trouble with this topic feel free to start a thread on the forum, and be sure to check out our guidelines for efficient posting.
    Last edited by TheEcologist; 03-12-2014 at 11:06 AM.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  2. #2
    R must die
    Points: 22,258, Level: 92
    Level completed: 91%, Points required for next Level: 92
    noetsi's Avatar
    Posts
    4,145
    Thanks
    245
    Thanked 653 Times in 630 Posts

    Re: [FAQ] How do I remove or deal with outliers?

    I have worked on this topic because a recent comp question dealt with it. One piece of advice on when it is legitimate to remove outliers that makes sense to me is that when it totally distorts analysis of the data (for example distorting signficantly the central tendency you are using) then it is legitimate to remove them. To me you would not want to completely change the results due to a single point or a few points.
    "Nobody in their right mind thinks they'll ever get forecasts correct." Dason

  3. #3
    R purist
    Points: 18,370, Level: 86
    Level completed: 4%, Points required for next Level: 480
    TheEcologist's Avatar
    Location
    The Netherlands.
    Posts
    1,590
    Thanks
    201
    Thanked 408 Times in 227 Posts

    Re: [FAQ] How do I remove or deal with outliers?

    I wouldn't completely agree with that logic. It seems like adjusting reality to your model (and hence choice of statistic for central tenancy). If they are true observations and not erroneous, you should be adjusting your model (and corresponding distribution with central measure). If the mean is influenced too strongly by a few points (that are real observations), you should switch to a measure that is more appropriate for your data. Often a robust measure like the median will work much better, without you having to revert to adjusting reality to your model.

    One of the best examples of this, which you will often see this in politics, is the mean and median income.

    Certain political parties/institutes like to quote the mean income as a measure of prosperity; "the people can see that the mean income in our country has increased, and hence we are doing a darn good job!".
    In most countries, if not all, the mean income will give a distorted view of reality as the very few extremely rich have a strong influence on the mean income. This can cause the situation that the mean household income increases, while the majority of people get poorer (standard scenario in many third-world countries)

    Instead if you use the median income, as your measure - which is the amount which divides the income distribution into two equal groups, half having income above that amount, and half having income below that amount - you have a much better picture of what is going on as your measure of central tenancy fits better with your data (you are not making the implicit assumption that the incomes are distributed evenly and symmetrically).


    For instance look at the difference in mean and median income for some world countries, which provides a measure of income inequality.
    http://en.wikipedia.org/wiki/Mean_household_income

    I was wondering however, what people here think of using Chebyshev's inequality to id outliers?
    Last edited by TheEcologist; 05-02-2013 at 02:16 AM.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  4. The Following 2 Users Say Thank You to TheEcologist For This Useful Post:

    Dason (10-04-2012), derksheng (01-24-2013)

  5. #4
    Test of Gnomality
    Points: 12,102, Level: 72
    Level completed: 13%, Points required for next Level: 348
    hlsmith's Avatar
    Posts
    2,255
    Thanks
    141
    Thanked 378 Times in 368 Posts

    Re: [FAQ] How do I remove or deal with outliers?

    I enjoyed this thread. I remember taking my courses and we were always removing outliers. Then I got my first real life dataset and started removing outliers. But, I realized I was removing real people's data and in actual life it is not right to remove observations to make your data better fit the model and its measures, you have to attempt to make the model fit the data, because like life - prediction models are not perfect.

  6. The Following User Says Thank You to hlsmith For This Useful Post:

    derksheng (01-24-2013)

  7. #5
    Test of Gnomality
    Points: 12,102, Level: 72
    Level completed: 13%, Points required for next Level: 348
    hlsmith's Avatar
    Posts
    2,255
    Thanks
    141
    Thanked 378 Times in 368 Posts

    Re: [FAQ] How do I remove or deal with outliers?


    I have never used Chebyshev's inequality for outliers. I have always used standard deviations from standard normal distribution (perhaps Chebyshev's would have been better at times). I know Chebyshev's is applicable to most distributions, and have been under the assumption if you know the distribution use that instead of Chebyshev. Are there certain times when Chebyshev's inequality is best?

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats