[FAQ] How do I remove or deal with outliers?


Global Moderator
How do I remove or deal with outliers?

Removing outliers can cause your data to become more normal but contrary to what is sometimes perceived, outlier removal is subjective, there is no real objective way of removing outliers.

The problem, as always, is what the heck does one mean by 'outlier' in these
contexts. Seems to be like pornography -- "I know it when I see it."
-- Berton Gunter (quoting Justice Potter Stewart in a discussion about tests
for outliers)
R-help (April 2005)

Always remember that these points remain observations and you should not just throw them out on a whim. Instead you should have good reasons to remove your outliers. There may be many truly valid reasons to remove data-points. These include outliers caused by measurement errors, incorrectly entered data-points or impossible values in real life. If you feel that any outlier are erroneous data points and you can validate this, then you should feel free to remove them.

On the other hand, if you see no reason why your outliers are erroneous measurements then there is no truly objective way to remove them. They are true observations and you may have to consider that the assumptions of your test do not correspond to the reality of your situation. You could always try a non-parametric test (which in general are less sensitive to outliers) or some other analysis that does not require the assumption that your data is normally distributed.

Here's some more online help on the topic here.

If you're still having trouble with this topic feel free to start a thread on the forum, and be sure to check out our guidelines for efficient posting.
Last edited:


No cake for spunky
I have worked on this topic because a recent comp question dealt with it. One piece of advice on when it is legitimate to remove outliers that makes sense to me is that when it totally distorts analysis of the data (for example distorting signficantly the central tendency you are using) then it is legitimate to remove them. To me you would not want to completely change the results due to a single point or a few points.


Global Moderator
I wouldn't completely agree with that logic. It seems like adjusting reality to your model (and hence choice of statistic for central tenancy). If they are true observations and not erroneous, you should be adjusting your model (and corresponding distribution with central measure). If the mean is influenced too strongly by a few points (that are real observations), you should switch to a measure that is more appropriate for your data. Often a robust measure like the median will work much better, without you having to revert to adjusting reality to your model.

One of the best examples of this, which you will often see this in politics, is the mean and median income.

Certain political parties/institutes like to quote the mean income as a measure of prosperity; "the people can see that the mean income in our country has increased, and hence we are doing a darn good job!".
In most countries, if not all, the mean income will give a distorted view of reality as the very few extremely rich have a strong influence on the mean income. This can cause the situation that the mean household income increases, while the majority of people get poorer (standard scenario in many third-world countries)

Instead if you use the median income, as your measure - which is the amount which divides the income distribution into two equal groups, half having income above that amount, and half having income below that amount - you have a much better picture of what is going on as your measure of central tenancy fits better with your data (you are not making the implicit assumption that the incomes are distributed evenly and symmetrically).

For instance look at the difference in mean and median income for some world countries, which provides a measure of income inequality.

I was wondering however, what people here think of using Chebyshev's inequality to id outliers?
Last edited:


Less is more. Stay pure. Stay poor.
I enjoyed this thread. I remember taking my courses and we were always removing outliers. Then I got my first real life dataset and started removing outliers. But, I realized I was removing real people's data and in actual life it is not right to remove observations to make your data better fit the model and its measures, you have to attempt to make the model fit the data, because like life - prediction models are not perfect.


Less is more. Stay pure. Stay poor.
I have never used Chebyshev's inequality for outliers. I have always used standard deviations from standard normal distribution (perhaps Chebyshev's would have been better at times). I know Chebyshev's is applicable to most distributions, and have been under the assumption if you know the distribution use that instead of Chebyshev. Are there certain times when Chebyshev's inequality is best?