confusion on outliers

#1
I am not able to distinguish the outliers - When to go with std. dev and When do we need to go with Median.
My understanding on std. dev. is - if the data is away from mean by more than 2 std dev. we consider that as outlier.
Similarly for Median, we say that any data that is not in-between q1 and q3, we say again that as outlier.

So am confused which one to choose.

Can you guys help me understand.
 

hlsmith

Not a robit
#2
What is the context behind this question? Do you have your own data sample or is this for a course/class? There is also the scenario where you really don't have any outliers. the best was to look for them is graphing data, via histograms, boxplots, and Q-Q plots. But just because a observation is >/< 2 SD doesn't mean it needs to be removed or addressed.
 
#3
I dont have any data set/dump.
The basis of my question is when the data has outliers, we can visualize to look for any inconsistencies in the data.
I didn't understand the other part - what do you mean by "just because a observation is >/< 2 SD doesn't mean it needs to be removed or addressed."? can you elaborate...
I also understand that it depends on the context of the data but not sure on how to relate What should be used when? . I think if you can give a a real-time scenario , it would help.
 

Karabiner

TS Contributor
#4
It is natural or even necessary to find values 2 SD away from the mean. For example, in case of a normally distributed variable,
about 5% are +/-2SD away from the mean. So declaring them as "outliers" and maybe even removing them seems a bit strange,
since they are perfectly valid values.

In exploratory data analysis (box-and-whisker-plots) there are some suggestions how to classify "outliers" or "extremes",
so maybe you do some research there?

With kind regards

Karabiner
 

hlsmith

Not a robit
#5
@Karabiner hit the concept I would have been alluding to. There will always be observations 2 SD from mean. If you remove them, then the values right be next to them then become the new 2 SD values.

It is important to review outliers to see if they are erroneous values or not. If they are legitimate values (not a mislabeled or incorrect value) you can bias your statistics if you remove them. The example I always use now is the Flint Michigan water crisis in the United States. They were trimming outliers, which were real water quality values. In trimming these water quality values, they ended up missing that the water was beyond safe limits. Many times if you have real outliers, this just may mean you need to use other analytics or double check assumptions are met in your analytics (e.g., concept of leverage in linear regression).
 

Miner

TS Contributor
#6
There are a number of statistical tests for outliers, but do not rush to discard outliers. They raise the following questions:
  • Is the measurement process stable?
  • Is the distribution or model wrong?
  • Is some transformation required?
  • Is there an identifiable subset of observations that is important in its different behavior
Do not discard the outlier unless you can show that there was an error such as transposing digits, a measurement error, recording error, sampling error, etc. It is better to correct the error or use robust statistics. Here is a good and short article on outliers.

1574260583963.png
 
Last edited: