# Determining outliers

#### Moemanofnj

##### New Member
Dear folks,
I am interested in statistics but I never had a course in college and I've had to learn
a few basic things myself. This time, I am working on something that is beyond my comprehension. I have a set of environmental data (20 soil samples) which I am playing around with. (See below.) I need to determine the outliers. It appears to me that most methods require normal or log-normal distribution. Using the free EPA proUCL, it tells me that my data is not normally distributed. (Looking at the graphs, I suspect that the highest 3 values are causing this?) If not normal, do I need to use non-parametric methods to determine outlier? What free software can I use for this purpose? Or, is there a method to justify normal or log-normal distribution? If I take 3 new samples in the field, at locations next to those with highest values, and prove consistency, will that help as a last resort? I would greatly appreciate any opinions and suggestions.

6,080
6,420
6,490
6,640
7,160
7,390
7,650
7,740
7,760
8,170
8,290
8,390
8,710
8,930
9,240
9,310
9,670
13,700
14,700
15,400

#### Miner

##### TS Contributor
The typical outlier tests such as Grubb's or Dixon's will not detect the outliers due to swamping. Try the Generalized Extreme Studentized Deviate (ESD) Test, which does detect 3 outliers. See attached pic. Trial 1 is the standard Grubb's test (single outlier), which was fooled by swamping. Trial 2 looks for 2 outliers and trial 3 looks for 3 outliers, both of which were detected.

Of greater import is what you decide to do with this information. Is this indicative of bad test results, or a mixture of different populations?

See http://www.real-statistics.com/stud...generalized-extreme-studentized-deviate-test/ for guidance on how to do in Excel.

#### CE479

##### New Member
Hi,

Welcome (I saw your intro post elsewhere)!

What do the numbers relate to? Are the high values plausible, or could they be in error? Would you have expected the data to be normally distributed based on the work of others?

#### Miner

##### TS Contributor
Also, a simple boxplot, while less precise, will show these as potential outliers.

#### Moemanofnj

##### New Member
This is very helpful, Miner. And thanks for the reference that I can study further. The 20 values are good sampling results (a naturally occurring metal at concentrations in parts per million) and I don't believe they are from different populations. (Samples were taken 10 boring locations all from similar natural soil material.) I may just have missed taking samples at locations that might have yielded results between 9,670 and 13,700? I am attempting to prove that all these (including the highest 3) can be considered natural, background, levels in soil; as opposed to an actual discharge of contaminants (outliers).

#### Moemanofnj

##### New Member
thanks, CE479. The 3 high values are good results from actual lab analysis of Aluminum in soil. I expected data to be normally distributed based on the field observations during collection of my soil samples (homogeneous, uncontaminated, naturally-occurring soils)

#### Miner

##### TS Contributor
Keep an open mind about the possibility of two populations. Note the attached probability plots. Sometimes outliers are trying to tell us that our assumptions are wrong.

#### Moemanofnj

##### New Member
Yes, these probability plots and the box plot do show different populations. Many thanks. I will probably need to collect more samples now to see what's going on at this site.

#### Exorcist

##### New Member
Not wanting to start a new thread about dealing with outliers, so I'll just throw it here.
How do you feel about deleting observations that are deemed as 'outliers'? I've recently been working with a dataset containing financial data over many years. Naturally, outliers are going to show up. Not really having much practical experience in the field of statistics, I'm not entirely sure how to approach the issue at hand. Some suggest to simply go ahead and delete whichever observations prevent the model from being a greater fit. Others say outliers should by no means simply be deleted. Granted, simply cutting these off (using whatever criteria, such as Cook's distance), I could massively improve the model's fit. But I'm not entirely sure that is serves a purpose. As it's company panel data, the 'outliers' really aren't data errors as such - some firms can simply differ quite a bit when it comes to financial status.
So far common logic has prevented me from simply removing observations that don't meet certain criteria, such as Cook's D < 1/n. Especially if those outliers aren't absolutely extreme and don't really seem to affect the results of the regression too much. So you people being much wiser and more experienced, any comments would be welcome regarding the deletion of outliers.

#### noetsi

##### Fortran must die
this is a hotly debated topic among theorist, personally I think most practitioners just delete them. All agree that if there is a mistake, you code something wrong, and this cause the outliers you should get rid of that. But if the outlier reflects a real data point some feel you should not delete them and some feel you should.

My own view of outliers is that it depends on why they exist and what analysis you are running. Commonly I run outlier analysis because we want to know what outliers exist, and why. Obviously in this case you are not going to get rid of them, the whole point of the project is to identify them. When it comes to analysis like regression then my own sense is that you should not let a small number of points distort the results so removing them makes sense. However, with the large number of results I have, thousands of points individual outliers will not distort the results. The problem is that a set of outliers might, and I have not found a methodology for determining when a group of points influence the regression line, as compared to a single point. Also the rules of thumb such as 3 standard deviations meaning extreme outliers or the Tukey boxplot are influenced by distribution such as skew which the outliers may cause.

Robust regression is one solution although which of the many approaches to use I am not sure of (and they only work with interval DV). Another easier solution is to run the model with and without the outlier and show both results.