Is there a test to see if a sample (one result, many parameters) is an outlier?

#1
I am posting here because I am not sure really how to look for my answer. I am used to comparing means of two samples, but what happens when you have only one sample but 20 parameters? How do you determine if this is significantly outside the data cloud for a data set that has about 100 samples (each with a value for said 20 parameters)? Any help about how to approach this problem, especially the names of methods that could be used would be helpful.

Here is more background on this question. I apologize for the length. Let's assume that a untested groundwater well (well x) is near a contamination source. We have data on 20 water quality parameters for 100 wells in the area. The idea is that if for one of the parameters well x has a value that is one standard deviation above the mean for the 100 wells, this would not indicate contamination - the one parameter sample is not outside of the 95% confidence interval. However, if we have 5 parameters for well x that are one standard deviation above the respective means for the 100 wells, then we could multiply the probabilities of these test values being above the 100 well baseline. For example: 0.16^5 = 0.0001048576. In other words, we would expect to see a sample exceed one st. dev. of the baseline mean for these five parameters at the same time in only 105 cases out of a million.

Does this logic make sense?
Also some of these 20 parameters are highly correlated and several would violate the assumptions of a normal distribution.

Any help is much appreciated!!!
 

rogojel

TS Contributor
#2
Hi,
one possibility could be to look at the Mahalanobis distance. If you use Minitab it is under Multivariate/principal components/graphs called outöier plot.
 
#3
Thanks rogojel!

I looked at my data in Minitab and played with the outlier plot option in the PCA. This is quite close to what I am looking for - So thanks for your help. A few complications still exist though, and I'd be happy to have feedback from anyone who might help me with the following questions:

1. The Mahalanobis distance was created for multivariate normal data. Is there a test that could be done for a dataset with parameters having a variety of distributions?

2. Mahalanobis distance (at least as implemented in Minitab) does not seem to deal well with missing values. I have only about one half of the rows being returned in the output (after Mahalanobis distance was calculated). Does this mean that in order to analyze a row (in these case they are wells), one would have to remove all parameters having blank values before running the analysis?

3. Is there a way to compare one test well with the large group of test wells without including the test well in with the rest of the data and then running the PCA and Mahalanobis distance? I suppose the question is: can I do a multivariate hypothesis test to see if the sample well is outside of the multivariate space described by all the baseline tests? Again, the sample well has only one measurement for each of about 20 different parameters.

Thanks for your help!
 

rogojel

TS Contributor
#4
hi sparrow,

a quick partial answer:

concerning non-normality, that would not worry me much. If your data was skewed it would only mean that you will get some false alarms, but you have to examine each outlier anyway to decide whether it is really a problem point or not so you have a chance of recognizing those false alarms.
 
#5
Thanks rogojel. Do any other forum members have any suggestions for this issue? It sure would be great to have someone give us the data from their well and be able to tell them that there is a 99% probability that their well is outside of the "normal" baseline dataset and that therefore it might be considered polluted or at least in need of further analysis. I like the Rogojel's idea to use Mahalanobis distance, but it doesn't have a confidence interval or other such measure associated with it. And I don't think that most people know how to interpret a raw Mahalanobis distance value.
 

rogojel

TS Contributor
#6
hi sparrow,
I believe that getting confidence intervals around the Mahalanobis distance is not a big problem, IIRC it is approximatively chi-squared distributed.

The bigger problem IMHO is the number of missing data.
 

Miner

TS Contributor
#7
Sparrow,
Since you are a Minitab user, try the T-squared multivariate control chart. You can adjust the control limits to your desired level of confidence. Your application of this approach would be the multivariate equivalent of ANOM (Analysis of Means). The null hypothesis is that there is no difference from the GROUP MEAN.