Transform data for outlier analysis?

#1
Hey everyone,

I have a data set which is log normally distributed and want to detect outliers, so far I used a threshold approach:

vector contains the values of interest.
outlier detected if:

value > ((2*(median absolute deviation(vector))/0.6745) + median(vector)

My problem is: is this even applicable? I stumbled upon a statistics site which listed the method above under methods that assume normality, so not my case.

If that is so, would anyone have any suggestions as to how to transform the data into a normally distributed set? Log transformation seems not to be fitting in this case, as I would also divide by a median based on log values (log / log).
I would of course also accept a different outlier detection method if somebody has suggestions.

Any help is greatly appreciated, many thanks in advance.

Rene
 

Dason

Ambassador to the humans
#2
If your data is log normal then by definition taking the log transform will give you normality. If that isn't the case then your data isn't actually lognormal. Is there are a reason you think it is lognormal in the first place?
 

noetsi

Fortran must die
#3
Why are you using MAD if your data is normal? You use that normally if its not normal. Z scores are the more common way to detect outliers (or a box and whisker plot) with normal data. If you log a distribution the outliers may well be different than the non-logged data (regardless if your data is normal or not) so you should take that into consideration when logging data when looking for outliers.

I reccomend reading this

http://etd.library.pitt.edu/ETD/available/etd-05252006-081925/unrestricted/Seo.pdf
 

Dason

Ambassador to the humans
#4
Why are you using MAD if your data is normal?
They did say they thought the data was log-normal. Plus what's the harm in using something like MAD? If the data actually wasn't normal one could make an argument for it quite easily.
 

noetsi

Fortran must die
#5
Well if the data is normal then I would think the median or something based on it would be less accurate than something based on a mean. But more practically people understand and are more interested in Z scores than MAD outside academic communities :) Easier to do, easier to explain, easier to get accepted.

You're right if the data is not normal than MAD would be a good idea, but if the data is skewed MAD inflates the number of outliers. An adjusted boxplot might be the better way to go since it adjusts for skew.
 

Dason

Ambassador to the humans
#6
I would think MAD would be easier to explain as opposed to a z score. It seems a lot more intuitive and you don't have to use standard deviation or anything like that...
 

noetsi

Fortran must die
#7
The way it is calculated throws people off. Plus means are better known than medians (yes I know that is silly). But mainly its very easy to find literature that talks about z scores and far harder to find it for MAD (a calculation I had not heard of unto a week ago despite doing many outlier analysis including grad stats ones). :)

What is commonly used, espeically by the major software is always easier to sell. Neither SPSS nor SAS calculated MAD.
 

Dason

Ambassador to the humans
#8
But you were talking about what people understand. I was making the point that intuitively I think the MAD is more understandable because you don't have to delve into standard deviation (why do we square everything and then take the square root? <- this is a question almost everybody has when learning about variance/standard deviation). Plus if it's the use of the median that you object to note that you can calculate the MAD using the mean - it's just not quite as robust then.
 

noetsi

Fortran must die
#9
You don't have to delve into anything with a Tukey boxplot :)

I never realized you could calculate MAD with a mean.
 
#10
Hi everyone,
nice to see that there is quite some discussion going on.
To refine my previous post, I cannot apply any manual examination with plots as I have to check ~1.4 mio vectors separately.
However, i uploaded a little subset of my data which represents the problem nicely (attaching did not work :/).

http://www.file-upload.net/download-3870378/subset.txt.html

Some of the vectors (rows) seem to be distributed normally already, while others are clearly not.

Why are you using MAD if your data is normal? You use that normally if its not normal. Z scores are the more common way to detect outliers (or a box and whisker plot) with normal data. If you log a distribution the outliers may well be different than the non-logged data (regardless if your data is normal or not) so you should take that into consideration when logging data when looking for outliers.
That is exactly why I'm posting this question here, as I checked the outliers for both cases and they were different. My question is basically: which approach makes more sense and can be justified with regards to the data?
The reason for my use of the median and MAD instead of mean and SD is an advice posted by gianmarco in another thread, see below.

http://www.talkstats.com/showthread.php/13955
http://www.talkstats.com/showthread.php/20516

So if you have recommendations on how to proceed, please tell me.

Best regards
 
#11
Hey everyone,

to refine my previous post, plots are no option because I'm evaluating 1.4 mio vectors.
You can find a subset of the data here:
http://www.file-upload.net/download-3870378/subset.txt.html

Some of the vectors show nearly a normal distribution while others don't, still I need to find a common method to evaluate both.

Why are you using MAD if your data is normal? You use that normally if its not normal. Z scores are the more common way to detect outliers (or a box and whisker plot) with normal data. If you log a distribution the outliers may well be different than the non-logged data (regardless if your data is normal or not) so you should take that into consideration when logging data when looking for outliers.
That is exactly why I am posting here, I compared log and non log values and got different results on the analysis. Still I need to justify the basic approach on outlier detection and since i am not sure which variant makes more sense, I was hoping to get some advice from more experienced persons.
Concerning the use of median and MAD, I followed gianmarcos advice from these posts:

http://www.talkstats.com/showthread.php/17682
http://www.talkstats.com/showthread.php/13955

I am not sure how to proceed, so any advice is appreciated very much.

Best regards