# Outlier detection in Affymetrix samples

#### bontus

##### New Member
Hello everybody,

I have a little question concerning the mixing of robust and '"regular" statistics and hopefully you can help me on this topic (I admit that I'm not very good at statistics).

I'm working with a dataset of Affymetrix HuEx CEL files and have normalized it using RMA.

My resulting data matrix therefore has design similar to this one:

ID sample1 sample 2 sample 3 sample4
1 15 16 37 18
2 14 17 13 13
3 20 22 23 17

As I'm looking for outliers in each row, I've been following the COPA approach of Tomlins et al.
This means, each row/gene was median centered and then divided by its median absolute deviation (therefore median =0 and mad = 1).

Now what I would like to know:
is it OK to check for outliers in the resulting data by determining whether a sample value is above mean(row)+3*sd(row) or below mean(row)-3*sd(row).

First I thought that it would be possible to stay with robust statistics and check for each single value if it is above median(row)+3*mad(row), but I have found no evidence that this has ever been used and I'm not sure if it is even applicable.

It would be great if you could enlighten me, thank you very much in advance.
Rene

#### afingal

##### New Member
The COPA approach of Tomlins et al. you mention sounds like the standard Z-scores which are presented in heat maps of such data. I think you want to look for outliers first, before computing Z-scores because that transforms the data into relative measures and some statistical power would be lost.

Are you wanting to identify outliers so they can be removed from the data for replicates of the same sample (to make a tighter analysis) or is it the outliers, themselves, which you are interested in? In other words, is the point of this to identify samples that are vastly different from the rest of the samples in a given row. If the former, then the statistical relevance is key. If the latter, the point is to do whatever it takes to reduce the data down to a point where you can identify an interesting trend and then validate it with further experiments. For the most part, you aren't going to get statistically robust data out of a microarray.

There are specific tests for outliers. For example, Tukey's outlier test which can be configured for different cutoffs.

http://www.edgarstat.com/tukeys_outliers_help.cfm

See David Hoaglin, Frederick Mosteller, and John Tukey (editors),
Understanding Robust and Exploratory Data Analysis,
New York, John Wiley & Sons, 1983, pp. 39, 54, 62, 223.

#### gianmarco

##### TS Contributor
Hi!
I do not know the method you cited.
Generally speaking, different approaches are available to flag values as outliers.
See this and this previous post in this same Forum.

Taking into account your data's first row, this small dataset is a good example of how outliers can "mask" themself. In fact, if you use the "traditional" method based on mean and st. dev., no value can be considered outlier.
On the other hand, if you use Tukey's method or, to keep with your original quoting of median and Median Absolute Deviation, the one base on Median and Median Absolute Deviation, 37 is plainly flagged as outlier.

As for details for the use of Median&MAd to detect outliers, please see the earlier posts I pointed out above.

Hope this helps,
Best Regards
Gm

#### bontus

##### New Member
Hi there and thank you for your quick replies.

In fact I want to investigate the outliers, so I don't want to remove them from the dataset. This is also why I chose the approach of Tomlins et al., as the relative signals of outliers are sticking out from the rest of the group. Just FYI, the complete COPA compares the 0.95 / 0.9 / 0.75 quantiles of the rows/genes after transformation to z-scores, and checks for dissimilarities between the test and the control group (so whether the test group has very strong outliers and the control does not).

I'm thinking about using the Median-based method described in this thread, though I do not understand why the MAD is divided by a constant. As my data is already median centered and scaled by MAD, the computations become rather easy which is beneficial for the overall project (I'm a little short on time). Also, when applying the interquartile method, I got more false positives on a test dataset.

I do not know whether it matters or not, but my test group consists of 56 samples, while my control group has a size of 12 samples.
Considering Tukey's method: is that applicable in my case? I really want to know the single columns housing the outlier, as I want to create a profile of the outlier distribution over the different samples (just for one group).

Cheers and many thanks

Last edited:

#### afingal

##### New Member
For those reading the forum, who might not know about these Affymetrix arrays, they are a system where you can measure the expression level of every gene of a human, mouse or other organism all in one experiment. In fact, the newer ones go even farther than that, measuring alternate transcripts of genes and even things like micro RNAs. From a statistical viewpoint, the issue is that you have to work with something on the order of 10^4 or 10^5 measurements for each sample. This gives a lot of opportunity for something to occur by chance. You have to find some way of reducing the data set to a list of the most interesting things so that you can evaluate them but, whatever way you do it, your list is likely to contain one, two or a few things which are there by chance and not for real reasons.

Consider this. There is a limit to the dynamic resolution on the array. What are the smallest legitimate value and the highest legitimate value the array can possibly measure? In order to get such statistical significance that a difference could not be by chance out of 10^5 tries, you may require a difference in magnitude that it could not occur with this technique. Having said that, you are in much better shape with the number of test and control samples you have than typical with the array experiments I have seen. It must have cost a small fortune to run that many samples.

I think I would be inclined to do a first pass with either of the things you mentioned. That is, either use a difference of about three standard deviations or apply some non-parametric test. The point of the first pass is just to derive a list of things of interest of a manageable size. Then, if I understand the statistics properly, what you would do is apply a post hoc test to each item on your list. Perhaps another forum participant can comment on which test would likely be the best choice.

In any case, you need a really significant p-value to be sure that what you have picked out of that many measurements is something real. Otherwise, you have to accept the fact that, quite likely, some of the items on your list are random and not real. Then you need to do followup experiments to determine which they are.

#### bontus

##### New Member
In fact this is just a subset of the data and I have 2 other datasets for validation. Don't ask me who paid for all this, I'm just the guy who is supposed to make something out of it.
Which brings me to another question concerning the outliers.

As I said I have two groups (cancer & normal) and I want to perform an outlier analysis on them to determine wheter for a gene, outliers are present in the cancer samples whilst none are found in the normal samples. Now, should I run the outlier search on each group separately or should I run it on all samples without separating the groups.
This is an essential question, as the transformation to z-scores will be affected (substracting a different median and dividing by different MAD).

Can you give me any advise on which approach makes more sense? I would think it makes more sense to separate the samples into groups, as I already know they are different. However, if I assume a null hypothesis of "nothing is changed in cancer", I would go for the whole data set.

#### bontus

##### New Member
Hi people,

it's been a while since I posted in this thread, so to get some further advice on my problem I will give a short update:

I'm still wondering which outlier method to use / which one is the most logical choice.
I read through some suggested literature (see link below) and decided to go for the adjusted boxplot assuming not all of my data is normally distributed after taking the log2 of the normalized expression values (which I am certain of).
Still, since the transformation to z-scores is commonly used, I cannot detach from the idea of using it. I am now trying to find a reason (or more precise dis-/advantages) to chose one of the methods over the other.
When directly comparing both, the z-score approach results in a much higher number of values being flagged as outliers, whereas the correlation between both approaches is below 0.5 for the ones above the upper limit and even close to zero for the ones below the lower limit.
To be precise on how the outliers are flagged, I do the calculations for all samples within the data set and then subdivide into groups after outlier detection.

Any advice or recommendations are greatly appreciated, I just don't want to argue about my choice later on, without having any reason besides "the others did it this way"

Best regards

http://etd.library.pitt.edu/ETD/available/etd-05252006-081925/unrestricted/Seo.pdf