Hello everybody,
I have a little question concerning the mixing of robust and '"regular" statistics and hopefully you can help me on this topic (I admit that I'm not very good at statistics).
I'm working with a dataset of Affymetrix HuEx CEL files and have normalized it using RMA.
My resulting data matrix therefore has design similar to this one:
ID sample1 sample 2 sample 3 sample4
1 15 16 37 18
2 14 17 13 13
3 20 22 23 17
As I'm looking for outliers in each row, I've been following the COPA approach of Tomlins et al.
This means, each row/gene was median centered and then divided by its median absolute deviation (therefore median =0 and mad = 1).
Now what I would like to know:
is it OK to check for outliers in the resulting data by determining whether a sample value is above mean(row)+3*sd(row) or below mean(row)-3*sd(row).
First I thought that it would be possible to stay with robust statistics and check for each single value if it is above median(row)+3*mad(row), but I have found no evidence that this has ever been used and I'm not sure if it is even applicable.
It would be great if you could enlighten me, thank you very much in advance.
Rene
I have a little question concerning the mixing of robust and '"regular" statistics and hopefully you can help me on this topic (I admit that I'm not very good at statistics).
I'm working with a dataset of Affymetrix HuEx CEL files and have normalized it using RMA.
My resulting data matrix therefore has design similar to this one:
ID sample1 sample 2 sample 3 sample4
1 15 16 37 18
2 14 17 13 13
3 20 22 23 17
As I'm looking for outliers in each row, I've been following the COPA approach of Tomlins et al.
This means, each row/gene was median centered and then divided by its median absolute deviation (therefore median =0 and mad = 1).
Now what I would like to know:
is it OK to check for outliers in the resulting data by determining whether a sample value is above mean(row)+3*sd(row) or below mean(row)-3*sd(row).
First I thought that it would be possible to stay with robust statistics and check for each single value if it is above median(row)+3*mad(row), but I have found no evidence that this has ever been used and I'm not sure if it is even applicable.
It would be great if you could enlighten me, thank you very much in advance.
Rene