Gene expression analysis - FDR and p-values

jma

New Member
#1
Hi,

I'm using Affymetrix microarrays to check if there exist some differences in the gene expression of two group of animals that have been differently treated.
I'm using a t-test with permutations (as my groups only have three animals and I read that it is better to use permutation if you can not check for normality in the gene expressions in the groups). Because of the multiple hypothesis testing problem that occurs in microarrays experiments I will use an FDR correction to check for false positives in the p-values.
I'm new at these kind of testing and I have never worked with FDR before so I have a few things that I don't understand.

1- Should the FDR calculation be applied to all the p-values that I get from the t-test or just to those that are under 0.05 (I was thinking of using 5% as significans level)?

2- When I calculate FDR for all p-values I get very high FDRs for all my p-values and it is the same for many of them. Am I doing something wrong or is this due to the fact that I have around 200 p-values under 0.05 and around 11 000 p-values in total so the ratio of "significant" genes is to low?

3- What is the cut off value that people normally use? Just to check if I have understood: after I calculate the FDRs I should sort my list of p-values with respect to the FDR-values and then select all p-values under my cut off for the FDRs right? Which then would mean that, within my new corrected list of p-values, there can be cutOff*length(new_list) false positives.

Please correct me if I'am wrong. I really would appreciate if some body could help me answer these questions as I new to this kind of statistic testing.


Best regards
 

Mean Joe

TS Contributor
#2
1) Apply FDR to all p-values

2) This happens often; the FDR "q" value is often the same for many of the results, and it does get high pretty fast. I'm not sure why it happens. I've never dealt with 11,000 p-values...

3) Use .05 divided by number of hypothesis results, incrementally increasing to .05. Let's say, for ease of calculation, that you had 10,000 hypothesis results. The cut-off for the #1 result would be .05/10,000, which equals .000005. The cut-off for the #2 result would be twice that, which equals .000010, the cut-off for the #3 result would be thrice that, which equals .000015, ..., the cut-off for the #10,000 result would be 10,000-times that, which equals .05

Here's a little calculator, so you can see some typical results. Notice that there are a lot of similar q-values in this example too.
FDR Calculator in Excel
 

Dason

Ambassador to the humans
#3
1) Apply FDR to all p-values

2) This happens often; the FDR "q" value is often the same for many of the results, and it does get high pretty fast. I'm not sure why it happens. I've never dealt with 11,000 p-values...
It's an artifact of the algorithm used. Basically if we have a p-value of .07 and a p-value of .08 then we would expect that the corresponding q-values would retain that ordering (so that the q-value for the p-value of .07 would be lower than the q-value for the p-value of .08). The algorithm to convert to q-values doesn't guarantee this ordering so if the .07pvalue was changed to a q-value of .2 and the .08 pvalue was changed to a q-value of .15 --- then we change the q-value of .2 to .15 to maintain the ordering with respect to the p-values. There are better reasons for why we do this than what I'm explaining but I think this is the most intuitive.

3) Use .05 divided by number of hypothesis results, incrementally increasing to .05. Let's say, for ease of calculation, that you had 10,000 hypothesis results. The cut-off for the #1 result would be .05/10,000, which equals .000005. The cut-off for the #2 result would be twice that, which equals .000010, the cut-off for the #3 result would be thrice that, which equals .000015, ..., the cut-off for the #10,000 result would be 10,000-times that, which equals .05
I'm not exactly sure what you're saying here but typically we just use a cut-off based directly on the q-values. So we would declare any observations with a q-value less than or equal to .05 (or whatever you want to control FDR at) to be significant. .05 is a common cutoff but so are .01 and .1.
 

jma

New Member
#4
Thanks for replaying. I have attached a figure of the p-values and the t-values. I have heard that the distribution of my p-values is a bit skewed and that is why I get so strange FDR-values? Also the number of p-values under 0.05 are just 3% to 5% of the total of my tests.

Best regards
 

WeeG

TS Contributor
#5
the p-values should have a uniform distribution, your graph of p-values is indeed skewed, however, it is skewed to the direction that indicate no real discoveries.

I am not too familiar with the q-values, but I have also seen the problem with the algorithm of giving the same result.

why don't you simply perform the Benjamini-Hochberg procedure (BH) ? it is very easy to implement. if your tests are correlated, you can use the Benjamini-Yekutieli correction to make the procedure more conservative.