likelihood of false positive in Robust regression model. Gene expression results


I'd like to calculate the probability of a gene A or genes A , B, C and D occuring due to chance alone in a genelist of concordant genes derived from two separate methods using the same conditions.

1. Significance of microarrays analysis of Disease v Control yielded 500 results meeting a q<0.05%. So I know that there is a 1 in 20 chance of a gene being there due to chance This dataset was validated by a second independent batch . The overlap of the two genelists was 50 genes. Therefore there is a 1 in 400 chance of a single gene occurring in the genelist of 50 by chance alone. Right?

2.independently the same gene expression data is normalize against a human tissue bank of expression data ( fRMA) and robust regression using LIMMA is performed .The resulting genelist of 500 genes meets this criterion : B statistic> 97.5%th percentile and is included in determining concordance of genelists from 1 and 2.

3. A, B, D are genes which are concordant between 1 and 2.

I want to generate a few candidate genes which have the highest chance of occurring because they are really there in disease and hopefully biologically important OR truly ubiquitous in this tissue. Is this presumption correct?

a/ How do I calculate the probablity of A being in 3. due to chance
b/ what is the probablity of A, B and D being there due to chance together?
c/ any comments about the method and its validity?

Thank you very much :wave:
Last edited: