# Fisher's exact test, Hypergeometric distribution

#### AnnaS

##### New Member
Hi,
I need some help with Fisher's exact test and the hypergeometric distribution

My data consists of:

500 mutations - that I want to score as either damaging or benign
500 population allele frequencies (AF) - extracted, for each mutation, from a set of 10 000 controls

For each allele frequency I want to use Fisher's/HypGeo distr to calculate a p-value, where a high p-value indicating damaging mutation (by using a corrected p-value of logarithmics)

This is what I have done so far:

To estimate the p-values for the allele freq (AF) we use the hypergeometric distribution to get the score that will give a high detection if a specific mutation doesn't exist among the controls of the AF's. Assume we have a total of n/2 alleles, i.e. n alleles (2 per person), and out of these n alleles m comes from patients and n-m comes from controls. If we for example only have two patients then we would get m = 2 * 2 = 4 alleles among patients. Also, out of all n alleles a total of k is of the mutated variant. We let k be fixed, i.e. we condition on the total number of mutated variants. Let X be number of mutated variants found among cases (i.e. our mutations data). Under H_0, that there is no difference in mutation risk between cases and controls (and also under Hardy-Weinberg equilibrium), then X will have a hypergeometric distribution \\

X ~ Hyp(n,m,k)

i.e.

P_0(X=x) ={{(n-m) over (k-x)}{m over x}} / {n over k}

where n\$over k is the number of subsets with k elements out of a total of n elements and index 0 in P_0 is just to say it's calculated under the null hypothesis. \\ \\ Let x_{obs} be observed value on x. Then we get \\

p-value P_0(X > x_obs)= 1 – hypergeometric distribution function

Then we can just transform the p-value to get

Allele freq (AF) p-value = -log(p-value)

This will give us a value between for example 0 and 1, where a high score would answer to a strong association between mutation and disease and a value close to 0 would answer to a weak association.

CONCLUSION:

INDATA: n, m, k, x_obs}

Calculate p-value = P_0( X > x_obs) from hypergeometric distribution function

AF p-value = -log(p-value)

Though in this method we need to know number of controls. But if we know that our sought mutation is not in the AF data, and we know number of people in the the data (10000), then maybe we could do like this
m= 2 * (nr of patients) \\
n-m = 2 * (nr of persons in AF data) \\
k = total nr of mutations = nr of mutations among patients = X

For the 'association part' (since the AF data is comparing number of mutations among cases and controls) it is important to know number of mutations among the controls to get a score. The fewer mutations the higher the score. And this is something Fischer's exact test would take care of by giving a lower p-value and therefore a higher score = -log(p-value) the fewer the mutations are among controls.

How would I proceed and get a p-value for each extracted value? If I understand it manually than I could do it automatically!

Best,
Anna