# probability of finding 2 random DNA fragments matching between genomes

#### GummyBear

##### New Member
Hi all,

I have a question about probabilty.
There are 2 genomes (a long string, composed of 4 letters: A, T, G & C), each 150,000 bases (letters) long. Now, I generate 10,000 random fragments (sub-strings) from each genome, each exactly 20 bases long. What is the probability of finding exactly same fragment (string) in each genome?

What I've tried:

Since there are 4 possible bases, there can be 4^20 total fragments. So finding the same fragment between the 2 genomes is (1/4^20), right?
But, how does the fact that there are only 10,000 fragments affect the probability? Also, do I have to worry about the genome size (150,000 bases, from which the fragments were obtained)? I am unable to find answer for this.

Another related question: so for the same problem, if I allow some mis-matches between the fragment match (say 5 out of 20 bases need not match), how will the probability change? Will it be just (1/4^15) or (1/4^15)*(15 choose 5)?

Any help will be greatly appreciated! Have a great day!

#### asterisk

##### New Member
If the strings are random, the odds of one string matching another string is $$\frac{1}{4^{20}}$$

The odds one string matches any of a list of 10000 strings is

$$\frac{10000}{4^{20}}$$

The odds of any string in the first list matching any string in the second list is

$$\frac{10000^2}{4^{20}}$$

For the last question

$${{20}\choose{15}}*(1/4)^{15}*(3/4)^5$$

I haven't taken biology since high school, and am far from an expert, but it is my understanding that genomes are not random and there are rules like percentage of g = percentage c, percentage a = t

#### GummyBear

##### New Member
Thanks very much for your answer!
Yes, I know the genome composition is not random, it is just to test our simple hypothesis.

BTW for the last answer, I didn't clearly understand your answer. Can you please explain it to me (sorry for my ignorance)? Thanks once again!

Last edited:

#### asterisk

##### New Member
This is a Binomial Distribution

$${{20}\choose{15}}*(1/4)^{15}*(3/4)^5$$

$${{n}\choose{k}}*(p)^{k}*(1-p)^{n=k}$$

n = 20 = number of trials
k = 15 = number of successes
p = 1/4 = probability of success
(1-p) = 3/4 = probability of failure

#### GummyBear

##### New Member
Thanks very much, asterisk! It was very helpful.