probability of finding 2 random DNA fragments matching between genomes

#1
Hi all,

I have a question about probabilty.
There are 2 genomes (a long string, composed of 4 letters: A, T, G & C), each 150,000 bases (letters) long. Now, I generate 10,000 random fragments (sub-strings) from each genome, each exactly 20 bases long. What is the probability of finding exactly same fragment (string) in each genome?

What I've tried:

Since there are 4 possible bases, there can be 4^20 total fragments. So finding the same fragment between the 2 genomes is (1/4^20), right?
But, how does the fact that there are only 10,000 fragments affect the probability? Also, do I have to worry about the genome size (150,000 bases, from which the fragments were obtained)? I am unable to find answer for this.

Another related question: so for the same problem, if I allow some mis-matches between the fragment match (say 5 out of 20 bases need not match), how will the probability change? Will it be just (1/4^15) or (1/4^15)*(15 choose 5)?

Any help will be greatly appreciated! Have a great day!
 
#2
If the strings are random, the odds of one string matching another string is \(\frac{1}{4^{20}}\)

The odds one string matches any of a list of 10000 strings is

\(\frac{10000}{4^{20}}\)

The odds of any string in the first list matching any string in the second list is

\(\frac{10000^2}{4^{20}}\)


For the last question

\({{20}\choose{15}}*(1/4)^{15}*(3/4)^5\)

I haven't taken biology since high school, and am far from an expert, but it is my understanding that genomes are not random and there are rules like percentage of g = percentage c, percentage a = t
 
#3
Thanks very much for your answer!
Yes, I know the genome composition is not random, it is just to test our simple hypothesis.

BTW for the last answer, I didn't clearly understand your answer. Can you please explain it to me (sorry for my ignorance)? Thanks once again!
 
Last edited: