Sam Shah, one of the math teacher bloggers that I read, posted a bioinformatics-related question on A biology question that is actually a probability question « Continuous Everywhere but Differentiable Nowhere:
Let’s say you have a sequence of 3 billion nucleotides. What is the probability that there is a sequence of 20 nucleotides that repeats somewhere in the sequence? You may assume that there are 4 nucleotides (A, C, T, G) and when coming up with the 3 billion nucleotide sequence, they are all equally likely to appear.
This is the sort of combinatorics question that comes up a lot in building null models for bioinformatics, when we want to know just how weird something we’ve found really is.
Of course, we usually end up asking for the expected number of occurrences of a particular event, rather than the probability of the event, since expected values are additive even when the events aren’t independent. So let me change the problem to
In a sequence of N bases (independent, uniformly distributed), what is the expected number of k-mers (k≪N). Plug in N=3E9 and k=20.
The probability that any particular k-mer occurs in a particular position is 4-k, so the expected number of occurrences of that k-mer is N/4k, or about 2.7E-3 for the values of N and k given. Oops, we should count both strands, so double that to 5.46E-3.
When the expected number is that small, we can use it equally well as the probability of there being one or more such k-mers. (Note: this assumes 4k ≫ N.)
Now let’s look at each k-mer that actually occurs (all 2N of them), and estimate how many other k-mers match. There are roughly 2N/4k for each (we can ignore little differences like N vs. N-1), so there are 4 N2/4k total pairs. But we’ve counted each pair twice, so the expected number of pairs is only 2 N2/4k, which is 16E6 for N=3E9 and k=20.
We have to take k up to about 32 before we get expected numbers below 1, and up to about 36 before having a repetition is surprising in a uniform random stream.