- Thread starter lamchops
- Start date

Assume there is a time series: {1,5,3,5,1,5,3,5,}. Can one determine the probability that the sequence1,5,3,9,5 occurred two times in this series by chance? I would expect that it would be unlikely. Now assume that the series consisted of 10 billion single digit integers. It occurs to me that the probability of finding 1,5,3,5 two times in that series would be very likely.

Is there a generalized method for determining these probabilities?

Perhaps the problem could be simplified and restated thus: suppose we have a random letter generator that outputs lower case letters of the first 10 letters of the alphabet (a,b,c, . . .,h,i,j), and we use this random letter generator to create a series, say, 1000 letters long. Now we create a 3 letter sequence, say: d, g, c. What is the likelihood that that 3 letter sequence will appear in the 1000 letter series?

It seems to me that this should be a straightforward probability problem.

OK, now let’s assume there is a 1000 element series, as above, comprised of the first ten letters of the alphabet: (a, b, c, d, e, f, g, h, i, j). We don't know whether it is randomly generated or not, but we work from the assumption that it is. We note that a three-letter sequence occurs twice in the series. That shouldn't be so unlikely that it would call into question the assumption the series is randomly generated.

But suppose we find a five-letter sequence that occurs twice in the 1000 element series, or a ten-letter sequence, or a twenty-letter sequence that occurs twice in the series? At some length the occurrence of two identical sequences in the 1000 element series is going to be so unlikely that it refutes the assumption that the series is randomly generated.

I'm guessing that this is some sort of elementary probability/statistics problem which can be generalized as a function the length of the series and the length of the identical sequences. I'm hoping someone can help me understand how to do this.

Any thoughts?

The possibility of overlapping sequences seems to me to be an excellent point that I hadn't thought of. Also, I guess there isn't any pressing need to to have a closed-form solution since one can calculate the probability for an individual instance.

I feel like I am making some significant progress in understanding this type of problem.

So if one had a 1K letter series as above and it was not known if it was randomly generated or not. And if in it there were two identical fifty-letter sequences (let's say one in the first third, not near the beginning of the series, and one in the last third of the series, not near the end). Then how might one calculate the probability of this happening in a single 1K letter randomly generated series?

Heres some r code to get you goin'

C-like:

```
library(stringr);
library(gtools);
n = 7;
nchars = 2
alphabet = letters[1:nchars]
words = data.frame( permutations(nchars, n, v=alphabet, repeats.allowed = T ) );
makeWord = function(j){
paste(words[j,], collapse = '')
}
words_ = as.vector( unlist( lapply(1:nrow(words), makeWord ) ) );
subseq = 'ab'
countsubseq = function(j){
str_count(words_[j], subseq)
}
counts_ = as.vector( unlist( lapply(1:nrow(words), countsubseq ) ) );
```

I am not really interested in calculating the probability of a fifty-letter sequence occurring twice in a 1K series (not overlapping, etc.). I am interested in understanding how it is done. And in this case why simulation might be a better way than using traditional statistical analysis.

So perhaps I should start by going back to the simplier form of the question and ask: how would one calculate the actual probability of a four-letter sequence occurring twice (not overlapping, etc.) in a 1K letter series (as described above), using traditional statistical analysis? Is this such a difficult problem that simulation would be a preferable approach to using traditional statistical analysis?

If you are just interested in testing for randomness, there are many suites of tests. This isn't my field but there seems to be plenty of stuff on the net.

I'm not interested in testing for randomness per se, although the question of randomness is implied in the problem. What I am investigating is at what length two identical substrings in a series become so large that the probability they occurred randomly in a series approaches zero. Using simulation, you showed that in a 1k series this substring length is around 8 or greater.

The specific problem I am looking at involves substring pairs of ~50 symbols in a ~40k series.

Using your "rough start" methodology from above there are 10^50 possible substrings. I’m not a professional mathematician, but by my estimation, that is a really big number—a really, really big number. Even relative to a 40k series.

On the one hand, one could just say that it is impossible that two 50 symbol substrings could randomly occur in a 40k series. On the other hand, it would be nice to be able to quantity the extent of that impossibility.