# Chance?

#### lamchops

##### New Member
Hi everybody.

I'm trying to understand the methodology for determining the likelihood a sequence that repeats twice in a time series is not by chance. Assuming the two sequences are identical, I'm guessing the likelihood is some function of the length of the sequence and the size of the series.

#### Dason

##### Ambassador to the humans
I don't quite understand what you're asking. Maybe providing some examples could help

#### lamchops

##### New Member
Thanks for the reply. It may help me to better understand my question.

Assume there is a time series: {1,5,3,5,1,5,3,5,}. Can one determine the probability that the sequence1,5,3,9,5 occurred two times in this series by chance? I would expect that it would be unlikely. Now assume that the series consisted of 10 billion single digit integers. It occurs to me that the probability of finding 1,5,3,5 two times in that series would be very likely.

Is there a generalized method for determining these probabilities?

#### lamchops

##### New Member
That's a good question, but I'm not sure it is what I was trying to get at.

Perhaps the problem could be simplified and restated thus: suppose we have a random letter generator that outputs lower case letters of the first 10 letters of the alphabet (a,b,c, . . .,h,i,j), and we use this random letter generator to create a series, say, 1000 letters long. Now we create a 3 letter sequence, say: d, g, c. What is the likelihood that that 3 letter sequence will appear in the 1000 letter series?

It seems to me that this should be a straightforward probability problem.

#### lamchops

##### New Member
So in a 1000 letter series generated by the above random letter generator, one should expect to find approximately 100 "d"s, and approximately 10 "d, g" pairs, and approximately 1 "d, g, c" sequence? Am I on the right track here?

#### fed2

##### Active Member
yeah i think you are on the right track. at least as long as the the letter series is much longer than the sequence you are counting.

#### lamchops

##### New Member
Thanks for the reply,

OK, now let’s assume there is a 1000 element series, as above, comprised of the first ten letters of the alphabet: (a, b, c, d, e, f, g, h, i, j). We don't know whether it is randomly generated or not, but we work from the assumption that it is. We note that a three-letter sequence occurs twice in the series. That shouldn't be so unlikely that it would call into question the assumption the series is randomly generated.

But suppose we find a five-letter sequence that occurs twice in the 1000 element series, or a ten-letter sequence, or a twenty-letter sequence that occurs twice in the series? At some length the occurrence of two identical sequences in the 1000 element series is going to be so unlikely that it refutes the assumption that the series is randomly generated.

I'm guessing that this is some sort of elementary probability/statistics problem which can be generalized as a function the length of the series and the length of the identical sequences. I'm hoping someone can help me understand how to do this.

Any thoughts?

#### katxt

##### Active Member
You can do a simulation for any particular case. I doubt if there is an exact closed form answer because of end effects and the possibility of overlapping sequences.

#### lamchops

##### New Member
Thank you for the help Katxt,

The possibility of overlapping sequences seems to me to be an excellent point that I hadn't thought of. Also, I guess there isn't any pressing need to to have a closed-form solution since one can calculate the probability for an individual instance.

I feel like I am making some significant progress in understanding this type of problem.

So if one had a 1K letter series as above and it was not known if it was randomly generated or not. And if in it there were two identical fifty-letter sequences (let's say one in the first third, not near the beginning of the series, and one in the last third of the series, not near the end). Then how might one calculate the probability of this happening in a single 1K letter randomly generated series?

#### fed2

##### Active Member
simulation will probably explode your computer for even moderate n, but is good place to start. I think when it comes to complicated sequence counting like problems the go to stratagem is to establish some sort of recursive relationship, ie of the form x_n+1 = x_n*2 + ...., or something. Then generating function. Thats from distant memory. For simple problems you might be able to count in a simpler way.

Heres some r code to get you goin'

C-like:
library(stringr);
library(gtools);

n = 7;
nchars = 2
alphabet = letters[1:nchars]

words = data.frame(  permutations(nchars, n, v=alphabet, repeats.allowed = T )   );

makeWord = function(j){
paste(words[j,], collapse = '')
}

words_ = as.vector( unlist(  lapply(1:nrow(words), makeWord )  )  );

subseq = 'ab'

countsubseq = function(j){
str_count(words_[j], subseq)
}

counts_ = as.vector( unlist(  lapply(1:nrow(words), countsubseq )    )  );

#### lamchops

##### New Member
Yeah, I don't want to blow up my computer , , , ,

I am not really interested in calculating the probability of a fifty-letter sequence occurring twice in a 1K series (not overlapping, etc.). I am interested in understanding how it is done. And in this case why simulation might be a better way than using traditional statistical analysis.

So perhaps I should start by going back to the simplier form of the question and ask: how would one calculate the actual probability of a four-letter sequence occurring twice (not overlapping, etc.) in a 1K letter series (as described above), using traditional statistical analysis? Is this such a difficult problem that simulation would be a preferable approach to using traditional statistical analysis?

#### fed2

##### Active Member
for sim just do:

counter=0
generate a sequence
if sequence has 2 occurences then counter + 1

do it until blue in face.

#### katxt

##### Active Member
Here's a rough start.
10 symbols. A full 1000 string has approx 1000 substrings of 4. There are 10^4 or 10000 possible substrings. So the probability that some given substring appears somewhere in the 1000 string is about 10%.
Not what you asked for but it might be useful.

#### lamchops

##### New Member
So if one picks a 4 symbol substring from the rqandomly generated 1k symbol series, the probability that it will occur a second time in the series is . . . . ~10%?

Last edited:

#### katxt

##### Active Member
Yes. But it's virtually certain that there will be at least one matching pair in the 1000 string. Here's what a simulation gave. 2500 runs each of matches in a 1000 string. Probably accurate to nearest % or so.

If you are just interested in testing for randomness, there are many suites of tests. This isn't my field but there seems to be plenty of stuff on the net.

#### lamchops

##### New Member
That simulation was very helpful, thank you.

I'm not interested in testing for randomness per se, although the question of randomness is implied in the problem. What I am investigating is at what length two identical substrings in a series become so large that the probability they occurred randomly in a series approaches zero. Using simulation, you showed that in a 1k series this substring length is around 8 or greater.

The specific problem I am looking at involves substring pairs of ~50 symbols in a ~40k series.

Using your "rough start" methodology from above there are 10^50 possible substrings. I’m not a professional mathematician, but by my estimation, that is a really big number—a really, really big number. Even relative to a 40k series.

On the one hand, one could just say that it is impossible that two 50 symbol substrings could randomly occur in a 40k series. On the other hand, it would be nice to be able to quantify the extent of that impossibility.

Last edited:

#### katxt

##### Active Member
You could probably put an upper limit on it. The chance that the first 50 symbol substring is repeated is about 40 000/10^50. There are 40 000 strings to make a possible match. 40 000*40 000/10^50 <10^-40. (I think, but it's not easy to get your head round this stuff. Pretty unlikely, anyhow.)

#### lamchops

##### New Member
Thanks,

That exercise was very helpful. I went in thinking that the probability of a string occurring twice in a series was a function of the length of the string and the length of the series, but I learned that it is also a function of the number of symbols that can occur at each point (in this case: 10).

While this exercise in combinational probabilities might be of theoretical interest to a first-year mathematics student, it likely doesn’t have much application in the real world of time series of continuous process where the number of symbols that can occur at each point is much, much (dare I say infinitely) larger. Which takes me to the next question I hope someone can help me understand.

Suppose one is examining a time series of a continuous process and it is discovered that two subsets of the series correlate exactly (correlation coefficient ~1). Is there a way of calculating the probability that this has occurred randomly? I would think that this probability would also be a function of the length of the subsets relative to the length of the series. Is there an established statistical methodology for approaching this type of problem?

Last edited: