statistical significance of sequence


Suppose I have a set of sequences of length n with symbols from an alphabet A = {a1, a2, ..., an}
A particular sequence may appear multiple times in the set.

I would like to know which sequences have frequencies that are statistically significant. That is, I would like to identify sequences whose frequency is more or less than the frequency if the sequences had been generated at random.

To do this with a brute force method, I would determine the probability that the sequence would occur at random.

In particular, suppose we have an alphabet of letters A = {a1, a2, ..., an}. A random sequence is formed by selecting a letter from A with probabilities {p1, p2, ..., pn}, respectively.

So with simulation, I would generate random sequences of length n, and then look at their frequencies.

Then I would look at the difference between the observed sequence frequency in my sample data and the expected "random" frequency to determine which sequences are significant.

I had two main questions:

1. In general, how should I determine the probabilities of letters {p1, p2, ..., pn}? Can I simply look at my set of sequences and count the frequency at which letters appear and divide this by the total number of letter appearances?

2. Is there an analytic way to determine which sequence is the most statistically significant?

In general, I do know of chi-square tests, anderson-darling tests, and ks-tests. But these tests compare two distributions. I am mainly interested in identifying
"the most statistically significant" sequence, not whether the distribution matches a random distribution or not.

I also know of the seminal paper by Karlin and Atlschul titled "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes", but I'm not sure it it applicable as the paper seems to assume a scoring scheme.

Randomness tests just tell me whether the sequence is random or not, not whether its frequency is statistically significant.

Thank you.


TS Contributor
Then you want to do a 1-sample proportion test for each alphabet and see which one gives you the smallest p-value ?
Thank you for your reply, BGM. In your statement, "Then you want to do a 1-sample proportion test for each alphabet", did you mean for each sequence (versus alphabet)?

Below is how I would apply the proportion test if I understand you correctly:

The proportion would be the fraction of sequences in the set that match the desired sequence.

The NULL hypothesis would be that the observed sequence was generated at random.
In particular:
H_0: p = p_0, where p_0 is a specific numeric value for the population proportion p.

p_0 would be proportion if the sequences were generated at random.

Alternative hypothesis;

H_A: p > p_0

With H_0: p = p_0
Test statistic z = (p_hat - p_0) / SD (p_hat)

where SD(p_hat) = sqroot ( [p_0 * (1 - p_0)] / n)

n would be the number of samples

p-value is the probability, calculated assuming the null hypothesis H_0 is true, of observing a value of the test statistic more extreme than the value we actually observed.

In this case, the p-value is P(z > z_0)

A small P-values means the data we observed would be very unlikely if our null hypothesis H_0 is true.