Lets assume we have a series of letters X X Z Y Z X Y Y Y Z X Z, in bi-gram format they will turn into

X X,

X Z,

Z Y, so on and so forth.

My analysis is about searching for irregularities, if Xs are followed by more Zs instead of Ys and Xs. I know I can do a 4 by 4 analysis but the p value will be more ambiguous about exact location of irregularity (if exists). Moreover distribution of Xs, Ys, and Zs are not even.

I'm thinking about getting the ratio of Xs, Ys, and Zs to their sum then multiplying this number with the sum of second elements in bi-grams for specific categories for getting expected values. For instance;

In corpus, there are 40 X, 20 Y, and 20 Z which turns to 0.4, 0.2, 0.2 which sums up to 1.

X is followed by 15 X, 10 Y, and 2 Z, it's sum is 27 when we multiply this sum with each corresponding multiplier we get 10.8 for X, 5.4 for Y, and 5.4 for Z.

My question is, is it possible to use this numbers for expected values since if there were no anomaly, the distribution of second elements of bi-grams would be almost exactly the same (with deviation of one less X since it is the first element).

I'm looking forward to your insights.