[Likelihood of embedded Markov chain] Likelihood of Goldman Yang (GY) rate matrix


New Member
This question concerns defining the likelihood of a set of parameters in a rate matrix of an embedded Markov chain.
To get a brief idea of the context I will try to explain the problem with a simpler one first.
I am working with evolutionary models (see http://en.wikipedia.org/wiki/Models_of_DNA_evolution) that try to model the evolution of a DNA sequence (S1) into another (S2). Each position in the sequences of \(n\) length can take one of the values \(\{a,c,g,t\}\). Then

\cdot & \alpha & \beta & \beta \\
\alpha & \cdot & \beta & \beta \\
\beta & \beta & \cdot & \alpha \\
\beta & \beta & \alpha & \cdot \\
where entries are in the order \((a,g,c,t)\) and the diagonal entries are \(-\alpha-2\beta\), so that the rows sum to \(0\).
This is the rate matrix of an embedded markov chain and transition probabilites are given by \(e^{Qt}\). \(t\) can be set to \(1\).
1-p_{ts}-2p_{tv} & p_{ts} & p_{tv} & p_{tv} \\
p_{ts} & 1-p_{ts}-2p_{tv} & p_{tv} & p_{tv} \\
p_{tv} & p_{tv} & 1-p_{ts}-2p_{tv} & p_{ts} \\
p_{tv} & p_{tv} & p_{ts} & 1-p_{ts}-2p_{tv} \\
\(p_{ts} = p_{ts}(\alpha,\beta) = \frac{1}{4} + \frac{1}{4} e^{-4\beta} - \frac{1}{2} e^{-2(\alpha+\beta)}\) and \(p_{tv} = p_{tv}(\beta) = \frac{1}{4} - \frac{1}{4} e^{-4\beta}\)

Assume I know the two sequences and thus can count the number of changes from S1 to S2. Transversions are changes from either a or g in one position to c or t or vica versa and transitions are changes from either \(a \rightarrow g\), \(g \rightarrow a\), \(c \rightarrow t\) or \(t \rightarrow c\). Let \(n_{ts} = n_{ag} + n_{ga} + n_{ct} + n_{tc}\) be the number of transitions, \(n_{tv} = n_{ac} + \cdots + n_{tg}\) (8 terms) the number of transversions, and \(n\) the total number of sites in the alignment. Then the likelihood is given by

\(L(Q;S_1,S_2) = L(\alpha,\beta;S_1,S_2) = P_{\alpha,\beta}(S_1,S_2)\)

As \(S_1\) and \(S_2\) are independent, then
= \prod_{i=1}^n P(S_1(i),S_2(i))
= \prod_{i=1}^n P(S_2(i)|,S_1(i))P(S_1(i))

\(P(S_1(i))\) can be set to the stationary distribution of the matrix (\(\varphi=(\frac{1}{4},\frac{1}{4},\frac{1}{4},\frac{1}{4}))\). In other words, the chance of observing a given letter at any position in S1 is \(\frac{1}{4}\).

This can also be written as

L(\alpha,\beta;S_1(i),S_2(i)) = \frac{1}{4} \prod_{i=1}^n p_{ts}^{1((S_1(i),S_2(i))=ts)}p_{tv}^{1((S_1(i),S_2(i))=tv)}(1-p_{ts}-2p_{tv})^{1(S_1(i)=S_2(i))}

It then follows that
L(\alpha,\beta;S_1(i),S_2(i)) =
\frac{1}{4} P_{ts}^{\sum_{i=1}^n1(S_1(i),S_2(i)=ts)}
= \frac{1}{4}P_{ts}^{n_{ts}}P_{tv}^{n_{tv}}(1-P_{ts}-2P_{tv})^{n-n_{ts}-n_{tv}}
\propto P_{ts}^{n_{ts}}P_{tv}^{n_{tv}}(1-P_{ts}-2P_{tv})^{n-n_{ts}-n_{tv}}

The parameters \(\alpha\) and \(\beta\) can then be estimated by maximizing the log-likelihood using the EM algorithm.

Okay, so this was a rather simple model, the so called Kimura model. However, I would like to do something similar with a slightly more complex model, the Goldman Yang (1994) model (GY).

This is a model on the codon level, i.e. for protein- coding sequence data. What is important is that a codon is a triple of letters from the sequence that 'codes' for an amino acid. There are 20 different amino acids and sequence of DNA can therefore be translated into amino acids following this table: http://en.wikipedia.org/wiki/DNA_codon_table
The codons marked in red are called stop codons, but are ignored here. So assume that none of the sequences S1 or S2 contain any stop codons.

The GY substitution model is a continuous time Markov chain on the 61 sense codons.
So in other words, we have a 61x61 rate matrix of codons and the entries in the matrix will be as follows:

q_{ij} = \begin{cases}
0, & \text{if the two codons differ at more than one position} \\
\varphi_j, & \text{for synonymous transversion}\\
\kappa\varphi_j, & \text{for synonymous transition}\\
\omega\varphi_j, & \text{for non-synonymous transversion}\\
\omega\kappa\varphi_j, & \text{for non-synonymous transition}\\
, where synonymous is when a substitution (transition or transversion) does not result in a change in the amino acid encoded by the codon. Non-synonymous is when the substitution changes the amino acid.
I have trouble defining the likelihood of the parameters in this model, but I would like to define the likelihood in manner similar to the Kimura model, so that I can use the EM-algorithm to estimate the parameters.
I hope the description of the problem is understandable, otherwise I will be happy to elaborate. Any help on defining the likelihood would be appreciated.