21.3. 21.3 Algorithms on stochastic grammars

Below we give algorithms on stochastic transformational grammars. Stochastic transformational grammars play a central role in modern bioinformatics. Two types of transformational grammars are widespread, the Hidden Markov Models (HMMs) are used for protein structure prediction and gene finding, while Stochastic Context Free Grammars (SCFGs) are used for RNA secondary structure prediction.

21.3.1. 21.3.1 Hidden Markov Models

We give the formal definition of Hidden Markov Models (HMM): Let denote a finite set of states. There are two distinguished states among the states, the start and the end states. The states are divided into two parts, emitting and non-emitting states. We assume that only the start and the end states are non-emitting, we will show that this assumption is not too strict.

The M transformation matrix contains the transition probabilities, , that the Markov process will jump to state from state . Emitting states emit characters form a finite alphabet, . The probability that the state emits a character will be denoted by . The Markov process starts in the start state and ends in the end state, and walks according to the transition probabilities in M. Each emitting state emits a character, the emitted characters form a sequence. The process is hidden since the observer observes only the sequence and does not observe the path that the Markov process walked on. There are three important questions for HMMs that can be answered using dynamic programming algorithms.

The first question is the following: given an HMM and a sequence, what is the most likely path that emits the given sequence? The Viterbi algorithm gives the answer for this question. Recall that is the -long prefix of sequence , and is the character in the th position. The dynamic programming answering the first question is based on that we can calculate the probability, the probability of the most probable path emitting prefix and being in state if we already calculated for all possible , since

The reason behind the above equation is that the probability of any path is the product of transition and emission probabilities. Among the products having the same last two terms (in our case ) the maximal is the one for which the product of the other terms is maximal.

The initialisation of the dynamic programming is

Since the end state does not emit a character, the termination of the dynamic programming algorithm is

where is the probability of the most likely path emitting the given sequence. One of the most likely paths can be obtained with a trace-back.

The second question is the following: given an HMM and a sequence, what is the probability that the HMM emits the sequence? This probability is the sum of the probabilities of paths that emit the given sequence. Since the number of paths emitting a given sequence might grow exponentially with the length of the sequence, the naive algorithm that finds all the possible emitting paths and sum their probabilities would be too slow.

The dynamic programming algorithm that calculates quickly the probability in question is called the Forward algorithm. It is very similar to the Viterbi algorithm, just there is a sum instead of maximalisation in it:

Since the END state does not emit, the termination is

where is the probability that the HMM emits sequence .

The most likely path obtained by the Viterbi algorithm has more and less reliable parts. Therefore we are interested in the probability

This is the third question that we answer with dynamic programming algorithm. The above mentioned probability is the sum of the probabilities of paths that emit in state divided by the probability that the HMM emits sequence . Since the number of such paths might grow exponentially, the naive algorithm that finds all the possible paths and sum their probability is too slow.

To answer the question, first we calculate for each suffix and state what is the probability that the HMM emits suffix given that state emits . This can be calculated with the Backward algorithm, which is similar to the Forward algorithm just starts the recursion with the end of the sequence:

Let denote the probability

Then

and from this

which is the needed probability.

21.3.2. 21.3.2 Stochastic context-free grammars

It can be shown that every context-free grammar can be rewritten into Chomsky normal form. Each rule of a grammar in Chomsky normal form has the form or , where the s are non-terminal symbols, and is a terminal symbol. In a stochastic grammar, each derivation rule has a probability, a non-negative number such that the probabilities of derivation rules for each non-terminal sum up to .

Given a SCFG and a sequence, we can ask the questions analogous to the three questions we asked for HMMs: what is the probability of the most likely derivation, what is the probability of the derivation of the sequence and what is the probability that a sub-string has been derivated starting with non-terminal, given that the SCFG derivated the sequence. The first question can be answered with the CYK (Cocke-Younger-Kasami) algorithm which is the Viterbi-equivalent algorithm for SCFGs. The second question can be answered with the Inside algorithm, this is the Forward-equivalent for SCFGs. The third question can be answered with the combination of the Inside and Outside algorithms, as expected, the Outside algorithm is analogous to the Backward algorithm. Though the introduced algorithms are equivalent with the algorithms used for HMMs, their running time is significantly greater.

Let denote the probability of the rule , and let denote the probability of the rule . The Inside algorithm calculates for all and , this is the probability that non-terminal derives the substring from till . The initial conditions are:

for all and . The main recursion is:

where is the number of non-terminals. The dynamic programming table is an upper triangle matrix for each non-terminal, the filling-in of the table starts with the main diagonal, and is continued with the other diagonals. The derivation probability is , where is the length of the sequence, and is the starting non-terminal. The running time of the algorithm is , the memory requirement is .

The Outside algorithm calculates for all and , this is the part of the derivation probability of deriving sequence which is “outside” of the derivation of substring from till , starting the derivation from . A more formal definition for is that this is the sum of derivation probabilities in whom the substring from till is derived from , divided by . Here we define as . The initial conditions are:

The main recursion is:

The reasoning for Eqn. 21.54 is the following. The non-terminal was derivated from a non-terminal together with a non-terminal, and their derivation order could be both and . The outside probability of non-terminal is product of the outside probability of , the derivation probability and the inside probability of . As we can see, inside probabilities are needed to calculate outside probabilities, this is a significant difference from the Backward algorithm that can be used without a Forward algorithm.

The CYK algorithm is very similar to the Inside algorithm, just there are maximalisations instead of summations:

The probability of the most likely derivation is . The most likely derivation can be obtained with a trace-back.

Finally, the probability that the substring from till has been derived by given that the SCFG derived the sequence is:

Exercises

21.3-1 In a regular grammar, each derivation rule is either in a form or in a form . Show that each HMM can be rewritten as a stochastic regular grammar. On the other hand, there are stochastic regular grammars that cannot be described as HMMs.

21.3-2 Give a dynamic programming algorithm that calculate for a stochastic regular grammar and a sequence

  • the most likely derivation,

  • the probability of derivation,

  • the probability that character is derived by non-terminal .

21.3-3 An HMM can contain silent states that do not emit any character. Show that any HMM containing silent states other than the start and end states can be rewritten to an HMM that does not contain silent states above the start and end states and emits sequences with the same probabilities.

21.3-4 Pair Hidden Markov models are Markov models in which states can emit characters not only to one but two sequences. Some states emit only into one of the sequences, some states emit into both sequences. The observer sees only the sequences and does not see which state emits which characters and which characters are co-emitted. Give the Viterbi, Forward and Backward algorithms for pair-HMMs.

21.3-5 The Viterbi algorithm does not use that probabilities are probabilities, namely, they are non-negative and sum up to one. Moreover, the Viterbi algorithm works if we replace multiplications to additions (say that we calculate the logarithm of the probabilities). Give a modified HMM, namely, in which “probabilities” not necessary sum up to one, and they might be negative, too, and the Viterbi algorithm with additions are equivalent with the Gotoh algorithm.

21.3-6 Secondary structures of RNA sequences are set of basepairings, in which for all basepairing positions and , implies that either or . The possible basepairings are , , , , and . Give a dynamic programming algorithm that finds the secondary structure containing the maximum number of basepairings for an RNA sequence. This problem was first solved by Nussionov et al. [ 256 ].

21.3-7 The derivation rules of the Knudsen-Hein grammar are [ 202 ], [ 201 ]

where has to be substituted with the possible characters of RNA sequences, and the s in the expression have to be replaced by possible basepairings. Show that the probability of the derivation of a sequence as well as the most likely derivation can be obtained without rewriting the grammar into Chomsky normal form.