21.4. 21.4 Comparing structures

In this section, we give dynamic programming algorithms for comparing structures. As we can see, aligning labelled rooted trees is a generalisation of sequence alignment. The recursions in the dynamic programming algorithm for comparing HMMs yields a linear equation system due to circular dependencies. However, we still can call it dynamic programming algorithm.

21.4.1. 21.4.1 Aligning labelled, rooted trees

Let be a finite alphabet, and , . Labelling of tree is a function that assigns a character of to each node . If we delete a node from the tree, then the children of the node will become children of the parental node. If we delete the root of the tree, then the tree becomes a forest. Let be a rooted tree labelled with characters from , and let represent the labelling. is an alignment of trees and labelled with characters from if restricting the labeling of to the first (respectively, second) coordinates and deleting nodes labelled with ' ' yields tree (respectively, ). Let be a similarity function. An optimal alignment of trees and is the tree labelled with for which

is maximal. This tree is denoted by . Note that a sequence can be represented with a unary tree, which has a single leaf. Therefore aligning trees is a generalisation of aligning sequences (with linear gap penalty).

Below we will concern only with trees in which each node has a degree at most . The recursion in the dynamic programming algorithm goes on rooted subtrees. A rooted subtree of a tree contains a node of the tree and all nodes that are smaller than . The tree obtained by root is denoted by .

A tree to an empty tree can be aligned only in one way. Two leafes labelled by and can be aligned in three different way. The alignment might contain only one node labelled with or might contain two nodes, one of them is labelled with , the other with . One of the points is the root, the other the leaf.

Similarly, when we align a single leaf to a tree, then in the alignment either the single character of the node is labelled together with a character of the tree or labelled together with ' ' in an independent node. This node can be placed in several ways on tree , however the score of any of them is the same.

After this initialisation, the dynamic programming algorithm aligns greater rooted subtrees using the alignments of smaller rooted subtrees. We assume that we already know the score of the optimal alignments , , , , , , and when aligning subtrees and , where and are the children of and and are the children of . Should one of the nodes have only one child, the dynamic programming reduces the problem of aligning and to less subproblems. We assume that the algorithm also knows the score of the optimal alignments of to the empty tree and the score of the optimal alignment of to the empty tree. Let the labelling of be and the labelling of be . We have to consider constant many subproblems for obtaining the score of the optimal alignment of and . If one of the tree is aligned to one of the children's subtree of the other tree, then the other child and the root of the other tree is labelled together with ' '. If character of is co-labelled with the character of , then the children nodes are aligned together, as well. The last situation is the case when the roots are not aligned in but one of the roots is the root of and the other root is its only child. The children might or might not be aligned together, this is five possible cases altogether.

Since the number of rooted subtrees equals to the number of nodes of the tree, the optimal alignment can be obtained in time, where and are the number of nodes in and .

21.4.2. 21.4.2 Co-emission probability of two HMMs

Let and be Hidden Markov Models. The co-emission probability of the two models is

where the summation is over all possible sequences and is the probability that model emitted sequence . The probability that path emitted sequence is denoted by , a path from the START state till the state is denoted by . Since state can be reached on several paths, this definition is not well-defined, however, this will not cause a problem later on. Since the coemission probability is the sum of the product of emission of paths,

Let denote the path that can be obtained with removing the last state from , and let be the state before in path . (We define similarly and .) Hence

where is the probability of jumping to END from , and

can be also obtained with this equation:

where is the probability that emitted . Equation 21.62 defines a linear equation system for all pairs of emitting states and . The initial conditions are:

Unlike the case of traditional dynamic programming, we do not fill in a dynamic programming table, but solve a linear equation system defined by Equation 21.62. Hence, the coemission probability can be calculated in time, where and are the number of emitting states of the models.

Exercises

21.4-1 Give a dynamic programming algorithm for the local similarities of two trees. This is the score of the most similar subtrees of the two trees. Here subtrees are any consecutive parts of the tree.

21.4-2 Ordered trees are rooted trees in which the children of a node are ordered. The ordered alignment of two ordered trees preserve the orderings in the aligned trees. Give an algorithm that obtains the optimal ordered alignment of two ordered trees and has running time being polynomial with both the maximum number of children and number of nodes.

21.4-3 Consider the infinite Euclidean space whose coordinates are the possible sequences. Each Hidden Markov model is a vector in this space the coordinates of the vector are the emission probabilities of the corresponding sequences. Obtain the angle between two HMMs in this space.

21.4-4 Give an algorithm that calculates the generating function of the length of the emitted sequences of an HMM, that is

where is the probability that the Markov model emitted a sequence with length .

21.4-5 Give an algorithm that calculates the generating function of the length of the emitted sequences of a pair-HMM, that is

where is the probability that the first emitted sequence has length , and the second emitted sequence has length .