21.2. 21.2 Algorithms on trees

Algorithms introduced in this section work on rooted trees. The dynamic programming is based on the reduction to rooted subtrees. As we will see, above obtaining optimal cases, we can calculate algebraic expressions in the same running time.

21.2.1. 21.2.1 The small parsimony problem

The (weighted) parsimony principle is to describe the changes of biological sequences with the minimum number (minimum weight) of mutations. We will concern only with substitutions, namely, the input sequences has the same length and the problem is to give the evolutionary relationships of sequences using only substitutions and the parsimony principle. We can define the large and the small parsimony problem. For the large parsimony problem, we do not know the topology of the evolutionary tree showing the evolutionary relationships of the sequences, hence the problem is to find both the best topology and an evolutionary history on the tree. The solution is not only locally but globally optimal. It has been proved that the large parsimony problem is NP-complete [ 119 ].

The small parsimony problem is to find the most parsimonious evolutionary history on a given tree topology. The solution for the small parsimony problem is only locally optimal, and there is no guarantee for global optimum.

Each position of the sequences is scored independently, therefore it is enough to find a solution for the case where there is only one character at each leaf of the tree. In this case, the evolutionary history can be described with labelling the internal nodes with characters. If two characters at neighbouring vertices are the same, then no mutation happened at the corresponding edge, otherwise one mutation happened. The naive algorithm investigates all possible labelings and selects the most parsimonious solution. Obviously, it is too slow, since the number of possible labelings grows exponentially with the internal nodes of the tree.

The dynamic programming is based on the reduction to smaller subtrees [ 294 ]. Here the definition of subtrees is the following: there is a natural partial ordering on the nodes in the rooted subtree such that the root is the greatest node and the leaves are minimal. A subtree is defined by a node, and the subtree contains this node and all nodes that are smaller than the given node. The given node is the root of the subtree. We suppose that for any child of the node and any character we know the minimum number of mutations that are needed on the tree with root given that there is at node . Let denote this number. Then

where is the set of children of , is the alphabet, and is if and otherwise.

The minimum number of mutations on the entire tree is , where is the root of the tree. A most parsimonious labelling can be obtained with trace-backing the tree from the root to the leaves, writing to each nodes the character that minimises Eqn. 21.39. To do this, we have to store for all and .

The running time of the algorithm is for one character, where is the number of nodes of the tree, and for entire sequences, where is the length of the sequences.

21.2.2. 21.2.2 The Felsenstein algorithm

The input of the Felsenstein algorithm [ 104 ] is a multiple alignment of DNA (or RNA or protein) sequences, an evolutionary tree topology and edge lengths, and a model that gives for each pair of characters, and and time , what is the probability that evolves to duting time . Let denote this probability. The equilibrium probability distribution of the characters is denoted by . The question is what is the likelihood of the tree, namely, what is the probability of observing the sequences at the leaves given the evolutionary parameters consisting of the edge lengths and parameters of the substitution model.

We assume that each position evolves independently, hence the probability of an evolutionary process is the product of the evolutionary probabilities for each position. Therefore it is enough to show how to calculate the likelihood for a sequence position. We show this for an example tree that can be seen on Figure 21.1. will denote the character at node and is the length of edge . Since we do not know the characters at the internal nodes, we must sum the probabilities for all possible configurations:

If we consider the four character alphabet of DNA, the summation has members, an in case of species, it would have , namely the computational time grows exponentially with the number of sequences. However, if we move the expressions not depending on the summation index out of the summation, then we get the following product:

which can be calculated in significantly less time. Note that the parenthesis in (21.41) gives the topology of the tree. Each summation can be calculated independently then we multiply the results. Hence the running time of calculating the likelihood for one position decreases to and the running time of calculating the likelihood for the multiple alignment is where is the length of the alignment.

Figure 21.1.  The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.s on the edges of the tree.

The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.

Exercises

21.2-1 Give an algorithm for the weighted small parsimony problem where we want to get minimum weight evolutionary labeling given a tree topology and a set of sequences associated to the leaves of the tree.

21.2-2 The gene content changes in species, a gene that can be found in a genome of a species might be abundant in another genome. In the simplest model an existing gene might be deleted from the genome and an abundant gene might appear. Give the small parsimony algorithm for this gene content evolution model.

21.2-3 Give an algorithm that obtains the Maximum Likelihood labelling on a tree.

21.2-4 Rewrite the small parsimony problem in the form of (21.40) replacing sums with minimalisation, and show that the Sankoff algorithm is based on the same rearrangement as the Felsenstein algorithm.

21.2-5 The Fitch algorithm [ 109 ] works in the following way: Each node is associated with a set of characters, . The leaves are associated with a set containing the character associated to the leaves, and each internal character has the set:

After reaching the root, we select an arbitrary character from , where is the root of the tree, and we choose the same character that we chose at the parent node if the set of the child node has this character, otherwise an arbitrary character from the set of the child node. Show that we get a most parsimonious labelling. What is the running time of this algorithm?

21.2-6 Show that the Sankoff algorithm gives all possible most parsimonious labelling, while there are most parsimonious labellings that cannot be obtained with the Fitch algorithm.