Chapter 21. Bioinformatics

In this chapter at first we present algorithms on sequences, trees and stochastic grammars, then we continue with algorithms of comparison of structures and constructing of evolutionary trees, and finish the chapter with some rarely discussed topics of bioinformatics.

21.1. 21.1 Algorithms on sequences

In this section, we are going to introduce dynamic programming algorithms working on sequences. Sequences are finite series of characters over a finite alphabet. The basic idea of dynamic programming is that calculations for long sequences can be given via calculations on substrings of the longer sequences.

The algorithms introduced here are the most important ones in bioinformatics, they are the basis of several software packages.

21.1.1. 21.1.1 Distances of two sequences using linear gap penalty

DNA contains the information of living cells. Before the duplication of cells, the DNA molecules are doubled, and both daughter cells contain one copy of DNA. The replication of DNA is not perfect, the stored information can be changed by random mutations. Random mutations creates variants in the population, and these variants evolve to new species.

Given two sequences, we can ask the question how much the two species are related, and how many mutations are needed to describe the evolutionary history of the two sequences.

We suppose that mutations are independent from each other, and hence, the probability of a series of mutations is the product of probabilities of the mutations. Each mutation is associated with a weight, mutations with high probability get a smaller weight, mutations with low probability get a greater weight. A reasonable choice might be the logarithm of one over the probability of the mutation. In this case the weight of a series of mutations is the sum of the weights of the individual mutations. We also assume that mutation and its reverse have the same probability, therefore we study how a sequence can be transfered into another instead of evolving two sequences from a common ancestor. Assuming minimum evolution minimum evolution, we are seeking for the minimum weight series of mutations that transforms one sequence into another. An important question is how we can quickly find such a minimum weight series. The naive algorithm finds all the possible series of mutations and chooses the minimum weight. Since the possible number of series of mutations grows exponentially – as we are going to show it in this chapter –, the naive algorithm is obviously too slow.

We are going to introduce the Sellers' algorithm [ 299 ]. Let be a finite set of symbols, and let denote the set of finite long sequences over . The long prefix of will be denoted by , and denotes the th character of . The following transformations can be applied for a sequence:

  • Insertion of symbol before position , denoted by .

  • Deletion of symbol at position , denoted by .

  • Substitution of symbol to symbol at position , denoted by .

The concatenation of mutations is denoted by the symbol. denotes the set of finite long concatenations of the above mutations, and denotes that transforms a sequence into sequence . Let a weight function such that for any , and transformations satisfying

the

equation also holds. Furthermore, let be independent from . The transformation distance between two sequences, and , is the minimum weight of transformations transforming into :

If we assume that satisfies

for any , , , then the transformation distance is indeed a metric on .

Since is a metric, it is enough to concern with transformations that change each position of a sequence at most once. Series of transformations are depicted with sequence alignments. By convention, the sequence at the top is the ancestor and the sequence at the bottom is its descendant. For example, the alignment below shows that there were substitutions at positions three and five, there was an insertion in the first position and a deletion in the eighth position.

  - A U C G U A C A G  

  U A G C A U A - A G  

A pair at a position is called aligned pair. The weight of the series of transformations described by the alignment is the sum of the weights of aligned pairs. Each series of mutations can be described by an alignment, and this description is unique up the permutation of mutations in the series. Since the summation is commutative, the weight of the series of mutations does not depend on the order of mutations.

We are going to show that the number of possible alignments also grows exponentially with the length of the sequences. The alignments that do not contain this pattern

  # -  

  - #  

where # an arbitrary character of gives a subset of possible alignments. The size of this subset is , since there is a bijection between this set of alignments and the set of coloured sequences that contains the characters of and in increasing order, and the characters of is coloured with one colour, and the characters of is coloured with the other colour. For example, if , then .

An alignment whose weight is minimal called an optimal alignment. Let the set of optimal alignments of and be denoted by , and let denote the weights of any alignment in .

The key of the fast algorithm for finding an optimal alignment is that if we know , , and , then we can calculate in constant time. Indeed, if we delete the last aligned pair of an optimal alignment of and , we get the optimal alignment of and , or and or and , depending on the last aligned column depicts a deletion, an insertion, substitution or match, respectively. Hence,

The weights of optimal alignments are calculated in the so-called dynamic programming table, . The element of contains . Comparing of an and an long sequence requires the fill-in of an table, indexing of rows and columns run from till and , respectively. The initial conditions for column and row are

The table can be filled in using Equation (21.7). The time requirement for the fill-in is . After filling in the dynamic programming table, the set of all optimal alignments can be find in the following way, called trace-back. We go from the right bottom corner to the left top corner choosing the cell(s) giving the optimal value of the current cell (there might be more than one such cells). Stepping up from position means a deletion, stepping to the left means an insertion, and the diagonal steps mean either a substitution or a match depending on whether or not . Each step is represented with an oriented edge, in this way, we get an oriented graph, whose vertices are a subset of the cells of the dynamic programming table. The number of optimal alignments might grow exponentially with the length of the sequences, however, the set of optimal alignments can be represented in polynomial time and space. Indeed, each path from to on the oriented graph obtained in the trace-back gives an optimal alignment.

21.1.2. 21.1.2 Dynamic programming with arbitrary gap function

Since deletions and insertions get the same weight, the common name of them is indel or gap, and their weights are called gap penalty. Usually gap penalties do not depend on the deleted or inserted characters. The gap penalties used in the previous section grow linearly with the length of the gap. This means that a long indel is considered as the result of independent insertions or deletions of characters. However, the biological observation is that long indels can be formed in one evolutionary step, and these long indels are penalised too much with the linear gap penalty function. This observation motivated the introduction of more complex gap penalty functions [ 339 ]. The algorithm introduced by Waterman et al. penalises a long gap with . For example the weight of this alignment:

  - - A U C G A C G U A C A G  

  U A G U C - - - A U A G A G  

is .

We are still seeking for the minimal weight series of transformations transforming one sequence into another or equivalently for an optimal alignment. Since there might be a long indel at the end of the optimal alignment, above knowing , we must know all , and , to calculate . The dynamic programming recursion is given by the following equations:

The initial conditions are:

The time requirement for calculating is , hence the running time of the fill-in part to calculate the weight of an optimal alignment is . Similarly to the previous algorithm, the set of optimal alignments represented by paths from to can be found in the trace-back part.

If , then the running time of this algorithm is . With restrictions on the gap penalty function, the running time can be decreased. We are going to show two such algorithms in the next two sections.

21.1.3. 21.1.3 Gotoh algorithm for affine gap penalty

A gap penalty function is affine if

There exists a running time algorithm for affine gap penalty [ 138 ]. Recall that in the Waterman algorithm,

where

The key of the Gotoh algorithm is the following reindexing

And similarly

In this way, Ús can be calculated in constant time, hence . Thus, the running time of the algorithm remains , and the algorithm will be only a constant factor slower than the dynamic programming algorithm for linear gap penalties.

21.1.4. 21.1.4 Concave gap penalty

There is no biological justification for the affine gap penalty function [ 36 ], [ 137 ], its wide-spread use (for example, CLUSTAL-W [ 322 ]) is due to its low running time. There is a more realistic gap penalty function for which an algorithm exists whose running time is slightly more than the running time for affine gap penalty, but it is still significantly better than the cubic running time algorithm of Waterman et al. [ 124 ], [ 246 ].

A gap penalty function is concave if for each , . Namely, the increasement of gap extensions are penalised less and less. It might happen that the function starts decreasing after a given point, to avoid this, it is usually assumed that the function increases monotonously. Based on empirical data [ 36 ], if two sequences evolved for PAM unit [ 80 ], the weight of a long indel is

which is also a concave function. (One PAM unit is the time span on which 1% of the sequence changed.) There exist an running time algorithm for concave gap penalty functions. this is a so-called forward looking algorithm. The Forward-Looking algorithm calculates the th row of the dynamic programming table in the following way for an arbitrary gap penalty function:

Forward-Looking

  1  
                        FOR
                      
                        2       3       4  
                        FOR
                      
                        5       6       7        At this step, we suppose that  and  are already calculated.   8    
                        FOR
                      
                           
                      Inner cycle.   9       
                        IF
                      
                      
                     
                        THEN
                       10            11           
                  

where is the gap penalty function and is a pointer whose role will be described later. In row , we assume that we already calculated and . It is easy to show that the forward looking algorithm makes the same comparisons as the traditional, backward looking algorithm, but in a different order. While the backward looking algorithm calculates at the th position of the row looking back to the already calculated entries of the dynamic programming table, the Forward-Looking algorithm has already calculated by arriving to the th position of the row. On the other hand, it sends forward candidate values for , , and by arriving to cell , all the needed comparisons of candidate values have been made. Therefore, the Forward-Looking algorithm is not faster than the traditional backward looking algorithm, however, the conception helps accelerate the algorithm.

The key idea is the following.

Lemma 21.1 Let be the actual cell in the row. If

then for all

Proof. From the condition it follows that there is a for which

Let us add to the equation:

For each concave gap penalty function,

rearranging this and using Equation (21.25)

The idea of the algorithm is to find the position with a binary search from where the actual cell cannot send forward optimal values. This is still not enough for the desired acceleration since number of candidate values should be rewritten in the worst case. However, the corollary of the previous lemma leads to the desired acceleration:

Corollary 21.2 Before the th cell sends forward candidate values in the inner cycle of the forward looking algorithm, the cells after cell form blocks, each block having the same pointer, and the pointer values are decreasing by blocks from left to right.

The pseudocode of the algorithm is the following:

Forward-Looking-Binary-Searching( )

  1  ; ;    2  
                        FOR
                      
                      
                     
                        TO
                      
                        3    
                        DO
                      
                        4          5        At this step, we suppose that  and  are already calculated.   6       
                        IF
                      
                      
                     
                        THEN
                        7          
                        THEN
                      
                        8       
                        IF
                      
                      
                     
                        AND
                      
                        9          
                        THEN
                      
                     ;   10          
                        ELSE IF
                      
                       11             
                        THEN
                      
                                            
                      12          
                        IF
                      
                       13             
                        THEN
                      
                     ;   14             
                        ELSE
                      
                       15                
                        IF
                      
                       16                   
                        THEN
                      
                       17                   
                        ELSE
                      
                       18            19            20                              
                     
                  

The algorithm works in the following way: for each row, we maintain a variable storing the number of blocks, a list of positions of block ends, and a list of pointers for each block. For each cell , the algorithm finds the last position for which the cell gives an optimal value using binary search. There is first a binary search for the blocks then for the positions inside the choosen block. It is enough to rewrite three values after the binary searches: the number of blocks, the end of the last block and its pointer. Therefore the most time consuming part is the binary search, which takes time for each cell.

We do the same for columns. If the dynamic programming table is filled in row by row, then for each position in row , the algorithms uses the block system of column . Therefore the running time of the algorithm is .

21.1.5. 21.1.5 Similarity of two sequences, the Smith-Waterman algorithm

We can measure not only the distance but also the similarity of two sequences. For measuring the similarity of two characters, , the most frequently used function is the log-odds:

where is the joint probability of the two characters (namely, the probability of observing them together in an alignment column), and are the marginal probabilities. The similarity is positive if , otherwise negative. Similarities are obtained from empirical data, for aminoacids, the most commonly used similarities are given by the PAM and BLOSUM matrices.

If we penalise gaps with negative numbers then the above described, global alignment algorithms work with similarities by changing minimalisation to maximalisation.

It is possible to define a special problem that works for similarities and does not work for distances. It is the local similarity problem or the local sequence alignment problem [ 305 ]. Given two sequences, a similarity and a gap penalty function, the problem is to give two substrings of the sequences whose similarity is maximal. A substring of a sequence is a consecutive part of the sequence. The biological motivation of the problem is that some parts of the biological sequences evolve slowly while other parts evolve fast. The local alignment finds the most conservated part of the two sequences. Local alignment is widely used for homology searching in databases. The reason why local alignments works well for homology searching is that the local alignment score can separate homologue and non-homologue sequences better since the statistics is not decreased due to the variable regions of the sequences.

The Smith-Waterman algorithm work in the following way. The initial conditions are:

Considering linear gap penalty, the dynamic programming table is filled in using the following recursions:

Here , the gap penalty is a negative number. The best local similarity score of the two sequences is the maximal number in the table. The trace-back starts in the cell having the maximal number, and ends when first reaches a .

It is easy to prove that the alignment obtained in the trace-back will be locally optimal: if the alignment could be extended at the end with a sub-alignment whose similarity is positive then there would be a greater number in the dynamic programming table. If the alignment could be extended at the beginning with a subalignment having positive similarity then the value at the end of the traceback would not be .

21.1.6. 21.1.6 Multiple sequence alignment

The multiple sequence alignment problem was introduced by David Sankoff [ 294 ], and by today, the multiple sequence alignment has been the central problem in bioinformatics. Dan Gusfield calls it the Holy Grail of bioinformatics [ 148 ]. Multiple alignments are widespread both in searching databases and inferring evolutionary relationships. Using multiple alignments, it is possible to find the conservated parts of a sequence family, the positions that describe the functional properties of the sequence family. AS Arthur Lesk said: [ 169 ]: What two sequences whisper, a multiple sequence alignment shout out loud.

The columns of a multiple alignment of sequences is called aligned -tuples. The dynamic programming for the optimal multiple alignment is the generalisation of the dynamic programming for optimal pairwise alignment. To align sequences, we have to fill in a dimensional dynamic programming table. To calculate an entry in this table using linear gap penalty, we have to look back to a dimensional hypercube. Therefore, the memory requirement in case of sequence, long each is , and the running time of the algorithm is if we use linear gap penalty, and with arbitrary gap penalty.

There are two fundamental problems with the multiple sequence alignment. The first is an algorithmic problem: it is proven that the multiple sequence alignment problem is NP-complete [ 337 ]. The other problem is methodical: it is not clear how to score a multiple alignment. An objective scoring function could be given only if the evolutionary relationships were known, in this case an aligned -tuple could be scored according to an evolutionary tree [ 264 ].

A heuristic solution for both problems is the iterative sequence alignment [ 105 ], [ 75 ], [ 322 ]. This method first construct a guide-tree using pairwise distances (such tree-building methods are described in section 21.5). The guide-tree is used then to construct a multiple alignment. Each leaf is labelled with a sequence, and first the sequences in “cherry-motives” are aligned into each other, then sequence alignments are aligned to sequences and sequence alignments according to the guide-tree. The iterative sequence alignment method uses the “once a gap – always gap” rule. This means that gaps already placed into an alignment cannot be modified when aligning the alignment to other alignment or sequence. The only possibility is to insert all-gap columns into an alignment. The aligned sequences are usually described with a profile. The profile is a table, where is the length of the alignment. A column of a profile contains the statistics of the corresponding aligned -tuple, the frequencies of characters and the gap symbol.

The obtained multiple alignment can be used for constructing another guide-tree, that can be used for another iterative sequence alignment, and this procedure can be iterated till convergence. The reason for the iterative alignment heuristic is that the optimal pairwise alignment of closely related sequences will be the same in the optimal multiple alignment. The drawback of the heuristic is that even if the previous assumption is true, there might be several optimal alignments for two sequences, and their number might grow exponentially with the length of the sequences. For example, let us consider the two optimal alignments of the sequences AUCGGUACAG and AUCAUACAG.

We cannot choose between the two alignments, however, in a multiple alignment, only one of them might be optimal. For example, if we align the sequence AUCGAU to the two optimal alignments, we get the following locally optimal alignments:

The left alignment is globally optimal, however, the right alignment is only locally optimal.

Hence, the iterative alignment method yields only a locally optimal alignment. Another problem of this method is that it does not give an upper bound for the goodness of the approximation. In spite of its drawback, the iterative alignment methods are the most widely used ones for multiple sequence alignments in practice, since it is fast and usually gives biologically reasonable alignments. Recently some approximation methods for multiple sequence alignment have been published with known upper bounds for their goodness [ 149 ], [ 283 ]. However, the bounds biologically are not reasonable, and in practice, these methods usually give worse results than the heuristic methods.

We must mention a novel greedy method that is not based on dynamic programming. The DiAlign [ 249 ], [ 250 ], [ 251 ] first searches for gap-free homologue substrings by pairwise sequence comparison. The gap-free alignments of the homologous substrings are called diagonals of the dynamic programming name, hence the name of the method: Diagonal Alignment. The diagonals are scored according to their similarity value and diagonals that are not compatible with high-score diagonals get a penalty. Two diagonals are not compatible if they cannot be in the same alignment. After scoring the diagonals, they are aligned together a multiple alignment in a greedy way. First the best diagonal is selected, then the best diagonal that is comparable with the first one, then the third best alignment that is comparable with the first two ones, etc. The multiple alignment is the union of the selected diagonals that might not cover all the characters in the sequence. Those characters that were not in any of the selected diagonals are marked as “non alignable”. The drawback of the method is that sometimes it introduces too many gaps due to not penalising the gaps at all. However, DiAlign has been one of the best heuristic alignment approach and is widely used in the bioinformatics community.

21.1.7. 21.1.7 Memory-reduction with the Hirschberg algorithm

If we want to calculate only the distance or similarity between two sequences and we are not interested in an optimal alignment, then in case of linear or affine gap penalties, it is very easy to construct an algorithm that uses only linear memory. Indeed, note that the dynamic programming recursion needs only the previous row (in case of filling in the dynamic table by rows), and the algorithm does not need to store earlier rows. On the other hand, once the dynamic programming table has reached the last row and forgot the earlier rows, it is not possible to trace-back the optimal alignment. If the dynamic programming table is scrolled again and again in linear memory to trace-back the optimal alignment row by row then the running time grows up to , where is the length of the sequences.

However, it is possible to design an algorithm that obtains an optimal alignment in running time and uses only linear memory. This is the Hirschberg algorithm [ 165 ], which we are going to introduce for distance-based alignment with linear gap penalty.

We introduce the suffixes of a sequence, a suffix is a substring ending at the end of the sequence. Let denote the suffix of starting with character .

The Hirschberg algorithm first does a dynamic programming algorithm for sequences and using liner memory as described above. Similarly, it does a dynamic programming algorithm for the reverse of the sequences and .

Based on the two dynamic programming procedures, we know what is the score of the optimal alignment of and an arbitrary prefix of , and similarly what is the score of the optimal alignment of and an arbitrary suffix of . From this we can tell what is the score of the optimal alignment of and :

and from this calculation it must be clear that in the optimal alignment of and , is aligned with the prefix for which

is minimal.

Since we know the previous rows of the dynamic tables, we can tell if and is aligned with any characters of or these characters are deleted in the optimal alignment. Similarly, we can tell if any character of is inserted between and .

In this way, we get at least two columns of the optimal alignment. Then we do the same for and the remaining part of the prefix of , and for and the remaining part of the suffix of . In this way we get alignment columns at the quarter and the three fourths of sequence . In the next iteration, we do the same for the for pairs of sequences, etc., and we do the iteration till we get all the alignment columns of the optimal alignment.

Obviously, the memory requirement still only grows linearly with the length of the sequences. We show that the running time is still , where and are the lengths of the sequences. This comes from the fact that the running time of the first iteration is , the running time of the second iteration is , where is the position for which we get a minimum distance in Eqn. (21.31). Hence the total running time is:

21.1.8. 21.1.8 Memory-reduction with corner-cutting

The dynamic programming algorithm reaches the optimal alignment of two sequences with aligning longer and longer prefixes of the two sequences. The algorithm can be accelerated with excluding the bad alignments of prefixes that cannot yield an optimal alignment. Such alignments are given with the ordered paths going from the right top and the left bottom corners to , hence the name of the technique.

Most of the corner-cutting algorithms use a test value. This test value is an upper bound of the evolutionary distance between the two sequences. Corner-cutting algorithms using a test value can obtain the optimal alignment of two sequences only if the test value is indeed smaller then the distance between the two sequences, otherwise the algorithm stops before reaching the right bottom corner or gives a non-optimal alignment. Therefore these algorithms are useful for searching for sequences similar to a given one and we are not interested in sequences that are farther from the query sequence than the test value.

We are going to introduce two algorithms. the Spouge algorithm [ 307 ], [ 306 ] is a generalisation of the Fickett [ 107 ] and the Ukkonnen algorithm [ 333 ], [ 332 ]. The other algorithm was given by Gusfield, and this algorithm is an example for a corner-cutting algorithm that reaches the right bottom corner even if the distance between the two sequence is greater than the test value, but in this case the calculated score is bigger than the test value, indicating that the obtained alignment is not necessary optimal.

The Spouge algorithm calculates only those for which

where is the test value, is the gap penalty, and are the length of the sequences. The key observation of the algorithm is that any path going from to will increase the score of the alignment at least by . Therefore is is smaller than the distance between the sequences, the Spouge algorithm obtains the optimal alignments, otherwise will stop before reaching the right bottom corner.

This algorithm is a generalisation of the Fickett algorithm and the Ukkonen algorithm. Those algorithms also use a test value, but the inequality in the Fickett algorithm is:

while the inequality in the Ukkonnen algorithm is:

Since in both cases, the left hand side of the inequalities are not greater than the left end side of the Spouge inequality, the Fickett and the Ukkonnen algorithms will calculate at least as much part of the dynamic programming table than the Spouge algorithm. Empirical results proved that the Spouge algorithm is significantly better [ 306 ]. The algorithm can be extended to affine and concave gap penalties, too.

The -difference global alignment problem [ 148 ] asks the following question: Is there an alignment of the sequences whose weight is smaller than ? The algorithm answering the question has running time, where is the length of the longer sequence. The algorithm is based on the observation that any path from to having at most score cannot contain a cell for which . Therefore the algorithm calculates only those cells for which and disregards the neighbours of the border elements for which . If there exists an alignment with a score smaller or equal than , then and is indeed the distance of the two sequences. Otherwise , and is not necessary the score of the optimal alignment since there might be an alignment that leaves the band defined by the inequality and still has a smaller score then the best optimal alignment in the defined band.

The corner-cutting technique has been extended to multiple sequence alignments scored by the sum-of-pairs scoring scheme [ 57 ]. The sum-of-pairs score is:

where is the th aligned -tuple is the distance function on , is the number of sequences, is the character of the multiple alignment in the th row and th column. The -suffix of sequence is . Let denote the distance of the optimal alignment of the -suffix and the -suffix of the th and the th sequences. The Carillo and Lipman algorithm calculates only the positions for which

where is the test value. The goodness of the algorithm follows from the fact that the sum-of-pairs score of the optimal alignment of the not yet aligned suffixes cannot be smaller than the sum of the scores of the optimal pairwise alignments. This corner cutting method can align at most six moderately long sequences [ 223 ].

Exercises

21.1-1 Show that the number of possible alignments of an and an long sequences is

21.1-2 Give a series of pairs of sequences and a scoring scheme such that the number of optimal alignments grows exponentially with the length of the sequences.

21.1-3 Give the Hirschberg algorithm for multiple alignments.

21.1-4 Give the Hirschberg algorithm for affine gap penalties.

21.1-5 Give the Smith-Waterman algorithm for affine gap penalties.

21.1-6 Give the Spouge algorithm for affine gap penalties.

21.1-7 Construct an example showing that the optimal multiple alignment of three sequences might contain a pairwise alignment that is only suboptimal.