21.6. 21.6 Miscellaneous topics

In this section, we cover topics that are usually not mentioned in bioinformatics books. We only mention the main results in a nutshell and do not prove theorems.

21.6.1. 21.6.1 Genome rearrangement

The genome of an organism consists of several genes. For each gene, only one strand of the double stranded DNA contains meaningful information, the other strand is the reverse complement. Since the DNA is chemically oriented, we can talk about the direction of a gene. If each gene has one copy in the genome then we can describe the order and directions of genes as a signed permutation, where the signs give the directions of genes.

Given two genomes with the same gene content, represented as a signed permutation then the problem is to give the minimal series of mutations that transform one genome into another. We consider three types of mutations:

  • Reversal A reversal acts on a consecutive part of the signed permutation. It reverse the order of genes on the given part as well as the signs of the genes.

  • Transposition A transposition swaps two consecutive block of genes.

  • Reverted transposition It swaps two consecutive blocks and one of the blocks is reverted. As for reversals, the signs in the reverted block also change.

If we assume that only mutations happened, then we can give an running time algorithm that obtains a shortest series of mutations transforming one genome into another, where is the number of genes.

If we consider other types of mutations, then the complexity of problems is unknown. For transpositions, the best approximation is an approximation [ 96 ], if we consider all possible types of mutations, then the best approximation is a -approximation [ 146 ]. For a wide range of and biologically meaningful weights, the weighted sorting problem for all types of mutations has a -approximation [ 26 ].

If we do not know the signs, then the problem is proved to be NP-complete [ 56 ]. Similarly, the optimal reversal median problem even for three genomes and signed permutations is NP-complete [ 55 ]. The optimal reversal median is a genome that minimises the sum of distances from a set of genomes.

Below we describe the Hannenhalli-Pevzner theorem for the reversal distance of two genomes. Instead of transforming permutation into , we transform into the identical permutation. Based on elementary group theory, it is easy to show that the two problems are equivalent. We assume that we already calculated , and we will denote it simply by .

We transform an long signed permutation into a long unsigned permutation by replacing to and to . Additionally, we frame the unsigned permutation into and . The vertexes of the so-called graph of desire and reality are the numbers of the unsigned permutation together with and . Starting with , we connect every other number in the graph, these are the reality edges. Starting also with , we connect with with an arc, these are the desire edges. An example graph can be seen on Figure 21.9.

Figure 21.9.  Representation of the Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality. signed permutation with an unsigned permutation, and its graph of desire and reality.

Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality.


Since each vertex in the graph of desire and reality has a degree of two, the graph can be unequivocally decomposed into cycles. We call a cycle a directed cycle if on a walk on the cycle, we go at least once from left to right on a reality cycle and we go at least once from right to left on a reality cycle. Other cycles are unoriented cycles.

The span of a desire edge is the interval between its left and right vertexes. Two cycles overlap if there are two reality edges in the two cycles whose spans intersect. The vertexes of the overlap graph of a signed permutation are the cycles in its graph of desire and reality, two nodes are connected if the two cycles overlap. The overlap graph can be decomposed into components. A component is directed if it contains a directed cycle, otherwise it is unoriented. The span of a component is the interval between its leftmost and rightmost nodes in the graph of desire and reality. An unoriented component is a hurdle if its span does not contain any unoriented component or it contains all unoriented component. Other components are called protected non-hurdles.

A super-hurdle is hurdle for which it is true that if we delete this hurdle then one of the protected non-hurdles becomes a hurdle. A fortress is a permutation in which all hurdles are super-hurdles and their number is odd.

The Hannenhalli-Pevzner theorem is the following:

Theorem 21.9 Given a signed permutation . The minimum number of mutations sorting this permutation into the identical permutation is

where is the length of the permutation, is the number of cycles, is the number of hurdles, and if the permutation is a fortress, otherwise .

The proof of the theorem can be found in the book due to Pevzner.

The reversal distance was calculated in time by Bader et al.. It is very easy to obtain in time. The hard part is to calculate and . The source of the problem is that the overlap graph might contain edges. Therefore the fast algorithm does not obtain the entire overlap graph, only a spanning tree on each component of it.

21.6.2. 21.6.2 Shotgun sequencing

A genome of an organism usually contain significantly more than one million nucleic acids. Using a special biochemical technology, the order of nucleic acids can be obtained, however, the uncertainty grows with the length of the DNA, and becomes absolutely unreliable after about 500 nucleic acids.

A possible solution for overcoming this problem is the following: several copies are made from the original DNA sequence and they are fragmented into small parts that can be sequenced in the above described way. Then the original sequence must be reconstructed from its overlapping fragments. This technique is called shotgun sequencing.

The mathematical definition of the problem is that we want to find the shortest common super-sequence of a set of sequences. Sequence is a super-sequence of if is subsequence of . Recall that a subsequence is not necessarily a consecutive part of the sequence. Maier proved that the shortest common super-sequence problem is NP-complete is the size of the alphabet is at least and conjectured that it is the case if the size is at least . Later on it has been proved that the problem is NP-complete for all non-trivial alphabet [ 284 ].

Similar problem is the shortest common super-string problem, that is also an NP-complete problem [ 125 ]. This later has biological relevance, since we are looking for overlapping substrings. Several approximation algorithms have been published for the shortest common super-string problem. A greedy algorithm finds for each pair of strings the maximal possible overlap, then it tries to find a shortest common super-string by merging the overlapping strings in a greedy way [ 318 ]. The running time of the algorithm is , where is the number of sequences and is the total length of the sequences. This greedy method is proved to be a -approximation [ 46 ]. A modified version being a -approximation also exist, and the conjecture is that the modified version is a -approximation [ 46 ].

The sequencing of DNA is not perfect, insertions, deletions and substitutions might happen during sequencing. Therefore Jiang and Li suggested the shortest -approximative common super-string problem [ 182 ]. Kececioglu and Myers worked out a software package including several heuristic algorithm for the problem [ 196 ]. Later on Myers worked for Celera, which played a very important role in sequencing the human genome. A review paper on the topic can be found in [ 341 ].

Exercises

21.6-1 Show that a fortress contains at least three super-hurdle.

21.6-2 At least how long is a fortress?

  PROBLEMS  

21-1 Concave Smith–Waterman

Give the Smith–Waterman-algorithm for concave gap penalty.

21-2 Concave Spouge

Give Spouge-algorithm for concave gap penalty.

21-3 Serving at a petrol station

There are two rows at a petrol station. Each car needs either petrol or diesel oil. At most two cars can be served at the same time, but only if they need different type of fuel, and the two cars are the first ones in the two rows or the first two in the same row. The serving time is the same not depending on whether two cars are being served or only one. Give a pair-HMM for which the Viterbi-algorithm provides a shortest serving scenario.

21-4 Moments of an HMM

Given an HMM and a sequence. Obtain the mean, variance, th moment of the probabilities of paths emitting the given sequence.

21-5 Moments of a SCFG

Given a SCFG and a sequence. Obtain the mean, variance, th moment of the probabilities of derivations of the given sequence.

21-6 Co-emission probability of two HMMs

Can this probability be calculated in time where and are the number of states in the HMMs?

21-7 Sorting reversals

A sorting reversal is a reversal that decreases the reversal distance of a signed permutation. How can a sorting reversal change the number of cycles and hurdles?

  CHAPTER NOTES  

The first dynamic programming algorithm for aligning biological sequences was given by Needleman and Wunch in 1970 [ 252 ]. Though the concave gap penalty function is biologically more relevant, the affine gap penalty has been the standard soring scheme for aligning biological sequences. For example, one of the most popular multiple alignment program, CLUSTAL-W uses affine gap penalty and iterative sequence alignment [ 322 ]. The edit distance of two strings can calculated faster than time, that is the famous “Four Russians' speedup” [ 19 ]. The running time of the algorithm is , however, it has such a big constant in the running time that it is not worth using it for sequence lengths appear in biological applications. The longest common subsequence problem can be solved using a dynamic programming algorithm similar to the dynamic programming algorithm for aligning sequences. Unlike that algorithm, the algorithm of Hunt and Szymanski creates a graph whose points are the characters of the sequences and , and is connected to iff . Using this graph, the longest common subsequence can be find in time, where is the number of edges in the graph and is the number of nodes [ 171 ]. Although the running time of this algorithm is , since the number of edges might be , in many cases the number of edges is only , and in this cases the running time is only . A very sophisticated version of the corner-cutting method is the diagonal extension technique, which fills in the dynamic programming table by diagonals and does not need a test value. An example for such an algorithm is the algorithm of Wu at al. [ 348 ]. The diff command in the Unix operating system is also based on diagonal extension [ 245 ], having a running time , where and are the lengths of the sequences and is the edit distance between the two sequences. The Knuth-Morris-Pratt string-searching algorithm searches a small pattern in a long string . Its running time is , where and are the length of the sequences [ 203 ]. Landau and Vishkin modified this algorithm such that the modified version can find a pattern in that differs at most in position [ 216 ]. The running time of the algorithm is , the memory requirement is . Although dynamic programming algorithms are the most frequently used techniques for aligning sequences, it is also possible to attack the problem with integer linear programming. Kececioglu and his colleges gave the first linear programming algorithm for aligning sequences [ 195 ]. Their method has been extended to arbitrary gap penalty functions [ 9 ]. Lancia wrote a review paper on the topic [ 215 ] and Pachter and Sturmfels showed the relationship between the dynamic programming and integer linear programming approach in their book Algebraic Statistics for Computational Biology [ 263 ]. The structural alignment considers only the 3D structure of sequences. The optimal structural alignment problem is to find an alignment where we penalise gaps, however, the aligned characters scored not by their similarity but by how close their are in the superposed 3D structures. Several algorithms have been developed for the problem, one of them is the combinatorial extension (CE) algorithm [ 302 ]. For a given topology it is possible to find the Maximum Likelihood labeling [ 279 ]. This algorithm has been integrated into PAML, which is one of the most popular software package for phylogenetic analysis (http://abacus.gene.ucl.ac.uk/software/paml.html). The Maximum Likelihood tree problem is to find for a substitution model and a set of sequences the tree topology and edge lengths for which the likelihood is maximal. Surprisingly, it has only recently been proved that the problem is NP-complete [ 64 ], [ 288 ]. The similar problem, the Ancestral Maximum Likelihood problem has been showed to be NP-complete also only recently [ 4 ]. The AML problem is to find the tree topology, edge lengths and labellings for which the likelihood of a set of sequences is maximal in a given substitution model. The two most popular sequence alignment algorithms based on HMMs are the SAM [ 170 ] and the HMMER (http://hmmer.wustl.edu/) packages. An example for HMM for genome annotation is the work of Pedersen and Hein [ 269 ]. Comparative genome annotation can be done with pair-HMMs like the DoubleScan [ 242 ], (http://www.sanger.ac.uk/Software/analysis/doublescan/) and the Projector [ 243 ], (http://www.sanger.ac.uk/Software/analysis/projector/) programs. Goldman, Thorne and Jones were the first who published an HMM in which the emission probabilities are calculated from evolutionary informations [ 135 ]. It was used for protein secondary structure prediction. The HMM emits alignment columns, the emission probabilities can be calculated with the Felsenstein algorithm. The Knudsen-Hein grammar is used in the PFold program, which is for predicting RNA secondary structures [ 201 ]. This SCFG generates RNA multiple alignments, where the terminal symbols are alignment columns. The derivation probabilities can be calculated with the Felsenstein algorithm, the corresponding substitution model is a single nucleotide or a dinucleotide model, according to the derivation rules. The running time of the Forward algorithm grows squarely with the number of states in the HMM. However, this is not always the fastest algorithm. For a biologically important HMM, it is possible to reduce the running time of the Forward algorithm to with a more sophisticated algorithm [ 226 ], [ 227 ]. However, it is unknown whether or not similar acceleration exist for the Viterbi algorithm. The Zuker-Tinoco model [ 323 ] defines free energies for RNA secondary structure elements, and the free energy of an RNA structure is the sum of free energies of the elements. The Zuker-Sankoff algorithm calculates in time the minimum free energy structure, using memory, where is the length of the RNA sequence. It is also possible to calculate the partition function of the Boltzmann distribution with the same running time and memory requirement [ 240 ]. For a special case of free energies, both the optimal structure and the partition function can be calculated in time, using still only memory [ 231 ]. Two base-pairings, and forms a pseudo-knot if . Predicting the optimal RNA secondary structure in which arbitrary pseudo-knots are allowed is NP-complete [ 230 ]. For special types of pseudo-knots, polynomial running time algorithms exist [ 8 ], [ 230 ], [ 287 ], [ 329 ]. RNA secondary structures can be compared with aligning ordered forests [ 168 ]. Atteson gave a mathematical definition for the goodness of tree-constructing methods, and showed that the Neighbor-Joining algorithm is the best one for some definitions [ 22 ]. Elias and Lagergren recently published an improved algorithm for Neighbor-Joining that has only running time [ 97 ]. There are three possible tree topologies for four species that are called quartets. If we know all the quartets of the tree, it is possible to reconstruct it. It is proved that it is enough to know only the short quartets of a tree that are the quartets of closely related species [ 98 ]. A genome might contain more than one DNA sequences, the DNA sequences are called chromosomes. A genome rearrangement might happen between chromosomes, too, such mutations are called translocations. Hannenhalli gave a running time algorithm for calculating the translocation and reversal distance [ 157 ]. Pisanti and Sagot generalised the problem and gave results for the translocation diameter [ 274 ]. The generalisation of sorting permutations is the problem of finding the minimum length generating word for an element of a group. The problem is known to be NP-complete [ 181 ]. Above the reversal distance and translocation distance problem, only for the block interchange distance exists a polynomial running time algorithm [ 65 ]. We mention that Bill Gates, the owner of Microsoft worked also on sorting permutations, actually, with prefix reversals [ 129 ].

Description of many algorithms of bioinformatics can be found in the book of Pevzner and Jones [ 272 ]. We wrote only about the most important topics of bioinformatics, and we did not cover several topics like recombination, pedigree analysis, character-based tree reconstructing methods, partial digesting, protein threading methods, DNA chip analysis, knowledge representation, biochemical pathways, scale-free networks, etc. We close the chapter with the words of Donald Knuth: “It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips, that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can't predict an unending growth. I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it's at that level.”