In this section, we shell introduce algorithms whose input is a set of objects and distances between objects. The distances might be obtained from pairwise alignments of sequences, however, the introduced algorithms work for any kind of distances. The leaves of the tree are the given objects, and the topology and the lengths of the edges are obtained from the distances. Every weighted tree defines a metric on the leaves of the tree, we define the distance between two leaves as the sum of the weights of edges on the path connecting them. The goodness of algorithms can be measured as the deviation between the input distances and the distances obtained on the tree.
We define two special metrics, the ultrametric and additive metric. The clustering algorithms generate a tree that is always ultrametric. We shell prove that clustering algorithms gives back the ultrametric if the input distances follow a ultrametric, namely, the tree obtained by a clustering algorithm defines exactly the input distances.
Similarly, the Neighbour Joining algorithm creates a tree that represents an additive metric, and whenever the input distances follow an additive metric, the generated tree gives back the input distances.
For both proves, we need the following lemma:
Proof. The proof is based on induction, the induction starts with three points. For three points, there is exactly one possible topology, a star-tree. Let the lengths of the edges connecting points , and with the internal node of the star three be , and , respectively. The lengths of the edges defined by the
equation system, which has a unique solution since the determinant
is not .
For number of points, let us assume that there are two trees representing the same metric. We find a cherry motif on the first tree, with cherries and . A cherry motif is a motif with two leafes whose connecting path has exactly one internal node. Every tree contains at least two cherry motives, a path on the tree that has the maximal number of internal nodes has cherry motives at both ends.
If there is only one internal node on the path connecting and on the other tree, then the length of the corresponding edges in the two cherry motives must be the same, since for any additional point , we must get the same subtree. We define a new metric by deleting points and , and adding a new point . The distance between and any point is , where is the length of the edge connecting with the internal point in the cherry motif. If we delete nodes and , we get a tree that represent this metric and they are the same, according to the induction.
If the path between and contains more than one internal node on the other tree, then we find a contradiction. There is a point on the second tree for which . Consider a such that the path connecting and contains node . From the first tree
while on the second tree
which contradicts that .
It is easy to prove that the three distances between any three points are all equal or two of them equal and the third is smaller in any ultrametric.
Theorem 21.5 If the metric on a finite set of points is ultrametric, then there is exactly one tree that represents it. Furthermore, this tree can be rooted such that the distance between a point and the root is the same for all points.
Proof. Based on the Lemma 21.3, it is enough to construct one ultrametric tree for any ultrametric. We represent ultrametric trees as dendrograms. in this representation, the horizontal edges has length zero. For an example dendrogram, see Figure 21.2.
The proof is based on the induction on the number of leaves. Obviously, we can construct a dendrogram for two leaves. After constructing the dendrogram for leaves, we add leaf to the dendrogram in the following way. We find a leaf in the dendrogram, for which is minimal. Then we walk up on the dendrogram till we reach the distance (we might go upper than the root). The node is connected to the dendrogram at this point, see Figure 21.3.
This dendrogram represents properly the distances between leaf and any other leaf. Indeed, if leaf that is below the new internal node that bonnets leaf , then and from the ultrametric property and the minimality of it follows that . On the other hand, if leaf is not below the new internal point joining leaf , then , and from the ultrametric property it comes that .
It is easy to see that the construction in the proof needs running time, where is the number of objects. We shell give another algorithm that finds the pair of objects and for which is minimal. From the ultrametric property, for any , , hence we can replace the pair of objects and to a new object, and the distance between this new object and any other object is well defined, it is . The objects and are connected at height , and we treat this sub-dendrogram as a single object. We continue the iteration till we have a single object. This algorithm is slower than the previous algorithm, however, this is the basis of the clustering algorithms. The clustering algorithms create a dendrogram even if the input distances do not follow a ultrametric. On the other hand, if the input distances follow a ultrametric, then most of the clustering algorithms gives back this ultrametric.
As we mentioned, the clustering algorithms find the object pair and for which is minimal. The differences come from how the algorithms define the distance between the new object replacing the pair of objects and and any other object. If the new object is denoted by , then the introduced clustering methods define in the following way:
the new distance is the arithmetic mean of the distances between the elements in and : , where and are the number of elements in and .
This method is used when the objects can be embedded into a Euclidean space. Then the distance between two objects can be defined as the distance between the centroids of the elements of the objects. It is not necessary to use the coordinates of the Euclidean space since the distance in question is the distance between point and the point intersecting the edge in proportion in the triangle obtained by points , Ús (see Figure 21.4). This length can be calculated using only , and . Hence the algorithm can be used even if the objects cannot be embedded into a Euclidean space.
The centroid of is the centroid of the centroids of and . This method is related to the centroid method as the single average is related to the UPGMA method. It is again not necessary to know the coordinates of the elements, hence this method can be applied to distances that cannot be embedded into a Euclidean space.
It is easy to show that the first four method gives the dendrogram of the input distances whenever the input distances follow a ultrametric since in this case. However, the
methods do not give the corresponding dendrogram for ultrametric input distances, since will be smaller than (which equals to ).
The central problem of the clustering algorithms is that they give a dendrogram that might not be biologically correct. Indeed, the evolutionary tree of biological sequences can be a dendrogram only if the molecular clock hypothesis holds. The molecular clock hypothesis says that the sequences evolve with the same tempo at each branches of the tree, namely they collect the same number of mutations at a given time span. However, this is usually not true. Therefore biologists want an algorithm that give a ultrametric tree only if the input distances follow a ultrametric. The most popular such method is the Neighbour-Joining algorithm.
Proof. Due to Lemma 21.3, there is at most one such tree, therefore it is enough to construct it. First we give the construction then we prove its goodness.
For three objects we can construct a tree according to (21.66)–(21.68). Assume that we constructed the tree for objects, and we want to add leaf to the tree. First we find the topology and then we give the length of the new edge. For obtaining the new topology we start with any leaf , and let denote the neighbour of leaf . There are at least two other edges going out from , we find two leaves on the paths starting with these two outgoing edges, let and denote these leaves, see Figure 21.5.
The leaf is connected to the edges between and if
Using similar inequalities, we can decide if leaf is before looking from or looking from . If the degree of is greater than , then we find leaves on the other paths and we do the same investigations for , , and points. From the additive property, it follows that inequality can hold at most for one cases. If it holds for , then we connect leaf to the edge connecting and . If the inequality holds for another case, then we derive the maximal subtree of the tree that contains as a leaf and also contains the leaf for which the inequality holds. We define as , then renaming to we continue the searching for the connection place of leaf . If we get equality for all outgoing edges of , then we connect leaf to .
After finding the topology, we obtain the length of the new edge. Leaf is connected to the edge between and , let denote the new internal point, see Figure 21.6/b.
Figure 21.6. Some tree topologies for proving Theorem 21.7.
Now we prove the correctness of the construction. First we show that is well-defined, namely, for all node that is not in the new subtree containing leaves and , . If the new subtree contains then for that was used to find the place of leaf will obviously hold (see Figure 21.6/a). Due to the additive metric property and the place of leaf
Using inequalities Ús a , it follows that
Similarly for all leaves that are not separated from by , it holds that
It is due to the additive metric and the inequality
this later inequality comes from these inequalities
If the degree of is greater than , then similar inequalities hold.
Due to the way of calculating the new edge lengths, is represented properly on the new tree, hence is represented properly for all that is separated from leaf by . Note that might be an earlier .
If leaf is connected to the edge between and (Figure 21.6/b), then due to the definition , is represented properly. From the equation
it follows that
hence is represented properly. It can be similarly shown that for all points that are separated from by , the is represented properly on the tree.
If leaf is connected to node (Figure 21.6), then from the equations
it comes that both and are represents properly on the new tree, and with similar reasoning, it is easy to show that actually for all nodes that is separated from by , is represented properly on the tree.
Hence we construct a tree containing leaf from the tree containing the first leaves, thus proving Theorem 21.7.
It is easy to show that the above algorithm that constructs the tree representing an additive metric takes running time. However, it works only if the input distances follow an additive metric, other wise inequality (21.74) might hold several times, hence we cannot decide where to join leaf to. We shell introduce an algorithm that has running time and gives back the additive tree whenever the input distances follow an additive metric, moreover it generates an additive tree that approximates the input distances if those are not follow an additive metric.
algorithm works in the following way: Given a set with points and a distance function on the points. First we calculate the for each point the sum of the distances from the other points:
Then we find the pair of points for which
is minimal. The length of the edges from points and to the new point are
Then we recalculate distances. We drop points and , and add point . The distance between and any other point is defined as
Proof. From Theorem 21.7 there is exactly one tree that represents the distances. It is enough to prove that the
algorithm always pick a cherry motif on this tree, since a straightforward calculation shows that in this case the calculated edge lengths are proper.
First we prove if and follows a cherry motif then for all , Ús . Indeed, rearranging , we have to prove that
If , then we get that
Now we prove the Theorem 21.8 in an indirect way. Suppose that and does not follow a cherry motif, however, is minimal. From the previous lemma, neither nor are in a cherry motif with other leaves. We find a cherry motif with leaves and and internal node . Let denote the last common node of paths going from to and to . Since is minimal,
Rearranging this we get that
and cases , , and inside the sum cancel each other. For the other , the left hand side is
If joins to the tree via the path between and , then the expression 21.93 might be positive. Let these cases called II. class cases. To avoid contradiction, the sum of absolute values from I. class cases must be less than the sum from the II. class cases.
There is another node on the path connecting and , and we can find a cherry motif after node with leaves and and internal node . Here again the II. class cases have to be more than the I. class cases, but this contradict to the situation with the first cherry motif. Hence and form a cherry motif and we prove Theorem 21.8.
21.5-1 Show that in a ultrametric, three distances coming from three points are all equal or two of them equal and the third is smaller. Prove the similar claim for the three sum of distances coming from four points in an additive metric.
21.5-2 Show that a ultrametric is always an additive metric.
21.5-3 Give an example for a metric that is not additive.
21.5-4 Is it true that all additive metric is a Euclidean metric?
21.5-5 Give the formula that calculates from , and for the centroid method.
21.5-6 Give algorithms that decide in whether or not a metric is
( is the number of points.)