## 20.4. 20.4 Stable partitions and the PT-algorithm

Most index structures used for efficient evaluation of queries of semi-structured databases are based on a partition of the nodes of a graph. The problem of creating indexes can often be reduced to finding the coarsest stable partition.

Definition 20.19 Let be a binary relation on the finite set , that is, . Then is the set of nodes, and is the set of edges. For arbitrary , let and . We say that is stable with respect to for arbitrary and , if or . Let be a partition of , that is, a decomposition of into disjoint sets, or in other words, blocks. Then is stable with respect to , if all blocks of are stable with respect to . is stable with respect to partition , if all blocks of are stable with respect to all blocks of . If is stable with respect to all of its blocks, then partition is stable. Let and be two partitions of . is a refinement of , or is coarser than , if every block of is the union of some blocks of . Given , and , the coarsest stable partition is the coarsest stable refinement of , that is, the stable refinement of that is coarser than any other stable refinement of .

Note that stability is sometimes defined the following way. is stable with respect to if or . This is not a major difference, only the direction of the edges is reversed. So in this case stability is defined with respect to the binary relation instead of , where if and only if , since .

Let and . We will prove that there always exists a unique solution of the problem of finding the coarsest stable partition, and there is an algorithm that finds the solution in time with space requirement . This algorithm was published by R. Paige and R. E. Tarjan in 1987, therefore it will be called the PT-algorithm.

The main idea of the algorithm is that if a block is not stable, then it can be split into two in such a way that the two parts obtained are stable. First we will show a naive method. Then, using the properties of the split operation, we will increase its efficiency by continuing the procedure with the smallest part.

Definition 20.20 Let be a binary relation on , and a partition of . Furthermore, let split() be the refinement of which is obtained by splitting all blocks of that are not disjoint from , that is, and . In this case, add blocks and to the partition instead of . is a splitter of if .

Note that is not stable with respect to if and only if is a splitter of .

Stability and splitting have the following properties, the proofs are left to the Reader.

Claim 20.21 Let and be two subsets of , while and two partitions of . Then

1. Stability is preserved under refinement, that is, if is a refinement of , and is stable with respect to , then is also stable with respect to .

2. Stability is preserved under unification, that is, if is stable with respect to both and , then is stable with respect to .

3. The split operation is monotonic in its second argument, that is, if is a refinement of , then split() is a refinement of split().

4. The split operation is commutative in the following sense. For arbitrary , and , split(, split()) = split(, split()), and the coarsest partition of that is stable with respect to both and is split(, split()).

In the naive algorithm, we refine partition starting from partition , until is stable with respect to all of its blocks. In the refining step, we seek a splitter of that is a union of some blocks of . Note that finding a splitter among the blocks of would be sufficient, but this more general way will help us in improving the algorithm.

`Naive-PT(` `)`

```  1     2
`WHILE`

is not stable   3
`DO`
let  be a splitter of  that is the union of some blocks of    4          5
`RETURN`

```

Note that the same set cannot be used twice during the execution of the algorithm, since stability is preserved under refinement, and the refined partition obtained in step 4 is stable with respect to . The union of the sets used can neither be used later, since stability is also preserved under unification. It is also obvious that a stable partition is stable with respect to any that is a union of some blocks of the partition. The following propositions can be proved easily using these properties.

Claim 20.22 In any step of the algorithm `Naive-PT` , the coarsest stable refinement of is a refinement of the actual partition stored in .

Proof. The proof is by induction on the number of times the cycle is executed. The case is trivial. Suppose that the statement holds for before using the splitter . Let be the coarsest stable refinement of . Since consists of blocks of , and, by induction, is a refinement of , therefore is the union of some blocks of . is stable with respect to all of its blocks and the union of any of its blocks, thus is stable with respect to , that is, . On the other hand, using that the split operation is monotonic, is a refinement of , which is the actual value of .

Claim 20.23 The algorithm `Naive-PT` determines the unique coarsest stable refinement of , while executing the cycle at most times.

Proof. The number of blocks of is obviously at least and at most . Using the split operation, at least one block of is divided into two, so the number of blocks increases. This implies that the cycle is executed at most times. is a stable refinement of when the algorithm terminates, and, using the previous proposition, the coarsest stable refinement of is a refinement of . This can only happen if is the coarsest stable refinement of .

Claim 20.24 If we store the set for all elements of , then the cost of the algorithm `Naive-PT` is at most .

Proof. We can assume, without restricting the validity of the proof, that there are no sinks in the graph, that is, every node has outgoing edges. Then for arbitrary in . Consider a partition , and split all blocks of . Let be the set of the nodes of that have at least one outgoing edge. Then . Now let , that is, the set of sinks of . Set is stable with respect to arbitrary , since , so does not have to be split during the algorithm. Therefore, it is enough to examine partition consisting of blocks instead of , that is, a partition of set . By adding blocks to the coarsest stable refinement of we obviously get the coarsest stable refinement of . This means that there is a preparation phase before the algorithm in which is obtained, and a processing phase after the algorithm in which blocks are added to the coarsest stable refinement obtained by the algorithm. The cost of preparation and processing can be estimated the following way. has at most elements. If, for all in we have , then the preparation and processing requires time.

From now on we will assume that holds for arbitrary in , which implies that . Since we store sets , we can find a splitter among the blocks of partition in time. This, combined with the previous proposition, means that the algorithm can be performed in time.

The algorithm can be executed more efficiently using a better way of finding splitter sets. The main idea of the improved algorithm is that we work with two partitions besides , and a partition that is a refinement of in every step such that is stable with respect to all blocks of . At the start, let and let be the partition consisting only one block, set . The refining step of the algorithm is repeated until .

`PT(` `)`

```  1     2     3
`WHILE`

4
`DO`
let  be a block of  that is not a block of ,              and  a block of  in  for which   5          6          7
`RETURN`

```

Claim 20.25 The result of the `PT` -algorithm is the same as that of algorithm `Naive-PT` .

Proof. At the start, is a stable refinement of with respect to the blocks of . In step 5, a block of is split, thus we obtain a refinement of . In step 6, by refining using splits we ensure that is stable with respect to two new blocks of . The properties of stability mentioned in Proposition 20.21 and the correctness of algorithm `Naive-PT` imply that the `PT` -algorithm also determines the unique coarsest stable refinement of .

In some cases one of the two splits of step 6 can be omitted. A sufficient condition is that is a function of .

Claim 20.26 If for all in , then step 6 of the `PT` -algorithm can be exchanged with .

Proof. Suppose that is stable with respect to a set which is the union of some blocks of . Let be a block of that is a subset of . It is enough to prove that is stable with respect to . Let be a block of . Since the result of a split according to is a stable partition with respect to , either or . Using , we get in the first case, and in the second case, which means that we obtained a stable partition with respect to .

Note that the stability of a partition with respect to and generally does not imply that it is also stable with respect to . If this is true, then the execution cost of the algorithm can be reduced, since the only splits needed are the ones according to because of the reduced sizes.

The two splits of step 6 can cut a block into four parts in the general case. According to the following proposition, one of the two parts gained by the first split of a block remains unchanged at the second split, so the two splits can result in at most three parts. Using this, the efficiency of the algorithm can be improved even in the general case.

Claim 20.27 Let be a stable partition with respect to , where is the union of some blocks of , and let be a block of that is a subset of . Furthermore, let be a block of that is cut into two (proper) parts and by the operation in such a way that none of these is the empty set. Suppose that block is further divided into the nonempty sets and by . Then

1. and if and only if and .

2. and if and only if and .

3. The operation leaves block unchanged.

4. .

Proof. The first two statements follow using the definition of the split operation. To prove the third statement, suppose that was obtained from by a proper decomposition. Then , and since , . All blocks of partition , including , are stable with respect to , which implies . Since , using the first statement, so is stable with respect to the set , therefore a split according to does not divide block . Finally, the fourth statement follows from and .

Denote by the number of nodes in that can be reached from , that is, . Note that if , then .

Since sizes are always halved, an arbitrary in can appear in at most different sets that were used for refinement in the `PT` -algorithm. In the following, we will give an execution of the `PT` algorithm in which the determination of the refinement according to block in steps 5 and 6 of the algorithm costs . Summing this for all blocks used in the algorithm and for all elements of these blocks, we get that the complexity of the algorithm `Efficient-PT` is at most . To give such a realization of the algorithm, we have to choose good data structures for the representation of our data.

• Attach node to all edges of set , and attach the list to all nodes . Then the cost of reading set is proportional to the size of .

• Let partition be a refinement of partition . Represent the blocks of the two partitions by records. A block of partition is simple if it consists of one block of , otherwise it is compound.

• Let be the list of all compound blocks in partition . At start, let , since is the union of the blocks of . If consists of only one block, then is its own coarsest stable refinement, so no further computation is needed.

• For any block of partition , let Q-blocks() be the double-chained list of the blocks of partition the union of which is set . Furthermore, store the values for all in set to which one pointer points from all edges such that is an element of . At start, the value assigned to all nodes is , and make a pointer to all nodes that points to the value .

• For any block of partition , let X-block() be the block of partition in which appears. Furthermore, let be the cardinality of , and the double-chained list of the elements of . Attach a pointer to all elements that points to the block of in which this element appears. Using double chaining any element can be deleted in time.

Using the proof of Proposition 20.24, we can suppose that without restricting the validity. It can be proved that in this case the space requirement for the construction of such data structures is .

`Efficient-PT(` `)`

```  1
`IF`

2
`THEN`

`RETURN`

3     4     5

is the list of the compound blocks of .   6
`WHILE`

7
`DO`
let  be an element of    8       let  be the smaller of the first two elements of    9         10         11         12
`IF`

13
`THEN`

14       Generate set  by reading the edges  of set  for which               is an element of , and for all elements  of this set, compute the             value . 15       Find blocks  and  for all blocks               of  by reading set  16       By reading all edges  of set  for which  is an element of ,              create set  checking the condition               17       Reading set , for all blocks  of ,              determine the sets              and   18
`FOR`
all blocks  of  for which ,  and   19
`DO`

`IF`

is a simple block of   20
`THEN`

21            22       Compute the value  by reading              the edges  of  for which  is an element of . 23
`RETURN`

```

Claim 20.28 The algorithm `Efficient-PT` determines the coarsest stable refinement of . The computation cost of the algorithm is , and its space requirement is .

Proof. The correctness of algorithm follows from the correctness of the `PT` -algorithm and Proposition 20.27. Because of the data structures used, the computation cost of the steps of the cycle is proportional to the number of edges examined and the number of elements of block , which is altogether. Sum this for all blocks used during the refinement and all elements of these blocks. Since the size of is at most half the size of , arbitrary in set can be in at most different sets . Therefore, the total computation cost of the algorithm is . It can be proved easily that a space of size is enough for the storage of the data structures used in the algorithm and their maintenance.

Note that the algorithm could be further improved by contracting some of its steps but that would only decrease computation cost by a constant factor.

Let be the graph that can be obtained from by changing the direction of all edges of . Consider a 1-index in graph determined by the bisimulation . Let and be two classes of the bisimulation, that is, two nodes of . Using the definition of bisimulation, or . Since , this means that is stable with respect to in graph . So the coarsest 1-index of is the coarsest stable refinement of the basic partition of graph .

Corollary 20.29 The coarsest 1-index can be determined using the algorithm `Efficient-PT` . The computation cost of the algorithm is at most , and its space requirement is at most .

Exercises

20.4-1 Prove Proposition 29.21.

20.4-2 Partition is size-stable with respect to set if for arbitrary elements , of a block of . A partition is size-stable if it is size-stable with respect to all its blocks. Prove that the coarsest size-stable refinement of an arbitrary partition can be computed in time.

20.4-3 The 1-index is minimal if no two nodes and with the same label can be contracted, since there exists a node for which is not stable with respect to . Give an example that shows that the minimal 1-index is not unique, therefore it is not the same as the coarsest 1-index.

20.4-4 Prove that in case of an acyclic graph, the minimal 1-index is unique and it is the same as the coarsest 1-index.