20.8. 20.8 Index refresh

In database management we usually have three important aspects in mind. We want space requirement to be as small as possible, queries to be as fast as possible, and insertion, deletion and modification of the database to be as quick as possible. Generally, a result that is good with respect to one of these aspects is worse with respect to another aspect. By adding indexes of typical queries to the database, space requirement increases, but in return we can evaluate queries on indexes which makes them faster. In case of dynamic databases that are often modified we have to keep in mind that not only the original data but also the index has to be modified accordingly. The most costly method which is trivially exact is that we create the index again after every modification to the database. It is worth seeking procedures to get the modified indexes by smaller modifications to those indexes we already have.

Sometimes we index the index or its modification as well. The index of an index is also an index of the original graph, although formally it consists of classes of index nodes, but we can unite the elements of the index nodes belonging to the same class. It is easy to see that by that we get a partition of the nodes of the graph, that is, an index.

In the following, we will discuss those modifications of semi-structured databases when a new graph is attached to the root and when a new edges is added to the graph, since these are the ones we need when creating a new website or a new reference.

Suppose that is the 1-index of graph . Let be a graph that has no common node with . Denote by the 1-index of . Let be the graph obtained by uniting the roots of and . We want to create using and . The following proposition will help us.

Claim 20.45 Let be the 1-index of graph , and let be an arbitrary refinement of . Then .

Proof. Let and be two nodes of . We have to show that and are bisimilar in with respect to the 1-index if and only if and are bisimilar in the index graph with respect to the 1-index of . Let and be bisimilar in with respect to the 1-index. We will prove that there is a bisimulation according to which and are bisimilar in . Since the 1-index is the partition corresponding to the coarsest bisimulation, the given bisimulation is a refinement of the bisimulation corresponding to the 1-index, so and are also bisimilar with respect to the bisimulation corresponding to the 1-index of . Let if and only if and are bisimilar in with respect to the 1-index. Note that since is a refinement of , all elements of and are bisimilar in if . To show that the relation is a bisimulation, let be a parent of , where is a parent of , and is an element of . Then , and are bisimilar in , so there is a parent of for which and are bisimilar in . Therefore is a parent of , and . Since bisimulation is symmetric, relation is also symmetric. We have proved the first part of the proposition.

Let and be bisimilar in with respect to the 1-index of . It is sufficient to show that there is a bisimulation on the nodes of according to which and are bisimilar. Let if and only if with respect to the 1-index of . To prove bisimilarity, let be a parent of . Then is also a parent of . Since and are bisimilar if , there is a parent of for which and are bisimilar with respect to the 1-index of , and is a parent of an element of . Since and are bisimilar, there is a parent of such that and are bisimilar. Using the first part of the proof, it follows that and are bisimilar with respect to the 1-index of . Since bisimilarity is transitive, and are bisimilar with respect to the 1-index of , so . Since relation is symmetric by definition, we get a bisimulation.

As a consequence of this proposition, can be created with the following algorithm for disjoint and .

`Graphaddition-1-Index(` `)`

```  1

is the basic partition according to labels.   2

is the basic partition according to labels.   3

is the 1-index of .   4

is the 1-index of .   5
The 1-indexes are joined at the roots.   6

is the basic partition according to labels.   7

is the 1-index of .   8
`RETURN`

```

To compute the cost of the algorithm, suppose that the 1-index of is given. Then the cost of the creation of is , where and denote the number of nodes and edges of the graph, respectively.

To prove that the algorithm works, we only have to notice that is a refinement of if and are disjoint. This also implies that index is safe and exact, so we can use this as well if we do not want to find the minimal index. This is especially useful if new graphs are added to our graph many times. In this case we use the lazy method, that is, instead of computing the minimal index for every pair, we simply sum the indexes of the addends and then minimize only once.

Claim 20.46 Let be the 1-index of graph , , and let the graphs be disjoint. Then for the 1-index of the union of the graphs joined at the roots.

In the following we will examine what happens to the index if a new edge is added to the graph. Even an operation like this can have significant effects. It is not difficult to construct a graph that contains two identical subgraphs at a distant of 2 from the root which cannot be contracted because of a missing edge. If we add this critical edge to the graph, then the two subgraphs can be contracted, and therefore the size of the index graph decreases to about the half of its original size.

Suppose we added a new edge to graph from to . Denote the new graph by , that is, . Let partition be the 1-index of . If there was an edge from to in , then the index graph does not have to be modified, since there is a parent of the elements of , that is, of all elements bisimilar to , in whose elements are bisimilar to . Therefore .

If there was no edge from to , then we have to add this edge, but this might cause that will no longer be stable with respect to . Let be the partition we get from by splitting in such a way that is in one part and the other elements of are in the other, and leaving all other classes of the partition unchanged. defines its edges the usual way, that is, if there is an edge from an element of a class to an element of another class, then we connect the two classes with an edge directed the same way.

Let partition be the original . Then is a refinement of , and is stable with respect to according to . Note that the same invariant property appeared in the `PT` -algorithm for partitions and . Using Proposition 20.45 it is enough to find a refinement of . If we can find an arbitrary stable refinement of the basic partition of , then, since the 1-index is the coarsest stable partition, this will be a refinement of . is a refinement of the basic partition, that is, the partition according to labels, and so is . So if is stable, then we are done. If it is not, then we can stabilize it using the `PT` -algorithm by starting with the above partitions and . First we have to examine those classes of the partition that contain a children of , because these might lost their stability with respect to the two new classes gained by the split. The `PT` -algorithm stabilizes these by splitting them, but because of this we now have to check their children, since they might have lost stability because of the split, etc. We can obtain a stable refinement using this stability-propagator method. Since we only walk through the nodes that can be reached from , this might not be the coarsest stable refinement. We have shown that the following algorithm computes the 1-index of the graph .

`Edgeaddition-1-Index(` `)`

```  1

is the basic partition according to labels.   2

is the 1-index of .   3
`IF`

If there was an edge from  to ,                    then no modification is needed.  5
`THEN`

`RETURN`

6
Split .   7     8

is the old partition.   9
Add an edge from  to .  10
Replace  with  and .  11
Determine the edges of .  12
Execute the
`PT`
-algorithm               starting with  and . 13

is the coarsest stable refinement.  14
`RETURN`

```

Step 13 can be omitted in practice, since the stable refinement obtained in step 12 is a good enough approximation of the coarsest stable partition, there is only 5% difference between them in size.

In the following we will discuss how FB-indexes and A()-indexes can be refreshed. The difference between FB-indexes and 1-indexes is that in the FB-index, two nodes are in the same similarity class if not only the incoming but also the outgoing paths have the same label sequences. We saw that in order to create the FB-index we have to execute the `PT` -algorithm twice, using it on the graph with the edges reversed at the second time. The FB-index can be refreshed similarly to the 1-index. The following proposition can be proved similarly to Proposition 20.45, therefore we leave it to the Reader.

Claim 20.47 Let be the FB-index of graph , and let be an arbitrary refinement of . Denote by the FB-index of . Then .

As a consequence of the above proposition, the FB-index of can be created using the following algorithm for disjoint and .

`Graphaddition-FB-Index(` `)`

```  1

`FB-Index-Creator(`

`)`

is the FB-index of .   2

`FB-Index-Creator(`

`)`

is the FB-index of .   3
Join the FB-indexes at their roots.   4

`FB-Index-Creator(`

`)`

is the FB-index of .   5
`RETURN`

```

When adding edge , we must keep in mind that stability can be lost in both directions, so not only but also has to be split into , and , , respectively. Let be the partition before the modification, and the partition obtained after the splits. We start the `PT` -algorithm with and in step 3 of the algorithm `FB-Index-Creator` . When stabilizing, we will now walk through all descendants of and all ancestors of .

`Edgeaddition-FB-Index(` `)`

```  1

`FB-index-creator(`

`)`

is the FB-index of .   2
`IF`

If there was an edge from  to ,                    then no modification is needed.  4
`THEN`

`RETURN`

5
Split .   6     7
Split .   8     9

is the old partition.  10
Add an edge form  to .  11
Replace  with  and ,                     with  and .  12
Determine the edges of .  13

`FB-Index-Creator(`

`)`

Start the
`PT`
-algorithm                     with  and  in the algorithm
`FB-Index-Creator`
. 14

`FB-Index-Creator(`

`)`

is the coarsest ancestor-stable              and descendant-stable refinement. 15
`RETURN`

```

Refreshing the A()-index after adding an edge is different than what we have seen. There is no problem with adding a graph though, since the following proposition holds, the proof of which is left to the Reader.

Claim 20.48 Let be the A()-index of graph , and let be an arbitrary refinement of . Denote by the A()-index of . Then .

As a consequence of the above proposition, the A()-index of can be created using the following algorithm for disjoint and .

`Graphaddition-A(` `)-Index(` `)`

```  1

is the basic partition according to labels.   2

`Naive-Approximation(`

`)`

is the A()-index of .   3

is the basic partition according to labels.   4

`Naive-Approximation(`

`)`

is the A()-index of .   5
Join the A()-indexes.   6

is the basic partition according to labels.   7

`Naive-Approximation(`

`)`

is the A()-index of .   8
`RETURN`

```

If we add a new edge to the graph, then, as earlier, first we split into two parts, one of which is , then we have to repair the lost -stabilities walking through the descendants of , but only within a distant of . What causes the problem is that the A()-index contains information only about -bisimilarity, it tells us nothing about -bisimilarity. For example, let be a child of , and let . When stabilizing according to the 1-index, has to be detached from its class if there is an element in this class that is not a children of . This condition is too strong in case of the A(1)-index, and therefore it causes too many unnecessary splits. In this case, should only be detached if there is an element in its class that has no 0-bisimilar parent, that is, that has the same label as . Because of this, if we refreshed the A()-index the above way when adding a new edge, we would get a very bad approximation of the A()-index belonging to the modification, so we use a different method. The main idea is to store all A()-indexes not only the A()-index, where . Yi et al. give an algorithm based on this idea, and creates the A()-index belonging to the modification. The given algorithms can also be used for the deletion of edges with minor modifications, in case of 1-indexes and A()-indexes.

Exercises

20.8-1 Prove Proposition 20.47.

20.8-2 Give an algorithm for the modification of the index when an edge is deleted from the data graph. Examine different indexes. What is the cost of the algorithm?

20.8-3 Give algorithms for the modification of the D()-index when the data graph is modified.

 `PROBLEMS`

`20-1` `Implication problem regarding constraints`

Let and be regular expressions, and two nodes. Let predicate mean that can be reached from by a label sequence that fits to . Denote by the constraint . if and . Let be a finite set of constraints, and a constraint.

• a. Prove that the implication problem is a 2-EXPSPACE problem.

• b. Denote by the constraint . Prove that the implication problem is undecidable with respect to this class.

`20-2` `Transformational distance of trees`

Let the transformational distance of vertex-labeled trees be the minimal number of basic operations with which a tree can be transformed to the other. We can use three basic operations: addition of a new node, deletion of a node, and renaming of a label.

• a. Prove that the transformational distance of trees and can be computed in time, with storage cost of , where is the number of nodes of the tree and is the depth of the tree.

• b. Let and be two trees. Give an algorithm that generates all pairs , where and simulates graphs and , respectively, and the transformational distance of and is less then a given integer . (This operation is called approximate join.)

`20-3` `Queries of distributed databases`

A distributed database is a vertex-labeled directed graph the nodes of which are distributed in partitions (servers). The edges between different partitions are cross references. Communication is by message broadcasting between the servers. An algorithm that evaluates a query is efficient, if the number of communication steps is constant, that is, it does not depend on the data and the query, and the size of the data transmitted during communication only depends on the size of the result of the query and the number of cross references. Prove that an efficient algorithm can be given for the regular query of distributed databases in which the number of communication steps is 4, and the size of data transmitted is , where is the size of the result of the query, and is the number of cross references. (Hint. Try to modify the algorithm `Naive-Evaluation` for this purpose.)

 `CHAPTER NOTES`

This chapter examined those fields of the world of semi-structured databases where the morphisms of graphs could be used. Thus we discussed the creation of schemas and indexes from the algorithmic point of view. The world of semi-structured databases and XML is much broader than that. A short summary of the development, current issues and the possible future development of semi-structured databases can be found in the paper of Vianu [].

The paper of M. Henzinger, T. Henzinger and Kopke [] discusses the computation of the maximal simulation. They extend the concept of simulation to infinite graphs that can be represented efficiently (these are called effective graphs), and prove that for such graphs, it can be determined whether two nodes are similar. In their paper, Corneil and Gotlieb [] deal with quotient graphs and the determination of isomorphism of graphs. Arenas and Libkin [] extend normal forms used in the relational model to XML documents. They show that arbitrary DTD can be rewritten without loss as XNF, a normal form they introduced.

Buneman, Fernandez and Suciu [] introduce a query language, the UnQL, based on structural recursion, where the data model used is defined by bisimulation. Gottlob, Koch and Pichler [] examine the classes of the query language XPath with respect to complexity and parallelization. For an overview of complexity problems we recommend the classical work of Garey and Johnson [] and the paper of Stockmeyer and Meyer [].

The PT-algorithm was first published in the paper of Paige and Tarjan []. The 1-index based on bisimulations is discussed in detail by Milo and Suciu [], where they also introduce the 2-index, and as a generalization of this, the T-index.

The A()-index was introduced by Kaushik, Shenoy, Bohannon and Gudes []. The D()-index first appeared in the work of Chen, Lim and Ong []. The M()-index and the M -index, based on frequent queries, are the results of He and Yang []. FB-indexes of branching queries were first examined by Kaushik, Bohannon, Naughton and Korth []. The algorithms of the modifications of 1-indexes, FB-indexes and A()-indexes were summarized by Kaushik, Bohannon, Naughton and Shenoy []. The methods discussed here are improved and generalized in the work of Yi, He, Stanoi and Yang []. Polyzotis and Garafalakis use a probability model for the study of the selectivity of queries []. Kaushik, Krishnamurthy, Naughton and Ramakrishnan [] suggest the combined use of structural indexes and inverted lists.

The book of Tucker [] and the encyclopedia edited by Khosrow-Pour [] deal with the use of XML in practice.