## 20.2. 20.2 Schemas and simulations

In case of relational databases, schemas play an important role in coding and querying data, query optimization and storing methods that increase efficiency. When working with semi-structured databases, the schema must be obtained from the graph. The schema restricts the possible label strings belonging to the paths of the graph.

Figure 20.4 shows the relational schemas with relations and , respectively, and the corresponding semi-structured description. The labels of the leaves of the tree are the components of the tuples. The directed paths leading from the root to the values contain the label strings database.R.tuple.A, database.R.tuple.B, database.R.tuple.C, database.Q.tuple.C, database.Q.tuple.D. This can be considered the schema of the semi-structured database. Note that the schema is also a graph, as it can be seen on Figure 20.5. The disjoint union of the two graphs is also a graph, on which a simulation mapping can be defined as follows. This way we create a connection between the original graph and the graph corresponding to the schema.

Definition 20.1 Let be a vertex-labeled directed graph, where denotes the set of nodes, the set of edges, the set of labels, and is the label belonging to node . Denote by the set of the start nodes of the edges leading to node . A binary relation ( ) is a simulation, if, for ,

i) and

ii) for all there exists a such that

Node v simulates node u, if there exists a simulation such that . Node u and node v are similar, , if simulates and simulates .

It is easy to see that the empty relation is a simulation, that the union of simulations is a simulation, that there always exists a maximal simulation and that similarity is an equivalence relation. We can write instead of in the above definition, since that only means that the direction of the edges of the graph is reversed.

We say that graph simulates graph if there exists a mapping such that the relation is a simulation on the set .

Two different schemas are used, a lower bound and an upper bound. If the data graph simulates the schema graph , then S is a lower bound of D. Note that this means that all label strings belonging to the directed paths in appear in at some directed path. If simulates , then S is an upper bound of D. In this case, the label strings of also appear in .

In case of semi-structured databases, the schemas which are greatest lower bounds or lowest upper bounds play an important role.

A map between graphs and that preserves edges is called a morphism. Note that is a morphism if and only if simulates . To determine whether a morphism from to exists is an NP-complete problem. We will see below, however, that the calculation of a maximal simulation is a PTIME problem.

Denote by the nodes that simulate . The calculation of the maximal simulation is equivalent to the determination of all sets for . First, our naive calculation will be based on the definition.

`Naive-Maximal-Simulation(` `)`

```  1
`FOR`
all    2
`DO`

3
`WHILE`

4
`DO`

5
`RETURN`

```

Claim 20.2 The algorithm `Naive-Maximal-Simulation` computes the maximal simulation in time if .

Proof. Let us start with the elements of . If an element of does not simulate by definition according to edge , then we remove from set . In this case, we say that we improved set according to edge . If set cannot be improved according to any of the edges, then all elements of simulate . To complete the proof, notice that the `wHILE` cycle consists of at most iterations.

The efficiency of the algorithm can be improved using special data structures. First, introduce a set , which contains , and of the elements of whom we want to find out whether they simulate .

`Improved-Maximal-Simulation(` `)`

```  1
`FOR`
all    2
`DO`

3
`IF`

4
`THEN`

5
`ELSE`

6
`WHILE`

7
`DO`

8
`FOR`
all    9
`DO`

10         11
`RETURN`

```

The `wHILE` cycle of the improved algorithm possesses the following invariant characteristics.

: .

: .

When improving the set according to edge , we check whether an element has parents in . It is sufficient to check that for the elements of instead of because of . Once an element was chosen, it is removed from set .

We can further improve the algorithm if we do not compute the set in the iterations of the `wHILE` cycle but refresh the set dynamically.

`Efficient-Maximal-Simulation(` `)`

```  1
`FOR`
all     2
`DO`

3
`IF`

4
`THEN`

5
`ELSE`

6       7
`WHILE`

8
`DO`

`FOR`
all    9
`DO`

`FOR`
all   10
`DO`

`IF`

11
`THEN`

12
`FOR`
all   13
`DO`

`IF`

14
`THEN`

15         16         17
`RETURN`

```

The above algorithm possesses the following invariant characteristic with respect to the `wHILE` cycle.

: : .

Use an array as a counter for the realization of the algorithm. Let the value be the nonnegative integer . The initial values of the counter are set in time. When element is removed from set , the values must be decreased for all children of . By this we ensure that the innermost `iF` condition can be checked in constant time. At the beginning of the algorithm, the initial values of the sets are set in time if . The setting of sets takes altogether time. For arbitrary nodes and , if is true in the -th iteration of the `wHILE` cycle, then it will be false in the -th iteration for . Since implies , the value of in the -th iteration is a subset of the value of in the -th iteration, and we know that invariant holds. Therefore can be checked in time. is true at most once for all nodes and , since once the condition holds, we remove from set . This implies that the computation of the outer `iF` condition of the `wHILE` cycle takes time.

Thus we have proved the following proposition.

Claim 20.3 The algorithm `Effective-Maximal-Simulation` computes the maximal simulation in time if .

If the inverse of a simulation is also a simulation, then it is called a bisimulation. The empty relation is a bisimulation, and there always exist a maximal bisimulation. The maximal bisimulation can be computed more efficiently than the simulation. The maximal bisimulation can be computed in time using the `PT` algorithm. In case of edge-labeled graphs, the cost is .

We will see that bisimulations play an important role in indexing semi-structured databases, since the quotient graph of a graph with respect to a bisimulation contains the same label strings as the original graph. Note that in practice, instead of simulations, the so-called DTD descriptions are also used as schemas. DTD consists of data type definitions formulated in regular language.

Exercises

20.2-1 Show that simulation does not imply bisimulation.

20.2-2 Define the operation turn-tree for a directed, not necessarily acyclic, vertex-labeled graph the following way. The result of the operation is a not necessarily finite graph , the vertices of which are the directed paths of starting from the root, and the labels of the paths are the corresponding label strings. Connect node with node by an edge if can be obtained from by deletion of its endpoint. Prove that and are similar with respect to the bisimulation.