20.2. 20.2 Schemas and simulations

In case of relational databases, schemas play an important role in coding and querying data, query optimization and storing methods that increase efficiency. When working with semi-structured databases, the schema must be obtained from the graph. The schema restricts the possible label strings belonging to the paths of the graph.

Figure 20.4 shows the relational schemas with relations and , respectively, and the corresponding semi-structured description. The labels of the leaves of the tree are the components of the tuples. The directed paths leading from the root to the values contain the label strings database.R.tuple.A, database.R.tuple.B, database.R.tuple.C, database.Q.tuple.C, database.Q.tuple.D. This can be considered the schema of the semi-structured database. Note that the schema is also a graph, as it can be seen on Figure 20.5. The disjoint union of the two graphs is also a graph, on which a simulation mapping can be defined as follows. This way we create a connection between the original graph and the graph corresponding to the schema.

Figure 20.4.  A relational database in the semi-structured model.

A relational database in the semi-structured model.

Definition 20.1 Let be a vertex-labeled directed graph, where denotes the set of nodes, the set of edges, the set of labels, and is the label belonging to node . Denote by the set of the start nodes of the edges leading to node . A binary relation ( ) is a simulation, if, for ,

i) and

ii) for all there exists a such that

Node v simulates node u, if there exists a simulation such that . Node u and node v are similar, , if simulates and simulates .

Figure 20.5.  The schema of the semi-structured database given in Figure 20.4.

The schema of the semi-structured database given in Figure 20.4.


It is easy to see that the empty relation is a simulation, that the union of simulations is a simulation, that there always exists a maximal simulation and that similarity is an equivalence relation. We can write instead of in the above definition, since that only means that the direction of the edges of the graph is reversed.

We say that graph simulates graph if there exists a mapping such that the relation is a simulation on the set .

Two different schemas are used, a lower bound and an upper bound. If the data graph simulates the schema graph , then S is a lower bound of D. Note that this means that all label strings belonging to the directed paths in appear in at some directed path. If simulates , then S is an upper bound of D. In this case, the label strings of also appear in .

In case of semi-structured databases, the schemas which are greatest lower bounds or lowest upper bounds play an important role.

A map between graphs and that preserves edges is called a morphism. Note that is a morphism if and only if simulates . To determine whether a morphism from to exists is an NP-complete problem. We will see below, however, that the calculation of a maximal simulation is a PTIME problem.

Denote by the nodes that simulate . The calculation of the maximal simulation is equivalent to the determination of all sets for . First, our naive calculation will be based on the definition.

Naive-Maximal-Simulation( )

  1  
                     FOR
                   all    2    
                     DO
                   
                     3  
                     WHILE
                   
                     4    
                     DO
                   
                     5  
                     RETURN
                   
                   
               

Claim 20.2 The algorithm Naive-Maximal-Simulation computes the maximal simulation in time if .

Proof. Let us start with the elements of . If an element of does not simulate by definition according to edge , then we remove from set . In this case, we say that we improved set according to edge . If set cannot be improved according to any of the edges, then all elements of simulate . To complete the proof, notice that the wHILE cycle consists of at most iterations.

The efficiency of the algorithm can be improved using special data structures. First, introduce a set , which contains , and of the elements of whom we want to find out whether they simulate .

Improved-Maximal-Simulation( )

  1  
                     FOR
                   all    2    
                     DO
                   
                     3       
                     IF
                   
                     4          
                     THEN
                   
                     5          
                     ELSE
                   
                     6  
                     WHILE
                   
                     7    
                     DO
                   
                     8       
                     FOR
                   all    9          
                     DO
                   
                    10         11  
                     RETURN
                   
                   
               

The wHILE cycle of the improved algorithm possesses the following invariant characteristics.

: .

: .

When improving the set according to edge , we check whether an element has parents in . It is sufficient to check that for the elements of instead of because of . Once an element was chosen, it is removed from set .

We can further improve the algorithm if we do not compute the set in the iterations of the wHILE cycle but refresh the set dynamically.

Efficient-Maximal-Simulation( )

  1  
                     FOR
                   all     2    
                     DO
                   
                     3       
                     IF
                   
                     4          
                     THEN
                   
                     5          
                     ELSE
                   
                     6       7  
                     WHILE
                   
                     8    
                     DO
                   
                  
                     FOR
                   all    9       
                     DO
                   
                  
                     FOR
                   all   10          
                     DO
                   
                  
                     IF
                   
                    11             
                     THEN
                   
                    12                
                     FOR
                   all   13                   
                     DO
                   
                  
                     IF
                   
                    14                      
                     THEN
                   
                                                  
                   15         16         17  
                     RETURN
                   
                   
               

The above algorithm possesses the following invariant characteristic with respect to the wHILE cycle.

: : .

Use an array as a counter for the realization of the algorithm. Let the value be the nonnegative integer . The initial values of the counter are set in time. When element is removed from set , the values must be decreased for all children of . By this we ensure that the innermost iF condition can be checked in constant time. At the beginning of the algorithm, the initial values of the sets are set in time if . The setting of sets takes altogether time. For arbitrary nodes and , if is true in the -th iteration of the wHILE cycle, then it will be false in the -th iteration for . Since implies , the value of in the -th iteration is a subset of the value of in the -th iteration, and we know that invariant holds. Therefore can be checked in time. is true at most once for all nodes and , since once the condition holds, we remove from set . This implies that the computation of the outer iF condition of the wHILE cycle takes time.

Thus we have proved the following proposition.

Claim 20.3 The algorithm Effective-Maximal-Simulation computes the maximal simulation in time if .

If the inverse of a simulation is also a simulation, then it is called a bisimulation. The empty relation is a bisimulation, and there always exist a maximal bisimulation. The maximal bisimulation can be computed more efficiently than the simulation. The maximal bisimulation can be computed in time using the PT algorithm. In case of edge-labeled graphs, the cost is .

We will see that bisimulations play an important role in indexing semi-structured databases, since the quotient graph of a graph with respect to a bisimulation contains the same label strings as the original graph. Note that in practice, instead of simulations, the so-called DTD descriptions are also used as schemas. DTD consists of data type definitions formulated in regular language.

Exercises

20.2-1 Show that simulation does not imply bisimulation.

20.2-2 Define the operation turn-tree for a directed, not necessarily acyclic, vertex-labeled graph the following way. The result of the operation is a not necessarily finite graph , the vertices of which are the directed paths of starting from the root, and the labels of the paths are the corresponding label strings. Connect node with node by an edge if can be obtained from by deletion of its endpoint. Prove that and are similar with respect to the bisimulation.