20.6. 20.6 D()- and M()-indexes

When using A()-indexes, the value of must be chosen appropriately. If is too large, the size of the index will be too big, and if is too small, the result obtained has to be checked too many times in order to preserve exactness. Nodes of the same class are similar locally, that is, they cannot be distinguished by their distance neighbourhoods, or, more precisely, by the paths of length at most leading to them. The same is used for all nodes, even though there are less important nodes. For instance, some nodes appear very rarely in results of queries in practice, and only the label sequences of the paths passing through them are examined. There is no reason for using a better refinement on the less important nodes. This suggests the idea of using the dynamic D()-index, which assigns different values to the nodes according to queries. Suppose that a set of queries is given. If there is an and an query among them, where and are regular queries, then a partition according to at least 1-bisimulation in case of nodes with label , and according to at least 2-bisimulation in case of nodes with label is needed.

Definition 20.35 Let be the index graph belonging to graph , and to all index node assign a nonnegative integer . Suppose that the nodes of block are -bisimilar. Let the values satisfy the following condition: if there is an edge from to in graph , then . The index having this property is called a D()-index.

The “D” in the notation refers to the word “dynamic”. Note that the A()-index is a special case of the D()-index, since in case of A()-indexes, the elements belonging to any index node are exactly -bisimilar.

Since classification according to labels, that is, the basic partition is an A()-index, and in case of finite graphs, the 1-index is the same as an A()-index for some , these are also special cases of the D()-index. The D()-index, just like any other index, is safe, so it is sufficient to evaluate the queries on them. Results must be checked to ensure exactness. The following proposition states that exactness is guaranteed for some queries, therefore checking can be omitted in case of such queries.

Claim 20.36 Let be a directed path in the D()-index, and suppose that if . Then all elements of fit to the label sequence .

Proof. The proof is by induction on . The case is trivial. By the inductive assumption, all elements of fit to the label sequence . Since there is an edge from node to node in graph , there exist and such that there is an edge from to in graph . This means that fits to the label sequence of length . The elements of are at least -bisimilar, therefore all elements of fit to this label sequence.

Corollary 20.37 The D()-index is exact with respect to label sequence if for all nodes of the index graph that fit to this label sequence.

When creating the D()-index, we will refine the basic partition, that is, the A()-index. We will assign initial values to the classes consisting of nodes with the same label. Suppose we use different values. Let be the set of these values, and denote the elements of by . If the elements of do not satisfy the condition given in the D()-index, then we increase them using the algorithm Weight-Changer , starting with the greatest value, in such a way that they satisfy the condition. Thus, the classes consisting of nodes with the same label will have good values. After this, we refine the classes by splitting them, until all elements of a class are -bisimilar, and assign this to all terms of the split. During this process the edges of the index graph must be refreshed according to the partition obtained by refinement.

Weight-Changer( )

  1     2     3  
                     WHILE
                   
                     4    
                     DO
                   
                  
                     FOR
                   all , where  is a node of the A()-index and    5       
                     DO
                   
                  
                     FOR
                   all , where  is a node of the A()-index                 and there is an edge from  to   6             7       8       9  
                     RETURN
                   
                   
               

It can be checked easily that the computation cost of the algorithm Weight-Changer is , where is the number of edges of the A()-index.

D( )-Index-Creator( )

  1  let  be the A()-index belonging to graph , let  be the set        of nodes of , let  be the set of edges of    2  
                  
                     Weight-Changer(
                     
                     )
                        
                   Changing the initial weights                    according to the condition of the D()-index.  3  
                     FOR
                   
                   
                  
                     TO
                   
                     4    
                     DO FOR
                   all    5       
                     DO
                   
                  
                     IF
                   
                     6          
                     THEN
                   
                  
                     FOR
                   all , where    7             
                     DO
                   
                     8                   9                  10                ,                           11                  12  
                     RETURN
                   
                   
               

In step 7, a split operation is performed. This ensures that the classes consisting of -bisimilar elements are split into equivalence classes according to -bisimilarity. It can be proved that the computation cost of the algorithm D( )-Index-Creator is at most , where is the number of edges of graph , and .

In some cases, the D()-index results in a partition that is too fine, and it is not efficient enough for use because of its huge size. Over-refinement can originate in the following. The algorithm D( )-Index-Creator( ) assigns the same value to the nodes with the same label, although some of these nodes might be less important with respect to queries, or appear more often in results of queries of length much less than , so less fineness would be enough for these nodes. Based on the value assigned to a node, the algorithm Weight-Changer will not decrease the value assigned to the parent node if it is greater than . Thus, if these parents are not very significant nodes considering frequent queries, then this can cause over-refinement. In order to avoid over-refinement, we introduce the M()-index and the M -index, where the “M” refers to the word “mixed”, and the “*” shows that not one index is given but a finite hierarchy of gradually refined indexes. The M()-index is a D()-index the creation algorithm of which not necessarily assigns nodes with the same label to the same -bisimilarity classes.

Let us first examine how a D()-index must be modified if the initial weight of index node is increased. If , then does not change. Otherwise, to ensure that the conditions of the D()-index on weights are satisfied, the weights on the ancestors of must be increased recursively until the weight assigned to the parents is at least . Then, by splitting according to the parents, the fineness of the index nodes obtained will be at least , that is, the elements belonging to them will be at least -bisimilar. This will be achieved using the algorithm Weight-Increaser .

Weight-Increaser( )

  1  
                     IF
                   
                     2    
                     THEN
                   
                  
                     RETURN
                   
                     3  
                     FOR
                   all    4    
                     DO
                   
                  
                  
                     Weight-Increaser(
                     
                     )
                     5  
                     FOR
                   all    6    
                     DO
                   
                     7          8          9  
                     RETURN
                   
                   
               

The following proposition can be easily proved, and with the help of this we will be able to achieve the appropriate fineness in one step, so we will not have to increase step by step anymore.

Claim 20.38 if and only if , and if there is an edge from node to node , then there is a node , from which there is an edge to node and , and, conversely, if there is an edge from node to node , then there is a node , from which there is an edge to node and .

Denote by FRE the set of simple expressions, that is, the label sequences determined by the frequent regular queries. We want to achieve a fineness of the index that ensures that it is exact on the queries belonging to FRE. For this, we have to determine the significant nodes, and modify the algorithm D( )-Index-Creator in such a way that the not significant nodes and their ancestors are always deleted at the refining split.

Let be a frequent simple query. Denote by and the set of nodes that fit to in the index graph and data graph, respectively, that is and . Denote by the fineness of index node in the index graph , then the nodes belonging to are at most -bisimilar.

Refine( )

  1  
                     FOR
                   all    2    
                     DO
                   
                  
                  
                     Refine-Index-Node(
                     
                     )
                     3  
                     WHILE
                   
                   such that length() and  fits to    4    
                     DO
                   
                  
                  
                     Weight-Increaser(
                     
                     )
                     5  
                     RETURN
                   
                   
               

The refinement of the index nodes will be done using the following algorithm. First, we refine the significant parents of index node recursively. Then we split according to its significant parents in such a way that the fineness of the new parts is . The split parts of are kept in set . Lastly, we unite those that do not contain significant nodes, and keep the original fineness of for this united set.

Refine-Index-Node( )

  1  
                     IF
                   
                     2    
                     THEN
                   
                  
                     RETURN
                   
                     3  
                     FOR
                   all    4    
                     DO
                   
                     5       
                     IF
                   
                     6          
                     THEN
                   
                  
                     Refine-Index-Node(
                     
                     )
                     7     8     9  
                     FOR
                   all   10    
                     DO
                   
                  
                     IF
                   
                    11       
                     THEN
                   
                  
                     FOR
                   all   12          
                     DO
                   
                    13               14               15               16               17               18    19  
                     FOR
                   all   20    
                     DO
                   
                  
                     IF
                   
                    21       
                     THEN
                   
                    22            23    24    25    26    27  
                     RETURN
                   
                   
               

The algorithm Refine refines the index graph according to a frequent simple expression in such a way that it splits an index node into not necessarily equally fine parts, and thus avoids over-refinement. If we start from the A()-index, and create the refinement for all frequent queries, then we get an index graph that is exact with respect to frequent queries. This is called the M()-index. The set FRE of frequent queries might change during the process, so the index must be modified dynamically.

Definition 20.39 The M()-index is a D()-index created using the following M( )-Index-Creator algorithm.

M( )-Index-Creator( )

  1  the A() index belonging to graph    2  the nodes of    3  
                     FOR
                   all    4    
                     DO
                   
                     5  the set of edges of    6  
                     FOR
                   all    7    
                     DO
                   
                  
                  
                     Refine(
                     
                     )
                     8  
                     RETURN
                   
                   
               

The M()-index is exact with respect to frequent queries. In case of a not frequent query, we can do the following. The M()-index is also a D()-index, therefore if an index node fits to a simple expression in the index graph , and the fineness of the index node is at least the length of , then all elements of the index node fit to the query in graph . If the fineness of the index node is less, then for all of its elements, we have to check according to Naive-Evaluation whether it is a solution in graph .

When using the M()-index, over-refinement is the least if the lengths of the frequent simple queries are the same. If there are big differences between the lengths of frequent queries, then the index we get might be too fine for the short queries. Create the sequence of gradually finer indexes with which we can get from the A()-index to the M()-index in such a way that, in every step, the fineness of parts obtained by splitting an index node is greater by at most one than that of the original index node. If the whole sequence of indexes is known, then we do not have to use the finest and therefore largest index for the evaluation of a simple query, but one whose fineness corresponds to the length of the query.

Definition 20.40 The M -index is a sequence of indexes such that

  1. Index is an M(i)-index, where .

  2. The fineness of all index nodes in is at most , where .

  3. is a refinement of , where .

  4. If node of index is split in index , and is a set obtained by this split, that is, , then .

  5. Let be a node of index , and . Then for and for all index nodes of such that .

It follows from the definition that in case of M -indexes is the A()-index. The last property says that if the refinement of an index node stops, then its fineness will not change anymore. The M -index possesses the good characteristics of the M()-index, and its structure is also similar: according to frequent queries the index is further refined if it is necessary to make it exact on frequent queries, but now we store and refresh the coarser indexes as well, not only the finest.

When representing the M -index, we can make use of the fact that if an index node is not split anymore, then we do not need to store this node in the new indexes, it is enough to refer to it. Similarly, edges between such nodes do not have to be stored in the sequence of indexes repeatedly, it is enough to refer to them. Creation of the M -index can be done similarly to the M( )-Index-Creator algorithm. The detailed description of the algorithm can be found in the paper of He and Yang.

With the help of the M -index, we can use several strategies for the evaluation of queries. Let be a frequent simple query.

The simplest strategy is to use the index the fineness of which is the same as the length of the query.

M -Index-Naive-Evaluation( )

  1  the M -index corresponding to graph    2  length()   3  
                     RETURN
                   
                  
                     Index-Evaluation(
                     
                     )
                   
               

The evaluation can also be done by gradually evaluating the longer prefixes of the query according to the index the fineness of which is the same as the length of the prefix. For the evaluation of a prefix, consider the partitions of the nodes found during the evaluation of the previous prefix in the next index and from these, seek edges labeled with the following symbol. Let be a simple frequent query, that is, .

M -Index-Evaluation-Top-to-Bottom( )

  1   the M -index corresponding to graph    2     3     4  
                     FOR
                   all       
                   The children of the root in graph .   5    
                     DO
                   
                  
                     IF
                   
                     6       
                     THEN
                   
                     7  
                     FOR
                   
                   
                  
                     TO
                   length()   8    
                     DO
                   
                     9         10       
                  
                     M 
                     
                     -Index-Evaluation-Top-to-Bottom(
                     
                     )
                    11       
                     FOR
                   all       
                   Node  is a node of graph .  12          
                     DO
                   
                  
                     IF
                   
                  , where       
                   The partition of node                     in graph . 13             
                     THEN
                   
                  
                     FOR
                   minden   14                
                     DO
                   
                  
                     FOR
                   all       
                   For all children of                           in graph . 15                   
                     DO
                   
                  
                     IF
                   
                    16                      
                     THEN
                   
                    17  
                     RETURN
                   
                   
               

Our strategy could also be that we first find a subsequence of the label sequence corresponding to the simple query that contains few nodes, that is, its selectivity is large. Then find the fitting nodes in the index corresponding to the length of the subsequence, and using the sequence of indexes see how these nodes are split into new nodes in the finer index corresponding to the length of the query. Finally, starting from these nodes, find the nodes that fit to the remaining part of the original query. The detailed form of the algorithm M -Index-Prefiltered-Evaluation is left to the Reader.

Exercises

20.6-1 Find the detailed form of the algorithm M -Index-Prefiltered-Evaluation . What is the cost of the algorithm?

20.6-2 Prove Proposition 20.38.

20.6-3 Prove that the computation cost of the algorithm Weight-Changer is , where is the number of edges of the A()-index.