The information stored in semi-structured databases can be retrieved using queries. For this, we have to fix the form of the questions, so we give a **query language**, and then define the meaning of questions, that is, the **query evaluation** with respect to a semi-structured database. For efficient evaluation we usually use indexes. The main idea of **indexing** is that we reduce the data stored in the database according to some similarity principle, that is, we create an index that reflects the structure of the original data. The original query is executed in the index, then using the result we find the data corresponding to the index values in the original database. The size of the index is usually much smaller than that of the original database, therefore queries can be executed faster. Note that the inverted list type index used in case of classical databases can be integrated with the schema type indexes introduced below. This is especially advantageous when searching XML documents using keywords.

First we will get acquainted with the query language consisting of regular expressions and the index types used with it.

**Definition 20.4 **
*Given a directed vertex-labeled graph , where denotes the set of vertices, the set of edges and the set of labels. contains two special labels, ROOT and VALUE. The label of vertex is , and the identifier of vertex is . The root is a node with label ROOT, and from which all nodes can be reached via directed paths. If is a leaf, that is, if it has no outgoing edges, then its label is VALUE, and is the value corresponding to leaf . Under the term path we always mean a directed path, that is, a sequence of nodes such that there is an edge from to if . A sequence of labels is called a label sequence or simple expression. Path
fits to the label sequence
if for all .*

We define regular expressions recursively.

**Definition 20.5 **
*Let , where R is a regular expression, and is the empty expression, _ denotes an arbitrary label, . denotes succession, | is the logical OR operation, ? is the optional choice, and * means finite repetition. Denote by L(R) the regular language consisting of the label sequences determined by . Node n fits to a label sequence if there exists a path from the root to node such that fits to the label sequence. Node fits to the regular expression
if there exists a label sequence in the language , to which node fits. The result of the query on graph G determined by the regular expression is the set of nodes that fit to expression .*

Since we are always looking for paths starting from the root when evaluating regular expressions, the first element of the label sequence is always ROOT, which can therefore be omitted.

Note that the set of languages corresponding to regular expressions is closed under intersection, and the problem whether is decidable.

The result of the queries can be computed using the nondeterministic automaton corresponding to the regular expression . The algorithm given recursively is as follows.

*
Naive-Evaluation(
)
*

1 If we were in node in state , then was put in setVisited. 2`Traverse(`

`)`

*
Traverse(
)
*

1`IF`

Visited2`THEN`

3 4 5`RETURN`

all If we get to state from state by reading sign . 6`FOR`

`DO`

7`IF`

8`THEN`

9`Traverse(`

`)`

all If we get to state from state by reading sign . 10`FOR`

`DO`

11`IF`

12`THEN`

all , where Continue the traversal for the children of node recursively. 13`FOR`

`DO`

14`Traverse(`

`)`

`RETURN`

**Claim 20.6 **
*Given a regular query and a graph , the calculation cost of is a polynomial of the number of edges of and the number of different states of the finite nondeterministic automaton corresponding to .*

**Proof. **The sketch of the proof is the following. Let be the finite nondeterministic automaton corresponding to . Denote by the number of states of . Consider the breadth-first traversal corresponding to the algorithm *
Traverse
* of graph with edges, starting from the root. During the traversal we get to a new state of the automaton according to the label of the node, and we store the state reached at the node for each node. If the final state of the automaton is acceptance, then the node is a result. During the traversal, we sometimes have to step back on an edge to ensure we continue to places we have not seen yet. It can be proved that during a traversal every edge is used at most once in every state, so this is the number of steps performed by that automaton. This means steps altogether, which completes the proof.

Two nodes of graph are **indistinguishable with regular expressions** if there is no regular for which one of the nodes is among the results and the other node is not. Of course, if two nodes cannot be distinguished, then their labels are the same. Let us categorize the nodes in such a way that nodes with the same label are in the same class. This way we produce a partition of the set of nodes, which is called the **basic partition**. It can also be seen easily that if two nodes are indistinguishable, then it is also true for the parents. This implies that the set of label sequences corresponding to paths from the root to the indistinguishable nodes is the same. Let for all nodes . Nodes and are indistinguishable if and only if . If the nodes are assigned to classes in such a way that the nodes having the same value are arranged to the same class, then we get a refinement of partition . For this new partition, if a node is among the results of a regular query , then all nodes from the equivalence class of are also among the results of the query.

**Definition 20.7 **
*Given a graph and a partition of that is a refinement of the basic partition, that is, for which the nodes belonging to the same equivalence class have the same label. Then the graph is called an index. The nodes of the index graph are the equivalence classes of partition , and if and only if there exist and such that . If , then is the identifier of index node , , where , and root' is the equivalence class of partition that contains the root of . If , then .*

Given a partition of set , denote by **class(n)** the equivalence class of that contains for . In case of indexes, the notation can also be used instead of .

Note that basically the indexes can be identified with the different partitions of the nodes, so partitions can also be called indexes without causing confusion. Those indexes will be good that are of small size and for which the result of queries is the same on the graph and on the index. Indexes are usually given by an equivalence relation on the nodes, and the partition corresponding to the index consists of the equivalence classes.

**Definition 20.8 **
*Let be the partition for which for a class if and only if . Then the index corresponding to is called a naive index.*

In case of naive indexes, the same language is assigned to all elements of class in partition , which will be denoted by .

**Claim 20.9 **
*Let be a node of the naive index and a regular expression. Then or .*

**Proof. **Let and . Then there exists a label sequence in to which fits, that is, . Since , also fits to this label sequence, so .

*
Naive-Index-Evaluation(
)
*

1 let be the naive index of 2 3all 4`FOR`

5`DO`

`RETURN`

**Claim 20.10 **
*Set produced by the algorithm
*

`Naive-Index-Evaluation`

is equal to .
**Proof. **Because of the previous proposition either all elements of a class are among the results of a query or none of them.

Using naive indexes we can evaluate queries, but, according to the following proposition, not efficiently enough. The proposition was proved by Stockmeyer and Meyer in 1973.

**Claim 20.11 **
*The creation of the naive index needed in the algorithm
*

`Naive-Index-Evaluation`

is PSPACE-complete.The other problem with using naive indexes is that the sets are not necessary disjoint for different , which might cause redundancy in storing.

Because of the above we will try to find a refinement of the partition corresponding to the naive index, which can be created efficiently and can still be used to produce .

**Definition 20.12 **
*Index is safe if for any and label sequence such that fits to the label sequence in graph , class() fits to the label sequence in graph . Index is exact if for any class of the index and label sequence such that fits to the label sequence in graph , arbitrary node fits to the label sequence in graph .*

Safety means that the nodes belonging to the result we obtain by evaluation using the index contain the result of the regular query, that is, , while exactness means that the evaluation using the index does not provide false results, that is, . Using the definitions of exactness and of the edges of the index the following proposition follows.

**Claim 20.13 **
*1. Every index is safe.*

*2. The naive index is safe and exact.*

If is a set of nodes of , then the language , to the label strings of which the elements of fit, was defined using graph . If we wish to indicate this, we use the notation . However, can also be defined using graph , in which is a node. In this case, we can use the notation instead of , which denotes all label sequences to which node fits in graph . for safe and exact indexes, so in this case we can write for simplicity. Then can be computed using , since the size of is usually smaller than that of .

Arbitrary index graph can be queried using the algorithm *
Naive-Evaluation
*. After that join the index nodes obtained. If we use an exact index, then the result will be the same as the result we would have obtained by querying the original graph.

*
Index-Evaluation(
)
*

1 let be the index of 2 3all 4`FOR`

5`DO`

`RETURN`

First, we will define a safe and exact index that can be created efficiently, and is based on the similarity of nodes. We obtain the 1-index this way. Its size can be decreased if we only require similarity locally. The A()-index obtained this way lacks exactness, therefore using the algorithm *
Index-Evaluation
* we can get results that do not belong to the result of the regular query , so we have to test our results to ensure exactness.

**Definition 20.14 **
*Let be an equivalence relation on set such that, for ,*

*i) ,*

*ii) if there is an edge from node to node , then there exists a node for which there is an edge from node to node and .*

*iii) if there is an edge from node to node , then there exists a node for which there is an edge from node to node and .*

*The above equivalence relation is called a bisimulation. Nodes and of a graph are bisimilar if and only if there exists a bisimulation such that .*

**Definition 20.15 **
*Let be the partition consisting of the equivalence classes of a bisimulation. The index defined using partition is called 1-index.*

**Claim 20.16 **
*The 1-index is a refinement of the naive index. If the labels of the ingoing edges of the nodes in graph are different, that is, for and , then if and only if and are bisimilar. *

**Proof. **
if . Let node fit to the label sequence , and let be the node corresponding to label . Then there exists a such that and . fits to the label sequence , so, by induction, also fits to the label sequence , therefore fits to the label sequence . So, if two nodes are in the same class according to the 1-index, then they are in the same class according to the naive index as well.

To prove the second statement of the proposition, it is enough to show that the naive index corresponds to a bisimulation. Let and be in the same class according to the naive index. Then . If , then there exists a label sequence such that the last two nodes corresponding to the labels are and . Since we assumed that the labels of the parents are different, , where and are disjoint, and fits to the sequence , and , while . Since , there exists a such that and . fits to the sequence , and because of the different labels of the parents, so , and by induction, therefore .

**Claim 20.17 **
*The 1-index is safe and exact.*

**Proof. **If fits to the label sequence in graph because of nodes , then, by the definition of the index graph, there exists an edge from to , , that is, fits to the label sequence in graph . To prove exactness, assume that fits to the label sequence in graph because of . Then there are , such that and , that is, . We can see by induction that fits to the label sequence because of nodes , but then fits to the label sequence because of nodes in graph .

If we consider the bisimulation in case of which all nodes are assigned to different partitions, then the graph corresponding to this 1-index is the same as graph . Therefore the size of is at most the size of , and we also have to store the elements of for the nodes of , which means we have to store all nodes of . For faster evaluation of queries we need to find the smallest 1-index, that is, the coarsest 1-index. It can be checked that and are in the same class according to the coarsest 1-index if and only if and are bisimilar.

*
1-Index-Evaluation(
)
*

1 let be the coarsest 1-index of 2`RETURN`

`Index-Evaluation(`

`)`

In the first step of the algorithm, the coarsest 1-index has to be given. This can be reduced to finding the coarsest stable partition, what we will discuss in the next section of this chapter. Thus using the efficient version of the *
PT
*-algorithm, the coarsest 1-index can be found with computation cost and space requirement , where and denote the number of nodes and edges of graph , respectively.

Since graph is safe and exact, it is sufficient to evaluate the query in graph , that is, to find the index nodes that fit to the regular expression . Using Proposition 20.6, the cost of this is a polynomial of the size of graph .

The size of can be estimated using the following parameters. Let be the number of different labels in graph , and **
the diameter of graph
**, that is, the length of the longest directed path. (No node can appear twice in the directed path.) If the graph is a tree, then the diameter is the depth of the tree. We often create websites that form a tree of depth , then we add a navigation bar consisting of elements to each page, that is, we connect each node of the graph to chosen pages. It can be proved that in this case the diameter of the graph is at most . In practice, and are usually very small compared to the size of the graph. The proof of the following proposition can be found in the paper of Milo and Suciu.

**Claim 20.18 **
*Let the number of different labels in graph be at most , and let the diameter of be less than . Then the size of the 1-index defined by an arbitrary bisimulation can be bounded from above with a bound that only depends on and but does not depend on the size of .*

**Exercises**

20.3-1 Show that the index corresponding to the maximal simulation is between the 1-index and the naive index with respect to refinement. Give an example that shows that both inclusions are proper.

20.3-2 Denote by the index corresponding to the maximal simulation. Does hold?

20.3-3 Represent graph and the state transition graph of the automaton corresponding to the regular expression with relational databases. Give an algorithm in a relational query language, for example in PL/SQL, that computes .