19.3. 19.3 Query rewriting

Answering queries using views, in other words query rewriting using views has become a problem attracting much attention and interest recently. The reason is its applicability in a wide range of data manipulation problems: query optimisation, providing physical data independence, data and information integration, furthermore planning data warehouses.

The problem is the following. Assume that query is given over schema , together with views . Can one answer using only the results of views ? Or, which is the largest set of tuples that can be determined knowing the views? If the views and the relations of the base schema can be accessed both, what is the cheapest way to compute the result of ?

19.3.1. 19.3.1 Motivation

Before query rewriting algorithms are studied in detail, some motivating examples of applications are given. The following university database is used throughout this section.

The schemata of the individual relations are as follows:

It is supposed that professors, students and departments are uniquely identified by their names. Tuples of relation Registered show that which student took which course in what semester, while Major shows which department a student choose in majoring (for the sake of convenience it is assumed that one department has one subject as possible major).

19.3.1.1.  Query optimisation.

If the computation necessary to answer a query was performed in advance and the results are stored in some materialised view, then it can be used to speed up the query answering.

Consider the query that looks for pairs (Student,Title), where the student registered for the given Ph.D.-level course, the course is taught by a professor of the Database area (the C-number of graduate courses is at least 400, and the Ph.D.-level courses are those with C-number at least 500).

Suppose that the following materialised view is available that contains the registration data of graduate courses.

View Graduate can be used to answer query (19.43).

It will be faster to compute (19.45) than to compute (19.43), because the natural join of relations Registered and Course has already be done by view Graduate, furthermore it shelled off the undergraduate courses (that make up for the bulk of registration data at most universities). It worth noting that view Graduate could be used event hough syntactically did not agree with any part of query (19.43).

On the other hand, it may happen that the the original query can be answered faster. If relations Registered and Course have an index on attribute C-Number, but there exists no index built for Graduate, then it could be faster to answer query (19.43) directly from the database relations. Thus, the real challenge is not only that to decide about a materialised view whether it could be used to answer some query logically, but a thorough cost analysis must be done when is it worth using the existing views.

19.3.1.2.  Physical data independence.

One of the principles underlying modern database systems is the separation between the logical view of data and the physical view of data. With the exception of horizontal or vertical partitioning of relations into multiple files, relational database systems are still largely based on a one-to-one correspondence between relations in the schema and the files in which they are stored. In object-oriented systems, maintaining the separation is necessary because the logical schema contains significant redundancy, and does not correspond to a good physical layout. Maintaining physical data independence becomes more crucial in applications where the logical model is introduced as an intermediate level after the physical representation has already been determined. This is common in storage of XML data in relational databases, and in data integration. In fact, the Stored System stores XML documents in a relational database, and uses views to describe the mapping from XML into relations in the database.

To maintain physical data independence, a widely accepted method is to use views over the logical schema as mechanism for describing the physical storage of data. In particular, Tsatalos, Solomon and Ioannidis use GMAPs (Generalised Multi-level Access Paths) to describe data storage.

A GMAP describes the physical organisation and the indexes of the storage structure. The first clause of the GMAP ( aS ) describes the actual data structure used to store a set of tuples (e.g., a -tree, hash index, etc.) The remaining clauses describe the content of the structure, much like a view definition. The gIVEN and sELECT clauses describe the available attributes, where the gIVEN clause describes the attributes on which the structure is indexed. The definition of the view, given in the wHERE clause uses infix notation over the conceptual model.

In the example shown in Figure 19.3 the GMAP G1 stores a set of pairs containing students and the departments in which they major, and these pairs are indexed by a -tree on attribute Student.name. The GMAP G2 stores an index from names of students to the numbers of courses in which they are registered. The GMAP G3 stores an index from course numbers to departments whose majors are enrolled in the course.

Figure 19.3.  GMAPs for the university domain.

GMAPs for the university domain.

Given that the data is stored in structures described by the GMAPs, the question that arises is how to use these structures to answer queries. Since the logical content of the GMAPs are described by views, answering a query amounts to finding a way of rewriting the query using these views. If there are multiple ways of answering the query using the views, we would like to find the cheapest one. Note that in contrast to the query optimisation context, we must use the views to answer the given query, because all the data is stored in the GMAPs.

Consider the following query, which asks for names of students registered for Ph.D.-level courses and the departments in which these students are majoring.

The query can be answered in two ways. First, since Student.name uniquely identifies a student, we can take the join of G! and G2, and then apply selection operator , finally a projection eliminates the unnecessary attributes. The other execution plan could be to join G3 with G2 and select . In fact, this solution may even be more efficient because G3 has an index on the course number and therefore the intermediate joins may be much smaller.

19.3.1.3.  Data integration.

A data integration system (also known as mediator system) provides a uniform query interface to a multitude of autonomous heterogeneous data sources. Prime examples of data integration applications include enterprise integration, querying multiple sources on the World-Wide Web, and integration of data from distributed scientific experiments.

To provide a uniform interface, a data integration system exposes to the user a mediated schema. A mediated schema is a set of virtual relations, in the sense that they are not stored anywhere. The mediated schema is designed manually for a particular data integration application. To be able to answer queries, the system must also contain a set of source descriptions. A description of a data source specifies the contents of the source, the attributes that can be found in the source, and the (integrity) constraints on the content of the source. A widely adopted approach for specifying source descriptions is to describe the contents of a data source as a view over the mediated schema. This approach facilitates the addition of new data sources and the specification of constraints on the contents of sources.

In order to answer a query, a data integration system needs to translate a query formulated on the mediated schema into one that refers directly to the schemata of the data sources. Since the contents of the data sources are described as views, the translation problem amounts finding a way to answer a query using a set of views.

Consider as an example the case where the mediated schema exposed to the user is schema University, except that the relations Teaches and Course have an additional attribute identifying the university at which the course is being taught:

Suppose we have the following two data sources. The first source provides a listing of all the courses entitled "Database Systems" taught anywhere and their instructors. This source can be described by the following view definition:

The second source lists Ph.D.-level courses being taught at The Ohio State University (OSU), and is described by the following view definition:

If we were to ask the data integration system who teaches courses titled "Database Systems" at OSU, it would be able to answer the query by applying a selection on the source DB-courses:

On the other hand, suppose we ask for all the graduate-level courses (not just in databases) being offered at OSU. Given that only these two sources are available, the data integration system cannot find all tuples in the answer to the query. Instead, the system can attempt to find the maximal set of tuples in the answer available from the sources. In particular, the system can obtain graduate database courses at OSU from the DB-course source, and the Ph.D.-level courses at OSU from the OSUPhD source. Hence, the following non-recursive datalog program gives the maximal set of answers that can be obtained from the two sources:

Note that courses that are not PH.D.-level courses or database courses will not be returned as answers. Whereas in the contexts of query optimisation and maintaining physical data independence the focus is on finding a query expression that is equivalent with the original query, here finding a query expression that provides the maximal answers from the views is attempted.

19.3.1.4.  Semantic data caching.

If the database is accessed via client-server architecture, the data cached at the client can be semantically modelled as results of certain queries, rather than at the physical level as a set of data pages or tuples. Hence, deciding which data needs to be shipped from the server in order to answer a given query requires an analysis of which parts of the query can be answered by the cached views.

19.3.2. 19.3.2 Complexity problems of query rewriting

In this section the theoretical complexity of query rewriting is studied. Mostly conjunctive queries are considered. Minimal, and complete rewriting will be distinguished. It will be shown that if the query is conjunctive, furthermore the materialised views are also given as results of conjunctive queries, then the rewriting problem is NP-complete, assuming that neither the query nor the view definitions contain comparison atoms. Conjunctive queries will be considered in rule-based form for the sake of convenience.

Assume that query is given over schema .

Definition 19.22 The conjunctive query is a rewriting of query using views , if

  • and are equivalent, and

  • contains one or more literals from .

is said to be locally minimal if no literal can be removed from without violating the equivalence. The rewriting is globally minimal, if there exists no rewriting using a smaller number of literals. (The comparison atoms are not counted in the number of literals.)

Example 19.8 Query rewriting. Consider the following query and view .

can be rewritten using :

View replaces the first two literals of query . Note that the view certainly satisfies the third literal of the query, as well. However, it cannot be removed from the rewriting, since variable does not occur in the head of , thus if literal were to be removed, too, then the natural join of and would not be enforced anymore.

Since in some of the applications the database relations are inaccessible, only the views can be accessed, for example in case of data integration or data warehouses, the concept of complete rewriting is introduced.

Definition 19.23 A rewriting of query using views is called a complete rewriting, if contains only literals of and comparison atoms.

Example 19.9 Complete rewriting. Assume that besides view of Example 19.8 the following view is given, as well:

A complete rewriting of query is:

It is important to see that this rewriting cannot be obtained step-by-step, first using only , then trying to incorporate , (or just in the opposite order) since relation of does not occur in . Thus, in order to find the complete rewriting, use of the two view must be considered parallel, at the same time.

There is a close connection between finding a rewriting and the problem of query containment. This latter one was discussed for tableau queries in section 19.1.3. Homomorphism between tableau queries can be defined for rule based queries, as well. The only difference is that it is not required in this section that a homomorphism maps the head of one rule to the head of the other. (The head of a rule corresponds to the summary row of a tableau.) According to Theorem 19.20 it is NP-complete to decide whether conjunctive query contains another conjunctive query . This remains true in the case when may contain comparison atoms, as well. However, if both, and may contain comparison atoms, then the existence of a homomorphism from to is only a sufficient but not necessary condition for the containment of queries, which is a -complete problem in that case. The discussion of this latter complexity class is beyond the scope of this chapter, thus it is omitted. The next proposition gives a necessary and sufficient condition whether there exists a rewriting of query using view .

Claim 19.24 Let and be conjunctive queries that may contain comparison atoms. There exists a a rewriting of query using view if and only if , that is the projection of to the empty attribute set contains that of .

Proof. Observe that is equivalent with the following proposition: If the output of is empty for some instance, then the same holds for the output of , as well.

Assume first that there exists a rewriting, that is a rule equivalent with that contains in its body. If is such an instance, that the result of is empty on it, then every rule that includes in its body results in empty set over , too.

In order to prove the other direction, assume that if the output of is empty for some instance, then the same holds for the output of , as well. Let

Let be a list of variables disjoint from variables of . Then the query defined by

satisfies . It is clear that . On the other hand, if there exists a valuation of the variables of that satisfies the body of over some instance , then fixing it, for arbitrary valuation of variables in a tuple is obtained in the output of , whenever a tuple is obtained in the output of together with the previously fixed valuation of variables of .

As a corollary of Theorem 19.20 and Proposition 19.24 the following theorem is obtained.

Theorem 19.25 Let be a conjunctive query that may contain comparison atoms, and let be a set of views. If the views in are given by conjunctive queries that do not contain comparison atoms, then it is NP-complete to decide whether there exists a rewriting of using .

The proof of Theorem 19.25 is left for the Reader (Exercise 19.3-1).

The proof of Proposition 19.24 uses new variables. However, according to the next lemma, this is not necessary. Another important observation is that it is enough to consider a subset of database relations occurring in the original query when locally minimal rewriting is sought, new database relations need not be introduced.

Lemma 19.26 Let be a conjunctive query that does not contain comparison atoms

furthermore let be a set of views that do not contain comparison atoms either.

  1. If is a locally minimal rewriting of using , then the set of database literals in is isomorphic to a subset of database literals occurring in .

  2. If

    is a rewriting of using the views, then there exists a rewriting

    such that , that is the rewriting does not introduce new variables.

The details of the proof of Lemma 19.26 are left for the Reader (Exercise 19.3-2). The next lemma is of fundamental importance: A minimal rewriting of using cannot increase the number of literals.

Lemma 19.27 Let be conjunctive query, be set of views given by conjunctive queries, both without comparison atoms. If the body of contains literals and is a locally minimal rewriting of using , then contains at most literals.

Proof. Replacing the view literals of by their definitions query is obtained. Let be a homomorphism from the body of to . The existence of follows from by the Homomorphism Theorem (Theorem 19.18). Each of the literals of the body of is mapped to at most one literal obtained from the expansion of view definitions. If contains more than view literals, then the expansion of some view literals in the body of is disjoint from the image of . These view literals can be removed from the body of without changing the equivalence.

Based on Lemma 19.27 the following theorem can be stated about complexity of minimal rewritings.

Theorem 19.28 Let be conjunctive query, be set of views given by conjunctive queries, both without comparison atoms. Let the body of contain literals.

  1. It is NP-complete to decide whether there exists a rewriting of using that uses at most literals.

  2. It is NP-complete to decide whether there exists a rewriting of using that uses at most database literals.

  3. It is NP-complete to decide whether there exists a complete rewriting of using .

Proof. The first statement is proved, the proof of the other two is similar. According to Lemmas 19.27 and 19.26, only such rewritings need to be considered that have at most as many literals as the query itself, contain a subset of the literals of the query and do not introduce new variables. Such a rewriting and the homomorphisms proving the equivalence can be tested in polynomial time, hence the problem is in NP. In order to prove that it is NP-hard, Theorem 19.25 is used. For a given query and view let be the view, whose head is same as the head of , but whose body is the conjunction of the bodies of and . It is easy to see that there exists a rewriting using with a single literal if and only if there exists a rewriting (with no restriction) using .

19.3.3. 19.3.3 Practical algorithms

In this section only complete rewritings are studied. This does not mean real restriction, since if database relations are also to be used, then views mirroring the database relations one-to-one can be introduced. The concept of equivalent rewriting introduced in Definition 19.22 is appropriate if the goal of the rewriting is query optimisation or providing physical data independence. However, in the context of data integration on data warehouses equivalent rewritings cannot be sought, since all necessary data may not be available. Thus, the concept of maximally contained rewriting is introduced that depends on the query language used, in contrast to equivalent rewritings.

Definition 19.29 Let be a query, be a set of views, be a query language. is a maximally contained rewriting of with respect to , if

  1. is a query of language using only views from ,

  2. contains ,

  3. if query satisfies , then .

19.3.3.1.  Query optimisation using materialised views.

Before discussing how can a traditional optimiser be modified in order to use materialised views instead of database relations, it is necessary to survey when can view be used to answer a given query. Essentially, view can be used to answer query , if the intersection of the sets of database relations in the body of and in the body of is non-empty, furthermore some of the attributes are selected by are also selected by . Besides this, in case of equivalent rewriting, if contains comparison atoms for such attributes that are also occurring in , then the view must apply logically equivalent, or weaker condition, than the query. If logically stronger condition is applied in the view, then it can be part of a (maximally) contained rewriting. This can be shown best via an example. Consider the query over schema University that list those professor, student, semester triplets, where the advisor of the student is the professor and in the given semester the student registered for some course taught by the professor.

View below can be used to answer , since it uses the same join condition for relations Registered and Teaches as , as it is shown by variables of the same name. Furthermore, selects attributes Student, PName, Semester, that are necessary in order to properly join with relation Advisor, and for select clause of the query. Finally, the predicate is weaker than the predicate of the query.

The following four views illustrate how minor modifications to change the usability in answering the query.

View is similar to , except that it does not select the attribute PName from relation Teaches, which is needed for the join with the relation Adviser and for the selection of the query. Hence, to use in the rewriting, it has to be joined with relation Teaches again. Still, if the join of relations Registered and Teaches is very selective, then employing may actually result in a more efficient query execution plan.

In view the join of relations Registered and Teaches is over only attribute C-number, the equality of variables Semester and is not required. Since attribute is not selected by , the join predicate cannot be applied in the rewriting, and therefore there is little gain by using .

View considers only the professors who have at least one area of research. Hence, the view applies an additional condition that does not exists in the query, and cannot be used in an equivalent rewriting unless union and negation are allowed in the rewriting language. However, if there is an integrity constraint stating that every professor has at least one area of research, then an optimiser should be able to realise that is usable.

Finally, view applies a stronger predicate than the query, and is therefore usable for a contained rewriting, but not for an equivalent rewriting of the query.

System-R style optimisation

Before discussing the changes to traditional optimisation, first the principles underlying the System-R style optimiser is recalled briefly. System-R takes a bottom-up approach to building query execution plans. In the first phase, it concentrates of plans of size 1, i.e., chooses the best access paths to every table mentioned in the query. In phase , the algorithm considers plans of size , by combining plans obtained in the previous phases (sizes of and ). The algorithm terminates after constructing plans that cover all the relations in the query. The efficiency of System-R stems from the fact that it partitions query execution plans into equivalence classes, and only considers a single execution plan for every equivalence class. Two plans are in the same equivalence class if they

  • cover the same set of relations in the query (and therefore are also of the same size), and

  • produce the answer in the same interesting order.

In our context, the query optimiser builds query execution plans by accessing a set of views, rather than a set of database relations. Hence, in addition to the meta-data that the query optimiser has about the materialised views (e.g., statistics, indexes) the optimiser is also given as input the query expressions defining the views. Th additional issues that the optimiser needs to consider in the presence of materialised views are as follows.

A. In the first iteration the algorithm needs to decide which views are relevant to the query according to the conditions illustrated above. The corresponding step is trivial in a traditional optimiser: a relation is relevant to the query if it is in the body of the query expression.

B. Since the query execution plans involve joins over views, rather than joins over database relations, plans can no longer be neatly partitioned into equivalence classes which can be explored in increasing size. This observation implies several changes to the traditional algorithm:

1. Termination testing: the algorithm needs to distinguish partial query execution plans of the query from complete query execution plans. The enumeration of the possible join orders terminates when there are no more unexplored partial plans. In contrast, in the traditional setting the algorithm terminates after considering the equivalence classes that include all the relations of the query.

2. Termination testing: the algorithm needs to distinguish partial query execution plans of the query from complete query execution plans. The enumeration of the possible join orders terminates when there are no more unexplored partial plans. In contrast, in the traditional setting the algorithm terminates after considering the equivalence classes that include all the relations of the query.

(a) is cheaper than , and

(b) has greater or equal contribution to the query than . Informally, a plan contributes more to the query than plan if it covers more of the relations in the query and selects more of the necessary attributes.

3. Combining partial plans: in the traditional setting, when two partial plans are combined, the join predicates that involve both plans are explicit in the query, and the enumeration algorithm need only consider the most efficient way to apply these predicates. However, in our case, it may not be obvious a priori which join predicate will yield a correct rewriting of the query, since views are joined rather than database relations directly. Hence, the enumeration algorithm needs to consider several alternative join predicates. Fortunately, in practice, the number of join predicates that need to be considered can be significantly pruned using meta-data about the schema. For example, there is no point in trying to join a string attribute with a numeric one. Furthermore, in some cases knowledge of integrity constraints and the structure of the query can be used to reduce the number of join predicates to be considered. Finally, after considering all the possible join predicates, the optimiser also needs to check whether the resulting plan is still a partial solution to the query.

The following table summarises the comparison of the traditional optimiser versus one that exploits materialised views.

Another method of equivalent rewriting is using transformation rules. The common theme in the works of that area is that replacing some part of the query with a view is considered as another transformation available to the optimiser. These methods are not discussed in detail here.

The query optimisers discussed above were designed to handle cases where the number of views is relatively small (i.e., comparable to the size of the database schema), and cases where equivalent rewriting is required. In contrast, the context of data integration requires consideration of large number of views, since each data source is being described by one or more views. In addition, the view definitions may contain many complex predicates, whose goal is to express fine-grained distinction between the contents of different data sources. Furthermore, in the context of data integration it is often assumed that the views are not complete, i.e., they may only contain a subset of the tuples satisfying their definition. In the foregoing, some algorithms for answering queries using views are described that were developed specifically for the context of data integration.

19.3.3.2.  The Bucket Algorithm.

The goal of the Bucket Algorithm is to reformulate a user query that is posed on a mediated (virtual) schema into a query that refers directly to the available sources. Both the query and the sources are described by conjunctive queries that may include atoms of arithmetic comparison predicates. The set of comparison atoms of query is denoted by .

Since the number of possible rewritings may be exponential in the size of the query, the main idea underlying the bucket algorithm is that the number of query rewritings that need to be considered can be drastically reduced if we first consider each subgoal – the relational atoms of the query – is considered in isolation, and determine which views may be relevant to each subgoal.

The algorithm proceeds as follows. First, a bucket is created for each subgoal in the query that is not in , containing the views that are relevant to answering the particular subgoal. In the second step, all such conjunctive query rewritings are considered that include one conjunct (view) from each bucket. For each rewriting obtained it is checked that whether it is semantically correct, that is holds, or whether it can be made semantically correct by adding comparison atoms. Finally the remaining plans are minimised by pruning redundant subgoals. Algorithm Create-Bucket executes the first step described above. Its input is a set of source descriptions and a conjunctive query in the form

Create-Bucket( )

  1  
                           FOR
                         
                         
                        
                           TO
                         
                           2    
                           DO
                         
                           3       
                           FOR
                         all    4        
                         is of form .   5          
                           DO
                         
                        
                           FOR
                         
                         
                        
                           TO
                         
                           6             
                           IF
                         
                           7                
                           THEN
                         let  be a mapping defined on the variables                          of  as follows:  8                   
                           IF
                         
                         is the  variable of  and    9                      
                           THEN
                         
                        , where  is the  variable of   10                      
                           ELSE
                         
                         is a new variable that                                does not appear in  or . 11                                              
                         12                   
                           IF
                         
                          13                      
                           THEN
                         add  to   14  
                           RETURN
                         
                        Bucket 
                     

Procedure is the extension of Satisfiable described in section 19.1.2 to the case when comparison atoms may occur besides equality atoms. The necessary change is only that for all variable occurring in comparison atoms it must be checked whether all predicates involving are satisfiable simultaneously.

Create-Bucket running time is polynomial function of the sizes of and . Indeed, the kernel of the nested loops of lines 3 and 5 runs times. The commands of lines 6–13 require constant time, except for line 12. The condition of of command iF in line 12 can be checked in polynomial time.

In order to prove the correctness of procedure Create-Bucket , one should check under what condition is a view put in . In line 6 it is checked whether relation appears as a subgoal in . If not, then obviously cannot give usable information for subgoal of . If is a subgoal of , then in lines 9–10 a mapping is constructed that applied to the variables allows the correspondence between subgoals and , in accordance with relations occurring in the heads of and , respectively. Finally, in line 12 it is checked whether the comparison atoms contradict with the correspondence constructed.

In the second step, having constructed the buckets using Create-Bucket , the bucket algorithm finds a set of conjunctive query rewritings, each of them being a conjunctive query that includes one conjunct from every bucket. Each of these conjunctive query rewritings represents one way of obtaining part of the answer to from the views. The result of the bucket algorithm is defined to be the union of the conjunctive query rewritings (since each of the rewritings may contribute different tuples). A given conjunctive query is a conjunctive query rewriting, if

  1. , or

  2. can be extended with comparison atoms, so that the previous property holds.

Example 19.10 Bucket algorithm. Consider the following query that lists those articles that there exists another article of the same area such that and mutually cites each other. There are three views (sources) available, .

In the first step, applying Create-Bucket , the following buckets are constructed.

In the second step the algorithm constructs a conjunctive query from each element of the Cartesian product of the buckets, and checks whether is contained in . If yes, it is given to the answer.

In our case, it tries to match with the other views, however no correct answer is obtained so. The reason is that does not appear in the head of , hence the join condition of – variables and occur in relation sameArea, as well – cannot be applied. Then rewritings containing are considered, recognising that equating the variables in the head of a contained rewriting is obtained. Finally, the algorithm finds that combining and rewriting is obtained, as well. This latter is redundant, as it is obtained by simple checking, that is can be pruned. Thus, the result of the bucket algorithm for query (19.68) is the following (actually equivalent) rewriting

The strength of the bucket algorithm is that it exploits the predicates in the query to prune significantly the number of candidate conjunctive rewritings that need to be considered. Checking whether a view should belong to a bucket can be done in time polynomial in the size of the query and view definition when the predicates involved are arithmetic comparisons. Hence, if the data sources (i.e., the views) are indeed distinguished by having different comparison predicates, then the resulting buckets will be relatively small.

The main disadvantage of the bucket algorithm is that the Cartesian product of the buckets may still be large. Furthermore, the second step of the algorithm needs to perform a query containment test for every candidate rewriting, which is NP-complete even when no comparison predicates are involved.

19.3.3.3.  Inverse-rules algorithm.

The Inverse-rules algorithm is a procedure that can be applied more generally than the Bucket algorithm. It finds a maximally contained rewriting for any query given by arbitrary recursive datalog program that does not contain negation, in polynomial time.

The first question is that for given datalog program and set of conjunctive queries , whether there exists a datalog program equivalent with , whose edb relations are relations of . Unfortunately, this is algorithmically undecidable. Surprisingly, the best, maximally contained rewriting can be constructed. In the case, when there exists a datalog program equivalent with , the algorithm finds that, since a maximally contained rewriting contains , as well. This seemingly contradicts to the fact that the existence of equivalent rewriting is algorithmically undecidable, however it is undecidable about the result of the inverse-rules algorithm, whether it is really equivalent to the original query.

Example 19.11 Equivalent rewriting. Consider the following datalog program , where edb relations edge and contain the edges and vertices coloured black of a graph .

It is easy to check that lists the endpoints of such paths (more precisely walks) of graph whose inner points are all black. Assume that only the following two views can be accessed.

stores edges whose tail is black, while stores those, whose head is black. There exists an equivalent rewriting of datalog program that uses only views and as edb relations:

However, if only , or is accessible alone, then equivalent rewriting is not possible, since only such paths are obtainable whose starting, or respectively, ending vertex is black.

In order to describe the Inverse-rules Algorithm, it is necessary to introduce the Horn rule, which is a generalisation of datalog program, and datalog rule. If function symbols are also allowed in the free tuple of rule (19.27) in Definition 19.11, besides variables and constants, then Horn rule is obtained. A logic program is a collection of Horn rules. In this sense a logic program without function symbols is a datalog program. The concepts of and can be defined for logic programs in the same way as for datalog programs.

The Inverse-rules Algorithm consists of two steps. First, a logic program is constructed that may contain function symbols. However, these will not occur in recursive rules, thus in the second step they can be eliminated and the logic program can be transformed into a datalog program.

Definition 19.30 The inverse of view given by

is the following collection of Horn rules. A rule corresponds to every subgoal , whose body is the literal . The head of the rule is , where is obtained from by preserving variables appearing in the head of rule (19.74), while function symbol is written in place of every variable not appearing the head. Distinct function symbols correspond to distinct variables. The inverse of a set of views is the set , where distinct function symbols occur in the inverses of distinct rules.

The idea of the definition of inverses is that if a tuple appears in a view , for some constants , then there is a valuation of every variable appearing in the head that makes the body of the rule true. This “unknown” valuation is denoted by the function symbol .

Example 19.12 Inverse of views. Let be the following collection of views.

Then consists of the following rules.

Now, the maximally contained rewriting of datalog program using views can easily be constructed for given and .

First, those rules are deleted from that contain such edb relation that do not appear in the definition any view from . The rules of are added the datalog program obtained so, thus forming logic program . Note, that the remaining edb relations of are idb relations in logic program , since they appear in the heads of the rules of . The names of idb relations are arbitrary, so they can be renamed so that their names do not coincide with the names of edb relations of . However, this is not done in the following example, for the sake of better understanding.

Example 19.13 Logic program. Consider the following datalog program that calculates the transitive closure of relation edge.

Assume that only the following materialised view is accessible, that stores the endpoints of paths of length two. If only this view is usable, then the most that can be expected is listing the endpoints of paths of even length. Since the unique edb relation of datalog program is edge, that also appears in the definition of , the logic program is obtained by adding the rules of to .

Let the instance of the edb relation edge of datalog program be the graph shown on Figure 19.4.

Figure 19.4.  The graph The graph G ..

The graph G .


Then introduces three new constants, Ús . The idb relation edge of logic program is graph shown on Figure 19.5.

Figure 19.5.  The graph The graph G' ..

The graph G' .


computes the transitive closure of graph . Note that those pairs in th transitive closure that do not contain any of the new constants are exactly the endpoints of even paths of .

The result of logic program in Example 19.13 can be calculated by procedure , for example. However, it is not true for logic programs in general, that the algorithm terminates. Indeed, consider the logic program

If the edb relation contains the constant , then the output of the program is the infinite sequence . In contrary to this, the output of the logic program given by the Inverse-rules Algorithm is guaranteed to be finite, thus the computation terminates in finite time.

Theorem 19.31 For arbitrary datalog program and set of conjunctive views , and for finite instances of the views, there exist a unique minimal fixpoint of the logic program , furthermore procedures Naiv-Datalog and Semi-Naive-Datalog give this minimal fixpoint as output.

The essence of the proof of Theorem 19.31 is that function symbols are only introduced by inverse rules, that are in turn not recursive, thus terms containing nested functions symbols are not produced. The details of the proof are left for the Reader (Exercise 19.3-3).

Even if started from the edb relations of a database, the output of a logic program may contain tuples that have function symbols. Thus, a filter is introduced that eliminates the unnecessary tuples. Let database be the instance of the edb relations of datalog program . denotes the set of those tuples from that do not contain function symbols. Let denote that program, which computes for a given instance . The proof of the following theorem, exceeds the limitations of the present chapter.

Theorem 19.32 For arbitrary datalog program and set of conjunctive views , the logic program is a maximally contained rewriting of using . Furthermore, can be constructed in polynomial time of the sizes of and .

The meaning of Theorem 19.32 is that the simple procedure of adding the inverses of view definitions to a datalog program results in a logic program that uses the views as much as possible. It is easy to see that can be constructed in polynomial time of the sizes of and , since for every subgoal a unique inverse rule must be constructed.

In order to completely solve the rewriting problem however, a datalog program needs to be produced that is equivalent with the logic program . The key to this is the observation that contains only finitely many function symbols, furthermore during a bottom-up evaluation like Naiv-Datalog and its versions, nested function symbols are not produced. With proper book keeping the appearance of function symbols can be kept track, without actually producing those tuples that contain them.

The transformation is done bottom-up like in procedure Naiv-Datalog . The function symbol appearing in the idb relation of is replaced by the list of variables . At same time the name of the idb relation needs to be marked, to remember that the list belongs to function symbol . Thus, new “temporary” relation names are introduced. Consider the the rule

of the logic program (19.78) in Example 19.13. It is replaced by rule

Notation means that the first argument of is the same as the first argument of , while the second and third arguments of together with function symbol give the second argument of . If a function symbol would become an argument of an idb relation of during the bottom-up evaluation of , then a new rule is added to the program with appropriately marked relation names.

Example 19.14 Transformation of logic program into datalog program. The logic program Example 19.13 is transformed to the following datalog program by the procedure sketched above. The different phases of the bottom-up execution of Naiv-Datalog are separated by lines.

The datalog program obtained shows clearly that which arguments could involve function symbols in the original logic program. However, some rows containing function symbols never give tuples not containing function symbols during the evaluation of the output of the program.

Relation is called significant, if in the precedence graph of Definition 19.16there exists oriented path from to the output relation of . If is not significant, then the tuples of are not needed to compute the output of the program, thus can be eliminated from the program.

Footnote. Here the definition of precedence graph needs to be extended for the edb relations of the datalog program, as well.

Example 19.15 Eliminating non-significant relations. There exists no directed path in the precedence graph of the datalog program obtained in Example 19.14, from relations and to the output relation of the program, thus they are not significant, i.e., they can be eliminated together with the rules that involve them. The following datalog program is obtained:

One more simplification step can be performed, which does not decrease the number of necessary derivations during computation of the output, however avoids redundant data copying. If is such a relation in the datalog program that is defined by a single rule, which in turn contains a single relation in its body, then can be removed from the program and replaced by the relation of the body of the rule defining , having equated the variables accordingly.

Example 19.16 Avoiding unnecessary data copying. In Example 19.14 the relations and are defined by a single rule, respectively, furthermore these two rules both have a single relation in their bodies. Hence, program (19.83) can be simplified further.

The datalog program obtained in the two simplification steps above is denoted by . It is clear that there exists a one-to-one correspondence between the bottom-up evaluations of and . Since the function symbols in are kept track, it is sure that the output instance obtained is in fact the subset of tuples of the output of that do not contain function symbols.

Theorem 19.33 For arbitrary datalog program that does not contain negations, and set of conjunctive views , the logic program is equivalent with the datalog program .

19.3.3.4.  MiniCon.

The main disadvantage of the Bucket Algorithm is that it considers each of the subgoals in isolation, therefore does not observe the most of the interactions between the subgoals of the views. Thus, the buckets may contain many unusable views, and the second phase of the algorithm may become very expensive.

The advantage of the Inverse-rules Algorithm is its conceptual simplicity and modularity. The inverses of the views must be computed only once, then they can be applied to arbitrary queries given by datalog programs. On the other hand, much of the computational advantage of exploiting the materialised views can be lost. Using the resulting rewriting produced by the algorithm for actually evaluating queries from the views has significant drawback, since it insists on recomputing the extensions of the database relations.

The MiniCon algorithm addresses the limitations of the previous two algorithms. The key idea underlying the algorithm is a change of perspective: instead of building rewritings for each of the query subgoals, it is considered how each of the variables in the query can interact with the available views. The result is that the second phase of MiniCon needs to consider drastically fewer combinations of views. In the following we return to conjunctive queries, and for the sake of easier understanding only such views are considered that do not contain constants.

The MiniCon algorithm starts out like the Bucket Algorithm, considering which views contain subgoals that correspond to subgoals in the query. However, once the algorithm finds a partial variable mapping from a subgoal in the query to a subgoal in a view , it changes perspective and looks at the variables in the query. The algorithm considers the join predicates in the query – which are specified by multiple occurrences of the same variable – and finds the minimal additional set of subgoals that must be mapped to subgoals in , given that will be mapped to . This set of subgoals and mapping information is called a MiniCon Description (MCD). In the second phase the MCDs are combined to produce query rewritings. The construction of the MCDs makes the most expensive part of the Bucket Algorithm obsolete, that is the checking of containment between the rewritings and the query, because the generating rule of MCDs makes it sure that their join gives correct result.

For a given mapping subgoal of view is said to cover a subgoal of query , if . , and respectively denotes the set of variables of the query, respectively of that of the view. In order to prove that a rewriting gives only tuples that belong to the output of the query, a homomorphism must be exhibited from the query onto the rewriting. An MCD can be considered as a part of such a homomorphism, hence, these parts will be put together easily.

The rewriting of query is a union of conjunctive queries using the views. Some of the variables may be equated in the heads of some of the views as in the equivalent rewriting (19.70) of Example 19.10. Thus, it is useful to introduce the concept of head homomorphism. The mapping is a head homomorphism, if it is an identity on variables that do not occur in the head of , but it can equate variables of the head. For every variable of the head of , also appear in the head of , furthermore . Now, the exact definition of MCD can be given.

Definition 19.34 The quadruple is a MiniCon Description (MCD) for query over view , where

  • is a head homomorphism over ,

  • is obtained from by applying , that is , where is the set of variables appearing in the head of ,

  • is a partial mapping from to ,

  • is a set of subgoals of that are covered by some subgoal of using the mapping (note: not all such subgoals are necessarily included in ).

The procedure constructing MCDs is based on the following proposition.

Claim 19.35 Let be a MiniCon Description over view for query . can be used for a non-redundant rewriting of if the following conditions hold

C1. for every variable that is in the head of and is in the domain of , as well, appears in the head of , furthermore

C2. if does not appear in the head of , then for all such subgoals of that contain holds that

1. every variable of appears in the domain of and

2. .

Clause C1 is the same as in the Bucket Algorithm. Clause C2 means that if a variable is part of a join predicate which is not enforced by the view, then must be in the head of the view so the join predicate can be applied by another subgoal in the rewriting. The procedure Form-MCDs gives the usable MiniCon Descriptions for a conjunctive query and set of conjunctive views .

Form-MCDs( )

  1     2  
                           FOR
                         each subgoal  of    3    
                           DO
                         
                        
                           FOR
                         
                           4       
                           DO
                         
                        
                           FOR
                         every subgoal    5          
                           DO
                         Let  be the least restrictive head homomorphism on ,                    such that there exists a mapping  with .  6             
                           IF
                         
                         and  exist   7                
                           THEN
                         Add to  any new MCD ,                          that can be constructed where:  8                   (a)  (respectively, ) is                             an extension of  (respectively, ),  9                   (b)  is the minimal subset of subgoals of  such that                             ,  and  satisfy Proposition 19.35, and 10                   (c) It is not possible to extend  and  to                              and  such that (b) is satisfied,                            and  as defined in (b), is a subset of .  11  
                           RETURN
                         
                         
                     

Consider again query (19.68) and the views of Example 19.10. Procedure Form-MCDs considers subgoal of the query first. It does not create an MCD for view , because clause C2 of Proposition 19.35 would be violated. Indeed, the condition would require that subgoal be also covered by using the mapping , , since is not in the head of .

Footnote. The case of , is similar.

For the same reason, no MCD will be created for even when the other subgoals of the query are considered. In a sense, the MiniCon Algorithm shifts some of the work done by the combination step of the Bucket Algorithm to the phase of creating the MCDs by using Form-MCDs . The following table shows the output of procedure Form-MCDs .

Procedure Form-MCDs includes in only the minimal set of subgoals that are necessary in order to satisfy Proposition 19.35. This makes it possible that in the second phase of the MiniCon Algorithm needs only to consider combinations of MCDs that cover pairwise disjoint subsets of subgoals of the query.

Claim 19.36 Given a query , a set of views , and the set of MCDs for over the views , the only combinations of MCDs that can result in non-redundant rewritings of are of the form , where

  • C3. contains all the subgoals of , and

  • C4. for every .

The fact that only such sets of MCDs need to be considered that provide partitions of the subgoals in the query reduces the search space of the algorithm drastically. In order to formulate procedure Combine-MCDs , another notation needs to be introduced. The mapping of MCD may map a set of variables of onto the same variable of . One arbitrarily chosen representative of this set is chosen, with the only restriction that if there exists variables in this set from the head of , then one of those is the chosen one. Let denote the representative variable of the set containing . The MiniCon Description is considered extended with in he following as a quintet . If the MCDs are to be combined, and for some and holds, then in the conjunctive rewriting obtained by the join , and will be mapped to the same variable. Let denote the equivalence relation determined on the variables of by two variables being equivalent if they are mapped onto the same variable by , that is, . Let be the set of MCDs obtained as the output of Form-MCDs .

Combine-MCDs( )

  1     2  
                           FOR
                         
                         such that  is a partition of the subgoals of    3    
                           DO
                         Define a mapping  on  as follows:   4       
                           IF
                         there exists a variable  in  such that    5          
                           THEN
                         
                           6          
                           ELSE
                         
                         is a fresh copy of    7    Let  be the transitive closure of    8           
                         is an equivalence relation of variables of .   9    Choose a representative for each equivalence class of .  10    Define mapping  as follows:  11    
                           IF
                         
                          12       
                           THEN
                         
                         is the representative of the equivalence class of  under   13       
                           ELSE
                         
                          14    Let  be given as   15      16  
                           RETURN
                         
                        Answer
                     

The following theorem summarises the properties of the MiniCon Algorithm.

Theorem 19.37 Given a conjunctive query and conjunctive views , both without comparison predicates and constants, the MiniCon Algorithm produces the union of conjunctive queries that is a maximally contained rewriting of using .

The complete proof of Theorem 19.37 exceeds the limitations of the present chapter. However, in Problem 19-1 the reader is asked to prove that union of the conjunctive queries obtained as output of Combine-MCDs is contained in .

It must be noted that the running times of the Bucket Algorithm, the Inverse-rules Algorithm and the MiniCon Algorithm are the same in the worst case: , where is the number of subgoals in the query, is the maximal number of subgoals in a view, and is the number of views. However, practical test runs show that in case of large number of views (3–400 views) the MiniCon Algorithm is significantly faster than the other two.

Exercises

19.3-1 Prove Theorem 19.25 using Proposition 19.24 and Theorem 19.20.

19.3-2 Prove the two statements of Lemma 19.26. Hint. For the first statement, write in their definitions in place of views into . Minimise the obtained query using Theorem 19.19. For the second statement use Proposition 19.24 to prove that there exists a homomorphism from the body of the conjunctive query defining view to the body of . Show that is a good choice.

19.3-3 Prove Theorem 19.31 using that datalog programs have unique minimal fixpoint.

  PROBLEMS  

19-1 MiniCon is correct

Prove that the output of the MiniCon Algorithm is correct. Hint. It is enough to show that for each conjunctive query given in line 14 of Combine-MCDs holds. For the latter, construct a homomorphism from to .

19-2 is correct

Prove that each tuple produced by logic program is contained in the output of (part of the proof of Theorem 19.32). Hint. Let be a tuple in the output of that does not contain function symbols. Consider the derivation tree of . Its leaves are literals, since they are extensional relations of program . If these leaves are removed from the tree, then the leaves of the remaining tree are edb relations of . Prove that the tree obtained is the derivation tree of in datalog program .

19-3 Datalog views

This problem tries to justify why only conjunctive views were considered. Let be a set of views, be a query. For a given instance of the views the tuple is a certain answer of query , if for any database instance such that , holds, as well.

a. Prove that if the views of are given by datalog programs, query is conjunctive and may contain non-equality () predicates, then the question whether for a given instance of the views tuple is a certain answer of is algorithmically undecidable. Hint. Reduce to this question the Post Correspondence Problem, which is the following: Given two sets of words and over the alphabet . The question is whether there exists a sequence of indices (repetition allowed) such that

The Post Correspondence Problem is well known algorithmically undecidable problem. Let the view be given by the following datalog program:

Furthermore, let be the following conjunctive query.

Show that for the instance of that is given by and , the tuple is a certain answer of query if and only if the Post Correspondence Problem with sets and has no solution.

b. In contrast to the undecidability result of a., if is a set of conjunctive views and query is given by datalog program , then it is easy to decide about an arbitrary tuple whether it is a certain answer of for a given view instance . Prove that the datalog program gives exactly the tuples of the certain answer of as output.

  CHAPTER NOTES  

There are several dimensions along which the treatments of the problem “answering queries using views” can be classified. Figure 19.6 shows the taxonomy of the work.

Figure 19.6.  A taxonomy of work on answering queries using views.

A taxonomy of work on answering queries using views.


The most significant distinction between the different work s is whether their goal is data integration or whether it is query optimisation and maintenance of physical data independence. The key difference between these two classes of works is the output of the the algorithm for answering queries using views. In the former case, given a query and a set of views , the goal of the algorithm is to produce an expression that references the views and is either equivalent to or contained in . In the latter case, the algorithm must go further and produce a (hopefully optimal) query execution plan for answering using the views (and possibly the database relations). Here the rewriting must be an equivalent to in order to ensure the correctness of the plan.

The similarity between these two bodies of work is that they are concerned with the core issue of whether a rewriting of a query is equivalent or contained in the query. However, while logical correctness suffices for the data integration context, it does not in the query optimisation context where we also need to find the cheapest plan using the views. The complication arises because the optimisation algorithms need to consider views that do not contribute to the logical correctness of the rewriting, but do reduce the cost of the resulting plan. Hence, while the reasoning underlying the algorithms in the data integration context is mostly logical, in the query optimisation case it is both logical and cost-based. On the other hand, an aspect stressed in data integration context is the importance of dealing with a large number of views, which correspond to data sources. In the context of query optimisation it is generally assumed (not always!) that the number of views is roughly comparable to the size of the schema.

The works on query optimisation can be classified further into System-R style optimisers and transformational optimisers. Examples of the former are works of Chaudhuri, Krishnamurty, Potomianos and Shim [ 61 ]; Tsatalos, Solomon, and Ioannidis [ 327 ]. Papers of Florescu, Raschid, and Valduriez [ 112 ]; Bello et. al. [ 35 ]; Deutsch, Popa and Tannen [ 89 ], Zaharioudakis et. al. [ 354 ], furthermore Goldstein Ús Larson [ 136 ] belong to the latter.

Rewriting algorithms in the data integration context are studied in works of Yang and Larson [ 351 ]; Levy, Mendelzon, Sagiv and Srivastava [ 155 ]; Qian [ 280 ]; furthermore Lambrecht, Kambhampati and Gnanaprakasam [ 210 ]. The Bucket Algorithm was introduced by Levy, Rajaraman and Ordille [ 156 ]. The Inverse-rules Algorithm is invented by Duschka and Genesereth [ 93 ], [ 94 ]. The MiniCon Algorithm was developed by Pottinger and Halevy [ 277 ], [ 276 ].

Query answering algorithms and the complexity of the problem is studied in papers of Abiteboul and Duschka [ 2 ]; Grahne and Mendelzon [ 140 ]; furthermore Calvanese, De Giacomo, Lenzerini and Vardi [ 54 ].

The STORED system was developed by Deutsch, Fernandez and Suciu [ 88 ]. Semantic caching is discussed in the paper of Yang, Karlapalem and Li [ 350 ]. Extensions of the rewriting problem are studied in [ 49 ], [ 110 ], [ 120 ], [ 208 ], [ 350 ].

Surveys of the area can be found in works of Abiteboul [ 1 ], Florescu, Levy and Mendelzon [ 111 ], Halevy [ 154 ], [ 153 ], furthermore Ullman [ 330 ].

Research of the authors was (partially) supported by Hungarian National Research Fund (OTKA) grants Nos. T034702, T037846T and T042706.