Answering queries using views, in other words query rewriting using views has become a problem attracting much attention and interest recently. The reason is its applicability in a wide range of data manipulation problems: query optimisation, providing physical data independence, data and information integration, furthermore planning data warehouses.
The problem is the following. Assume that query is given over schema , together with views . Can one answer using only the results of views ? Or, which is the largest set of tuples that can be determined knowing the views? If the views and the relations of the base schema can be accessed both, what is the cheapest way to compute the result of ?
Before query rewriting algorithms are studied in detail, some motivating examples of applications are given. The following university database is used throughout this section.
The schemata of the individual relations are as follows:
It is supposed that professors, students and departments are uniquely identified by their names. Tuples of relation Registered show that which student took which course in what semester, while Major shows which department a student choose in majoring (for the sake of convenience it is assumed that one department has one subject as possible major).
If the computation necessary to answer a query was performed in advance and the results are stored in some materialised view, then it can be used to speed up the query answering.
Consider the query that looks for pairs (Student,Title), where the student registered for the given Ph.D.level course, the course is taught by a professor of the Database area (the Cnumber of graduate courses is at least 400, and the Ph.D.level courses are those with Cnumber at least 500).
Suppose that the following materialised view is available that contains the registration data of graduate courses.
View Graduate can be used to answer query (19.43).
It will be faster to compute (19.45) than to compute (19.43), because the natural join of relations Registered and Course has already be done by view Graduate, furthermore it shelled off the undergraduate courses (that make up for the bulk of registration data at most universities). It worth noting that view Graduate could be used event hough syntactically did not agree with any part of query (19.43).
On the other hand, it may happen that the the original query can be answered faster. If relations Registered and Course have an index on attribute CNumber, but there exists no index built for Graduate, then it could be faster to answer query (19.43) directly from the database relations. Thus, the real challenge is not only that to decide about a materialised view whether it could be used to answer some query logically, but a thorough cost analysis must be done when is it worth using the existing views.
One of the principles underlying modern database systems is the separation between the logical view of data and the physical view of data. With the exception of horizontal or vertical partitioning of relations into multiple files, relational database systems are still largely based on a onetoone correspondence between relations in the schema and the files in which they are stored. In objectoriented systems, maintaining the separation is necessary because the logical schema contains significant redundancy, and does not correspond to a good physical layout. Maintaining physical data independence becomes more crucial in applications where the logical model is introduced as an intermediate level after the physical representation has already been determined. This is common in storage of XML data in relational databases, and in data integration. In fact, the Stored System stores XML documents in a relational database, and uses views to describe the mapping from XML into relations in the database.
To maintain physical data independence, a widely accepted method is to use views over the logical schema as mechanism for describing the physical storage of data. In particular, Tsatalos, Solomon and Ioannidis use GMAPs (Generalised Multilevel Access Paths) to describe data storage.
A GMAP describes the physical organisation and the indexes of the storage structure. The first clause of the GMAP (
aS
) describes the actual data structure used to store a set of tuples (e.g., a tree, hash index, etc.) The remaining clauses describe the content of the structure, much like a view definition. The
gIVEN
and
sELECT
clauses describe the available attributes, where the
gIVEN
clause describes the attributes on which the structure is indexed. The definition of the view, given in the
wHERE
clause uses infix notation over the conceptual model.
In the example shown in Figure 19.3 the GMAP G1 stores a set of pairs containing students and the departments in which they major, and these pairs are indexed by a tree on attribute Student.name. The GMAP G2 stores an index from names of students to the numbers of courses in which they are registered. The GMAP G3 stores an index from course numbers to departments whose majors are enrolled in the course.
Given that the data is stored in structures described by the GMAPs, the question that arises is how to use these structures to answer queries. Since the logical content of the GMAPs are described by views, answering a query amounts to finding a way of rewriting the query using these views. If there are multiple ways of answering the query using the views, we would like to find the cheapest one. Note that in contrast to the query optimisation context, we must use the views to answer the given query, because all the data is stored in the GMAPs.
Consider the following query, which asks for names of students registered for Ph.D.level courses and the departments in which these students are majoring.
The query can be answered in two ways. First, since Student.name uniquely identifies a student, we can take the join of G! and G2, and then apply selection operator , finally a projection eliminates the unnecessary attributes. The other execution plan could be to join G3 with G2 and select . In fact, this solution may even be more efficient because G3 has an index on the course number and therefore the intermediate joins may be much smaller.
A data integration system (also known as mediator system) provides a uniform query interface to a multitude of autonomous heterogeneous data sources. Prime examples of data integration applications include enterprise integration, querying multiple sources on the WorldWide Web, and integration of data from distributed scientific experiments.
To provide a uniform interface, a data integration system exposes to the user a mediated schema. A mediated schema is a set of virtual relations, in the sense that they are not stored anywhere. The mediated schema is designed manually for a particular data integration application. To be able to answer queries, the system must also contain a set of source descriptions. A description of a data source specifies the contents of the source, the attributes that can be found in the source, and the (integrity) constraints on the content of the source. A widely adopted approach for specifying source descriptions is to describe the contents of a data source as a view over the mediated schema. This approach facilitates the addition of new data sources and the specification of constraints on the contents of sources.
In order to answer a query, a data integration system needs to translate a query formulated on the mediated schema into one that refers directly to the schemata of the data sources. Since the contents of the data sources are described as views, the translation problem amounts finding a way to answer a query using a set of views.
Consider as an example the case where the mediated schema exposed to the user is schema University, except that the relations Teaches and Course have an additional attribute identifying the university at which the course is being taught:
Suppose we have the following two data sources. The first source provides a listing of all the courses entitled "Database Systems" taught anywhere and their instructors. This source can be described by the following view definition:
The second source lists Ph.D.level courses being taught at The Ohio State University (OSU), and is described by the following view definition:
If we were to ask the data integration system who teaches courses titled "Database Systems" at OSU, it would be able to answer the query by applying a selection on the source DBcourses:
On the other hand, suppose we ask for all the graduatelevel courses (not just in databases) being offered at OSU. Given that only these two sources are available, the data integration system cannot find all tuples in the answer to the query. Instead, the system can attempt to find the maximal set of tuples in the answer available from the sources. In particular, the system can obtain graduate database courses at OSU from the DBcourse source, and the Ph.D.level courses at OSU from the OSUPhD source. Hence, the following nonrecursive datalog program gives the maximal set of answers that can be obtained from the two sources:
Note that courses that are not PH.D.level courses or database courses will not be returned as answers. Whereas in the contexts of query optimisation and maintaining physical data independence the focus is on finding a query expression that is equivalent with the original query, here finding a query expression that provides the maximal answers from the views is attempted.
If the database is accessed via clientserver architecture, the data cached at the client can be semantically modelled as results of certain queries, rather than at the physical level as a set of data pages or tuples. Hence, deciding which data needs to be shipped from the server in order to answer a given query requires an analysis of which parts of the query can be answered by the cached views.
In this section the theoretical complexity of query rewriting is studied. Mostly conjunctive queries are considered. Minimal, and complete rewriting will be distinguished. It will be shown that if the query is conjunctive, furthermore the materialised views are also given as results of conjunctive queries, then the rewriting problem is NPcomplete, assuming that neither the query nor the view definitions contain comparison atoms. Conjunctive queries will be considered in rulebased form for the sake of convenience.
Assume that query is given over schema .
Definition 19.22 The conjunctive query is a rewriting of query using views , if
and are equivalent, and
contains one or more literals from .
is said to be locally minimal if no literal can be removed from without violating the equivalence. The rewriting is globally minimal, if there exists no rewriting using a smaller number of literals. (The comparison atoms are not counted in the number of literals.)
Example 19.8 Query rewriting. Consider the following query and view .
can be rewritten using :
View replaces the first two literals of query . Note that the view certainly satisfies the third literal of the query, as well. However, it cannot be removed from the rewriting, since variable does not occur in the head of , thus if literal were to be removed, too, then the natural join of and would not be enforced anymore.
Since in some of the applications the database relations are inaccessible, only the views can be accessed, for example in case of data integration or data warehouses, the concept of complete rewriting is introduced.
Definition 19.23 A rewriting of query using views is called a complete rewriting, if contains only literals of and comparison atoms.
Example 19.9 Complete rewriting. Assume that besides view of Example 19.8 the following view is given, as well:
A complete rewriting of query is:
It is important to see that this rewriting cannot be obtained stepbystep, first using only , then trying to incorporate , (or just in the opposite order) since relation of does not occur in . Thus, in order to find the complete rewriting, use of the two view must be considered parallel, at the same time.
There is a close connection between finding a rewriting and the problem of query containment. This latter one was discussed for tableau queries in section 19.1.3. Homomorphism between tableau queries can be defined for rule based queries, as well. The only difference is that it is not required in this section that a homomorphism maps the head of one rule to the head of the other. (The head of a rule corresponds to the summary row of a tableau.) According to Theorem 19.20 it is NPcomplete to decide whether conjunctive query contains another conjunctive query . This remains true in the case when may contain comparison atoms, as well. However, if both, and may contain comparison atoms, then the existence of a homomorphism from to is only a sufficient but not necessary condition for the containment of queries, which is a complete problem in that case. The discussion of this latter complexity class is beyond the scope of this chapter, thus it is omitted. The next proposition gives a necessary and sufficient condition whether there exists a rewriting of query using view .
Claim 19.24 Let and be conjunctive queries that may contain comparison atoms. There exists a a rewriting of query using view if and only if , that is the projection of to the empty attribute set contains that of .
Proof. Observe that is equivalent with the following proposition: If the output of is empty for some instance, then the same holds for the output of , as well.
Assume first that there exists a rewriting, that is a rule equivalent with that contains in its body. If is such an instance, that the result of is empty on it, then every rule that includes in its body results in empty set over , too.
In order to prove the other direction, assume that if the output of is empty for some instance, then the same holds for the output of , as well. Let
Let be a list of variables disjoint from variables of . Then the query defined by
satisfies . It is clear that . On the other hand, if there exists a valuation of the variables of that satisfies the body of over some instance , then fixing it, for arbitrary valuation of variables in a tuple is obtained in the output of , whenever a tuple is obtained in the output of together with the previously fixed valuation of variables of .
As a corollary of Theorem 19.20 and Proposition 19.24 the following theorem is obtained.
Theorem 19.25 Let be a conjunctive query that may contain comparison atoms, and let be a set of views. If the views in are given by conjunctive queries that do not contain comparison atoms, then it is NPcomplete to decide whether there exists a rewriting of using .
The proof of Theorem 19.25 is left for the Reader (Exercise 19.31).
The proof of Proposition 19.24 uses new variables. However, according to the next lemma, this is not necessary. Another important observation is that it is enough to consider a subset of database relations occurring in the original query when locally minimal rewriting is sought, new database relations need not be introduced.
Lemma 19.26 Let be a conjunctive query that does not contain comparison atoms
furthermore let be a set of views that do not contain comparison atoms either.
If is a locally minimal rewriting of using , then the set of database literals in is isomorphic to a subset of database literals occurring in .
If
is a rewriting of using the views, then there exists a rewriting
such that , that is the rewriting does not introduce new variables.
The details of the proof of Lemma 19.26 are left for the Reader (Exercise 19.32). The next lemma is of fundamental importance: A minimal rewriting of using cannot increase the number of literals.
Lemma 19.27 Let be conjunctive query, be set of views given by conjunctive queries, both without comparison atoms. If the body of contains literals and is a locally minimal rewriting of using , then contains at most literals.
Proof. Replacing the view literals of by their definitions query is obtained. Let be a homomorphism from the body of to . The existence of follows from by the Homomorphism Theorem (Theorem 19.18). Each of the literals of the body of is mapped to at most one literal obtained from the expansion of view definitions. If contains more than view literals, then the expansion of some view literals in the body of is disjoint from the image of . These view literals can be removed from the body of without changing the equivalence.
Based on Lemma 19.27 the following theorem can be stated about complexity of minimal rewritings.
Theorem 19.28 Let be conjunctive query, be set of views given by conjunctive queries, both without comparison atoms. Let the body of contain literals.
It is NPcomplete to decide whether there exists a rewriting of using that uses at most literals.
It is NPcomplete to decide whether there exists a rewriting of using that uses at most database literals.
It is NPcomplete to decide whether there exists a complete rewriting of using .
Proof. The first statement is proved, the proof of the other two is similar. According to Lemmas 19.27 and 19.26, only such rewritings need to be considered that have at most as many literals as the query itself, contain a subset of the literals of the query and do not introduce new variables. Such a rewriting and the homomorphisms proving the equivalence can be tested in polynomial time, hence the problem is in NP. In order to prove that it is NPhard, Theorem 19.25 is used. For a given query and view let be the view, whose head is same as the head of , but whose body is the conjunction of the bodies of and . It is easy to see that there exists a rewriting using with a single literal if and only if there exists a rewriting (with no restriction) using .
In this section only complete rewritings are studied. This does not mean real restriction, since if database relations are also to be used, then views mirroring the database relations onetoone can be introduced. The concept of equivalent rewriting introduced in Definition 19.22 is appropriate if the goal of the rewriting is query optimisation or providing physical data independence. However, in the context of data integration on data warehouses equivalent rewritings cannot be sought, since all necessary data may not be available. Thus, the concept of maximally contained rewriting is introduced that depends on the query language used, in contrast to equivalent rewritings.
Definition 19.29 Let be a query, be a set of views, be a query language. is a maximally contained rewriting of with respect to , if
is a query of language using only views from ,
contains ,
if query satisfies , then .
Before discussing how can a traditional optimiser be modified in order to use materialised views instead of database relations, it is necessary to survey when can view be used to answer a given query. Essentially, view can be used to answer query , if the intersection of the sets of database relations in the body of and in the body of is nonempty, furthermore some of the attributes are selected by are also selected by . Besides this, in case of equivalent rewriting, if contains comparison atoms for such attributes that are also occurring in , then the view must apply logically equivalent, or weaker condition, than the query. If logically stronger condition is applied in the view, then it can be part of a (maximally) contained rewriting. This can be shown best via an example. Consider the query over schema University that list those professor, student, semester triplets, where the advisor of the student is the professor and in the given semester the student registered for some course taught by the professor.
View below can be used to answer , since it uses the same join condition for relations Registered and Teaches as , as it is shown by variables of the same name. Furthermore, selects attributes Student, PName, Semester, that are necessary in order to properly join with relation Advisor, and for select clause of the query. Finally, the predicate is weaker than the predicate of the query.
The following four views illustrate how minor modifications to change the usability in answering the query.
View is similar to , except that it does not select the attribute PName from relation Teaches, which is needed for the join with the relation Adviser and for the selection of the query. Hence, to use in the rewriting, it has to be joined with relation Teaches again. Still, if the join of relations Registered and Teaches is very selective, then employing may actually result in a more efficient query execution plan.
In view the join of relations Registered and Teaches is over only attribute Cnumber, the equality of variables Semester and is not required. Since attribute is not selected by , the join predicate cannot be applied in the rewriting, and therefore there is little gain by using .
View considers only the professors who have at least one area of research. Hence, the view applies an additional condition that does not exists in the query, and cannot be used in an equivalent rewriting unless union and negation are allowed in the rewriting language. However, if there is an integrity constraint stating that every professor has at least one area of research, then an optimiser should be able to realise that is usable.
Finally, view applies a stronger predicate than the query, and is therefore usable for a contained rewriting, but not for an equivalent rewriting of the query.
Before discussing the changes to traditional optimisation, first the principles underlying the SystemR style optimiser is recalled briefly. SystemR takes a bottomup approach to building query execution plans. In the first phase, it concentrates of plans of size 1, i.e., chooses the best access paths to every table mentioned in the query. In phase , the algorithm considers plans of size , by combining plans obtained in the previous phases (sizes of and ). The algorithm terminates after constructing plans that cover all the relations in the query. The efficiency of SystemR stems from the fact that it partitions query execution plans into equivalence classes, and only considers a single execution plan for every equivalence class. Two plans are in the same equivalence class if they
cover the same set of relations in the query (and therefore are also of the same size), and
produce the answer in the same interesting order.
In our context, the query optimiser builds query execution plans by accessing a set of views, rather than a set of database relations. Hence, in addition to the metadata that the query optimiser has about the materialised views (e.g., statistics, indexes) the optimiser is also given as input the query expressions defining the views. Th additional issues that the optimiser needs to consider in the presence of materialised views are as follows.
A. In the first iteration the algorithm needs to decide which views are relevant to the query according to the conditions illustrated above. The corresponding step is trivial in a traditional optimiser: a relation is relevant to the query if it is in the body of the query expression.
B. Since the query execution plans involve joins over views, rather than joins over database relations, plans can no longer be neatly partitioned into equivalence classes which can be explored in increasing size. This observation implies several changes to the traditional algorithm:
1. Termination testing: the algorithm needs to distinguish partial query execution plans of the query from complete query execution plans. The enumeration of the possible join orders terminates when there are no more unexplored partial plans. In contrast, in the traditional setting the algorithm terminates after considering the equivalence classes that include all the relations of the query.
2. Termination testing: the algorithm needs to distinguish partial query execution plans of the query from complete query execution plans. The enumeration of the possible join orders terminates when there are no more unexplored partial plans. In contrast, in the traditional setting the algorithm terminates after considering the equivalence classes that include all the relations of the query.
(a) is cheaper than , and
(b) has greater or equal contribution to the query than . Informally, a plan contributes more to the query than plan if it covers more of the relations in the query and selects more of the necessary attributes.
3. Combining partial plans: in the traditional setting, when two partial plans are combined, the join predicates that involve both plans are explicit in the query, and the enumeration algorithm need only consider the most efficient way to apply these predicates. However, in our case, it may not be obvious a priori which join predicate will yield a correct rewriting of the query, since views are joined rather than database relations directly. Hence, the enumeration algorithm needs to consider several alternative join predicates. Fortunately, in practice, the number of join predicates that need to be considered can be significantly pruned using metadata about the schema. For example, there is no point in trying to join a string attribute with a numeric one. Furthermore, in some cases knowledge of integrity constraints and the structure of the query can be used to reduce the number of join predicates to be considered. Finally, after considering all the possible join predicates, the optimiser also needs to check whether the resulting plan is still a partial solution to the query.
The following table summarises the comparison of the traditional optimiser versus one that exploits materialised views.
Another method of equivalent rewriting is using transformation rules. The common theme in the works of that area is that replacing some part of the query with a view is considered as another transformation available to the optimiser. These methods are not discussed in detail here.
The query optimisers discussed above were designed to handle cases where the number of views is relatively small (i.e., comparable to the size of the database schema), and cases where equivalent rewriting is required. In contrast, the context of data integration requires consideration of large number of views, since each data source is being described by one or more views. In addition, the view definitions may contain many complex predicates, whose goal is to express finegrained distinction between the contents of different data sources. Furthermore, in the context of data integration it is often assumed that the views are not complete, i.e., they may only contain a subset of the tuples satisfying their definition. In the foregoing, some algorithms for answering queries using views are described that were developed specifically for the context of data integration.
The goal of the Bucket Algorithm is to reformulate a user query that is posed on a mediated (virtual) schema into a query that refers directly to the available sources. Both the query and the sources are described by conjunctive queries that may include atoms of arithmetic comparison predicates. The set of comparison atoms of query is denoted by .
Since the number of possible rewritings may be exponential in the size of the query, the main idea underlying the bucket algorithm is that the number of query rewritings that need to be considered can be drastically reduced if we first consider each subgoal – the relational atoms of the query – is considered in isolation, and determine which views may be relevant to each subgoal.
The algorithm proceeds as follows. First, a bucket is created for each subgoal in the query that is not in , containing the views that are relevant to answering the particular subgoal. In the second step, all such conjunctive query rewritings are considered that include one conjunct (view) from each bucket. For each rewriting obtained it is checked that whether it is semantically correct, that is holds, or whether it can be made semantically correct by adding comparison atoms. Finally the remaining plans are minimised by pruning redundant subgoals. Algorithm
CreateBucket
executes the first step described above. Its input is a set of source descriptions and a conjunctive query in the form
CreateBucket(
)
1FOR
TO
2DO
3FOR
all 4 is of form . 5DO
FOR
TO
6IF
7THEN
let be a mapping defined on the variables of as follows: 8IF
is the variable of and 9THEN
, where is the variable of 10ELSE
is a new variable that does not appear in or . 11 12IF
13THEN
add to 14RETURN
Bucket
Procedure is the extension of
Satisfiable
described in section 19.1.2 to the case when comparison atoms may occur besides equality atoms. The necessary change is only that for all variable occurring in comparison atoms it must be checked whether all predicates involving are satisfiable simultaneously.
CreateBucket
running time is polynomial function of the sizes of and . Indeed, the kernel of the nested loops of lines 3 and 5 runs times. The commands of lines 6–13 require constant time, except for line 12. The condition of of command
iF
in line 12 can be checked in polynomial time.
In order to prove the correctness of procedure
CreateBucket
, one should check under what condition is a view put in . In line 6 it is checked whether relation appears as a subgoal in . If not, then obviously cannot give usable information for subgoal of . If is a subgoal of , then in lines 9–10 a mapping is constructed that applied to the variables allows the correspondence between subgoals and , in accordance with relations occurring in the heads of and , respectively. Finally, in line 12 it is checked whether the comparison atoms contradict with the correspondence constructed.
In the second step, having constructed the buckets using
CreateBucket
, the bucket algorithm finds a set of conjunctive query rewritings, each of them being a conjunctive query that includes one conjunct from every bucket. Each of these conjunctive query rewritings represents one way of obtaining part of the answer to from the views. The result of the bucket algorithm is defined to be the union of the conjunctive query rewritings (since each of the rewritings may contribute different tuples). A given conjunctive query is a conjunctive query rewriting, if
, or
can be extended with comparison atoms, so that the previous property holds.
Example 19.10 Bucket algorithm. Consider the following query that lists those articles that there exists another article of the same area such that and mutually cites each other. There are three views (sources) available, .
In the first step, applying
CreateBucket
, the following buckets are constructed.
In the second step the algorithm constructs a conjunctive query from each element of the Cartesian product of the buckets, and checks whether is contained in . If yes, it is given to the answer.
In our case, it tries to match with the other views, however no correct answer is obtained so. The reason is that does not appear in the head of , hence the join condition of – variables and occur in relation sameArea, as well – cannot be applied. Then rewritings containing are considered, recognising that equating the variables in the head of a contained rewriting is obtained. Finally, the algorithm finds that combining and rewriting is obtained, as well. This latter is redundant, as it is obtained by simple checking, that is can be pruned. Thus, the result of the bucket algorithm for query (19.68) is the following (actually equivalent) rewriting
The strength of the bucket algorithm is that it exploits the predicates in the query to prune significantly the number of candidate conjunctive rewritings that need to be considered. Checking whether a view should belong to a bucket can be done in time polynomial in the size of the query and view definition when the predicates involved are arithmetic comparisons. Hence, if the data sources (i.e., the views) are indeed distinguished by having different comparison predicates, then the resulting buckets will be relatively small.
The main disadvantage of the bucket algorithm is that the Cartesian product of the buckets may still be large. Furthermore, the second step of the algorithm needs to perform a query containment test for every candidate rewriting, which is NPcomplete even when no comparison predicates are involved.
The Inverserules algorithm is a procedure that can be applied more generally than the Bucket algorithm. It finds a maximally contained rewriting for any query given by arbitrary recursive datalog program that does not contain negation, in polynomial time.
The first question is that for given datalog program and set of conjunctive queries , whether there exists a datalog program equivalent with , whose edb relations are relations of . Unfortunately, this is algorithmically undecidable. Surprisingly, the best, maximally contained rewriting can be constructed. In the case, when there exists a datalog program equivalent with , the algorithm finds that, since a maximally contained rewriting contains , as well. This seemingly contradicts to the fact that the existence of equivalent rewriting is algorithmically undecidable, however it is undecidable about the result of the inverserules algorithm, whether it is really equivalent to the original query.
Example 19.11 Equivalent rewriting. Consider the following datalog program , where edb relations edge and contain the edges and vertices coloured black of a graph .
It is easy to check that lists the endpoints of such paths (more precisely walks) of graph whose inner points are all black. Assume that only the following two views can be accessed.
stores edges whose tail is black, while stores those, whose head is black. There exists an equivalent rewriting of datalog program that uses only views and as edb relations:
However, if only , or is accessible alone, then equivalent rewriting is not possible, since only such paths are obtainable whose starting, or respectively, ending vertex is black.
In order to describe the Inverserules Algorithm, it is necessary to introduce the Horn rule, which is a generalisation of datalog program, and datalog rule. If function symbols are also allowed in the free tuple of rule (19.27) in Definition 19.11, besides variables and constants, then Horn rule is obtained. A logic program is a collection of Horn rules. In this sense a logic program without function symbols is a datalog program. The concepts of and can be defined for logic programs in the same way as for datalog programs.
The Inverserules Algorithm consists of two steps. First, a logic program is constructed that may contain function symbols. However, these will not occur in recursive rules, thus in the second step they can be eliminated and the logic program can be transformed into a datalog program.
Definition 19.30 The inverse of view given by
is the following collection of Horn rules. A rule corresponds to every subgoal , whose body is the literal . The head of the rule is , where is obtained from by preserving variables appearing in the head of rule (19.74), while function symbol is written in place of every variable not appearing the head. Distinct function symbols correspond to distinct variables. The inverse of a set of views is the set , where distinct function symbols occur in the inverses of distinct rules.
The idea of the definition of inverses is that if a tuple appears in a view , for some constants , then there is a valuation of every variable appearing in the head that makes the body of the rule true. This “unknown” valuation is denoted by the function symbol .
Example 19.12 Inverse of views. Let be the following collection of views.
Then consists of the following rules.
Now, the maximally contained rewriting of datalog program using views can easily be constructed for given and .
First, those rules are deleted from that contain such edb relation that do not appear in the definition any view from . The rules of are added the datalog program obtained so, thus forming logic program . Note, that the remaining edb relations of are idb relations in logic program , since they appear in the heads of the rules of . The names of idb relations are arbitrary, so they can be renamed so that their names do not coincide with the names of edb relations of . However, this is not done in the following example, for the sake of better understanding.
Example 19.13 Logic program. Consider the following datalog program that calculates the transitive closure of relation edge.
Assume that only the following materialised view is accessible, that stores the endpoints of paths of length two. If only this view is usable, then the most that can be expected is listing the endpoints of paths of even length. Since the unique edb relation of datalog program is edge, that also appears in the definition of , the logic program is obtained by adding the rules of to .
Let the instance of the edb relation edge of datalog program be the graph shown on Figure 19.4.
Then introduces three new constants, és . The idb relation edge of logic program is graph shown on Figure 19.5.
computes the transitive closure of graph . Note that those pairs in th transitive closure that do not contain any of the new constants are exactly the endpoints of even paths of .
The result of logic program in Example 19.13 can be calculated by procedure , for example. However, it is not true for logic programs in general, that the algorithm terminates. Indeed, consider the logic program
If the edb relation contains the constant , then the output of the program is the infinite sequence . In contrary to this, the output of the logic program given by the Inverserules Algorithm is guaranteed to be finite, thus the computation terminates in finite time.
Theorem 19.31
For arbitrary datalog program and set of conjunctive views , and for finite instances of the views, there exist a unique minimal fixpoint of the logic program , furthermore procedures
NaivDatalog
and
SemiNaiveDatalog
give this minimal fixpoint as output.
The essence of the proof of Theorem 19.31 is that function symbols are only introduced by inverse rules, that are in turn not recursive, thus terms containing nested functions symbols are not produced. The details of the proof are left for the Reader (Exercise 19.33).
Even if started from the edb relations of a database, the output of a logic program may contain tuples that have function symbols. Thus, a filter is introduced that eliminates the unnecessary tuples. Let database be the instance of the edb relations of datalog program . denotes the set of those tuples from that do not contain function symbols. Let denote that program, which computes for a given instance . The proof of the following theorem, exceeds the limitations of the present chapter.
Theorem 19.32 For arbitrary datalog program and set of conjunctive views , the logic program is a maximally contained rewriting of using . Furthermore, can be constructed in polynomial time of the sizes of and .
The meaning of Theorem 19.32 is that the simple procedure of adding the inverses of view definitions to a datalog program results in a logic program that uses the views as much as possible. It is easy to see that can be constructed in polynomial time of the sizes of and , since for every subgoal a unique inverse rule must be constructed.
In order to completely solve the rewriting problem however, a datalog program needs to be produced that is equivalent with the logic program . The key to this is the observation that contains only finitely many function symbols, furthermore during a bottomup evaluation like
NaivDatalog
and its versions, nested function symbols are not produced. With proper book keeping the appearance of function symbols can be kept track, without actually producing those tuples that contain them.
The transformation is done bottomup like in procedure
NaivDatalog
. The function symbol appearing in the idb relation of is replaced by the list of variables . At same time the name of the idb relation needs to be marked, to remember that the list belongs to function symbol . Thus, new “temporary” relation names are introduced. Consider the the rule
of the logic program (19.78) in Example 19.13. It is replaced by rule
Notation means that the first argument of is the same as the first argument of , while the second and third arguments of together with function symbol give the second argument of . If a function symbol would become an argument of an idb relation of during the bottomup evaluation of , then a new rule is added to the program with appropriately marked relation names.
Example 19.14
Transformation of logic program into datalog program. The logic program Example 19.13 is transformed to the following datalog program by the procedure sketched above. The different phases of the bottomup execution of
NaivDatalog
are separated by lines.
The datalog program obtained shows clearly that which arguments could involve function symbols in the original logic program. However, some rows containing function symbols never give tuples not containing function symbols during the evaluation of the output of the program.
Relation is called significant, if in the precedence graph of Definition 19.16there exists oriented path from to the output relation of . If is not significant, then the tuples of are not needed to compute the output of the program, thus can be eliminated from the program.
Footnote. Here the definition of precedence graph needs to be extended for the edb relations of the datalog program, as well.
Example 19.15 Eliminating nonsignificant relations. There exists no directed path in the precedence graph of the datalog program obtained in Example 19.14, from relations and to the output relation of the program, thus they are not significant, i.e., they can be eliminated together with the rules that involve them. The following datalog program is obtained:
One more simplification step can be performed, which does not decrease the number of necessary derivations during computation of the output, however avoids redundant data copying. If is such a relation in the datalog program that is defined by a single rule, which in turn contains a single relation in its body, then can be removed from the program and replaced by the relation of the body of the rule defining , having equated the variables accordingly.
Example 19.16 Avoiding unnecessary data copying. In Example 19.14 the relations and are defined by a single rule, respectively, furthermore these two rules both have a single relation in their bodies. Hence, program (19.83) can be simplified further.
The datalog program obtained in the two simplification steps above is denoted by . It is clear that there exists a onetoone correspondence between the bottomup evaluations of and . Since the function symbols in are kept track, it is sure that the output instance obtained is in fact the subset of tuples of the output of that do not contain function symbols.
Theorem 19.33 For arbitrary datalog program that does not contain negations, and set of conjunctive views , the logic program is equivalent with the datalog program .
The main disadvantage of the Bucket Algorithm is that it considers each of the subgoals in isolation, therefore does not observe the most of the interactions between the subgoals of the views. Thus, the buckets may contain many unusable views, and the second phase of the algorithm may become very expensive.
The advantage of the Inverserules Algorithm is its conceptual simplicity and modularity. The inverses of the views must be computed only once, then they can be applied to arbitrary queries given by datalog programs. On the other hand, much of the computational advantage of exploiting the materialised views can be lost. Using the resulting rewriting produced by the algorithm for actually evaluating queries from the views has significant drawback, since it insists on recomputing the extensions of the database relations.
The
MiniCon
algorithm addresses the limitations of the previous two algorithms. The key idea underlying the algorithm is a change of perspective: instead of building rewritings for each of the query subgoals, it is considered how each of the variables in the query can interact with the available views. The result is that the second phase of
MiniCon
needs to consider drastically fewer combinations of views. In the following we return to conjunctive queries, and for the sake of easier understanding only such views are considered that do not contain constants.
The
MiniCon
algorithm starts out like the Bucket Algorithm, considering which views contain subgoals that correspond to subgoals in the query. However, once the algorithm finds a partial variable mapping from a subgoal in the query to a subgoal in a view , it changes perspective and looks at the variables in the query. The algorithm considers the join predicates in the query – which are specified by multiple occurrences of the same variable – and finds the minimal additional set of subgoals that must be mapped to subgoals in , given that will be mapped to . This set of subgoals and mapping information is called a MiniCon Description (MCD). In the second phase the MCDs are combined to produce query rewritings. The construction of the MCDs makes the most expensive part of the Bucket Algorithm obsolete, that is the checking of containment between the rewritings and the query, because the generating rule of MCDs makes it sure that their join gives correct result.
For a given mapping subgoal of view is said to cover a subgoal of query , if . , and respectively denotes the set of variables of the query, respectively of that of the view. In order to prove that a rewriting gives only tuples that belong to the output of the query, a homomorphism must be exhibited from the query onto the rewriting. An MCD can be considered as a part of such a homomorphism, hence, these parts will be put together easily.
The rewriting of query is a union of conjunctive queries using the views. Some of the variables may be equated in the heads of some of the views as in the equivalent rewriting (19.70) of Example 19.10. Thus, it is useful to introduce the concept of head homomorphism. The mapping is a head homomorphism, if it is an identity on variables that do not occur in the head of , but it can equate variables of the head. For every variable of the head of , also appear in the head of , furthermore . Now, the exact definition of MCD can be given.
Definition 19.34 The quadruple is a MiniCon Description (MCD) for query over view , where
is a head homomorphism over ,
is obtained from by applying , that is , where is the set of variables appearing in the head of ,
is a partial mapping from to ,
is a set of subgoals of that are covered by some subgoal of using the mapping (note: not all such subgoals are necessarily included in ).
The procedure constructing MCDs is based on the following proposition.
Claim 19.35 Let be a MiniCon Description over view for query . can be used for a nonredundant rewriting of if the following conditions hold
C1. for every variable that is in the head of and is in the domain of , as well, appears in the head of , furthermore
C2. if does not appear in the head of , then for all such subgoals of that contain holds that
1. every variable of appears in the domain of and
2. .
Clause C1 is the same as in the Bucket Algorithm. Clause C2 means that if a variable is part of a join predicate which is not enforced by the view, then must be in the head of the view so the join predicate can be applied by another subgoal in the rewriting. The procedure
FormMCDs
gives the usable MiniCon Descriptions for a conjunctive query and set of conjunctive views .
FormMCDs(
)
1 2FOR
each subgoal of 3DO
FOR
4DO
FOR
every subgoal 5DO
Let be the least restrictive head homomorphism on , such that there exists a mapping with . 6IF
and exist 7THEN
Add to any new MCD , that can be constructed where: 8 (a) (respectively, ) is an extension of (respectively, ), 9 (b) is the minimal subset of subgoals of such that , and satisfy Proposition 19.35, and 10 (c) It is not possible to extend and to and such that (b) is satisfied, and as defined in (b), is a subset of . 11RETURN
Consider again query (19.68) and the views of Example 19.10. Procedure
FormMCDs
considers subgoal of the query first. It does not create an MCD for view , because clause C2 of Proposition 19.35 would be violated. Indeed, the condition would require that subgoal be also covered by using the mapping , , since is not in the head of .
Footnote. The case of , is similar.
For the same reason, no MCD will be created for even when the other subgoals of the query are considered. In a sense, the MiniCon Algorithm shifts some of the work done by the combination step of the Bucket Algorithm to the phase of creating the MCDs by using
FormMCDs
. The following table shows the output of procedure
FormMCDs
.
Procedure
FormMCDs
includes in only the minimal set of subgoals that are necessary in order to satisfy Proposition 19.35. This makes it possible that in the second phase of the MiniCon Algorithm needs only to consider combinations of MCDs that cover pairwise disjoint subsets of subgoals of the query.
Claim 19.36 Given a query , a set of views , and the set of MCDs for over the views , the only combinations of MCDs that can result in nonredundant rewritings of are of the form , where
C3. contains all the subgoals of , and
C4. for every .
The fact that only such sets of MCDs need to be considered that provide partitions of the subgoals in the query reduces the search space of the algorithm drastically. In order to formulate procedure
CombineMCDs
, another notation needs to be introduced. The mapping of MCD may map a set of variables of onto the same variable of . One arbitrarily chosen representative of this set is chosen, with the only restriction that if there exists variables in this set from the head of , then one of those is the chosen one. Let denote the representative variable of the set containing . The MiniCon Description is considered extended with in he following as a quintet . If the MCDs are to be combined, and for some
and holds, then in the conjunctive rewriting obtained by the join , and will be mapped to the same variable. Let denote the equivalence relation determined on the variables of by two variables being equivalent if they are mapped onto the same variable by , that is, . Let be the set of MCDs obtained as the output of
FormMCDs
.
CombineMCDs(
)
1 2FOR
such that is a partition of the subgoals of 3DO
Define a mapping on as follows: 4IF
there exists a variable in such that 5THEN
6ELSE
is a fresh copy of 7 Let be the transitive closure of 8 is an equivalence relation of variables of . 9 Choose a representative for each equivalence class of . 10 Define mapping as follows: 11IF
12THEN
is the representative of the equivalence class of under 13ELSE
14 Let be given as 15 16RETURN
Answer
The following theorem summarises the properties of the MiniCon Algorithm.
Theorem 19.37 Given a conjunctive query and conjunctive views , both without comparison predicates and constants, the MiniCon Algorithm produces the union of conjunctive queries that is a maximally contained rewriting of using .
The complete proof of Theorem 19.37 exceeds the limitations of the present chapter. However, in Problem 191 the reader is asked to prove that union of the conjunctive queries obtained as output of
CombineMCDs
is contained in .
It must be noted that the running times of the Bucket Algorithm, the Inverserules Algorithm and the MiniCon Algorithm are the same in the worst case: , where is the number of subgoals in the query, is the maximal number of subgoals in a view, and is the number of views. However, practical test runs show that in case of large number of views (3–400 views) the MiniCon Algorithm is significantly faster than the other two.
Exercises
19.31 Prove Theorem 19.25 using Proposition 19.24 and Theorem 19.20.
19.32 Prove the two statements of Lemma 19.26. Hint. For the first statement, write in their definitions in place of views into . Minimise the obtained query using Theorem 19.19. For the second statement use Proposition 19.24 to prove that there exists a homomorphism from the body of the conjunctive query defining view to the body of . Show that is a good choice.
19.33 Prove Theorem 19.31 using that datalog programs have unique minimal fixpoint.
PROBLEMS

Prove that the output of the MiniCon Algorithm is correct. Hint. It is enough to show that for each conjunctive query given in line 14 of
CombineMCDs
holds. For the latter, construct a homomorphism from to .
192
is correct
Prove that each tuple produced by logic program is contained in the output of (part of the proof of Theorem 19.32). Hint. Let be a tuple in the output of that does not contain function symbols. Consider the derivation tree of . Its leaves are literals, since they are extensional relations of program . If these leaves are removed from the tree, then the leaves of the remaining tree are edb relations of . Prove that the tree obtained is the derivation tree of in datalog program .
193
Datalog views
This problem tries to justify why only conjunctive views were considered. Let be a set of views, be a query. For a given instance of the views the tuple is a certain answer of query , if for any database instance such that , holds, as well.
a. Prove that if the views of are given by datalog programs, query is conjunctive and may contain nonequality () predicates, then the question whether for a given instance of the views tuple is a certain answer of is algorithmically undecidable. Hint. Reduce to this question the Post Correspondence Problem, which is the following: Given two sets of words and over the alphabet . The question is whether there exists a sequence of indices (repetition allowed) such that
The Post Correspondence Problem is well known algorithmically undecidable problem. Let the view be given by the following datalog program:
Furthermore, let be the following conjunctive query.
Show that for the instance of that is given by and , the tuple is a certain answer of query if and only if the Post Correspondence Problem with sets and has no solution.
b. In contrast to the undecidability result of a., if is a set of conjunctive views and query is given by datalog program , then it is easy to decide about an arbitrary tuple whether it is a certain answer of for a given view instance . Prove that the datalog program gives exactly the tuples of the certain answer of as output.
CHAPTER NOTES

There are several dimensions along which the treatments of the problem “answering queries using views” can be classified. Figure 19.6 shows the taxonomy of the work.
The most significant distinction between the different work s is whether their goal is data integration or whether it is query optimisation and maintenance of physical data independence. The key difference between these two classes of works is the output of the the algorithm for answering queries using views. In the former case, given a query and a set of views , the goal of the algorithm is to produce an expression that references the views and is either equivalent to or contained in . In the latter case, the algorithm must go further and produce a (hopefully optimal) query execution plan for answering using the views (and possibly the database relations). Here the rewriting must be an equivalent to in order to ensure the correctness of the plan.
The similarity between these two bodies of work is that they are concerned with the core issue of whether a rewriting of a query is equivalent or contained in the query. However, while logical correctness suffices for the data integration context, it does not in the query optimisation context where we also need to find the cheapest plan using the views. The complication arises because the optimisation algorithms need to consider views that do not contribute to the logical correctness of the rewriting, but do reduce the cost of the resulting plan. Hence, while the reasoning underlying the algorithms in the data integration context is mostly logical, in the query optimisation case it is both logical and costbased. On the other hand, an aspect stressed in data integration context is the importance of dealing with a large number of views, which correspond to data sources. In the context of query optimisation it is generally assumed (not always!) that the number of views is roughly comparable to the size of the schema.
The works on query optimisation can be classified further into SystemR style optimisers and transformational optimisers. Examples of the former are works of Chaudhuri, Krishnamurty, Potomianos and Shim [ 61 ]; Tsatalos, Solomon, and Ioannidis [ 327 ]. Papers of Florescu, Raschid, and Valduriez [ 112 ]; Bello et. al. [ 35 ]; Deutsch, Popa and Tannen [ 89 ], Zaharioudakis et. al. [ 354 ], furthermore Goldstein és Larson [ 136 ] belong to the latter.
Rewriting algorithms in the data integration context are studied in works of Yang and Larson [ 351 ]; Levy, Mendelzon, Sagiv and Srivastava [ 155 ]; Qian [ 280 ]; furthermore Lambrecht, Kambhampati and Gnanaprakasam [ 210 ]. The Bucket Algorithm was introduced by Levy, Rajaraman and Ordille [ 156 ]. The Inverserules Algorithm is invented by Duschka and Genesereth [ 93 ], [ 94 ]. The MiniCon Algorithm was developed by Pottinger and Halevy [ 277 ], [ 276 ].
Query answering algorithms and the complexity of the problem is studied in papers of Abiteboul and Duschka [ 2 ]; Grahne and Mendelzon [ 140 ]; furthermore Calvanese, De Giacomo, Lenzerini and Vardi [ 54 ].
The STORED system was developed by Deutsch, Fernandez and Suciu [ 88 ]. Semantic caching is discussed in the paper of Yang, Karlapalem and Li [ 350 ]. Extensions of the rewriting problem are studied in [ 49 ], [ 110 ], [ 120 ], [ 208 ], [ 350 ].
Surveys of the area can be found in works of Abiteboul [ 1 ], Florescu, Levy and Mendelzon [ 111 ], Halevy [ 154 ], [ 153 ], furthermore Ullman [ 330 ].
Research of the authors was (partially) supported by Hungarian National Research Fund (OTKA) grants Nos. T034702, T037846T and T042706.