In chapter “Relational database design” basic concepts of relational databases were introduced, such as relational schema, relation, instance. Databases were studied from the designer point of view, the main question was how to avoid redundant data storage, or various anomalies arising during the use of the database.
In the present chapter the schema is considered to be given and focus is on fast and efficient ways of answering user queries. First, basic (theoretical) query languages and their connections are reviewed in Section 19.1.
In the second part of this chapter (Section 19.2) views are considered. Informally, a view is nothing else, but result of a query. Use of views in query efficiency, providing physical data independence and data integration is explained.
Finally, the third part of the present chapter (Section 19.3) introduces query rewriting.
Consider the database of cinemas in Budapest. Assume that the schema consists of three relations:
The schemata of individual relations are as follows:
Possible values of instances of each relation are shown on Figure 19.1.
Typical user queries could be:
19.1 Who is the director of “Control”?
19.2 List the names address of those theatres where Kurosawa films are played.
19.3 Give the names of directors who played part in some of their films.
These queries define a mapping from the relations of database schema CinePest to some other schema (in the present case to schemata of single relations). Formally, query and query mapping should be distinguished. The former is a syntactic concept, the latter is a mapping from the set of instances over the input schema to the set of instances over the output schema, that is determined by the query according to some semantic interpretation. However, for both concepts the word “query” is used for the sake of convenience, which one is meant will be clear from the context.
In the remaining of this chapter the most important query languages are reviewed. The expressive powers of query languages need to be compared.
Definition 19.2 Let and be query languages (with appropriate semantics). is dominated by ( is weaker, than ), in notation , if for every query of there exists a query , such that . and are equivalent, if and .
there exist in relations Film, Theater and Show tuples , and
put the tuple into the output relation.
denote different variables that take their values from the domains of the corresponding attributes, respectively. Using the same variables implicitly marked where should stand identical values in different tuples.
Conjunctive queries are the simplest kind of queries, but they are the easiest to handle and have the most good properties. Three equivalent forms will be studied, two of them based on logic, the third one is of algebraic nature. The name comes from first order logic expressions that contain only existential quantors (), furthermore consist of atomic expressions connected with logical “and”, that is conjunction.
The tuple is called free tuple if the 's are variables or constants. This is a generalisation of a tuple of a relational instance. For example, the tuple in Example 19.1 is a free tuple.
where , are relation names from , is a relation name not in , are free tuples. Every variable occurring in must occur in one of , as well.
The rule based conjunctive query is also called a rule for the sake of simplicity. is the head of the rule, is the body of the rule, is called a (relational) atom. It is assumed that each variable of the head also occurs in some atom of the body.
A rule can be considered as some tool that tells how can we deduce newer and newer facts, that is tuples, to include in the output relation. If the variables of the rule can be assigned such values that each atom is true (that is the appropriate tuple is contained in the relation ), then tuple is added to the relation . Since all variables of the head occur in some atoms of the body, one never has to consider infinite domains, since the variables can take values from the actual instance queried. Formally. let be an instance over relational schema , furthermore let be a the query given by rule (19.3). Let denote the set of variables occurring in , and let denote the set of constants that occur in . The image of under is given by
An immediate way of calculating is to consider all possible valuations in some order. There are more efficient algorithms, either by equivalent rewriting of the query, or by using some indices.
An important difference between atoms of the body and the head is that relations are considered given, (physically) stored, while relation is not, it is thought to be calculated by the help of the rule. This justifies the names: 's are extensional relations and is intensional relation.
Query over schema is monotone, if for instances and over , implies . is satisfiable, if there exists an instance , such that . The proof of the next simple observation is left for the Reader (Exercise 19.1-1).
Proposition 19.4 shows the limitations of rule based queries. For example, the query Which theatres play only Kurosawa films? is obviously not monotone, hence cannot be expressed by rules of form (19.3).
If the difference between variables and constants is not considered, then the body of a rule can be viewed as an instance over the schema. This leads to a tabular form of conjunctive queries that is most similar to the visual queries (QBE: Query By Example) of database management system Microsoft Access.
Definition 19.5 A tableau over the schema is a generalisation of an instance over , in the sense that variables may occur in the tuples besides constants. The pair is a tableau query if is a tableau and is a free tuple such that all variables of occur in , as well. The free tuple is the summary.
The summary row of tableau query shows which tuples form the result of the query. The essence of the procedure is that the pattern given by tableau is searched for in the database, and if found then the tuple corresponding to is included in the output relation. More precisely, the mapping is an embedding of tableau into instance , if . The output relation of tableau query consists of all tuples that is an embedding of tableau into instance .
The tableau query answers question 19.2 of the introduction.
The syntax of tableau queries is similar to that of rule based queries. It will be useful later that conditions for one query to contain another one can be easily formulated in the language of tableau queries.
A database consists of relations, and a relation is a set of tuples. The result of a query is also a relation with a given attribute set. It is a natural idea that output of a query could be expressed by algebraic and other operations on relations. The relational algebra consists of the following operations.
Footnote. The relational algebra is the monotone part of the (full) relational algebra introduced later.
Selection: It is of form either or , where and are attributes while is a constant. The operation can be applied to all such relations that has attribute (and ), and its result is relation that has the same set of attributes as has, and consists of all tuples that satisfy the selection condition.
Projection: The form of the operation is , , where 's are distinct attributes. It can be applied to all such relations whose attribute set includes each and its result is the relation that has attribute set ,
that is it consists of the restrictions of tuples in to the attribute set .
Natural join: This operation has been defined earlier in chapter “Relational database design”. Its notation is , its input consists of two (or more) relations , , with attribute sets , , respectively. The attribute set of the output relation is .
Renaming: Attribute renaming is nothing else, but an injective mapping from a finite set of attributes into the set of all attributes. Attribute renaming can be given by the list of pairs , where , which is written usually in the form . The renaming operator maps from inputs over to outputs over . If is a relation over , then
Relational algebra queries are obtained by finitely many applications of the operations above from relational algebra base queries, which are
Input relation: .
Single constant: , where is a constant, is an attribute name.
The mapping given by a relational algebra query can be easily defined via induction on the operation tree. It is easy to see (Exercise 19.1-2) that non-satisfiable queries can be given using relational algebra . There exist no rule based or tableau query equivalent with such a non-satisfiable query. Nevertheless, the following is true.
The proof of Theorem 19.6 consists of three main parts:
Rule based Tableau
Satisfiable relational algebra Rule based
Rule based Satisfiable relational algebra
The first (easy) step is left to the Reader (Exercise 19.1-3). For the second step, it has to be seen first, that the language of rule based queries is closed under composition. More precisely, let be a database, be a query over . If the output relation of is , then in a subsequent query can be used in the same way as any extensional relation of . Thus relation can be defined, then with its help relation can be defined, and so on. Relations are intensional relations. The conjunctive query program is a list of rules
where 's are pairwise distinct and not contained in . In rule body only relations and can occur. is considered to be the output relation of , its evaluation is is done by computing the results of the rules one-by-one in order. It is not hard to see that with appropriate renaming the variables can be substituted by a single rule, as it is shown in the following example.
can be written using and only by the first two rules of (19.6)
It is apparent that some variables had to be renamed to avoid unwanted interaction of rule bodies. Substituting expression (19.7) into the third rule of (19.6) in place of , and appropriately renaming the variables
Thus it is enough to realise each single relational algebra operation by an appropriate rule.
Let denote the list of variables (and constants) corresponding to the common attributes of and , let denote the variables (and constants) corresponding to the attributes occurring only in , while denotes those of corresponding to 's own attributes. Then rule gives exactly relation .
Assume that and the selection condition is of form either or , where are attributes is constant. Then
are the rules sought. The satisfiability of relational algebra query is used here. Indeed, during composition of operations we never obtain an expression where two distinct constants should be equated.
If , then
The renaming operation of relational algebra can be achieved by renaming the appropriate variables, as it was shown in Example 19.4.
For the proof of the third step let us consider rule
By renaming the attributes of relations 's, we may assume without loss of generality that all attribute names are distinct. Then can be constructed that is really a direct product, since the the attribute names are distinct. The constants and multiple occurrences of variables of rule (19.9) can be simulated by appropriate selection operators. The final result is obtained by projecting to the set of attributes corresponding to the variables of relation .
Conjunctive queries are a class of query languages that has many good properties. However, the set of expressible questions are rather narrow. Consider the following.
19.4 List those pairs where one member directed the other member in a film, and vice versa, the other member also directed the first in a film.
19.5 Which theatres show “La Dolce Vita” or “Rashomon”?
19.6 Which are those films of Hitchcock that Hitchcock did not play a part in?
19.7 List those films whose every actor played in some film of Fellini.
19.8 Let us recall the game “Chain-of-Actors”. The first player names an actor/actress, the next another one who played in some film together with the first named. This is continued like that, always a new actor/actress has to be named who played together with the previous one. The winner is that player who could continue the chain last time. List those actors/actresses who could be reached by “Chain-of-Actors” starting with “Marcello Mastroianni”.
Question 19.4 can be easily answered if equalities are also allowed rule bodies, besides relational atoms:
Allowing equalities raises two problems. First, the result of the query could become infinite. For example, the rule based query
results in an infinite number of tuples, since variables and are not bounded by relation , thus there can be an infinite number of evaluations that satisfy the rule body. Hence, the concept of domain restricted query is introduced. Rule based query is domain restricted, if all variables that occur in the rule body also occur in some relational atom.
The second problem is that equality atoms may cause the body of a rule become unsatisfiable, in contrast to Proposition 19.4. For example, query
is domain restricted, however if and are distinct constants, then the answer will be empty. It is easy to check whether a rule based query with equality atoms is satisfiable.
1 Compute the transitive closure of equalities of the body of . 2
IFTwo distinct constants should be equal by transitivity 3
RETURN“Not satisfiable.” 4
It is also true (Exercise 19.1-4) that if a rule based query that contains equality atoms is satisfiable, then there exists a another rule based query without equalities that is equivalent with .
The question 19.5 cannot be expressed with conjunctive queries. However, if the union operator is added to relational algebra, then 19.5 can be expressed in that extended relational algebra:
Rule based queries are also capable of expressing question 19.5 if it is allowed that the same relation is in the head of many distinct rules:
Non-recursive datalog program is a generalisation of this.
where no relation of occurs in a head, the same relation may occur in the head of several rules, furthermore there exists an ordering of the rules such that the relation in the head of does not occur in the body of any rule for .
The semantics of the non-recursive datalog program (19.15) is similar to the conjunctive query program (19.5). The rules are evaluated in the order of Definition 19.7, and if a relation occurs in more than one head then the union of the sets of tuples given by those rules is taken.
The union of tableau queries is denoted by . It is evaluated by individually computing the result of each tableau query , then the union of them is taken. The following holds.
The proof of Theorem 19.8 is similar to that of Theorem 19.6 and it is left to the Reader (Exercise 19.1-5). Let us note that the expressive power of the union of tableau queries is weaker. This is caused by the requirement having the same summary row for each tableau. For example, the non-recursive datalog program query
cannot be realised as union of tableau queries.
The query 19.6 is obviously not monotone. Indeed, suppose that in relation Film there exist tuples about Hitchcock's film Psycho, for example (“Psycho”,“A. Hitchcock”,“A.Perkins”), (“Psycho”,“A. Hitchcock”,“J. Leigh”), , however, the tuple (“Psycho”,“A.Hitchcock”,“A. Hitchcock”) is not included. Then the tuple (“Psycho”) occurs in the output of query 19.6. With some effort one can realize however, that Hitchcock appears in the film Psycho, as “a man in cowboy hat”. If the tuple (“Psycho”,“A. Hitchcock”,“A. Hitchcock”) is added to relation Film as a consequence, then the instance of schema CinePest gets larger, but the output of query 19.6 becomes smaller.
It is not too hard to see that the query languages discussed so far are monotone, hence query 19.6 cannot be formulated with non-recursive datalog program or with some of its equivalents. Nevertheless, if the difference operator is also added to relation algebra, then it becomes capable of expressing queries of type 19.6 For example,
realises exactly query 19.6. Hence, the (full) relational algebra consists of operations . The importance of the relational algebra is shown by the fact, that Codd calls a query language relationally complete, exactly if for all relational algebra query there exists , such that .
If negative literals, that is atoms of the form are also allowed in rule bodies, then the obtained non-recursive datalog with negation, in notation nr-datalog is relationally complete.
where is a relation, is a free tuple, 's are literals, that is expression of form or , such that is a free tuple for . does not occur in the body of the rule. The rule is domain restricted, if each variable that occurs in the rule also occurs in a positive literal (expression of the form ) of the body. Every nr-datalog rule is considered domain restricted, unless it is specified otherwise.
The semantics of rule (19.18) is as follows. Let be a relational schema that contains all relations occurring in the body of , furthermore, let be an instance over . The image of under is
A nr-datalog program over schema is a collection of nr-datalog rules
where relations of schema do not occur in heads of rules, the same relation may appear in more than one rule head, furthermore there exists an ordering of the rules such that the relation of the head of rule does not occur in the head of any rule if .
The computation of the result of nr-datalog program (19.20) applied to instance over schema can be done in the same way as the evaluation of non-recursive datalog program (19.15), with the difference that the individual nr-datalog rules should be interpreted according to (19.19).
expresses query 19.6. Query 19.7 is realised by the nr-datalog program
then (19.23) answers the following query (assuming that all films have unique director)
19.9 List all those films whose every actor played in each film of Fellini,
instead of query 19.7.
It is easy to see that every satisfiable nr-datalog program that contains equality atoms can be replaced by one without equalities. Furthermore the following proposition is true, as well.
Query 19.8 cannot be formulated using the query languages introduced so far. Some a priori information would be needed about how long a chain-of-actors could be formed starting with a given actor/actress. Let us assume that the maximum length of a chain-of-actors starting from “Marcello Mastroianni” is 117. (It would be interesting to know the real value!) Then, the following non-recursive datalog program gives the answer.
Footnote. Arbitrary comparison atoms can be used, as well, similarly to equality atoms. Here makes it sure that all pairs occur at most once in the list.
It is much easier to express query 19.8 using recursion. In fact, the transitive closure of the graph Film-partner needs to be calculated. For the sake of simplicity the definition of Film-partner is changed a little (thus approximately doubling the storage requirement).
The datalog program (19.25) is recursive, since the definition of relation Chain-partner uses the relation itself. Let us suppose for a moment that this is meaningful, then query 19.8 is answered by rule
is a datalog rule, if , the 's are relation names, the 's are free tuples of appropriate length. Every variable of has to occur in one of , as well. The head of the rule is , the body of the rule is . A datalog program is a finite collection of rules of type (19.27). Let be a datalog program. The relation occurring in is extensional if it occurs in only rule bodies, and it is intensional if it occurs in the head of some rule.
If is a valuation of the variables of rule (19.27), then is a realisation of rule (19.27). The extensional (database) schema of consists of the extensional relations of , in notation . The intensional schema of , in notation is defined similarly as consisting of the intensional relations of . Let . The semantics of datalog program is a mapping from the set of instances over to the set of instances over . This can be defined proof theoretically, model theoretically or as a fixpoint of some operator. This latter one is equivalent with the first two, so to save space only the fixpoint theoretical definition is discussed.
There are no negative literals used in Definition 19.11. The main reason of this is that recursion and negation together may be meaningless, or contradictory. Nevertheless, sometimes negative atoms might be necessary. In those cases the semantics of the program will be defined specially.
Let be a datalog program, be an instance over . Fact , that is a tuple consisting of constants is an immediate consequence of and , if either for some relation , or is a realisation of a rule in and each is in . The immediate consequence operator is a mapping from the set of instances over to itself. consists of all immediate consequences of and .
The definition of implies that . Using Proposition 19.12 it follows that
By the monotonicity of and (19.29) we have
that is is a fixpoint. It is easy to see that every fixpoint that contains , also contains for all , that is it contains , as well.
is not optimal, of course, since every fact that becomes included in is calculated again and again at every further execution of the
The idea of
is that it tries to use only recently calculated new facts in the
loop, as much as it is possible, thus avoiding recomputation of known facts. Consider datalog program with , and . For a rule
of where and , the following rules are constructed for and
Relation denotes the change of in iteration . The union of rules corresponding to in layer is denoted by , that is rules of form (19.32) for , . Assume that the list of relations occurring in rules defining the relation is . Let
denote the set of facts (tuples) obtained by applying rules (19.32) to input instance and to relations . The input instance is the actual value of the relations of .
1 those rules of whose body does not contain relation 2
DO4 5 6
FOR8 are the relations of the rules defining . 9
DO10 11 12
UNTILfor all 13
Proof. We will show by induction on that after execution of the loop of lines 6–12 times the value of is , while is equal to for arbitrary . is the result obtained for starting from and applying the immediate consequence operator -times.
For , line 4 calculates exactly for all . In order to prove the induction step, one only needs to see that is exactly equal to , since in lines 9–10 procedure
constructs -t and using that. The value of is , by the induction hypothesis. Additional new tuples are obtained only if that for some relation defining such tuples are considered that are constructed at the last application of , and these are in relations , also by the induction hypothesis.
The halting condition of line 12 means exactly that all relations are unchanged during the application of the immediate consequence operator , thus the algorithm has found its minimal fixpoint. This latter one is exactly the result of datalog program on input instance according to Definition 19.14.
eliminates a large amount of unnecessary calculations, nevertheless it is not optimal on some datalog programs (Exercise 19.1-7). However, analysis of the datalog program and computation based on that can save most of the unnecessary calculations.
Definition 19.16 Let be a datalog program. The precedence graph of is the directed graph defined as follows. Its vertex set consists of the relations of , and is an arc for if there exists a rule in whose head is and whose body contains . is recursive, if contains a directed cycle. Relations and are mutually recursive if the belong to the same strongly connected component of .
Being mutually recursive is an equivalence relation on the set of relations . The main idea of procedure
is that for a relation only those relations have to be computed “simultaneously” with that are in its equivalence class, all other relations defining can be calculated “in advance” and can be considered as relations.
1 Determine the equivalence classes of under mutual recursivity. 2 List the equivalence classes according to a topological order of . 3 There exists no directed path from to in for all . 4
Semi-Naive-Datalogto compute relations of taking relations of as relations for .
Lines 1–2 can be executed in time using depth first search, where and denote the number of vertices and edges of graph , respectively. Proof of correctness of the procedure is left to the Reader (Exercise 19.1-8).
In the present section we return to conjunctive queries. The costliest task in computing result of a query is to generate the natural join of relations. In particular, if there are no indexes available on the common attributes, hence only procedure
IFand can be joined 5
It is clear that the running time of
is . Thus, it is important that in what order is a query executed, since during computation natural joins of relations of various sizes must be calculated. In case of tableau queries the Homomorphism Theorem gives a possibility of a query rewriting that uses less joins than the original.
Let be queries over schema . contains , in notation , if for all instances over schema holds. according to Definition 19.1 iff and . A generalisation of valuations will be needed. Substitution is a mapping from the set of variables to the union of sets of variables and sets of constants that is extended to constants as identity. Extension of substitution to free tuples and tableaux can be defined naturally.
Proof. Assume first, that is a homomorphism from to , and let be an instance over schema . Let . This holds exactly if there exists a valuation that maps tableau into and . It is easy to see that maps tableau into and , that is . Hence, , which in turn is equivalent with .
On the other hand, let us assume that . The idea of the proof is that both, and are applied to the “instance” . The output of is free tuple , hence the output of also contains , that is there exists a embedding of into that maps to . To formalise this argument the instance isomorphic to is constructed.
Let be the set of variables occurring in . For all let be constant that differs from all constants appearing in or , furthermore . Let be the valuation that maps to , furthermore let . is a bijection from to and there are no constants of appearing in , hence well defined on the constants occurring in .
It is clear that , thus using is obtained. That is, there exists a valuation that embeds tableau into , such that . It is not hard to see that is a homomorphism from to .
According to Theorem 19.6 tableau queries and satisfiable relational algebra (without subtraction) are equivalent. The proof shows that the relational algebra expression equivalent with a tableau query is of form , where is the number of rows of the tableau. It implies that if the number of joins is to be minimised, then the number of rows of an equivalent tableau must be minimised.
The tableau query is minimal, if there exists no tableau query that is equivalent with and , that is has fewer rows. It may be surprising, but it is true, that a minimal tableau query equivalent with can be obtained by simply dropping some rows from .
Proof. Let be a minimal query equivalent with . According to the Homomorphism Theorem there exist homomorphisms from to , and from to . Let . It is easy to check that and . But is minimal, hence is minimal, as well.
over the schema of attribute set . The tableau query corresponding to is the following tableau :
Such a homomorphism is sought that maps some rows of to some other rows of , thus sort of “folding” the tableau. The first row cannot be eliminated, since the homomorphism is an identity on free tuple , thus must be mapped to itself. The situation is similar with the third row, as the image of is itself under any homomorphism. However the second row can be eliminated by mapping to and to , respectively. Thus, the minimal tableau equivalent with consists of the first and third rows of . Transforming back to relational algebra expression,
The next theorem states that the question of tableau containment and equivalence is NP-complete, hence tableau minimisation is an NP-hard problem.
19.12 Assume that is obtained from by removing some free tuples. Is it true then that ?
problem will be reduced to the various tableau problems. The input of
problem is a finite set , and a collection of its subsets . It has to be determined whether there exists , such that subsets in cover exactly (that is, for all there exists exactly one such that ).
is known to be an NP-complete problem.
Let be an input of
. A construction is sketched that produces a pair of tableau queries to in polynomial time. This construction can be used to prove the various NP-completeness results.
Let the schema consist of the pairwise distinct attributes . and are tableau queries over schema such that the summary row of both of them is free tuple , where are pairwise distinct variables.
Let be another set of pairwise distinct variables. Tableau consists of rows, for each element of corresponds one. stands in column of attribute in the row of , while stands in column of attribute for all such that holds. In other positions of tableau pairwise distinct new variables stand.
Similarly, consists of rows, one corresponding to each element of . stands in column of attribute in the row of for all such that , furthermore stands in the column of attribute , for all . In other positions of tableau pairwise distinct new variables stand.
The NP-completeness of problem 19.10 follows from that has an exact cover using sets of if and only if holds. The proof, and the proof of the NP-completeness of problems 19.11 and 19.12 are left to the Reader (Exercise 19.1-9).
19.1-1 Prove Proposition 19.4, that is every rule based query is monotone and satisfiable. Hint. For the proof of satisfiability let be the set of constants occurring in query , and let be another constant. For every relation schema in rule (19.3) construct all tuples , where , and is the arity of . Let be the instance obtained so. Prove that is nonempty.
19.1-4 Prove that every rule based query with equality atoms is either equivalent with the empty query , or there exists a rule based query without equality atoms such that . Give a polynomial time algorithm that determines for a rule based query with equality atoms whether holds, and if not, then constructs a rule based query without equality atoms, such that .
For every relation of the fact is in iff it is in , furthermore
for every relation of every fact constructed from constants of is in .
19.1-8 Prove that procedure
stops in finite time for all inputs, and gives correct result. Give an example of an input on which
calculates less number of rows multiple times than
Prove that for tableau queries and of the proof of Theorem 19.20 there exists a homomorphism from to if and only if the
problem has solution.
Prove that the decision problems 19.11 and 19.12 are NP-complete.