Chapter 20. Semi-structured Databases

The use of the internet and the development of the theory of databases mutually affect each other. The contents of web sites are usually stored by databases, while the web sites and the references between them can also be considered a database which has no fixed schema in the usual sense. The contents of the sites and the references between sites are described by the sites themselves, therefore we can only speak of semi-structured data, which can be best characterized by directed labeled graphs. In case of semi-structured data, recursive methods are used more often for giving data structures and queries than in case of classical relational databases. Different problems of databases, e.g. restrictions, dependencies, queries, distributed storage, authorities, uncertainty handling, must all be generalized according to this. Semi-structuredness also raises new questions. Since queries not always form a closed system like they do in case of classical databases, that is, the applicability of queries one after another depends on the type of the result obtained, therefore the problem of checking types becomes more emphasized.

The theoretical establishment of relational databases is closely related to finite modelling theory, while in case of semi-structured databases, automata, especially tree automata are most important.

20.1. 20.1 Semi-structured data and XML

By semi-structured data we mean a directed rooted labeled graph. The root is a special node of the graph with no entering edges. The nodes of the graph are objects distinguished from each other using labels. The objects are either atomic or complex. Complex objects are connected to one or more objects by directed edges. Values are assigned to atomic objects. Two different models are used: either the vertices or the edges are labeled. The latter one is more general, since an edge-labeled graph can be assigned to all vertex-labeled graphs in such a way that the label assigned to the edge is the label assigned to its endpoint. This way we obtain a directed labeled graph for which all inward directed edges from a vertex have the same label. Using this transformation, all concepts, definitions and statements concerning edge-labeled graphs can be rewritten for vertex-labeled graphs.

Figure 20.1.  Edge-labeled graph assigned to a vertex-labeled graph.

Edge-labeled graph assigned to a vertex-labeled graph.


The following method is used to gain a vertex-labeled graph from an edge-labeled graph. If edge has label , then remove this edge, and introduce a new vertex with label , then add edges and . This way we can obtain a vertex-labeled graph of nodes and edges from an edge-labeled graph of vertices and edges. Therefore all algorithms and cost bounds concerning vertex-labeled graphs can be rewritten for edge-labeled graphs.

Figure 20.2.  An edge-labeled graph and the corresponding vertex-labeled graph.

An edge-labeled graph and the corresponding vertex-labeled graph.


Since most books used in practice use vertex-labeled graphs, we will also use vertex-labeled graphs in this chapter.

The XML (eXtensible Markup Language) language was originally designed to describe embedded ordered labeled elements, therefore it can be used to represent trees of semi-structured data. In a wider sense of the XML language, references between the elements can also be given, thus arbitrary semi-structured data can be described using the XML language.

The medusa.inf.elte.hu/forbidden site written in XML language is as follows. We can obtain the vertex-labeled graph of Figure 20.3 naturally from the structural characteristics of the code.

       ≤HTML≥          ≤HEAD≥             ≤TITLE≥403 Forbidden≤/TITLE≥          ≤/HEAD≥          ≤BODY≥             ≤H1≥Forbidden≤/H1≥             You don't have permission to access /forbidden.          ≤ADDRESS≥Apache Server at medusa.inf.elte.hu ≤/ADDRESS≥          ≤/BODY≥       ≤/HTML≥

Figure 20.3.  The graph corresponding to the XML file “forbidden”.

The graph corresponding to the XML file “forbidden”.

Exercises

20.1-1 Give a vertex-labeled graph that represents the structure and formatting of this chapter.

20.1-2 How many different directed vertex-labeled graphs exist with vertices, edges and possible labels? How many of these graphs are not isomorphic? What values can be obtained for , and ?

20.1-3 Consider a tree in which all children of a given node are labeled with different numbers. Prove that the nodes can be labeled with pairs , where and are natural numbers, in such a way that

a. for every node .

b. If is a descendant of , then .

c. If and are siblings and , then .