 |
|
Graphs in SQL
SQL Tips by Donald Burleson |
Graphs are much
more challenging than trees. Even something as basic as a planar
graph drawing is still an open research problem. There are few graph
encodings and all of them are volatile. The set of graph problems is
much richer and those problems are usually difficult. Applied to
graphs the cowboy programmer attitude simply does not work.
This chapter will
offer information regarding the different realms of graph methods:
the recursive with kingdom, the connect by county, and
the materialized transitive closure province. It will show typical
graph queries, beginning with the number of connected components and
ending with aggregated totals.
Schema
Design
Being cautioned
about the complexity of graph problems, this field should be
approached carefully. Its thorough examination should begin with the
schema design. When designing a schema, the first question to ask is
how are graphs defined?
Graphs can be
directed and undirected. This study is devoted almost exclusively to
directed graphs; this is why the adjective is normally omitted.
The directed graph definition is nearly ubiquitous.
Graph Definition
Graph is a set of nodes
connected by edges. Each edge is an ordered pair of nodes,
to which are referred in this chapter as head and tail.
It is straightforward to carry over
this graph definition to the schema design:
table Nodes
(
id integer primary key,
-
);
table Edges
(
head integer references Nodes,
tail integer references Nodes,
?
);
Nodes and edges in a graph are normally weighted, hence the
ellipsis in the schema definition. For example, in a network of
cities connected by roads, each edge is labeled with the distance.
Tree
Constraint
Representing trees
as graphs is a very common database application programming
practice. Consider the familiar Emp table:
select
ename, empno, mgr from Emp
Tree as Graph
In practice, tree schema
design is often reduced to a single table
table
Tree (
id integer primary key,
parentId integer references Tree,
?
);
This is just a
sloppy programming style. The design with Nodes
and Edges
is clean - it separates the two concepts. With a single table it
is easy to get confused when writing a hierarchical query. Also,
consider the tree root node. It has to refer to parentId
= NULL, whereas the
design with Nodes and Edges does not require NULLs.
The most important
concern, however, is that the graph schema, allows graphs which are
not trees. Consider a cycle, for example:
In order to
describe a tree structure in graph terms, a couple of auxiliary
definitions are needed.
1
Directed path is a sequence of edges such that the
previous edge head is the same as the next edge tail.
2
Undirected path is a sequence of edges such that the
previous edge head or tail coincides with the next edge head or
tail.
An undirected path
concept is needed to define a connected graph - a graph where
there is a path between every pair of nodes. Otherwise, by path
we would understand the directed path. For example, the graph in
Figure 6.1 is connected.
Cycle is a
closed path. In a closed path the last edge head is the same as the
first edge tail. An acyclic graph does not contain cycles.
The tree
definition, then, is essentially a constrained graph.
Tree as Subclass of Graph
Tree is an acyclic
connected graph such that any two edges have distinct heads.
While the condition
that any two edges have distinct heads could be declared
effortlessly as a unique key constraint on the head column,
the other two constraints are tough. In Chapter 4 we studied many
tricks which allowed to enforce complex constraints via materialized
views. If there is any hope leveraging that approach for enforcing
tree constraint, then we?d better figure out how to query cycles,
and calculate a number of connected components in a graph, as well.
This does not look
very promising. Therefore, we have to give up on enforcing tree
constraint in a graph model. To be fair, though, only a minority of
database application programmers would insist that enforcing tree
constraint is a matter of life-or-death. More important issues are:
1)
How is a graph model queried?
2)
Are the query methods efficient?