 |
|
Recursive SQL
SQL Tips by Donald Burleson |
DB2 and SQL Server
2005 support the ANSI SQL standard recursive SQL, which renders the
transitive closure effortlessly:
with
TransClosedEdges (tail, head) as
( select tail, head from Edges
union all
select e.tail, ee.head from Edges e, TransClosedEdges ee
where e.head = ee.tail
)
select distinct * from TransClosedEdges
This query looks
artificial at first. It requires a certain educational background to
fully appreciate it.
Consider the
adjacency matrix of a graph. It is a square matrix with dimensions
equal to the number of nodes. It is conventional to enumerate graph
nodes with numbers from 1 to N, therefore, matching
nodes with matrix columns and rows. With this arrangement matrix,
entry aij naturally correspond to an edge from
node i to node j. If there is indeed such an edge in a
graph, then we define aij=1; otherwise, aij=0.
For our purposes
the powers of adjacency matrix are especially interesting. The entry
in column i and row j of the adjacency matrix raised
in the n-th power is the number of paths of length n
from node i to node j.
Although the square
of the adjacency matrix in Figure 6.7 is the adjacency matrix of the
graph shown in Figure 6.8, in general, the power of the adjacency
matrix does not have be an adjacency matrix anymore.
There are various
ways to fix this problem, and these ways will be explained
later. For now, given the adjacency matrix
A, consider a formal series:
By adding the
matrix powers An what kind of matrix
TA is produced? The
reader can easily convince himself that an entry in column i,
row j of matrix An
is the number of paths of length n from node i to node
j in the original graph. Then, an entry in matrix
TA is the number of
paths of any length from node i to node j.
For adjacency
matrices corresponding to directed acyclic graphs the powers would
evaluate to 0 for a sufficiently
large n, and the formal series
TA is finite. In other words, all the paths in
a directed acyclic graph have their length bounded.
Given a transitive
closure series matrix TA,
if any nonzero number is changed into 1, then
TA can be converted
and obtained into an adjacency matrix!
Next, the matrix
powers sum is arranged a little bit differently:
The series in the
parenthesis is again TA,
hence:
It is this
expression that is compared against the recursive with query
in the beginning of the section. First, consider the product of
matrices A and
TA. Matrix
multiplication in SQL can be expressed as:
table A (
i integer, -- column i
j integer, -- column j
val number -- value of A(i,j)
);
table T_A (
i integer,
j integer,
val number
);
select A.i
AS i, T_A.j AS j, sum(A.val*T_A.val) AS val
from A, T
where A.j=T_A.i
group by A.i, T_A.j
The join between A and
TA looks similar to the
join between Edges and TransClosedEdges in the
recursive SQL query that was introduced in the beginning of the
section. It is the aggregation part that appears to make them look
alike.
This is not the
only way to multiply the matrices in SQL, however. Remember that
products and sums of adjacency matrices have nonnegative integer
entries. Therefore, instead of having a single row (i,j,val),
we can have a bag of val identical rows of (i,j).
Schema design, once again, is important!
Counting with Bags
SQL operates with bags of
values, which are not much different from numbers in base-1 number
system. Any record with a nonnegative count field can be converted
into a bag of identical records without this field. Bags are added
with the union all operator, e.g.
select
head, tail from Edges1
union
all
select
head, tail from Edges2
we
multiply them with Cartesian product, e.g.
select
e1.head, e1.tail, e2.head, e2.tail
from
Edges1 e1, Edges2 e2
The bag result could be
aggregated back into a set of records (by counting).
In a bag design
matrix, multiplication reduces to:
table A (
i integer, -- column i
j integer -- column j
);
table T_A (
i integer,
j integer
);
select A.i
AS i, T_A.j AS j
from A, T
where A.j=T_A.i
The right side of the equation that defines a transitive
closure
could be written in
full in SQL as:
select i, j
from A
union all
select A.i AS i, T_A.j AS j
from A, T
where A.j=T_A.i
After renaming
variables appropriately, it becomes indistinguishable from the
recursive view TransClosedEdges definition in the query:
with
TransClosedEdges (tail, head) as
( select tail, head from Edges
union all
select e.tail, ee.head from Edges e, TransClosedEdges ee
where e.head = ee.tail
)
select distinct * from TransClosedEdges
The outer query
transforms a bag back into a set because it is of interest to know
if there are paths from node tail to node head, rather
than their exact number.
Given the matrix
interpretation of the recursive transitive closure query, the matrix
evaluation can be translated into the SQL execution steps as
follows:
A reader who is
already familiar with the recursive SQL from another source may have
already learned an entirely different algorithm. SQL, however, is
about declarative programming and not about the algorithms. Any
algorithm would do as long as it produces the correct result set.
Recursive SQL is known to have many execution strategies: na?e,
incremental (semi-naive), etc. Our algorithm could be coined as
matrix evaluation.
There is a pitfall,
however. The query would never terminate on graphs with cycles.
Indeed, a cycle that passes through node i would eventually
produce the edge (tail=i, head=i) in the TransClosedEdges
relation. The TransClosedEdges relation never shrinks;
therefore this record would remain in the TransClosedEdges.
Then, the join at step 2 would never be empty.
One of the
solutions to the cycle problem is suggested in the DB2 Cookbook
by Graeme Birchall. Recursive with construction is very
powerful since extra columns may be introduced and those columns
would be calculated recursively! One such column fits naturally into
our query - the path expression:
with
TransClosedEdges (tail, head) as
( select tail, head, tail||?.?||head AS path
from
Edges
union all
select e.tail, ee.head, e.tail||?.?||ee.path AS path
from
Edges e, TransClosedEdges ee
where e.head = ee.tail
)
select distinct * from TransClosedEdges
Then Graeme goes on
introducing a function called LOCATE_BLOCK, which could be
used in the where clause as an indicator that the current
node is in the path already:
with
TransClosedEdges (tail, head) as
( select tail, head, tail||?.?||head AS path
from
Edges
union all
select e.tail, ee.head, e.tail||?.?||ee.path AS path
from
Edges e, TransClosedEdges ee
where
e.head = ee.tail
and
LOCATE_BLOCK(e.head, path) = 0
)
select distinct * from TransClosedEdges
There is a more
satisfactory solution to the cycle problem, though. Both DB2 and SQL
Server slightly deviated from the ANSI SQL standard which demands a
union in the recursive query definition. With the set semantics, the
transitive closure relation TransClosedEdges cannot grow
larger than a complete graph -- a graph in which every pair of nodes
is connected.
Let's complete the
section with simple wisdom about performance. Clearly, the
Edges.head column has to be indexed if this query is to be
scaled to a hierarchy of any significant size. The same comment
applies to the Oracle solution discussed next.
Connect By
There are several
ways to query a transitive closure in Oracle, beginning with parsing
the sys_connect_by_path pseudo column and ending with firing
the corellated connect by subquery:
select
a.tail, b.head from Edges a, Edges b
where b.head in (
select head from Edges
connect by prior head = tail
start with tail = a.tail
)
Without question the
most satisfactory method both aesthetically and performance-wise is:
select
connect_by_root(tail), tail
from Edges
connect by prior head = tail
This transitive closure query is succinct, but still the
syntax could be improved. Apparently, the designers were preoccupied
with tree problems, hence the connect_by_root pseudo column.
There is no concept of the root node anywhere in the transitive
closure problem scope. Likewise, the concept of the prior
edge belongs to the solution space rather than is inherent to the
transitive closure problem. Overall, the transitive closure is
defined symmetrically in terms of the head and tail,
but the query above is skewed.
The cycle issue is
no-brainer, as there is a dedicated keyword to take care of it:
select
connect_by_root(tail), tail
from Edges
connect by nocycle prior head = tail