

Transitive Closure
SQL Tips by Donald Burleson 
There are several
key graph concepts that would guide your intuition when writing
queries on graphs:
1)
Reflexive closure of a graph is built by adding missing loops  edges with the same endpoints. Reflexive closure is expressed
easily in SQL:
select
head, tail from Edges  original relation
union
select head, head from Edges ? extra loops
union
select tail, tail from Edges ? more loops
2)
Symmetric closure of a (directed) graph is built by adding an
inversely oriented edge for each edge in the original graph. In SQL:
select
head, tail from Edges  original relation
union
select tail, head from Edges ? inverse arrows
3)
Transitive closure of a (directed) graph is generated by
connecting edges into paths and creating a new edge with the tail
being the beginning of the path and the head being the end. Unlike
the previous two cases, a transitive closure cannot be expressed
with bare SQL essentials  the select, project, and join relational
algebra operators.
For now, let's
assume that we know how to query transitive closure, and demonstrate
armed with these definitions how some problems can be approached.
Here is a puzzle from the comp.databases.oracle.misc forum:
Suppose I have a
table that contains Account# & RelatedAccount# (among other things).
How could I use
CONNECT BY & START WITH in a query to count relationships or
families. For example, in
ACCT
REL_ACCT
Bob Mary
Bob Jane
Jane Bob
Larry Moe
Curly Larry
there are 2
relationship sets (Bob,Mary,Jane & Larry,Moe,Curly). If I
added
Curly
Jane
then there'd be
only 1 larger family. Can I use CONNECT BY & START with to detect
such relationships and count them? In my case I'd be willing to go
any number of levels deep in the recursion.
By ignoring Oracle
specific SQL keywords as inessential to the problem at hand and
using proper graph terminology, the question can be formulated into
just one line:
Find a number of
connected components in a graph.
The Connected
component of a graph is a set of nodes reachable from each
other. A node is reachable from another node if there is an
undirected path between them.
Reachability is an
equivalence relation: it is reflective, symmetric, and transitive.
The reachability relation is obtained by closing the Edges
relation to become reflective, symmetric, and transitive as shown in
Figure 6.5.
Now return to the
problem of finding the number of connected components and assume
that the reachability relation EquivalentNodess has already been
calculated. Then, simply select the smallest node from each
component. Informally this would read:
Select node(s)
such that there is no node with smaller label reachable from it.
Count them.
Formally:
select
count(distinct tail) from EquivalentNodes e
where not exists (
select * from EquivalentNodes ee
where ee.head<e.tail and e.tail=ee.tail
);
Equivalence Relation and
Group By (cont)
In one of the Chapter 1
sidebars we have attributed the incredible efficiency of the
group by
operator to its proximity to one of the most fundamental
mathematical constructions  the equivalence relation. There are
two ways to define an equivalence relation.
The first way is
leveraging the existing equality operator on a domain of values.
The second way is defining an equivalence relation explicitly, as
a set of pairs. The standard group by
operator is not able to understand an equivalence relation defined
explicitly  this is the essence of the problem, which has just
been solved.
Being able to query
the number of connected components earned us an unexpected bonus: a
connected graph can be redefined as a graph that has a single
connected component. And, a connected graph with N nodes and
N1 edges must be a tree. Thus, counting nodes and edges
together with a transitive closure is another opportunity to enforce
tree constraint.
Now that some
important graph closure properties have been established, everything
is ready for transitive closure implementations. Unfortunately, this
story has to branch here since database vendors approach the
hierarchical query differently.