 |
|
Hierarchical Weighted Total
SQL Tips by Donald Burleson |
Before exploring
aggregation on graphs, let's have a quick look at aggregation on
trees. Aggregation on trees is much simpler and has a designated
name: hierarchical total. A typical query is:
Find combined
salary of all the employees under direct and indirect supervision of
King.
This query,
however, has no inherent complexity, and it effectively reduces to
the familiar task of finding a set of all the node's descendants.
In graph context,
the hierarchical total query is commonly referred to as the bill of
materials (BOM). Consider bicycle parts assembly:
Parts with the same
name are assumed to be identical. They are modeled as nodes in an
acyclic directed graph. The edges specify the construction flow:
1 Assemble each wheel from
the required parts (including reflectors).
2
Assemble the frame from the required components (including
reflectors).
3
Assemble a bicycle from the required parts (including frame and
wheels).
Suppose the total
list of parts for the bike assembly is to be ordered. How is the
quantity for each part calculated? Unlike the hierarchical total for
a tree, the quantities cannot simply be added as they multiply along
each path.
Therefore, there
are two levels of aggregation here, multiplication of the quantities
along each path and summation along each alternative path.
Aggregation on Graphs
Two levels of aggregation
fit naturally into in graphs queries. Consider finding a shortest
path between two nodes. First, the distances are added along each
path, and a path is chosen with minimal length.
Double aggregation
is not something unique to graph queries, however. Consider the
following query:
Find the sum of
the salaries grouped by department. Select maximum of them.
When expressing
such a query in SQL, the first sentence is accommodated as an inner
subquery inside the outer query corresponding to the second
sentence:
select
max(salaryExpences) from (
select deptno, sum(sal) salaryExpenses
from Emp
group by dept
)
The hierarchical
weighted total query has the same structure. The first level of
aggregation where edges are joined into paths is analogous to the
group by subquery from the salary expense query. Formally, it is
a transitive closure, which is enhanced with additional aggregates
or a generalized transitive closure.
The following is a
hypothetical SQL syntax for a generalized transitive closure:
select
distinct first(Part), last(SubPart), product
(Quantity)
from AssemblyEdges
connect by prior Part = later SubPart
The unfamiliar syntax requires some clarification:
1
The product is a non-standard aggregate from Chapter 4.
2
The first and last refer to the first and last edges
in the path, correspondingly.
These aggregates are unique to ordered structures such as directed
graphs.
1
Although not used in the example, concatenating edge labels into a
string is one more natural aggregation function, which is unique to
graphs. The list aggregate function is a standard way to
accommodate it.
2
The later keyword is just a syntactic sugar fixing apparent
asymmetry caused by the prior keyword.
3
There is no start with clause, which is, in fact, redundant.
It is an outer query where paths will be restricted to those
originated in the ?Bicycle? node.
The generalized
transitive closure query is enveloped with a second level of
aggregation, which is accomplished by standard means:
select
leaf, sum(factoredQuantity) from (
select product(Quantity) factoredQuantity,
first(Part) root,
last(SubPart) leaf
from AssemblyEdges
connect by prior Part = later SubPart
) where root = ?Bicycle?
group by leaf
Enough theory, so
what is done in the real world to implement an aggregated weighted
total query? Let's start with Oracle, because the proposed syntax
for a generalized transitive closure resembles the Oracle connect
by.
First, we have to
be able to refer to the first and the last edges in the path:
select
connect_by_root(Part) root, SubPart leaf
from AssemblyEdges
connect by prior Part = SubPart
Unlike the
fictional syntax, Oracle treats edges in the path asymmetrically.
Any column from the AssemblyEdges table is assumed to
implicitly refer to the last edge. This design dates back to version
7. The connect_by_root function referring to the first edge
has been added in version 10. It is remarkable how this essentially
ad-hock design proved to be successful in practice.
Next, Oracle syntax
has several more path aggregate functions:
1
level - the length of the path.
2
sys_connect_by_path - essentially the list aggregate
in the fictitious syntax.
3
connect_by_isleaf - an indicator if there is no paths which
contain the current path as a prefix.
4
connect_by_iscycle - an indicator if the first edge is
adjacent to the last.
Unfortunately, we
need an aggregate which is a product of edge weights and not a
string which is a concatenation of weights produced by
sys_connect_by_path. You might be tempted to hack a function
which accepts a string of the edge weights, parses it, and returns
the required aggregate value.
Nah, too easy! Can
the problem be solved without coding a function even a simple one?
Yes it can, although this solution would hardly be more efficient.
The critical idea is representing any path in the graph as a
concatenation of three paths:
1
a prefix path from the start node to some intermediate node i
2
a path consisting of a single weighted edge (i,j)
3
a postfix path from the node i to the end node
Then, the path from
x to y is frozen while interpreting the edge (i,j)
as a variable. All that needs done is to aggregate the weights of
all those edges.
Let's assemble the
solution piece by piece. First, all the paths in the graphs are
expressed as:
with
TClosure as (
select distinct connect_by_root(Part) x, SubPart y,
sys_connect_by_path('['||Part||','||SubPart||'>','
') path
from AssemblyEdges
connect by nocycle Part=prior SubPart
union
select Part, Part, '' from AssemblyEdges
union
select SubPart, SubPart, '' from AssemblyEdges
) -
This is essentially
a reflexive transitive closure relation enhanced with the path
column. Paths are strings of concatenated edges; each edge is
sugarcoated into '['||Part||','||SubPart||'>', which helps
visual perception, but is inessential for the solution.
Next, the two paths
and the intermediate edge are joined together and grouped by paths:
? ,
PathQuantities as (
select t1.x, t2.y,
t1.p||' ['||e.Part||','||e.SubPart||'>'||t2.p,
product(Quantity) Quantity
from TClosure t1, AssemblyEdges e, TClosure t2
where t1.y = e.Part and e.SubPart = t2.x
group by t1.x, t2.y, t1.p||' ['||e.Part||','||e.SubPart||'>'||t2.p)
?
Now all the paths
with quantities aggregated along them are available. Let's
group the paths by the first and last node in the path while adding
the quantities:
select x,
y, sum(Quantity)
from PathQuantities
group by x, y
This query is almost final, it needs only a minor touch:
restricting the node x to ?Bicycle? and interpreting
the y column as a Part in the assembly:
select y
Part, sum(Quantity)
from PathQuantities
where x = ?Bicycle?
group by x, y
The recursive SQL
solution turns out to be quite satisfactory:
with
TCAssembly as (
select Part, SubPart, Quantity AS factoredQuantity
from AssemblyEdges
where Part = ?Bicycle?
union all
select te.Part, e.SubPart, e.Quantity * te.factoredQuantity
from TCAssembly te, AssemblyEdges e
where te.SubPart = e.Part
) select SubPart, sum(Quantity) from TCAssembly group by SubPart
Most importantly,
it accommodates the inner aggregation with the non-standard
aggregate effortlessly! The cycle detection issue that plagued the
recursive SQL in the section on transitive closure is not a problem
for the directed acyclic graphs.
The hierarchical
total turns out to be surprisingly useful in the following two
sections. Selecting all the subsets of items satisfying some
aggregate criteria, Basket generation, appears to have little
to do with the hierarchical total, at first. The other, rather
unexpected application is Comparing hierarchies.