 |
|
Generating Baskets
SQL Tips by Donald Burleson |
Given a set of
items, how much, say, would $500 buy?
select *
from items;
Even though the
problem may sound unbelievably simple, the answer involves finding
all sets of items, not just the item records that cost below $500.
However, what do those sets have to do with the graph problems that
are the focus of this chapter? Well, there is no method on how to
generate sets of items, but even if there were, the second challenge
would be aggregating a price on these sets.
The problem in
terms of graphs is reformulated using the very suggestive Figure
6.14 as the basic idea.
What criterion
connects the items in the graph? Anything that does not list an item
twice would do. For purposes in this chapter, any two items are
connected if the first item precedes lexicographically to the second
one. Then, as in the hierarchical total section, the solution has to
branch in order to accommodate the differences between various
platforms.
In Oracle the idea
is adopted from Figure 6.13. Unlike the previous section, however,
where values were aggregated on edges, values are added together on
nodes. The path is decomposed into the three components as shown in
Figure 6.15.
The transitive
closure relation together with an aggregation query formally is:
with Sets
as (
select distinct connect_by_root(name) x, name y,
case when connect_by_root(name) = name then
''
else
substr(sys_connect_by_path(name,','),
instr(sys_connect_by_path(name,','),',',2))
end p
from items
connect by nocycle name > prior name
), SetTotals as (
select t1.x||t1.p||t2.p,
sum(price) price_sum
from Sets t1, Items i, Sets t2
where t1.y = i.name and i.name = t2.x
group by t1.x||t1.p||t2.p
)
Most of the
attention here has been focused on making the path string look
right. Without this twisted case condition, joining the paths
according to Figure 6.15 would produce strings with duplicate items,
e.g. camera, camera, monitor.
The final query is
now one step away:
select *
from SetTotals
where price_sum < 500
Now that the connect by based solution has been
examined, let's see what recursive SQL can offer. The solution is
embarrassingly simple:
with Sets (maxName,
itemSet, sumPrice) as (
select name, name, price
union
select name, itemSet || ',' name, sumPrice + price
from Sets, T
where name > maxName
and sumPrice + price < 500
) select itemSet, sumPrice from Sets
Not only it is
clearer, but it is also more efficient. Imagine a store with 50K
items. Calculating all the subsets of the items is not a proposition
that is expected to be completed in the remaining lifetime of the
universe. The number of baskets with the aggregated cost limited by
some modest amount, however, is reasonably small. Once again,
applying predicates early is a good idea in general.
Comparing
Hierarchies
Being able to
navigate graph structures is only a part of the story. Comparing
hierarchies is more challenging. First of all, by what criteria are
two hierarchical structures considered as equivalent? In particular,
is the order of siblings important?
Depending on the
context, the reader may lean to one or the other answer.
Likewise, can
children become grandchildren?
Even though a
reader may lean to the conclusion that dummy intermediate nodes are
not allowed, it is easy to provide a counter example when ignoring
certain types of intermediate nodes is imperative.
This ambiguity is
discouraging. Yet, let's ignore it for a while and try to develop
some basic understanding of how trees can be compared. Looking
desperately for a bright idea, I pulled the following entry from
thesaurus.com
Main Entry:
compare
Synonyms:
analyze, approach, balance, bracket, collate, confront, consider,
contemplate, correlate, divide, equal, examine, hang, inspect,
juxtapose, match, match up, measure, observe, oppose, parallel,
ponder, rival, scan, scrutinize, segregate, separate, set against,
size up, study, touch, weigh
The last synonym - weigh - sounds promising. When comparing two things in the physical
world, they are measured on some scale or weighed. Perhaps tree
weights can be compared somehow?
Let's revisit the
idea of the hierarchical total. When calculating the total, a raw
node weight is augmented by the augmented weights of its children.
The process begins at the root node descending recursively down to
the leaves. The trick is to guess the right kind of aggregation,
which accommodates both the tree structure and unaugmented node
weights. Then, for all comparison purposes a tree can be identified
with the augmented weight of its root node.
The easiest
aggregation to try is simply the sum of all tree node weights. This
na?e method, however, fails to take into account the differences in
the tree structure.
Joe Celko noticed
that even though the weights at the root nodes are the same, the
weights at node 2 are different. Thus a comparison is needed
of the two sets of nodes element-by-element with their respective
weights. This is easy enough to do; simply store the weights in
Table1
and Table2
Then, check if the
symmetric difference is empty.
select
AggregatedWeight from Table1
minus
select AggregatedWeight from Table2
union
select AggregatedWeight from Table2
minus
select AggregatedWeight from Table1
Still, the idea of
comparing scalar values rather than sets of values is very
appealing. Sets of integers are well known to be mapped
bijectively to ordinary integers. This may not be a practical
solution, but at least it supports the intuition that a scalar based
tree comparison is possible.
Why is the method
of mapping integer sets into integers unpractical? First, tree nodes
can be labeled with values of dataypes other than integers.
This is easily fixable, since the only important property of the
label for tree comparison purposes is its identity. Any value of any
datatype can be mapped to an integer, and in fact such a map is
commonly known as a hash function. Second, a much more
difficult problem is that the mapping of sets integers to integers
grows very fast. The range of computer integer numbers overflows
pretty easily even for sets of a moderate size. A hash function,
however, provides a satisfactory solution to the range overflowing
problem as well.
The basic premise
of any hash-based method is that the (unlikely) hash code collisions
are tolerable. Given an object a, the chance that there
exists another object x≠a such that hash(a)=hash(x) is
considered as negligible. Likewise, for any objects a, b,
x and y satisfying the equation hash(a)+hash(b)=hash(x)+hash(y)
it must either follow that x=a, y=b or x=b,
y=a. This could be contrasted to the ordinary addition wherein
the equation a+b=x+y is too ambiguous to determine x
and y.
The hierarchical
total query can be defined recursively. At each node labeled p,
which has children c1, c2, ?, cn, we aggregate
the following value:
total(p) = hash(p+total(c1)+total(c2)+?+total(cn))
The recursive
definition is the easiest to implement with recursive SQL. Consider
the employee hierarchy:
table
Employees ( -- tree nodes
id integer primary key,
sal integer
);
table
Subordination ( -- tree edges
mgr integer references Employees,
emp integer references Employees
);
Then, the
hierarchical total query starts with the leaf nodes:
select id,
hash(id) AS total
from Employees
where id not in (select mgr from Subordination)
The recursive step
mimics the recurrence definition:
with
NodeWeights as (
select id, hash(id) AS total
from Employees
where id not in (select mgr from Subordination)
union all
select e.id, e.id+sum(hash(total))
from Subordination s, Employees e, NodeWeights nw
where s.mgr = e.id and s.emp = nw.id
group by e.id
) select weight from NodeWeights
where id not in (select emp from Subordination)
After all hierarchy
nodes are weighted, the outermost query selects the weight at the
root node.
It is difficult to
fit a recursive idea into the connect by framework. Likewise,
when nested sets and intervals were studied, the recursive ideas
were ignored altogether. Could the hash based tree comparison method
be adapted to these contexts as well?
Let's invoke our
familiar path encoding - after all, it reflects a tree structure.
Admittedly, I do not know how to aggregate paths in a subtree of
descendants, but I can suggest adding their hash values instead!
More specifically, a hash value of each node can be multiplied by a
hash value of the node path and summed up the hierarchy. This
definition is free of recursion, and therefore, is exactly what is
required for the hierarchical total method to work.
The method for
basic tree comparison can be amended to meet the exact tree equality
specification. In the beginning of the section there were two tree
equality dilemmas that were left hanging. In the first dilemma, if
the reordering of siblings is allowed, the node summation is
performed exactly as described. Otherwise, the node weights should
be multiplied by the order numbers.
For example, for
the left tree in Figure 6.14 the augmented weight of the root node
has to be recalculated to become 1*1 + 2*1 + 3*2 = 7. If
leaves 2 and 3 are swapped, then the weight at the
root node has to change to 1*1 + 3*1 + 2*2 = 8. A similar
idea applies to the hash-based comparison method, where the
hierarchical total formula becomes:
total(p) = hash(p+1?/span>total(c1)+
2?/span>total(c2)+?+ n?/span>total(cn))
Likewise, in the
second dilemma if intermediate empty nodes in the tree structure are
allowed, then Celko's method works as it is. Otherwise, each
unaugmented node weight is multiplied by the level. For example, for
the left tree in Figure 6.15 the augmented weight of the root node
has to be recalculated to become 1*1 + 2*2 + 2*3 = 11. For
the right tree it is 1*1 + 2*2 + 2*0 + 3*3 = 14. By contrast,
the hash-based tree comparison method considers the tree structures
in Figure 6.15 as different.