 |
|
Nested Intervals
SQL Tips by Donald Burleson |
Querying descendants has to be done
via nested intervals. Given the matrix the interval boundaries are
calculated as:
Of these two numbers which is
interval is lower bound and which is the upper bound? Let's compare
them. Multiplying both numbers to the product of their denominators
and simplifying the result, the problem is reduced to answering if
Here is the determinant expression,
again, which evaluates to 1. Hence, interval boundaries are ordered
as:
Next, how can we be sure those
intervals are indeed nested? Let's compare an arbitrary node
interval ends with that of its children. The nth child
interval encoding is:
Therefore, the child interval
boundaries are:
Note, that the second endpoint is
the same expression as the first one, with n decremented by
1. Therefore, the following must be checked for any n ≥ 1
Both inequalities reduce to correspondingly. This proves that
the parent node interval indeed contains its child interval.
The second property of nested
intervals - sibling node intervals being disjoint - can be proved in
a similar fashion.
Descendants Query
Now that the nested intervals
structure has been covered, querying the node's descendants can be
looked at. As a first approximation, let's take the descendants
query in terms of nested sets as a template and rewrite it in terms
of nested intervals:
select
descendant.*
from
MatrixTreeNodes descendant, MatrixTreeNodes node
where
descendant.a11/descendant.a21 between node.a11/node.a21
and (node.a11-node.a12)/(node.a21-node.a22)
and
node.name = ? -- predicate uniquely identifying a node
Unfortunately, this query would
produce a wrong result. None of the database vendors support the
rational number datatype. The ratios of integers would be silently
casted into float numbers with accompanying errors due to lack of
precision. All the expressions must be rewritten with divisions
within the means of integer arithmetic:
select
descendant.*
from
MatrixTreeNodes descendant, MatrixTreeNodes node
where
descendant.a11*node.a21 >= descendant.a21*node.a11
and
descendant.a11*node.a22 >= descendant.a21*node.a12
and
node.name = ? -- predicate identifying a node uniquely
When the descendant query
performance was explained in the context of nested sets, the index
range scan was emphasized as an efficient way to extract all the
descendant nodes. This idea generalizes to nested intervals,
although interval boundaries must be indexed. Let's enhance the tree
encoding schema design with two function-based indexes:
table
MatrixTreeNodes (
a11
integer,
a12
integer,
a21
integer,
a22
integer
);
CREATE
INDEX idx_left ON MatrixTreeNodes(a11/a21);
CREATE
INDEX idx_right ON MatrixTreeNodes((a11-a12)/(a21-a22));
The query must be rewritten in such
a way that the optimizer can leverage these indexes:
select
descendant.*
from
MatrixTreeNodes descendant, MatrixTreeNodes node
where
descendant.a11*node.a21 >= descendant.a21*node.a11
and
descendant.a11*node.a22 >= descendant.a21*node.a12
and
descendant.a11/descendant.a21
between node.a11/node.a21 - 0.0000001
and
(node.a11-node.a12)/(node.a21-node.a22) + 0.0000001
and
node.name = ? -- predicate uniquely identifying a node
The constant 0.0000001 is
designed to compensate for floating point arithmetic rounding
errors. It essentially is a minimal supported mantissa. Please refer
to your favorite database SQL manual in order to find the exact
value. This way the index range scan would capture all the nodes in
the interval, and possibly, some extra, and the small list of nodes
is filtered with the exact condition.
Ancestor Criteria
Suppose there are two nodes: one
encoded with matrix A, and the other encoded with B.
Node A is an ancestor of B if and only if there is a
directed path from A to B. In matrix terms, there has
to be a sequence of atomic matrices so that after A is
multiplied to it, matrix B is then obtained. By matrix
multiplication associativity law, all those atomic matrices can be
combined into a single matrix X. In other words, node A
is an ancestor of node B if there is matrix X such
that:
A X = B
If matrix A has inverse A-1,
then when multiplying both sides to A-1 the result
is:
X = A-1
B
The formula for the inverse of the 2?2
matrix is where knowledge was leveraged that
the matrices always have a determinant 1. Therefore, given any nodes A and B, matrix X that encodes the path between
them can always be found.
This is absurd, as node B
might not be a descendant of A! Let's examine the phenomenon
more closely. As usual, an example might be handy. Consider the
nodes A=1.7 and B=1.3.2.5 in matrix encoding:
Then, A-1B
evaluates to:
This is not a valid matrix encoding,
however. It violates the constraint that the entries in the upper
row are greater than the ones in the lower row.
Here is more detailed explanation
for why this is happening. Since matrix A is decomposed into
a product of (atomic) matrices, why not leverage the law of
inverse of matrix product:
(P
Q) -1 = Q-1 P-1
In the example:
Hence, A-1B expands into the
following product of atomic matrices and their inverses
where collapses into the identity matrix.
This is because both A=.1.7 and B=.1.3.2.5 start with the same
prefix .1. Therefore, the above expression for A-1B
simplifies to which cannot be further reduced. It
is the multiplication by an atomic matrix inverse that violates the
constraint.
In order to carry over this idea to
SQL, A-1B must be written in a generic form which translates to the descendants
query from the previous section:
select B.*
from
MatrixTreeNodes A, MatrixTreeNodes B
where
B.a21*A.a12 - B.a11*A.a22 > -B.a11*A.a21 + B.a21*A.a11
and
B.a12*A.a22 - B.a22*A.a12 > -B.a22*A.a11 + B.a12*A.a21
and A.name
= ? -- predicate identifying a node uniquely
Admittedly, this query is slightly
more complicated than the nested intervals version. The real
contribution of this section is introducing inverse matrices, which
will be leveraged later when relocating subtrees.