 |
|
Matrix Encoding
SQL Tips by Donald Burleson |
The very basic skill of multiplying
2?2 matrices of integer
numbers is listed as a reminder in Figure 5.8.
Matrix multiplication obeys the same
rules as string concatenation. It is associative
(AB)C = A(BC)
and distributive
AB + AC = A(B+C)
but not commutative
AB
≠
BA
Any materialized path is a
concatenation of atomic materialized paths. For example,
.1.3.2.5 can be viewed as .1 linked to .3, then joined with .2, and
finally connected to .5. Can the same thing be done with matrices?
The trick is to define atomic matrices, such that multiplying them
would produce the matrix encoding for the full path.
Atomic matrices turned out to be
quite simple. In fact every atomic matrix has three constant
entries: 0 in the lower right corner, and 1 in the lower left, and
-1 in the upper right. The upper left entry is the node's sequence
number in the chain of siblings incremented by 1. For example, .5
corresponds to the matrix:
And the result from multiplying the
matrices corresponding to .1, .3, .2, and .5
is:
Although we didn't seem to progress
much so far, we can at least round up matrix tree encoding schema
design
table
MatrixTreeNodes (
a11
integer,
a12
integer,
a21
integer,
a22
integer
);
A lot of questions might emerge in
the reader's mind at this moment, and all of them will be addressed
one by one. The most important of which is how this matrix encoding
is related to nested intervals? Indeed, all these matrix
manipulations should be evaluated from the querying perspective. In
particular, how are the node's descendants and ancestors queried?
And an even more basic question, how are the node's parent and
immediate children found?
Parent
and Children Queries
In the example there are a certain constraints that
node encoding obeys. The entries in the left column are positive,
and the entries in the right column are negative. And absolute
values of the entries in the right column are component-wise smaller
than those in the left column. Likewise, absolute values in the
upper row are greater than the absolute values in the lower row.
These properties are obvious for atomic matrices, but what about
arbitrary nodes?
Consider an arbitrary node with
encoding as:
Its n-th child encoding is
calculated as a matrix product of:
Let us examine this expression
closely. First we notice that the parent left row entries are moved
into the child right row with the sign changed. Now, can we prove
our intuition about the entries in the left row being positive and
the entries in the right row being negative? Sure. Assume this
property holds in the case of the parent encoding. Then, it must
carry over to the child. By induction, it follows that any node will
honor it. A similar line of reasoning proves our insight about
absolute values.
The second very important constraint
that matrix encoding satisfies is:
A reader with basic linear algebra
background most likely recognized the matrix determinant there.
Determinants obey multiplication law: when matrices multiply, their
determinants multiply as well. Therefore, a determinant of the node
encoding matrix is a product of the atomic matrix determinants!
Since all atomic matrices have a determinant equal to 1, the
determinant of any node encoding matrix must be 1.
A determinant constraint reduces the
number of independent matrix entries to three. Given any three
matrix elements, the forth entry is unambiguously calculated from
the determinant constraint equation. And this could be done even
better by reducing the number of independent elements to two.
Given the two elements a11
and a21, relabeled conventionally as a and -b,
plus the unknowns: a12 and a22 relabeled as y
and x, the determinant equation reads:
This is perhaps the most celebrated
equation in the elementary number theory. Its integer solutions are
calculated via the extended Euclidean algorithm. Here is the
algorithm illustrated on the familiar matrix encoding example:
The following equation is solved in
a series of steps illustrated in Figure 5.9:
At the last algorithm iteration, it
is concluded that the values x=11 and y=19 satisfy the
equation.
Is this the only solution? Certainly
not. Consider:
62?/span>107
- 107?/span>62
= 0
Add it to:
11?/span>107
- 19?/span>62
= 0
And the result
(11+62)?/span>107 - (19+107)?/span>62
= 0
which implies another solution of
x=73, y=126! Fortunately, we know that x (i.e. a22)
and y (i.e. a12) has to be smaller than 62 (i.e. a21)
and 107 (i.e. a11), correspondingly. Therefore they
can be dismissed.
The most important implication of
the research in this section is that the combination of a11
with a21 is always unique. We can go as far as reducing the
MatrixTreeNodes definition to these two attributes (and calculate
the other two columns on the fly via extended Euclidean algorithm),
or leave the redundant attributes in the table and just declare the
unique key. The second alternative is chosen, which is justified by
the next step. Knowing that a12 and a22 are always
negative, we are going to store their absolute values. Then, as been
shown already, the child values a12 and a22 have to
refer to some parent identified by a11 and a21. In
other words, a child always refers to its parent explicitly via the
foreign key constraint.
Therefore, the tree schema design is
enhanced as shown below:
table
MatrixTreeNodes (
a11
integer,
a12
integer,
a21
integer,
a22
integer
);
alter table
MatrixTreeNodes
ADD
CONSTRAINT uk_node UNIQUE (a11,a21)
ADD
CONSTRAINT fk_adjacency FOREIGN KEY (a12,a22)
REFERENCES MatrixTreeNodes (a11,a21);
The hierarchy design in which a node
refers to the parent name explicitly is called the adjacency
tree model, and its scope
is actually bigger than trees. The schema for the adjacency model is
the following:
table
AdjacentTreeNodes (
id integer,
parent_id integer
);
alter table
AdjacentTreeNodes
ADD
CONSTRAINT uk_node UNIQUE (id)
ADD
CONSTRAINT fk_adjacency FOREIGN KEY (parent_id)
REFERENCES AdjacentTreeNodes (id);
Unlike matrix encoding, there is no
theory regarding how to choose a set of node identifiers, except for
the obvious restrictions that the id column is a unique
identifier, and parent_id always refers to the parent node.
The general adjacency model is the main topic of the next chapter.
There is one subtle distinction
between matrix and adjacency encodings. What the root node refers
to. In adjacency encoding the root node parent_id has to be
NULL, as there is no parent. In the matrix encoding the extended
Euclidean algorithm is simply applied and the four numbers obtained.
The root node refers to some nonexistent parent! What if the root
node matrix encoding:
is changed into:
Technically, NULLS cannot be forced
into the matrix entries - they would destroy all the algorithms that
were developed so far. It is more reasonable to admit that the
formal referential constraint declaration for matrix encoding is
invalid, and therefore, should be retracted from the schema design.
This is not a big issue, however, given that matrix encoding enjoys
more sophisticated constraints than referential integrity.
Parent
is NULL?
In the
adjacency model the root node refers to the NULL parent. Does it
mean that the query ?Find the root node's parent? cannot be
answered? In the matrix model the root refers to the imaginary
parent, and the query ?Find the root node's parent? returns the
empty set as it supposed to.
Once again, it was possible to
establish direct links between parent and children because the
values of a12 and a22 were negated. From now on, the
generic matrix node encoding will be referred to as:
Now that the informal referential
integrity constraint is available, querying parent and children
nodes becomes obvious.
Find all the employees who report
directly to Jones.
select
child.name
from
MatrixTreeNodes parent, MatrixTreeNodes child
where
parent.name = ?Jones?
and
child.a11 = parent.a12 and child.a21 = parent.a22
Who is Jones? manager?
select
parent.name
from
MatrixTreeNodes parent, MatrixTreeNodes child
where
child.name = ?Jones?
and
child.a11 = parent.a12 and child.a21 = parent.a22