 |
|
Incremental Evaluation
SQL Tips by Donald Burleson |
Transitive closure
enjoys a lot of attention in the database research community.
Proving the impossibility of doing things in a certain way is one of
the favorite theoretical topics. Not surprisingly, it was
established very early that transitive closure cannot be expressed
by simple means of relational algebra, even enhanced with
aggregation and grouping.
We have already seen
the power of the incremental evaluation idea in the database
implementation world. Indexes and materialized views are the most
familiar incremental evaluation structures. Dong et al proved
that transitive closure can be efficiently maintained via an
incremental evaluation system. In this section Dong's approach is
explored, although with some deviation that would simplify the
matter.
Let's revisit the
transitive closure expression in terms of the adjacency matrix:
From a mathematical
perspective this is quite a handsome series. It could be made even
prettier if we consider the identity matrix - the matrix with
ones on the main diagonal and zeros elsewhere. A common notation for
the identity matrix is symbol 1.
Perhaps the most important equality involving the identity matrix
is:
There it shows that
the identity matrix literally begs to be added in front of the
series for the transitive closure:
Speaking in graph
language, what happened is the loops were included at each node.
Formally, T is now both a
reflexive and transitive closure.
Are there any
immediate benefits? Let's multiply T
by 1 - A
All the nonzero
powers of A at the right side
cancel out:
Multiplying both
sides by the inverse of 1 - A
gives an explicit formula for the transitive closure:
Therefore, we might
be able to calculate transitive closure (of directed acyclic graphs,
at least), if we know how to invert matrices in SQL! Unfortunately,
inverting matrices in SQL is difficult. The matrix approach,
however, still shows its practical merit in the scope of an
incremental evaluation system.
In the incremental
evaluation method, the original graph is stored together with
another graph -- its transitive closure. Every time the original
graph is updated that is a new edge is inserted or deleted; the
transitive closure graph has to be changed as well so that the
information in both structures remains consistent. The problem of
transitive closure graph maintenance essentially is finding an
efficient SQL expression for the insertions and deletions in the
transitive closure graph.
Let's continue
pursuing the adjacency matrix approach. Inserting an edge (x,y)
into the original graph produces a new graph with the adjacency
matrix that is the original matrix A incremented by S, the latter
being a matrix with a single nonzero entry in column x, row
y. How would this increment of the adjacency matrix affect
the transitive closure matrix? Well, let's calculate. The new
transitive closure matrix is:
Expanding powers
into polynomials the result is:
Let's rearrange
terms more suggestively:
The first series is
the familiar transitive closure matrix T
of the original not incremented adjacency matrix
A. The second series could be
factored into
which reduces to T S
T. The last series vanishes if the scope is limited to directed
acyclic graphs. Indeed, each term in that series is multiplied by S
at least twice. In other words, each term in the series corresponds
to a path in the graph that goes through the edge (x,y)
twice, which implies a cycle.
Summarizing, the new
transitive closure matrix reduces to:
Incremental Maintenance of
Graph Matrices
When the original
adjacency matrix
A
is incremented by
S,
the transitive closure matrix
T
is incremented by
T S T.
This is a very
succinct result. The result is double checked by comparing it with
the query computing the transitive closure increment in Dong's
paper:
SELECT *
FROM (
SELECT VALUES (a, b)
UNION
SELECT Start = TC.Start, End = b
FROM TC
WHERE TC.End = a
UNION
SELECT Start = a, End = TC.End
FROM TC
WHERE b = TC.Start
UNION
SELECT Start = TC1.Start, End = TC2.End
FROM TC AS TC1, TC AS TC2
WHERE TC1.End = a AND TC2.Start = b
) AS T;
SELECT *
FROM TC?NEW AS T
WHERE NOT EXISTS (SELECT *
FROM TC
WHERE TC.Start=T.Start AND TC.End=T.End)
INTO TEMP DELTA;
INSERT INTO
TC
SELECT *
FROM DELTA;
The first part of the query, which is a union of four blocks
is supposed to correspond to the transitive closure matrix increment
T S T. It appears that they are
very dissimilar. How can this be?
Remember, however,
that we conveniently decided to operate the reflexive transitive
closure matrices. If we go back and represent
T as
1+T?, where T? is not
reflexive anymore, then the increment matrix has to be written as
(1+T?) S (1+T?) which can be
expanded into with four terms as in Dong's method.
Now that we are
confident with the matrix solution, it can be finalized in terms of
SQL. It has already been shown the operating matrices of integers
that count paths in the directed acyclic graph offers an exceptional
clarity. Therefore, let's represent the transitive closure relation
directly after the transitive closure matrix that was investigated
earlier:
table TRC (
-- transitive, reflexive closure
i integer, -- tail
j integer, -- head
val integer
-- weight
)
Once again, it is
the reflexive property and the additional information in the val
column that distinguishes this method formally from Dong's.
However, there is an
ambiguity. What about zero matrix entries? There is an option
available to either store it as (i,j,0) or ignore such rows.
The first option simplifies SQL that maintains the table TRC.
The second option provides a natural means for compressing sparse
matrices.
Sparse Matrices
Sparse matrices save
space. Instead of storing full matrix
i j val
- - ---
1 1
0
1 2
0
1 3
1
2 1
0
2 2
0
2 3
0
3 1
0
3 2
5
3 3
0
it is more economical to
omit zero entries
i j val
- - ---
1 3
1
3 2
5
Matrix addition query
requires a little more care for sparse matrices. Matrix
multiplication, however, is an aggregate. It produces correct
result set either way.
Now all the ground
work for transitive closure maintenance in SQL is complete.
Inserting an edge (x,y) into the adjacency graph has to
trigger a conforming change in the transitive closure table TRC.
The values in the TRC.val column should be incremented
accordingly by the entries of the product of three matrices T S T.
Knowing how to write
a product of two arbitrary matrices in SQL, the product of three
matrices -- A B C -- can be written as a composition of two binary
product operations (A B) C. Alternatively, matrix elements are
summed up at once
which is easy to
translate into SQL:
select A.i
AS i, C.j AS j, sum(A.val*B.val*C.val) AS val
from A, B, C
where A.j=B.i and B.j=C.i
group by A.i, C.j
Please note how
naturally the matrix associativity property goes along with the
relational join associativity.
Actually, the goal
is a simpler matrix product -- T S T.
There is no challenge adapting the general case to our needs:
select t1.i
AS i, t2.j AS j, sum(t1.val*t2.val) AS val
from TRC t1, TRC t2
where t1.j = :x and t2.i = :y
group by t1.i, t2.j
If the option of
storing zero matrix entries is chosen, then the above query fits
naturally into an update statement to the TRC table:
update TRC
set val = (
select val + sum(t1.val*t2.val)
from TRC t1, TRC t2
where t1.j = :x and t2.i = :y
and t1.i = trc.i, t2.j = trc.j
group by t1.i, t2.j
)
It is fascinating
that the transitive closure table maintenance is solvable with a
single update, but this answer is unrealistic for two reasons.
First, the TRC table has to grow at some moment, and this
issue is left out of scope. Second, from a performance perspective,
updating or trying to update all the rows in the TRC table
smells a disaster. It would be helpful to know which rows require
update, and formalize it within the where clause (which is
entirely missing).
Therefore, let's
materialize updates to the TRC table in a designated table
TRCDelta:
insert into
TRCDelta
select t1.i AS i, t2.j AS j, sum(t1.val*t2.val) AS val
from TRC t1, TRC t2
where t1.j = :x and t2.i = :y
group by t1.i, t2.j
To keep this table
small, matrix entries should not be stored with zero values any
longer. On the other hand, by just storing a sparse matrix stored in
the TRC table, the TRCDelta is automatically
calculated as a sparse matrix either.
It is the TRC
table update step that has to be adjusted. There are two cases:
1
All the rows in TRCDelta that do not match TRC rows
have to be inserted:
insert into
TRC
select * from TRCDelta
where (i,j) not in (select i,j from TRC)
1 The rows that match have to increment their values:
update TRC
set val = val + (select val from TRCDelta td
where td.i = trc.i and td.j = trc.j)
where (i,j) in (select i,j from TRCDelta)
This completes the program in the case of an insertion of an
edge. What about a deletion? From the matrix perspective the cases
of inserting an edge and deleting it are symmetric. Following the
earlier development, the same matrix S that corresponds to a single
edge (x,y) could be used, except that it must be subtracted
everywhere. The final result in the matrix form is that transitive
closure matrix T is decremented by T S T.
In the first na?e
update solution for the TRC table all that must be done is to
reverse the sign:
update TRC
set val = (
select val
-
sum(t1.val*t2.val)
from TRC t1, TRC t2
where t1.j = :x and t2.i = :y
and t1.i =
trc.i, t2.j = trc.j
group by t1.i, t2.j
)
Please note, that
this symmetry is made possible because of the extra information that
was stored in the TRC table. Since the number of paths that
are going from node i to node j is known, all the
paths that are affected by the deletion of an edge (x,y) can
simply be subtracted. In Dong's approach, maintenance under
deletion is more complicated.
Carrying over the
solution to sparse matrices requires little insight. The TRCDelta
table that stores the T S T matrix is calculated the same way as in
the edge insertion scenario. Thus, subtracting the T S T from T
brings up two possibilities:
1
An entry in the transitive closure matrix
T is the same as the corresponding entry in the
T S T.
2 An entry
in the transitive closure matrix T
is bigger than the corresponding entry in the
T S T.
In the first case
the entry in the difference matrix T - T S T is 0. Therefore,
all these entries have to be deleted:
delete from
TRC
where (i,j, val) in (select i,j, val from TRCDelta)
All the other
entries have to be adjusted:
update TRC
set val = val - (select val from TRCDelta td
where td.i = trc.i and td.j = trc.j)
where (i,j) in (select i,j from TRCDelta)
Now that several
methods for transitive closure calculation/ maintenance has been
shown, let's again return to applications. Perhaps the most
significant problem that can be expressed in terms of transitive
closure is aggregation on graphs.