 |
|
Interval Coalesce
SQL Tips by Donald Burleson |
Normally, the
shorter the problem statement is, the simpler its formal expression
in SQL. However, there are notable exceptions and interval
coalesce is one of them.
Given a set of
intervals, return the smallest set of intervals that cover them.
This deceptively
simple formulation leaves a pile of questions. First, what is
the intervals table? This one is easy:
table
Intervals (
x integer, -- start of the interval
y integer -- end of the interval
);
ALTER TABLE
Intervals
ADD CONSTRAINT ends_ordering CHECK( x < y );
Perhaps, the from and to, or maybe, start
and end column names may sound more appropriate.
Unfortunately, they are SQL keywords.
Next, what is the
concept of one interval covering the other? Easy: an interval is a
set of points between the interval endpoints x and y.
What about the endpoints, are they included into the interval? For
our purpose, let's agree that both x and y belong to
interval set of points, in other words, the intervals are closed
from both ends. Open and half-open intervals require careful
reexamination of all the inequality predicates in the query which we
are going to write. Therefore, formally: A = {p|x≤p≤y}.
Finally, interval A covers interval B when B is
subset of A: B⊆A.
Armed with these
definitions, we are ready to write the query. The Intervals
table is the only candidate to be placed into the query from
clause, but there is a catch. If we just select some intervals out
of the set we have, we won't get the answer. Think about an easy
case of two overlapping intervals. The answer to the query is an
interval whose endpoints are combined from different records.
Therefore, a selfjoin is needed:
select
fst.x, lst.y
from Intervals fst, Intervals lst
where fst.x < lst.y
and ?
Next, the ?no gaps? condition is introduced. Consider any
interval endpoint between fst.x and fst.y. It has to
be covered by some interval!
Actually, if there
is a gap in a chain of intervals, then both ends of the gaps would
be exposed (that is not covered by other intervals). This
observation allows a small optimization, so that only the x
endpoints will need checked.
Unfortunately, the
?no gaps? condition cannot be naturally translated into SQL, since
it does not support universal quantifiers. The standard
approach is converting the clause into existential quantifiers,
which will be explained in more detail in the section on Relational
Division. Here let's take a little jump and write the whole
condition at once:
select
fst.x, lst.y
from Intervals fst, Intervals lst
where fst.x < lst.y
and not exists (
select * from Intervals int
where int.x > fst.x and int.x < lst.y
and not exists (
select * from Intervals cov
where int.x >= cov.x and int.x =< cov.y
)
) and ?
That's not all. The
chain of intervals has to be maximal, in other words, it
cannot be extended at the either end.
This condition
translates into one extra subquery:
select
fst.x, lst.y
from Intervals fst, Intervals lst
where fst.x < lst.y
and not exists (
select * from Intervals int
where int.x > fst.x and int.x < lst.y
and not exists (
select * from Intervals cov
where int.x >= cov.x and int.x =< cov.y
)
) and not exists (
select * from Intervals cov
where cov.x < fst.x and fst.x <= cov.y
or cov.x <= lst.y and lst.y < cov.y
)
The final query is surprisingly verbose. Terser
expression would involve converting NOT EXISTS subqueries into
counting.
Is (NOT) EXISTS or (NOT)
IN faster?
This is the wrong
question. The EXISTS and IN subqueries are equivalent; in fact,
the optimizer uses the term semijoin
for both. The NOT
EXISTS and NOT IN subqueries are different as far as
NULL semantics is
concerned, but still they are very similar so that the SQL
execution engine uses the term antijoin
for both. It might
be argued that SQL standard should have defined them identically,
leaving explicit control of NULL
semantics to the end user (by means of IS (NOT) NULL
predicate). History proved any attempt to get UNKNOWN
value semantics right as futile, therefore it is better just to
forget about NULLs
and aim for elementary consistency.
Which NOT EXISTS
clause should be used, since two are available? Let's revisit both
conditions:
1
A chain of intervals must have no gaps (Figure 1.8)
2
A chain is maximal if it has uncovered ends (Figure 1.9)
These conditions
are not entirely independent, since every maximal chain of intervals
is delimited by gaps on both ends. A change of perspective is
needed.
Consider any pair
of intervals fst and lst. Suppose fst.x and
lst.y are not covered by other intervals. Are fst and
lst the beginning and the end of a chain? Not necessarily,
because there might be gaps between them. Let's shift our focus onto
a set of fst, lst pairs such that fst.x and lst.y
are not covered by other intervals. If there is a gap between some
pair of intervals fst1, lst1, then it follows that there has
to be another pair fst2, lst2 with the same first element
fst2 = fst1 and some other second element lst2 such that
lst1.y < lst2.y
In other words,
among all the fst, lst pairs with the same fst
element, it is the pair with the minimal lst element that
defines a chain without gaps.
Let's formalize
these ideas. A NOT EXISTS subquery has already been written that
defined a set of fst, lst pairs such that fst.x
and lst.y are not covered by other intervals. Alternatively,
the conditional summation pattern could have been applied.
select fst.x, lst.y, sum(
case when cover.x < fst.x and fst.x <= cover.y
or
cover.x <= lst.y and lst.y < cover.y
then 1 else 0 end
)
from intervals fst, intervals lst, intervals cover
where fst.y <= lst.y
group by fst.x, lst.y
counts if either fst.x or lst.y is covered by some
interval. If the count is 0, then the fst, lst pair
defines a maximal chain (possibly with gaps). Therefore, a formal
query returning all the maximal chains is:
select
fst.x, lst.y
from intervals fst, intervals lst, intervals cover
where fst.y <= lst.y
group by fst.x, lst.y
having sum(
case when cover.x < fst.x and fst.x <= cover.y
or
cover.x <= lst.y and lst.y < cover.y
then 1 else 0 end
) = 0
The final step is selecting all the pairs with the minimal
lst element. All the pairs with the fixed fst element,
which translates into group by are being considered:
select x,
min(y) from (
select fst.x, lst.y
from intervals fst, intervals lst, intervals cover
where fst.y <= lst.y
group by fst.x, lst.y
having sum(
case when cover.x < fst.x and
fst.x <= cover.y
or cover.x <= lst.y and lst.y < cover.y
then 1 else 0 end
) = 0
)
group by x
There is an important special case of the interval coalesce
problem:
Given a set of
integers, partition them into ranges of successive numbers.
This problem can
also be called interval packing, and is the reverse of the
discrete interval sampling, that will be studied in the next
chapter.
If each integer
x is represented as a (closed) interval [x,x+1], then the
problem reduces to interval coalesce. Rod West suggested a much more
elegant solution, however. His key insight was an expression that
groups the numbers within the same range. Then, if we know how to
group integers, the ranges are defined by taking minimum and maximum
inside each group. What criterion identifies each group?
It was shown in the
section on counting how to enumerate rows in the increasing order:
select t.x,
count(*) seq#
from T t, T tt
where tt.x <= t.x
group by t.x
It is the x - seq# expression that remains constant
within each group!
The rest is
straightforward. We group by this expression and calculate
min and max aggregate values, which are demarcating the
beginning and the end of each interval:
select
min(x), max(x) from (
select x, rank() over(order by x) seq# from T
) group by x-seq
Predictably, this problem might occur in a slightly more complicated
context. Our input relation can have one more column, say name,
so that integers are grouped by the name values. Rod's
solution scales up naturally to the new requirements. The additional
grouping column just emerges in the appropriate places:
select
name, min(x), max(x) from (
select x, rank() over(partition by name order by x) seq#
from T
) group by name, x-seq