 |
|
Enumerating Sets of Integers
SQL Tips by Donald Burleson |
When enumerating integer
sets every integer is mapped into a set of integers. Representing an
integer in a binary numbering system provides a natural way to
implement such mapping. Formally, in each
ai
is either 0
or 1. Then, the sequence (a0,
a1,
a2,
a3,
?) is a characteristic vector of a set: i is an element of
the set whenever
ai
= 1. For example, 10 = 21 + 23 maps into
{1,3}. Here is an integer set enumeration for up to N=10:
Unfortunately, this
output cannot be produced in SQL, at least, not yet. Later, a study
of composite data types and how multiple values are aggregated into
strings and collections will be presented. For now, how to build a
relation, ?i is an element of set N? is shown:
The implementation is
easy. Take the IntegerPairs relation and recruit the BITAND
and POWER SQL functions to filter the unwanted pairs (N, i)
pairs:
select y N,
x i from IntegerPairs
where
bitand(power(2,x),y)>0
Although the idea of
pipelining was somewhat discredited in the previous section, it may
be recovered for this purpose. We take an infinite stream of integer
pairs, apply a filter with the bitand(power(2,x),y)>0
predicate and cut a finite segment of the result on the client. With
a non-pipelined IntegerPairs implementation (via Cartesian
product of Integers relations) the execution would hang in an
infinite loop.
Discrete
Interval Sampling
Perhaps the second most
common problem that utilizes the Integers relation is
unpacking the intervals, or leveraging the Signal Processing
terminology Discrete Interval Sampling.
Reduced to essentials the
problem is the following. Given two dates, ?1-jan-2005? and
?31-dec-2005?, for example, return the list of the days in
between. This is trivial, of course, if the Dates relation
has already been filled in with the dates:
select day
from Dates
where day between ?1-jan-2005? and ?31-dec-2005?
If the Dates
relation is not present, why not simply manufacture it? All that is
required is the support of the datetime/interval arithmetic by the
RDBMS of your choice. Perhaps, the easiest approach is adding
numeric values to the date represented in the date data type.
For example, adding 1 to today's date produces tomorrow's date.
Likewise, to_date(?1-jan-2005?)+1 is to_date(?2-jan-2005?),
and so on. Therefore,
select
to_date(?1-jan-2005?)+num
from Integers
where to_date(?1-jan-2005?)+num <= to_date(?31-dec-2005?)
will generate a list of days spanning the whole year. This toy query
is frequently encountered as a part of more challenging puzzles.
Consider the following
hotel reservation application. The Bookings table stores
reservations made against a set of rooms:
Users typically query
what rooms are empty during a certain period. For example, querying
all available rooms in the time period from 2005-05-07 to 2005-05-12
should return:
We start with two
relations: the first containing the list of all the rooms, and the
second containing all the dates between 2005-05-07 and 2005-05-12.
It is very tempting to define the first relation as
select
distinct RoomNo from Bookings
However, it is possible
that a particular room is always empty, and therefore, has no
records in the Bookings table. The whole Bookings
table can be empty! Let's assume that we have the list of all
rooms as one more relation -- the Rooms table. We are just
one step from the solution. If a Cartesian product of the two
relations is generated, the unwanted results (again, with scalar
subquery) can be filtered out:
select
RoomNo, Day
from (
select to_date('07-May-2005')+num Day
from Integers
where to_date('07-May-2005')+num <= to_date('12-May-2005')
), Rooms r
where 0 = ( select count(*) from Bookings b
where
r.RoomNo = b.RoomNo
and Day
between "From" and "To"
);
Although the query has written itself effortlessly, much is left to
be desired about its performance. A seasoned SQL performance
analyst would immediately spot a couple of problems with this
solution. Building a Cartesian product, first, and evaluating every
resulting tuple with a subquery does not sound very promising. But
the Cartesian product is unavoidable: if the Bookings
relation is empty, then the result must be the Cartesian
product. It is an interesting research question, whether one can
reduce the cost associated with subquery evaluation in this example.
An alternative approach
to the problem would reformulate the query from a business
perspective.
1 What
is the list of rooms available during a certain interval?
2 What is the average
ratio of occupied rooms to empty rooms during a certain
period?
Both of these
queries can be answered without discrete interval sampling.