 |
|
Logarithmic Buckets
SQL Tips by Donald Burleson |
A bird's-eye view onto
big tables is essential for data analysis, with aggregation and
grouping being the tools of the trade. Standard, equality-based
grouping, however, fails to accommodate continuous distributions of
data. Consider a TaxReturns table, for example. Would
simple grouping by GrossIncome achieve anything? Undoubtedly
it would show many individuals with identical incomes, but the big
picture would escape us because of the sheer report size. Normally,
we would like to know the distribution of incomes by ranges of
values. Choosing the ranges carefully should produce a much more
compact report.
There are two elementary
methods for choosing ranges. With a linear scale the next range
boundary point would be placed by a uniform offset. With a
logarithmic scale the offset is increased exponentially. In the tax
returns example a familiar linear scale of ranges is 0 - 10K, 10K - 20K, 20K - 30K, ?, while the logarithmic scale is ?, 1K ?10K, 10K - 100K, 100K - 1000K, -
Although a linear scale
is conceptually simpler, a logarithmic scale is more natural. It
appears that a linear scale has more precision where most tax
returns are located, but this is solely by virtue of the magic
number 10K that was chosen as an increment. Had a different figure
been chosen, say 100K, then the precision would have been lost.
In a capitalist system
income is not bounded. There are numerous individuals whose income
goes far beyond the ranges where most people exist. As of 2005 there
were 222 billionaires in the US. They all do not fit in the same
tiny 10K bucket of incomes, of course. Most likely each would be
positioned in its unique range, so 222 10K ranges is necessary in
order to accommodate only them!
From logarithmic
perspective, however, a $1B person is ?only? 5 ranges away from a
$10K party. The logarithmic scale report is compact no matter how
skewed the distribution is. Changing the factor of 10 into, say, 2
would increase the report size by a factor of log2(10)≈3,
which could still qualify as a compact report. In a sense, the
choice of factor in a logarithmic scale is irrelevant.
Logarithmic Scale in
Programming
I often hear a fellow
programmer suggesting ?Let's allocate 2K of memory?. Why 2K? Well,
this size sounds about right for this particular data structure.
Fine, if this structure is static, but what about dynamic ones,
like any collection type? Allocating 2K every time when we need to
grow the structure may be fine by today's standard, but this magic
number 2K would look ridiculous a decade later.
Logarithmic scale,
however, provides an immortal solution. ?Last time I allocated X
bytes of memory, let's allocate twice of that this time?. In
short, there is no room for linear solutions in the world abiding
by Moore's law.
Let's implement
logarithmic scale grouping of incomes in SQL. All that is needed is
power, log, and floor functions:
select
power(10,floor(log(10,GrossIncome))), count(1)
from TaxReturns
group by floor(log(10,GrossIncome));
It is the floor(log(10,GrossIncome)) function that maps
logarithmically spaced range boundaries into a distinct numbers. We
group by this expression, although for the purpose of the
readability of the result set, it is convenient to transform the
column expression within the select list into conventional salary
units by exponentiating it.
Skyline
Query
Suppose you are shopping
for a new car, and are specifically looking for a big car with
decent gas mileage. Unfortunately, you are trying to satisfy the two
conflicting goals. If querying the Cars relation in the
database, then all models that are worse than others by both
criteria can be ignored. The remaining set of cars is called the
Skyline.
Figure 3.8 shows the
Skyline of all cars in a sample set of three models. The
vertical axis is seating capacity while the horizontal axis is the
gas mileage. The vehicles are positioned in such a way that their
locations match their respective profiles. The Hummer is taller than
the Ferrari F1 racing car, which reflects the difference in their
seating accommodations: 4 vs. 1. The Ferrari protrudes to the right
of the Hummer, because it has superior gas mileage.
The Skyline is visualized
as a contour of solid lines. The Ferrari outline is lost beneath the
Roadster.
More formally the Skyline
is defined as those points which are not dominated by any other
point. A point dominates the other point if it is as good or better
in all the dimensions. In the above example, the Roadster with
mileage=20 and seating=2 dominates the Ferrari F1 with
mileage=10 and seating=1. This condition can certainly be
expressed in SQL. In this example, the Skyline query is:
select *
from Cars c
where not exists (
select * from Cars cc
where cc.seats >= c.seats and cc.mileage > c.mileage
or cc.seats > c.seats and cc.mileage >=
c.mileage
);
Despite the apparent simplicity of this query, you would not be
pleased with this query performance. Even if the seats and mileage
columns were indexed, it would not help much as only half of the
records on average meet each individual inequality condition.
Certainly, there are not many records that satisfy the combined
predicate, but leveraging an index to match it is not possible.
Bitmapped indexes, which excel with Boolean expressions similar to
is here, demand a constant on one side of the inequality predicate.
There is an efficient way
to answer a Skyline query. Unfortunately, it is not aligned with the
earlier Skyline SQL query attempt. Order all the data by either:
seats, mileage, or mileage, seats. Here a composite index might be
handy. Compare each record with its predecessor, and discard the one
that is dominated by the other.
Let's assume that this
example has been extended to 4 records:
Iterating record by
record down, the Ferrari would be excluded first, because it is
dominated by the neighbor BMW. Then, the BMW itself would be
excluded, because it is dominated by the Roadster.