 |
|
Histograms in SQL
SQL Tips by Donald Burleson |
The concept of
histograms originated in statistics. Admittedly, statistics never
enjoyed the reputation of being the most exciting math subject. I
never was able to overcome a presumably unfair impression that it is
just a collection of ad-hoc methods. Yet a typical database table
has a huge volume of data, and histograms provide an opportunity to
present it in compact human digestible report.
A histogram can be
built on any (multi-)set of values. In the database world this set
comes from a table column, so values can be numerical, character, or
other datatypes. For purposes in this book, it is convenient to
order the values and then index them as in Figure 3.4
This might not feel
natural at first, but the two dimensions on the graph provide an
insight as to why there are exactly two kinds of histogram.
Equal-Width Histogram
Equal-width
histogram partitions the graph horizontally, as shown on Figure 3.5.
Evidently, there is
something wrong here, either with the term equal- width histogram,
or my unorthodox interpretation. First, the partitioning is
vertical; second, why do buckets have to be of uniform size? Well,
if the coordinates are swapped in an attempt to make partitioning
conform to its name, then it would not be a function anymore. Also,
partitioning does not have to be uniform. Histograms with
logarithmic partitioning will be studied later.
There is one
special type of equal-width histogram - frequency histogram.
It partitions the set of values in the finest possible way, Figure
3.6.
The two kinds of
objects -- (indexed) values and buckets -- can now be associated in
a query. Either of the following questions should be asked:
1
what bucket a particular value falls in, or
2
what is the aggregate value in each bucket (there may be
interest in more than one aggregate function)
In the first case,
the query returns one extra attribute per each record in the input
data. Given the input set of data,
As suggested in
Figure 3.4, the number in the Bucket column is determined by
the Value.
Enough abstract
talk, let's write some SQL queries.
Assign a bucket
number to each record in the IndexedValues table. All the records
with values in the range from 0 to 1 are placed into 0th
bucket, values in the range 2 to 3 go into bucket number 1, etc.
The key to the
solution is identifying what kind of function mapping of indexes
into buckets satisfies the (informal) specification. The
floor(index#/2) does, so the solution is as simple as:
select index#, value, floor(value/2) bucket
from IndexedValues
Often, the mapping
from indexes to buckets is not determined by a mathematical
function, like the combination of floor and division in the previous
example. Consider:
Distribute
IndexedValues records into 4 buckets. If total number of records is
n, then all the records with values in the range from 0 to n/4 are
placed into 0th bucket, values in the range n/4 to n/2 go into
bucket number 1, etc.
This informal query
translates 1-to-1 into SQL:
select index#, value, floor(4*value/n) bucket
from IndexedValues, (select count(*) n from IndexedValues)
Let's move on to the second type of queries - aggregates grouped by
bucket.
Assign a bucket
number to each record in the IndexedValues table. All the records
with values in the range from 0 to 1 are placed into 0th
bucket, values in the range 2 to 3 go into bucket number 1, etc. For
each bucket count the number of values that fall into this bucket.
This version is
just a small increment to the query written earlier:
select floor(value/2) bucket, count(*)
from IndexedValues
group by floor(value/2)
A simplified version of the previous query in verbose form defines a
frequency histogram.
Assign a bucket
number to each record in the IndexedValues table. The record with
value 0 is placed into 0th bucket, value 1 goes into
bucket number 1, etc. For each bucket count the number of values
that fall into this bucket.
It is formally
expressed as celebrated group by query:
select value bucket, count(*)
from IndexedValues
group by value
Equal-Height Histogram
Equal-height
histogram partitions the graph vertically, as shown on Figure 3.7.
The development in
the rest of the section would mimic an equal-width case with the
value and index# roles reversed. There is some asymmetry,
though. Unlike the equal-width case where partitioning values with
the finest granularity led to the introduction of the frequency
histogram, there is nothing interesting about partitioning indexes
to the extreme. Each bucket corresponds one-to-one to the index, so
the concept of a bucket in no longer needed.
Let's proceed
straight to queries. They require little thought other than the
formal substitution of value by index#.
Assign a bucket
number to each record in the IndexedValues table. All the records
with indexes in the range from 0 to 1 are placed into 0th
bucket, indexes in the range 2 to 3 go into bucket number 1, etc.
select index#, value, floor(index#/2) bucket
from IndexedValues
There is one
subtlety for an aggregate query. The record count within each bucket
is not interesting anymore, as in our example we know it's trivially
2. We might ask other aggregate functions, though:
Assign a bucket
number to each record in the IndexedValues table. All the records
with indexes in the range from 0 to 1 are placed into 0th bucket,
indexes in the range from 2 to 3 go into bucket number 1, etc. For
each bucket find the maximum and minimum value.
select floor(index#/2) bucket, min(value), max(value)
from IndexedValues
group by floor(index#/2)