 |
|
Conditional Summation
with
CASE Operator
SQL Tips by Donald Burleson |
The genesis of the conditional summation idiom
is an equivalence between count(*) and sum(1).
Formally,
select
count(*) from emp
is the same as:
select
sum(1) from emp
There is must be
something wrong here. How these two queries can be equivalent? They
produce different result when the Emp relation is empty.
This is merely a
misfortunate historical artifact. Had SQL ANSI standard committee
been in a mathematical mood that day, they would certainly fixed the
definition to make them equivalent. This would make SQL world a tiny
bit simpler.
More important,
however, is a common misconception that count should have any
arguments at all. First, for most practical purposes count =
sum(1), and there is no free variable parameter within the
sum(1) expression. Second, think about how the count
function may be implemented on a low-level. A reasonable code must
look like this:
int count =
0;
for( int i = 0; i< 10; i++)
count = count + 1;
The count variable
is updated during each row processing with the unary increment
operator +1. Unlike count, any ?normal? aggregation
has to use a binary operation during each aggregate incremental
computation step
int sum = 0;
for( int i = 0; i< 10; i++)
sum = sum +
element[i];
that is, + for sum,
∨ for max,
∧ for min, etc.
Therefore, one argument is needed for normal aggregates, and no
arguments for count.
Argument for COUNT
The formal difference
between
select
count(*) from emp
and
select
count(1) from emp
has been the subject of
lengthy investigations on some internet forums. If there were
indeed any implementation and performance difference between the
two, then one can argue that the query optimizer should transform
the query accordingly to eliminate it. In other words, this
counting syntax quirk is not worth 2 cents.
OK, as far as
simple counting is concerned, there does not appear to be any need
for an argument. But what about
select
count(ename) from emp
where only
non-null values of the ename column are counted? The
description of count(ename) in the previous sentence
translates directly into SQL:
select
count(*) from emp where ename is not null
So, count(ename)
is no more than a syntax shortcut.
Well, how about
select
count(distinct ename) from emp
where the count
aggregate function accepts a column expression with a keyword? This
is yet another shortcut:
select
count(*) from (
select distinct empno from emp
)
What if counting two different values at the same time like
this is preferred:
select
count(ename), count(*) from emp
Even though it
looks like SQL has a dedicated syntax shortcut for every imaginable
task, at this point it is easy to argue that these extensions are
nifty at least in some practical cases.
Enter the
conditional summation
pattern. Whenever rows are counted satisfying a certain
criteria such as:
select
count(*) from emp
where sal < 1000
and the where clause seems to hinder the query's
evolution to a more sophisticated form, it can be rewritten
without the where clause:
select
sum(case when sal < 1000 then 1 else 0 end)
from emp
Conditional summation queries scale up nicely to accommodate
more complex requirements. In the example with the familiar Emp
table,
the previous query
is transformed to count the small salaries per each department by
amending it with group by:
select
deptno,
sum(case when sal < 1000
then 1 else 0 end) cnt
from emp
group by deptno
The subtle novelty
here is that the conditional summation query is no longer equivalent
to the former attempt restricting the condition in the where
clause:
select
deptno,count(*) from emp
where sal < 1000
group by deptno
Zero
counts were perfectly legal in the aggregation without the
grouping case. Disappearing zeros are a sign of (yet another) SQL
inconsistency.
Aggregation without
Grouping
An aggregation with no
grouping is, in fact, an aggregation within a single group. If SQL
syntax allowed grouping by the empty set of columns
∅,
then a simple aggregate
select
count(*) from T
could also be represented
as
select
count(*) from T
group by
∅
Without the empty set
syntax, we still can write
select
count(*) from T
group by
0
The 0 pseudo column is a
constant expression, so that the table T is partitioned into a
single group effectively the same way as with the empty set.
Perhaps the most
important rationalization for the conditional summation idiom is
counting by different criteria. Without conditional summation we
would have to count by each individual condition in a dedicated
query block, and combine those counts with a join. The pivot
operator, which will be studied in Chapter 3, is a typical showcase
of this idea.
Before the case
operator became widely available in off-the-shelf RDBMS systems, a
much more ingenious counting method with an indicator function was
employed.