 |
|
Counting in SQL
SQL Tips by Donald Burleson |
Counting is one of the basic patterns that an SQL developer
learns after learning the basics: selection, projection, join, and
subquery. While counting might look deceptively easy in a context of
a single table, it becomes intellectually challenging when two
tables are joined together and grouping applied.
The first half of this chapter will present information to enable
the reader to write counting queries in a complex context.
The second half of the chapter will present information regarding
conditional summation. This pattern is a beautiful
combination of two SQL constructs: the case operator, and
aggregation. Conditional summation has numerous applications,
which will also be presented throughout the remainder of the
chapter.
Counting
Ordered Rows
Let's start with a basic counting problem. Suppose we are given a
list of integers, for example:
and want to enumerate all of them sequentially like this:
Enumerating rows in the increasing order is the same as counting the
number of rows preceding a given row.
SQL enjoys success unparalleled by any rival query language. The
reason for such popularity might be credited to its proximity to the
English language. Examine the informal idea carefully:
Enumerating rows in increasing order is counting how many rows
precede a given row.
Perhaps the most important thing to note is the rows in the source
table are referred to twice: first, to a given row, and
second, to a preceding row. Therefore, the number list must
be joined with itself as shown in Figure 1.1.
Surprisingly, not many
basic SQL tutorials, which are so abundant on the web today,
mention a Cartesian product. A Cartesian product is a join
operator with no join condition
select
A.*, B.* from A, B
Carrying over this idea into a formal SQL query is straightforward.
As it is the first query in this book, it will be performed step by
step. The Cartesian product itself is:
select t.x
x, tt.x y
from T t, T tt
The triangle area below the main diagonal is:
select t.x
x, tt.x y
from T t, T tt
where tt.x <= t.x
And finally, only one column is needed - t.x - to group the
previous result by and count:
select t.x,
count(*) seqNum
from T t, T tt
where tt.x <= t.x
group by t.x
There is a certain harmony of group by operator in a joint
effort with relational join. There are also drawbacks that
we?ll discuss later.
Equivalence Relation and
Group By
Almost any other SQL query
uses the group by operator. Why is the group by
operator so powerful? It is not among the fundamental relational
algebra operators. A partial answer to this fascinating efficiency
is that group by
embodies an equivalence
relation. Indeed, it partitions rows into equivalence classes of
rows with identical values in a column or a group of columns, and
calculates aggregate values per each equivalence class.
So what happens if the programmer modifies the problem slightly and
asks for a list of pairs where each number is coupled with its
predecessor?
Let me provide a typical mathematician's answer, which is remarkable
in a certain way. Given that it has already been shown how to number
list elements successively, it might be tempting to reduce the
current problem to the previous one:
Enumerate all the numbers in the increasing order and match each
sequence number seq# with predecessor seq#-1. Next!
This attitude is, undoubtedly, the most
economical way of thinking, although not necessarily producing the
most efficient SQL. Therefore, let's revisit our original approach,
as illustrated on Figure 1.2.
This translates into the following SQL query:
select t.x,
max(tt.x) predecessor
from T t, T tt
where tt.x < t.x
group by t.x
Both solutions are expressed in a standard SQL
leveraging join and grouping with aggregation. Alternatively,
instead of joining and grouping why not simply calculate the
count or max in place as a correlated scalar
subquery:
select t.x,
(select count(*) from T tt where tt.x <=
t.x) seq#
from T t
group by t.x
The subquery always
returns a single value; this is why it is called scalar. The tt.x
<= t.x predicate connects it to the outer query; this is why it
is called correlated. Arguably, leveraging correlated scalar
subqueries is one the most intuitive techniques in writing SQL
queries.
Is GROUP BY Redundant?
Chris Date asserts that
the group by operator is redundant since
select
deptno, avg(sal) from Emp
group by
deptno
could be rewritten as:
select
distinct deptno,
(select avg(sal) from Emp ee
where e.deptno = ee.deptno)
from Emp
e
Unlike Date, who exploits
this fact as evidence of SQL deficiencies, it is rather viewed as
yet another demonstration of the power of scalar subqueries.
How about counting
rows that are not necessarily distinct? This is where the method
breaks. It is challenging to distinguish duplicate rows by purely
logical means, so various less ?pure? counting methods were devised.
They all, however, require extending the SQL syntactically, which
was the beginning of slipping along the ever increasing language
complexity slope.
Here is how an
analytic SQL extension counts rows:
select x,
rank() over(order by x) seq# from T; -- first problem
select x, lag() over(order by x) seq# from T; -- second problem
Many people suggest that not only is it more efficient, but
more intuitive. The idea that ?analytics rock? can be challenged in
many ways. The syntactic clarity has its cost: the SQL programmer
has to remember (or, at least, lookup) the list of analytic
functions. The performance argument is not evident, since
non-analytical queries are a simpler construction from the optimizer
perspective. A shorter list of physical execution operators implies
fewer query transformation rules, and less dramatic combinatorial
explosion of the optimizer search space.
It might even be
argued that the syntax could be better. The partition by and
order by clauses have similar functionality to the group
by and order by clauses in the main query block. Yet one
name was reused, and the other had been chosen to receive a new
name. Unlike other scalar expressions, which can be placed
anywhere in SQL query where scalar values are accepted, the
analytics clause lives in the scope of the select clause
only. I have never been able to suppress an impression that analytic
extension could be designed in more natural way.