 |
|
A Case for the CASE
Operator
SQL Tips by Donald Burleson |
This section
embarks upon a little bit longer journey. We have already mentioned
SQL proximity to the English language. Consider the following query:
For each
customer, show the number of calls made during the first 6 months
that exceeded the average length of all calls made during the year,
and the number of calls made during the second 6 months that
exceeded the same average length.
True, you have to
read this long sentence more than once. Upon careful examination,
though, you?ll be easily convinced that the apparent complexity is
illusory. The query can be translated into SQL in small increments.
But first, a schema is needed to anchor the SQL query to:
table Calls
(
FromPh integer(10),
ToPh integer(10),
ConnectedAt date,
Length integer
);
The Calls table stores the calls placed on a telephone
network over a period of one year. Each FromPh number
identifies a customer.
Thus the process
begins with translating
For each
customer ?
Into
select
FromPh, ?
from Calls
group by FromPh
The missing part that will be gradually developed in the next
steps is intentionally marked with ellipsis.
At this moment you
may wonder if:
select
distinct FromPh
from Calls
is the easier way to list all the customers. It certainly is, but
now what? This query is complete, it answers the question partially,
but it cannot be expanded to answer the remainder. The group by
clause, on the other hand, is one of the most powerful SQL
constructs.
DISTINCT operator is redundant
Technically, the
distinct operator is a special case of group by. For
any table (or view) T
select
distinct x, y from T
is equivalent to
select
x, y from T
group by
x, y
The next clause:
?, show the
number of calls made during the first 6 months that exceeded the
average length of all calls made during the year, ?
leaves us a choice.
The condition can be placed into the where clause, but then
some difficulty may arise in assembling the query pieces together.
A better alternative is to leverage a familiar conditional summation
pattern:
select
FromPh,
sum( case when - then 1 else 0 end ),
?
from Calls
group by FromPh
Ellipsis means that the condition must be interpreted:
? during the
first 6 months that exceeded the average length of all calls made
during the year ?
This is still a
relatively complex sentence. You may notice that the two variables:
ConnectedAt and Length are involved. The condition
begins with
? during the
first 6 months ?
which is easily
translated into ConnectedAt < ?1-July-2005?. The next
fragment
? the average
length of all calls made during the year ?
is a little bit
trickier. First, the query is ambiguous. Did the author mean the
average length of all the calls in the system, or the average length
for each customer? Both interpretations are perfectly reasonable.
The average length of the call is
select
avg(Length)
from Calls
while the average length of the call per each customer is:
select
FromPh,
avg(Length)
from Calls
group by FromPh
The first interpretation is easier to implement than the
second one, therefore, it is used as an exercise for the reader.
So, given the query
that has been developed so far
select
FromPh,
sum( case when - then 1 else 0 end ),
?
from Calls
group by FromPh
where does the relation
select
FromPh,
avg(Length)
from Calls
group by FromPh
fit in? The only
place that admits arbitrary relations is the from clause.
Relational Closure
The SQL query block inside
the from clause is called inner view. From a logical
perspective there is no difference if a relation within the
from clause (or anywhere in the SQL statement, for that
matter) is a table or a view. It is a manifestation of the
fundamental property of the Relational Model - Relational Closure.
It is common to organize a query in a chain of inner views so that
every step is small and easily comprehendible.
Let's nest the
second query into the first as an inner view:
select
c1.FromPh,
sum( case when - then 1 else 0 end ),
?
from Calls /*as*/ c1, (
select FromPh,
avg(Length) /*as*/ av
from Calls
group by FromPh
) c2
group by FromPh
Aliases c1, c2, and av, were
introduced along the way, which will be helpful for further
development. The c1, in fact, is required to disambiguate the
FromPh column name in the select clause.
Translation of the
informal query into SQL is only a few steps away from completion.
First, the relations c1 and c2 are naturally joined by
the customer id, FromPh. Second, the av alias is the
average length of the call per each customer that was required to
complete the predicate inside the case operator. Thus:
select
c1.FromPh,
sum(case when ConnectedAt < '1-July-2005'
and Length
< av
then 1 else 0 end),
?
from Calls /*as*/ c1, (
select FromPh,
avg(Length) /*as*/ av
from Calls
group by FromPh
) c2
where c1.FromPh = c2.FromPh
group by FromPh
The final clause of the informal query
?and the number
of calls made during the second 6 months that exceeded the same
average length.
is very similar to
the clause that was just analyzed.
Let's explore a
slightly different path. Instead of introducing an inner view c2,
why not calculate the average length for the customer in place as a
correlated scalar subquery:
select
c1.FromPh,
sum(case when ConnectedAt < '1-July-2005'
and Length
< (
select avg(Length)
from Calls c2
where c1.FromPh = c2.FromPh
)
then 1 else 0 end
),
?
from Calls c1
group by FromPh
Which of the two queries, the one with inner view, or the one
with scalar subquery performs better? Well, they are logically
equivalent, aren't they? The SQL engine reserves the right to
transform a query to a logically equivalent one. A curious reader
might want to check if both queries have the same execution plans on
the RDBMS of his choice.
Let's pause and
reflect back a little. The genesis of the solution is the case
operator inside the sum aggregate. It is possible to express
this query in SQL without it. Chantziantoniou et al (the authors of
the article where I borrowed the problem from) followed that route
and introduced a chain of named intermediate views as follows:
create view
AvgCallLengthPerCust as
select FromPh, avg(Length) /*as*/ avgL
from Calls
group by FromPh
create view
ExcAvgDuring1stHalfYear as
select C.FromPh, count(*) /*as*/ count
from Calls /*as*/ C, AvgCallLengthPerCust /*as*/ V
where C.FromPh = V.FromPh AND
Length > avgL AND Date<'1-July-2005'
group by C.FromPh
create view
ExcAvgDuring2ndHalfYear as
select C.FromPh, count(*) /*as*/ count
from Calls /*as*/ C, AvgCallLengthPerCust /*as*/ V
where C.FromPh = V.FromPh AND
Length > avgL AND Date>='1-July-2005'
group by C.FromPh
select
a1.FromPh, a1.cnt, a2.cnt
from ExcAvgDuring1stHalfYear /*as*/ a1,
ExcAvgDuring2ndHalfYear /*as*/ a2
where a1.FromPh=a2.FromPh
Based on this example, Chantziantoniou et al proposed
extending SQL language in such a way that would make writing queries
involving multiple features of the same group easier. As have been
shown earlier, a solution leveraging the case operator makes
this argument less convincing.