 |
|
Summarizing by more
than
one Relation
SQL Tips by Donald Burleson |
The previous
section mentioned that
select
deptno, count(*) from Emp
group by deptno
could be rewritten into an equivalent form leveraging the
correlated scalar subquery:
select
distinct deptno,
(select count(*) from Emp ee
where ee.deptno = e.deptno)
from Emp e
Both queries
project the Emp relation onto the deptno column, and
extend the result with one extra column that counts the number of
rows in each group in the original relation. What about those
deptno values that are missing in the Emp table,
shouldn't they be listed with count 0? Suppose deptno, say,
40 is a valid department on the system, how can the query be changed
to show it with the count 0?
Well, if deptno
= 40 is a valid department, then it should be in some table --
Dept, for example, where it is most likely a primary key.
Then, why not use this table in the outer query:
select
deptno,
(select count(*) from Emp e
where e.deptno = d.deptno)
from Dept d
An added bonus of
having two tables in the query is that the distinct qualifier
is no longer required.
Hugh Darwen's Summarize
Hugh Darwen argued that
group by with aggregation is an operator that requires two
tables as the arguments, in general. The idea of introducing such
an operator in SQL never caught on. Yet, in each practical
situation it might be useful to double check if writing the
group by clause as a one- or two- argument operator is more
appropriate.
SQL is notorious
for allowing multiple ways to express the same query. Listing all
the departments with the employee counts could also be rewritten via
the outer join:
select
d.deptno, d.dname,
sum(case when e.deptno is not null then 1 else 0 end)
from Emp e right outer join Dept d
where d.deptno = e.deptno
group by d.deptno, d.dname
If the conditional summation pattern is reduced to a simple
count(*), then the departments with no employees will count 1
employee instead of 0.
ANSI Join Syntax
It is difficult to argue
about elegance or ugliness of a certain syntax construction. You
just see it or you don't. Comma separated join syntax reflects the
fundamental feature of Relational Algebra, which asserts the
normal form for select-project-join queries. The only kind of join
that escapes it (and therefore, warrants a dedicated syntax) is
the outer join.
It is not only aesthetics.
It is common for production databases to have columns like
CREATED_ON, or COMMENTS
across many tables. In
this case the
NATURAL JOIN syntax is
plain dangerous.
As Anthony Molinaro
eloquently put it: ?Old style is short and sweet and perfect. ANSI
dumbed it down, and for people who've been developing for
sometime, it's wholly unnecessary.?
Which form, scalar
subquery or outer join, is more performant? Surely, the answer
differs between the vendors. Oracle, for example, is better at outer
join optimization than unnesting scalar subqueries in the select
clause. Outer join from the optimizer's perspective has almost
the same rights as normal join. It can be permuted with the other
joins in the join order; it is costed similarly, etc. If the
summarizing query is a part of the other query, chances are the
optimizer may find the better plan when the query is written via
outer join.