 |
|
Product
SQL Tips by Donald Burleson |
Product is another
aggregate function, which is not on the list of built-in SQL
functions. Randomly browsing any mathematical book however reveals
that the product symbol occurs much less frequently than summation.
Therefore, it is unlikely to expect the product to grow outside a
very narrow niche. Yet there are at least two meaningful
applications.
Factorial
If the product aggregate
function were available, it would make the factorial calculation
effortless:
select
prod(num) from integers where num <=10 -- 10!
As always, a missing
feature encourages people to become very inventive. From a math
perspective, any multiplicative property can be translated into an
additive one if the logarithm function is applied. Therefore,
instead of multiplying numbers, their logarithms can be added, and
the result exponentiated. The rewritten factorial query is still
very succinct:
select
exp(sum(ln(num))) from integers where num <=10
At this point most people
realize that product via logarithms does not work in all the cases.
The ln SQL function throws an exception for negative numbers
and zero. After recognizing this fact, however, the problem could be
fixed immediately with a case expression.
Numbers in SQL
If you feel uneasy about
?fixing? an elegant exp(sum(ln(?))) expression with case
analysis, it's an indicator that you as a programmer reached a
certain maturity level. Normally, there are no problems when
generalizing clean and simple concepts in math. What exactly is
broken?
The problem is that the
logarithm is a multivalued function defined on complex numbers.
For example, ln(-1) = i π (taken on principal branch).
Then, e i π = -1 as expected!
Writing a function
calculating factorial is perhaps one of the most ubiquitous
exercises in procedural programming. It is primarily used to
demonstrate the recursion. It is very tempting, therefore, to try
recursive SQL:
with
factorial (n, f) as (
values(1, 1)
union all
select n+1, f*(n+1) from factorial
where n<10
)
select * from factorial
Joe Celko suggested one more method - calculating factorial via
Gamma function. A polynomial approximation with an error of less
than 3*10-7 for 0 ≤ x ≤ 1 is:
Γ(x+1) = 1 - 0.577191652
x + 0.988205891 x2 - 0.897056937 x3 +
0.918206857 * x4 - 0.756704078 x5 +
0.482199394 x6 -
0.193527818 x7
+ 0.035868343 x8
Unfortunately, the error
grows significantly outside of the interval 0 ≤ x ≤ 1. For example,
Γ(6) = 59.589686421
Factorial alone could
hardly be considered a convincing case for the product aggregate
function. While factorial is undoubtedly a cute mathematical object,
in practice it is never interesting to such an extent as to warrant
a dedicated query. Most commonly, factorial occurs as a part of more
complex expression or query, and its value could be calculated on
the spot with an unglamorous procedural method.
Interpolation
Interpolation is more
pragmatic justification for the product aggregate. Consider the
following data:
Can you guess the missing
values?
When this SQL puzzle was
posted on the Oracle OTN forum, somebody immediately reacted
suggesting the lag and leading analytic functions as a
basis for calculating intermediate values by averaging them. The
problem is that the number of intermediate values is not known in
advance, so my first impression was that the solution should require
leveraging recursion or a hierarchical SQL extension, at least. A
breakthrough came from Gabe Romanescu, who posted the following
analytic SQL solution:
select X,Y
,case when Y is null
then yLeft+(rn-1)*(yRight-yLeft)/cnt
else Y end /*as*/ interp_Y
from (
select X, Y
,count(*) over (partition by grpa) cnt
,row_number() over (partition by grpa order by X)
rn
,avg(Y) over (partition by grpa) yLeft
,avg(Y) over (partition by grpd) yRight
from (
select X, Y
,sum(case when Y is
null then 0 else 1 end)
over (order by X)
grpa
,sum(case when Y is
null then 0 else 1 end)
over (order by X
desc) grpd
from data
)
);
As usual in case of
complex queries with multiple levels of inner views, the best way to
understand it is executing the query in small increments. The
inner-most query introduces two new columns, with the sole purpose
to group spans with missing values together.
select X, Y
,sum(case when Y is null then 0 else 1 end)
over (order by X) grpa,
,sum(case when Y is null then 0 else 1 end)
over (order by X desc) grpd
from data;
The next step
select X, Y
,count(*) over (partition by grpa) cnt
,row_number() over (partition by grpa order by X)
rn
,avg(Y) over (partition by grpa) yLeft
,avg(Y) over (partition by grpd) yRight
from (
select X, Y
,sum(case when Y is
null then 0 else 1 end)
over (order by X)
grpa
,sum(case when Y is
null then 0 else 1 end)
over (order by X
desc) grpd
from data
)
calculates four more columns:
1
cnt - length of each span
2
rn - offset inside the span
3
yLeft - the Y value at the left boundary
4
yRight - the Y value at the right boundary
The aggregate function - avg - is pretty arbitrary. There is only one non NULL value
in each partition.
Now, everything is ready
for the final step in which the linear interpolation formula is
applied:
Note that instead of the
xLeft and xRight variables we have cnt = xRight - xLeft and rn = X -
xLeft.
A query leveraging an
analytic SQL extension can be expressed in standard SQL:
select X,
Y,
(select min(bLeft.Y+( bRight.Y - bLeft.Y)*
(bb.X- bLeft.X)/( bRight.X - bLeft.X) )
from data bLeft, data bRight
where bLeft.X < bRight.X
and bLeft.Y is not null and bRight.Y is not null
and not exists(select 0 from data bi
where bLeft.X < X and X < bRight.X
and bi.Y is not null)
and bb.X between bLeft.X and bRight.X) /*as*/ interp_Y
from data bb
Arguably, this solution
is more intuitive. The original relation can be scanned and an extra
column added as a scalar subquery.
The original relation
corresponds to point (X,Y) on the figure. In the subquery
points (bLeft.X,bLeft.Y) and (bRight.X,bRight.Y) are the boundaries
of the span that we are filling in with values. The min
aggregate is bogus. It insures that the subquery in the select
clause is scalar. Without aggregation the subquery returns either
one value or two identical values.
What if the missing
values are not bound with a known value at the end of the range?
This can certainly be addressed as a special case and at the cost of
making the query more complicated. Alternatively, this snag is
nicely solved with non-linear interpolation, in general, and
Lagrange Interpolating Polynomial, in particular
Here is the product
symbol, at last!
Polynomial is a simpler
concept than the piece-wise linear function. A cleaner concept
translates into a simpler SQL, but the interpolation result is
slightly different, of course.
select x,
sum(y*mul) interp_Y from (
select x,j,y, product(a) mul
from (
select bb.X x, bj.X j, bj.Y y, (bb.X-bk.X)/(bj.X-bk.X) a
from data bj,data bk, data bb
where bj.Y is not null and bk.Y is not null
and bj.X!=bk.X
) group by x,j,y
) group by x