|
 |
|
SQL Extensions for Performance
Oracle Database Tips by Donald BurlesonJuly 24, 2015
|
By Kim Floss
Oracle has greatly enhanced their SQL capabilities such
that there are many new functions that return complex data quickly.
Formerly, coding a simple query to use a ranking of items was quite
complex. Now, if
you delve into these capabilities, things are much easier.
Jonathon Lewis, world-renowned Oracle expert, provides a great
introduction into these functions in the following excerpt:
Analytic Functions
In Oracle version 8.1.6, Oracle Corp. introduced a whole
new layer of functionality to its implementation of SQL: the analytic
functions. This article provides a brief introduction to these functions,
describing their use, and then demonstrating the benefit in clarity and
brevity of the code that they offer. This article is based on material
previously presented to the U.K. Oracle User Group at their annual
conference in December 2000.
The Way Things Were
Assume you have a data warehouse describing the sales for
a chain of shops selling computer games. One of your main data tables,
describing sales by location and title sold per week, may look something
like the following:
create table game_sale (
title varchar2(30),
store varchar2(30),
sales number(10,2),
week_ending date
);
How would you deal with the requirement for an end-user
report showing the two best-selling games in each store this week? There are
two traditional approaches: procedural and nonprocedural.
In the procedural case, we simply write a piece of SQL
that sorts the data for a given week by store and period, then use a 3GL to
walk through it one row at a time, making sure we print only the first two
rows for each store. Consider, for example, this very simple PL/SQL solution
that produces output which is correct but perhaps not entirely aesthetically
satisfactory:
declare
cursor c1 is
select store, title, sales
from game_sale
where week_ending = '21-Jan-2001'
order by store, sales desc
;
m_last_row c1%rowtype;
m_out_ct number := 0;
begin
for r1 in c1 loop
if (m_last_row.store != r1.store) then
m_out_ct := 0;
dbms_output.new_line;
end if;
if m_out_ct != 2 then
dbms_output.put_line (
r1.store || ' - ' ||
r1.title || ' - ' || r1.sales
);
m_out_ct := m_out_ct + 1;
end if;
m_last_row := r1;
end loop;
end;
/
Glasgow - Crash Simulator - 1934
Glasgow
- Manic the Gerbil - 913
London
- Dome - 2167
London
- Portal Combat - 1824
PL/SQL procedure successfully completed.
The following nonprocedural solution is rather more
compact. On the plus side, if we are using SQL*Plus, it does allow us to
produce a much tidier output without resorting to more fiddly little bits of
code. On the minus side, the style of thinking required to produce this code
is not really "natural;" thus, it is not immediately obvious that the SQL is
appropriate to the requirement. More significantly, perhaps, this second
solution requires Oracle to perform a correlated subquery for each row we
examine to check how many items sold more than the current item in the
current store this week — this could be a very resource-intensive query.
select
store, title, sales
from
game_sale gs1
where
week_ending = '21-Jan-2001'
and
2 > (
select
count(*)
from
game_sale gs2
where
gs2.week_ending =
gs1.week_ending
and
gs2.store = gs1.store
and
gs2.sales > gs1.sales
)
order by
store, sales desc;
STORE
TITLE
SALES
---------- ---------------- ---------
Glasgow
Crash Simulator
1934
Manic the Gerbil
913
London
Dome
2167
Portal Combat
1824
In fact, both solutions also suffer from logical defects,
partly because the requirement was not specified with sufficient precision,
but also because the code in both cases is too simplistic.
Look at the
PL/SQL code and ask yourself what would happen if
Glasgow
happened to sell 913 copies of Dome — it wouldn't be reported although it
appears to qualify for second place. (In fact, two users running the same
query with a different set of
init.ora
parameters might get
different answers, depending solely on a minor variation in their
init.ora
parameters such as
sort_area_size).
Anyway, if there is a draw for second place, should both items be reported
or should neither item be reported? The requirement was for the two
best-selling games — we have no indication of how we should handle draws
which result in more than two candidate titles.
The SQL option suffers from similar
problems. Take the same scenario where Dome ties with Manic in
Glasgow. As it stands, the SQL solution will
automatically return three lines for Glasgow. At least there is
no risk of the results changing randomly and you can choose to control the
result by changing the inequalities in the SQL, substituting '>'
for '>=',
but you do need to think very carefully about how you are supposed to handle
tied places.
One final point to consider with these solutions is the
next, simple refinement of the requirement. The top two sales in London total 3991, whereas the top two sales
in Glasgow
total 2847; surely the marketing director will want to see the stores sorted
in descending order by volume of sales. How do you do that in PL/SQL or SQL?
An Analytic Solution
The analytic functions offer a
cleaner and simpler solution at the SQL level. We can solve (one version of)
our requirement using the rank()
function. This function behaves like
all the other analytic functions: after any initial selection, manipulation
and grouping of the data, but before applying an order by clause, it will:
-
Break the data into partitions (nothing to do with
partitioned tables or partition views)
-
Optionally sort the data within each partition for
further processing
-
Optionally define "windows" within each partition
(not shown in this example)
-
Add value to the output based on the partitions and
windows.
All this extra processing is cued
by the over()clause as follows:
select
store, title, sales, in_store_rank
ranked
from (
select
store, title, sales,
rank()over (
partition by store
order by sales desc
) as in_store_rank
from game_sale
where week_ending = '21-Jan-2001'
)
where in_store_rank <= 2
order by
store, in_store_rank;
STORE
TITLE
SALES
IN_STORE_RANK
------- ---------------- ------- --------
Glasgow
Crash Simulator
1934
1
Manic the Gerbil
913
2
London
Dome
2167
1
Portal Combat
1824
2
You will note that I have used an in-line view as part of
the query. There are two reasons for this. First, you cannot use an analytic
function in a where clause but we want to test the value of the generated
rank, so we have wrapped our SQL in a pair of parentheses and made it an
in-line view. The second reason is for clarity: if you use a complex SQL
statement to generate the basic solution, it is convenient to improve
clarity by pushing it into an in-line view before applying the analytic
functions to something that now looks like a simple table. I frequently use
in-line views when applying analytic views. (Be warned, however, that I have
found a few exotic cases in which this causes dramatic changes to execution
paths; in particular, when mixing partitioned tables with partition views.
The current status of this problem is "possibly a bug.")
In examining the highlighted (i.e.,
bold faced) code that demonstrates the use of the
rank() function, we see
the partition clause that breaks the data set into separate sections, each
section is then sorted in descending order by the sales column. When this
phase of sorting is complete, Oracle applies a ranking to the intermediate
data and the job is nearly finished. Because we want to report only the top
two rows for each store (partition), we wrap the query into an in-line view,
and restrict our selection to those rows in which the generated rank is less
than or equal to 2.
Note that we then sort the final result set; it is
likely, but not guaranteed, that the data for this example would
coincidently appear in the correct order, but it is important to remember
that any application of analytic functions takes place before, and
independently of, the final order by clause. Do not rely on side-effects;
always include an explicit closing order by if you want your output sorted.
In case you are wondering about
tied places, the rank()function
leaves gaps. If sales of Dome in the
Glasgow
store had matched those of Manic, then both games would have been ranked 2,
and the next placed game would have been ranked 4, with no game in third
place. There is an alternative ranking function,
dense_rank()that would give the results
1,2,2,3,4…, but there is no function for generating the ordering 1,3,3,4,5….
We can be a lot more fanciful than this simple example,
however. We note that London
clearly sells more games than
Glasgow. Perhaps we would like to list the store not
alphabetically, but by total sales. We can do this by introducing another
style of analytic function, partitioning the data again by store, and
sorting on the result of that function.
select store, title, sales
from (
select
store, title, sales,
rank()over (
partition by store
order by sales desc
) as in_store_rank,
sum(sales) over (
partition by store
) as store_totals
from game_sale
where week_ending = '21-Jan-2001'
)
where
in_store_rank <= 2
order by
store_totals desc, in_store_rank;
In the
highlighted (bold faced) code this time, we have taken the familiar
sum()function, and used it in its new
analytic form, demonstrating that analytic functions can use existing
columns as parameters and are not restricted to generating new values as the
parameterless rank()function
does. In this case, we only have a
partition
clause, which splits the data by
store
again and thereby sums the
sales
by
store
without the need for the "traditional"
group by
clause. In this way, we can get "raw data" and
summaries on the same line of a report without having to run two versions of
the query or joining a table to itself. By the way, the example above sorts
the store by total sold in each store; it is left as useful and interesting
exercise to sort the stores by the totals of just the games listed in the
report.
The savings in processing can be
significant, but do remember that each time we do any partitioning we are
introducing more work, typically sorting. However, the
v$sysstatand
v$sesstatviews do not appear to report in
all the work that Oracle is doing, so be a little cautious as you
investigate the possibilities on offer from analytic functions.
There are
numerous functions, old and new, that can be used in this analytic way. The
existing functions, like sum()above are
count(), avg()
, min()
, max()
, variance(), and
stddev().
Apart from the rank()function introduced above, we have
the new dense_rank()
,
percent_rank()
,
and row_number().
There are also a number of new
statistical functions, such as corr()
, var_pop()
, var_samp(), covar_pop(),
covar_samp()
, stddev_pop(), stddev_samp(), and
an entire host of regression functions. There are also the
cume_dist()and
ntile()functions (the latter allowing data
to be split into percentiles, quartiles, etc.). Finally, under the heading
of "miscellaneous but useful" we have the functions
first_value()
, last_value()
, ratio_to_report()
, lead(), and
lag().
Common Requirements
Take a look at a couple of the new functions and see how
they answer some of the most commonly asked questions about SQL. How do I
report the difference between one row and the next? How do I report a figure
as the percentage of the total for a report? How do I make one column report
a running total of the figures in another column?
First, the
lead()function can used to return the
data from the following
row in the output.
lead()
and
lag()take
as their parameters the name of the column you want to report, and the
distance, (expressed as a number of rows) between the current row and the
row you want to report alongside it. In my case, the distance is just one
row; hence, the value 1 appearing in the function in this example.
select
title, this_sale, next_sale,
this_sale - next_sale delta
from (
select title, sales this_sale,
lead(sales,1) over (
partition by store
order by sales desc
) as next_sale
from game_sale
where week_ending = '21-Jan-2001'
);
TITLE
THIS_SALE NEXT_SALE
DELTA
---------------- ---------- --------- --------
Crash Simulator
1934
913
1021
Tonic the Gerbil
913
482
431
Dome
482
315
167
Portal Combat
315
72
243
Age of Umpires
72
Next, to report one column as a
percentage of the total data, we use the
ratio_to_report()function. This function is named
somewhat modestly, as it can be used not only to report the data as a
percentage of the report total, but can also be used to report the data as a
ratio of a partition. I demonstrate this by showing how a query can be
simultaneously partitioned in many ways (bear in mind that this does mean
more sorting, however).
Note how this
report does not need to use an in-line view, because I am not using the
result of an analytic function as the basis for a subsequent
where
clause. Note also how an
empty over()clause
shows that I am treating the entire data set as the partition.
select
store, title, sales,
ratio_to_report(sales) over (
partition by store
) store_ratio,
ratio_to_report(sales) over (
) country_ratio
from game_sale
where week_ending = '21-Jan-2001'
order by
store, sales desc;
Store
Title
Sale St % Co %
------- ---------------- ----- ---- ----
Glasgow
Crash Simulator 1934
.52 .24
Tonic the Gerbil
913 .25
.11
Dome
482 .13
.06
Portal Combat
315 .08
.04
Age of Umpires
72 .02
.01
London
Dome
2167 .49
.27
Portal Combat
1824 .41
.22
Crash Simulator
247 .06
.03
Age of Umpires
110 .03
.01
Tonic the Gerbil
52 .01
.01
Finally we
come to the example of a running total, which also introduces the final
feature of analytic functions, and the
over()
clause, which is the
range or
rows
clause. Thus far, our
analytic functions have been associated with either a single row from a
partition (e.g.,
rank())
or with the entire set of data within the partition (e.g.,
ratio_to_report()
). It is possible, however, to specify that a function
should be applied to a range within the partition.
There are several types
of range specification, such as "within 3 rows of the current row," "within
10 days of the date in the current row," "the week (yes, analytic functions
do know what a week is) preceding the date in the current row," etc. In this
example, we use one of the simplest options, which is "this row and all the
preceding rows in the partition," to make a function produce the running
total of the partition. Note that if you use the
range
or
rows
clause, you must have an
order by
clause in the
over()
clause to allow the ranging calculation to have
a consistent, reproducible meaning.
select
store, week_ending, sales,
sum(sales) over(
partition by store
order by week_ending
range unbounded preceding
) running
from game_sales
where store = 'Glasgow'
order by store, week_ending;
STORE
WEEK ENDING SALES RUNNING
-------- ----------- ----- -------
Glasgow
07-May-2000 4,000
4,000
14-May-2000 5,000
9,500
21-May-2000 6,000
15,000
28-May-2000 7,000
22,000
04-Jun-2000 8,000
30,000
11-Jun-2000 7,802
37,802
18-Jun-2000 7,636
45,438
25-Jun-2000 8,134
53,572
2-Jul-2000 7,815
61,387
9-Jul-2000 8,023
69,410
This is a wonderful demonstration of how a very clean and
simple piece of SQL can now be used to produce a result that previously
required some heavy-duty calculation from the database engine or a carefully
tested piece of procedural code. In this particular case, we also have an
example in which the analytic function appears to be applied with virtually
no overhead to produce a commonly requested form of output.
Conclusion
Analytic functions are very powerful and allow you to
produce useful tabular results with very little programming, but it helps to
use in-line views to increase the clarity of code.
Be careful about performance; some of the functions
available clearly require Oracle to do a lot of sorting. But because the
sort statistics are not reported; you should only apply analytic functions
to relatively small result sets.
One final warning, PL/SQL does not
yet recognize the syntax of analytic functions in static code, so you will
need to resort to ref cursors
and dynamic SQL to implement analytic functions
inside PL/SQL packages.
But, what if you want to do more than that?
What if you want to use a function, but still want to use the index?
Often, in the past, use of functions meant a full-table scan, as the
optimizer would often opt to ignore the index, if you had a function
present. Enter Function Based
Indexes - build your index using the function, and low and behold - now you
are back to using your index.
Also, there are a wealth of other functions present, in addition to ranking,
things like 'top n' which will return your top 5 or top 10 (or top x) items
from a table, or group by functionality that can now do cubes and roll ups,
essentially totaling all the data for you, for each major category.
|