Streamlining Oracle ETL With
Pipelined Table Functions
I've recently been reading Bert Scalzo's
"Oracle DBAs Guide To Data Warehousing And Star Schemas",
and he's got an interesting take on the ETL process that I
thought it worth looking at in more detail.
Background
Bert was the DBA behind the 7-Eleven data
warehouse, one of the first multi-terabyte dimensional data
warehouses built on the Oracle database. The book is a bit
different to other Oracle data warehousing books in that it
concentrates on the implementation details of building a
warehouse, and as it's not by an Oracle member of staff he tends
to have forthright opinions on what elements of the technology
stack work, and what don't. A particular example of this is
where he looks at the various releases of the Oracle database
(7, 8i and ) and explains in some detail why 7 and 8i don't
really cut the mustard as large-scale data warehousing
platforms. In another instance, he looks at the evolution of the
star transformation, explains why the original implementations
(which were first implemented using nested loops, and then after
by cartesian joins) weren't up to scratch but then absolutely
insists that you use the version that comes with Oracle . As
the author's not afraid to offer an opinion it's an entertaining
read, and a bit different to your run of the mill 'new features'
book.
One chapter that particularly interested me was
his approach to loading the data warehouse. Right from the
outset, Bert eschews the use of ETL tools, preferring instead to
code the load routines manually. In his opinion, the trouble
with ETL tools is:
-
The code they produce is not optimally
efficient, and often uses slow-running Java to host and run
the processes
-
The cost of the tools is often more than the
hours they replace
-
They allow the project to be staffed by people
with little Oracle or data warehousing experience
This is a theme echoed in an
Oracleworld paper by Arup Nanda that I looked at last year,
which again rejected the use of ETL tools, but this time because
the nature of the ETL (lots of small transformations to a large
number of nearly identical tables) was better served by running
a number of Unix shell scripts that in turn generated SQL
scripts to a pattern. I'll come back to the ETL vs. hand coding
debate later on in the article.
Anyway, the ETL process that the author wanted
to carry out involved the following steps
-
load data from a large flat file
-
apply transformations to the data
-
update/insert the data into a dimension table
What made this interesting was the fact that the
file was very, very big, and the ETL window was small.
Therefore, any ETL process had to
To illustrate, take a situation where your ETL
process starts with a file of 1000 records, and has two stages,
each of which take an hour to run. The least efficient way to
process this file is to let the first process handle all 1000
records (taking an hour to do so), then pass the data on to the
second process, which will also take an hour to run. Your total
run time is two hours. What the author is suggesting with 'minimise
inter-process wait states' is that it would be good if your
first process could handle for example 10 records, then pass
these on to the second process so that it could start it's work.
The first process can then get on with the next 10 records, and
the pass them on when it's done. Your total run time would then
be just a bit over one hour, cutting down dramatically the time
to perform the load.
The 'maximise total concurrent CPU usage'
requirement aims to take advantage of the ability for
multi-processor servers to carry out more than one task at once.
Oracle is very good at this, and has had this feature (known as
parallelism) for many years. However, it only really comes in to
play if you use DML statements (insert, update etc) to do your
load; if you write your load in PL/SQL using cursors and loops
to process your data row-by-row, you're not going to be able to
take advantage of parallel DML. What these two requirements
mean, therefore, is that your ideal ETL process is going to use
parallel processing, and somehow reduce to the minimum the wait
between different ETL stages.