| |
New Developments In Oracle Data
Warehousing
April 22, 2004
I've been giving a lot of thought recently to how I might go about
designing and building future Oracle data warehousing projects. Over
the last few months, I've come across some interesting design
approaches, and coupled with some new technology that's now available
it's probably a good time to have a think about how future projects
might come together.
One of the key messages that keeps coming through, from both
clients and from Oracle, is around near or real-time data warehousing.
Traditionally, data warehouses have been refreshed daily, or weekly or
monthly, and it's been generally accepted that there's always going to
be a lag between the data in the source systems, and the data in the
data warehouse. This isn't necessarily a bad thing - data warehouses
are for analyzing trends, and you wouldn't want to change the
direction of a business based on a few minutes data; however, the role
that data warehouses play is changing, and it's common now to use the
data warehouse as a way of getting a holistic view of the whole
business, in real time, so that split-second decisions can be taken as
soon as an event happens, to prevent a customer defection, or to take
advantage of market movements, for example. If you're going to be
relevant in the data warehousing market, you've got to be capable of
being real-time.
Warehouses that are built using Oracle Warehouse Builder are
generally loaded on a batch basis, using PL/SQL routines build using
OWB that are scheduled using OEM and/or Oracle Workflow. If you want
these routines to be 'near' real-time, what you then do is run them on
a regular, constant basis (say every minute or so) and have them
process changed data from the source system, as opposed to loading the
entire source system each time. The usual way to do this is to have
some sort of 'new' or 'changed' date field in the source system, so
that you can pick up only those records that are new or amended. The
smart way to do this though, as from Oracle 9i, is to use 'Change Data
Capture', a mechanism for automatically feeding changed database
records through to interface tables, or indeed through to the
warehouse fact and dimension tables themselves.
In actual fact, I've never actually used the Oracle 9i change data
capture, because of the way it is implemented - it relies on triggers
being placed on the source tables, which can (a) slow down the source
system and (b) worry the source system DBA as you're changing his data
model. However, I recently had a look at a presentation and paper from
the last Oracleworld by Michael Brey and Michael Schmitz entitled
"Experiences with Real-Time Data Warehousing Using Oracle Database
10g" which
explains how the new 10g version of change data capture, called
'Asynchronous' (as opposed to 'Synchronous') change data capture
works.

In the 10g version, the redo logs are used instead of triggers, making
the process far less invasive and putting no load on the source
system. The drawback of course is that the source system (as well as
the target) have to be 10g, but I can see how you could put a 'real
time' system together, that mines the redo logs, transforms the data
using a PL/SQL package and places the data in the correct fact or
dimension table. More details on Asynchronous Change Data Capture can
be found in the Oracle white paper
"On-Time Data Warehousing with Oracle10g - Information at the Speed of
your Business" available on OTN.
Another interesting approach that the paper talks about deals with
the question of real time vs. fairly static data in the data mart. One
criticism of having a real time data mart or warehouse is that if you
run a query twice in succession, you might get a different answer, as
more data has been loaded since you first run the query. To get around
this, the system designers had a main fact table, which is only loaded
once a day, together with a 'today's data' fact table, which is
updated constantly during the day. If you want a static view of the
data, you just query the main fact table; if you want a completely up
to date view, you SELECT against both tables, and UNION the results.
At the end of the day, the 'today's data' fact table is
instantaneously 'partition exchanged' with the partitioned main fact
table, which was partitioned by day and had an empty partition set
aside for each new day's data.
Following on from this, I attended a
presentation by Jon Page at the recent UKOUG Reporting Tools & BIW
Special Event in March on Oracle's 'Information Architecture'. Oracle
are currently pushing a vision of data warehousing where the warehouse
is integrated into the ERP system; in Oracle's case, using a product
called 'Daily Business Intelligence' (DBI), an add-on to the
e-Business Suite. Interestingly, what Oracle are saying with DBI is
that you don't need a separate warehouse, and you don't need an ETL
process that loads data in batches and applies changes to the
warehouse on a daily, weekly or monthly basis. It looks like DBI is
being populated and refreshed using Fast Refresh Materialized Views -
materialized views that are updated whenever the base tables have data
COMMITed, a process that has little impact on database performance but
has the advantage of keeping the summary tables (or in fact any
derived tables) always up to date.

What this means in effect is that you don't have a long drawn-out
ETL process, with multiple staging tables and mass-copying of data
from source to target - you just create some fast-refresh materialized
views and code the ETL into SELECT statements. A good approach -
although the point I keep coming up with is 'what about complex ETL
processes?' - surely at some point you're going to need intermediate
tables, and surely at some point you're going to have to break out of
standard SQL and use PL/SQL for a more complex process. However, it's
an interesting approach and one I'll bear in mind for whatever I work
on next.
Separate from all this, I also recently took a look at a paper by
Arup Nanda entitled
"A
Multi-Source, Time Variant Data Warehouse Case Study" - you might
have seen a
write-up I gave this paper at the time. The case study describes a
project where a medical claims data warehouse was built for a client,
with a rule that no ETL tool - not even OWB - could be used. The paper
then goes on to describe how they set up the warehouse and the ETL
routines, using nothing more than SQL*Plus and Unix shell scripts.
What was interesting about the paper was the completely different
approach they took to putting a warehouse together; most projects I
work on involve one or more OWB developers, building a series of ETL
mappings that translate into SQL scripts and PL/SQL packages, whilst
the authors of the paper did most of their work using parameterized
CREATE TABLE AS SELECT ('CTAS') scripts that were called from Unix
shell scripts.
What was different about Arup Nanda's approach was that he eschewed
the often long, drawn out graphical ETL process used by OWB and
crafted a number of reusable SQL scripts that bulk-copied data from
source to target using the CTAS statement. In addition, to deal with
changes to data types he then crafted a number of scripts that issues
ALTER TABLE commands to change column types. Lastly, he made extensive
use of (again) daily partitioned fact tables, dropping old partitions
and creating new partitions using generic SQL scripts and alongside
this using locally partitioned indexes that reduced index maintenance
to those partitions directly affected by table loads.
What this said to me was that Oracle 9i warehousing projects don't
necessarily need to use OWB, if the process can be better served by
putting together well crafted reusable SQL scripts called from either
MSDOS batch files or Unix shell scripts. It also highlighted the
central role that partitioning plays in a warehouse, not only for
performance reasons but also due to the potential for partition
exchange loading - the process of swapping out a staging table that
contains data for a period with the corresponding empty partition in
the fact table. In fact, OWB 9.2 now supports partition exchange
loading, so this can also be carried out from the OWB GUI tool without
having to learn the relevant SQL statements. However, in some cases -
where the ETL process is simple but repeated many times with source
data of varying structure - coding SQL statements by hand and
controlling them from shell scripts is sometimes more efficient than
coding them using a graphical ETL tool.
The last development I've been looking at, is 'Agile'
methodologies that take a fundamentally different approach to the
project build than traditional 'waterfall' methodologies.
Traditional data warehouse projects that I've worked on (and
managed) usually work to a project plan where requirements are first
defined, then the system is designed, then coded, tested then rolled
out. Whilst this works well when the project requirements can be
completely designed at the start, it sometimes falls down when
- the requirements aren't fully defined at the start, and in fact
only come out as the project gets going
- changes to requirements, or problems during the build mean that
the project diverges from the plan early on
- the client wants to see some results early, but the project only
delivers something right at the end
- time is tight but the project methodology demands lots of
documentation, and lots of process
- team members have different skills but get treated for planning
purposes as interchangeable
Agile methodologies, such as
Extreme Programming
or Scrum address these
issues by taking a fundamentally different approach.

- Requirements are driven out over development cycles constantly
feeding back into the project
- iterations of the project are delivered regularly and quickly,
so that feedback can be obtained
- the methodology is 'lightweight', only determining objectives
and leaving the developers to decide how to do the work
- There is very little project 'ceremony', only requiring a
minimal amount of project documentation
- What project process there is is mainly focused on setting
objectives and measuring progress
- Team members are recognized as having unique skills which
sometimes cannot be interchanged.
- This agile approach looks excellent for business intelligence &
data warehousing projects, which more often than not don't have
their requirements fully defined at the start ("I don't know what I
want, so show me what can be done and I'll tell you then") and
benefit most from regular releases of data and reports to the
warehouse user base. It's also the case that, in most projects I've
heard of, whilst there is a lot of time and attention paid to the
plan at the start of the project, as time goes on less and less
project documentation actually gets filled in as everyone's backs
are to the wall getting the project delivered - the agile approach
would therefore fit this better, allowing us to focus on that
documentation which is actually essential - setting objectives and
measuring progress - whilst allowing us to dispense with
documentation that is only filled in 'because it should be'.
The
agile approach should therefore help us deliver projects that are
more responsive to customer needs, deliver results earlier and
consistently, and allow developers to concentrate their efforts on
parts of the project that deliver real value. Of course the
potential problem here is that most projects you are asked to quote
for are on a fixed price, fixed requirements basis; the client
believes they have fully defined the requirements as part of the
RFP, and they therefore want you to quote a fixed price for doing
this work. Any changes to requirements (which more often than not
eventually come in) then become change requests which have to then
be separately negotiated. If agile methodologies are going to work,
they'll need to be a partnership between client and the development
team, balancing the benefits from a flexible approach to the design
and build with the fact that it's not possible to completely lock
down the cost and elapsed time at the start of the project.
So what does this mean then for future data warehousing projects?
Well, whilst in the short term in all likelihood things will carry
on as normal, some approaches I'd like to bring into future
warehousing projects would include;
- Increasing use of real-time data, through a combination of
synchronous and asynchronous change data capture, and fast refresh
materialized views
- More use in general of the built-in ETL features in Oracle 9i
and 10g, even when they're not yet supported by OWB (though check
the new New Features documents for later versions, as more and
more Oracle 9i ETL functionality is being exposed through OWB)
- Consideration for the use of generic, parameterized SQL
scripts when this would be a more efficient way to build the ETL
process
- The use of agile methodologies where the customer is agreeable
to an iterative development process, drawing out requirements
through a series of releases and keeping project documentation
focused on setting objectives and measuring process.
Any thoughts from anyone on this approach? Is anyone using the
Oracle 9i features for real time data warehousing, or is anyone
using an agile methodology, such as eXtreme Programming or Scrum to
run their warehousing project? Drop me a line if you've got any
feedback.
|

|
|
|