|
 |
|
OWB - Oracle Data Warehouse Builder articles
Don Burleson
|
Here are a great
series of articles on using Oracle Warehouse Builder by Mark Rittman.
OWB warehouse builder
Experts Feature
Implementing Real-Time Data
Warehousing in Oracle10g
Add the rest here . . .
Posted by mark at
07:05 PM |
Comments (0)
OWB : Leader In
Gartner ETL 2004 Magic Quadrant
Yahoo Finance: "REDWOOD SHORES, Calif., Sept. 16 /PRNewswire-FirstCall/
-- Oracle (Nasdaq: ORCL - News) today announced its position in the
leader quadrant in Gartner's ETL 2004 Magic Quadrant. Oracle is the
only database vendor listed in the leader quadrant.
One of 18 vendors in the extract, transform and load (ETL) market
evaluated by Gartner for this year's report, Oracle was recognized
based on its vision and ability to execute. The affordability of
Oracle® Warehouse Builder, and its alignment with the feature-rich
Oracle Database 10g, has helped Oracle grow its ETL customer base to
include more than 3,000 companies. According to Gartner, "leaders" are
performing well today, have a clear vision of market direction and can
execute on that opportunity.
"Many of the world's largest companies are using Oracle Warehouse
Builder for their most pressing ETL needs," said Ray Roccaforte, vice
president of Server Development, Business Intelligence, Oracle Corp.
"Industry adoption of ETL will continue to be fueled by the widespread
deployment of business intelligence solutions. Our ETL tool is an
excellent option for companies implementing such solutions."
Oracle Delivers Powerful Data Warehousing and ETL Offerings
In May 2004, Gartner listed Oracle in the 2004 Data Warehouse DBMS
Leaders' Quadrant for completeness of vision and ability to execute.
Oracle's combined strengths in data warehousing and ETL provide
customers with a comprehensive offering to efficiently and cost
effectively develop, deploy and manage highly scalable decision
support systems.
Business Intelligence market growth has fueled ETL tool adoption such
as Oracle Warehouse Builder that reads data from a database, extracts
the desired data, transforms it into the required form according to
predefined rules, and then loads into a data warehouse for reporting
and analytic requirements.
About Oracle Warehouse Builder 10g
Oracle Warehouse Builder enables the extraction, transformation, and
loading of heterogeneous data to produce quality information in Oracle
Database 10g. It is the only enterprise business intelligence design
tool that manages the full life-cycle of data and metadata integration
for Oracle Database 10g."
Read more
here including a
diagram of the Gartner ETL Magic Quadrant. This is obviously based
on the new Paris release of OWB, which is a step change in terms of
functionality and performance above the 3i and 9i-generations of OWB.
What's probably particularly helped in this release is features such
as the ability to load non-Oracle databases (revealed in
Paul Narth's presentation), the data profiler and close
integration with Oracle OLAP. Good to see our choice of ETL tool
vindicated.
The Register
Reviews OWB 'Paris'
Oracle rebuilds Warehouse: "Unsurprisingly, Oracle has
announced a whole bunch of new products at Oracle OpenWorld this week.
Some of these, like the new version of 11i, will get lots of media
attention. One that will not is the new release of Oracle Warehouse
Builder (OWB), code-named Paris. However, in its own right it is
pretty significant and it merits some serious consideration.
Perhaps the first and most obvious new feature that will be available
in the Paris release (which is scheduled to be available by the end of
the year) is that you will be able to send data to third party
databases as targets. You will, however, have to use Oracle
Transparent Gateways to do this, which presumably means that you have
to go via an Oracle database. However, as OWB uses the database as its
engine this shouldn't be a hardship.
The second major new facility is data profiling. Just how clever this
is I do not yet know (I am arranging a detailed briefing) but my guess
would be that it is on a par with, or better than, the other ETL
vendors that have introduced comparable capabilities over the last
year, but not as advanced as the specialised product's from companies
like Trillium Software.
There is some other nice stuff in OWB. For example, it understands
OLAP and has support for such things as slowly changing dimensions;
the cleansing and matching capabilities provided can be integrated
with data mining algorithms for enrichment purposes; there is a
mapping debugger for transformations; there are extensive new metadata
features and expert capabilities; and so on and so forth."
A very positive review on OWB 'Paris' by
The Register. Read the rest of the article
here.
Introduction To
OWB 10gR2 "Paris" Presentation Now Available
Oracle's Paul Narth gave a presentation earlier today at Openworld in
London, on the new features in OWB 'Paris', otherwise know as OWB 10g
Release 2. The presentation can be
downloaded from the Openworld website, and comes with a number of
screenshots showing the new interface in this next release of OWB.
One of the big new features in OWB 'Paris' is the data profiler, which
allows you to profile your data within the OWB interface, discover the
structural content, identify anomalies, and then report on data
quality once your warehouse is populated.

The mapping editor sports a redesigned interface, has a mapping
operators toolbar, properties bar, and allows you to pan and zoom
around a multi-screen mapping canvas.

One of the big improvements in 'Paris' is the data object editor,
which allows you to define cubes and dimensions in a logical form, and
then deploy them either to relational tables with ROLAP metadata, or
to analytic workspaces. Then, you can either then deploy and populate
your cube in a relational star schema, or alternatively, through the
new
AWXML feature, directly create and populate analytic workspace
objects without having first created equivalent ROLAP objects.

The impact analysis report has been much improved in Paris, as has
change management in general, which now allows you to perform
'what-if' analysis, and helps you more easily determine the cost of a
project change.

Another interesting feature is support for 'Experts', a scripting
feature (that uses
OMB*Plus) that allows you to define a workflow and guided series
of steps for departmental 'power users' to work with OWB mappings. If
you have a database business 'power user' who commonly extracts his
own data from the database, manipulates it, and then reports against
it, the idea is that you can set up an 'expert' in Paris that allows
him to do all of this, guided by your script, using mappings in OWB.

You can find out more details, and see some more screenshots, on
Paul's
powerpoint presentation.
OWB Performance
Tips
There was an
interesting thread on the OTN OWB Forum the other week, where a
poster asked if anyone had any tips on improving the performance of
Oracle Warehouse Builder. Nikolai Rochnik from the OWB product team
came back with some useful guidance:
"There can be three factors affecting the performance of OWB Client
working with Design Repository. Here is what you can do about them now
as well as what OWB development has done (a lot!) to improve this in
the next release (November 2004). They are in the order of greatest
impact.
1. Design Repository management.
Problem: CHANGELOG table in the OWB design repository is typically
several orders of magnitude smaller in row count than pctree,
intralink, or interlink, so the nested loop plan is best. The
optimizer's row estimate in the absence of gathered statistics does
not serve us very well here.
Solution in the next release:
* Automated gathering of schema statistics (and refresh statistics on
a regular basis)
* Increased use of SQL bind variables
* Improved SQL hints
* Ensured definition of all necessary indexes
* Used COMPRESS for indexes
Workaround for now: Try analyzing OWB design repository with the
following statements. This is only worthwhile if the size of the
repository is large.
exec dbms_stats.gather_table_stats(null, 'PCTREE', null,
DBMS_STATS.AUTO_SAMPLE_SIZE, false, 'FOR ALL COLUMNS SIZE AUTO');
exec dbms_stats.gather_table_stats(null, 'INTERLINK', null,
DBMS_STATS.AUTO_SAMPLE_SIZE, false, 'FOR ALL COLUMNS SIZE AUTO');
exec dbms_stats.gather_table_stats(null, 'INTRALINK', null,
DBMS_STATS.AUTO_SAMPLE_SIZE, false, 'FOR ALL COLUMNS SIZE AUTO');
exec dbms_stats.gather_table_stats(null, 'CHANGELOG', null,
DBMS_STATS.AUTO_SAMPLE_SIZE, false, 'FOR ALL COLUMNS SIZE AUTO');
exec dbms_stats.gather_index_stats(null, 'IDX_PCTREE_PARENTID', null,
DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_PCTREE_CHILDID', null,
DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_PCTREE_PARENTID_CHILDID',
null, DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_PCTREE_CHILDID_PARENTID',
null, DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_INTERLINK_LINKFROM',
null, DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_INTERLINK_LINKTO', null,
DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'IDX_INTRALINK_LINKFROM',
null, DBMS_STATS.AUTO_SAMPLE_SIZE);
exec dbms_stats.gather_index_stats(null, 'CHANGELOG_PK', null,
DBMS_STATS.AUTO_SAMPLE_SIZE);
2. Memory management.
Solution in the next release: A lot of improvements to prevent memory
leaks and unnecessary flushing of needed objects.
Workaround for now: Try a higher Java Virtual Machine (JVM) limit for
you OWB Client: Edit -Xmx value in [OWB Home]\owb\bin\win32\owbclient.bat
from default 384M. This is only worthwhile if the client machine has
enough real memory to support the VM sizes of all actively running
applications and if Windows Task Manager shows the owbclient java
session Memory Usage and VM Size values close to the Xmx value, then
OWB starts spending a larger portion of its effort managing memory.
3. Network latency.
Problem: Network latency is why having OWB Client installed local to
the Oracle database containing the design repository will always be
faster, as you observe "I found OWB much faster when Oracle and OWB
running on same machine having 400 MB only." Obviously this does not
solve the problem in multiple developer environment, but it is not as
big of the factors above. We are confident that OWB Client 'chatting'
over the network is already optimized."
Adding to this, it's been my experience that the two issues that can
have the biggest impact on OWB performance are first, a client machine
with less than 512MB of memory, and secondly, contention on the disk
unit that houses the OWB design repository. There really is no point
using the OWB client on a machine with 256MB or less of memory (it'll
run extremely slow, and often crash or hang and lose your uncommitted
work), and more than once we've found that sluggish OWB performance
was down to the disks holding our design repository database also
being used, without us knowing, to hold the database for a large and
active OLTP application.
Analytic
Functions In OWB
Analytic Functions In Oracle Warehouse Builder: "Analytic
Functions are an important feature of the Oracle database that allows
the users to enhance SQL's analytical processing capabilities. These
functions enable the user to calculate rankings and percentiles,
moving window calculations, lag/lead analysis, top-bottom analysis,
linear regression analytics and other similar calculation-intense data
processing. This whitepaper describes how to use some of these
analytical functions in the ETL (extraction, transformation and
loading) processing implementation using Oracle Warehouse Builder."
Nicholas Goodman
On Enhancing OWB Process Flows
Nicholas Goodman: "There are significant limitations to the
implementation of Workflow processing in OWB 10g r1 and prior. There
are methods available to developers that can add at least some amount
of logic and conditionals such that it might be able to behave like an
application. This section will enumerate how to use the facilities in
OWB currently to build an IF/ELSE workflow pattern and a CASE workflow
pattern."
Tom Burzinski On
Data Warehousing Test Plans
"Companies invest countless hours and hundreds of thousands, sometimes
millions, of dollars in data warehouse technology in pursuit of acing
the big BI exam. Yet, few make a proportionate investment in testing
their systems. The fact is no data warehouse project can succeed
without complete and comprehensive testing."
writes Tom Burzinski for DMReview. "Testing a data warehouse is
more than reviewing the system's reports or other outputs ... The
nature, scale and scope of a data warehouse project may vary depending
on budget, available time, data quality, the data warehouse type and
complexity of data transformation; however, there are four critical
phases of testing that should always be included."
Comparing
Informatica And OWB
A couple of projects that I've worked on recently had chosen Oracle
Warehouse Builder over
Informatica's Powercenter. It was interesting therefore to see a
new article by Rajan Chandras that
looked at the latest version of Powercenter, and to compare how
Informatica's offering compared to OWB10g.
According to
"(Re)Enter The Leader", Powercenter has a similar architecture to
OWB:
"PowerCenter 7.0 (I'll call it PC7) is an ETL tool in the classic
mold: data extract, transform, and load logic is constructed in a
(mostly) sequential arrangement of graphical objects that flow from
source to target. The objective is conceptually simple: Read data from
source, transform it as needed, and write it to target. Reality is a
little more complex, of course, and the construction of logic happens
at three levels.
At the lowest level, individual graphical
objects can be sources, targets, or transformations (sources and
targets can be themselves considered as special types of
transformations). A source transformation is used to read from a data
source, and supply that data in sequential row-wise fashion for
subsequent processing. At the other end of the logic stream, the
target transformation receives data (again, in row-wise order) and
writes it out to recipient data structures. The remaining intermediate
transformations do just that — transform data values as required.
Sources, targets, and transformations are
assembled in a daisy chain to form the next level of processing, which
in PC7 is called the "mapping." A mapping is the end-to-end flow of
logic, from one or more source transformation to one or more target
transformations.
The execution of the mapping, called the
"workflow" in PC7, provides the third level of the overall logic. The
workflow provides for the execution of multiple mappings and
dependencies among mappings. In standard programming terms, the
transforms are the syntax and components of the program, the mapping
is the overall program itself, and the workflow is the execution and
production of one or more programs.
There are PC7 components that correspond to
these levels. The PowerCenter Designer is the programming integrated
development environment (IDE), where you "assemble" all the sources,
targets, and transformations to create a mapping. The PowerCenter
Workflow Manager is used to build a workflow around the mapping. The
Workflow Monitor provides production support capabilities for the
workflow. In addition, there are the PowerCenter Repository Manager
and the Repository Server Manager, which provide administration
capabilities for the PC7 Repository (more on the this a little
later)."
All sounds familiar, with Powercenter's graphic objects being the
source and target objects you drop on to a mapping canvas, together
with the PL/SQL transformations, mappings being the same in the two
tools, and Powercenter's workflow being the same as the workflow
interface in OWB10g.
Reading the rest of the article, some interesting similarities and
differences were:
-
Informatica
Powerexchange seems similar to Oracle gateways, but with connectors
to Peoplesoft and Siebel in addition to the SAP connectivity that
both tools offer. Support is good in both tools for non-Oracle
databases (DB2, SQL Server, Teradata, Sybase and so on)
-
One major
difference is that OWB will only populate Oracle 8i, 9i or 10g data
warehouses, whilst Informatica works against any major vendor
(thanks Duncan for
pointing that one out, one of those 'so obvious if you're used
to OWB, I forgot to mention it' moments...)
-
Both tools
allow you to built reusable components for transforming data, with
Powercenter's being specific to the tool whilst Oracle's are regular
PL/SQL functions and procedures.
-
Informatica,
like Oracle, are making a big noise about grid computing. "PC7
offers server grid capabilities, too, by which PowerCenter can
distribute loads across heterogeneous Unix, Windows, or Linux-based
computing platforms. Although grid capabilities may seem exciting, I
don't believe they match real-world need for grid computing yet, and
I wouldn't recommend using them in place of other industry grid
solutions."
-
The main
architectural different between Powercenter and OWB is that
Powercenter has it's own ETL engine, that sits on top of the source
and target databases and does it's own data movement and
transformation, whilst OWB uses SQL and the built-in ETL functions
in 9i and 10g to move and transform data. Interestingly the article
observes that the Informatica approach can be slower than the
approach used with OWB. "Also,
be aware that ETL tools are in general a slower (if more elegant)
alternative to native SQL processing (such as Oracle PL*SQL or
Microsoft Transact SQL)."
-
Powercenter's
use of web services and distributed computing looks more developed
than OWB's. "PowerCenter
Web services are managed through the Web Services Hub, another
component of the architecture, which supports standards such as
Simple Object Access Protocol (SOAP), Web Services Description
Language (WSDL), and Universal Description, Discovery, and
Integration (UDDI). The architectural components can be collocated
on a single server or spread across diverse servers, which allows
solution parallelism, flexibility, and scalability."
-
Powercenter
starts at around $200,000 (yikes!) although there is a "Flexible
pricing model.". OWB is licensed as part of 10gDS which is
around $5000 per named user, although you'll need the Enterprise
Edition of the 8i, 9i or 10g database to provide the ETL
functionality.
Historically, customers chose Informatica when they had lots of
different database sources and targets, and the transformations
between them were complex and interwoven. At one point, if you wanted
a graphical interface for building transformations, tools such as
Informatica,
Constellar,
Genio and so on were the only game in town, and you were looking
at a minimum investment of between $50,000 and $100,000 to deploy
these tools. The introduction of
DTS by Microsoft and OWB by Oracle suddenly changed the market by
providing much of the functionality of these tools as either a free
database component or as a low-cost add-on. Vendors like Informatica
have responded by introducing additional new features (such as web
services integration, distributed loading and transformation, and so
on) but it's now the case that, if you have a fairly straightforward
need to graphically extract, transform and load data, you'll probably
find the vast majority of your needs are now met by tools like OWB, at
a far lower cost.
Interestingly, Informatica also have their own BI query tool called
PowerAnalyzer. Sold separately from PowerCenter but designed to be
used in tandem with their ETL tool, PowerAnalyzer is a web-based query
tool that creates ROLAP queries against Oracle, IBM, Microsoft and
Sybase datasources. Designed to be deployed using J2EE application
servers, it also comes with an Excel interface and, as
Seth Grimes reports for Intelligent Enterprise, a range of
prebuilt analytic applications:
"The analytics dimension features conventional query and reporting and
online analytic processing-style slice-and-dice analysis, and also
optional packaged modules for CRM, financial, HR, supply chain, and
cross-functional analysis. Its ability to visually model analytic
workflows is one that's not yet common. It's intended to facilitate
root-cause analysis, although this capability appears to be limited by
the packaged analytic framework and other architectural strictures.
For example, the highly promoted Excel support doesn't include
database write-back. It also lets analysts embed business logic in
private spreadsheets rather than in a repository, which can prove
limiting when logic locked in a spreadsheet isn't visible to the
workflow modeler and other non-Excel interface elements. Last,
PowerAnalyzer delivers only the mainstream data-analysis functions
that are found in competing BI tools, underscoring Informatica's view
that integration and usability rather than analytic depth are the keys
to market share.
PowerAnalyzer 4.0 is a credible entry in the
larger BI market and will prove compelling for organizations that
require an easy, nondisruptive path for integrating mainstream
analytics into an existing computing environment."
That being said though, PowerAnalyzer is rarely (if ever) deployed by
non-Informatica PowerCenter customers, limiting it to a fairly small
audience who have a particular need for high-end ETL and specific
industry analytic templates. Looks interesting, though, albeit with a
hefty pricetag.
Version And
Environment Management Using OWB10g
One of our customers recently asked me about the best way to manage
multiple environments when using OWB. The customer had a development
environment, together with test, QA and of course production, and
wanted to know how to set up their design and runtime repositories. Do
we create separate design and runtime repositories for dev, test, QA
and prod, do we have just one for all environments, or is the answer
somewhere in between?
The situation is complicated a bit by the fact that OWB's architecture
has changed slightly over the previous few versions. Up until OWB
9.0.3, you had a design repository to which your OWB client connected,
and one or more OWB runtimes, which you installed into every schema
you loaded data in to. With OWB 9.0.4 this changed in that you now
have one runtime repository per target database, and this runtime
repository contains all the packages, and stores all the load results,
for use by all the target schemas in a database. More details on the
new runtime architecture can be found in
Architecture Whitepaper for Warehouse Builder 10g
and
Server-side component (runtime) in Oracle Warehouse Builder 10g
on OTN.
In practice, we've found that different clients use different
approaches to solve this problem. Some of them have a single design
repository, with it's contents effectively being the most current view
of the OWB project definition. This definition is then used to
initially populate the dev warehouse environment, and after that it's
used to populate the test and QA environments. Once everyone's happy
that this is ok, it's then used to load up the production warehouse
environment.
The problem with this is that the model within the design repository
doesn't usually reflect what's actually deployed in test, QA and prod.
One way around this is to make backup copies of the design repository
(using the .MDL file export routine) after each deployment, and then
load these up whenever required to see how the design repository
looked when you last deployed to an environment. Another way to deal
with this is to have separate design repositories for dev, test, QA
and prod, and keep these in sync by exporting and importing modules
between the environments using the .MDL file export facility. Yet
another way is to use the snapshot facility in recent versions to
store copies of the repository for later reference.
You'll be interested therefore to know that Oracle have recently
posted a series of OWB casestudies, one of which is specifically about
version and environment management using OWB. Entitled "Case
Study 9: How Do I Manage Multiple Versions of my BI Implementation?"
it discusses two scenarios where OWB is used to manage multiple target
environments, such as dev, test, QA and prod. According to the article
introduction:
"After a period of development and testing, one company implements its
BI system in production. The Production version of the system
typically changes as new features are incrementally implemented from
Development, and as Production bugs are discovered and fixed. At the
same time, the Development version of the system continues to evolve
with new functionality. This company now has several individually
changing versions of the system and faces a challenge familiar to all
companies, regardless of how many BI environments they maintain: how
to best manage changes in different versions of the system.
One version of this common scenario is depicted in Figure 9–1, where
the Development environment is consistently more advanced than the
functionality in Production, and QA is somewhere between the two
extremes. Development changes are incrementally propagated to QA and
subsequently to Production. At the same time, Production has its own
cycle of changes, denoted in Figure 9–1 as the shadow environment
labeled ’Production1’, and used for controlled problem solving.
’Production’ and ’Production1’ are at the same stage of development,
and serve to illustrate the errors that occur in Production, which are
fixed and implemented directly in Production, but that must somehow be
merged with Development. Other companies may have fewer or more
differing environments for their BI systems, but the same maintenance
challenges still apply.
Companies may need multiple environments for their BI systems, as
illustrated in Figure 9–1, because they typically implement
incremental changes to the system. However, some companies implement
only whole projects in Production. Figure 9–1 does not apply to these
companies. In this case study, a company finds a problem with a
mapping in Production. The first step is to compare the Production
version of the mapping with the Development version of the mapping, as
illustrated in Figure 9–2. If the mapping is identical in both
environments, the solution is simple: make the changes in either
environment and copy the mapping to override the older version. If the
mapping in Production differs from its Development version, then the
approach depends on whether the BI system is in its initial or mature
phase."
Sounds just like what we are looking for. So what does it recommend?
Firstly, it splits OWB projects into two types: those that are in
development or which have just been deployed (referred to as 'initial
phase') or those that have been up and running for some time (referred
to as 'mature phase').
For initial phase projects, the recommendation is to have a separate
design repository, and runtime repository, for each environment. This
is because, during this early phase, changes are likely to be made to
all of the environments (prod, test, dev and so on) all at the same
time, and it's more convenient to track these changes using separate
repositories. To eventually keep them all in sync, the recommendation
is to use the .MDL export facility to export and import individual
mappings, objects, modules and so on.
For mature phase projects, the recommendation is to still have
separate runtime repositories for each environment, but in this case
however just one design repository is used, with the OWB snapshot
facility used to store snapshots of what the dev, prod, test and QA
are looking like. For example, at the start of a system change, your
design repository should reflect what's in production. You'd then take
a snapshot of this, and save it as a copy of production. Then, you'd
make whatever changes you need to this version, referring to this now
as development. When you then go to deploy it to test, you take a
snapshot at that point and save it as a copy of test. In this manner,
you keep copies of the design of each environment, but avoid the
overhead of managing several design repository installations, using
the saved snapshots instead.
The document details the steps you need to go through to migrate
changes from one environment to another using the two approaches, and
has some useful diagrams you might want to 'appropriate' for your
project documentation.
Take a look if you get a chance, and while you're there, you might
want to download the
complete set of ten case studies, which covers areas such as
transferring remote files, handling flat files with varying names, and
rolling out changes to the target schema.
In-Depth
Introduction To Oracle Warehouse Builder 10g
"Whilst many Oracle DBAs build and maintain their data warehouses
using nothing more than UNIX, SQL, and PL/SQL scripts, many are now
starting to use graphical tools such as Oracle Warehouse Builder 10g
(OWB) to help with development. What do tools such as OWB10g offer the
DBA, and how can they help with the warehouse design and build?"
Find out in my latest article for DBAZine entitled
"An Introduction To Oracle Warehouse Builder 10g"
Householding
Warehouse Data Using OWB10g
"Householding is the process of identifying related information from
multiple sources and storing it in the data warehouse such that all
related information about a given party is accessible when looking at
any information about the identified party. A key point is that
householding does not simply refer to the matching up of data about a
family. The need for householding began with a requirement for clear
and concise marketing data. As such, the goal was to identify all
individuals that reside in the same physical residence, hence the name
householding. A marketer could now send one piece of mail to the
household, which saved the marketer considerable dollars in postage
and mail preparation costs. Today, householding refers to any grouping
of information about a given person, family, household or company. The
need today for accurate identification of the same person, family,
household or company across all systems is significantly greater than
when householding was first identified as a required business rule for
data migration. Today's marketing programs target people at many
levels - as an individual, as a head of household or as a member of a
given group of people (i.e., club, residence and/or business)."
writes Greg Mancuso and Al Moreno in
"The Common Problem – Working with Merge/Purge and Household" for
DMReview.
As an aside, if you're using OWB, merge/purge and householding are
fairly painless processes due to the inclusion of the name and address
cleansing, and match/merge transformations as of OWB 9.2.
The name and address cleansing functions were originally derived from
the Pure*Integrate technology Oracle licensed from Carleton, and work
with name and address files (which you have to purchase separately)
from companies such as Trillium.

The Match/Merge transformation that allows you to first match, and
then merge, warehouse data. It allows you to match records using
business rules, and consolidate multiple records into a single record.

Together with the name and address cleansing transformation,
match/merge can be used to combine multiple customer records into a
household record, as described in the preceding article. What
particularly makes the match/merge transformation useful is the
ability to do 'fuzzy matching' where exact matches between records
can't be found but where similar records need to be combined (where,
for example, an address matches but one or more postcode records can't
be found). For more details, take a look at:
·
Match/Merge Operator
·
Name and Address Operator
from the
OWB 10g Documentation.
Loading XML
Documents Using OWB 10g
Recent versions of Oracle Warehouse Builder come with the ability to
load XML documents into the database. This article explains, in simple
steps, how to load a simple XML document into your data warehouse
using OWB, and refers to OWB 10.1 although it should work with any
version from 9.0.4 upwards. It's based on the examples in the
Using The XML Toolkit chapter in the
OWB 10g Users Guide available on OTN.
The examples are based around a scenario where an 'orders' XML
document is loaded into a 'purchase orders' table. To start off, and
assuming OWB 10.1 is installed together with appropriate runtime,
repository and target schemas, the first thing we want to do is create
a table within our target schema to receive our XML data. Create the
table using this script:
create table Purchase_Orders (
id varchar2(10) not null,
order_date date not null,
shipto_name varchar2(60) not null,
shipto_street varchar2(80) not null,
shipto_city varchar2(30) not null,
shipto_state varchar2(2) not null,
shipto_zip varchar2(9));
Next, create a directory on the Oracle server to hold our XML
documents. In my case, I created a directory called
c:\xml_test
on my Windows 2000 server, and then made sure that the directory was
listed in the UTL_FILE_DIR section of the init.ora file. You'll need
to bounce the database for this setting to become active.
Now, using notepad or any other text editing program, create a file
called
orders.xml
and type in the following XML:
<ROWSET>
<ROW>
<ID>100</ID>
<ORDER_DATE>2000.12.20</ORDER_DATE>
<SHIPTO_NAME>Adrian Howard</SHIPTO_NAME>
<SHIPTO_STREET>500 Marine World Parkway</SHIPTO_STREET>
<SHIPTO_CITY>Redwood City</SHIPTO_CITY>
<SHIPTO_STATE>CA</SHIPTO_STATE>
<SHIPTO_ZIP>94065</SHIPTO_ZIP>
</ROW>
</ROWSET>
Save this file to your directory on the server. This is the document
you'll load in using OWB. Now, start up OWB and either create a new
project, or open up an existing one.
Now, an important thing to bear in mind is that OWB loads XML
documents into the database using some new PL/SQL transformations,
rather than XML being a new source module type or another type of flat
file you can bring into the project. These new transformations ar
WB_XML_LOAD, a procedure that just loads your documents, and
WB_XML_LOAD_F, a function that loads documents and returns the number
of documents loaded.

Therefore, to import an XML document into your warehouse, you have to
go by a kind of 'roundabout' route where you create either a pre or
post-mapping trigger in a mapping, and use the trigger to call either
WB_XML_LOAD or WB_XML_LOAD_F. To illustrate this,
-
Create a new
mapping and call it XML_LOAD. Drop onto the mapping a pre-mapping
trigger, and select WB_XML_LOAD from the list of transformations.
-
Drop a
constant on to the mapping, add an attribute to the constant, call
it CONTROL_FILE and make it VARCHAR2(250)
-
Go into the
properties for this attribute, click on the '...' button in the
expression field, and enter the following into the text box:
'<OWBXMLRuntime>'||
'<XMLSource>'|| '<file>c:\xml_test\orders.xml</file>'|| '</XMLSource>'||
'<targets>'||
'<target dateFormat="yyyy.MM.dd">Purchase_Orders</target>'||
'</targets>'||
'</OWBXMLRuntime>'
-
Click on the
'validate' button and check that there are no errors. This is your
XML control file that tells OWB where the XML file source is (c:\xml_test\orders.xml)
and what target to load the XML elements in to (purchase_orders).
The dateFormat bit tells Oracle how to parse the date within the XML
file and convert it to Oracle format.
-
Drag the
constant attribute across to the control_info attribute in your
pre-mapping trigger. Your mapping should now look something like:

If you try and validate the mapping now, you'll get the error message
'VLD-1110: Code cannot be generated for mappings that contain only
constants or map input/output parameters as source and targets. This
is because we can't run a mapping that consists of just a pre or
post-mapping trigger; to get round this, either deploy this trigger to
an existing mapping, or add a dummy table update to the mapping that
satisfies this requirement. For example, I created a table called
DUMMY within my target schema:
CREATE TABLE DUMMY (DUMMY_ROW VARCHAR2 (1) ) ;
dropped it on to the canvas twice (once as DUMMY, once as DUMMY_1),
and then connected the DUMMY_ROW column from one to the other to
create a table INSERT.

6.
Validate the transformation now, and it should come back with no
errors. Now, start up the Deployment Manager, connect to your runtime
repository, and then navigate down to your XML_LOAD transformation.
Right-click on the transformation, choose SET ACTION > CREATE, and
then choose FILE > GENERATE/DEPLOY. Your screen should look something
like:

-
Click on
'Deploy', and the results should read 'Deploy Status : Success'.
Then, after clicking 'OK', go back to the Deployment Manager screen,
right-click on the XML_LOAD mapping, and select 'Execute...' .
Accept the defaults, let the mapping run, and your screen should
look like:

8.
If you then start up SQL*Plus and select from the PURCHASE_ORDERS
table, you should find your XML document loaded in.

What we've gone through here is a fairly simple example where the XML
document's elements match exactly the table columns that we're going
to load in to, and we're only loading only one document into only one
table. Often, you might need to transform the XML document before
loading it into the database, and to do this you would apply an
XLST stylesheet to match up the elements in the XML document with
the target table's columns. On other occasions, you might want to load
the XML document into multiple tables, putting part of the data in one
table and the rest in others. All of these situations are discussed in
the
Using The XML Toolkit chapter in the OWB 10g Users Guide which can
be accessed from the OTN website.
Scripting And
Refactoring Using OWB OMB*Plus
I've been doing some research for a paper on OWB10g, and because of
this I've had a chance to look at the scripting support that was
originally introduced with OWB9.0.4. If, like me, you didn't really
pay much attention to this new feature when it first came out, I think
you'll be pleasantly surprised about what is actually a very powerful
new capability.
One of the criticisms leveled at graphical ETL tools such as OWB is
that, whilst they might make a novice Oracle developer more
productive, they actually slow down more experienced developers who
are more used to working with the command line. Whilst tools such as
OWB are great for novice users who need visual prompts when putting
together objects and mappings in a warehouse, DBAs and Unix sysadmins
are more used to defining objects using scripts and often write
programs to generate batches of objects to make the process more
efficient. A good example of this is Arup Nanda's
"A Multi-Source, Time Variant Data Warehouse Case Study" which
describes a data warehouse that had 7,200 tablespaces, 10 tables, 360
partitions and 100 indexes, and loaded data from source to target
using a number of very similar CREATE TABLE AS SELECT statements that
were programmatically generated using a Unix shell script. Clearly, in
this example, generating all these objects and mappings by hand using
the OWB GUI would be far less efficient than writing a series of
reusable scripts.
Another problem that faces developers when working with a GUI
application like OWB is there's no way to record and replay a series
of development steps so that you can apply them in one go to your
project. For example, you might want to add a number of columns
(modified date, modified by etc) to all your tables, and also make
these modifications to your test and production environments. Using
the OWB GUI tool you'd have to go into each project and make the
changes manually, which takes time and opens up the possibility of
introducing user error. Also, not having an automation process
precludes the ability to carry out
agile or test driven development, where changes can be made to
requirements at any stage of the project as long as these changes are
automatically
refactored into the development model .
To address these issues, starting with OWB 9.0.4 a scripting
capability was introduced which allows us to create, alter, generate
or drop both warehouse objects and mappings using the command line.
OWB scripting consists of two components:
-
OMB Scripting,
a programming language based on Tcl that allows us to manipulate
objects in the OWB repository, and
-
OMB*Plus, a
'SQL*Plus' lookalike used for interactively running OMB Scripting
commands
OMB stands for 'Oracle MetaBase', which is the product name for the
repository that stores the business logic and object definitions for
our OWB project. OMB Scripting acts as an interface to the
OWB Java API, translating scripting commands into API calls,
allowing us to work with the OWB repository without using the GUI
interface. Like SQL, OMB commands are divided into MDL (metadata
definition language) commands which allow us to create metadata
objects, and MML (metadata manipulation language) commands which allow
us to change the contents of metadata objects.
For the purposes of the rest of the article, I'll be using OWB10g on
Windows 2000, which can be
downloaded from OTN if you've not got this current version. To
start up the OWB*Plus environment, using the Start Menu navigate to
your OWB program group, go to the Warehouse Builder subgroup and
select 'OWB OMB Plus'. Then, to connect to your repository, use the
OMBCONNECT command (changing the connection details as required)
OMBCONNECT username/password@hostname:port:service_name
Then, to change context to a project, use the OMBCC command:
OMBCC 'MY_PROJECT'
Note the use of CAPITALS with the OMB commands - this is mandatory and
you'll get errors if you use lower case. Now, you can create a new
module using the OMBCREATE MODULE command:
OMBCREATE ORACLE_MODULE 'STAGING_TABLES' \
SET PROPERTIES (DESCRIPTION) VALUES ('Warehouse staging tables')
Note the '\' which indicates that a command carries on to the next
line, and the lack of ';' compared to SQL*Plus. Now, change context to
the new module and create a table in the repository using the
OMBCREATE TABLE command
OMBCC 'STAGING_TABLES'
OMBCREATE TABLE 'STG_CUSTOMERS' \
SET PROPERTIES (DESCRIPTION, BUSINESS_NAME) VALUES ('Customer
staging table','Staging Customers') \
ADD COLUMN 'CUST_ID' \
SET PROPERTIES (DATATYPE, PRECISION, SCALE, NOT_NULL) VALUES
('NUMBER',7,0,1) \
ADD COLUMN 'CUST_NAME' \
You can then add further columns to the table using the OMBALTER
command:
OMBALTER TABLE 'STG_CUSTOMERS' \
ADD COLUMN 'CUST_FORENAME' \
SET PROPERTIES (DATATYPE, LENGTH) VALUES ('VARCHAR2',30)
Finally, commit your work with the OMBCOMMIT command.

If you open up the OWB GUI, your project should look something like:

with the STAGING_TABLES module being the module you've just created
using OMB*Plus.
The OMB scripting feature is based on
Tcl with extensions for dealing with the OWB repository.
Therefore, you can use all of the Tcl programming constructs to
programatically create or maintain objects in the OWB repository. For
example, you might wish to loop through all of the tables within a
module and add a MODIFIED DATE column, which you could do as follows;
foreach tableName [OMBLIST TABLES] {
OMBALTER TABLE '$tableName' \
ADD COLUMN 'MODIFIED DATE' \
SET PROPERTIES (DATATYPE) \
VALUES ('DATE') \
puts "Table $tableName altered."
}
You can also build and maintain transformations using OMB scripting.
For example, we we might have COUNTRIES and REGIONS tables which we
want to join and load into a GEOGRAPHIES table.
OMBCREATE MAPPING 'LOAD_COUNTRIES_MAP' \
SET PROPERTIES (BUSINESS_NAME, DESCRIPTION) \
VALUES ('Load Countries from HR Mapping', 'Mapping to load the
countries table from HR schema') \
ADD TABLE OPERATOR 'COUNTRIES' BOUND TO TABLE '../HR_SRC/COUNTRIES'\
ADD TABLE OPERATOR 'REGIONS' BOUND TO TABLE '../HR_SRC/REGIONS'\
ADD TABLE OPERATOR 'GEOGRAPHIES' BOUND TO TABLE 'GEOGRAPHIES' \
ADD JOINER OPERATOR 'JOIN' SET PROPERTIES (JOIN_CONDITION) \
VALUES ('INGRP1.REGION_ID = INGRP2.REGION_ID') \
ADD CONNECTION FROM ATTRIBUTE 'COUNTRY_ID' OF GROUP 'INOUTGRP1' OF
OPERATOR 'COUNTRIES' TO \
GROUP 'INGRP1' OF OPERATOR 'JOIN' \
ADD CONNECTION FROM ATTRIBUTE 'COUNTRY_NAME' OF GROUP 'INOUTGRP1' OF
OPERATOR 'COUNTRIES' TO \
GROUP 'INGRP1' OF OPERATOR 'JOIN' \
ADD CONNECTION FROM ATTRIBUTE 'REGION_ID' OF GROUP 'INOUTGRP1' OF
OPERATOR 'COUNTRIES' TO \
GROUP 'INGRP1' OF OPERATOR 'JOIN' \
ADD CONNECTION FROM ATTRIBUTE 'REGION_ID' OF GROUP 'INOUTGRP1' OF
OPERATOR 'REGIONS' TO \
GROUP 'INGRP2' OF OPERATOR 'JOIN' \
ADD CONNECTION FROM ATTRIBUTE 'REGION_NAME' OF GROUP 'INOUTGRP1' OF
OPERATOR 'REGIONS' TO \
GROUP 'INGRP2' OF OPERATOR 'JOIN' \
ADD CONNECTION FROM GROUP 'OUTGRP1' OF OPERATOR 'JOIN' TO \
GROUP 'INOUTGRP1' OF OPERATOR 'GEOGRAPHIES' BY NAME
Then, we can generate and deploy mappings using OMB:.
OMBCREATE TRANSIENT DEPLOYMENT_ACTION_PLAN \
'DEPLOY_PLAN' ADD ACTION 'TABLE_DEPLOY' \
SET PROPERTIES (OPERATION) \
VALUES ('CREATE') SET REFERENCE TABLE \
/MY_PROJECT/STAGING_TABLES/STG_CUSTOMERS'
OMBDEPLOY DEPLOYMENT_ACTION_PLAN 'DEPLOY_PLAN'
OMB scripting can carry out maintenance tasks such as installing
runtime schemas, creating deployment locations, create runtime
repository connections, and so on.
OMBCREATE RUNTIME_REPOSITORY_CONNECTION \ 'NEW_RR_CONNECTION' \
SET PROPERTIES (DESCRIPTION, BUSINESS_NAME, \
HOST, PORT, SERVICE_NAME, \
CONNECT_AS_USER, RUNTIME_REPOSITORY_OWNER) \
VALUES ('Runtime Repository', \
'MY_RUNTIME', 'acgdev03', \
1521, 'plusdev1', 'OWBRU', 'OWBRR')
Finally, OMB scripts can be saved to the filesystem, loaded up whilst
in the OMB*Plus environment or executed directly from DOS or Unix by
supplying the script name after the OMB*Plus executable.
ombplus.bat c:\owb\refactoring_script.tcl
Scripting within Oracle Warehouse Builder has three main advantages.
Firstly, it allows us to bulk create objects and mappings, and make
mass changes to them without having to go into the OWB GUI and make
each change manually. This addresses the main reason that Arup Nanda
used shell scripts in his technical paper, and gives power back to the
experienced developer who wants to generate scripts according to a
template.
Secondly, by allowing us to script changes to objects, and more
importantly, be able to repeat these changes without fear of operator
error, it allows us to take a more agile approach to development and
we can refactor development changes reliably and repeatedly.
Lastly, and this is more of a theoretical rather than practical
benefit at the moment, it allows us the possibility of using the
command line to build objects and mappings, but in an abstract form
that isn't directly tied to the particular implementation details for
our particular Oracle platform. For instance, our developer might put
together an OMB script to merge a staging table into a dimension
table, which in Oracle 8i would have been implemented by OWB as two
separate INSERT and UPDATE commands. Once OWB was pointed at an Oracle
9i database however, it would in this case implement the merge using
the MERGE command. The advantage of using OWB is that the logical
transformation (merging into a dimension table) is kept constant,
whilst the actual implementation varies according to the platform.
What OMB brings to this is that you can write scripts that are now at
a more abstract level, building objects and mappings in a logical way
whilst OWB deals with the implementation. Before OMB was around, the
only way you could build things in this abstract way was to use the
OWB GUI, but now you can actually write scripts that have a degree of
abstraction from the actual implementation details for your particular
platform.
More details on OMB scripting can be found at:
Publishing OWB
Transformations As Web Services
There's an interesting technical note available on OTN on
invoking OWB processes or mappings as web services.
Traditionally, OWB mappings and processes are run using the OWB
deployment manager, SQL*Plus (using the Execution Template) or through
Oracle Enterprise Manager. However, from OWB 9.0.4 onwards, mappings
and process flows can be called by any application that can
communicate via a web service interface.

The way this is accomplished is to create a PL/SQL procedure that
accepts a set of input parameters, calls the OWB mapping or process,
and outputs the results as output parameters. This procedure is then
published as a PL/SQL web service using Oracle JDeveloper, and the web
service is then deployed to a J2EE application server such as Oracle
OC4J or Apache Tomcat.
The article
describes a scenario where an OWB mapping is put together that
takes as it's input a customer address, cleanses the address using the
name and address transformation, then outputs the cleaned up address
as output parameters. This mapping is then published as a web service
and can be called by any J2EE or .NET application, allowing you to
deploy and use OWB's transformation capabilities outside of the normal
OWB environment. The article comes with a sample
.mdl export and
PL/SQL package files and full instructions on how to set up your
own OWB web service. Well worth a look.
Streamlining
Oracle 9i ETL With Pipelined Table Functions
I've recently been reading Bert Scalzo's
"Oracle DBAs Guide To Data Warehousing And Star Schemas", and he's
got an interesting take on the ETL process that I thought it worth
looking at in more detail.
Background
Bert was the DBA behind the 7-Eleven data warehouse, one of the first
multi-terabyte dimensional data warehouses built on the Oracle
database. The book is a bit different to other Oracle data warehousing
books in that it concentrates on the implementation details of
building a warehouse, and as it's not by an Oracle member of staff he
tends to have forthright opinions on what elements of the technology
stack work, and what don't. A particular example of this is where he
looks at the various releases of the Oracle database (7, 8i and 9i)
and explains in some detail why 7 and 8i don't really cut the mustard
as large-scale data warehousing platforms. In another instance, he
looks at the evolution of the star transformation, explains why the
original implementations (which were first implemented using nested
loops, and then after by cartesian joins) weren't up to scratch but
then absolutely insists that you use the version that comes with
Oracle 9i. As the author's not afraid to offer an opinion it's an
entertaining read, and a bit different to your run of the mill 'new
features' book.
One chapter that particularly interested me was his approach to
loading the data warehouse. Right from the outset, Bert eschews the
use of ETL tools, preferring instead to code the load routines
manually. In his opinion, the trouble with ETL tools is:
-
The code they
produce is not optimally efficient, and often uses slow-running Java
to host and run the processes
-
The cost of
the tools is often more than the hours they replace
-
They allow the
project to be staffed by people with little Oracle or data
warehousing experience
This is a theme echoed in an
Oracleworld paper by Arup Nanda that I looked at last year, which
again rejected the use of ETL tools, but this time because the nature
of the ETL (lots of small transformations to a large number of nearly
identical tables) was better served by running a number of Unix shell
scripts that in turn generated SQL scripts to a pattern. I'll come
back to the ETL vs. hand coding debate later on in the article.
Anyway, the ETL process that the author wanted to carry out involved
the following steps
-
load data from
a large flat file
-
apply
transformations to the data
-
update/insert
the data into a dimension table
What made this interesting was the fact that the file was very, very
big, and the ETL window was small. Therefore, any ETL process had to
-
Minimise
inter-process wait states
-
Maximise total
concurrent CPU usage
To illustrate, take a situation where your ETL process starts with a
file of 1000 records, and has two stages, each of which take an hour
to run. The least efficient way to process this file is to let the
first process handle all 1000 records (taking an hour to do so), then
pass the data on to the second process, which will also take an hour
to run. Your total run time is two hours. What the author is
suggesting with 'minimise inter-process wait states' is that it would
be good if your first process could handle for example 10 records,
then pass these on to the second process so that it could start it's
work. The first process can then get on with the next 10 records, and
the pass them on when it's done. Your total run time would then be
just a bit over one hour, cutting down dramatically the time to
perform the load.
The 'maximise total concurrent CPU usage' requirement aims to take
advantage of the ability for multi-processor servers to carry out more
than one task at once. Oracle is very good at this, and has had this
feature (known as parallelism) for many years. However, it only really
comes in to play if you use DML statements (insert, update etc) to do
your load; if you write your load in PL/SQL using cursors and loops to
process your data row-by-row, you're not going to be able to take
advantage of parallel DML. What these two requirements mean,
therefore, is that your ideal ETL process is going to use parallel
processing, and somehow reduce to the minimum the wait between
different ETL stages.
Introducing External Tables
One of the key new features in Oracle 9i for business intelligence and
data warehousing was the inclusion of a number of ETL features within
the database, the point of which was to remove the requirement to
purchase a separate ETL engine such as Informatica, Genio or Datastage.
The approach proposed by the author uses these new features to
shortcut the warehouse load process.
If you remember from earlier in the article, our load process had to
carry out three steps:
-
load data from
a large flat file
-
apply
transformations to the data
-
update/insert
the data into a dimension table
Traditionally, you would accomplish this first step by using
SQL*Loader. SQL*Loader allows you to import a flat file into the
database, carry out basic transformations on it (change case,
reformat, reject invalid rows), and if it helps speed up the process,
load it in parallel using direct path. As you can only insert data
using SQL*Loader, and the degree to which transformations can take
place is limited, you'd generally load the data into a staging table
and then process it in a second step.
Oracle 9i however introduced a new feature called
External Tables, which allows you to define a database table over
a flat file. Taking as our example a comma-separated contracts file
that is used to load data into a contracts dimension, the code to
create an external table would be:
create directory inp_dir as '/home/oracle/input_files';
create table contracts_file (contract_id number, descrip
varchar2(50),
init_val_loc_curr number, init_val_adj_amt number)
organization external (
type oracle_loader
default directory inp_dir
access parameters (
fields terminated by ','
)
location ('contracts_file.csv')
)
parallel 10;
The External Table feature allows you to embed the SQL*Loader control
file into the table DLL script, and then allows you to run SELECT
statements against the flat file. You can include the external table
in joins, subqueries and so on, but you can't use the external table
to delete or update data in the flat file. External tables came in
with Oracle 9i, and they're available as data sources in recent
versions of Oracle Warehouse Builder.
(Pipelined) Table Functions
Traditionally, once we'd loaded the flat file data into a staging
table via a SQL*Loader script, the next stage would be to transform
the data, either through a series of DML commands or by using a PL/SQL
package. The transformation would have one or more stages, and the
stages would run sequentially, one after the other.
Oracle 9i now gives us the opportunity to improve on this situation,
by allowing us to create our transformations as functions - functions
that take our external table as an input, and output the transformed
data as rows as columns that can be used to update a table. These
functions, known as
Table Functions, can also be 'pipelined',
which makes Oracle start returning rows from the function before
processing has finished. By pipelining table functions, you can string
a number of them together and have them 'pass off' rows to the next
process as soon as a batch of rows are transformed, meeting our
requirement to minimize our inter-process wait states. So how does
this work in practice?
create or replace type rec_contract_type is object
(
contract_id number,
descrip varchar2(50),
init_val_loc_curr number,
init_val_adj_amt number
);
/
create table contracts_dim
(
contract_id number,
descrip varchar2(50),
init_val_loc_curr number,
init_val_adj_amt number
);
create or replace type table_contract_type is table of
rec_contract_type;
/
CREATE OR REPLACE PACKAGE transform_pkg
AS
TYPE contract_cur IS REF CURSOR RETURN contracts_file%ROWTYPE;
FUNCTION go (contract_arg IN contract_cur)
RETURN table_contract_type
PARALLEL_ENABLE (PARTITION contract_arg BY ANY)
PIPELINED;
END;
/
CREATE OR REPLACE PACKAGE BODY transform_pkg AS
FUNCTION go (contract_arg IN contract_cur)
RETURN table_contract_type
PARALLEL_ENABLE (PARTITION contract_arg BY ANY)
PIPELINED
IS
contract_rec_in rec_contract_type :=
rec_contract_type(null,null,null,null);
contract_rec_out rec_contract_type :=
rec_contract_type(null,null,null,null);
BEGIN
LOOP
FETCH contract_arg INTO
contract_rec_in.contract_id,contract_rec_in.descrip,
contract_rec_in.init_val_loc_curr,
contract_rec_in.init_val_adj_amt;
exit when contract_arg%NOTFOUND;
contract_rec_out.contract_id :=
contract_rec_in.contract_id;
contract_rec_out.descrip := contract_rec_in.descrip;
contract_rec_out.init_val_loc_curr :=
contract_rec_in.init_val_loc_curr;
contract_rec_out.init_val_adj_amt :=
contract_rec_in.init_val_loc_curr*1.6;
PIPE ROW (contract_rec_out);
end loop;
return;
end;
end;
/
As the example shows, the function takes as it's argument a ref cursor
which points to the data being fed into the transformation, and then
sends as it's output another ref cursor that points to the set of
transformed rows as processed by the function. The PIPELINED clause
tells the function to return rows as soon as a PIPE ROW() command is
used, which happens every time a row in the cursor is processed by the
function. The function therefore takes as its input a pointer to the
results of a SELECT statement, and spits out at the other end a
pointer to the transformed version of these rows and columns.
The Merge Command
What we want to do now, is take the transformed data and use it to
update our dimension table. In the past, we'd have accomplished this
by a combination of INSERT and UPDATE statements, but with Oracle 9i
we've now got the MERGE command which does this in one step. What's
more, we can wrap up the whole transformation process into this one
MERGE command, referencing the external table and the table
function in the one command as the source for the MERGEd data.
alter session enable parallel dml;
MERGE /*+ parallel(contract_dim,10) append */
INTO contracts_dim d
USING (select * from table(transform_pkg.go(CURSOR(select /*+
parallel(contracts_file,10) full(contracts_file) */ * from
contracts_file)))) f
ON (d.contract_id = f.contract_id)
WHEN MATCHED THEN
update set descrip = f.descrip,
init_val_loc_curr = f.init_val_loc_curr,
init_val_adj_amt = f.init_val_adj_amt
WHEN NOT MATCHED THEN
insert values ( f.contract_id,
f.descrip,
f.init_val_loc_curr,
f.init_val_adj_amt)
/
So there we have it - our complex ETL function all contained within a
single MERGE statement. No separate SQL*Loader phase, no staging
tables, and all piped through and loaded in parallel.
Oracle Warehouse Builder
The good news for OWB users is that recent releases of OWB (9.0.4
upwards) support
table functions as transformations in a mapping process.

The current version of OWB (10g Release 1) supports table functions
that you've written and then import into the OWB repository, whilst
the next release, codenamed 'Paris', will allow you to
build the table function graphically within OWB. This support for
table functions, together with the ability to fine tune the load
parameters and take advantage of 9i ETL functions like MERGE and
pipelining actually invalidate Bert's argument that ETL tools don't
produce efficient, optimal code, although to be honest all of this is
fairly new functionality and the book is a couple of years old now.
Conclusion
With the new ETL functionality in Oracle 9i and 10g, plus the support
for this functionality in OWB 9i and 10g, there's really no need now
to split every transformation into a series of staging tables and
separate transformations. If you start to use features such as table
functions, external tables, parallelism and pipelining, you can start
to take advantage of the features Oracle have built into the database
engine specifically for loading data warehouses.
If you want to read more about this subject, you can check out Bert
Scalzo's excellent book
"Oracle DBAs Guide To Data Warehousing And Star Schemas" or check
out the following articles;
·
OTN article on Table Functions
·
OTN OWB Forum thread on Table Functions
·
"Loading and transform external data into Oracle 9i"
·
"CASTing About For a Solution: Using CAST and Table Functions in
PL/SQL"
·
"Using Pipelined Table Functions (Oracle 9i)"
Web Services,
BPEL And Oracle Data Warehousing
One of the news items I
picked up on earlier this week was around the release of Oracle's
Business Activity Monitoring ('BAM') feature for Application Server
10g. From the
amount of articles written about BAM this is obviously a key
product for Oracle, so I decided to take a further look.
According to
Ephraim Schwartz and James Niccolai for Infoworld,
"In response to market demand for more real-time business information,
Oracle announced today it will offer a BAM (business activity
monitoring) tool, Business Activity Manager, as part of an upgrade to
its application server software due out in the middle of the year.
The BAM component will run on the existing Oracle10g Application
server, released at the start of the year, and on Application Server
10.1.2, due out this summer. It will not run on earlier releases such
as Application Server 9i, according to Thomas Kurian, senior vice
president of Oracle Server Technologies.
Included in Business Activity Manager are the ability to detect,
correlate, and analyze separate events such as RFID scans, bar code
readings, or events out of an enterprise application or a supply-chain
process event.
Kurian said the the BAM application is unique in that it can correlate
both traditional historical events out of a data warehouse with
real-time events."
So Business Activity Monitoring will be an add-on feature for
Application Server 10g, and looks like it uses the B2B-connectivity
that comes with AS10g. This would seem logical as a big part of
AS10g's appeal as a middleware platform is through it's ability to get
different ERP and CRM systems talking together, using technologies
such as XML, SOAP, web services and so on. Note the RFID scans mention
(current
flavour of the month with Oracle) and the nod towards integrating
current activity with historical data warehouses, which concurs with
the current Oracle vision of
integrated, real-time data warehousing.
Martin Lamonica for CNET Asia notes that the introduction of
Business Activity Monitoring is part of a general Oracle strategy to
make money out of integrating software:
"The introduction of the new data-gathering software is part of
Oracle's strategy to bring in more revenue from the business of
stitching together incompatible software. Oracle in January introduced
Customer Data Hub, integration software sold as a stand-alone product
that uses XML (Extensible Markup Language)-based protocols called Web
services to centralize information about customers. The company
earlier this year introduced an update to its suite of server-based
transactional software, called Oracle 10g Application Server.
Making disparate systems share information cost-effectively is a
perennial problem for companies and represents billions of dollars in
technology spending. Kurian said many companies have already automated
several individual business processes, such as manufacturing or human
resources, by investing in packaged applications or writing custom
applications. Now, companies are looking to automate "cross-functional
business processes" that span a number of computer systems, he said."
Customer Data Hub was originally mentioned by Oracle
back in February this year and is a way of bringing together data
on all trading entities into a single data store - which is of course
an Oracle database. Customer Data Hub raised a few eyebrows at the
time as it was seen as a bit of an about-face by Oracle, who up until
then had advocated 'Oracle-only' solutions as the only way to deliver
a successful ERP project. When you looked at Customer Data Hub in more
detail though, what you actually saw was infact quite a consistent
approach - keep your third-party ERP applications (Siebel, Peoplesoft
and so on) but use the Oracle database (via the Customer Data Hub) as
your single store of trading entity data. Business Activity Monitoring
looks to be building on this approach and directly taking on
integration specialists such as WebMethods, Tibco and SeeBeyond, in
the process positioning Application Server 10g as the centre of your
ERP framework. In addition, if you read the CNET Asia article you'll
also see references to something called Business Process Execution
Language, a web services protocol being championed by Oracle, IBM and
BEA as a standards-based way to integrate ERP data.
Duncan Lamb
mentions BAM and BPEL in an article in his (always excellent)
blog:
"CRN
: Daily Archives : Oracle To Update App Server With Business Activity
Monitoring : 2:53 PM EST Thurs., May 13, 2004: "The BAM
data feeds into a central management console, where it can be
correlated, filtered and subjected to analytics, according to Oracle.
The results can be viewed on a portal dashboard. "
According to the article, the BAM capabilities include support for
BPEL, which is coming out the winner in the standards wars for the
"Process Execution" slice of the
alphabet soup of the SOA.
Oracle has not settled on a standard itself however, announcing
it will support both BPEL
(initially proposed by MS and IBM) and WS-Choreagraphy.
Thank goodness for both of them, for as the article says, "Oracle's
move brings clarity to the respective focus of each group."
So what's all this about BPEL? Well, BPEL is a bit of a hot topic in
the application server and web services world at the moment, and in
this SearchWebServices.com article it is defined as
"BPEL (Business Process Execution Language) for Web services is an
XML-based language designed to enable task-sharing for a
distributed computing or
grid computing environment - even across multiple organizations -
using a combination of
Web services. Written by developers from BEA Systems, IBM, and
Microsoft, BPEL combines and replaces IBM's WebServices Flow Language
(WSFL) and Microsoft's XLANG specification. (BPEL is also sometimes
identified as BPELWS or BPEL4WS.)
Using BPEL, a programmer formally describes a business process that
will take place across the Web in such a way that any cooperating
entity can perform one or more steps in the process the same way. In a
supply chain process, for example, a BPEL program might describe a
business protocol that formalizes what pieces of information a product
order consists of, and what exceptions may have to be handled. The
BPEL program would not, however, specify how a given Web service
should process a given order internally."
From an Oracle point of view, one of the best places to start learning
about BPEL is Mike Lehmann's
Web services, BPEL4WS, J2EE blog. Mike is the Product Manager for
Oracle's BPEL engine and has put together a number of articles on web
services and BPEL that you might want to take a look at to get a
heads-up on the technology. As a starter, Mike's put together a couple
of good articles,
"Who Needs Web Services Transactions?" and
"Oracle Backs BPEL4WS and WS-Choreography", together with a
powerpoint presentation, "BPEL:
Business Processes with Web Services" which you might want to take
a look at.
So, how does this affect us? In terms of data warehousing and BI,
probably not much at the moment, and it's probably only going to be
relevant to organisations using Application Server 10g as a kind of
'central customer data hub' in conjunction with the latest release of
Oracle Applications. In the long term though, it's another pointer
towards Oracle's vision of pervasive, real-time business intelligence,
that merges current activity with historical data, and you can see the
Oracle Application Server being a more central part of this process,
using all the interconnect functionality within the application server
platform to bring the disparate data together. What this probably
means to someone like me (who's probably more comfortable using the
RDBMS as the integration engine) is that to stay ahead of the game,
you're going to need to gain a familiarity with web services, BPEL,
XML, SOAP and so on, as increasingly data warehouse sources are going
to be in this form rather than just flat files and database links.
OWB 10g Now
Available For Windows
Nicholas Goodman contacted me with the good news that Oracle
Warehouse Builder 10g has
now been released for Windows.
Although this first release of OWB 10g is mainly concerned with Oracle
Database 10g compatibility as a target database, the
accompanying FAQ mentions the following new features:
-
Simplified
installation - there are no longer Design-Time and Server-Side
installation options.
-
Warehouse
Builder now provides a script that configures the Warehouse Builder
Runtime Service for remote execution of Warehouse Builder's mappings
on Oracle Enterprise Manager nodes.
-
Warehouse
Builder 10g supports the deployment of Process Flows into Computer
Associates Unicenter AutoSys Job Management.
More details on these changes are in the
Release Notes.
The good news therefore is that we can now use Oracle 10g as a
deployment target for OWB. However, the OWB 10g release we're probably
all waiting for is the
'Paris' release, due later in 2004, which looks likely to rework
the way dimensional data models are put together, supporting more
complex star schemas together with more seamless support for CWM
metadata and multidimensional OLAP.
Posted by mark at
10:13 PM |
Comments (0)
Simon Terr On
Data Warehousing And GIS
"Recently I've been involved in a very interesting project involving
data warehousing and geographical information system (GIS) technology.
Although the data warehouse was not real-time enabled, the GIS
technology added a very interesting dimension to this data warehousing
project."
writes Simon Terr for DMReview.
Degenerate
Dimensions Defined
Much of my work involves designing and reviewing dimensional data
models, and an interesting issue that often comes up is how to deal
with data items such as invoice number, order number and so on, that
are not strictly facts - you're not going to want to add them up, or
average them, or perform any other maths on them - but they don't seem
to fit into existing dimensions.
Ralph Kimball coined the term
'Degenerate Dimensions' for these data items, as they perform much
the same function as dimensions: they sit in the fact table and allow
you to limit down or 'slice and dice' your fact table measures, but
they aren't foreign key links through to dimension tables, as all the
information you want - the invoice number, or the order number - is
contained in the degenerate dimension column itself. Degenerate
dimensions are useful as they tie the transactions, or events, in the
fact table back to real-life items - invoices, orders and so on - and
they can be a quick way to group together similar transactions for
further analysis.
The key here is not to go overboard and make these degenerate
dimensions into full dimension tables - for example, an Invoice
dimensions - as in all likelihood this dimension table will grow at
the same rate as your fact table. If there is other interesting
information to go with the invoice - for example, who the customer
was, what products were ordered - this is better placed in specific
dimensions for customers and products where it can be stored as a kind
of 'master copy', rather than storing it alongside each order in a
ballooning Invoice dimension.
The other advantage with degenerate dimensions is that they're a lot
easier to build and maintain when using ETL tools such as Oracle
Warehouse Builder, as you don't have to create dimension lookup
tables, create synthetic keys, sequences and so on. Indeed, if you're
loading your dimensional model into a multidimensional database such
as Oracle OLAP, your database will be much smaller in size and easier
to handle if you can keep the number of formal dimensions to a
minimum, as they tend to 'explode' in size the more dimensions you add
to the database.
Judicious use of degenerate dimensions keeps your dimensional model
rational and your database size reasonable, whilst allowing you to
keep useful items in the fact table that help us tie the data
warehouse back to the original source systems.
New Metadata
Exchange Options For OWB 9.2
From
http://www.oracle.com/corporate/press/index.html?3003709.html
"Oracle(r) Provides Business Intelligence Users Powerful Metadata
Management
Oracle Corporation and Meta Integration Technology, Inc. (MITI), the
industry leader in metadata integration, today announced a partnership
that will enable Oracle(r) Warehouse Builder customers to more
efficiently manage metadata integration with modeling, extraction and
reporting tools commonly used in business intelligence (BI)
architectures.
The incorporation of MITI's Meta Integration(r) Model Bridge (MIMB)
technology with Oracle Warehouse Builder provides users with greater
control of their business data integration processes. MIMB enables the
movement of metadata across numerous environments such as databases
and data warehouses as well as BI, design, and extract, transfer and
load (ETL) tools."
"This partnership will help customers to utilize their existing
investments in modeling, reporting and ETL tools. Oracle Warehouse
Builder can utilize MIMB for metadata integration with popular BI
products including:
* Data modeling tools such as CA AllFusion ERwin, Sybase
PowerDesigner(r)
* ETL tools such as Informatica PowerCenter
* BI query and reporting tools from Business Objects and Cognos
* Standard metadata repository formats such as OMG CW(TM)
"Every modeling, ETL, OLAP, BI and reporting tool provides its own
repository that produces and consumes metadata," said Dan Vesset,
research manager, Analytics and Data Warehousing, IDC. "As a result,
it is critical for any new tool to integrate well in the enterprise
metadata flow. Oracle's partnership with MITI enables Oracle Warehouse
Builder to integrate with a wide range of tools, providing users
greater control over metadata integration."
You can download MIMB from the
Meta Integration Technology website and Oracle have also got a
FAQ and
viewlet demonstration available on OTN. Although the press release
only came out recently, the technology actually works with OWB 9.2 and
can be used now. It's a separate download that gets installed
alongside OWB, and once installed gives you a number of additional
options when using the metadata import and export bridges.
From my own point of view the Informatica integration certainly looks
useful, as does that for Business Objects and Cognos. One to take a
look at when I get a spare moment.
New Developments
In Oracle Data Warehousing
I've been giving a lot of thought recently to how I might go about
designing and building future Oracle data warehousing projects. Over
the last few months, I've come across some interesting design
approaches, and coupled with some new technology that's now available
it's probably a good time to have a think about how future projects
might come together.
One of the key messages that keeps coming through, from both clients
and from Oracle, is around near or real-time data warehousing.
Traditionally, data warehouses have been refreshed daily, or weekly or
monthly, and it's been generally accepted that there's always going to
be a lag between the data in the source systems, and the data in the
data warehouse. This isn't necessarily a bad thing - data warehouses
are for analysing trends, and you wouldn't want to change the
direction of a business based on a few minutes data; however, the role
that data warehouses play is changing, and it's common now to use the
data warehouse as a way of getting a holistic view of the whole
business, in real time, so that split-second decisions can be taken as
soon as an event happens, to prevent a customer defection, or to take
advantage of market movements, for example. If you're going to be
relevant in the data warehousing market, you've got to be capable of
being real-time.
Warehouses that are built using Oracle Warehouse Builder are generally
loaded on a batch basis, using PL/SQL routines build using OWB that
are scheduled using OEM and/or Oracle Workflow. If you want these
routines to be 'near' real-time, what you then do is run them on a
regular, constant basis (say every minute or so) and have them process
changed data from the source system, as opposed to loading the entire
source system each time. The usual way to do this is to have some sort
of 'new' or 'changed' date field in the source system, so that you can
pick up only those records that are new or amended. The smart way to
do this though, as from Oracle 9i, is to use 'Change Data Capture', a
mechanism for automatically feeding changed database records through
to interface tables, or indeed through to the warehouse fact and
dimension tables themselves.
In actual fact, I've never actually used the Oracle 9i change data
capture, because of the way it is implemented - it relies on triggers
being placed on the source tables, which can (a) slow down the source
system and (b) worry the source system DBA as you're changing his data
model. However, I recently had a look at a presentation and paper from
the last Oracleworld by Michael Brey and Michael Schmitz entitled
"Experiences with Real-Time Data Warehousing Using Oracle Database
10g" which
explains how the new 10g version of change data capture, called
'Asynchronous' (as opposed to 'Synchronous') change data capture
works.

In the 10g version, the redo logs are used instead of triggers,
making the process far less invasive and putting no load on the source
system. The drawback of course is that the source system (as well as
the target) have to be 10g, but I can see how you could put a 'real
time' system together, that mines the redo logs, transforms the data
using a PL/SQL package and places the data in the correct fact or
dimension table. More details on Asynchronous Change Data Capture can
be found in the Oracle white paper
"On-Time Data Warehousing with Oracle10g - Information at the Speed of
your Business" available on OTN.
Another interesting approach that the paper talks about deals with the
question of real time vs. fairly static data in the data mart. One
criticism of having a real time data mart or warehouse is that if you
run a query twice in succession, you might get a different answer, as
more data has been loaded since you first run the query. To get around
this, the system designers had a main fact table, which is only loaded
once a day, together with a 'today's data' fact table, which is
updated constantly during the day. If you want a static view of the
data, you just query the main fact table; if you want a completely up
to date view, you SELECT against both tables, and UNION the results.
At the end of the day, the 'today's data' fact table is
instantaneously 'partition exchanged' with the partitioned main fact
table, which was partitioned by day and had an empty partition set
aside for each new day's data.
Following on from this, I attended a
presentation by Jon Page at the recent UKOUG Reporting Tools & BIW
Special Event in March on Oracle's 'Information Architecture'. Oracle
are currently pushing a vision of data warehousing where the warehouse
is integrated into the ERP system; in Oracle's case, using a product
called 'Daily Business Intelligence' (DBI), an add-on to the
e-Business Suite. Interestingly, what Oracle are saying with DBI is
that you don't need a separate warehouse, and you don't need an ETL
process that loads data in batches and applies changes to the
warehouse on a daily, weekly or monthly basis. It looks like DBI is
being populated and refreshed using Fast Refresh Materialized Views -
materialized views that are updated whenever the base tables have data
COMMITed, a process that has little impact on database performance but
has the advantage of keeping the summary tables (or in fact any
derived tables) always up to date.

What this means in effect is that you don't have a long drawn-out ETL
process, with multiple staging tables and mass-copying of data from
source to target - you just create some fast-refresh materialized
views and code the ETL into SELECT statements. A good approach -
although the point I keep coming up with is 'what about complex ETL
processes?' - surely at some point you're going to need intermediate
tables, and surely at some point you're going to have to break out of
standard SQL and use PL/SQL for a more complex process. However, it's
an interesting approach and one I'll bear in mind for whatever I work
on next.
Separate from all this, I also recently took a look at a paper by Arup
Nanda entitled
"A Multi-Source, Time Variant Data Warehouse Case Study" - you
might have seen a
write-up I gave this paper at the time. The case study describes a
project where a medical claims data warehouse was built for a client,
with a rule that no ETL tool - not even OWB - could be used. The paper
then goes on to describe how they set up the warehouse and the ETL
routines, using nothing more than SQL*Plus and Unix shell scripts.
What was interesting about the paper was the completely different
approach they took to putting a warehouse together; most projects I
work on involve one or more OWB developers, building a series of ETL
mappings that translate into SQL scripts and PL/SQL packages, whilst
the authors of the paper did most of their work using parameterised
CREATE TABLE AS SELECT ('CTAS') scripts that were called from Unix
shell scripts.
What was different about Arup Nanda's approach was that he eschewed
the often long, drawn out graphical ETL process used by OWB and
crafted a number of reusable SQL scripts that bulk-copied data from
source to target using the CTAS statement. In addition, to deal with
changes to data types he then crafted a number of scripts that issues
ALTER TABLE commands to change column types. Lastly, he made extensive
use of (again) daily partitioned fact tables, dropping old partitions
and creating new partitions using generic SQL scripts and alongside
this using locally partitioned indexes that reduced index maintenance
to those partitions directly affected by table loads.
What this said to me was that Oracle 9i warehousing projects don't
necessarily need to use OWB, if the process can be better served by
putting together well crafted reusable SQL scripts called from either
MSDOS batch files or Unix shell scripts. It also highlighted the
central role that partitioning plays in a warehouse, not only for
performance reasons but also due to the potential for partition
exchange loading - the process of swapping out a staging table that
contains data for a period with the corresponding empty partition in
the fact table. Infact, OWB 9.2 now supports partition exchange
loading, so this can also be carried out from the OWB GUI tool without
having to learn the relevant SQL statements. However, in some cases -
where the ETL process is simple but repeated many times with source
data of varying structure - coding SQL statements by hand and
controlling them from shell scripts is sometimes more efficient than
coding them using a graphical ETL tool.
The last development I've been looking at, is 'Agile'
methodologies that take a fundamentally different approach to the
project build than traditional 'waterfall' methodologies.
Traditional data warehouse projects that I've worked on (and managed)
usually work to a project plan where requirements are first defined,
then the system is designed, then coded, tested then rolled out.
Whilst this works well when the project requirements can be completely
designed at the start, it sometimes falls down when
-
the
requirements aren't fully defined at the start, and in fact only
come out as the project gets going
-
changes to
requirements, or problems during the build mean that the project
diverges from the plan early on
-
the client
wants to see some results early, but the project only delivers
something right at the end
-
time is tight
but the project methodology demands lots of documentation, and lots
of process
-
team members
have different skills but get treated for planning purposes as
interchangeable
Agile methodologies, such as
Extreme Programming or
Scrum address these issues by taking a fundamentally different
approach.

-
Requirements
are driven out over development cycles, constantly feeding back into
the project
-
iterations of
the project are delivered regularly and quickly, so that feedback
can be obtained
-
the
methodology is 'lightweight', only determining objectives and
leaving the developers to decide how to do the work
-
There is very
little project 'ceremony', only requiring a minimal amount of
project documentation
-
What project
process there is is mainly focussed on setting objectives and
measuring progress
-
Team members
are recognised as having unique skills which sometimes cannot be
interchanged.
This agile approach looks excellent for business intelligence & data
warehousing projects, which more often than not don't have their
requirements fully defined at the start ("I don't know what I want, so
show me what can be done and I'll tell you then") and benefit most
from regular releases of data and reports to the warehouse user base.
It's also the case that, in most projects I've heard of, whilst there
is a lot of time and attention paid to the plan at the start of the
project, as time goes on less and less project documentation actually
gets filled in as everyone's backs are to the wall getting the project
delivered - the agile approach would therefore fit this better,
allowing us to focus on that documentation which is actually essential
- setting objectives and measuring progress - whilst allowing us to
dispense with documentation that is only filled in 'because it should
be'.
The agile approach should therefore help us deliver projects that are
more responsive to customer needs, deliver results earlier and
consistently, and allow developers to concentrate their efforts on
parts of the project that deliver real value. Of course the potential
problem here is that most projects you are asked to quote for are on a
fixed price, fixed requirements basis; the client believes they have
fully defined the requirements as part of the RFP, and they therefore
want you to quote a fixed price for doing this work. Any changes to
requirements (which more often than not eventually come in) then
become change requests which have to then be separately negotiated. If
agile methodologies are going to work, they'll need to be a
partnership between client and the development team, balancing the
benefits from a flexible approach to the design and build with the
fact that it's not possible to completely lock down the cost and
elapsed time at the start of the project.
So what does this mean then for future data warehousing projects?
Well, whilst in the short term in all likelihood things will carry on
as normal, some approaches I'd like to bring into future warehousing
projects would include;
-
Increasing use
of real-time data, through a combination of synchronous and
asynchronous change data capture, and fast refresh materialized
views
-
More use in
general of the built-in ETL features in Oracle 9i and 10g, even when
they're not yet supported by OWB (though check the new New Features
documents for later versions, as more and more Oracle 9i ETL
functionality is being exposed through OWB)
-
Consideration
for the use of generic, parameterized SQL scripts when this would be
a more efficient way to build the ETL process
-
The use of
agile methodologies where the customer is agreeable to an iterative
development process, drawing out requirements through a series of
releases and keeping project documentation focussed on setting
objectives and measuring process.
Any thoughts from anyone on this approach? Is anyone using the Oracle
9i features for real time data warehousing, or is anyone using an
agile methodology, such as eXtreme Programming or Scrum to run their
warehousing project? Drop me a line if you've got any feedback.
Guide To
Integrating OWB9.2 And Oracle Workflow
A
recent posting on Oracle Warehouse Builder prompted some questions
from readers on the process of integrating Oracle Workflow with Oracle
Warehouse Builder; this integration has also been a 'hot topic' on the
OTN OWB Forum. The OWB documentation stops short of explaining how
workflow jobs, created using OWB9.2 and deployed using the Deployment
Manager, are actually submitted and run using Oracle Workflow, and
therefore
Sonia Brown put together some notes on how this integration is
achieved.
The document covers the following steps, with a number of screenshots
to illustrate the process;
-
Installing
Oracle Workflow Server
-
Edit init.ora
file
-
Start Oracle
Universall Installer for Oracle Database 9.2
-
Install and
configure Web Server
-
Workflow web
interface virtual directory mappings
-
Installing
Workflow help documents
-
Set up a
directory service for Oracle Workflow
-
Verify your
base URL
-
Implement
Oracle Internet Directory integration (optional)
-
Implement
Single Sign-On integration (optional)
-
Installing
Additional Languages
-
Installing
Workflow Builder Client (Optional)
-
Installation
-
OWB
Configuration Settings
-
Create
Workflow Location
-
Create a
Process Flow using Workflow Location
-
Set Properties
of a Process Flow
-
Deploy the
Process Flow
-
Process Flows
using Workflow Server
-
Launch
process
-
View Process
Information
-
Oracle
Workflow Builder Import (Optional)
-
Process Flow
Import
There's also a
viewlet available on OTN, that shows how Oracle Enterprise Manager
can be used to run the resultant Oracle Workflow processes; one thing
to be aware of is that the preferred database user credentials must be
those of the Runtime Access User, whilst the Runtime Owner's details
are used in the properties page.
Comparing OWB To
Hand-Coding PL/SQL
"Hello Mark,
Stumbled across your web page and blog. Allow me to introduce myself.
I am an Oracle Apps developer and have been investigating the use of
OWB and ETL functionality for use in a conversion project. Have done
much ETL without formal tools before the term existed, but want to
know your opinion of the tool for data conversion from a mainframe
legacy source to Oracle. The main issue is the benefits
(standardization, use of metadata, ???) versus the cost in terms of
ramp-up time. It's a no-brainer to do data conversion with sql*loader
and pl/sql but to implement and learn a tool such as OWB, what in your
opinion is the learning curve? What is the biggest benefit? We have
maybe 30 objects or so that we will be converting. Oracle will be
using the tool for creating interfaces as well. Thanks in advance for
your input."
Good question about OWB. If I was to put my 'marketing' hat on, it'd
be straightforward to list the benefits of using OWB
-
metadata about
objects and mappings is captured during the build process, stored in
the OMG CMW format, and flows through to other Oracle tools such as
Discoverer, 9i OLAP, the database, workflow and so on.
-
building
graphical mappings, rather than hundreds of lines of code, makes it
easier for other developers to work out how a load process works
-
easy creation
of complex 9i metadata around your database objects; both the
dimensions and hierarchies you get with 'normal' Oracle 8i and 9i,
and the CWMLITE / CWM2 metadata required as part of the OLAP Option.
-
with OWB 9.2,
you now get a mapping debugger, plus Oracle workflow integration
-
all of the
extra auditing / error trapping that you get by default with an OWB
mapping
-
easy access to
the Oracle fine-tuning tricks (INSERT */ APPEND */ on loads, etc)
-
access to
additional, pay extra functionality like name and address cleansing
However, there is an overhead in using OWB, both in getting up and
running, and in building the project. In theory, these overheads are
balanced out by gains in productivity, but depending on what you're
looking to achieve, and the experience of your PL/SQL programmers,
it's worth weighing these up beforehand. Typical overheads include
-
setting up
OWB, plus the repository, runtime schema and so on, and ensuring
good performance of the tool
-
learning the
basics of OWB. Basically, your mappings are set based rather than
procedural, so you have to rethink how you put mappings together,
with each one translating to a single INSERT, UPDATE or MERGE
statement.
-
With OWB 9.0.4
and 9.2, the tool itself has got more complex, with separate
mechanisms for deploying mappings, and the mappings themselves
(which were straight PL/SQL packages, executable from a DOS/Shell
script) now wrapped up in an XML wrapper, together with the concept
of connectors, collections and so forth
-
Being a java
application, OWB can be a little bit 'sluggish' when working with,
and can also slow down if the underlying database you're storing
definitions and mappings in is under a heavy load.
One of the benefits of working with OWB is that the mappings it
creates are all PL/SQL; in addition, if you want to extend the
functionality available in mappings by creating your own custom
transformations, these are all coded in PL/SQL. Therefore, you don't
need to learn a new scripting language to get the most out of OWB.
However, there's two caveats here
-
although the
mappings that OWB generates are PL/SQL, and you can then go and
hand-edit the mappings, there's no way of folding these changes back
into the stored mapping model. Therefore, you'd need to reapply
these customisations each time you deployed the mapping.
-
if you write
custom functions to drop into a mapping, be careful that these don't
kill the performance of the mapping. For example, if you include
functions in your mapping that transform the data, or insert values
into control tables as part of the mapping, these can screw with the
query optimiser, leading to full table scans or the mapping
reverting to row-based rather than set-based.
Weighing these all up, on balance I would still go with OWB. The
benefits of code portability amongst the development team (it's easier
to pass on a graphical mapping, rather than someone's PL/SQL code),
the additional error control and auditing put into the code with OWB,
plus the hooks into all the other Oracle functionality (workflow, OEM,
OLAP, Discoverer, CWM metadata) outweigh the steepish learning curve,
plus slightly slower development compared to hand-coding, at least
initially. Once you've got the hang of it, a good OWB developer can be
as fast as a good PL/SQL developer, but you've now got all the quality
control, and metadata in a standard form, which is going to pay off on
a large project.
MOSC Note On
Using The OWB Design Browser With The Standalone OC4J Instance
If you're looking to use the web-based Design Browser that comes with
OWB 9.0.4 and 9.2, and want to deploy it using the new built-in OC4J
instance, there's a new MOSC note No.
237367.1 that goes through all the steps required to set it up.
In the past, to use the Design Browser you had to install it into an
Oracle Portal installation; with the bundled OC4J server that comes
with 9.0.4 and 9.2 you can now run it using the Apache webserver that
comes with the 9i database. The note goes through the creation of the
Gateway Database Access Descriptors Settings, and shows you what
details to enter in to the Design Browser login screen when you get it
up and running.
Information on
Oracle Embedded Data Warehouse
"I am trying to find information about Oracle EDW, which is supposed
to be a datawarehouse out of the box for Oracle applications. Only
thing that pops up at Oracle is the warehouse builder FAQ and that is
no help. Please send me link to any documentation."
If you've got access to MOSC, there's a section on EDW in the Top
Tech Docs area
http://MOSC.oracle.com/MOSC/plsql/showdoc?db=NOT&id=149404.1&blackframe=0
Basically EDW (Embedded Data Warehouse) is a series of patches applied
to Apps 11i, that provides as you say an 'out of the box' data
warehouse that is maintained and populated by a customised version of
OWB. As of EDW 4.1, the version of OWB that's used is 9.0.3, and it
comes with prebuilt collectors or mappings to get data out of Apps and
in to the EDW warehouse.
The nice thing about the version of OWB that is provided with EDW is
that it's got an 'apps integrator', making it easier to get data out
of apps (a bit like the SAP integrator, so I hear) that allows you to
reference data by it's business definition, rather than the underlying
apps table. Apart from licensing EDW, AFAIK there's no other way to
get this integrator.
Tips On Using OWB
and AWM To Create Analytic Workspaces
Now that the
9.2.0.4.1 Oracle Database patchset is available, together with the
Analytic Workspace Manager and OWB9.2, I thought I'd take a look at
the different options available for populating an Analytic Workspace
and making the data available for querying using SQL.
Rather than manually building and populating analytic workspaces using
the OLAP Worksheet and OLAP DML, you can now build and populate them
using extensions to Oracle Warehouse Builder, or using the Analytic
Workspace Manager add-in to Enterprise Manager. They work in broadly
similar ways, using the same underlying AWM_ PL/SQL packages supplied
as part of the 9.2.0.4.1 patchset, but there a few differences in the
approach the tools take, what functionality they provide, and the way
they handle OLAP Catalog metadata, which i've
written up in this technical article.
Migrating OWB
From Windows 2000 To RedHat Linux
"We are looking at migrating our OWB on win2k to linux redhat. Does
anyone have any experience with this? What vrs of redhat? And what if
any advice you would like to offer?"
Back in the days of OWB 9.0.3 I looked at moving my laptop to Redhat
8.0 / SuSE 8, and 9iDS / OWB were my primary Oracle tools.
It's worth double checking what the situation with 9.0.4 and 9.2 is
(as it may have changed) but a couple of issues I came across were;
-
OWB 9.0.3 on
Linux didn't come with the Discoverer Metadata Bridge add-in, and
therefore we couldn't export the star schema to a Disco Business
Area. This may still be the case with 9.0.4 and 9.2, and it may also
be the case with the other metadata bridge feature - the one to 9i
OLAP.
-
In practice we
found that a developer, as well as needing OWB on their laptop, also
needed Discoverer Admin / Desktop, and Reports Builder, and these
aren't included in the Linux 9iDS bundle. Because of this, it's
difficult to completely 'throw out' the windows installations, as
there's no way of building the Discoverer side of things.
In
terms of the product itself - it's fine, and works more or less
exactly the way as the Windows versions, as it's all written in java.
Watch out for the above 'gotchas' though.
Oracle Warehouse
Builder 10g News
Details are emerging about the next releases of Oracle Warehouse
Builder, the first of which is due to be launched at Openworld next
week.
The next release of OWB will be known as OWB 10g and will be a bundled
install with the Oracle 10g database, much like the existing OLAP and
Data Mining options with Oracle 9i. I assume that this will only be
the 'server-side' install of OWB (the runtime and repository schemas)
as it wouldn't make much sense to install the client element on the
server. Apart from the bundled install, the functionality of this
first OWB 10g will be the same as OWB 9.0.2.3
The version after that, codenamed 'Paris', is expected to have a raft
of new functionality which Oracle are currently keeping close to their
chests. This 'Paris' release has been talked about for quite a while
now; I'd heard this codename being used for the version of OWB that
was to include 9i OLAP functionality, but this is now present to a
limited extent in existing versions. Maybe this forthcoming release
will add to this functionality, hopefully allowing a 9i OLAP Analytic
Workspace to be populated directly rather than having to load a
relational star-schema as an intermediate step.
|