Early Disk-Based Data Storage
Despite these problems, hashing is
still used within commercial data warehouses, and it remains one of
the fastest ways to store and retrieve disk information. Most Unix
systems can take a symbolic key and convert it into a disk address
in as little as 50 milliseconds. While hashing is a very old
technique, it is still a very powerful method. Many C++ programmers
use hashing to store and retrieve records within their
object-oriented applications.
Figure 1.3 Hashing for disk data storage.
While the hashing technique is still
very popular for fast storage and retrieval of individual records,
it is not suitable for the type of full scans that we see in a data
warehouse. As we would expect from a random key generator, records
are not stored contiguously on a disk. Rather, they are randomly
distributed across the disk device. While an index can help speed
retrieval of hashed records, we still do not see the high I/O
throughput that we see when records are stored contiguously on data
blocks. With contiguous record storage (such as a relational
database), we see that an 8 K file I/O will read hundreds of records
into an out buffer with one I/O. We do not get this luxury with
hashed file storage techniques.
It is interesting that in the early
1990s, more data was stored on magnetic tapes than in all of the
other file formats combined. In fact, even now, companies with
terabytes of data warehouses continue to use magnetic tapes for
systems that contain large amounts of unchanging, infrequently used
data. Magnetic tapes, which remain more than 10,000 times cheaper
than disk storage, are still the most economical way to store large
volumes of data.
Overall, data warehouse applications
that access data stored in IS-AM and VSAM data structures remain
popular. Commercial engines such as the Informix-SE database are
basically IS-AM files that are accessed by the data warehouse.
However, the lack of
robust commercial databases made sophisticated data analysis very
cumbersome. The problems inherent in early disk-based systems were
very serious, and an effort was undertaken to rethink the entire
concept of data storage. These problems included the following
issues: