Data Warehousing Technology: Focus on
Clinical Warehousing
This is the summary of a talk I gave as part of a workshop on Clinical
Data Warehousing, at the American Medical Informatics Association (AMIA) Fall
Symposium at Orlando, Florida, Nov 8th, 1998.
Introduction
- A Data Warehouse is a collection of data that reflects all aspects of an
organization's operations
- A Data Mart deals with a single aspect of the data.
- Data Marts are simpler to build than Warehouses. Pre-constructed Marts can
be purchased for special applications (inventory, marketing).
- One way to build a warehouse is to build individual data marts and then
integrate them later. This is not simple if the marts have been supplied by
different vendors.
The Problem to be
Solved
- The volume of the data is large: gigabytes to terabytes.
- A single query accesses a significant portion of the data.
- Response to the query should be reasonably fast: on-line analytical
processing (OLAP)
- The analytical tools should be usable by data analysts who are not
necessarily database experts. Some systems should be usable by non-technical
people (e.g., salespersons or executives).
Problem Constraints
- The data is mostly read-only.
- Updates to the data are made in bulk (batch-mode) only: there are no
interactive updates
This means that certain index structures that are suited for rapid read
access (though mediocre or even terrible for interactive data updates) can be
used. These structures are bitmap indices and join ("star") indices.
Technological
Approaches
- Use a Relational Database Engine (ROLAP). Scalable to very large data
volumes, but speed is limited.
- Use a Multidimensional Database (MDDB) Engine (MOLAP). Very fast with modest
volumes (<100 GB), but less scalable by itself. Can be used exclusively by
itself for data marts.
- Use Hybrid Technology (HOLAP): multidimensional data structures stored
within a relational engine.
All RDBMS vendors today provide HOLAP: they have acquired smaller MDDB vendors
and integrated their offerings. Here, the MDDB will store precomputed
aggregates, and much of the data as well, but if the query would require going
to the raw data, it will "pass on" the query to the underlying
RDBMS, which will use ROLAP to answer the query (somewhat more slowly). For
small data volumes, the MDDB itself would store the raw data.
Microsoft SQL Server's offering is very easy (almost a no-brainer) to set up:
a wizard automates the task of importing data from a "star
schema" in the RDBMS.
Relational Approaches
Both ROLAP and MOLAP use the Star schema, so called because "Dimension"
table surround a central "Fact" table like the spokes of a star (see
below).
The Star Schema
- Departs from strict normalization: thus, every line item in a
Pharmacy Order becomes a separate row in the Pharmacy Order Fact table, even
though, in a normalized relational database, the physician and the date/time
of order would be placed in an Order Header table. The fact table is in fact
derived by combining fields from the original
Header and Line Items
tables for the order.
A Snowflake Schema
More "normalized" than a Star schema: one or more Dimension tables
are in turn related to other (non-Fact) tables. This tends to be a bit of a
performance killer, but is sometimes unavoidable. An example is shown below:
this is an expansion of the top-left hand of the previous diagram.
- Here, a drug may belong to multiple therapeutic or pharmacologic
categories, and we would like to compute the total orders by category. So we
need a "bridge" table between drug and category.
- Similarly, we may want orders by diagnosis. The only problem is that
diagnosis is linked to a patient, who may have multiple diagnoses. So, we
have another bridge table, Patient_Diagnosis, between Diagnosis and Patient.
- There is another wrinkle here: Categories may be hierarchical, and the
depth of the hierarchy is variable and unpredictable. To cite an example,
within the category "oral antidiabetic medication", we have the
categories "sulfonylurea", "biguanide" and "thiazolidinedione".
"Oral antidiabetic" itself is a child of "antidiabetic",
which would include insulin. To handle this situation, we have the table
Category_Hierarchy, which is doubly linked to Category
("Category_1" is an alias for Category.)
These represent extreme cases. (In his book, Data warehousing guru Ralph
Kimball describes situations analogous to these under the category "A
Graduate Course in Dimensional Modeling".)
Join (Star)
Indices
Consider the regular B-tree indices used by a database table.
These consist of records arranged in a tree structure. The lowest levels of the
tree (the "leaves") store a value of the key for which
the index has been created (e.g., a patient ID) and the record number in
the table where the value can be found. For fields that are not unique, the same
key value may be found in multiple records.
Now suppose there was a B-tree index on a column in the Fact
table (Pharmacy Orders) on Patient ID, and another B-tree index on Patient ID in
the Patients dimension Table of the star schema diagram.
To locate the details of a particular row, we would get the Patient ID from that
row, then consult the B-tree of the Patient table to get to the record number of
the corresponding Patient Record. B-trees are reasonably fast, but when scanning
large numbers of rows, the consulting of the B-tree for each record adds up.
Suppose, however, instead of storing the Patient ID itself in
the Fact table, we stored, instead, the Record Number of the Patient ID in the
Patient table. Then, to look up the details of a particular row, we do not need
to consult the B-tree of the Patient table, but can go to the corresponding
record directly. This is exactly what a "join index" does. When you
build a join index, you are indexing the records in one table based on the
record numbers in another table. This greatly speeds up a join operation between
the two tables.
You can create a join index for every link between a Fact
table and a Dimension table.
Because the Fact table is heavily
encoded, a common operation is to translate the Fact table into
human-understandable form, by pulling descriptive fields from the surrounding
Dimension tables ("Star Join"). Obviously, no one can view billions of
rows, so the operation is typically restricted by multiple criteria based on the
dimension table (e.g, patients matching a particular demographic profile, drug
category).
Bitmapped
Indices
A bitmap index is explained as
follows: suppose severity of an adverse reaction in a fact table is recorded as
a number between 1 to 5. (Such a coding system is, in fact, used by the National
Cancer Institute's Common Toxicity Criteria). We can then index this data as
follows: we store an array of bits (1 or 0) for all grade 1 adverse reactions.
The number of bits is equal to the number of records in the table. 1 means that
record represents a grade 1 reaction, 0 means it does not. We similarly store
arrays for every other value of severity (2, 3, 4 and 5). The set of arrays
constitutes the bitmap index.
Bitmap indices are much more
economical in terms of storage when a particular column has a limited set (e.g.,
"low cardinality") of values. Another obvious candidate for a bitmap
is the sex of a patient (only two possible values), or whether the drug had to
be discontinued (Yes or No). One can rapidly perform set intersections and
unions with bitmaps, to answer questions like: show me all female patients with
grade 2 or greater where the drug had to be discontinued. Obviously, bitmap
indices do not make sense when a particular field can have hundreds of possible
values.
Multidimensional
Databases
- The fact table is conceptually stored as a multidimensional hypercube, with
each descriptive attribute (patient, physician, drug, etc.) forming one axis.
- Each cell in the hypercube contains one or more numeric fields (measures).
- Data compression is used to reduce the number of cells, because all
permutations of the dimensions do not exist in the data (e.g., only some
physicians order particular drugs).
- Relies heavily on pre-computation of aggregates for providing real-time
responses to queries. (PAP: precomputed analytical processing)
- There are significant space savings over ROLAP because arrays are more
compact than tables (all the array index subscripts are implicit), and cell
lookups become arithmetic operations rather than indexed searches.
- Ever since the RDBMS vendors got into MDDBs, the "pure" MOLAP
market has been consolidated. Hyperion is one of the few independents still
around.
Language Standards
for OLAP and Warehousing
- Microsoft has proposed MDX, an open OLAP query language standard). MDX,
used by Microsoft SQL Server OLAP services, is very much like SQL, but also
subtly different in ways that can trip up the unwary user.
- There is a draft proposal for OLAP operations in SQL-99 (percentiles,
regression coefficients, correlation, ranking of values, etc). The draft ISO
document discussing this is almost completely opaque, and the writers seem
to have a total phobia with respect to providing examples. It is not
recommended reading for anybody with a life, or with a modicum of sanity to
protect. There is also a concern that this proposal simply does not go far
enough: to cite an obvious example, 2-variable regression (Y vs. X) is
handled, but not N-variable multiple regression. Data mining packages from
SAS or SPSS go far, far beyond this proposal already, using their own
home-grown syntax.
- What RDBMS vendors already provide
- top N and bottom N queries (e.g., give me the 10 top-ordered drugs)
- a wider variety of statistical operators (standard
deviation, variance). The ISO proposal belatedly introduces these
operators, which have been in Oracle for more than six years.
- Computation of all applicable subtotals when a query asks
for grouping on multiple criteria (the CUBE operation in available in
MS-SQL Server).
Special
Problems of Clinical Warehouses
- Much clinical data is in Entity-Attribute-Value (EAV) form. An example of an
EAV schema is shown in the figure above. Here, the parameters that are
recorded are not hard-coded as columns in a table. Instead, the parameter
definitions are stored as metadata (parameter_info in the above diagram). A
single data table (Clinical_Events in the figure) stores information about the
Entity (Patient and timestamp info), Attribute (i.e., Parameter) and Value for
each instance of a parameter recorded at a particular time. (There may be two
timestamps for parameters that have duration, e.g., start and end times of an
intravenous infusion.)
- Generally, one must write code to bulk-transform EAV data subsets into star
schemas that are focused on a particular topic (e.g., adverse drug reactions,
lab tests). The tedium of writing such code can be minimized by appropriately
designed metadata.
- Free-Text data (e.g.,discharge summaries) needs information-retrieval (IR)
technology, which is semi- integrated with RDBMS technology. The offerings of
MS SQL Server, Oracle and IBM are not as powerful as the feature sets of
packages built by IR vendors (who must live or die by their single products).
For a brief overview of IR, click here.
- Eventually requires use of large controlled medical vocabularies.
Resources
- The Data Warehouse Information Center
http://dwinfocenter.org
This site, maintained by Larry Greenfield, is the single most useful
resource I've encountered.
Ralph Kimball’s articles in DBMS Magazine (now Intelligent Enterprise) http://www.intelligententerprise.com
Books:
Warehousing
"Gotchas" (from Larry Greenfield)
- 80% of your time may be spent extracting, cleaning & loading data.
- You may find hidden problems in the systems feeding the data warehouse.
- Warehouse projects often turn up the need for data not being captured by
existing systems.
- After end users receive query and reporting tools, requests for IS written
reports may increase rather than decrease, contrary to your ROI forecast.
- Many query, reporting, and OLAP tools allow users to perform calculations,
and users may perform the same calculation differently.
- Large-scale data warehousing can become an exercise in data homogenizing
that lessens the value of data.
- Overhead can eat up great amounts of disk space.
- You can’t assign security when using a transaction-processing system
mindset.
- Data warehouses are high-maintenance systems.
- You will fail if you concentrate on resource optimization to the neglect
of project, data, and user management issues and an understanding of what
adds value for the user.
Top 10 Complaints Of
Data-Warehousing Managers (Larry Greenfield)
- Demand soared. There were far more users than expected; our hardware and
software costs doubled in the first year.
- That vendor demonstration of push-button access raised management
expectations so high that the system will be seen as unresponsive no matter
what I do.
- The tools didn’t work together, even though all our vendors claimed
their products were compatible.
- As our needs grew more complex, the data-extraction tools ran out of
steam, and we had to write custom programs.
- Users continually want more data, but the so-called intuitive tools we
bought are too difficult for users to use. The IT group ends up becoming
their data-retrieval assistants.
- We spent half the time in the first year of development trying to locate
and correct inconsistencies between the reports from the data warehouse and
the financial statements.
- It’s really hard to find/train/retain skilled people.
- Once the data warehouse is up and running, the work has just begun. Users
constantly ask for more capabilities. It’s not clear that the people who
were good at building the system can maintain and manage it.
- Now that we have a bunch of data marts, it’s a pain to pull them all
together into a coordinated enterprise data warehouse. We should have
started with an enterprise data-warehousing architecture before we built
data marts.
- We should have paid more attention to performance.
When not
to use Data Warehousing (Larry Greenfield)
- DW for the most part, store historical data generated in internal
transaction processing systems. This is a small part of the universe of data
available to manage a business. Sometimes this part has limited value.
- DW can complicate business processes significantly.
- Data warehousing can have a learning curve that may be too long for
impatient firms.
- In certain organizations ad hoc end user query/reporting tools do not
"take".
- Many "strategic applications" of data warehousing have a short
life span and require the developers to put together a technically inelegant
system quickly. Some developers are reluctant to work this way.
- There is a limited number of people available who have worked with the
full data warehousing system project "life cycle".
- Data warehousing systems can require a great deal of
"maintenance" which many organizations cannot or will not support.
- Sometimes the cost to capture data, clean it up, and deliver it in a
format and time frame that is useful for the end users is too much of a cost
to bear.