Data Warehousing Technology: Focus on Clinical Warehousing

This is the summary of a talk I gave as part of a workshop on Clinical Data Warehousing, at the American Medical Informatics Association (AMIA) Fall Symposium at Orlando, Florida, Nov 8th, 1998.

Introduction

The Problem to be Solved

Problem Constraints

This means that certain index structures that are suited for rapid read access (though mediocre or even terrible for interactive data updates) can be used. These structures are bitmap indices and join ("star") indices.

Technological Approaches

Relational Approaches

Both ROLAP and MOLAP use the Star schema, so called because "Dimension" table surround a central "Fact" table like the spokes of a star (see below).

The Star Schema

star_schema.jpg (37762 bytes)

A Snowflake Schema

More "normalized" than a Star schema: one or more Dimension tables are in turn related to other (non-Fact) tables. This tends to be a bit of a performance killer, but is sometimes unavoidable. An example is shown below: this is an expansion of the top-left hand of the previous diagram.

 

These represent extreme cases. (In his book, Data warehousing guru Ralph Kimball describes situations analogous to these under the category "A Graduate Course in Dimensional Modeling".)

Join (Star) Indices

Consider the regular B-tree indices used by a database table. These consist of records arranged in a tree structure. The lowest levels of the tree (the "leaves") store  a value of the key for which the index has been created (e.g., a patient ID) and the record number in the table where the value can be found. For fields that are not unique, the same key value may be found in multiple records.

Now suppose there was a B-tree index on a column in the Fact table (Pharmacy Orders) on Patient ID, and another B-tree index on Patient ID in the Patients dimension Table of the star schema diagram. To locate the details of a particular row, we would get the Patient ID from that row, then consult the B-tree of the Patient table to get to the record number of the corresponding Patient Record. B-trees are reasonably fast, but when scanning large numbers of rows, the consulting of the B-tree for each record adds up.

Suppose, however, instead of storing the Patient ID itself in the Fact table, we stored, instead, the Record Number of the Patient ID in the Patient table. Then, to look up the details of a particular row, we do not need to consult the B-tree of the Patient table, but can go to the corresponding record directly. This is exactly what a "join index" does. When you build a join index, you are indexing the records in one table based on the record numbers in another table. This greatly speeds up a join operation between the two tables.

You can create a join index for every link between a Fact table and a Dimension table.

Because the Fact table is heavily encoded, a common operation is to translate the Fact table into human-understandable form, by pulling descriptive fields from the surrounding Dimension tables ("Star Join"). Obviously, no one can view billions of rows, so the operation is typically restricted by multiple criteria based on the dimension table (e.g, patients matching a particular demographic profile, drug category).

Bitmapped Indices

A bitmap index is explained as follows: suppose severity of an adverse reaction in a fact table is recorded as a number between 1 to 5. (Such a coding system is, in fact, used by the National Cancer Institute's Common Toxicity Criteria). We can then index this data as follows: we store an array of bits (1 or 0) for all grade 1 adverse reactions. The number of bits is equal to the number of records in the table. 1 means that record represents a grade 1 reaction, 0 means it does not. We similarly store arrays for every other value of severity (2, 3, 4 and 5). The set of arrays constitutes the bitmap index.

Bitmap indices are much more economical in terms of storage when a particular column has a limited set (e.g., "low cardinality") of values. Another obvious candidate for a bitmap is the sex of a patient (only two possible values), or whether the drug had to be discontinued (Yes or No). One can rapidly perform set intersections and unions with bitmaps, to answer questions like: show me all female patients with grade 2 or greater where the drug had to be discontinued. Obviously, bitmap indices do not make sense when a particular field can have hundreds of possible values.

Multidimensional Databases

datacube.jpg (12131 bytes)

Language Standards for OLAP and Warehousing

Special Problems of Clinical Warehouses

wpe250.jpg (13318 bytes)

Resources

       Books:

Warehousing "Gotchas" (from Larry Greenfield)

Top 10 Complaints Of Data-Warehousing Managers (Larry Greenfield)

When not to use Data Warehousing (Larry Greenfield)