Introduction

Design of a general-purpose medical record in something better than a simple narrative form, using a structure that is easily modifiable and expandable, is one of the tougher problems in database design. Most medical institutions have much more electronic information on a patient’s financial and insurance history than on the patient’s medical record. Financial information, like orthodox accounting information, is far easier to computerize and maintain, and a lot of patient data—X-Rays, Electrocardiograms, CT scans—requires large amounts of storage space. Further, off-the-shelf database engines have developed the ability to handle image data less than a decade ago, and the mainframe-style engines that run most medical database systems are slow to catch up with technology.

The Case Workup

During a routine check-up, a health provider goes through a standard check list in terms of history, physical examination and laboratory investigations. (Such a check-list is quite easy to record electronically.) When a patient has a symptom suggesting major illness, however, a whole series of questions are asked, and investigations performed, which would not be asked/performed if the patient did not have this symptom.

Superficially, when considering all the specialties in medicine, the number of questions that can be asked to a sick patient might seem almost infinite. In his non-fiction work "Five Patients", Michael Crichton gives an example of a fictitious question, "Have you ever danced in Tibet?" which may be marginally relevant because a certain very rare (fictitious) disease is seen only in individuals with this history.

In such a situation, there would never be any end to the questioning process. Clinicians have only a finite amount of time and stamina (and patients have only so much forbearance). Therefore, the strategy in history taking and physical examination is to maximize the chance of detecting something abnormal in the limited time available (in lay terms, to cover a reasonable number of bases).

Aside: Some early rule-based expert systems for medical diagnosis infuriated testers (real patients, or doctors simulating patients) with an endless series of mostly marginal or irrelevant questions. A good clinician typically begins by asking the patient "What’s your problem", and proceeding from there, based on the responses obtained. The programs, however did not accept volunteered information, and would go through about half their questions database before they elicited a relevant finding.

The Proforma or Protocol

Most of the time, the workup of a patient is proforma-based. The questions and investigations necessary in an individual case have been worked out by experience over years or decades, and the additional workup is very specific to the symptom or suspected illness.

• For example, history of foreign travel during the last year (or earlier) may be important if the clinician is suspecting certain parasitic infestations which are known to be extremely rare in the USA but quite common in tropical countries.

• A history of alcoholism is important in liver disease, and, of course, sexual inclination and history of drug abuse may be important if AIDS is suspected. (Asking such questions for a routine physical examination would be inappropriate and very rude.)

• When eliciting and recording a case history from a patient, it is important to record not only what is abnormal or suggestive (the positive findings), but also what are called the significant negative findings. For example, absence of a history or findings suggestive of chronic alcoholism in a patient with liver disease would change the focus of further workup. (In this case, it would suggest looking for exposure to chemical or infectious agents known to damage the liver.).

• For many illnesses, hospitals use printed protocols to minimize the chance of an inexperienced health provider missing something (a situation common with new interns), and to provide a record in a standardized form that may be suitable for eventual storage into a computerized database.

• As knowledge is systematized, the number of protocol expands. Less than a decade ago, protocol for AIDS did not exist, but now AIDS is complex enough to have a whole bunch of sub-protocols hanging off it.

A protocol is not the solution to all problems. The list of significant positive and negative findings for an illness evolves with time and depends on unanticipated circumstances. If clinicians stuck to a protocol and never asked for, or recorded, information beyond it, new illnesses would never be diagnosed.

To cite an extreme case, in the seventies, the illness described as "Toxic Shock Syndrome" was found to be present predominantly in young white women. After several false leads, it was found that all these women happened to be menstruating while they came down with the illness, and were eventually found to be using a particular brand of hyper-absorbent tampon whose prolonged use was found to cause increased vaginal growth of a particular bacterium. (This bacterium manufactured a toxin that was responsible for the symptoms of the illness.)

The situation is made still more complicated because the presence of one disease does not preclude the presence of another. There may be multiple, simultaneous illnesses in the same patient, so that multiple protocol may be used on the same patient at the same time.

Vocabulary Issues

Recording patient information requires a combination of narrative free text (e.g., discharge or operative summaries) as well as structured information (i.e., the rows and columns of conventional tables). Free text is very often necessary to capture nuances of meaning that conventionally structured information could not. An exclusively free-text approach, however, makes complex search and statistical analysis difficult or impossible.

Medicine, like most other domains, has acquired substantial baggage in the form of synonyms. For example, many terms have both English or their Graeco-Latin equivalents (for example, Liver failure can also be called "Hepatic failure", vomiting can also be called emesis, baldness can be called alopecia). And even a word like "failure" can sometimes be called "insufficiency" to indicate partial, but not complete, functioning of the organ. Synonyms make retrieval more difficult because they require the use of special controlled vocabulary tables that store synonym information. These tables must be consulted to improve the sensitivity of a computer-based search.

The National Library of Medicine’s Unified Medical Language System (ULMS) project is aimed at creating a controlled vocabulary that is a superset of the numerous vocabularies used in medicine (such as ICD-9 for disease categories, SNOMED for pathology-based disease classifications, DSM-IV for psychiatric diagnoses, etc.). The UMLS database is distributed on CD-ROM in the form of several relational tables plus code or applications that will operate on them. By the NLM's own admission, the vocabulary is incomplete in certain rapidly evolving domains (such as molecular genetics), and they are eagerly soliciting enhancements and additions to terms already present.

While the concept expressed through a controlled vocabulary are supposed to be atomic, the concepts themselves are inter-related to each other. For example, the concept of "Ischemic Heart Disease" is a subset of the concept "Heart Disease". To complicate things further, in many cases, you don’t have a hierarchy of concepts but a network, because you can reach the same concept via more than one path. For example, the concept "alcoholic liver damage" can be reached from the root concept "alcohol" or the root concept "liver".

Missing, Unknown or Irrelevant Data

Satisfactory recording of significant positives and negatives often requires multi-valued Boolean logic to deal with the problem of missing data. That is, in addition to True or False, you must have a third value, Unknown, with possibly a fourth, Not Relevant, and maybe a fifth "Not Elicitable".

Therefore, you have to spend some time thinking about the nature of missing values for every field and designing codes accordingly. (The advantage of a code is conciseness.). These codes are typically alphanumeric. The use of alphanumeric fields to do the job of Booleans (in order to handle missing data) is so common that such fields are referred to as multi-valued Booleans.

Entity-Attribute-Value tables

Clinical data has features that make it necessary to depart radically from conventional design paradigms. Specifically, a robust clinical patient record system needs to store at least several thousand possible types of facts for patients across all specialties. However, for any given patient, the number of kinds of actual findings may not exceed a few dozen. A conventional database table design (one column per fact) is unsuitable for such data, because of database vendor limitations on the number of columns per table and the need to continually add new tables/columns whenever new facts need incorporation. Further, a single patient's data is highly sparse, with most columns being null or not applicable.

Most electronic patient record systems (EPRS) deal with this problem through entity-attribute-value (EAV) design. In this methodology, the fact descriptors (attributes) are treated as data, so that addition of new types of facts does not require database restructuring. An EAV table of patient information records an entity (a combination of the patient ID, clinical event, and one or more date/time stamps recording when the events recorded actually occurred ), the attribute, and the associated value of that attribute. An EAV design represents a column-to-row transformation, because each row of such a table stores one fact about a patient at a particular instant in time. For example, a patient's laboratory value may be stored as: (<PatientY, 12/2/96>, serum_potassium, 4.1). Only positive or significant negative findings are recorded. Findings that are not relevant for a given patient are not stored. This means that, in spite of the extra space taken up by the attribute column, and the need to repeat the entity for every row, the space is taken up is actually less than with a "conventional" design.

The attribute-value representation of data has a long history. It had its roots in artificial intelligence research, and is used today in areas where the data to be represented is extremely heterogeneous. Examples of deployment include Web "cookies" (which are used to maintain Web client state), and the Microsoft Windows 95 and NT registries.

Physical vs. Logical Schemas

EAV is primarily a means of simplifying the physical schema of a database, to be used in circumstances where simplification is beneficial. However, irrespective of the database’s physical storage, its users naturally regard the data as conventionally structured–that is, segregated into tables and columns. Further, external programs used for graphical presentation or data analysis always expect to receive data as one column per attribute. The logical schema of a database reflects the users' perception of the data.

Because it implicitly captures a significant part of the semantics of the domain being modeled, the logical schema is domain-specific. In an EPRS, for example, one aspect of the logical schema is the grouping of individual parameters into a form, such as a battery of lab tests that are performed for a given disease protocol. In an EAV database, the logical schema differs greatly from the physical schema. (In a conventional database, on the other hand, the two do not differ appreciably.) The user interface of a good EAV system confirms to the logical schema as much as possible. That is, it creates the illusion of conventional data organization.

An EAV system must record the logical schema through metadata – "dictionary" tables whose contents describe the rest of the system. Well-designed metadata are critical to the proper functioning of an EAV system. If sufficiently rich, metadata can also be used actively (i.e., during actual system operation), instead of only describing the system passively.

Complications of EAV Storage

The simplicity and compactness of EAV representation is offset by a potential performance penalty compared to the equivalent conventional design and the need to build special-purpose tools for ad hoc query. For example, the simple AND, OR and NOT operations on conventional data must be translated into the less efficient set operations of Intersection, Union and Difference respectively.

Some extra programming work is also involved in creating data entry and data browsing forms for EAV data. These forms must reflect the logical schema, and perform a sleight-of-hand that makes the user feel as though s/he is working with conventionally structured data.

In any production EPRS, patient data is stored in both EAV and conventional forms (e.g., demographics are typically stored conventionally). Some tables are also like EAV tables but they are homogeneous. For example, in a table for drug administration, the names of the drugs that are administered are treated as data (and stored in a drug column). This complicates the process of query even further.

The Use of Strong Typing

In general, when one uses only a single EAV table to represent patient facts, one must coerce all data into least-common-denominator form – that is, as short text. The problems with this approach are twofold. Certain data (e.g., discharge summaries, EKGs, photographs) cannot be stored this way. Second, by coercing numbers and dates into text form, the values cannot be usefully indexed. (When numbers are stored as text, the text 12.5 is "greater" than 11000, because it comes later in alphabetical order.) This prevents us from being able to search rapidly for values of an attribute within or beyond a certain range.

Therefore, some databases segregate data by datatype. That is, there are separate EAV tables for short text, long text, numbers, dates, images and so on. Of course, the "Attribute Descriptions" dictionary table must also record the data type of each attribute so that the program can access the correct table when information on a particular attribute is asked for.

Concept Hierarchies

Medical concepts are related to each other in a hierarchy or network. If you want to query this database for all patients with "ischemic heart disease", you may find that the actual data is recorded in terms of the subclasses of this concept: angina pectoris, myocardial infarction (the "heart attack") and so on. Therefore, to be usable, the system would have to incorporate on-line access to a medical concepts vocabulary such as the UMLS. The user interface should let the user navigate the concept tree, or let the user specify that the identifiers for some or all child concepts be gathered, so that the query can be reformulated in terms of these identifiers.

Temporal (Time-based) Considerations

When querying the data looking for patients matching a particular profile, the user must also be allowed to specify the temporal relationship between query parameters. For example, suppose one is looking for women under 50 with a history of deep vein thrombosis and pulmonary embolism. To select only those patients where there was a probable cause-and-effect, the user must be allowed to specify that a patient can only be selected if the diagnosis of pulmonary embolism follows the diagnosis of deep vein thrombosis within a particular time window (e.g., three months or less). Here, one must make use of the timestamp field that is part of the entity in the EAV table.

Things get a little more complex when an entity has two time-stamps rather than one. For example, certain attributes, such as parameters related to radiotherapy, have a start date/time and an end date/time. In that case, the temporal operation must consider the duration of the event as well.

Conclusions

The requirements of a useful system for query of clinical data, as described here, represent a wish list. No system with such capabilities has been built, though its creation is not beyond the reach of today's programming technology.