Metadata is broadly defined as “data that describes other data” .
This definition conveys little information. Individual authors often define it, in both singular and plural form, in terms of specific themes, e.g., data warehousing or repositories of organizational information. To quote author David Marco, however, “when we mention metadata, we are really talking about knowledge”
There are as many classifications of metadata as there are authorities.
Process-related or technical metadata supports software efforts
Descriptive metadata, which supports users concerned with the software’s application domain/s (e.g., medicine, business).
Examples of technical metadata are table and column definitions in a database schema and mappings of data elements and their values between different physical databases. Examples of descriptive metadata are end-user system documentation, semantic descriptions of a table’s columns. Controlled vocabularies, which help to describe domain-specific data in standard ways, are an important category of descriptive metadata.
A “data dictionary” comprises both descriptive elements (e.g., the meaning and purpose of a field in a table) and technical elements (e.g., data type and “cardinality”, i.e., how many distinct values are permissible for that field).
One can also think of passive metadata that is consulted only by humans, versus active metadata, consulted by generic program code for various tasks. The distinction is important: active metadata must be correct and complete for programs to function correctly, and the effects of bad active metadata are detected with relative rapidity, while bad passive metadata has more insidious effects, resulting in wasted human effort (if incomplete) or human error (if wrong). Active metadata is always technical, but the reverse is not always true.
The boundary between data and metadata can be blurred; one application’s metadata is often another application’s data. Metadata can describe other metadata (an example of this is semantic relationships between vocabulary concepts). Instead of terms like “meta-metadata”, the Object Management Group (OMG, www.omg.org) recommends the use of M0, M1, M2 etc., where M0 = data, M1= 1st level metadata, and so on, with each succeeding number indicating a higher level of abstraction
Production clinical data repositories store much of their clinical data using a generic, or entity-attribute-value (EAV) data model. EAV. (To read more about EAV, read the section on Clinical Patient Records). In EAV, one row stores a single fact. A conventional, column-modeled table with one column per attribute, by contrast, stores a set of facts per row. EAV design, which is a form of generalized row modeling, is appropriate when the number of clinical parameters that potentially apply to a patient (e.g., the thousands of parameters across all of clinical medicine), is vastly more than those that actually apply to a given patient.
In an EAV system the conceptual database schema differs radically from the physical schema. End-users naturally tend to think of their data as organized conventionally, one-attribute-per-column, and expect their data to be presented this way in forms. Analytical programs (such as statistical or graphing packages) similarly require conventionally-structured data. A usable EAV system therefore creates the illusion of conventional structure through a set of metadata tables, whose contents describe the conceptual schema, and generic metadata-driven code that transforms rows in the physical schema to columns in the equivalent conceptual schema, and vice versa. Such code can, in fact, generate user interfaces automatically
Production EAV databases typically store some data, such as demographic variables (name, date of birth, sex, etc.) in conventional tables; an EAV design only makes sense for attributes that are sparse, i.e., apply only to some patients, whereas standard demographic variables apply to all patients. Technical metadata must therefore record which attributes are stored conventionally and which in EAV form. In fact, much of an EAV system’s metadata is active..
Large-scale initiatives for descriptive metadata (such as controlled vocabularies) have pre-dated, and have generally been more extensive, than initiatives for technical metadata. The latter have received a major boost in the last few years, with the invention of Extensible Markup Language (XML) 11 . While XML is a rather blunt tool for defining metadata (see below), it is ideal as a vehicle for interchange of both data and metadata.
The Metadata Coalition (MDC), originally a consortium of Microsoft-led data warehousing tool vendors, and now a part of OMG, was formed to create standards for describing and exchanging technical metadata models. The modeling standard now adopted is Unified Modeling Language (UML), an M2-level definition originally designed by Rational Corporation. The standard for metadata interchange is XMI (XML Metadata Interchange), an XML-based representation of a UML model that allows model exchange between UML tools.
Both UML and XMI specifications are freely downloadable from OMG’s Web site. While UML is well established and supported by numerous commercial or even shareware tools, XMI (originally developed by IBM) is less mature: OMG documents still define it with a Document Type Definition (DTD) rather than with an XML schema. The latter, more modern approach, approved in May 2001 by the World Wide Web consortium (www.w3c.org), is more robust than DTDs because one can apply, among other things, strong data typing to individual atomic elements, as opposed to DTDs, where all such elements are strings.
UML provides a standard means for defining M1-level technical metadata, which are, by definition, subject/domain specific. Some M1-level standards based on XML, such as Chemical Markup Language (CML) for chemical structures and Mathematical Markup Language (MathML) for mathematical equations, appear to be successful, but many others have not moved beyond the proposal stage. To succeed, such standards must have a circumscribed purpose, provide broad coverage within that purpose, and be supported by numerous groups or organizations.
Resource Description Framework (RDF) is a World-Wide-Web Consortium (W3C) standard specification for “a lightweight ontology system to support the exchange of knowledge on the Web”. RDF’s unit of information, the Property, is identical to an EAV triplet: the “Entity” is a Resource (anything that can have a URL), and the “attribute” is a PropertyType. (The “value” is the value of the PropertyType.) RDF-based initiatives include the Dublin Core Metadata proposal 19 , a standard set of XML tags that Web page authors add to their works to provide explicit information to Web indexing engines about authorship, subject, source language, copyright, and so on. (The engines must currently infer these indirectly).
The National Library of Medicine’s Unified Medical Language System (UMLS) continues to be the world’s largest controlled vocabulary effort. Meanwhile, SNOMED CT, the amalgamation of SNOMED with the UK National Health Service’s Clinical Terms version will be released imminently.
The International Standards Organization (ISO) has proposed a descriptive metadata standard, ISO 11179, which is still in draft version. A coalition of several US government agencies is working towards exchanging metadata based on ISO 11179. A team at the Environmental Protection Agency has developed MetaPro, a metadata registry based on in this standard, as an Oracle database. The ISO-11179 draft, however, is “necessary but not sufficient”: many difficult issues in metadata representation are not addressed.
References:
Building the Metadata Repository : David
Marco (Wiley)
Unified Modeling Language : http://www.omg.org/technology/uml/index.htm