A consensus framework map of a chromosome is the single most useful map of the chromosome, because of the amount of information it holds as well as the quality of the supporting data backing the putative order of its objects. We describe data structures and algorithms to assist in framework map maintenance and to answer queries about order and distance on genomic objects. We show how these algorithms are efficiently implemented in a client-server relational database. We believe that our data structures are particularly suitable for databases to support collaborative mapping efforts that use heterogeneous methodologies. We summarize two applications that use these algorithms: CHROMINFO, a database specifically designed for framework map maintenance; and the shared client-server database for the chromosome 12 genome center.
SQLGEN is a framework for rapid client-server relational database application development. It relies on an active data dictionary on the client machine that stores metadata on one or more database servers to which the client may be connected. The dictionary generates dynamic Structured Query Language (SQL) to perform common database operations; it also stores information about the access rights of the user at log-in time, which is used to partially self-configure the behavior of the client to disable inappropriate user actions. SQLGEN uses a microcomputer database as the client to store metadata in relational form, to transiently capture server data in tables, and to allow rapid application prototyping followed by porting to client-server mode with modest effort. SQLGEN is currently used in several production biomedical databases.
In STS-content mapping of a region, multiple optimal or near-optimal putative orders of markers exist. Determining which of the markers in this region can be placed reliably on the physical map of the chromosome and which markers lack sufficient evidence to be placed requires software that facilitates exploratory sensitivity analysis and interactive reassembly with different subsets of the imput data and that also assists the evaluation of any arbitrary (user-specified) marker order. We describe CONTIG EXPLORER, a package for interactive assembly of STS-content maps that provides the user with various ways of performing such analyses, thereby facilitating the design of laboratory experiments aimed at reducing ambiguity in STS order. We then compare the output of CONTIG EXPLORER with two other assembly programs, SEGMAP and CONTIGMAKER, for a region of chromosome 12p between 21 and 38 cM on the sex-averaged CEPH/Genethon linkage map.
DNA Workbench (DW) is a client-server database to manage physical mapping data that will form the basis for sequencing and efforts in biologically interesting regions of a chromosome. DW draws maps at different levels of resolution in either of two modes: proportional, when the sizes of objects and the physical distances between them are known accurately or approximately, and nonproportional, when most physical distance information in a region is not available, but order information is. DW interacts with the user primarily through the map graphic. Selection of individual objects on the graphic lets the user inspect and modify the underlying data. DW also manages dependency tracking between map objects and has a rudimentary form of version control. It is currently used to manage information on the DRD2 region on chromosome 11, and on the HOX region of chromosome 17.
MOTIVATION: When inspecting two maps of a chromosome or chromosomal region derived by similar or different methodologies, a computer-generated description of their differences is valuable, both to guide laboratory research as well as to assist version control. Our program, Mapdiff, can be used to compare two independently derived maps, or two revisions of the same map. Its usefulness increases in proportion to the percentage of shared objects between the two maps. Mapdiff uses a greedy algorithm to determine differences between shared objects. RESULTS: We illustrate Mapdiff's use in comparing the publicly available STS-content and radiation hybrid maps of human chromosome 12. AVAILABILITY: Freely available, (source, executables for Windows/DOS and Sun Solaris, documentation) on request from the author.
In this paper we describe PhenoDB, an Internet-accessible client/server database application for population and linkage genetics. PhenoDB stores genetic marker data on pedigrees and populations. A database for population and linkage genetics requires two core functions: data management tasks, such as interactive validation during data entry and editing, and data analysis tasks, such as generating summary population statistics and performing linkage analyses. In PhenoDB we attempt to make these tasks as easy as possible. The client/server architecture allows efficient management and manipulation of large datasets via an easy-to-use graphical interface. PhenoDB data (73 populations, 34 pedigrees, approximately 4200 individuals, and close to 80,000 typings) are stored in a generic format that can be readily exported to (or imported from) the file formats required by various existing analysis programs such as LIPED and Lathrop and Lalouel's Multipoint Linkage. PhenoDB allows performance of complex ad-hoc queries and can generate reports for use in project management. Finally, PhenoDB can produce statistical summaries such as allele frequencies, phenotype frequencies, and Chi-square tests of Hardy-Weinberg ratios of population/pedigree data.
Concept Locator (CL) is a client-server application that accesses a Sybase relational database server containing a subset of the UMLS Metathesaurus for the purpose of retrieval of concepts corresponding to one or more query expressions supplied to it. CL's query grammar permits complex Boolean expressions, wildcard patterns, and parenthesized (nested) subexpressions. CL translates the query expressions supplied to it into one or more SQL statements that actually perform the retrieval. The generated SQL is optimized by the client to take advantage of the strengths of the server's query optimizer, and sidesteps its weaknesses, so that execution is reasonably efficient.
Entity-attribute-value (EAV) data organization is increasingly used for knowledge representation for complex heterogeneous biomedical databases. When delivered in relation form for production applications the simplicity of EAV storage is offset by difficulty of set-based data retrieval. We describe a client-server application, QAV, that is designed to perform set-based query on the Columbia MED dataset, a large medical metadata repository that has been the focus of much research. QAV interacts with the user through a graphical front end and generates a series of SQL statements that are sent to the server for the actual data retrieval.
MOTIVATION: When two or more genomic maps of a chromosomal region are available, it is useful to be able to synthesize them to create a merged map. RESULTS: We show that map merging is an exploratory process because there are multiple ways to combine data based upon what the user wishes to focus on, and upon which particular data subset emphasis is desired. We describe Mapmerge, a program for merging two genomic maps, discuss its limitations, and illustrate an example of its use. AVAILABILITY: Freely available (ANSI C source code, a Make file, test data files, documentation) on request from the author.
ACT/DB is a client-server database application for storing clinical trials and outcomes data, which is currently undergoing initial pilot use. It stores most of its data in entity-attribute-value form. Such data are segregated according to data type to allow indexing by value when possible, and binary large object data are managed in the same way as other data. ACT/DB lets an investigator design a study rapidly by defining the parameters (or attributes) that are to be gathered, as well as their logical grouping for purposes of display and data entry. ACT/DB generates customizable data entry. The data can be viewed through several standard reports as well as exported as text to external analysis programs. ACT/DB is designed to encourage reuse of parameters across multiple studies and has facilities for dictionary search and maintenance. It uses a Microsoft Access client running on Windows 95 machines, which communicates with an Oracle server running on a UNIX platform. ACT/DB is being used to manage the data for seven studies in its initial deployment.
MOTIVATION: Molecular biology databases have been proliferating rapidly. Their heterogeneity and complexity pose a great challenge to efforts in database interoperation. To minimize the efforts of interoperating heterogeneous databases, it is useful to develop a system that lets a user of a particular genomic database access another related database as if the latter is structurally similar to the former. RESULTS: We extend a structurally simple model-the entity-attribute-value (EAV) model-to describe uniformly metadata relating to individual databases. Such metadata, which are necessary for performing database comparisons, include descriptions of primitive database objects (including entities, attributes, domain values and entity relationships) and specification of correspondences among the database objects. We show how to decompose SQL queries and map them from one database to another based on the EAV representation of the basic database objects. A prototype system is implemented to demonstrate query interoperation between two chromosome map databases. AVAILABILITY: Freely available (Cold Fusion source code and an Access database containing the mapping knowledge) upon request from the author.
CHRONOMERGE is a database application that facilitates merging and display of multiple time-stamped data streams. Each stream is a table containing time-stamped values of one or more parameters (such as a panel of laboratory tests) for multiple patients, and is typically created by querying a clinical data repository. The data within a single stream therefore represents a pool of multiple time series. The merge operation is complex because of the numerous options to be considered, such as the granularity of the time-interval for merge, and the choice of statistical aggregates. CHRONOMERGE combines multiple streams into a single stream based on patient and time, or time alone (if aggregates are to be computed across patients). It allows specification of various options through a graphical user interface, and generates appropriate SQL code (or invokes procedural routines) to perform the merge. The resultant stream, or subsets of it, can then be displayed graphically. CHRONOMERGE is intended to facilitate the analysis of time-stamped data that has been extracted from repositories when standard tools (such as the time-series modules of statistics packages) are inadequate.
Entity-Attribute-Value (EAV) tables form the major component of several mainstream electronic patient record systems (EPRS). Such systems have been optimized for real-time retrieval of individual patient data. Data warehousing, on the other hand, involves cross-patient data retrieval based on values of patient attributes, with a focus on ad hoc query. Attribute-centric query is inherently more difficult when data is stored in EAV form than when it is stored conventionally. We illustrate our approach to the attribute-centric query problem with ACT/DB, a database for managing clinical trials data. This approach is based on metadata supporting a query front end that essentially hides the EAV/non-EAV nature of individual attributes from the user. Our work does not close the query problem, and we identify several complex sub-problems that are still to be solved.
Entity-attribute-value (EAV) representation is a means of organizing highly heterogeneous data using a relatively simple physical database schema. EAV representation is widely used in the medical domain, most notably in the storage of data related to clinical patient records. Its potential strengths suggest its use in other biomedical areas, in particular research databases whose schemas are complex as well as constantly changing to reflect evolving knowledge in rapidly advancing scientific domains. When deployed for such purposes, the basic EAV representation needs to be augmented significantly to handle the modeling of complex objects (classes) as well as to manage interobject relationships. The authors refer to their modification of the basic EAV paradigm as EAV/CR (EAV with classes and relationships). They describe EAV/CR representation with examples from two biomedical databases that use it.
OBJECTIVE: To query a clinical data repository (CDR) for answers to clinical questions to determine whether different types of fields (coded and free text) would yield confirmatory, complementary, or conflicting information and to discuss the issues involved in producing the discrepancies between the fields. METHODS: The appropriate data fields in a subset of a CDR (5,135 patient records) were searched for the answers to three questions related to surgical procedures. Each search included at least one coded data field and at least one free-text field. The identified free-text records were then searched manually to ensure correct interpretation. The fields were then compared to determine whether they agreed with each other, were supportive of each other, contained no entry (absence of data), or were contradictory. RESULTS: The degree of concordance varied greatly according to the field and the question asked. Some fields were not granular enough to answer the question. The free-text fields often gave an answer that was not definitive. Absence of data was most logically interpreted in some cases as lack of completion of data and in others as a negative answer. Even with a question as specific as which side a hernia was on, contradictory data were found in 5 to 8 percent of the records. CONCLUSIONS: Using the data in the CDR to answer clinical questions can yield significantly disparate results depending on the question and which data fields are searched. A database cannot just be queried in automated fashion and the results reported. Both coded and textual fields must be searched to obtain the fullest assessment. This can be expected to result in information that may be confirmatory, complementary, or conflicting. To yield the most accurate information possible, final answers to questions require human judgment and may require the gathering of additional information.
Creating front ends for a large Entity-Attribute-Value database used by multiple groups or departments in an institution involves considerable maintenance overhead when using traditional client-server technology for developing. Switching to Web technology as a delivery vehicle solves some of these problems but introduces others. In particular, Web development environments are primitive and many features that client-server developers take for granted are missing. WebEAV is a generic framework for Web development that is intended to streamline the process of Web application development in this circumstance. It also addresses some challenging user interface issues that arise when any complex system is created. We describe WebEAV's architecture and provide an overview of its features with suitable examples.
Background: The Entity-Attribute-Value representation with
Classes and Relationships (EAV/CR) provides a flexible and simple database
schema to store heterogeneous biomedical data. In certain circumstances,
however, the EAV/CR model is known to retrieve data less efficiently than
conventionally based database schemas.
Objective: To perform a pilot study that systematically quantifies performance
differences for database queries directed at real-world microbiology data
modeled with EAV/CR and conventional representations, and to explore the
relative merits of different EAV/CR query implementation strategies.
Methods: Clinical microbiology data over a 10-year period were stored using
both database models. Query execution times were compared for 4
clinically-oriented attribute-centered and entity-centered queries operating
under varying conditions of database size and system memory. The performance
characteristics of 3 different EAV/CR query strategies were also examined.
Results: Performance was similar for entity-centered queries in the two
database models. Performance in the EAV/CR model was approximately 3-5 times
less efficient than its conventional counterpart for attribute-centered
queries. The differences in query efficiency widened slightly as database size
increased, although they were reduced with the addition of system memory. We
found that EAV/CR queries formulated using multiple, simple SQL statements
executed in batch were more efficient than single, large SQL statements.
Conclusion: This paper describes a pilot project to explore issues in and
compare query performance for EAV/CR and conventional database representations.
Although attribute-centered queries were less efficient in the EAV/CR model,
these inefficiencies may be, at least in part, addressable by using more
powerful hardware, and/or more memory.
OBJECTIVES: To explore the feasibility of using the
National Library of Medicine’s Unified Medical Language System (UMLS)
metathesaurus as the basis for a computational strategy to identify concepts in
medical narrative text preparatory to indexing. To quantitatively evaluate this
strategy in terms of true positives, false positives (spuriously identified
concepts) and false negatives (concepts missed by the identification process).
METHODS: Using the 1999
UMLS Metathesaurus, we processed a training set of 100 documents (50 discharge
summaries, 50 operative notes) with a concept-identification program, whose
output was manually analyzed. We flagged concepts that were erroneously
identified, or added new concepts that were not identified by the program,
recording the reason for failure in such cases. After several refinements to
both our algorithm as well as the UMLS subset that it operated on, we deployed
the program on a test set of 24 documents (12 of each kind).RESULTS: 7227
of 8745 matches (82.6%) were true positives in the training set, while 1298 of
1701 matches (76.3%) were true positives in the test set. Matches other than
true positive indicated potential problems in production-mode concept indexing.
Examples of causes of problems were: redundant concepts in the UMLS, homonyms,
acronyms, abbreviations and elisions, concepts that were missing in the UMLS,
proper names and spelling errors.
CONCLUSIONS: The error rate was too high for concept indexing to be the
only production-mode means of preprocessing medical narrative. Considerable
curation needs to be performed to define a UMLS subset that is suitable for
concept matching.
OBJECTIVES: To test the hypothesis that most instances of negated concepts in dictated medical documents can be detected by a strategy that relies on tools developed for the parsing of formal (computer) languages–specifically a lexical scanner (lexer) that uses regular expressions to generate a finite state machine, and a parser that relies on a restricted subset of context-free grammars, known as LALR(1) grammars.
METHODS: A diverse training set of 40 medical documents from a variety of specialties was manually inspected and used to develop a program (Negfinder) that contained rules to recognize a large set of negated patterns occurring in the text. Negfinder’s lexer and parser were developed using tools normally used to generate programming language compilers. The input to Negfinder consisted of medical narrative that was pre-processed to recognize UMLS concepts: the text of a recognized concept had been replaced with a coded representation coded that included its UMLS concept ID. The program generated an index with one entry per instance of a concept in the document, where the presence or absence of negation of that concept was recorded. This information was used to mark up the text of each document by color-coding to make it easier to inspect. The parser was then evaluated in two ways: 1) a test set of 60 documents (30 discharge summaries, 30 surgical notes) marked-up by Negfinder were inspected visually to quantify false positives and false negatives; and 2) a different test set of 10 documents was independently examined for negatives by a human observer and by Negfinder, and the results compared.
RESULTS: In the first evaluation using marked-up documents, there were 8,358 instances of UMLS concepts detected in the 60 documents of which 544 were negations detected by the program and verified by human observation (true positives or TPs). 13 instances were wrongly flagged as negated (false positives or FPs) , and the program missed 27 instances of negation (false negatives or FNs) yielding a sensitivity of 95.3% and a specificity of 97.7%. In the second evaluation using independent negation detection, there were 1869 concepts in 10 documents with 135 TPs, 12 FPs and 6 FNs, yielding a sensitivity of 95.7% and a specificity of 91.8%. The words “no,” “denies/denied,” “not” and “without” were present in 92.5% of all negations.
CONCLUSIONS: Negation of most concepts in medical narrative can be reliably detected by a simple strategy. The reliability of detection depends on several factors, the most important being the accuracy of concept matching.
Information retrieval (IR) is the field of computer science that deals with the processing of documents containing free text, so that they can be rapidly retrieved based on keywords specified in a user's query. IR technology is the basis of Web-based search engines, and plays a vital role in biomedical research, because it is the foundation of software that supports literature search. Documents can be indexed by both the words they contain, as well as the concepts that can be matched to domain-specific thesauri; concept matching, however, poses several practical difficulties that make it unsuitable for use by itself. This article provides an introduction to IR and summarizes various applications of IR and related technologies to genomics.
Clinical study data management systems (CSDMSs) have many similarities to clinical patient record systems (CPRSs) in their focus on recording clinical parameters. Requirements for ad hoc query interfaces for both systems would therefore appear to be highly similar. However, a clinical study is concerned primarily with collective responses of groups of subjects to standardized therapeutic interventions for the same underlying clinical condition. The parameters that are recorded in CSDMSs tend to be more diverse than those required for patient management in non-research settings, because of the greater emphasis on questionnaires for which responses to each question are recorded separately. The differences between CSDMSs and CPRSs are reflected in the metadata that support the respective systems' operation, and need to be reflected in the query interfaces. The authors describe major revisions of their previously described CSDMS ad hoc query interface to meet CSDMS needs more fully, as well as its porting to a Web-based platform.
The Pharmacogenetics Research Network, which has the long-term goal of genotype-phenotype correlation related to pharmacotherapy, mandates timely electronic publication of results by participating research groups through submission to PharmGKB, the consortium's repository database. Because informatics expertise across groups varies, many groups need help in managing their own data and in generating electronic submissions. To assist these operations, we perform a needs assessment to determine an optimum database implementation strategy, which varies from standalone microcomputer database application to Web-based solutions, depending on the group and problem scope. Solution implementation is coupled with transfer of expertise through hands-on training, so as to reduce the groups' long-term dependence on us. Where multiple groups face common problems, such as managing genotyping data or clinical study support, we have devised generic software that can be reused in its entirety by individual groups, or customized with modest effort.
Generic clinical study data management systems can record data on an arbitrary number of parameters in an arbitrary number of clinical studies without requiring modification of the database schema. They achieve this by using an Entity-Attribute-Value (EAV) model for clinical data. While very flexible for creating transaction-oriented systems for data entry and browsing of individual forms, EAV-modeled data is unsuitable for direct analytical processing, which is the focus of data marts. For this purpose, such data must be extracted and restructured appropriately. This paper describes how such a process, which is non-trivial and highly error prone if performed using non-systematic approaches, can be automated by judicious use of the study metadata-the descriptions of measured parameters and their higher-level grouping. The metadata, in addition to driving the process, is exported along with the data, in order to facilitate its human interpretation..
<OBJECTIVES: The authors designed and implemented a clinical data mart composed of an integrated information retrieval (IR) and relational database management system (RDBMS). DESIGN: Using commodity software, which supports interactive, attribute-centric text and relational searches, the mart houses 2.8 million documents that span a five-year period and supports basic IR features such as Boolean searches, stemming, and proximity and fuzzy searching. MEASUREMENTS: Results are relevance-ranked using either "total documents per patient" or "report type weighting." RESULTS: Non-curated medical text has a significant degree of malformation with respect to spelling and punctuation, which creates difficulties for text indexing and searching. Presently, the IR facilities of RDBMS packages lack the features necessary to handle such malformed text adequately. CONCLUSION: A robust IR+RDBMS system can be developed, but it requires integrating RDBMSs with third-party IR software. RDBMS vendors need to make their IR offerings more accessible to non-programmers.
We describe an interface and architecture for ad hoc temporal query of TrialDB, a clinical study data management system (CSDMS). A clinical study focuses primarily on the effect of therapy on a group of patients, who have individually enrolled in a study at different times. Relative times (chronological offsets from the time of enrollment) are therefore more useful than absolute times when collectively describing therapeutic or adverse events. For logistic reasons, study parameter values are typically recorded at fixed relative times ('study events'), which serve as time-stamps and can be used by CSDMS temporal query algorithms to simplify temporal computations. The entity-attribute-value model of clinical data storage, used by both CSDMSs and clinical patient record systems, complicates temporal query. To apply temporal operators, data for parameters of interest must first be transiently converted into conventional relational form, with one column per parameter.
The EAV/CR framework, designed for database support of rapidly evolving scientific domains, utilizes metadata to facilitate schema maintenance and automatic generation of Web-enabled browsing interfaces to the data. EAV/CR is used in SenseLab, a neuroscience database that is part of the national Human Brain Project. This report describes various enhancements to the framework. These include (1) the ability to create "portals" that present different subsets of the schema to users with a particular research focus, (2) a generic XML-based protocol to assist data extraction and population of the database by external agents, (3) a limited form of ad hoc data query, and (4) semantic descriptors for interclass relationships and links to controlled vocabularies such as the UMLS.
In highly functional metadata-driven software, the interrelationships within the metadata become complex, and maintenance becomes challenging. We describe an approach to metadata management that uses a knowledge-base subschema to store centralized information about metadata dependencies and use cases involving specific types of metadata modification. Our system borrows ideas from production-rule systems in that some of this information is a high-level specification that is interpreted and executed dynamically by a middleware engine. Our approach is implemented in TrialDB, a generic clinical study data management system. We review approaches that have been used for metadata management in other contexts and describe the features, capabilities, and limitations of our system..
Packages that support the creation of pathway diagrams are limited by their inability to be readily extended to new classes of pathway-related data. RESULTS: VitaPad is a cross-platform application that enables users to create and modify biological pathway diagrams and incorporate microarray data with them. It improves on existing software in the following areas: (i) It can create diagrams dynamically through graph layout algorithms. (ii) It is open-source and uses an open XML format to store data, allowing for easy extension or integration with other tools. (iii) It features a cutting-edge user interface with intuitive controls, high-resolution graphics and fully customizable appearance. AVAILABILITY: http://bioinformatics.med.yale.edu .
Query Integrator System (QIS) is a database mediator framework intended to address robust data integration from continuously changing heterogeneous data sources in the biosciences. Currently in the advanced prototype stage, it is being used on a production basis to integrate data from neuroscience databases developed for the SenseLab project at Yale University with external neuroscience and genomics databases. The QIS framework uses standard technologies and is intended to be deployable by administrators with a moderate level of technological expertise: It comes with various tools, such as interfaces for the design of distributed queries. The QIS architecture is based on a set of distributed network-based servers, data source servers, integration servers, and ontology servers, that exchange metadata as well as mappings of both metadata and data elements to elements in an ontology. Metadata version difference determination coupled with decomposition of stored queries is used as the basis for partial query recovery when the schema of data sources alters.
Objectives: The National Cancer Institute (NCI) has
developed the Common Data Elements (CDE) to serve as a controlled vocabulary of
data descriptors for cancer research, to facilitate data interchange and
inter-operability between cancer research centers. We evaluated CDE’s structure
to see whether it could represent the elements necessary to support its
intended purpose, and whether it could prevent errors and inconsistencies from
being accidentally introduced. We also performed automated checks for certain
types of content errors that provided a rough measure of curation quality.
Methods: Evaluation was performed on CDE
content downloaded via the NCI’s CDE Browser, and transformed into relational
database form. Evaluation was performed under three categories: 1)
compatibility with the ISO/IEC 11179 metadata model, on which CDE structure is
based, 2) features necessary for controlled vocabulary support, and 3) support
for a stated NCI goal, set up of data collection forms for cancer research.
Results: Various limitations were identified
both with respect to content (inconsistency, insufficient definition of
elements, redundancy) as well as structure – particularly the need for term and
relationship support, as well as the need for metadata supporting the explicit
representation of electronic forms that utilize sets of common data elements.
Conclusions:While there are numerous positive aspects to the CDE effort,
there is considerable opportunity for improvement. Our recommendations include
review of existing content by diverse experts in the cancer community;
integration with the NCI thesaurus to take advantage of the latter’s links to
nationally used controlled vocabularies, and various schema enhancements
required for electronic form support.