Development of
High-Speed, Low Cost, Terminology Server
There is an evident need for
efficient terminological tools in today's clinical information systems. These
tools must store, convey, and cross-reference the content of multiple emerging
standardized terminologies while at the same time support local variability.
Terminology servers can be
used to integrate heterogeneous and fragmented clinical information systems and
to uniformly store data in clinical repositories or to extract and translate
data from such repositories. As such, terminology servers are a critical
component of modern clinical information systems and must handle large-volume
traffic between ancillary systems and repositories and repositories and
clinical data interfaces. Existing commercial clinical applications lack such
capable servers and rely on alternative solutions, but with increasing size and
complexity of source and local data dictionaries offer less than optimal
solutions. Commercially available terminological servers are derivatives of
terminology editing tools and are not designed for performance. Therefore,
high-performance terminology servers are needed for large-scale clinical
production systems.
This proposal is to design,
build, benchmark and prove feasibility for a robust,
compact, generic terminology server that can simultaneously accommodate
multiple source terminologies, support a flexible semantic network and offer
high-end performance suitable for large scale clinical information systems. The
terminology server will operate on mid-level Intel-based computers running a
Linux-based operating system thus providing an affordable, capable, low-cost
solution for small- and medium-sized healthcare enterprises in today’s
cash-strapped healthcare industry, yet still offer the option to use high-end Unix boxes. The
terminology server will be designed to be an integral module of a future,
larger scale terminology maintenance and development environment which will
provide healthcare related institutions the benefit of applying local changes
while still maintaining the integrity of the source standardized terminologies
and the ability to cross-reference terminologies.
A. Specific aims:
The goal of this Phase I
proposal is to design, build and benchmark a software application that will be
able to simultaneously serve terminology content of various nature such as
SNOMED-RT, ICD-9, CPT, LOINC etc. The Terminology Server (TS) will offer an
Application Programming Interface (API) library that will enable external
applications to query the terminological content, concept attributes and
concept relations.
The TS will be optimized for
performance, utilizing shared-memory design rather than employing standard
database approach. The TS will support a Directed Acyclic Graph (DAG) design
for the hierarchic tree, based on a single semantic type (IS-A), while concept
representation will be frame-based utilizing literal attributes and
bi-directional semantic pointers of any type. The TS will support multiple
parenthoods and semantic inheritance. This generic design will assure maximum
flexibility for content incorporation.
The TS is planned to perform
on mid-level Intel-based computers running the Linux operating system. The
ultimate goal of this Phase I project is to achieve equivalent performance to
the New York Presbyterian Hospital (NYPH) controlled vocabulary server
containing the Medical Entities Dictionary (MED), which runs on high-end Unix machines, while serving the same content and executing
the same tasks.
The following project outline
is proposed:
1. Design shared-memory structure optimized for performance
2. Implement TS program using the C programming language
3. Build basic API library
4. Provide easy installation and running environment with better than 99%
uptime
5. Demonstrate feasibility by uploading SNOMED-RT to the TS
6. Benchmark the TS by uploading content similar to NYPH’s MED and compare execution of identical
tasks in both environments.
Upon successful completion of
Phase I, a Phase II is planned. The goal of Phase II will be to augment the TS
with a user-friendly, terminology editing environment and supporting tools.
This environment will support creation of protected “read-only”
segments and allow addition and modification of concepts based on various
permission levels. This design will allow for institutional variability at the
leaf-node level without sacrificing core source terminology content integrity.
Widespread use of this design will enable future automated translation of
individual coded data from different institutions utilizing the core source
standardized terminology.
B. Background and
Significance:
Today's healthcare
information systems rely more than ever on standardized terminologies to store,
convey and mine information. Natural development, practical needs and
regulatory measures are likely to increase pressure for use of standardized
terminologies and, at the same time, increase the size and coverage of such
terminologies1. Emerging comprehensive terminologies are complex
data structures based on hierarchic and semantic networks that are difficult to
comprehend and maintain. On the other hand, local Clinical Information Systems
(CIS) are heterogeneous and fragmented environments in need of integration.
Institutional Data Dictionaries (IDD) are an important
component of the required infrastructure for proposed solutions to support CIS
component integration and promote use of standardized terminologies. IDDs can be used by interface and upload engines to
translate and bridge between vocabularies of ancillary systems and to transform
multiple external vocabularies, such as ICD9-CM or LOINC, into a single coding
scheme for storage and retrieval of data from centralized clinical repositories
(see Figure 1). Although IDDs might not be expected
to be as large as the sum of all vocabularies used in any specific CIS, they do
have to maintain the source terminologies characteristics and structure and
eventually become large and complex enough to pose significant problems for
daily management and utilization. Moreover, IDDs must
also allow for local institutional variability2, a requirement that
should not be overlooked and that cannot be replaced by the use of emerging
standardized
terminologies. Therefore, a generic,
environment-independent terminology tool that can simultaneously support
multiple controlled terminologies, complete or partial, cross-reference
concepts and support local editorial work while maintaining overall component
integrity is advantageous.
CISs deal with very high
volume traffic, and performance is critical for usability and acceptance of
computer-based clinical applications. Every coded data element that is
contained within interacts multiple times with the IDD in its life cycle in the
clinical repository. Therefore, very fast and responsive TS are critical
components to any modern, efficient CIS.
Moreover, the
incorporation of controlled IDDs in the
infrastructure of CIS offers additional advantages. Logic engines can be
created to execute sophisticated queries to the clinical repositories3.
For example, if an institution has a clinical laboratory system that performs
several different instances of Total Creatine Kinase tests and Creatine Kinase Isoenzyme tests,
organizing classes can be created in the IDD and all the individual tests can
be mapped to them respectively (see Figure 2.1). When the need exists to
retrieve al of a specific patient Creatine Kinase tests, the clinician no longer needs to search for
each and every individual test, but simply search by the class, the IDD will
enable the automatic expansion from the class. Such possibilities offer users
and builders of clinical systems freedom from the need to have intrinsic
knowledge of each and every data element (see Figure 2.2, 2.3).
Pivotal to successful
implementation of TS is its ability to incorporate a variety of controlled
terminologies, in response to local clinical needs. Medical terminologies, in
general, do not share a unified structural model and therefore a flexible and
generic design of the TS is required in order to accommodate the rigid and
disparate models of existing medical terminologies.
The design proposed
below assumes such generic requirements. The singular assumption at the base of
the TS model is that the hierarchical tree of all the components contained
within a single IDD that the TS has to serve are based on a single semantic
type. This limitation does not however dictate that semantic type.
Any classification system
that is based on a single semantic type can be represented by a Directed
Acyclic Graph (DAG) structure. By extension, controlled vocabularies can be
represented by a DAG coupled with a semantic network. The IS-A semantic type is
usually used for the DAG basic hierarchic structure but can be replaced by any
other type. The DAG structure is intuitively and most commonly used to view the
content of the vocabulary since it offers a two-dimensional tree-like hierarchy
that is relatively easy to comprehend. The added value of the controlled
terminology, the inter-relationship between concepts and their attributes
cannot be simply conveyed using the DAG. For this purpose, semantic relationship of various types are created; The
characteristics of the relationship is represented by the semantic type while
the explicit inter-vocabulary relationship is represented by using direct
pointers between the concepts involved. Naturally, semantic relationships come
as bi-directional pairs; if concept {A} uses a specific semantic type to point
at concept {B} then concept {B} will have a reciprocal semantic type which will
be pointing at concept {A}.
To model
multi-dimensional networks by two-dimensional DAGs
each concept must be represented by a two-dimensional matrix resulting in a
frame-based controlled vocabulary. Each frame consists of multiple slots that
may be of two types; semantic slots contain semantic pointers that maintain the
explicit relationship between two concepts in different sub-hierarchies while
literal slots contain descriptive information about a concept. The semantic
network reshapes the two-dimensional hierarchic tree into a {n+2} dimensional structure where n stands for the number of semantic
pairs. Each semantic pair inter-relates two non-overlapping sub-hierarchies of
the DAG, thus creating a space that serves as the basis for the introduction of
inheritance within the hierarchic tree and the rules for its implementation.
The Medical Entities
Dictionary (MED)4 is an example of such an
IDD-Terminology Server that serves as a core component of its CIS. This
frame-based controlled vocabulary follows Cimino's
controlled medical dictionary desiderata5. Any editing effort must
maintain and support the semantic integrity of the affected hierarchies as well
as the multi-paternal structure6,7. The
degree of complexity of the vocabulary is not simply a function of the number
of concepts but is also exponentially related to the number of semantic types
maintained within. Editing a complex, frame-based DAG can easily introduce
error even by experienced human editors. Paradoxically, the MED experience has
shown that the same complexity, as long as integrity and logic are maintained,
can guide automated algorithms to support maintenance tasks and facilitate
non-domain expert interaction with the terminology3,8.
This proposal offers to
create a TS that will serve in CISs just as the MED
server support the NYPH CIS but in a more compact, cost-effective manner so
that even small and mid-sized healthcare institutions will be able to implement
and utilize the TS.
OPERATIONAL RULES
Any controlled
vocabulary is a formal knowledge representation language for a particular
domain and a set of self-consistent operational rules must be enforced on the
data structure that represents the knowledge. These rules are enumerated below:
1.Acyclicity: An ancestor concept may not be a descendant of
its descendant concept and each concept must have at least one parent (Except,
by definition, the top concept).
2.Each
concept must have a unique normalized name and a unique system identifier (uID).
3.Slots may
contain more than one value but may be limited, by definition, to single
values.
4.Literal
slot values do not get inherited. These slot values must be explicitly
instantiated by the user.
5.Literal
slots may contain any type of information.
6.Semantic
slot values are inherited by default (Figure 3) but may also be explicitly
instantiated by the user, subject to Rule 12, below.
7.Semantic
slots contain only the unique ID (uID) values of
other concepts.
8.Semantic
slots consist of pairs that point to one another (Figure 3). Thus, if concept
{A} has semantic slot [x] containing the uID of
concept {B} then concept {B} will have a slot [y] pointing to concept {A}; [x][y] are a semantic pair. When explicitly instantiating a
particular semantic slot, the user can arbitrarily choose either slot of the
pair. The reciprocal slot will automatically be instantiated. Inheritance of
the value will proceed for both slots in accordance with rule 12.
9.All slots,
literal or semantic, have a specified domain; i.e. a specific sub-hierarchy
within the terminology where the slot is defined.
10.Semantic
slots must have a defined range; i.e. a sub-hierarchy within the terminology
from which concepts may be applied to a specific slot. This range is
simultaneously the domain of the reciprocal semantic slot.
11.Any
sub-hierarchy in the terminology is identified by the root concept. Thus every
slot becomes defined at a unique concept in the terminology called its
insertion point. The "defined" status of a specific slot is
automatically activated for all descendants of the insertion points.
12.Refinement
inheritance of semantic slots (Figure 3); Semantic slot values may be refined
either by inheritance or by explicit instantiations. The following sub-rules
apply: (a) A more refined value (a descendant), inherited or explicit, will
replace a less refined value (its ancestor). (b) A semantic slot will only hold
the most specific value (i.e. the most refined) if several ancestor-descendant
values coexist by inheritance or explicit instantiation. (c) Multiple values in
a semantic slot must not have ancestor-descendant relationship with each other.
13.Only
explicit instantiations can be removed from semantic slots .An inherited slot
value cannot be removed on its own. Any removal of an explicit instantiation
from a semantic slot also removes its inherited values and affects its
reciprocal slot symmetrically.
14.Domain and
range may not overlap.
REPRESENTATIONAL MODELS
In general, there are
two contrasting models for data structures used to represent concepts within a frame-based
DAG. Once a modeling type is chosen, one is committed to maintaining the data
structures accordingly since dependant applications expect consistency. The
static model is intuitively simple: each concept holds all of its slot attributes, semantic and non-semantic (Figure 4). It can
further be enhanced by pre-calculating and storing within the concept its
ancestry and descendants in addition to parents and children. The static model
is most suitable for production systems for which response time is of utmost
importance. In such an environment applications use, but do not change, the
vocabulary. While this model offers speed it is not optimal for editorial
tasks. Editorial tasks change the vocabulary. Changes in parent-child
relationship or semantic values as well as changes to slot definitions will
require extensive calculations for propagation down the descendant tree. The
number of concepts that may be affected by the semantic propagation process
grows exponentially in the remaining depth of the sub-hierarchy of the changed
concept.
The alternative to a
static model is a dynamic model (Figure 4) that utilizes the operational rules
of the terminology and performs propagation calculations only when needed. In
the dynamic model only the explicit slot values are stored. Concepts viewed
using the dynamic model are partially virtual since
all of the non-explicit semantic values are calculated on-the-fly based on the
operational rules stated above. Traversing up the ancestry tree and enhanced
calculations are required to display the actual semantic content of a concept.
All editorial changes are applied only to explicit values of the affected
concept. The most time-inefficient operation, semantic propagation, need only
be limited to enforcing the operational rules of acyclicity
and refinement inheritance and can be performed in the background. On undo
situations the dynamic model is advantageous as well since the only affected
concepts are the main concept being edited and its explicit slot-reciprocals. Therefore,
there is no need to undo changes for each and every descendant of the main
concept and those of the reciprocals. Typically, the dynamic model size is
approximately two-thirds the size of the static model.
IMPLEMETATION
OUTLINE and COMPARISON
Currently there are no
standards for, or dominant application that function
as IDD TS. Beside the single instance of the MED there are only two commercial
products that exist but failed, to this day, to establish any significant
market share or actual production implementations. Apelon’s9 products use
relatively rigid descriptive logic to support editing and provide a run-time
environment that can be utilized as a query mechanism for the content. A
relatively recent newcomer is Health Language
Cyber+LE10. This product comes as an integrated package that
relies on a single content source – SNOMED-RT, allows user modifications
and support content queries. Both products rely on conventional databases for
content storage (Oracle 8, MS SQL Server), which add to the initial cost,
complexity of installation and management and maintenance costs. Both products
are expensive, lack in functionality and significant clinical installations and
performance evaluations, while initial reviews do not support robust
performance to serve as online terminology servers for large-scale, heavy-load
clinical systems.
A prototype design of
such a system, in a protracted beta phase, exists for more than 10 years at
Columbia Presbyterian Medical Center (CPMC) of the New York Presbyterian
Hospital (NYPH),
The MED was ahead of its
time. The MED is used, at CPMC for the standardized coding and translation of
most data items stored in, and extracted from, the clinical repository for
clinical applications, administrative, and research projects. The lack of
additional development and modernization of the MED, over the years, makes it a
poor candidate for a potential solution for today’s clinical applications
but, nevertheless, exemplifies the potential role of IDDs
and TSs in modern CISs.
Based on the MED’s example and the deficiencies in other existing
terminological tools this Phase I proposal is aimed to create a highly
efficient TS that can operate either on Intel-based computers or high-end Unix
boxes without the need to incorporate traditional databases, thus significantly
reducing acquisition and implementation costs. The lack of a standard database
should not be reviewed as a shortfall of the proposed system but rather as an
advantage. Maintaining all content in an optimized shared memory offers a much
more efficient execution environment and simpler and easier implementation and
maintenance. Updates can be executed without the need for database re-indexing
and no knowledge of underlying schema will be required. Content volume and size
of source terminologies is a negligible issue, since in the Linux/Unix
environment the only limiting factor is the actual size of the existing RAM
which, today, poses no implementation barrier. The API library that will be
offered with the TS will offer access to all basic content of the source
terminology (i.e. literal slots including name, semantic relations) by type.
Additionally the API library functions will be able to interact with one
another using external program to create additional algorithms to manipulate
content.
BEYOND THE MEDICAL DOMAIN
The concept of
controlled terminologies has, of course, broader implications than just for the
medical domain. A relatively new and intriguing area that has significant
impact on future need for terminology servers, beyond the medical scope, is the
new paradigm for the World Wide Web (Web) - the semantic Web (sWeb). Tim Berners-Lee and his colleagues15,16
outline an environment where software agents execute complicated tasks made
possible by giving the information well-defined meaning via the creation of
semantic ontologies and utilizing these to analyze content relationships. In
design are tools to create, describe and manipulate the content of such
ontologies17,18 but, in order to assure prompt and timely response
efficient ontology servers will be required.
One can only
anticipate that, in the future sWeb, a multitude of
ontologies will exist all in need to converse with one another in an efficient
way. A TS with a small installation imprint and hardware requirements will have
an immense advantage over larger, DB-based expensive tools. The TS proposed
here is such a tool.
DISCUSSION & FUTURE DEVELOPMENT GOALS
As stated by Chute et
al11, the differences between IDDs and
clinical terminologies must be recognized and reckoned with. Clinical
terminologies are created by domain experts to set reference standards, their
evolution rate is slow and they are not designed, nor do they directly offer
tools, to support production CISs. On the other hand,
IDDs are specifically designed to support everyday
production requirements of large scale CISs12. IDDs
must offer efficient, responsive performance while at the same time support
frequent modifications to their structure and content as dictated by numerous
ancillary systems, required for maintenance of multiple applications, by
clinical users, with minimal interruption of daily activities. Therefore, the
need for localized editing tools for maintenance use by personnel who are not
terminology mavens must be emphasized. All this must be achieved in an
environment that supports the coding schemes of the source standardized
terminologies and their intrinsic structure.
The complex semantic
network supported within the IDD can be used as a tool to allow for
semi-automatic classification of new concepts, given that the relevant semantic
slots are instantiated: the more complex the better. Since the semantic network
describes the explicit relationships of concepts and since that information
uses uIDs of other concepts, pre-defined new concepts
can be pushed down the hierarchy to match siblings with similar relationships
or refine parents in a process described by Cimino et
al8. For online creation of new concepts by non-expert users, the
interaction between the IS-A hierarchic location of the new concept and the
semantic values of parents and siblings will delimit the sub-domain within the
range sub-hierarchy and will simplify the selection process for a user. All
changes must then be tested against the operational rules to verify the
semantic integrity of the IDD with immediate feedback to the user.
Of the operational rules
listed above, close attention should be paid to rules #9 and #10. The design
that allows slots to have an insertion point defined at any level of the
hierarchy, rather than the top concept, offers extreme versatility. These requirements
allow for the peaceful co-existence of subsets from various standardized
terminologies. Coupled with the inheritance rule (#12) powerful inference
algorithms can be generated taking full advantage of the metadata within7,13,14.
This will allow the building of clinical applications that not only reference
concepts from terminologies, but are driven by the logic contained within the
structure of the reference terminologies, in a manner similar to the guided
editing process.
Complementary to the
operational rules, the IDD environment should:
·a.
Follow the desiderata
enumerated by Cimino4 and Chute et al11
·b.
Offer editing
capabilities for both individual concepts and slots.
·c.Support
multi-user functionality
·d.
Provide batch process
capabilities with error detection
·e.
Provide validation
capabilities for complete and sub- hierarchies and individual concepts
·f.
Provide logging, audit
trails and versioning support
·g.
Support roll-back/undo
capabilities
·h.Support
variable permission levels for editing of different parts of the hierarchy and
different aspects of the terminology.
Following the above
stated functionality, a structured, yet flexible, environment can be created to
supports the creation, maintenance and production use of large IDDs compliant with emerging standardized controlled
clinical terminologies. This environment minimizes the potential for errors and
directs unlearned users towards the "minimally-best" solution. By
doing so, the potential spectrum of would-be users is expanded and, as a
consequence, the usability of the IDD increased.
Such single concept
editor must be implemented in an environment that support multi-user
capability, support different levels of editing capabilities for pre-defined
sections of the hierarchical tree and accompanied by additional tools such as:
content viewer, batch editor with support for XML-based syntax for
import/export actions, verification tools and enhanced logic manipulation tools
that will serve as a middleware component between clinical applications and the
TS.
G.
References:
1.Lumpkin JR. E-health,
HIPPA, and beyond. Health Aff(Millwood). 2000
Nov-Dec;19(6):149-51.
2.
3.Cimino JJ. Terminology tools: state of the art and
practical lessons. Methods Inf Med 2001;40(4):298-306
4.http://informatics.cpmc.columbia.edu/homepages/wajngur.
The "PreEdit MEDviewer"
button.
5.Cimino JJ. Desiderata for
controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998 Nov;37(4-5):394-403.
6.Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge-based approaches to the
maintenance of a large controlled medical terminology. J Am Med Inform
Assoc 1994 Jan-Feb;1(1):35-50
7.Cimino JJ, Johnson SB, Hripcsak
G, Hill CL, Clayton PD. Managing vocabulary for a
centralized clinical system. Medinfo. 1995;8 Pt
1:117-20.
8.Elhanan G,
Cimino JJ. Controlled vocabulary and design of
laboratory results displays. Proc AMIA Annu Fall Symp 1997;:719-23.
10.http://www.healthlanguage.com/products/cyber_le.html
11.Chute CG,
12.Stausberg J, Wormek A, Kraut U.
Terminological reference of a knowledge-based system: the data dictionary. Medinfo 1995;8 Pt 1:157-61
13.Cimino JJ, Elhanan G, Zeng
Q. Supporting infobuttons with terminological
knowledge. Proc AMIA Annu Fall Symp 1997;:528-32.
14.Elhanan G,
Socratous SA, Cimino JJ.
Integrating DXplain into a clinical information
system using the World Wide Web. Proc AMIA Annu
Fall Symp 1996;:348-52.
1.Berners-Lee
T, Hendler J, Lassila O.
The Semantic Web. http://www.sciam.com/article.cfm?id=the-semantic-web.
2.Berners-Lee
T. The semantic road map. http://www.w3.org/DesignIssues/Semantic.html.
3.Heflin J, Volz R, Dale J. Requirements for a Web ontology language. http://www.w3.org/TR/webont-req/.
4.Hayes P.
RDF model theory. http://www.w3.org/TR/rdf-mt/.