Development of High-Speed, Low Cost, Terminology Server


 


Abstract:

There is an evident need for efficient terminological tools in today's clinical information systems. These tools must store, convey, and cross-reference the content of multiple emerging standardized terminologies while at the same time support local variability.

 

Terminology servers can be used to integrate heterogeneous and fragmented clinical information systems and to uniformly store data in clinical repositories or to extract and translate data from such repositories. As such, terminology servers are a critical component of modern clinical information systems and must handle large-volume traffic between ancillary systems and repositories and repositories and clinical data interfaces. Existing commercial clinical applications lack such capable servers and rely on alternative solutions, but with increasing size and complexity of source and local data dictionaries offer less than optimal solutions. Commercially available terminological servers are derivatives of terminology editing tools and are not designed for performance. Therefore, high-performance terminology servers are needed for large-scale clinical production systems.

 

This proposal is to design, build, benchmark and prove feasibility for a robust, compact, generic terminology server that can simultaneously accommodate multiple source terminologies, support a flexible semantic network and offer high-end performance suitable for large scale clinical information systems. The terminology server will operate on mid-level Intel-based computers running a Linux-based operating system thus providing an affordable, capable, low-cost solution for small- and medium-sized healthcare enterprises in today’s cash-strapped healthcare industry, yet still offer the option to use high-end Unix boxes.  The terminology server will be designed to be an integral module of a future, larger scale terminology maintenance and development environment which will provide healthcare related institutions the benefit of applying local changes while still maintaining the integrity of the source standardized terminologies and the ability to cross-reference terminologies.


A. Specific aims:

The goal of this Phase I proposal is to design, build and benchmark a software application that will be able to simultaneously serve terminology content of various nature such as SNOMED-RT, ICD-9, CPT, LOINC etc. The Terminology Server (TS) will offer an Application Programming Interface (API) library that will enable external applications to query the terminological content, concept attributes and concept relations.

 

The TS will be optimized for performance, utilizing shared-memory design rather than employing standard database approach. The TS will support a Directed Acyclic Graph (DAG) design for the hierarchic tree, based on a single semantic type (IS-A), while concept representation will be frame-based utilizing literal attributes and bi-directional semantic pointers of any type. The TS will support multiple parenthoods and semantic inheritance. This generic design will assure maximum flexibility for content incorporation.

 

The TS is planned to perform on mid-level Intel-based computers running the Linux operating system. The ultimate goal of this Phase I project is to achieve equivalent performance to the New York Presbyterian Hospital (NYPH) controlled vocabulary server containing the Medical Entities Dictionary (MED), which runs on high-end Unix machines, while serving the same content and executing the same tasks.

 

The following project outline is proposed:

1.      Design shared-memory structure optimized for performance

2.      Implement TS program using the C programming language

3.      Build basic API library

4.      Provide easy installation and running environment with better than 99% uptime

5.      Demonstrate feasibility by uploading SNOMED-RT to the TS

6.      Benchmark the TS by uploading content similar to NYPH’s MED and compare execution of identical tasks in both environments.

 

Upon successful completion of Phase I, a Phase II is planned. The goal of Phase II will be to augment the TS with a user-friendly, terminology editing environment and supporting tools. This environment will support creation of protected “read-only” segments and allow addition and modification of concepts based on various permission levels. This design will allow for institutional variability at the leaf-node level without sacrificing core source terminology content integrity. Widespread use of this design will enable future automated translation of individual coded data from different institutions utilizing the core source standardized terminology.


B. Background and Significance:

Today's healthcare information systems rely more than ever on standardized terminologies to store, convey and mine information. Natural development, practical needs and regulatory measures are likely to increase pressure for use of standardized terminologies and, at the same time, increase the size and coverage of such terminologies1. Emerging comprehensive terminologies are complex data structures based on hierarchic and semantic networks that are difficult to comprehend and maintain. On the other hand, local Clinical Information Systems (CIS) are heterogeneous and fragmented environments in need of integration. Institutional Data Dictionaries (IDD) are an important component of the required infrastructure for proposed solutions to support CIS component integration and promote use of standardized terminologies. IDDs can be used by interface and upload engines to translate and bridge between vocabularies of ancillary systems and to transform multiple external vocabularies, such as ICD9-CM or LOINC, into a single coding scheme for storage and retrieval of data from centralized clinical repositories (see Figure 1). Although IDDs might not be expected to be as large as the sum of all vocabularies used in any specific CIS, they do have to maintain the source terminologies characteristics and structure and eventually become large and complex enough to pose significant problems for daily management and utilization. Moreover, IDDs must also allow for local institutional variability2, a requirement that should not be overlooked and that cannot be replaced by the use of emerging standardized

 

 

terminologies. Therefore, a generic, environment-independent terminology tool that can simultaneously support multiple controlled terminologies, complete or partial, cross-reference concepts and support local editorial work while maintaining overall component integrity is advantageous.

 

CISs deal with very high volume traffic, and performance is critical for usability and acceptance of computer-based clinical applications. Every coded data element that is contained within interacts multiple times with the IDD in its life cycle in the clinical repository. Therefore, very fast and responsive TS are critical components to any modern, efficient CIS.

 

Moreover, the incorporation of controlled IDDs in the infrastructure of CIS offers additional advantages. Logic engines can be created to execute sophisticated queries to the clinical repositories3. For example, if an institution has a clinical laboratory system that performs several different instances of Total Creatine Kinase tests and Creatine Kinase Isoenzyme tests, organizing classes can be created in the IDD and all the individual tests can be mapped to them respectively (see Figure 2.1). When the need exists to retrieve al of a specific patient Creatine Kinase tests, the clinician no longer needs to search for each and every individual test, but simply search by the class, the IDD will enable the automatic expansion from the class. Such possibilities offer users and builders of clinical systems freedom from the need to have intrinsic knowledge of each and every data element (see Figure 2.2, 2.3).

 

 

 

Pivotal to successful implementation of TS is its ability to incorporate a variety of controlled terminologies, in response to local clinical needs. Medical terminologies, in general, do not share a unified structural model and therefore a flexible and generic design of the TS is required in order to accommodate the rigid and disparate models of existing medical terminologies.

 

The design proposed below assumes such generic requirements. The singular assumption at the base of the TS model is that the hierarchical tree of all the components contained within a single IDD that the TS has to serve are based on a single semantic type. This limitation does not however dictate that semantic type.

 

Any classification system that is based on a single semantic type can be represented by a Directed Acyclic Graph (DAG) structure. By extension, controlled vocabularies can be represented by a DAG coupled with a semantic network. The IS-A semantic type is usually used for the DAG basic hierarchic structure but can be replaced by any other type. The DAG structure is intuitively and most commonly used to view the content of the vocabulary since it offers a two-dimensional tree-like hierarchy that is relatively easy to comprehend. The added value of the controlled terminology, the inter-relationship between concepts and their attributes cannot be simply conveyed using the DAG. For this purpose, semantic relationship of various types are created; The characteristics of the relationship is represented by the semantic type while the explicit inter-vocabulary relationship is represented by using direct pointers between the concepts involved. Naturally, semantic relationships come as bi-directional pairs; if concept {A} uses a specific semantic type to point at concept {B} then concept {B} will have a reciprocal semantic type which will be pointing at concept {A}.  

 

To model multi-dimensional networks by two-dimensional DAGs each concept must be represented by a two-dimensional matrix resulting in a frame-based controlled vocabulary. Each frame consists of multiple slots that may be of two types; semantic slots contain semantic pointers that maintain the explicit relationship between two concepts in different sub-hierarchies while literal slots contain descriptive information about a concept. The semantic network reshapes the two-dimensional hierarchic tree into a {n+2} dimensional structure where n stands for the number of semantic pairs. Each semantic pair inter-relates two non-overlapping sub-hierarchies of the DAG, thus creating a space that serves as the basis for the introduction of inheritance within the hierarchic tree and the rules for its implementation.

 

The Medical Entities Dictionary (MED)4 is an example of such an IDD-Terminology Server that serves as a core component of its CIS. This frame-based controlled vocabulary follows Cimino's controlled medical dictionary desiderata5. Any editing effort must maintain and support the semantic integrity of the affected hierarchies as well as the multi-paternal structure6,7. The degree of complexity of the vocabulary is not simply a function of the number of concepts but is also exponentially related to the number of semantic types maintained within. Editing a complex, frame-based DAG can easily introduce error even by experienced human editors. Paradoxically, the MED experience has shown that the same complexity, as long as integrity and logic are maintained, can guide automated algorithms to support maintenance tasks and facilitate non-domain expert interaction with the terminology3,8.

 

This proposal offers to create a TS that will serve in CISs just as the MED server support the NYPH CIS but in a more compact, cost-effective manner so that even small and mid-sized healthcare institutions will be able to implement and utilize the TS.

 

OPERATIONAL RULES

Any controlled vocabulary is a formal knowledge representation language for a particular domain and a set of self-consistent operational rules must be enforced on the data structure that represents the knowledge. These rules are enumerated below:

1.Acyclicity: An ancestor concept may not be a descendant of its descendant concept and each concept must have at least one parent  (Except, by definition, the top concept).

2.Each concept must have a unique normalized name and a unique system identifier (uID).

3.Slots may contain more than one value but may be limited, by definition, to single values.

4.Literal slot values do not get inherited. These slot values must be explicitly instantiated by the user.

5.Literal slots may contain any type of information.

6.Semantic slot values are inherited by default (Figure 3) but may also be explicitly instantiated by the user, subject to Rule 12, below.

7.Semantic slots contain only the unique ID (uID) values of other concepts.

 

 

8.Semantic slots consist of pairs that point to one another (Figure 3). Thus, if concept {A} has semantic slot [x] containing the uID of concept {B} then concept {B} will have a slot [y] pointing to concept {A}; [x][y] are a semantic pair. When explicitly instantiating a particular semantic slot, the user can arbitrarily choose either slot of the pair. The reciprocal slot will automatically be instantiated. Inheritance of the value will proceed for both slots in accordance with rule 12.

9.All slots, literal or semantic, have a specified domain; i.e. a specific sub-hierarchy within the terminology where the slot is defined.

10.Semantic slots must have a defined range; i.e. a sub-hierarchy within the terminology from which concepts may be applied to a specific slot. This range is simultaneously the domain of the reciprocal semantic slot.

11.Any sub-hierarchy in the terminology is identified by the root concept. Thus every slot becomes defined at a unique concept in the terminology called its insertion point. The "defined" status of a specific slot is automatically activated for all descendants of the insertion points.                      

12.Refinement inheritance of semantic slots (Figure 3); Semantic slot values may be refined either by inheritance or by explicit instantiations. The following sub-rules apply: (a) A more refined value (a descendant), inherited or explicit, will replace a less refined value (its ancestor). (b) A semantic slot will only hold the most specific value (i.e. the most refined) if several ancestor-descendant values coexist by inheritance or explicit instantiation. (c) Multiple values in a semantic slot must not have ancestor-descendant relationship with each other.

13.Only explicit instantiations can be removed from semantic slots .An inherited slot value cannot be removed on its own. Any removal of an explicit instantiation from a semantic slot also removes its inherited values and affects its reciprocal slot symmetrically.

14.Domain and range may not overlap.

 

REPRESENTATIONAL MODELS

In general, there are two contrasting models for data structures used to represent concepts within a frame-based DAG. Once a modeling type is chosen, one is committed to maintaining the data structures accordingly since dependant applications expect consistency. The static model is intuitively simple: each concept holds all of its slot attributes, semantic and non-semantic (Figure 4). It can further be enhanced by pre-calculating and storing within the concept its ancestry and descendants in addition to parents and children. The static model is most suitable for production systems for which response time is of utmost importance. In such an environment applications use, but do not change, the vocabulary. While this model offers speed it is not optimal for editorial tasks. Editorial tasks change the vocabulary. Changes in parent-child relationship or semantic values as well as changes to slot definitions will require extensive calculations for propagation down the descendant tree. The number of concepts that may be affected by the semantic propagation process grows exponentially in the remaining depth of the sub-hierarchy of the changed concept.

 

 

The alternative to a static model is a dynamic model (Figure 4) that utilizes the operational rules of the terminology and performs propagation calculations only when needed. In the dynamic model only the explicit slot values are stored. Concepts viewed using the dynamic model are partially virtual since all of the non-explicit semantic values are calculated on-the-fly based on the operational rules stated above. Traversing up the ancestry tree and enhanced calculations are required to display the actual semantic content of a concept. All editorial changes are applied only to explicit values of the affected concept. The most time-inefficient operation, semantic propagation, need only be limited to enforcing the operational rules of acyclicity and refinement inheritance and can be performed in the background. On undo situations the dynamic model is advantageous as well since the only affected concepts are the main concept being edited and its explicit slot-reciprocals. Therefore, there is no need to undo changes for each and every descendant of the main concept and those of the reciprocals. Typically, the dynamic model size is approximately two-thirds the size of the static model.

 

IMPLEMETATION OUTLINE and COMPARISON

Currently there are no standards for, or dominant application that function as IDD TS. Beside the single instance of the MED there are only two commercial products that exist but failed, to this day, to establish any significant market share or actual production implementations. Apelon’s9 products use relatively rigid descriptive logic to support editing and provide a run-time environment that can be utilized as a query mechanism for the content. A relatively recent newcomer is Health Language Cyber+LE10. This product comes as an integrated package that relies on a single content source – SNOMED-RT, allows user modifications and support content queries. Both products rely on conventional databases for content storage (Oracle 8, MS SQL Server), which add to the initial cost, complexity of installation and management and maintenance costs. Both products are expensive, lack in functionality and significant clinical installations and performance evaluations, while initial reviews do not support robust performance to serve as online terminology servers for large-scale, heavy-load clinical systems.

 

A prototype design of such a system, in a protracted beta phase, exists for more than 10 years at Columbia Presbyterian Medical Center (CPMC) of the New York Presbyterian Hospital (NYPH), New York  - the Medical Entities Dictionary (MED)4. The MED has been the subject of many publications in the Medical Informatics literature. However, despite its pivotal operational role as an integral part of a large-scale, sophisticated CIS at NYPH, its success remained an isolated incident. Among the reasons for the lack of widespread adaptation of the MED concept were the facts that, at the time, no standardized content was available, and the CPMC content was regarded as highly proprietary, as well as the requirement for a highly unique clinical and IS environment.

 

The MED was ahead of its time. The MED is used, at CPMC for the standardized coding and translation of most data items stored in, and extracted from, the clinical repository for clinical applications, administrative, and research projects. The lack of additional development and modernization of the MED, over the years, makes it a poor candidate for a potential solution for today’s clinical applications but, nevertheless, exemplifies the potential role of IDDs and TSs in modern CISs.

 

Based on the MED’s example and the deficiencies in other existing terminological tools this Phase I proposal is aimed to create a highly efficient TS that can operate either on Intel-based computers or high-end Unix boxes without the need to incorporate traditional databases, thus significantly reducing acquisition and implementation costs. The lack of a standard database should not be reviewed as a shortfall of the proposed system but rather as an advantage. Maintaining all content in an optimized shared memory offers a much more efficient execution environment and simpler and easier implementation and maintenance. Updates can be executed without the need for database re-indexing and no knowledge of underlying schema will be required. Content volume and size of source terminologies is a negligible issue, since in the Linux/Unix environment the only limiting factor is the actual size of the existing RAM which, today, poses no implementation barrier. The API library that will be offered with the TS will offer access to all basic content of the source terminology (i.e. literal slots including name, semantic relations) by type. Additionally the API library functions will be able to interact with one another using external program to create additional algorithms to manipulate content.

 

BEYOND THE MEDICAL DOMAIN

The concept of controlled terminologies has, of course, broader implications than just for the medical domain. A relatively new and intriguing area that has significant impact on future need for terminology servers, beyond the medical scope, is the new paradigm for the World Wide Web (Web) - the semantic Web (sWeb). Tim Berners-Lee and his colleagues15,16 outline an environment where software agents execute complicated tasks made possible by giving the information well-defined meaning via the creation of semantic ontologies and utilizing these to analyze content relationships. In design are tools to create, describe and manipulate the content of such ontologies17,18 but, in order to assure prompt and timely response efficient ontology servers will be required.

 

One can only anticipate that, in the future sWeb, a multitude of ontologies will exist all in need to converse with one another in an efficient way. A TS with a small installation imprint and hardware requirements will have an immense advantage over larger, DB-based expensive tools. The TS proposed here  is such a tool.

 

DISCUSSION & FUTURE DEVELOPMENT GOALS

As stated by Chute et al11, the differences between IDDs and clinical terminologies must be recognized and reckoned with. Clinical terminologies are created by domain experts to set reference standards, their evolution rate is slow and they are not designed, nor do they directly offer tools, to support production CISs. On the other hand, IDDs are specifically designed to support everyday production requirements of large scale CISs12. IDDs must offer efficient, responsive performance while at the same time support frequent modifications to their structure and content as dictated by numerous ancillary systems, required for maintenance of multiple applications, by clinical users, with minimal interruption of daily activities. Therefore, the need for localized editing tools for maintenance use by personnel who are not terminology mavens must be emphasized. All this must be achieved in an environment that supports the coding schemes of the source standardized terminologies and their intrinsic structure.

 

The complex semantic network supported within the IDD can be used as a tool to allow for semi-automatic classification of new concepts, given that the relevant semantic slots are instantiated: the more complex the better. Since the semantic network describes the explicit relationships of concepts and since that information uses uIDs of other concepts, pre-defined new concepts can be pushed down the hierarchy to match siblings with similar relationships or refine parents in a process described by Cimino et al8. For online creation of new concepts by non-expert users, the interaction between the IS-A hierarchic location of the new concept and the semantic values of parents and siblings will delimit the sub-domain within the range sub-hierarchy and will simplify the selection process for a user. All changes must then be tested against the operational rules to verify the semantic integrity of the IDD with immediate feedback to the user.

 

Of the operational rules listed above, close attention should be paid to rules #9 and #10. The design that allows slots to have an insertion point defined at any level of the hierarchy, rather than the top concept, offers extreme versatility. These requirements allow for the peaceful co-existence of subsets from various standardized terminologies. Coupled with the inheritance rule (#12) powerful inference algorithms can be generated taking full advantage of the metadata within7,13,14. This will allow the building of clinical applications that not only reference concepts from terminologies, but are driven by the logic contained within the structure of the reference terminologies, in a manner similar to the guided editing process.

 

Complementary to the operational rules, the IDD environment should:

·a.            Follow the desiderata enumerated by Cimino4 and Chute et al11

·b.            Offer editing capabilities for both individual concepts and slots.

·c.Support multi-user functionality

·d.            Provide batch process capabilities with error detection

·e.            Provide validation capabilities for complete and sub- hierarchies and individual concepts

·f.  Provide logging, audit trails and versioning support

·g.            Support roll-back/undo capabilities

·h.Support variable permission levels for editing of different parts of the hierarchy and different aspects of the terminology.

 

Following the above stated functionality, a structured, yet flexible, environment can be created to supports the creation, maintenance and production use of large IDDs compliant with emerging standardized controlled clinical terminologies. This environment minimizes the potential for errors and directs unlearned users towards the "minimally-best" solution. By doing so, the potential spectrum of would-be users is expanded and, as a consequence, the usability of the IDD increased.


 

Such single concept editor must be implemented in an environment that support multi-user capability, support different levels of editing capabilities for pre-defined sections of the hierarchical tree and accompanied by additional tools such as: content viewer, batch editor with support for XML-based syntax for import/export actions, verification tools and enhanced logic manipulation tools that will serve as a middleware component between clinical applications and the TS.

 

 



G. References:


1.Lumpkin JR. E-health, HIPPA, and beyond. Health Aff(Millwood). 2000 Nov-Dec;19(6):149-51.

2.Bowie J. Moving to Fact-Based Care. Healthc Inform. 2002 Jan;19(1):96.

3.Cimino JJ. Terminology tools: state of the art and practical lessons. Methods Inf Med 2001;40(4):298-306

4.http://informatics.cpmc.columbia.edu/homepages/wajngur. The "PreEdit MEDviewer" button.

5.Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998 Nov;37(4-5):394-403.

6.Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge-based approaches to the maintenance of a large controlled medical terminology. J Am Med Inform Assoc 1994 Jan-Feb;1(1):35-50

7.Cimino JJ, Johnson SB, Hripcsak G, Hill CL, Clayton PD. Managing vocabulary for a centralized clinical system. Medinfo. 1995;8 Pt 1:117-20.

8.Elhanan G, Cimino JJ. Controlled vocabulary and design of laboratory results displays. Proc AMIA Annu Fall Symp 1997;:719-23.

9.http://www.apelon.com/

10.http://www.healthlanguage.com/products/cyber_le.html

11.Chute CG, Elkin PL, Sherertz DD, Tuttle MS. Desiderata for a clinical terminology server. Proc AMIA Symp. 1999;:42-6.

12.Stausberg J, Wormek A, Kraut U. Terminological reference of a knowledge-based system: the data dictionary. Medinfo 1995;8 Pt 1:157-61

13.Cimino JJ, Elhanan G, Zeng Q. Supporting infobuttons with terminological knowledge. Proc AMIA Annu Fall Symp 1997;:528-32.

14.Elhanan G, Socratous SA, Cimino JJ. Integrating DXplain into a clinical information system using the World Wide Web. Proc AMIA Annu Fall Symp 1996;:348-52.

1.Berners-Lee T, Hendler J, Lassila O. The Semantic Web. http://www.sciam.com/article.cfm?id=the-semantic-web.

2.Berners-Lee T. The semantic road map. http://www.w3.org/DesignIssues/Semantic.html.

3.Heflin J, Volz R, Dale J. Requirements for a Web ontology language. http://www.w3.org/TR/webont-req/.

4.Hayes P. RDF model theory. http://www.w3.org/TR/rdf-mt/.