inv
top top2
arrow SIIM Home  arrow Contact Us
SIIM
 
Stay Connected!

 

Twitter

 

Twitter

 

LinkedIn

 

Facebook

 

Facebook

Wordpress

 
CFA 2010
 
Ride to SIIM
 

It's not too late! Your support of the SIIM Research & Education Fund through the 4th Annual "Ride to SIIM" will help fund the SIIM Grant Program and the Samuel J. Dwyer, III, PhD, FSIIM, Memorial Lecture.

Make a per-mile contribution to the SIIM Research & Education Fund today!

 
 
Gateway
 
 
Scientific Abstracts
invisible

Overlap of Selected Ontologies in the Context

of the Breast Cancer Domain

 
Authors:
Zharko Aleksovski, PhD, Philips Research Europe; Richard Vdovjak, MD
 
Hypothesis:

There is only a minor breast cancer domain coverage agreement among the key UMLS-covered ontologies.

 
Introduction:
Ontologies such as those included in the UMLS methathesaurus, represent explicit medical knowledge.[1] The use of these ontologies is essential in many computer decision support systems (CDSS). Many of those ontologies are created for a specific purpose, which allows for providing more precise advice, tailored to a particular clinical context. However, from the available ontologies it is unclear which subset would be relevant in a particular clinical context. Moreover, little is known to what extent such ontologies overlap in the context of a specific medical sub-domain. In this paper we analyze the coverage and overlap of the breast cancer sub-domain from four key UMLS-covered ontologies: SNOMED-CT, MeSH, NCI, and ICD10. In particular, we test our hypothesis that there is only a minor breast cancer domain coverage agreement among these ontologies. The hypothesis is based on the intuition that the different purpose of ontologies will result in different modelling of the same domain.
 
Methods:

Breast cancer domain.

 

Ontology is formalization of the knowledge in a domain, and can be understood as a set of concepts with relations among them.[2] In this study we analyze the concepts from the ontologies. By a sub-domain we mean subset of the concepts that share a common characteristic. Such a sub-domain is breast cancer, for example. However, it is not straight-forward to determine which concepts represent this sub-domain. On one hand, for concepts like “Malignant neoplasm of breast” and “Broken leg” it is easy to agree whether they are relevant to breast cancer or not. On the other hand, “Benign neoplasm of breast” cannot be clearly classified. Benign neoplasms can possibly develop into malignant (breast cancer), but they are not breast cancer as such. These examples show that concept membership in this sub-domain is fuzzy. For the purpose of this study we tried to identify breast cancer sub-domain based on the concept's names, the structure of the ontologies and using UMLS. We define the breast cancer sub-domain as the set of concepts that are obtained in a three-step extraction process: query matching, subconcept expansion, and UMLS expansion. This procedure is automatic without involving any manual work and, as such, guarantees that the obtained results are repeatable.

 

For the first phase, we developed a list of queries, shown in Table 1. Each query consists of words which, when found in the concept's label, considered that the concept belongs to the breast cancer sub-domain. All the concepts that matched at least one of the queries in the list were selected. A concept matches a query when its name, or one of its synonyms, contains all the words in the query as its substrings. The comparison was independent of the capitalization used in the words. In the second phase, subconcept expansion, we used the isa hierarchy of the ontologies. For each concept retrieved in the first phase, we added all of its subconcepts. Then, we iteratively added all the subconcepts of the newly added ones, until nothing else was possible. The third phase, UMLS expansion, proceeded after the first two phases were finished for all four candidate ontologies. In this phase, we used UMLS to identify additional concepts. UMLS integrates these ontologies in such a way that it indicates which pairs of concepts from different ontologies are equivalent. For each ontology in this phase, we looked into the breast cancer concepts identified in the other three ontologies and used UMLS to see if equivalent concepts could be found in the ontology being analyzed that were missed in the first two phases.

 

From each of the four candidate ontologies, this procedure produced a subset of breast-cancer-related concepts, that in the scope of this study, represents the breast cancer sub-domain. The results are displayed in Table 2.

 

Breast cancer domain coverage agreement.

 

We define coverage agreement between two sets of concepts as the ratio between the cardinality of their intersection and cardinality of their union. This ratio indicates to what extent the two sets of concepts are the same: ratio 0 means they are disjoint, and ratio 1 means they are identical. The coverage agreement among breast cancer sub-domains of the four analyzed ontologies are shown in Table 3, in form of percentages. The highest agreement is 7.15%, which means that, as expected by our hypothesis, the agreement between the different ontologies in the context of breast cancer sub-domain is relatively close to 0 for any pair of analyzed ontologies.

 

Breast cancer domain ontology containment.

 

In addition to the agreement, another useful measure is containment between the breast cancer sub-domains of the ontologies. Containment of set A into set B, we define as the ratio between the cardinality of their intersection and the cardinality of A. Table 4 shows the computed containment percentages between each pair of ontologies.

 

Table 1

Table 1

 

Table 2

Table 2

 

Table 3

Table 3

 

Table 4

Table 4

Results:

Table 2 depicts the sub-domain extraction process. The second column shows the size of each candidate ontology. The third to fifth columns show the number of concepts obtained in each phase of the extraction.

 

Table 3 shows the coverage agreement between each pair of ontologies. The agreement is symetrical relation, hence, only the half below the diagonal in the table is filled.

 

Table 4 shows the containment between each pair of candidate ontologies. Each cell denotes how much of the row ontology is contained in column one. So in the context of breast cancer NCI is contained in MeSH with 8.62%. For both Table 3 and Table 4, the (BC) next to the ontology names means that the numbers are in terms of the breast cancer sub-domain.

 
Discussion:

Identifying a sub-domain-specific subset of a given ontology is not a straight-forward task. Without ground truth, it often leaves room for discussion. Our seed query approach to identify a sub-domain has the advantage of repeatability, while it still allows for gradual improvements by adjusting the initial set of seed queries. Comparing the fourth and the fifth columns in Table 2, one can observe that there is only a small number of overlapping concepts, which were identified in one of the ontologies but missed in others.

 
Conclusion:

Our hypothesis that there is only a minor breast cancer domain coverage agreement among the key UMLS-covered ontologies was confirmed. The small overlap among these ontologies is likely to be attributed to the fact that each of them was built for a different purpose, but all contribute to the available knowledge describing the breast cancer domain. Hence, an application tailored to a specific domain is likely to benefit from using multiple ontologies instead of a single one. A notable exception might be the inclusion of ICD10, which is contained, to a large extent, in SNOMED-CT. This effect is not surprising, because ICD10 was used in the construction of SNOMED-CT.

 
References:

[1] Bodenreider O. The Unified Medical Language System (UMLS): Integrating biomedical terminology Nucleic acids research. 2004;267-70.
[2] Gruber TR. Toward principles for the design of ontologies used for knowledge sharing Int. J. Hum.-Comput. Stud., Academic Press, Inc. 1993;43:907-928.