Breast cancer domain.
Ontology is formalization of the knowledge in a domain, and can be understood as a set of concepts with relations among them.[2] In this study we analyze the concepts from the ontologies. By a sub-domain we mean subset of the concepts that share a common characteristic. Such a sub-domain is breast cancer, for example. However, it is not straight-forward to determine which concepts represent this sub-domain. On one hand, for concepts like “Malignant neoplasm of breast” and “Broken leg” it is easy to agree whether they are relevant to breast cancer or not. On the other hand, “Benign neoplasm of breast” cannot be clearly classified. Benign neoplasms can possibly develop into malignant (breast cancer), but they are not breast cancer as such. These examples show that concept membership in this sub-domain is fuzzy. For the purpose of this study we tried to identify breast cancer sub-domain based on the concept's names, the structure of the ontologies and using UMLS. We define the breast cancer sub-domain as the set of concepts that are obtained in a three-step extraction process: query matching, subconcept expansion, and UMLS expansion. This procedure is automatic without involving any manual work and, as such, guarantees that the obtained results are repeatable.
For the first phase, we developed a list of queries, shown in Table 1. Each query consists of words which, when found in the concept's label, considered that the concept belongs to the breast cancer sub-domain. All the concepts that matched at least one of the queries in the list were selected. A concept matches a query when its name, or one of its synonyms, contains all the words in the query as its substrings. The comparison was independent of the capitalization used in the words. In the second phase, subconcept expansion, we used the isa hierarchy of the ontologies. For each concept retrieved in the first phase, we added all of its subconcepts. Then, we iteratively added all the subconcepts of the newly added ones, until nothing else was possible. The third phase, UMLS expansion, proceeded after the first two phases were finished for all four candidate ontologies. In this phase, we used UMLS to identify additional concepts. UMLS integrates these ontologies in such a way that it indicates which pairs of concepts from different ontologies are equivalent. For each ontology in this phase, we looked into the breast cancer concepts identified in the other three ontologies and used UMLS to see if equivalent concepts could be found in the ontology being analyzed that were missed in the first two phases.
From each of the four candidate ontologies, this procedure produced a subset of breast-cancer-related concepts, that in the scope of this study, represents the breast cancer sub-domain. The results are displayed in Table 2.
Breast cancer domain coverage agreement.
We define coverage agreement between two sets of concepts as the ratio between the cardinality of their intersection and cardinality of their union. This ratio indicates to what extent the two sets of concepts are the same: ratio 0 means they are disjoint, and ratio 1 means they are identical. The coverage agreement among breast cancer sub-domains of the four analyzed ontologies are shown in Table 3, in form of percentages. The highest agreement is 7.15%, which means that, as expected by our hypothesis, the agreement between the different ontologies in the context of breast cancer sub-domain is relatively close to 0 for any pair of analyzed ontologies.
Breast cancer domain ontology containment.
In addition to the agreement, another useful measure is containment between the breast cancer sub-domains of the ontologies. Containment of set A into set B, we define as the ratio between the cardinality of their intersection and the cardinality of A. Table 4 shows the computed containment percentages between each pair of ontologies.
Table 1

Table 2

|
Table 2 depicts the sub-domain extraction process. The second column shows the size of each candidate ontology. The third to fifth columns show the number of concepts obtained in each phase of the extraction.
Table 3 shows the coverage agreement between each pair of ontologies. The agreement is symetrical relation, hence, only the half below the diagonal in the table is filled.
Table 4 shows the containment between each pair of candidate ontologies. Each cell denotes how much of the row ontology is contained in column one. So in the context of breast cancer NCI is contained in MeSH with 8.62%. For both Table 3 and Table 4, the (BC) next to the ontology names means that the numbers are in terms of the breast cancer sub-domain. |
Our hypothesis that there is only a minor breast cancer domain coverage agreement among the key UMLS-covered ontologies was confirmed. The small overlap among these ontologies is likely to be attributed to the fact that each of them was built for a different purpose, but all contribute to the available knowledge describing the breast cancer domain. Hence, an application tailored to a specific domain is likely to benefit from using multiple ontologies instead of a single one. A notable exception might be the inclusion of ICD10, which is contained, to a large extent, in SNOMED-CT. This effect is not surprising, because ICD10 was used in the construction of SNOMED-CT.
|