## Graph-based Semantic Measures: Technical section

This section is dedicated to technical aspects of semantic measures. Measures used to estimate the similarity or relatedness of concepts or groups of concepts are detailed. This section also aims at referencing some of the metrics used by semantic measures e.g. metrics which evaluate the specificity of concepts.

Quick Summary:

• Specificity of concepts: this section is dedicated to the various metrics used to evaluate the specificity of concepts e.g. their information content.
• Pairwise measures: Technical discussion related to measures used to compare a pair of concepts. A listing of existing measures is also provided.
• Groupwise measures: Technical discussion related to measures used to compare a group of concepts. A listing of existing measures is also provided.

### Estimation of concept specificity

Two broad categories of measures have been defined to estimate specificity of concepts. The intrinsic approach only takes into account the topological properties of the taxonomic backbone of the semantic graph. The extrinsic approach, introduced by Resnik [1], extends the intrinsic approach also considering the usage of a concept to assess its specificity.

Since taxonomic relationships are transitive, an occurrence of the concept $y$ is also a occurrence of the concept $x$ if $y \prec x$, i.e. $x$ subsumes $y$. In other word, considering that Mammal subsumes Human, every time you read about Human you also read about Mammal. The concept Mammal is therefore more specific than all its subsumers. Otherwise stated, we can say that Mammal is also more informative. Think about a detective story; if I say 'A Mammal killed him', it would be less informative (discriminating) for the detective than saying 'A Human killed him'. The intrinsic approach rely on topological information to assess concept specificity e.g. depth of concepts in the semantic graph.

The extrinsic approach takes into account of the occurrences of concepts. Indeed, the intrinsic approach, in some way, only consider the schema associated to the ontology. However, adopting Shannon's theory of information, Resnik proposed to define the specificity of a concept as a function of its information content, i.e. the amount of information a concept carries. Therefore, more a concept is used or represented in a collection, less informative (specific) it is.

In the following documentation, we consider $\theta$ as a function estimating concept specificity, with:

• $\theta : \mathcal{C} \rightarrow \mathbb{R}^+$

In all cases requirement on measures evaluating the specificity of concepts.
Due to the partial order induced by the taxonomic subgraph, a concept must always be more specific than its ancestors (subsumers). A constraint of $\theta$ functions is thus:

• $y \prec x \rightarrow \theta(y) > \theta(x)$

Both, intrinsic and extrinsic approaches are detailed below.

#### Intrinsic estimation

##### Depth of concepts

The depth of a concept can be used to estimate its specificity, i.e. deeper is a concept in the ontology, more specific it is expected to be. Numerous $\theta$ function can be defined based on the depth of a concept.

$\theta(c) = f(depth(c))$

Since a concept can have multiple depth in a taxonomic graph, i.e. depending on the path linking the concept to the root which is considered, a min or max (among others) strategy can be used.

$\theta(c) = min(depth(c))$

In some cases, a logarithm function can be used to obtain non-linear decreasing function from the leaves to the root of the ontology:

$\theta(c) = \frac{log(max(depth(c)))}{log(depth\_max)}$

Note the equation presented above is a special case of Zhou function presented below (i.e. Zhou function with k set to nil). Talk about approximations proposed by Blanchard et al.

##### Number of ancestors

Number of ancestors can also be used as an estimator of specificity. This estimator follows the same strategy as the depth but is theoretically better suited for graph as aggregations of multiple depth are not required anymore.

$\theta(c) = f(\mathcal{A}(c))$, e.g. $\theta(c) = |\mathcal{A}(c)|$
##### Seco et al. - 2004

The specificity of concept can also be evaluated according to the number of concepts it subsumes.

$\theta(c) = 1 - \frac{log(\mathcal{D}(c))}{log(|C|)}$

Reference: Seco N, Veale T, Hayes J: An Intrinsic Information Content Metric for Semantic Similarity in WordNet. In 16th European Conference on Artificial Intelligence. IOS Press; 2004, 16:1-5.

##### Zhou et al. - 2008

Multiple estimators can also be mixed.

$\theta(c) = k (1 - \frac{log(\mathcal{D}(c))}{log(|C|)}) + (1 - k) \frac{log(max(depth(c)))}{log(depth\_max)}$

With k a parameter enabling to tune the contribution of each estimators.

Reference: Zhou Z, Wang Y, Gu J: A New Model of Information Content for Semantic Similarity in WordNet. In FGCNS ’08 Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking Symposia Volume 03. IEEE Computer Society; 2008:85-89.

##### Sanchez et al. - 2011
$\theta(c) = -log(\frac{x}{nb\_leaves + 1})$ with $x = \frac{RL(c)}{\mathcal{A}(c)}+1$

Sanchez D, Batet M, Isern D: Ontology-based information content computation. Knowledge-Based Systems 2011, 24:297-303.

##### Harispe et al. 2012

Modification of Sanchez et al. IC in order to distinguish specificity of leaves according to their number of ancestors.

##### Mazandu et al. - 2012

Gaston K. Mazandu and Nicola J. Mulder, “A Topology-Based Metric for Measuring Term Similarity in the Gene Ontology,” Advances in Bioinformatics, vol. 2012, Article ID 975783, 17 pages, 2012. doi:10.1155/2012/975783 http://www.hindawi.com/journals/abi/2012/975783/

#### Extrinsic estimation

##### Resnik Information Content - 1995
 Authors Resnik Year 1995 Type Extrinsic Context Semantic similarity on WordNet SML Flag Dependencies Requires an annotation repository or number of occurences per concepts Summary The information content is defined as the inverse logarithm of the probability to encounter an instance annotated by the concept. Reference [1]

Original definition of information content initially defined by Resnik as $-log(p(u))$ for a concept $u$, which can be reformulated by:

$IC(u) = -log( \frac{|\mathcal{I}(\mathcal{D}(u))|}{|\mathcal{I}(C)|})$
With $\mathcal{D}(u)$ the set of inclusive descendants of the concept $u$, $\mathcal{I}(\mathcal{D}(u))$ the set of entities annotated by a concept in $\mathcal{D}(u)$ and $\mathcal{I}(C)$ the set of entities annotated by a concept defined in $C$.

### Pairwise Semantic measures

#### Rada et al. - 1989

Rada defined the similarity of two concepts as a function of the shortest path linking the two concepts. Note that Rada originally defined a semantic distance between two concepts.

$dist(u,v) = sp(u,v)$

With sp(u,v) the shortest path linking the two concepts. The similarity is therefore defined by:

$sim(u,v) = \frac{1}{sp(u,v)+1}$

The measure proposed by Rada et al. was defined in the context of trees, i.e. a data structures in which there is no multi inheritances, a concept only have a unique parent. A common adaptation of the measure to DAG is to force the shortest path to contain a common ancestor of the compared concept.

Reference: Rada R, Mili H, Bicknell E, Blettner M: Development and application of a metric on semantic nets. Ieee Transactions On Systems Man And Cybernetics 1989, 19:17–30.

#### Wu & Palmer - 1994

The measure proposed by Wu & Palmer was defined in the context of semantic trees. The similarity is defined as a function of the depth of the compared concepts and the depth of their most specific common ancestor (MSCA), their common ancestor which has the maximal depth.

$sim(u,v) = \frac{depth(MSCA_{u,v})}{depth(u) + depth(v)}$

Reference: Wu Z, Palmer M: Verb semantics and lexical selection. In 32nd. Annual Meeting of the Association for Computational Linguistics. 1994:133–138.

#### Resnik - 1995

 Authors Resnik Year 1995 Type Pairwise Context Semantic similarity on WordNet SML Flag Dependencies Requires an IC function to be defined Summary The similarity is defined as the information content of the most informative common ancestors of the compared concepts Reference [1]

The similarity of two concepts $u$, $v$ is defined as the information content of their Most Informative Common Ancestor (MICA):

$sim(u,v) = IC(MICA_{u,v})$
Where $MICA_{u,v}$ is the concept in $\mathcal{A}(u) \cap \mathcal{A}(v)$ which maximize the selected IC function. Note that the function originally relies on the definition of information content initially defined by Resnik as $-log(p(u))$ which can be reformulated by:
$IC(u) = -log( \frac{|\mathcal{I}(\mathcal{D}(u))|}{|\mathcal{I}(C)|})$
With $\mathcal{D}(u)$ the set of inclusive descendants of the concept $u$, $\mathcal{I}(\mathcal{D}(u))$ the set of entities annotated by a concept in $\mathcal{D}(u)$ and $\mathcal{I}(C)$ the set of entities annotated by a concept defined in $C$.
Note that the measure can be abstracted in order to tune it with other function estimating the specificity of concepts:
$sim(u,v) = \theta(MICA_{u,v})$

#### Leacock and Chodorow - 1998

Adaptation of Rada et al. measure in order to take into account of the depth of the ontology.

$sim(u,v) = -log(\frac{sp(u,v)+1}{2 . max\_depth})$

Reference: Leacock C, Chodorow M: Combining Local Context and WordNet Similarity for Word Sense Identification. In WordNet: An electronic lexical database. edited by Fellbaum C MIT Press; 1998:265 – 283.

#### Lin - 1998

Adaptation of Lin measure in order to take into account of the specificity of compared concepts.

$sim(u,v) = \frac{IC(MICA_{u,v})}{IC(u) + IC(v)}$

#### Li et al. - 2010

Effectively integrating information content and structural relationship to improve the GO-based similarity measure between proteins. Bo Li, James Z. Wang, F. Alex Feltus, Jizhong Zhou, Feng Luo

### Groupwise Semantic measures

Existing measures

### RDF Compatibility

The graph data model in use is not fully RDF-compatible in the sense that graph vertices can only represent URI (the SML takes advantage of the Sesame API [2] to process RDF data). Therefore, no literals and blank nodes can be loaded in the graph which are currently processed by the library. In addition, the graph representation is not triple-oriented and therefore does not fully embrace rigorous definition of RDF-graphs [3].

### References

 [1] Resnik, P. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence IJCAI, 1:448-453, 1995. [2] Broekstra, J., Kampman, A. & Harmelen, F.V. Sesame: A generic architecture for storing and querying rdf and rdf schema. In The Semantic Web — ISWC 2002, pages 54-68, Springer Berlin Heidelberg, 2002. [3]