Many companies have their own classification systems to label and structure content. The most advanced ones even have their own Knowledge Graphs, a representation of the company’s knowledge that can be understood by both humans and machines. But how does one use this knowledge to process unstructured data such as text documents automatically? How can we recognize that a document mentions one or another resource from the Knowledge Graph?
Looking for concepts behind words
Simple examples like the one you will find below demonstrate that string matching with linguistic extensions is not enough to understand if a word represents a resource from the Knowledge Graph. We need to disambiguate words, that is to discover which concepts stand behind these words.
Given 1. a text, 2. a word of interest (or target word), 3. a Knowledge Graph — decide which resource from the Knowledge Graph does the word of interest represent. Here is an example:
BMW has designed a car that is going to drive Jaguar X1 out of the Car market.
This is what the typical formulation of the disambiguation against Knowledge Graph task, also called entity linking, would look like:
However, this formulation is only suitable for large Knowledge Graphs like DBpedia or Wikidata when one is only interested in disambiguating between the senses represented in the Knowledge Graph. For enterprise Knowledge Graphs, this task should be posted differently. Enterprise Knowledge Graphs are smaller than DBpedia and usually highly specific to their domains. They also often do not contain resources that can be found in public Knowledge Graphs; or contain general purpose words in domain specific meanings, for example as a name of a new game. Therefore, this would be a more suitable formulation of the problem statement:
Running Example — “Jaguars”
[{1: “The jaguar’s present range extends from Southwestern United States and Mexico in North America, across much of Central America, and south to Paraguay and northern Argentina in South America.”},
{2: “Overall, the jaguar is the largest native cat species of the New World and the third largest in the world.”},
{3: “Given its historical distribution, the jaguar has featured prominently in the mythology of numerous indigenous American cultures, including those of the Maya and Aztec.”},
{4: “The jaguar is a compact and well-muscled animal.”},
{5: “Melanistic jaguars are informally known as black panthers, but as with all forms of polymorphism they do not form a separate species.”},
{6: “The jaguar uses scrape marks, urine, and feces to mark its territory.”},
{7: “The word ‘jaguar’ is thought to derive from the Tupian word yaguara, meaning ‘beast of prey’.”},
{8: “Jaguar’s business was founded as the Swallow Sidecar Company in 1922, originally making motorcycle sidecars before developing bodies for passenger cars.”},
{9: “In 1990 Ford acquired Jaguar Cars and it remained in their ownership, joined in 2000 by Land Rover, till 2008.”},
{10: “Two of the proudest moments in Jaguar’s long history in motor sport involved winning the Le Mans 24 hours race, firstly in 1951 and again in 1953.”},
{11: “He therefore accepted BMC’s offer to merge with Jaguar to form British Motor (Holdings) Limited.”},
{12: “The Jaguar E-Pace is a compact SUV, officially revealed on 13 July 2017.”}]
The example contains twelve contexts featuring the target word “jaguar” in different senses. The first six contexts speak about the “jaguar” as animal, the last five mention “jaguar” as a car manufacturer. The seventh context refers to both senses as it describes the etymology of the word. Consequently, our desired output would be two senses, the first senses expressed in first five contexts and the second sense expressed in the latter five. The representation of the senses is dependent on the method and the word sense disambiguation procedure. The senses could be represented as a (weighted) list of words describing the sense, such as synonyms.
Word Sense Induction and Word Sense Disambiguation
In order to solve the task we use a two-step solution:
- Word Sense Induction (WSI)
- Word Sense Disambiguation (WSD)
The goal of the WSI step is to induce the senses of the target word for the given corpus. The outcome is a set of senses of the target word — a sense inventory. These senses are then used in the WSD step. The most advanced WSD methods are able to mix the induced senses with the senses taken from Knowledge Graph, in other words, they include external senses into the sense inventory.
Co-occurrence Graph Clustering
In this blogpost, we use co-occurrences of the target word to induce the senses. We take n words before and n words after each word as its co-occurrences and construct the co-occurrence graph. We used a variation of the Dice scores to compute and weigh the co-occurrences[1].
Then, we focus on the target word (“jaguar”) in the graph and extract a sub-graph containing it. We then cluster the sub-graph; each cluster represents an induced sense, see the figure. The clustering technique we are using is inspired by the HyperLex[2] clustering method, with the difference that we use PageRank[3] for estimating the “importance” of nodes.
As we deal with enterprise knowledge graph expect the target word to have some domain specific sense in the knowledge graph, we are specifically interested in distinguishing this KG sense from other senses. In order to be more precise with the KG sense and to guarantee its careful induction, we use broaders (in case of SKOS) and/or class information (in case of OWL) to influence the clustering process. As we make use of hypernyms, we add another “Hyper” to the described method’s name. Hence the name for the new method: “HyperHyperLex”.
The co-occurrence graph for “jaguar” is presented below. As we have a small amount of very small contexts the algorithm is a bit confused and cannot induce the right amount of senses, i.e. it induces more senses. However, the induced KG sense looks like this:
‘native’: 0.150,
‘cat’: 0.150,
‘specie’: 0.109,
‘largest’: 0.138,
‘world’: 0.109,
‘overall’: 0.074,
‘derive’: 0.004,
‘history’: 0.003,
‘thought’: 0.004,
‘word’: 0.004,
‘prominently’: 0.003,
‘featured’: 0.003,
‘american’: 0.004,
‘indigenous’: 0.003
Even despite being confused by the small size of the corpus, the method is still able to efficiently induce the KG sense, thanks to taking additional information from the KG (hypernyms).
The results of running a larger scale experiment are presented in Tables 1 and 2. Experiments show that the usage of hypernyms and class assertions from the KG sense helps to better identify the Knowledge Graph sense and therefore pull it apart from other senses. Therefore, the efficiency of the WSI step improves and more precise senses are induced, leading to better WSD.
Code
In this code, we use a prepared GitHub repository with a complete dataset with several large corpora, i.e. not the short local contexts as in the example of “jaguar”. If you run the code below you will also obtain that dataset and can further experiment on your own. Larger text pieces allow us to better estimate the co-occurrence scores between non-target words. In our next blog posts, we will demonstrate a technique that overcomes this limitation and is able to induce efficiently from local contexts.
References
- Dice, L. R. (1945). “Measures of the amount of ecologic association between species.” Ecology, 26(3):297–302.
- J. Veronis. 2004. “Hyperlex: lexical cartography for information retrieval.” Computer Speech & Language, 18(3):223–252.
- Di Marco, Antonio, and Roberto Navigli. “Clustering and diversifying web search results with graph-based word sense induction.” Computational Linguistics 39.3 (2013): 709–754.