Goodbye Data Silos!
Those who stay on their islands will fall back. This statement is valid on nearly any level of our increasingly complex society, ranging from whole cultural areas down to single individuals. The drawbacks of isolated departmental thinking becomes even more obvious when looking on the competencies and skill sets that become more and more relevant in our data- and knowledge-driven working environment: they should be interdisciplinary, multilingual, and should be based on a systemic / integrative attitude and on an agile mindset.
When looking at our data itself: it’s kept in silos, and therefore it is a highly time-consuming task for every knowledge worker to identify the right dots and information pieces, to connect them, to make sense of them, and finally to communicate and interpret them the right way. A shift to data-centric execution instead of document-based communication can be seen in many industries.
Graphs are on the Way
Some vendors of machine learning and AI technologies have aroused great hopes to fix the problem of disconnected and low-quality data automatically. It’s not perfectly true that machines are able to learn from any kind of data, especially from unstructured information, to come to a level that could substitute subject matter experts. The truth: algorithms like deep learning work only well when a lot of data (more than even large corporations usually have) of the same kind is available, and even then, only rather simple cognitive processes like ‘classification’ can be automated.
AI technologies are currently focused on solutions that automate processes. By that, other types of AI applications are well forgotten: decision and process augmentation, which refer to systems supporting knowledge workers by connecting some, but not all of the dots automatically. Applications like these are increasingly based on graph technologies, as they can map and support complex knowledge domains and their heterogeneous data structures in a more agile manner. In mid-2018, Gartner has identified Knowledge Graphs as new key technologies in their Hype Cycle for Artificial Intelligence and in their Hype Cycle for Emerging Technologies.
Knowledge Graphs (KG) have become increasingly important to support decision and process augmentation based on linked data. In this article we explore how enterprises can develop their own KGs along the whole Linked Data Life Cycle.
What is a Knowledge Graph?
It’s all about things, not strings: A Knowledge Graph represents a knowledge domain. It connects things of different types in a systematic way. Knowledge graphs encode knowledge arranged in a network of nodes and links rather than tables of rows and columns. By that, people and machines can benefit from a dynamically growing semantic network of facts about things and can use it for data integration, knowledge discovery, and in-depth analyses.
Graphs are all around: Facebook, Microsoft, Google, all of them operate their own graphs as part of their infrastructure. Google introduced in May 2012 its own version and interpretation of a Knowledge Graph, and since then the notion of ‘knowledge graph’ got more and more popular but linked to the Silicon Valley company. On the surface, information from the Knowledge Graph is used to augment search results and to enhance its AI when answering direct spoken questions in Google Assistant and Google Home voice queries. Behind the scenes and in return, Google uses its KG to improve its machine learning.
But Google’s KG (GKG) has a big disadvantage: it is quite limited how users and software agents can interact with it and its API returns only individual matching entities, rather than graphs of interconnected entities. Even Google itself recommends that, if you need the latter, data dumps from Wikidata should be used instead. But Wikidata is only one out of currently over 1,200 sources, which are highly structured and interconnected and most frequently available for download and reuse as standards-based knowledge bases. This ‘graph of graphs’ is also known as the ‘Semantic Web’. One might argue that this data still cannot easily be integrated in enterprise information systems due to a lack of quality control or missing license information etc.
Nevertheless, there are several reasons why organizations should explore this data management approach, to find out whether a ‘corporate semantic web’ reflecting their own specific knowledge domains would make sense or not.
Build your own Knowledge Graph
As we all know, many roads lead to Rome. Some of them are more exhaustive but more solid and sustainable, some of them are less explored but at the end more efficient, and in many cases the best way can only be found when already on the move.
In any case, the most frequently used approaches to develop knowledge graphs are: knowledge graphs can be curated like Cyc, edited by the crowd like Wikidata, extracted from large-scale, semi-structured knowledge bases such as Wikipedia, like DBpedia and YAGO, or they can be created by information extraction methods for unstructured or semi-structured information, which lead to knowledge graphs like Knowledge Vault.
The later approach sounds most promising since it better scales through a fully automated methodology, not only during the initial creation phase, but also for the continuous extension and improvement. A most fundamental problem with automatic knowledge extraction is the fact that and it cannot distinguish an unreliable source from an unreliable extractor. When learning from Google’s Knowledge Vault, we can assume the following:
- Approaches primarily focused on text-based extraction can be very noisy
- Better results can be achieved when extractors combine information obtained via analysis of text, tabular data, page structure, and human annotations with prior knowledge derived from existing knowledge repositories
- Extracted entity types and predicates should come from a fixed ontology
- Knowledge graphs should separate facts about the world from their lexical to make it a structured repository of knowledge that is language independent
- Supervised machine learning methods for fusing distinct information sources are most promising
- SKOS as a W3C standard serves as a solid starting point to create an Enterprise KG
A systematic view on how Knowledge Graphs can be created, maintained, extended and used along the whole Linked Data Life Cycle has been provided by Semantic Web Company just recently in the course of the latest release of its PoolParty Semantic Suite:
Gartner states in its Hype Cycle for Artificial Intelligence, 2018: “The rising role of content and context for delivering insights with AI technologies, as well as recent knowledge graph offerings for AI applications have pulled knowledge graphs to the surface.”
Read more about what Gartner is saying about the Semantic Web Company and PoolParty suites in their newest Hype Cycle for Artificial Intelligence and in their Hype Cycle for Emerging Technologies.