Companies must be extremely careful about how they deal with customer data due to current international data protection regulation. When creating marketing resources, presenting to stakeholders or demonstrating to prospective clients, they must have data that looks realistic but respects data privacy. However, using data that clearly does not conform with the expected format can mislead audiences and result in unfavourable decisions.
One way to tackle this problem is to anonymize the data samples used. However, implementing a good anonymization algorithm that makes it impossible to trace the data back to its source is not always straightforward. This article presents an alternative approach: instead of going through the tedious process of anonymizing data, we solve this problem by using the PoolParty Semantic Suite to generate synthetic data.
Five Steps to Create Synthetic Data With A Knowledge Graph
Our goal is to create synthetic data that conforms with the expected structure and that has realistic content, such as properly formed sentences with domain content, instead of the lorem ipsum type of sentences. As the data generated is completely synthetic, the risk of it being traceable to source materials is greatly mitigated.
We have applied this principle to create synthetic curricula vitae (CV) by following these steps:
1. Decide on the Structure
Define the structure of the CV by describing what data fields are needed, the expected content and other formatting aspects of the resulting document. This acts as a template to be automatically filled in using domain knowledge. For illustrative purposes we have defined a simplified CV template. In our case, relevant fields are the personal details of the person, their current position and education as well as the hard and soft skills they possess.
2. Analyze and Extract
Use PoolParty to analyze sample documents to extract instances of key elements and frequencies. The PoolParty Extractor was run using the ESCO taxonomy to identify relevant skills in the different source CVs and build a model of the frequency of different types of CVs (e.g. developer, HR, manager, …) contained in the data set. This gives us a good idea of how the target documents should look like with regards to content.
3. Build Your Knowledge Graph
Build a knowledge graph that has concept schemes that contain the domain-specific knowledge matching the template fields that need to be filled in. Knowledge can be reused from existing taxonomies or other sources (for instance, online image databases or random name generators). For our CV generator, we created concept schemes for companies, universities, degrees, hard skills (based on the output of the PoolParty Extractor) and soft skills, among others.
4. Build an Algorithm
Build a generation algorithm that looks at each field in the template and instantiates it with values from the corresponding concept scheme in the knowledge graph. In its simplest version, the instantiation can be done randomly, in more sophisticated versions we can look at relations across concept schemes. For instance, if the person holds a position in software development, then their hard skills should be related to that professional area.
5. Generate the synthetic data
Use the defined structure and the selected values to generate a meaningful CV from the template.
Having a rich knowledge graph is key to be able to generate documents that feel realistic and varied. A good agile approach to the development of such a knowledge graph is to tackle one category at a time and enrich the taxonomy step by step. For instance, building the taxonomy across all concept schemes with knowledge only about the software development discipline.
So far, we have discussed the idea of filling in fields in a template. However, we have not elaborated on what would be the nature of those fields. Often these fields are blocks of coherent text, for instance a summary of the biography of a person. Given that the use cases we have tackled delimit the structure and content of such blocks of text, so far we have opted for a simplified approach based on the knowledge graph instead of using natural language processing (NLP) to generate sentences. We have defined the type of sentences that appear in each relevant section of a CV. We also established how to choose those sentences (e.g. introducing variations in the combination).
Each of those sentences have parameters that can be instantiated using concepts from the knowledge graph. When we combine the different types of sentences, and different values within each of them, we end up with a varied range of outputs. The sheer number of possible outputs obscures the fact that the generation follows a certain generation pattern.
The Benefits Of This Approach
This approach is sufficiently generic that it can be applied to any use case where a relatively standard template for documents can be defined (contracts, for example). It also accommodates variations of such a template. It scales well as we can start with a subset of the knowledge in the taxonomy. In addition, the knowledge graph can be reused for other purposes. Once this infrastructure has been set up the generation of documents is instantaneous.
Feel free to check out our HR Recommender demo, where we used synthetic data generated using this approach.
The Road Ahead
There are complex approaches to generate text using NLP techniques. While we chose to only work with a knowledge graph in the first version of the CV Generator, any number of these NLP techniques could be used to improve the results with regards to the generation of realistic natural language sentences. Our generator was built to easily incorporate such techniques and advanced statistical models into the basic algorithm.