Tagging 101: What is Auto Classification?
Providing rich metadata that eases the content management process.
Auto classification is a methodology for scanning the contents of a document and automatically assigning tags, or descriptive labels and keywords, to the document. With the support of taxonomies and knowledge graphs, it indexes content into appropriate categories and classes to provide rich metadata that eases the content management process.
Using auto classification, organizations can create high-quality active metadata that improves search capabilities and builds recommender systems that deliver meaningful content.
As the internet has grown exponentially in recent years, some companies have had to upgrade their content management system (CMS) to a system that Gartner refers to as a digital experience platform (DXP). DXP take the CMS one step further by integrating a set of core technologies (websites, customer portals, HR portals, mobile apps, etc) in one framework to engage various audiences across various channels.
While these CMS and DXP have reliably helped the organizations that use them, they are not always the easiest to maintain simply because they cannot keep up with the volumes of different material that an organization produces. This is where auto classification – or semantic tagging – comes in, because it allows users to automatically categorize and tag content with rich descriptors that enable finability of content in search engines or recommender systems.
The common challenges users experience in their CMS.
Too much unstructured content to organize
Difficult to analyze because it lacks a defined or organized data model, unstructured data makes up roughly 80 to 90 percent of organizations’ data. The usual text documents, emails, social media activity, etc. that organizations primarily work with are incredibly difficult for common search engines to sift through.
Language obstacles
Content is often made up of natural language which can be tricky to interpret because it contains issues such as ambiguity in the same word meaning a different thing, i.e. apple like the fruit and Apple like the tech company. This leads machines to misinterpret the meaning, a common problem e.g. with virtual assistants who often take information at face value and do not have the ability to read between the lines.
The same can be said for multilingual companies where documents are in different languages and even numerical spreadsheets rely on linguistic labels. In many cases, organizations who have offices across the world may need to pour their resources into translating material that is necessary to develop company-wide reports or projects.
Content in CMS is hard to find
Many organizations collect and maintain their content in disparate data silos that inhibit other users in the organization from obtaining information. The obvious implication is that the content is then left hidden or inaccessible when it could be very useful in creating new or reusing old content. As the amount of data and content within an organization increases, data silos only grow more outdated and inconsistent, ultimately hindering their ability to publish content to their customer-facing websites. Coupled with the fact that content often suffers from language ambiguity, searching for specific material is incredibly difficult and often yields unhelpful results. Aside from it being frustrating, the unfortunate downside to this in the content creation process is that it often results in users recreating work that had previously been done before, or starting a project from scratch.
What’s the solution? Auto Classification
Auto classification serves as an umbrella term for tagging – the practice of extracting and assigning descriptors to your data to create enriched “metadata” – and classification – the practice of categorizing and clustering content into an organized taxonomy. The metadata and clusters help an organization clean up their CMS to rid them of any hassle they may have when trying to locate documents or create new ones.
Along with extracting tags from documents and data, a successful auto classification strategy requires a strong taxonomy to facilitate the categorization of these tags. Through the use of an entity extractor tool, keywords and labels are extracted from documents that are synced to the taxonomy. These tags can be automatically sorted into their corresponding classes and concept schemes in the taxonomy through predefined rules that have been set up in the thesaurus structure, or refined after manual review. The benefit to maintaining semantic tagging in a taxonomy is the consistency it provides through its hierarchical structure and controlled vocabularies.
Metadata is data that describes other data. For example, a document will have metadata describing its file type and size, date of document creation, author(s), the dates of any changes, etc. By defining these specific elements of a document, it is easier to find, use, and manage the document. For more information about metadata,check out our complete guide here >
From this visual, we can see that documents are uploaded to the Semantic Classifier to sort the documents into the correct domain; i.e. legal documents get sent to the legal domain, news articles get sent to the communications domain, etc. From there, the PoolParty Entity Extractor that is responsible for that domain, extracts the relevant tags from the document based on concepts/parameters defined within the taxonomy and knowledge graph. Based on these tags that are extracted, the documents are categorized into their respective categories in the taxonomy. The folders in this graphic represent the different nodes, concepts, and concept schemes that are identified in the taxonomy’s hierarchy, where the tags are organized. This process is automated and cyclical.
More information about the auto classification workflow can be found by jumping ahead to the PoolParty infrastructure section >
Altogether, auto classification serves as a way to annotate unstructured text and data so that it is organized and machine readable. The annotated documents are synced back to the CMS with the semantic metadata, and once it is machine readable with this semantic metadata, users can build robust knowledge hubs, search engines, recommender systems, etc. on top.
Semantic tagging explained.
Tags, or semantic metadata, are information building blocks that help classify information assets, making them easier to find, use, and link to each other. In a simpler analogy, consider a person who has packed up stacks of boxes for moving.
The moving company you have hired does not know what is in your boxes or how to care for them, so you mark the boxes with their contents and classify them according to additional features. The labels (plates, silverware, tablecloths) serve as the category to the boxes, and they are further classified into fragile, flammable, must be kept dry, etc. Without these labels, the stack of boxes all look the same, and searching for a specific item would require unnecessary work by pulling out the items of the boxes to see what is inside.
Auto Classification with Knowledge Graphs.
The added benefit to combining these capabilities with knowledge graphs is that you can map the logic between tags. Visually represented in a web of sorts, knowledge graphs link together various business assets, entities, concepts, etc. together to see how these things are related. They help to bring all these information elements into a consistent context, because knowledge graphs can also process logical and semantic relationships and even generate them by themselves.
If we again consider our boxes story, the tag, “plates” represents a node of the knowledge graph that is linked to the nodes “kitchen” and “fragile.” Within this knowledge graph, these tags that are linked together facilitate the logic in the statements marked A and B, and in the final statement (labelled with a check mark), the next steps for the moving company. In other words, the metadata on these boxes give helpful instructions and guidance to all the people involved.
Much like the boxes which identify the relation between item to category to room in the house, semantic tags that are mapped in a knowledge graph identify relationships between concepts, terms, documents, etc. and the contents within those documents. With semantic tags, you can bundle these relationships together by adding labels of synonymous terms that make search platforms function smarter. When the semantic metadata is stored in a knowledge graph, documents can be indexed and queried better, allowing for precise user search.
In a CMS, documents can be tagged with authors, topics, authoring dates, etc. If a user is looking for a document by one particular author, all those documents tagged with the same author will be retrieved so that the user does not have to sift through the whole database. In addition, users can more easily locate documents based on the extracted topics and classification, e.g., when searching for news about ‘renewable energy’ vs. event articles about it.
All this metadata helps the user become better oriented to their CMS so they can use it more efficiently.
In our Auto Classification 101 webinar, you can see how our auto classification and semantic tagging solution works in ready-made integrations for SharePoint and Adobe Experience Manager.
Watch the demos for free>
The benefits of implementing auto classification.
Auto classification breeds accuracy and efficiency
While tagging is an important practice for businesses wishing to improve the search and recommender engines, manual tagging is quite limited. Businesses may often start small by manually extracting important terms (which is akin to someone painstakingly going through an article with a yellow highlighter) from their documents, but the obvious drawback here is that it is not scalable. Companies who rely on manually adding metadata to their content experience burnout rather quickly – especially as their business continues to grow and their CMS amasses even more documents.
Furthermore, manually tagging entire databases, file by file, often involves a lot of people which may lead to many errors and inaccuracies. In many cases, two individuals categorize the same information in two different ways or miss data simply because they could not cover it all by hand. A leading company in consulting and management has experienced this exact problem, where they sought out PoolParty’s semantic tagging solutions to fix it.
Auto classification not only improves knowledge management workflows by reducing the amount of time spent curating metadata by hand, it also improves the quality of the metadata that is produced with it. The tags are made against the taxonomy to define clear rules for future classification so that the same rules can be used consistently for every piece of content. Since the behavior is pre-configured and approved ahead of time, there is less room for error or inaccuracy.
Rich metadata and data governance
As we have learned, metadata is data that explains other data. It helps orient users to the contents of a document, but also helps users (and more importantly, machines) understand the context and meaning of a document.
Active semantic metadata highlights missing and incorrect data, but also helps improve the quality of analytics by automatically correcting and enriching the data. In an automated tagging workflow, the pre-configured rules defined for the tagging behavior ensure that data is compliant and any errors are obvious. Besides helping comply with regulatory and business requirements, metadata management helps assess the impact of a change within a data source. It also supports accountability for the terms and definitions of a business glossary to lead organizations towards the development of a standardized data model.
These are all components of data governance, which Gartner defines as “the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption and control of data and analytics.” Organizations should strive to achieve data governance in order to comply with legal regulations as well as internal policies and procedures. Metadata management (and thus data governance) helps security and risk professionals be ahead of problematic scenarios by classifying data according to risk and security needs. It facilitates data lineage, impact analysis, and data management to reinforce privacy requirements.
Intelligent search and personalized content
The most obvious and advantageous benefit to auto classification is the vast improvement it makes to search engines. The metadata attached to documents and multimedia assets are crucial for search platforms because they rely on retrieving results based on keywords and other annotations.
On an ecommerce website, for example, a user who searches for “sweater” could also be recommended products that contain the keywords “hoodie” or “cardigan” because they have been bundled together with tags.
The categorization aspect of auto classification takes it a step further by putting these terms relating to sweater in appropriate categories or classes. If we consider a user’s filtering path on a clothing website, they might first toggle through the different product categories: Shirts, Coats, Dresses, Pants, Shoes > from Shirts they can select Graphic Tees, Sweaters, and Blouses > from Sweaters, they can toggle different options such as size or color. These products have all been tagged with appropriate keywords so that the correct products show up with each narrower category. This is precisely the behavior of a taxonomy that is organized on the backend which couples the bundled tags and classes together so that the user has a seamless experience using keyword search in text fields, or filtered searches from the product menus.
In another scenario, picture a company who is responsible for the research and development of a new drug. This project requires extensive reading and writing about the Delta Variant using material from their CMS which contains thousands of documents of research. A simple search for “Covid 19 mutation” returns all the literature in the database containing those keywords including news articles, internal HR updates for home office rules, etc.
The obvious problem here is that most of this information is invaluable to the research either because it has nothing to do with scientific information or it is very outdated in the case that it retrieves information from the early stages of the virus about the Beta Variant. A CMS search using semantic tools could make assumptions about the user’s intent because the content has been tagged and classified. Recommender systems or semantic search platforms (which are built on auto classification) are powerful in this case, because they can retrieve relevant and precise information based on filtering and confidence scores. In other words, documents which have been classified under the Delta Variant and tagged with recent publishing dates will get higher confidence scores and be shown at the top of the list of results.
Auto classification helps the user find precise information in their CMS more easily so that tedious workflow is greatly reduced.
Setup and infrastructure:
How to integrate PoolParty into your CMS.
Aside from data governance and a strongly organized system, the end goal, or perhaps more desired reason that companies seek out auto classification tools, is their ability to transform search engines. Our figurative drug development company wants the PoolParty Semantic Suite to return relevant results in half the amount of time, with far more accuracy.
PoolParty PowerTagging, the comprehensive solution to this wish list, comprises important semantic features:
Content
- The common challenges users experience in their CMS.
- What’s the solution? Auto Classification.
- Auto Classification with Knowledge Graphs.
- The benefits of implementing auto classification.
- Setup and infrastructure:
How to integrate PoolParty into your CMS. - PoolParty for SharePoint.
- PoolParty integration with Tridion.
- PoolParty integration with Adobe Experience Manager.
Classification of documents into domains Semantic Classifier
Semantic Classifier, free classifier demo >
The Semantic Classifier uses knowledge graph and machine learning to precisely classify documents into their respective domains. This product only requires a small training dataset to understand how to classify your documents automatically. The classified documents will be sent to their correct domains in order for the tags to be extracted correctly.
Text mining and natural language processing
Entity Extractor, free extractor demo >
Extract terms, phrases, and entities from unstructured documents. Use algorithms and a rule engine to disambiguate discrepancies in regular spoken language or unstructured text. E.g. the machine understanding Apple refers to the tech company and not the fruit. The extractors can be set up according to specific knowledge domains.
Read more about NLP in our white paper >
Categorization of tagged documents into the taxonomy
Thesaurus Server & GraphSearch, free graph demo >
Tagged documents can be sorted into the different concept schemes of the taxonomy. With the support of knowledge graphs, smart search applications and recommender systems can be built by recalling the concepts defined from the annotated documents.
Altogether, these are the main outcomes of PoolParty’s auto classification capabilities. Since PoolParty functions as a middleware, the PoolParty API can be integrated into the CMS to show the tags and enable smart semantic search. The integration, coupled with the PoolParty GraphSearch Server and Thesaurus APIs mean that users can perform faceted searches that look for occurrence, concepts, and meaning instead of just relying on specific keyword input.
A simple breakdown of the PoolParty setup is as follows:
Users can decide if they want an automatic or expert-assisited (semi-automatic) workflow. A semi-automatic workflow enables the user to review and refine the tags that are automatically returned in the PoolParty taxonomy.
PoolParty for SharePoint.
SharePoint is among one of the more popular CMS. While it’s native search is helpful when you are trying to find a small number of files using broad query parameters, many users often run into problems when dealing with a lot of data and more refined search requirements.
The following are critical drawbacks of SharePoint’s native search functionalities:
User search experience depends heavily on how the feature was set up by administrators.
-
- Search is limited to the database the user is working with.
- Without additional customization, search results cannot be filtered by any category other than the age of the document.
- Unfortunately, SharePoint’s out-of-the-box search functionality is not capable of returning results that are timely, comprehensive and relevant.
PoolParty for SharePoint’s automated tagging vastly improves SharePoint’s search performance. Read more about PoolParty’s ready-made integration with SharePoint >
PoolParty integration with Tridion.
Tridion connects people, processes, and information with the most complete portfolio of collaborative Content Management, Knowledge Management and Headless delivery technologies. Organizations can expand their global reach by extending with Accelerators for fast time-to-value and RWS Translation Management solutions.
PoolParty has partnered with RWS, the creators of Tridion, to enhance Tridion’s tagging capabilities and deliver a smart tagging solution.
PoolParty integration with Adobe Experience Manager.
Adobe Experience Manager (AEM) is a CMS for building websites and applications by managing all digital assets and publishing them to front-end sites. Currently, AEM does support tagging to these assets; however, the native tagging functionality (AEM Tags) is quite limited.
AEM Tags are not capable of the following:
- Synonyms and alternative labels and terms are not supported by default
- Polyhierarchy is not available
- Tags do not have their own unique ID, meaning if you move tags around or change their details, the system could not identify their relations or redirect them correctly
AEM’s tags can be enhanced using the integration set up with PoolParty and the Semantic Booster. Through our strong partnership with Mekon, you can:
- In PoolParty Thesaurus Server, build and maintain a taxonomy which allows you to easily tag and control your content vocabularies
- Using Mekon’s Semantic Booster, sync the taxonomies you built to AEM Tags
- The tagging process is made much easier and smarter using PoolParty’s sophisticated classification algorithm