In the time it takes a human life sciences researcher to read one research article machines can process hundreds of thousands of articles. An unco-ordinated army of bots, crawlers, and other software agents are active day and night on the Internet discovering, ingesting, and analyzing research content. Many of these agents are designed to help researchers rapidly filter the ever-expanding research record and surface the articles and findings most relevant to their work. For these software agents to be most effective, they need to understand the content they are reading in a manner similar to an expert human reader. (What are the main concepts being discussed and what are the main findings asserted? What is this research article telling us that is new and what is supporting or contradicting past findings?). This is where semantic enrichment comes into play — semantic enrichment adds structured machine-readable metadata to life science articles to assist software agents in ‘reading’ the content in a manner similar to a human researcher. In the present study, I'll define the mechanism of semantic enrichment of life sciences content, examine the benefits it is bringing to researchers today, and preview promising avenues for future benefits.
Origin of the vision for a ‘Semantic Web’
In 2001, World Wide Web creator Tim Berners-Lee and his colleagues previewed a vision for the Semantic Web, an internet evolution that ‘will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users ’. While his all-encompassing vision was subsequently challenged as too utopian to work effectively across the entire internet [2–4], it has now established a solid foothold in life sciences research publishing.
What is semantic enrichment?
Most research article tagging in XML, HTML, ePub, and PDF is designed to describe the structure and display of the article — this is the title, these are the authors, this is a figure legend, etc. Semantic enrichment tagging is different — it describes the concepts in the content.
Semantic enrichment in the life sciences is a direct application of Berners-Lee's Semantic Web vision. (Berners-Lee himself noted that the ‘e-science community’ was leading the way to the Semantic Web in his 2006 follow-up article, The Semantic Web Revisited ). Semantic enrichment is the process in which important entities and assertions (‘meaningful content’) are isolated from the narrative text of research articles and placed into normalized structures, so that publisher and researcher applications (‘software agents’) can read and process them in new and helpful ways (‘carry out sophisticated tasks for users’).
How does semantic enrichment work?
Identifying entities and assertions
The main body of life science research articles is written as narrative text in a form designed to be read and understood by human readers. Research articles contain complete sentences, complex-dependent clauses, a large array of words (nouns, verbs, adjectives, adverbs, and conjunctions), and images, tables and data files that together convey the ‘story’ of the research performed and the findings produced.
The first step in semantic enrichment is to distill human-readable narrative text into a structured list of entities and assertions:
Entities are the discipline-relevant concepts directly mentioned or inferred by the text. Examples include genes, chemical compounds, catalysts, mechanisms of action, diseases, places, and people. Some semantic enrichment systems use pre-determined lists of discipline-specific entities to identify a text's entities — in the form of controlled vocabularies, taxonomies, and thesauri — while others use text-analysis, natural language processing (NLP), and training sets to determine relevant entities.
Assertions are multi-part statements that follow a subject–predicate–object pattern. For example, ‘DDX3 (subject) facilitates (predicate) IKKα activation (object)’. In the W3C standard for RDF, these three-part statements are known as ‘triples ’.
One of the most vital calculations made by a semantic enrichment system is that of significance. A single life sciences research article will contain hundreds (or thousands) of entities and assertions within its text. However, they should not be treated as equal — some entities and assertions comprise the core findings of the research article while others appear in less significant context (or even in counter-example). To deliver the benefits described in the next section — such as search targeting and filtering, auto-summarization, and highly relevant article recommendations — it is crucial that the entities and assertions of each article are examined and ranked for significance.
Semantic enrichment systems contain algorithms that assign a significance score to each entity and assertion. These scores are used to discard those that fall below a significance threshold and rank the remaining concepts. These scoring algorithms can take into consideration location of the entity or assertion in research articles (giving more weight to those found in titles and abstracts, for example) and uniqueness of the entity or assertion in the overall corpus. Recently, developed systems also include machine learning algorithms that can evolve their scoring algorithms over time with training and feedback.
Creation and sharing of the semantic layer
Once identified, the significant structured entities and assertions are then added to the content delivery package — either in a separate metadata package or as tags directly embedded in the content.
What benefits can semantic enrichment deliver for researchers?
With semantically enriched content, software systems that interact with research content now have additional topical metadata with which to work. Research publishing systems already use semantic metadata to produce the following benefits:
A common benefit comes when semantic metadata is present during search:
Discipline-specific search autocomplete: when a researcher begins to type a query, suggested concepts from the semantic layer are presented to allow for faster searching, reduction of typographical errors/misspellings, and surfacing of phrases that may better match the searcher's intent. For example, on McGraw-Hill's AccessMedicine website (accessmedicine.mhmedical.com), typing ‘brach’ produces suggestions ‘brachial plexus’, ‘brachial artery’, ‘brachial plexus neuritis’, ‘brachiocephalic artery’ and more.
Better search ordering and filtering: because semantic enrichment includes a significance calculation for each concept, it can be combined with full-text search to push significant discussions of topics higher in search results, thus reducing researcher time to find the most relevant content for their query. Once a query has been run, semantic topics can be presented to a researcher to enable them to filter their results down to only significant discussions of a concept or a combination of concepts.
Discipline navigation and orientation
For researchers new to a discipline or concept, semantic metadata can be compiled into alternative views that allow researchers to quickly understand the basic assertions and concept relationships in that discipline.
Automated content grouping: research systems display semantic concepts on article pages that when selected navigate the user to a filtered list of all of the significant content on that concept in one location.
Knowledge graph: made popular by Google starting in 2012, knowledge graphs are compilations of assertions about a concept, including definitions, facts, descriptive images, and related concepts (Figure 1).
Visual concept navigation: new research applications such as Yewno Discover (https://www.yewno.com/discover/) show connections between concepts in a visual format, allowing the user to examine links between concepts (Figure 2).
Google knowledge graph.
Example of visual concept connections in Yewno Discover.
More accurate recommendations and personalization
Semantic enrichment is also used to create more accurate recommendations. These recommendations take place on research platforms, as well as via email alerting.
On-site recommendations (read this next): because the semantic enrichment layer includes a significance judgment, recommendations to articles with similar concept/assertion profiles can be better calculated and surfaced for the researcher.
Personalized recommendation (content finds you): in addition to content obtaining a semantic profile, researchers themselves can be assigned a semantic profile from their reading history. This profile is used to surface new content that matches a researcher's interests. An example of this in action is Springer Nature's Recommended email service (http://recommended.springernature.com/recommended/), which creates a personalized reading list for each of its subscribers.
Emerging benefits for researchers
More effective text mining
Researchers are increasingly using software (NLP text mining as well as machine learning algorithms) to interrogate an entire corpus of research at one time . Having a normalized, structured set of significant concepts and assertions along with the content reduces ambiguity and the time spent normalizing content for text mining projects. In The Handbook of Data Mining, Pyle  argues that ‘data preparation consumes 60 to 90% of the time needed to mine data — and contributes 75 to 90% of the mining project's success’.
As the number of research articles increases per year, the ability to ‘stay current’ with a field by reading all of its articles, or even all of its abstracts, becomes less possible. With a semantic layer of significant concepts and assertions, it may become possible to automatically produce even shorter summaries of articles or summaries of groups of articles (for example, a single ‘key findings’ summary for 20 articles on a concept).
However, generating abstractive summaries directly from the article text remains a difficult computer science problem due to the large variation in text and phrase structures. Semantic enrichment provides a shortcut by already identifying the significant concepts and assertions. As Allahyari writes, ‘…abstractive summarizers often rely on an extractive preprocessing component to produce the abstract of the text’  (highlights mine).
Direct answer and task agents
Direct-answer and task agent applications are currently gaining popularity in the mass market, with services such as Alexa, Siri, Cortana, and others promising to directly answer questions (‘Where is the nearest sushi restaurant?’) as well as perform tasks (‘Book a table for two at Sushi-Yoko tonight at 7:30 pm’). These systems perform best when accessing a set of structured data sets where concepts and assertions are clearly stated. It does not seem far off before these agents can be used to examine the research literature and return answers to questions or perform research tasks. A machine-readable semantic layer could provide a huge boost to the accuracy and effectiveness of these systems in a research setting.
As more aspects of knowledge discovery are assisted or directly completed by software, it becomes even more vital that these software agents understand the concepts and assertions being made in research. It is easy to imagine a near-future scenario in which a software agent scans thousands of research articles and misinterprets their findings, thus negatively affecting funding decisions, governmental policy, or even patient care.
Semantic enrichment of life sciences content reduces text ambiguity for software agents, thus facilitating more accurate search and retrieval, concept navigation, and recommendations for researchers today — and will play a supporting role in the future of auto-summarization, text mining, and direct-answer systems. The most innovative research software agents of the future are likely to rely on semantically enriched content to bring accuracy and precision (and reduce errors) in their outputs.
The Author declares that there are no competing interests associated with this manuscript.