Research Content Services

customer support

Europe PMC releases annotated full-text corpus for biomedical entities and associations -

Europe PMC, an open access repository of life science research, has developed a new approach to extract relevant information from research papers using machine learning algorithms. Text-mining algorithms can be used to automatically extract key concepts, relationships, and findings from scientific papers, allowing researchers to quickly identify relevant information and stay up-to-date on the latest developments in their field.

Europe PMC's SciLite Annotations tool uses text-mining to highlight terms in research articles and preprints, allowing users to quickly scan the article for relevant concepts, such as diseases, chemicals, or protein interactions. Europe PMC contains ∼1.3 billion annotations sourced in-house and from 10 external providers. The annotations platform covers multiple annotations types including bioentities ranging from accession numbers to Open Targets gene–disease relationships. Users can programmatically access the annotations using the Annotations API, reducing the time requirement of extracting facts and evidence to help advance the discovery process.

Europe PMC has developed annotations for Gene/Protein, Disease, Organism and Chemical bioentities. For this purpose, established ontologies are being used as dictionaries to pattern-match the entity terms from the text. Although the dictionary-based approach is easy to understand and implement, it requires an exhaustive list of patterns to recall more entities and regular updating to remain current. Moreover, with the contextual information missing, this creates ambiguity, especially with the use of acronyms and abbreviations by scientists writing papers.

To address the challenges of using a dictionary-based approach to annotations, the Data Science team at Europe PMC has explored the use of machine/deep learning techniques. To train any machine-learning algorithm, a gold standard dataset of annotations is needed. While corpora without annotations are good for learning semantics, text-mining tools trained on human-annotated corpora outperform those trained on non-annotated ones. Therefore, open-source gold-standard datasets are crucial for improving biomedical text-mining systems.

Based on this information, the team methodologically developed an annotated full-text corpus based on annotation guidelines. The corpus is a collection of 300 research articles from the Europe PMC open access subset. The selected articles have been annotated by humans to indicate mentions of three biomedical concepts: Gene/Protein, Disease, and Organism. The annotation guidelines were used by the human annotators to select the correct text span and type of annotation.

The Europe PMC team faced several challenges in creating a gold standard training set to support machine learning approaches for entity extraction. One of the first hurdles was to select a small number of representative articles from the several million available in the Europe PMC database. For this purpose, a strategy was designed and several techniques were employed to stratify articles and select the representative set.

The gold standard dataset is a significant step forward in improving biomedical text-mining systems, helping researchers extract the relevant information needed to advance the discovery process. Europe PMC hopes that this approach will enable researchers to keep up-to-date on the latest developments in their field and help facilitate information discovery and foster literature-data integration.

Click here to read the original press release.


  • |
  • |

sponsor links

For banner ads click here

STM Week - Innovations Seminar; STM Solutions Seminar; STM Day