Science and Research Content

Extracting Additional Metadata from Structured and Semi-Structured Fields in Cohorts’ Data -


In scientific publications, besides well-structured metadata, cohorts are also described. Harmonization of attributes across different cohorts is critical to leverage the collective potential of data. The individual needs and requirements of each domain however make it difficult to employ a common set of tools or to directly apply standard machine learning techniques.

Also, in scientific publications, supplementary sections of research papers often have large amounts of semi-structured data. This supplementary data could reaffirm and enhance the original data, which would be very helpful in search and data discovery. Due to its nature and scale, supplementary data is not amenable to manual processing, and instead requires application of machine learning and Natural Language Processing (NLP) techniques to extract any useful information.

The text-mining group established by the Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) aims to provide common tools and methods to extract additional metadata from structured and semi-structured fields in cohorts’ data. CINECA brings together a diverse collection of human cohorts consisting of 1.4M individuals in Canada, Europe, and Africa. The cohorts provide a representation of the scales, types, variable consents, and Ethical, Legal, and Social Implications (ELSI) challenges related to global cohorts, thus ensuring a representative set for CINECA’s activities.

For instance, the University of Applied Sciences and Arts of Western Switzerland (HES-SO)/ Swiss Institute of Bioinformatics (SIB) has focused on the CoLaus/PsyCoLaus cohort data and has developed a pipeline using MetaMap, a machine learning and rule-based framework for assigning unambiguous metadata descriptors to free text. Similarly, Simon Fraser University (SFU) has developed a rule-based text-mining tool LexMapr that cleans up and parses shorter form unstructured text to extract biomedical entities and map these to standard ontology terms. The European Bioinformatics Institute (EMBL-EBI) has developed Zooma and Curami pipelines to annotate and curate semi-structured data.

The focus of each team in the CINECA text-mining group is on building tools based on their prior experience/knowledge targeting specific cohorts to expose each of them as (micro) services. Subsequently, the tools will be made available via a collective interface for general cohort text analysis related tasks.

Click here to read the original article published by the Common Infrastructure for National Cohorts in Europe, Canada, and Africa.

STORY TOOLS

  • |
  • |

Please give your feedback on this article or share a similar story for publishing by clicking here.


sponsor links

For banner ads click here