OCLC harnesses machine learning to elevate WorldCat data quality measures

OCLC harnesses machine learning to elevate WorldCat data quality measures - August 16, 2023

OCLC, a trailblazer in global library services, is introducing cutting-edge machine learning technology into its ongoing WorldCat quality enhancement strategies. With a commitment to elevating the utility and accuracy of WorldCat data, OCLC's Metadata Quality teams are constantly seeking innovative methods to optimize the vast repository that powers the global library community's resource sharing and discovery initiatives.

The WorldCat database, a cornerstone of library services worldwide, is bolstered by a combination of meticulous manual processes and sophisticated automated tools. These measures collectively ensure that the data serves the diverse needs of libraries and patrons across a wide spectrum of services. As the technological landscape evolves, OCLC remains at the forefront of exploring novel ways to enhance, rectify, and de-duplicate WorldCat records, a step that significantly enhances the overall quality of the database.

Among the pivotal steps in the quality enhancement process is the eradication of duplicate records. Meticulous manual curation by expert metadata professionals, coupled with the utilization of duplicate detection software, has led to marked improvements in reducing duplicate entries. Now, to accelerate this progress, OCLC is harnessing the power of machine learning.

In December 2022, OCLC engaged the cataloging community in a groundbreaking data labeling exercise. This exercise aimed to validate the comprehension of duplicate records within WorldCat by a machine learning model. Over the course of four and a half months, a remarkable 336 participants meticulously labeled over 34,000 "possible duplicates" using an intuitive online interface. This collective effort has been invaluable in advancing the library profession and the global mission of libraries.

Armed with the comprehensive dataset gathered from this exercise, OCLC is poised to seamlessly integrate machine learning into its efforts to identify and resolve duplicate records within WorldCat. On August 19, 2023, an initial run of one million records—equivalent to 500,000 pairs—will undergo processing via the newly developed algorithm. The outcome will manifest as the merger of 500,000 duplicate records within WorldCat. This monumental step promises to substantially enhance cataloging, resource discovery, and interlibrary loan experiences, benefitting both library personnel and end users.

The introduction of machine learning into the WorldCat quality enhancement endeavor underscores OCLC's dedication to innovation and continuous improvement. By strategically leveraging cutting-edge technology, OCLC upholds its role as a driving force in shaping the global library landscape.

Click here to read the original press release.

Forward This

More News in this Theme

Artificial Intelligence (AI) / Machine Learning (ML) / Natural Language Processing (NLP)

YOUR GO-TO INTELLIGENCE RESOURCE FOR THE SCHOLARLY PUBLISHING INDUSTRY

OCLC harnesses machine learning to elevate WorldCat data quality measures - August 16, 2023

Forward This

STORY TOOLS

sponsor links