The CHORUS Forum, titled "Linking from Datasets to Content," gathered experts in the fields of data management, oceanography, and public access to explore the intricate process of connecting datasets with the content they represent. The forum, which featured a lineup of distinguished speakers, provided valuable insights into the strategies, technologies, and best practices essential for effective data-content linking.
The forum opened with moderator Shelley Stall, who presented findings from a pre-forum poll. Registrants identified several challenges in linking datasets to content, including simplifying the linking process, maintaining robust links, and addressing the social and cultural hurdles to widespread adoption. Cost, logistical efforts, and lack of awareness also emerged as significant barriers. However, participants recognized the immense potential of enhanced discoverability and reproducibility, both of which are central to the value proposition of linking datasets to their corresponding content.
Danie Kinkade, Director of the Biological and Chemical Oceanography Data Management Office at the Woods Hole Oceanographic Institution, provided a deep dive into the complexities of linking data within the context of ocean research. Kinkade highlighted the diverse and intricate nature of oceanographic data, which often spans multiple formats and repositories. The challenge, she noted, lies in ensuring these disparate datasets remain connected, allowing researchers to access a comprehensive view of their projects.
Kinkade emphasized the importance of standardizing data formats, assigning persistent identifiers (PIDs), and fostering partnerships that facilitate access to related content. By leveraging PIDs, repositories can link to additional data housed in specialized repositories, thereby offering researchers a more complete picture of their work.
Martin Halbert, Science Advisor for Public Access at the National Science Foundation (NSF), discussed the broader implications of linking content with data. He underscored the global nature of these challenges and the NSF's ongoing efforts to address them through the implementation of the 2022 OSTP Nelson Memo. Halbert highlighted the potential of AI-driven tools to automate data-content linkages, though he cautioned against the risks of errors and the need for a global consensus on best practices.
Halbert also pointed to the forthcoming 2025 and 2026 deadlines outlined in the Nelson Memo, which mandate the development and implementation of comprehensive policies for public access to data and content arising from federally funded research. He acknowledged the significant burden these requirements may place on researchers and stressed the NSF's commitment to finding fair and effective solutions.
Matt Buys, Executive Director of DataCite, focused on the importance of making research data more discoverable and reusable. Buys argued that linking datasets to research outputs can enhance the scholarly record and advance knowledge. He highlighted initiatives like Make Data Count and the Data Citation Corpus, which aim to build open infrastructure and community-based standards for data citation.
Buys urged the research community to shift its focus from altering publisher behavior to enabling repositories to add citation links to dataset metadata. He also advocated for the use of text mining and Natural Language Processing (NLP) to accelerate the identification of data citations.
The CHORUS 2024 Forum illuminated the critical challenges and opportunities in linking datasets to content. While the road ahead is fraught with technical, social, and logistical hurdles, the potential benefits in terms of discoverability, reproducibility, and the advancement of knowledge are too significant to ignore. As the research community continues to navigate these complexities, forums like this will play a vital role in fostering collaboration and driving progress.
Click here to read the original press release.