Many historical newspaper collections have been digitized to allow digital access. However, extracting specific articles from collections, finding related images, and linking related articles remained time-consuming and labor-intensive. Would automatic article extraction and semantic enrichment of these historical newspaper collections, the richest sources of information for humanities research, help improve their accessibility?
The DATA-KBR-BE project aims to facilitate data-level access to KBR's digitized and born-digital collections for digital humanities research. The project leverages AI-based methods to facilitate article-level search in historical collections. This annotation pipeline aims to enhance searchability with semantic data enrichment and cross-collection linkage of information. Starting from the given OCR results of the collection, the pipeline performs article segmentation, named entity recognition, and semantic linking.
The NewspAIper platform, based on this pipeline, allows the user to query the collection interactively. Furthermore, humanities researchers can more easily extract information relevant to their research interests from the collection. The NewspAIper platform facilitates interactive filters based on article date, language, found entities, and allows users to browse similar articles and illustrations.
In the future, this project aims to improve semantic text enrichment by providing both topic detection and trend analysis. Furthermore, toponyms found in the articles could be used to geolocate the historical images. These improvements are expected to allow for even more fine-grained filtering capabilities.
Click here to read the original article published by the Time Machine.
Please give your feedback on this article or share a similar story for publishing by clicking here.