Science and Research Content

EU collaborates with IBM for digitisation of historic European texts on massive scale -

Multinational computer, technology and IT consulting firm IBM, US, and the EU have unveiled a new initiative called IMPACT (IMProving ACcess to Text). The project seeks to provide technology that will enable highly-accurate digitisation of rare and culturally significant historical texts on a massive scale. The latest move is seen to expand the research collaboration of the EU and IBM, which now includes more than two dozen national libraries, research institutes, universities and companies across Europe.

IMPACT is stated to be different from past digitisation projects where the result has been static, online libraries of texts. It will seek to offer new tools and best practices to European institutions to enable them to efficiently and accurately continue to produce quality digital replicas of historically significant texts and make them widely available, editable and searchable online.

Funded by the EU, IMPACT's research utilises new web-enabled adaptive optical character recognition (OCR) software with 'crowd computing' technology - seen to be a fast growing concept designed around individuals, or 'crowds', enhancing a process or product by sharing their knowledge and expertise to dramatically improve its quality and efficiency. Combined, these technologies will allow institutions for the first time to adapt digitisation to the idiosyncrasies of old fonts, anomalies and even vocabularies. They are also seen to reduce error rates by 35 percent and substitution rates by 75 percent.

While today's OCR engines perform well with modern printed texts, the faded ink, age and unusual shapes of older typefaces can lower recognition rates by up to 50 percent and require massive manual post-production review. Consequently, for large-scale projects such as this, the efficiency of post-production review of digitised text is said to be crucial.

At the core of the digitisation project lies a collaborative correction system designed by IBM researchers. It is projected to make it simple and convenient for large groups of volunteers spread over the continent to verify the accuracy of processed texts and correct recognition mistakes using an online web system. Moreover, inherent in the system is the ability to learn from its recognition errors, and adapt automatically to the specific font's characters.

IMPACT technology seeks to streamline, simplify and accelerate the process of winnowing out questionable text scans, enabling reviewers to key in corrections to the text. Instead of displaying an entire scanned page, reviewers only see the actual letters or words in question. In cases where an entire word is suspect, it is added to a collection of other questionable terms, which are then arranged in alphabetical order. Volunteer reviewers need only accept or reject suggested substitutes with one keystroke. In addition, the system uses adaptive dictionary enrichment, a method in which new words are added to a central dictionary based on cross-identification and correction by other users.

Search for more Digital libraries

To access our daily STM news feed through your iPhone, iPad, or other smartphones, please visit www.myscoope.com for a mobile friendly reading experience.

Click here to read the original press release.

STORY TOOLS

  • |
  • |

sponsor links

For banner adsĀ click here