HathiTrust Research Center announces alpha release of extracted features dataset - Knowledgespeak

HathiTrust Research Center announces alpha release of extracted features dataset - June 5, 2014

The HathiTrust Research Center (HTRC) has announced the alpha release of a new dataset, consisting of page-level features extracted from a quarter-million text volumes.

Features are data attributes defined in such a way that they can be identified by a computer and analysed at scale. The HTRC Feature Extraction alpha dataset has already processed the underlying text, identifying headers and footers, rejoining hyphenated words, and offering page-level details such as term-frequency counts, per section (head/body/footer), per page; occurrences of terms as different parts of speech; line counts and sentence counts; and character counts at the start or end of lines.

Since it is currently in alpha version, the service is looking for feedback on how data like this can help in your research and how to better serve the scholarly community.

The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois. In conjunction with the HathiTrust Digital Library, the HTRC team strives to meet the technical challenges that researchers face when dealing with massive amounts of digital text, by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

Click here to read the original press release.

Forward This

More News in this Theme

New Release - Journals/Products/Services

YOUR GO-TO INTELLIGENCE RESOURCE FOR THE SCHOLARLY PUBLISHING INDUSTRY

HathiTrust Research Center announces alpha release of extracted features dataset - June 5, 2014

Forward This

STORY TOOLS

sponsor links