OCR4all – An Open Source Tool Providing a Full OCR Workflow For Creating Digital Corpus From Printed Sources

Why OCR?

A growing number of scholars in humanities increasingly need digital versions of originally printed or written text corpora, which must match the original at ≥ 99.95 %, or even 100% when it comes to historical-critical digital editions. Until very recently, the standard procedure used to reach this outcome was ‘double-keying’, wherein two people independently transcribe the same text manually and both versions are subsequently merged. However, a new generation of neural networks now makes the optical character recognition (OCR) of early modern prints (such as incunabula) as well as Arabic texts and South Indian scripts possible. With an accuracy of ≥ 99.97 %, the font-face-specific models deliver results that can also be used for historical-critical digital editions and only require a reasonable amount of post-correction. Therefore, it represents a cheaper alternative to the ‘double-keying’ used so far.

OCR4all

The Workshop introduces students of all humanities to OCR4all, an open source tool for OCR developed at the University of Würzburg. Individuals and small research groups with little technical experience can perform text recognition with very good to excellent recognition rates—related to a specific print—and independently create digital full texts. The workflow implemented in OCR4all is easily understandable and independently applicable. It specifically addresses users with little to no IT background—and so does the workshop—and combines different tools within a uniform user interface.

Results

The primary material in high-resolution TIFF format (300+ dpi according to DFG Practical Guidelines on Digitisation [12/16]) will remain unchanged. The final results will consist of

  1. correct full text,
  2. platform- and software-independent PAGE-XML files containing descriptions of text and line regions’s positions on all images, and
  3. OCR models—related to specific print and font-face—that can be used for text recognition on other printed texts.

After a week participants are able, independently, to create digital text versions of digital images and to evaluate OCR training models regarding their recognition rate. For this, both basic work steps in OCR as such and their implementation with OCR4all are presented in alternance. All contents comply with requirements of DFG Practical Guidelines on Digitisation. Conformity to the latter makes it easier for participants when later applying for the funding of OCR projects in Germany.

Syllabus

Participants carry out the following work steps, independently and under lecturer guidance:

  1. image pre-processing,
  2. segmenation of regions
  3. automatic segmenation of lines,
  4. manual creation of ground truth, and
  5. text recognition, and
  6. text output.

Best preparation—highest possible benefit

This one-week workshop will explicitly support you in the practical implementation of your own digitization project. This works best if:

  1. you have already your own images at hand, which are best uncompressed,
  2. text to be digitized comprises of at least 200 lines á 20 characters, and
  3. when working with a text set neither in Roman nor in Gothic typeface, if you have already manually transcribed 50 lines of 20 characters each into a text file.

Alternatively, digital images and Ground Truth are made available.

Requirements

No previous knowledge is required. You will need:

  • your own laptop with at least 8 GB RAM,
  • 30 GB free hard disk space,
  • an i5 or similar cpu, and
  • a current browser (eg. Safari, Chrome).

OCR4all runs on every current operating system. Participants will be informed about details regarding installation of software before event starts.

This one-week workshop will be repeated in second week with identical contents.