Pilot Project on the Use of OCR software for the Digitisation of the Funeral Sermons of the Staatsbibliothek zu Berlin

Between 2010 and 2012 the Department of Early Printed Books carried out a pilot project funded by the Deutsche Forschungsgemeinschaft (DFG) in order to fathom both capacity and potential of optical character recognition software for the digitisation of books printed in Germany in the early modern period. In comprehensive testbeds two OCR software products were tested and evaluated with regard to recognition quality and options for setup and optimization. At the same time genre-specific approaches for indexing early modern funeral sermons and related occasional personal writings were investigated.

OCR software of the following providers came under scrutiny:

  • B.I.T. Bureau Ingénieur Tomasi SARL Toulouse – Software: B.I.T. Alpha
  • Hermann & Kraemer GmbH und Co-KG Garmisch-Partenkirchen - Software: HKOCR on the basis of ABBYY FineReader Engine 9

In two work packages their components for the recognition of Gothic print was tested in order to learn more about their specific advantages and drawbacks and also to gain information on particular practical scenarios of implementation. Both programmes were trained in close cooperation with the providers on the basis of selected works from the holdings of the library. At the same time the metadata model which the Staatsbibliothek had developed, underwent practical tests based on selected samples and with view to its interaction with the OCR software products.

The following aspects received special consideration and evaluation: quality of recognition / quality of the text created, quality of binarisation, segmentation of images, words and characters, libraries of characters or alphabets, use of word libraries / dictionaries, documentation of the parametres and set-ups selected as well as availability of training results for further use of the Library.

In order to find the configuration best suited of each product, the test material as far as possible was arranged in certain groups. In the first phase an emphasis was put on the funeral sermons of a single author by one and the same publisher. In this way the time span of publication remained manageable and the types used could be expected to be rather consistent. As a matter of principle we tried to work with comparable print and layout patterns. Apart from the layout of the pages, fonts, types and font sizes played an important role here as well as the print quality of the original work.

Special challenges for the OCR tested were caused by the permanent variance of fonts and types, which could occur within the same text and even within the same word, by the abbreviations used for quotations from the bible or other sources, and also by marginal notes, which are often given in rather small type size and which are not always clearly separated from the main body of the text. Moreover, funeral sermons often contain quotations in Greek and Hebrew script.

