Skip to content

A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.

Notifications You must be signed in to change notification settings

chreul/OCR_Testdata_EarlyPrintedBooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

OCR_Testdata_EarlyPrintedBooks

A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.

Data

LatinHist-98000.pyrnn.gz is a mixed OCRopus model trained on twelve Latin books printed with Antiqua types between 1471 and 1686 with a focus (ten out of twelve) on early works produced before 1600. For details about the books please see [1] . The training was performed on 8,684 lines and the best model was chosen by evaluating all resulting models on 2,432 previously unseen test lines. The lowest achieved CER was 2.92% after 98,000 training steps.

The Books folder contains seven early printed books used for evaluation in [2][3]. For each book we made 150 lines of GT available as well as a strong individual model (coming soon) trained on an extensive amount of lines (at least 1,500). For details about the books please see [2][3].

For additional GT for Latin, Greek and German Fraktur please see the CIS OCR Testset.

Licence

Creative Commons License
All data available in this repository is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

Please cite (one of) the following publications when using the data.

Latin Hist

[1] Springmann et al.. (2016). Automatic quality evaluation and (semi-) automatic improvement of OCR models for historical printings. ArXiv e-prints.

@article{springmann2016automatic,
  title = {Automatic quality evaluation and (semi-) automatic
		  improvement of {OCR} models for historical printings},
  author = {Springmann, Uwe and Fink, Florian and Schulz, Klaus U},
  journal = {ArXiv e-prints},
  url = {https://arxiv.org/abs/1606.05157},
  year = {2016}
}

Seven Mixed Early Printed Books

[2] Reul et al.. (2017). Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. ArXiv e-prints. Submitted to the 13th IAPR International Workshop on Document Analysis Systems.

@article{reul2017voting,
  title = {Improving {OCR} Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting},
  author = {Reul, Christian and Springmann, Uwe and Wick, Christoph and Puppe, Frank},
  journal = {ArXiv e-prints},
  url = {https://arxiv.org/abs/1711.09670},
  year = {2017}
}

[3] Reul et al.. (2017). Transfer Learning for OCRopus Model Training on Early Printed Books. ArXiv e-prints. Submitted to 027.7 - Journal for Library Culture.

@article{reul2017transfer,
  title = {Transfer Learning for OCRopus Model Training on Early Printed Books},
  author = {Reul, Christian and Wick, Christoph and Springmann, Uwe and Puppe, Frank},
  journal = {ArXiv e-prints},
  url = {https://arxiv.org/abs/1712.05586},
  year = {2017}
}

About

A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published