Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods

Francois, Mathieu; Eglin, Véronique

doi:10.1007/978-3-031-41734-4_6

Mathieu Francois^11,12 &
Véronique Eglin¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14191))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1038 Accesses

Abstract

The digital transformation of engineering documents is an ambitious research topic in the industrial world. The representation of component identifiers (tags), which are textual entities without a language model is one of the major challenges. Most of OCR use dictionary-based correction methods so they fail at recognizing hybrid entities composed by numerical and textual characters. This study aims to adapt OCR results on language-free strings with a specific semantics and requiring an efficient post-OCR correction with unsupervised approaches. We propose a two-step methodology to face the questions of post-OCR correction in engineering documents. The first step focuses on the alignment of OCR transcriptions producing a single prediction refined from all OCR predictions. The second step presents a combined incremental clustering & correction approach achieving a continuous correction of tags’ transcriptions relatively to their assigned cluster. For both steps, the dataset was produced from a set of 1,600 real technical documents and made available to the research community. When compared to the best state-of-art OCR, the post-OCR approach produced a gain of 9 % of WER.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 17159; Price includes VAT (Japan)

Softcover Book: JPY 21449; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Aligning Ground Truth Text with OCR Degraded Text

Detecting non-natural language artifacts for de-noising bug reports

Article Open access 24 August 2022

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

Notes

1.
https://github.com/mathieuF789/dataset_tags.

References

Lund, W.B., Ringger, E.K.: Improving optical character recognition through efficient multiple system alignment. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2009), pp 231–240. Association for Computing Machinery (2009)
Google Scholar
Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 160–164 (2013)
Google Scholar
Huang, Z., et al.: ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)
Google Scholar
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of Post-OCR Processing Approaches. ACM Comput. Surv. 54(6), Article 124 (2021)
Google Scholar
Francois, M., Eglin, V., Biou, M.: Text detection and post-OCR correction in engineering documents. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25 (2022)
Google Scholar
Park, S., et al.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Prasad, R., Sarmah, R., Chakraborty, S.: Incremental k-Means Method (2019)
Google Scholar
Leilei, S., Chonghui, G.: Incremental affinity propagation clustering based on message passing. IEEE Trans. Knowl. Data Eng. 26, 2731–2744 (2014)
Article Google Scholar
Chakraborty, S.: Analysis and study of incremental DBSCAN clustering algorithm. Int. J. Enterprise Comput. Bus. Syst. 1 (2011)
Google Scholar
Ocr weighted Levenshtein distance.Joan Capell Garcia. Version 1.0.0. July. 13 (2022). http://github.com/zas97/ocr_weighted_levenshtein
Smith, R.: an overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629–633 (2007)
Google Scholar
Yuning, D.: PP-OCR: A Practical Ultra Lightweight OCR System (2020)
Google Scholar
EasyOCR. JaidedAI. Nov. 15 (2022). http://github.com/JaidedAI/EasyOCR
van Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., McGillivray, B., Colav izza, G.: Assessing the impact of OCR quality on downstream NLP Tasks. In Proceedings of the 12th International (2020)
Google Scholar
Torresan Bazzo, G., Acauan Lorentz, G., Suarez Vargas, D., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: ECIR 2020: Advances in Information Retrieval, pp 102–109 (2020)
Google Scholar
Xu, S., Smith, D.: Retrieving and combining repeated passages to improve OCR. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2017) (2017)
Google Scholar
Gupta, H., Del Corro, L., Broscheit, S., Hoffart, J., Brenner, E.: unsupervised multi-view post-ocr error correction with language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8647–8652 (2021)
Google Scholar
Rijhwani, S., Rosenblum, D., Neubig, G.: Lexically aware semi-supervised learning for OCR post-correction. Comput. Sci. Trans. Assoc. Comput. Ling. (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Univ. Lyon, INSA Lyon, CNRS, UCBL, LIRIS, UMR5205, 69621, Villeurbanne, France
Mathieu Francois & Véronique Eglin
Orinox, Vaulx-en-Velin, France
Mathieu Francois

Authors

Mathieu Francois
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Eglin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mathieu Francois or Véronique Eglin .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Francois, M., Eglin, V. (2023). Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-41734-4_6
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods