Abstract
The digital transformation of engineering documents is an ambitious research topic in the industrial world. The representation of component identifiers (tags), which are textual entities without a language model is one of the major challenges. Most of OCR use dictionary-based correction methods so they fail at recognizing hybrid entities composed by numerical and textual characters. This study aims to adapt OCR results on language-free strings with a specific semantics and requiring an efficient post-OCR correction with unsupervised approaches. We propose a two-step methodology to face the questions of post-OCR correction in engineering documents. The first step focuses on the alignment of OCR transcriptions producing a single prediction refined from all OCR predictions. The second step presents a combined incremental clustering & correction approach achieving a continuous correction of tags’ transcriptions relatively to their assigned cluster. For both steps, the dataset was produced from a set of 1,600 real technical documents and made available to the research community. When compared to the best state-of-art OCR, the post-OCR approach produced a gain of 9 % of WER.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lund, W.B., Ringger, E.K.: Improving optical character recognition through efficient multiple system alignment. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2009), pp 231–240. Association for Computing Machinery (2009)
Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 160–164 (2013)
Huang, Z., et al.: ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of Post-OCR Processing Approaches. ACM Comput. Surv. 54(6), Article 124 (2021)
Francois, M., Eglin, V., Biou, M.: Text detection and post-OCR correction in engineering documents. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25 (2022)
Park, S., et al.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Prasad, R., Sarmah, R., Chakraborty, S.: Incremental k-Means Method (2019)
Leilei, S., Chonghui, G.: Incremental affinity propagation clustering based on message passing. IEEE Trans. Knowl. Data Eng. 26, 2731–2744 (2014)
Chakraborty, S.: Analysis and study of incremental DBSCAN clustering algorithm. Int. J. Enterprise Comput. Bus. Syst. 1 (2011)
Ocr weighted Levenshtein distance.Joan Capell Garcia. Version 1.0.0. July. 13 (2022). http://github.com/zas97/ocr_weighted_levenshtein
Smith, R.: an overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629–633 (2007)
Yuning, D.: PP-OCR: A Practical Ultra Lightweight OCR System (2020)
EasyOCR. JaidedAI. Nov. 15 (2022). http://github.com/JaidedAI/EasyOCR
van Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., McGillivray, B., Colav izza, G.: Assessing the impact of OCR quality on downstream NLP Tasks. In Proceedings of the 12th International (2020)
Torresan Bazzo, G., Acauan Lorentz, G., Suarez Vargas, D., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: ECIR 2020: Advances in Information Retrieval, pp 102–109 (2020)
Xu, S., Smith, D.: Retrieving and combining repeated passages to improve OCR. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2017) (2017)
Gupta, H., Del Corro, L., Broscheit, S., Hoffart, J., Brenner, E.: unsupervised multi-view post-ocr error correction with language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8647–8652 (2021)
Rijhwani, S., Rosenblum, D., Neubig, G.: Lexically aware semi-supervised learning for OCR post-correction. Comput. Sci. Trans. Assoc. Comput. Ling. (2021)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Francois, M., Eglin, V. (2023). Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-41734-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)