Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods | SpringerLink
Skip to main content

Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14191))

Included in the following conference series:

  • 1038 Accesses

Abstract

The digital transformation of engineering documents is an ambitious research topic in the industrial world. The representation of component identifiers (tags), which are textual entities without a language model is one of the major challenges. Most of OCR use dictionary-based correction methods so they fail at recognizing hybrid entities composed by numerical and textual characters. This study aims to adapt OCR results on language-free strings with a specific semantics and requiring an efficient post-OCR correction with unsupervised approaches. We propose a two-step methodology to face the questions of post-OCR correction in engineering documents. The first step focuses on the alignment of OCR transcriptions producing a single prediction refined from all OCR predictions. The second step presents a combined incremental clustering & correction approach achieving a continuous correction of tags’ transcriptions relatively to their assigned cluster. For both steps, the dataset was produced from a set of 1,600 real technical documents and made available to the research community. When compared to the best state-of-art OCR, the post-OCR approach produced a gain of 9 % of WER.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 17159
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 21449
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/mathieuF789/dataset_tags.

References

  1. Lund, W.B., Ringger, E.K.: Improving optical character recognition through efficient multiple system alignment. In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2009), pp 231–240. Association for Computing Machinery (2009)

    Google Scholar 

  2. Wemhoener, D., Yalniz, I.Z., Manmatha, R.: Creating an improved version using noisy OCR from multiple editions. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 160–164 (2013)

    Google Scholar 

  3. Huang, Z., et al.: ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520 (2019)

    Google Scholar 

  4. Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of Post-OCR Processing Approaches. ACM Comput. Surv. 54(6), Article 124 (2021)

    Google Scholar 

  5. Francois, M., Eglin, V., Biou, M.: Text detection and post-OCR correction in engineering documents. In: Document Analysis Systems: 15th IAPR International Workshop, DAS 2022, La Rochelle, France, May 22–25 (2022)

    Google Scholar 

  6. Park, S., et al.: Cord: A consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  7. Prasad, R., Sarmah, R., Chakraborty, S.: Incremental k-Means Method (2019)

    Google Scholar 

  8. Leilei, S., Chonghui, G.: Incremental affinity propagation clustering based on message passing. IEEE Trans. Knowl. Data Eng. 26, 2731–2744 (2014)

    Article  Google Scholar 

  9. Chakraborty, S.: Analysis and study of incremental DBSCAN clustering algorithm. Int. J. Enterprise Comput. Bus. Syst. 1 (2011)

    Google Scholar 

  10. Ocr weighted Levenshtein distance.Joan Capell Garcia. Version 1.0.0. July. 13 (2022). http://github.com/zas97/ocr_weighted_levenshtein

  11. Smith, R.: an overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629–633 (2007)

    Google Scholar 

  12. Yuning, D.: PP-OCR: A Practical Ultra Lightweight OCR System (2020)

    Google Scholar 

  13. EasyOCR. JaidedAI. Nov. 15 (2022). http://github.com/JaidedAI/EasyOCR

  14. van Strien, D., Beelen, K., Coll Ardanuy, M., Hosseini, K., McGillivray, B., Colav izza, G.: Assessing the impact of OCR quality on downstream NLP Tasks. In Proceedings of the 12th International (2020)

    Google Scholar 

  15. Torresan Bazzo, G., Acauan Lorentz, G., Suarez Vargas, D., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: ECIR 2020: Advances in Information Retrieval, pp 102–109 (2020)

    Google Scholar 

  16. Xu, S., Smith, D.: Retrieving and combining repeated passages to improve OCR. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL 2017) (2017)

    Google Scholar 

  17. Gupta, H., Del Corro, L., Broscheit, S., Hoffart, J., Brenner, E.: unsupervised multi-view post-ocr error correction with language models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8647–8652 (2021)

    Google Scholar 

  18. Rijhwani, S., Rosenblum, D., Neubig, G.: Lexically aware semi-supervised learning for OCR post-correction. Comput. Sci. Trans. Assoc. Comput. Ling. (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mathieu Francois or Véronique Eglin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Francois, M., Eglin, V. (2023). Ensuring an Error-Free Transcription on a Full Engineering Tags Dataset Through Unsupervised Post-OCR Methods. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41734-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41733-7

  • Online ISBN: 978-3-031-41734-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics