Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality | SpringerLink
Skip to main content

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Abstract

While performing Optical Character Recognition (OCR), most engines provide confidence scores. These scores give an indication on how certain an engine is that a word or character is correctly determined. The practical application of this score is not yet clear and various studies have discussed the (un)usability of these confidence score as an estimation of OCR quality. Using a dataset of 2000 historical Dutch newspapers we investigated different aspects of the confidence score as provided by ABBYY Finereader, while also looking for a way to use the confidence score as an indication of quality. Such an indication could be used by institutions to determine which part of their collection would benefit from re-OCRing or post-processing. We found that the reliability of the confidence score as a measure of quality is largely dependent on the way the engine has been configured. In addition we show that when there is a high enough correlation between the word confidence and the Word Character Error (order independent) the word confidence can be used to calculate a proxy measure for categorizing digitized texts. However, such a measure must be recalculated for individual OCR engine set ups and producers. For our dataset this proxy measure performs well for the separation of digitized texts into categories of those with a very good and those with a very bad quality with total accuracy of 83%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 17159
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 21449
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. ABBYY: ABBYY FineReader Server (2022). www.abbyy.com/finereader-server/

  2. ABBYY: FineReader Engine 12 for Windows Developer’s Guide (2022). http://help.abbyy.com/en-us/finereaderengine/12/user_guide/introduction_startpage/

  3. Anderson, N., Muhlberger, G., Antonacopoulos, A.: Optical character recognition: IMPACT best practice guide. www.digitisation.eu/download/website-files/BPG/OpticalCharacterRecognition-IBPG_01.pdf. Accessed 05 Oct 2022

  4. Gupta, A., et al.: Automatic assessment of OCR quality in historical documents. Proc. AAAI Conf. Artif. Intell. 29(1) (2015). https://doi.org/10.1609/aaai.v29i1.9487

  5. Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34, 825–843 (2019). https://doi.org/10.1093/llc/fqz024

  6. Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. Mag. Digit. Libr. Forum 15 (2009)

    Google Scholar 

  7. Impact Centre of Competence: Confidence Level (OCR) (2018). www.digitisation.eu/glossary/confidence-level-ocr/

  8. IMPACT Centre of Competence: ocrevalUAtion (2019). http://github.com/impactcentre/ocrevalUAtion

  9. Instituut voor de Nederlandse taal: INT Historische Woordenlijst (2012). http://taalmaterialen.ivdnt.org/download/tstc-int-historische-woordenlijst/

  10. Kofax: Kofax documentation. http://docshield.kofax.com/KTA/en_US/740-uc0n6j0c5s/help/SD/ScriptDocumentation/c_Welcome.html

  11. Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., Pletschacher, S.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, HIP 2021, pp. 13–18. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3476887.3476888

  12. Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. 54(6) (2021). https://doi.org/10.1145/3453476

  13. OpenTaal: Nederlandse woordenlijst (2020). http://github.com/OpenTaal/opentaal-wordlist

  14. Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E., Varner, S.: Final report – always already computational: collections as data, May 2019. https://doi.org/10.5281/zenodo.3152935

  15. Salah, A.B., Moreux, J.P., Ragot, N., Paquet, T.: OCR performance prediction using cross-OCR alignment. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 556–560 (2015). https://doi.org/10.1109/ICDAR.2015.7333823

  16. Smith, D., Cordell, R.: A research agenda for historical and multilingual optical character recognition (2019). http://hdl.handle.net/2047/D20298542

  17. Springmann, U., Fink, F., Schulz, K.: Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016)

    Google Scholar 

  18. van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pp. 484–496. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009169004840496

  19. Traub, M.C., van Ossenbruggen, J., Hardman, L.: Impact analysis of OCR quality on research tasks in digital archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 252–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8_19

    Chapter  Google Scholar 

  20. Wilms, L., Koster, T.: Historical newspaper OCR ground-truth data set (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mirjam Cuper .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cuper, M., van Dongen, C., Koster, T. (2023). Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41734-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41733-7

  • Online ISBN: 978-3-031-41734-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics