Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality

Cuper, Mirjam; van Dongen, Corine; Koster, Tineke

doi:10.1007/978-3-031-41734-4_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14191))

Included in the following conference series:

International Conference on Document Analysis and Recognition

1130 Accesses
1 Citations

Abstract

While performing Optical Character Recognition (OCR), most engines provide confidence scores. These scores give an indication on how certain an engine is that a word or character is correctly determined. The practical application of this score is not yet clear and various studies have discussed the (un)usability of these confidence score as an estimation of OCR quality. Using a dataset of 2000 historical Dutch newspapers we investigated different aspects of the confidence score as provided by ABBYY Finereader, while also looking for a way to use the confidence score as an indication of quality. Such an indication could be used by institutions to determine which part of their collection would benefit from re-OCRing or post-processing. We found that the reliability of the confidence score as a measure of quality is largely dependent on the way the engine has been configured. In addition we show that when there is a high enough correlation between the word confidence and the Word Character Error (order independent) the word confidence can be used to calculate a proxy measure for categorizing digitized texts. However, such a measure must be recalculated for individual OCR engine set ups and producers. For our dataset this proxy measure performs well for the separation of digitized texts into categories of those with a very good and those with a very bad quality with total accuracy of 83%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 17159; Price includes VAT (Japan)

Softcover Book: JPY 21449; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Comprehensive Performance Evaluation of Various Feature Extraction Methods for OCR Purposes

Confidence-Aware Document OCR Error Detection

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

References

ABBYY: ABBYY FineReader Server (2022). www.abbyy.com/finereader-server/
ABBYY: FineReader Engine 12 for Windows Developer’s Guide (2022). http://help.abbyy.com/en-us/finereaderengine/12/user_guide/introduction_startpage/
Anderson, N., Muhlberger, G., Antonacopoulos, A.: Optical character recognition: IMPACT best practice guide. www.digitisation.eu/download/website-files/BPG/OpticalCharacterRecognition-IBPG_01.pdf. Accessed 05 Oct 2022
Gupta, A., et al.: Automatic assessment of OCR quality in historical documents. Proc. AAAI Conf. Artif. Intell. 29(1) (2015). https://doi.org/10.1609/aaai.v29i1.9487
Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34, 825–843 (2019). https://doi.org/10.1093/llc/fqz024
Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. Mag. Digit. Libr. Forum 15 (2009)
Google Scholar
Impact Centre of Competence: Confidence Level (OCR) (2018). www.digitisation.eu/glossary/confidence-level-ocr/
IMPACT Centre of Competence: ocrevalUAtion (2019). http://github.com/impactcentre/ocrevalUAtion
Instituut voor de Nederlandse taal: INT Historische Woordenlijst (2012). http://taalmaterialen.ivdnt.org/download/tstc-int-historische-woordenlijst/
Kofax: Kofax documentation. http://docshield.kofax.com/KTA/en_US/740-uc0n6j0c5s/help/SD/ScriptDocumentation/c_Welcome.html
Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., Pletschacher, S.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, HIP 2021, pp. 13–18. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3476887.3476888
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. 54(6) (2021). https://doi.org/10.1145/3453476
OpenTaal: Nederlandse woordenlijst (2020). http://github.com/OpenTaal/opentaal-wordlist
Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E., Varner, S.: Final report – always already computational: collections as data, May 2019. https://doi.org/10.5281/zenodo.3152935
Salah, A.B., Moreux, J.P., Ragot, N., Paquet, T.: OCR performance prediction using cross-OCR alignment. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 556–560 (2015). https://doi.org/10.1109/ICDAR.2015.7333823
Smith, D., Cordell, R.: A research agenda for historical and multilingual optical character recognition (2019). http://hdl.handle.net/2047/D20298542
Springmann, U., Fink, F., Schulz, K.: Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016)
Google Scholar
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pp. 484–496. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009169004840496
Traub, M.C., van Ossenbruggen, J., Hardman, L.: Impact analysis of OCR quality on research tasks in digital archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 252–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8_19
Chapter Google Scholar
Wilms, L., Koster, T.: Historical newspaper OCR ground-truth data set (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

KB, National Library of the Netherlands, Prins Willem-Alexanderhof 5, 2595 BE, The Hague, The Netherlands
Mirjam Cuper, Corine van Dongen & Tineke Koster

Authors

Mirjam Cuper
View author publications
You can also search for this author in PubMed Google Scholar
Corine van Dongen
View author publications
You can also search for this author in PubMed Google Scholar
Tineke Koster
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mirjam Cuper .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cuper, M., van Dongen, C., Koster, T. (2023). Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-41734-4_7
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality