Abstract
While performing Optical Character Recognition (OCR), most engines provide confidence scores. These scores give an indication on how certain an engine is that a word or character is correctly determined. The practical application of this score is not yet clear and various studies have discussed the (un)usability of these confidence score as an estimation of OCR quality. Using a dataset of 2000 historical Dutch newspapers we investigated different aspects of the confidence score as provided by ABBYY Finereader, while also looking for a way to use the confidence score as an indication of quality. Such an indication could be used by institutions to determine which part of their collection would benefit from re-OCRing or post-processing. We found that the reliability of the confidence score as a measure of quality is largely dependent on the way the engine has been configured. In addition we show that when there is a high enough correlation between the word confidence and the Word Character Error (order independent) the word confidence can be used to calculate a proxy measure for categorizing digitized texts. However, such a measure must be recalculated for individual OCR engine set ups and producers. For our dataset this proxy measure performs well for the separation of digitized texts into categories of those with a very good and those with a very bad quality with total accuracy of 83%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
ABBYY: ABBYY FineReader Server (2022). www.abbyy.com/finereader-server/
ABBYY: FineReader Engine 12 for Windows Developer’s Guide (2022). http://help.abbyy.com/en-us/finereaderengine/12/user_guide/introduction_startpage/
Anderson, N., Muhlberger, G., Antonacopoulos, A.: Optical character recognition: IMPACT best practice guide. www.digitisation.eu/download/website-files/BPG/OpticalCharacterRecognition-IBPG_01.pdf. Accessed 05 Oct 2022
Gupta, A., et al.: Automatic assessment of OCR quality in historical documents. Proc. AAAI Conf. Artif. Intell. 29(1) (2015). https://doi.org/10.1609/aaai.v29i1.9487
Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34, 825–843 (2019). https://doi.org/10.1093/llc/fqz024
Holley, R.: How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Mag. Mag. Digit. Libr. Forum 15 (2009)
Impact Centre of Competence: Confidence Level (OCR) (2018). www.digitisation.eu/glossary/confidence-level-ocr/
IMPACT Centre of Competence: ocrevalUAtion (2019). http://github.com/impactcentre/ocrevalUAtion
Instituut voor de Nederlandse taal: INT Historische Woordenlijst (2012). http://taalmaterialen.ivdnt.org/download/tstc-int-historische-woordenlijst/
Kofax: Kofax documentation. http://docshield.kofax.com/KTA/en_US/740-uc0n6j0c5s/help/SD/ScriptDocumentation/c_Welcome.html
Neudecker, C., Baierer, K., Gerber, M., Clausner, C., Antonacopoulos, A., Pletschacher, S.: A survey of OCR evaluation tools and metrics. In: The 6th International Workshop on Historical Document Imaging and Processing, HIP 2021, pp. 13–18. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3476887.3476888
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Comput. Surv. 54(6) (2021). https://doi.org/10.1145/3453476
OpenTaal: Nederlandse woordenlijst (2020). http://github.com/OpenTaal/opentaal-wordlist
Padilla, T., Allen, L., Frost, H., Potvin, S., Russey Roke, E., Varner, S.: Final report – always already computational: collections as data, May 2019. https://doi.org/10.5281/zenodo.3152935
Salah, A.B., Moreux, J.P., Ragot, N., Paquet, T.: OCR performance prediction using cross-OCR alignment. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 556–560 (2015). https://doi.org/10.1109/ICDAR.2015.7333823
Smith, D., Cordell, R.: A research agenda for historical and multilingual optical character recognition (2019). http://hdl.handle.net/2047/D20298542
Springmann, U., Fink, F., Schulz, K.: Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents (2016)
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH, pp. 484–496. INSTICC, SciTePress (2020). https://doi.org/10.5220/0009169004840496
Traub, M.C., van Ossenbruggen, J., Hardman, L.: Impact analysis of OCR quality on research tasks in digital archives. In: Kapidakis, S., Mazurek, C., Werla, M. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 252–263. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24592-8_19
Wilms, L., Koster, T.: Historical newspaper OCR ground-truth data set (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cuper, M., van Dongen, C., Koster, T. (2023). Unraveling Confidence: Examining Confidence Scores as Proxy for OCR Quality. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14191. Springer, Cham. https://doi.org/10.1007/978-3-031-41734-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-41734-4_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41733-7
Online ISBN: 978-3-031-41734-4
eBook Packages: Computer ScienceComputer Science (R0)