Abstract
We address the problem of estimating the tradeoff between the size of the training set and the performance of a KWS when used to assist the transcription of small collections of historical handwritten documents. As this application domain is characterized by a lack of data, and techniques such as transfer learning and data augmentation require more resources than those that are commonly available in the organizations holding the collections, we address the problem of getting the best out of the available data. For this purpose, we reformulate the problem as that of finding the size of the training set leading to a KWS whose performance, when used to support the transcription, allows to obtain the largest reduction of the human efforts to achieve the complete transcription of the collection. The results of a large set of experiments on three publicly available datasets largely adopted as a benchmark for performance evaluation show that a training set made of 5 to 8 pages is enough for achieving the largest reduction, independently of the actual pages included in the training set and the corresponding keyword lists. They also show that the actual time reduction depends much more on the keyword list than on the KWS performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6, 31–47 (2017)
Ashok, M., Madan, R., Joha, A., Sivarajah, U.: Ethical framework for artificial intelligence and digital technologies. Int. J. Inf. Manage. 62, 102433 (2022)
Bray, J.R., Curtis, J.: An ordination of the upland forest communities of southern wisconsin. Ecol. Monogr. 27, 325–349 (1957)
Capobianco, S., Marinai, S.: Docemul: a toolkit to generate structured historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1186–1191. IEEE (2017)
Cockburn, I.M., Henderson, R., Stern, S.: The impact of artificial intelligence on innovation: an exploratory analysis. In: The Economics of Artificial Intelligence: An Agenda, pp. 115–146. University of Chicago Press (2018)
Fischer, A., Visani, M., Kieu, V.C., Suen, C.Y.: Generation of learning samples for historical handwriting recognition using image degradation. In: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pp. 73–79 (2013)
Fischer, A., et al.: Automatic transcription of handwritten medieval documents. In: 2009 15th International Conference on Virtual Systems and Multimedia, pp. 137–142. IEEE (2009)
Granet, A., Morin, E., Mouchère, H., Quiniou, S., Viard-Gaudin, C.: Transfer learning for a letter-ngrams to word decoder in the context of historical handwriting recognition with scarce resources. In: 27th International Conference on Computational Linguistics (COLING), pp. 1474–1484 (2018)
Granet, A., Morin, E., Mouchère, H., Quiniou, S., Viard-Gaudin, C.: Transfer learning for handwriting recognition on historical documents. In: 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM) (2018)
Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: Doccreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)
Kieu, V., Visani, M., Journet, N., Domenger, J.P., Mullot, R.: A character degradation model for grayscale ancient document images. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR 2012), pp. 685–688. IEEE (2012)
Lombardi, F., Marinai, S.: Deep learning for historical document analysis and recognition-a survey. J. Imaging 6(10), 110 (2020)
Madi, B., Alaasam, R., Droby, A., El-Sana, J.: HST-GAN: historical style transfer GAN for generating historical text images. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 523–537. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_35
Maini, S., Groleau, A., Chee, K.W., Larson, S., Boarman, J.: Augraphy: a data augmentation library for document images. arXiv preprint arXiv:2208.14558 (2022)
Marcelli, A., De Gregorio, G., Santoro, A.: A model for evaluating the performance of a multiple keywords spotting system for the transcription of historical handwritten documents. J. Imaging 6(11) (2020). https://doi.org/10.3390/jimaging6110117. https://www.mdpi.com/2313-433X/6/11/117
Monnier, T., Aubry, M.: Docextractor: an off-the-shelf historical document element extraction. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 91–96. IEEE (2020)
Rath, T.M., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recogn. 9(2–4), 139 (2007)
Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR 2014 competition on handwritten text recognition on transcriptorium datasets (HTRTS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790. IEEE (2014)
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)
Studer, L., et al.: A comprehensive study of imagenet pre-training for historical document image analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 720–725. IEEE (2019)
Sudholt, S., Fink, G.A.: Phocnet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282. IEEE (2016)
Todorov, K., Colavizza, G., et al.: Transfer learning for historical corpora: an assessment on post-OCR correction and named entity recognition. In: CHR, pp. 310–339 (2020)
Vögtlin, L., Drazyk, M., Pondenkandath, V., Alberti, M., Ingold, R.: Generating synthetic handwritten historical documents with OCR constrained GANs. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12823, pp. 610–625. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86334-0_40
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
De Gregorio, G., Marcelli, A. (2023). Estimating the Optimal Training Set Size of Keyword Spotting for Historical Handwritten Document Transcription. In: Parziale, A., Diaz, M., Melo, F. (eds) Graphonomics in Human Body Movement. Bridging Research and Practice from Motor Control to Handwriting Analysis and Recognition. IGS 2023. Lecture Notes in Computer Science, vol 14285. Springer, Cham. https://doi.org/10.1007/978-3-031-45461-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-45461-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45460-8
Online ISBN: 978-3-031-45461-5
eBook Packages: Computer ScienceComputer Science (R0)