Estimating the Optimal Training Set Size of Keyword Spotting for Historical Handwritten Document Transcription | SpringerLink
Skip to main content

Estimating the Optimal Training Set Size of Keyword Spotting for Historical Handwritten Document Transcription

  • Conference paper
  • First Online:
Graphonomics in Human Body Movement. Bridging Research and Practice from Motor Control to Handwriting Analysis and Recognition (IGS 2023)

Abstract

We address the problem of estimating the tradeoff between the size of the training set and the performance of a KWS when used to assist the transcription of small collections of historical handwritten documents. As this application domain is characterized by a lack of data, and techniques such as transfer learning and data augmentation require more resources than those that are commonly available in the organizations holding the collections, we address the problem of getting the best out of the available data. For this purpose, we reformulate the problem as that of finding the size of the training set leading to a KWS whose performance, when used to support the transcription, allows to obtain the largest reduction of the human efforts to achieve the complete transcription of the collection. The results of a large set of experiments on three publicly available datasets largely adopted as a benchmark for performance evaluation show that a training set made of 5 to 8 pages is enough for achieving the largest reduction, independently of the actual pages included in the training set and the corresponding keyword lists. They also show that the actual time reduction depends much more on the keyword list than on the KWS performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 6634
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 8293
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahmed, R., Al-Khatib, W.G., Mahmoud, S.: A survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6, 31–47 (2017)

    Article  Google Scholar 

  2. Ashok, M., Madan, R., Joha, A., Sivarajah, U.: Ethical framework for artificial intelligence and digital technologies. Int. J. Inf. Manage. 62, 102433 (2022)

    Article  Google Scholar 

  3. Bray, J.R., Curtis, J.: An ordination of the upland forest communities of southern wisconsin. Ecol. Monogr. 27, 325–349 (1957)

    Article  Google Scholar 

  4. Capobianco, S., Marinai, S.: Docemul: a toolkit to generate structured historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1186–1191. IEEE (2017)

    Google Scholar 

  5. Cockburn, I.M., Henderson, R., Stern, S.: The impact of artificial intelligence on innovation: an exploratory analysis. In: The Economics of Artificial Intelligence: An Agenda, pp. 115–146. University of Chicago Press (2018)

    Google Scholar 

  6. Fischer, A., Visani, M., Kieu, V.C., Suen, C.Y.: Generation of learning samples for historical handwriting recognition using image degradation. In: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pp. 73–79 (2013)

    Google Scholar 

  7. Fischer, A., et al.: Automatic transcription of handwritten medieval documents. In: 2009 15th International Conference on Virtual Systems and Multimedia, pp. 137–142. IEEE (2009)

    Google Scholar 

  8. Granet, A., Morin, E., Mouchère, H., Quiniou, S., Viard-Gaudin, C.: Transfer learning for a letter-ngrams to word decoder in the context of historical handwriting recognition with scarce resources. In: 27th International Conference on Computational Linguistics (COLING), pp. 1474–1484 (2018)

    Google Scholar 

  9. Granet, A., Morin, E., Mouchère, H., Quiniou, S., Viard-Gaudin, C.: Transfer learning for handwriting recognition on historical documents. In: 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM) (2018)

    Google Scholar 

  10. Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: Doccreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)

    Article  Google Scholar 

  11. Kieu, V., Visani, M., Journet, N., Domenger, J.P., Mullot, R.: A character degradation model for grayscale ancient document images. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR 2012), pp. 685–688. IEEE (2012)

    Google Scholar 

  12. Lombardi, F., Marinai, S.: Deep learning for historical document analysis and recognition-a survey. J. Imaging 6(10), 110 (2020)

    Article  Google Scholar 

  13. Madi, B., Alaasam, R., Droby, A., El-Sana, J.: HST-GAN: historical style transfer GAN for generating historical text images. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 523–537. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_35

    Chapter  Google Scholar 

  14. Maini, S., Groleau, A., Chee, K.W., Larson, S., Boarman, J.: Augraphy: a data augmentation library for document images. arXiv preprint arXiv:2208.14558 (2022)

  15. Marcelli, A., De Gregorio, G., Santoro, A.: A model for evaluating the performance of a multiple keywords spotting system for the transcription of historical handwritten documents. J. Imaging 6(11) (2020). https://doi.org/10.3390/jimaging6110117. https://www.mdpi.com/2313-433X/6/11/117

  16. Monnier, T., Aubry, M.: Docextractor: an off-the-shelf historical document element extraction. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 91–96. IEEE (2020)

    Google Scholar 

  17. Rath, T.M., Manmatha, R.: Word spotting for historical documents. Int. J. Doc. Anal. Recogn. 9(2–4), 139 (2007)

    Article  Google Scholar 

  18. Sánchez, J.A., Romero, V., Toselli, A.H., Vidal, E.: ICFHR 2014 competition on handwritten text recognition on transcriptorium datasets (HTRTS). In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 785–790. IEEE (2014)

    Google Scholar 

  19. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019)

    Article  Google Scholar 

  20. Studer, L., et al.: A comprehensive study of imagenet pre-training for historical document image analysis. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 720–725. IEEE (2019)

    Google Scholar 

  21. Sudholt, S., Fink, G.A.: Phocnet: a deep convolutional neural network for word spotting in handwritten documents. In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 277–282. IEEE (2016)

    Google Scholar 

  22. Todorov, K., Colavizza, G., et al.: Transfer learning for historical corpora: an assessment on post-OCR correction and named entity recognition. In: CHR, pp. 310–339 (2020)

    Google Scholar 

  23. Vögtlin, L., Drazyk, M., Pondenkandath, V., Alberti, M., Ingold, R.: Generating synthetic handwritten historical documents with OCR constrained GANs. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12823, pp. 610–625. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86334-0_40

    Chapter  Google Scholar 

  24. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)

    Article  Google Scholar 

  25. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe De Gregorio .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

De Gregorio, G., Marcelli, A. (2023). Estimating the Optimal Training Set Size of Keyword Spotting for Historical Handwritten Document Transcription. In: Parziale, A., Diaz, M., Melo, F. (eds) Graphonomics in Human Body Movement. Bridging Research and Practice from Motor Control to Handwriting Analysis and Recognition. IGS 2023. Lecture Notes in Computer Science, vol 14285. Springer, Cham. https://doi.org/10.1007/978-3-031-45461-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45461-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45460-8

  • Online ISBN: 978-3-031-45461-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics