Abstract
Document Layout Analysis, which is the task of identifying different semantic regions inside of a document page, is a subject of great interest for both computer scientists and humanities scholars as it represents a fundamental step towards further analysis tasks for the former and a powerful tool to improve and facilitate the study of the documents for the latter. However, many of the works currently present in the literature, especially when it comes to the available datasets, fail to meet the needs of both worlds and, in particular, tend to lean towards the needs and common practices of the computer science side, leading to resources that are not representative of the humanities real needs. For this reason, the present paper introduces U-DIADS-Bib, a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities. Furthermore, we propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation, necessary for the generation of the ground truth segmentation maps. Finally, we present a standardized few-shot version of the dataset (U-DIADS-BibFS), with the aim of encouraging the development of models and solutions able to address this task with as few samples as possible, which would allow for more effective use in a real-world scenario, where collecting a large number of segmentations is not always feasible.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and analysed during the current study are available in the U-DIADS-Bib repository.Footnote 6
References
Adam K, Baig A, Al-Maadeed S et al (2018) KERTAS: dataset for automatic dating of ancient Arabic manuscripts. Int J Doc Anal Recognit 21(4):283–290. https://doi.org/10.1007/s10032-018-0312-3
Alaei A, Nagabhushan P, Pal U (2011) A new dataset of Persian handwritten documents and its segmentation. In: 2011 7th Iranian conference on machine vision and image processing, pp 1–5. https://doi.org/10.1109/IranianMVIP.2011.6121553
Amelio A, Bonifazi G, Corradini E et al (2022) Defining a deep neural network ensemble for identifying fabric colors. Appl Soft Comput 130(109):687. https://doi.org/10.1016/j.asoc.2022.109687
Amelio A, Bonifazi G, Cauteruccio F et al (2023) Representation and compression of residual neural networks through a multilayer network based approach. Expert Syst Appl 215(119):391. https://doi.org/10.1016/j.eswa.2022.119391
Boillet M, Bonhomme ML, Stutzmann D et al (2019) Horae: an annotated dataset of books of hours. In: Proceedings of the 5th international workshop on historical document imaging and processing. Association for computing machinery, New York, HIP ’19, pp 7–12. https://doi.org/10.1145/3352631.3352633
Bukhari SS, Breuel TM, Asi A et al (2012) Layout analysis for Arabic historical document images using machine learning. In: 2012 international conference on frontiers in handwriting recognition, pp 639–644. https://doi.org/10.1109/ICFHR.2012.227
Chen L, Papandreou G, Schroff F et al (2017) Rethinking atrous convolution for semantic image segmentation. CoRR arXiv:abs/1706.05587
Chen LC, Zhu Y, Papandreou G et al (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari V, Hebert M, Sminchisescu C et al (eds) Computer vision—ECCV 2018. Springer, Cham, pp 833–851
Cilia ND, De Stefano C, Fontanella F et al (2021) Papyrow: a dataset of row images from ancient Greek papyri for writers identification. In: Del Bimbo A, Cucchiara R, Sclaroff S et al (eds) Pattern recognition. Springer, Cham, ICPR International Workshops and Challenges, pp 223–234
Clausner C, Antonacopoulos A, Mcgregor N et al (2018) Icfhr 2018 competition on recognition of historical Arabic scientific manuscripts—rasm2018. In: 2018 16th international conference on frontiers in handwriting recognition (ICFHR), pp 471–476. https://doi.org/10.1109/ICFHR-2018.2018.00088
De Nardin A, Zottin S, Paier M et al (2023a) Efficient few-shot learning for pixel-precise handwritten document layout analysis. In: 2023 IEEE/CVF winter conference on applications of computer vision (WACV), pp 3669–3677. https://doi.org/10.1109/WACV56688.2023.00367
De Nardin A, Zottin S, Piciarelli C, et al (2023) Few-shot pixel-precise document layout segmentation via dynamic instance generation and local thresholding. International Journal of Neural Systems 33(10):2350,052. https://doi.org/10.1142/S0129065723500521
Dolfing HJ, Bellegarda J, Chorowski J et al (2020) The “scribblelens” Dutch historical handwriting corpus. In: 2020 17th international conference on frontiers in handwriting recognition (ICFHR), pp 67–72. https://doi.org/10.1109/ICFHR2020.2020.00023
Fiel S, Kleber F, Diem M et al (2017) Icdar2017 competition on historical document writer identification (historical-wi). In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), pp 1377–1382. https://doi.org/10.1109/ICDAR.2017.225
Fischer A, Indermühle E, Bunke H et al (2010) Ground truth creation for handwriting recognition in historical documents. In: Proceedings of the 9th IAPR international workshop on document analysis systems. Association for Computing Machinery, New York, DAS ’10, p 3–10. https://doi.org/10.1145/1815330.1815331
Fischer A, Frinken V, Fornés A et al (2011) Transcription alignment of Latin manuscripts using hidden markov models. In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing. Association for Computing Machinery, New York, HIP ’11, pp 29–36. https://doi.org/10.1145/2037342.2037348
Gatos B, Stamatopoulos N, Louloudis G et al (2015) Grpoly-db: An old Greek polytonic document image database. In: 2015 13th international conference on document analysis and recognition (ICDAR), pp 646–650. https://doi.org/10.1109/ICDAR.2015.7333841
Grüning T, Labahn R, Diem M et al (2018) Read-bad: a new dataset and evaluation scheme for baseline detection in archival documents. In: 2018 13th IAPR international workshop on document analysis systems (DAS), pp 351–356. https://doi.org/10.1109/DAS.2018.38
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. preprint arXiv:1503.02531
Howard A, Sandler M, Chen B et al (2019) Searching for mobilenetv3. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 1314–1324. https://doi.org/10.1109/ICCV.2019.00140
Kassis M, Abdalhaleem A, Droby A et al (2017) Vml-hd: The historical Arabic documents dataset for recognition systems. In: 2017 1st international workshop on Arabic script analysis and recognition (ASAR), pp 11–14. https://doi.org/10.1109/ASAR.2017.8067751
Kiessling B, Ezra DSB, Miller MT (2019) Badam: a public dataset for baseline detection in Arabic-script manuscripts. In: Proceedings of the 5th international workshop on historical document imaging and processing. Association for Computing Machinery, New York, HIP ’19, pp 13–18. https://doi.org/10.1145/3352631.3352648
Kurar Barakat B, El-Sana J, Rabaev I (2019) The pinkas dataset. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp 732–737, https://doi.org/10.1109/ICDAR.2019.00122
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440, https://doi.org/10.1109/CVPR.2015.7298965
Mehri M, Héroux P, Mullot R et al (2017) Hba 1.0: a pixel-based annotated dataset for historical book analysis. In: Proceedings of the 4th international workshop on historical document imaging and processing. Association for Computing Machinery, New York, HIP2017, pp 107–112. https://doi.org/10.1145/3151509.3151528
Nikolaidou K, Seuret M, Mokayed H et al (2022) A survey of historical document image datasets. Int J Doc Anal Recog 25(4):305–338. https://doi.org/10.1007/s10032-022-00405-8
Potanin M, Dimitrov D, Shonenkov A et al (2021) Digital peter: new dataset, competition and handwriting recognition methods. In: The 6th international workshop on historical document imaging and processing. Association for Computing Machinery, New York, HIP ’21, pp 43–48. https://doi.org/10.1145/3476887.3476892
Quirós L, Kallio M, Vidal E (2020) Finnish court records-sub500. A dataset of Finnish notarial records (19th Century). https://doi.org/10.5281/zenodo.3945088
Romero V, Sánchez JA (2021) The hisclima database: historical weather logs for automatic transcription and information extraction. In: 2020 25th international conference on pattern recognition (ICPR), pp 10141–10148. https://doi.org/10.1109/ICPR48806.2021.9412210
Saini R, Dobson D, Morrey J et al (2019) Icdar 2019 historical document reading challenge on large structured Chinese family records. In: 2019 international conference on document analysis and recognition (ICDAR), pp 1499–1504. https://doi.org/10.1109/ICDAR.2019.00241
Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recognit 33(2):225–236. https://doi.org/10.1016/S0031-3203(99)00055-2
Simistira F, Seuret M, Eichenberger N et al (2016) Diva-hisdb: a precisely annotated large dataset of challenging medieval manuscripts. In: 2016 15th international conference on frontiers in handwriting recognition (ICFHR), pp 471–476. https://doi.org/10.1109/ICFHR.2016.0093
Wüthrich M, Liwicki M, Fischer A et al (2009) Language model integration for the recognition of handwritten medieval documents. In: 2009 10th international conference on document analysis and recognition, pp 211–215. https://doi.org/10.1109/ICDAR.2009.17
Zhao H, Shi J, Qi X et al (2017) Pyramid scene parsing network. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6230–6239. https://doi.org/10.1109/CVPR.2017.660
Acknowledgements
The authors would like to acknowledge the Bibliothèque nationale de France for providing access to the digital library Gallica.
Funding
Partial financial support was received from Piano Nazionale di Ripresa e Resilienza (PNRR) DD 3277 del 30 dicembre 2021 (PNRR Missione 4, Componente 2, Investimento 1.5)—Interconnected Nord-Est Innovation Ecosystem (iNEST).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zottin, S., De Nardin, A., Colombi, E. et al. U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts. Neural Comput & Applic 36, 11777–11789 (2024). https://doi.org/10.1007/s00521-023-09356-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09356-5