Abstract
Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset (DELINE8K is available at https://github.com/Tahlor/DELINE8K). Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
1001 Free Fonts (2024). https://www.1001freefonts.com/. Accessed 14 Feb 2024
An Objective Evaluation Methodology for Document Image Binarization Techniques – IEEE Conference Publication – IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/4669964. Accessed 19 Feb 2024
Bhunia, A.K., et al.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1094 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Bhunia_Handwriting_Transformers_ICCV_2021_paper.html. Accessed 19 Feb 2024
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), 10 January 2021, pp. 2134–2141 (2021). https://doi.org/10.1109/ICPR48806.2021.9412447. http://arxiv.org/abs/2012.14163. Accessed on 16 Aug 2023
Buslaev, A., et al.: Albumentations: fast and flexible image augmentations. Information 11(2), 125 (2020). issn: 2078–2489. https://doi.org/10.3390/info11020125. https://www.mdpi.com/2078-2489/11/2/125. Accessed 19 Feb 2024
Cohen, G., et al.: EMNIST: extending MNIST to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 2921–2926 (2017). https://doi.org/10.1109/IJCNN.2017.7966217. https://ieeexplore.ieee.org/abstract/document/7966217. Accessed 19 Feb 2024
Crawford, A., Ray, A., Carriquiry, A.: A database of handwriting samples for applications in forensic statistics. Data Brief 28, 105059 (2020). issn: 2352-3409. https://doi.org/10.1016/j.dib.2019.105059. https://www.sciencedirect.com/science/article/pii/S2352340919314155. Accessed 19 Feb 2024
Davis, B., et al.: Deep visual template-free form parsing (2019). https://doi.org/10.48550/arXiv.1909.02576. http://arxiv.org/abs/1909.02576. Accessed 12 Feb 2024
General Services Administration. GSA Forms (2024). https://www.gsa.gov/forms. Accessed 14 Feb 2024
Gholamian, S., Vahdat, A.: Handwritten and printed text segmentation: a signature case study (2023). https://doi.org/10.48550/arXiv.2307.07887. http://arxiv.org/abs/2307.07887. Accessed 18 Aug 2023
Groleau, A., et al.: ShabbyPages: a reproducible document denoising and binarization dataset (2023). https://doi.org/10.48550/arXiv.2303. http://arxiv.org/abs/2303.09339. Accessed 16 Aug 2023
He, K., et al.: Deep residual learning for image recognition (2015). https://doi.org/10.48550/arXiv.1512.03385. http://arxiv.org/abs/1512.03385. Accessed 17 Feb 2024
Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994). issn: 1939–3539. https://doi.org/10.1109/34.291440. https://ieeexplore.ieee.org/document/291440. Accessed 19 Feb 2024
Internal Revenue Service. Forms, Instructions & Publications (2024). https://www.irs.gov/forms-instructions-and-publications. Accessed 14 Feb 2024
Jo, J., et al.: Handwritten text segmentation via end-to-end learning of convolutional neural networks. Multimedia Tools Appl. 79(43), 32137–32150 (2020). issn: 1573-7721. https://doi.org/10.1007/s11042-020-09624-9. Accessed 18 Aug 2023
Kleber, F., et al.: CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: 2013 12th International Conference on Document Analysis and Recognition. 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 560–564. IEEE, Washington, DC (2013). isbn: 978-0-7695- 4999-6. https://doi.org/10.1109/ICDAR.2013.117. http://ieeexplore.ieee.org/document/6628682/. Accessed 19 Feb 2024
Marti, U.-V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition, pp. 39–46 (2002). http://www.tummy.com/xvscan/. Accessed 23 July 2020
NVlabs. ocrodeg: Document Image Degradation. https://github.com/NVlabs/ocrodeg 2024
Office of Personnel Management. OPM Forms (2024). https://www.opm.gov/forms/. Accessed 14 Feb 2024
Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22(3), 313–318 (2003). issn: 0730–0301. https://doi.org/10.1145/882262.882269. Accessed 19 Feb 2024
Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 817–822 (2012). https://doi.org/10.1109/ICFHR.2012.216. https://ieeexplore.ieee.org/abstract/document/6424498. Accessed 19 Feb 2024
Pratikakis, I., et al.: ICDAR2017 competition on document image binarization (DIBCO 2017). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1395–1403 (2017). https://doi.org/10.1109/ICDAR.2017.228.
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv:2102. https://github.com/openai/DALL-E. Accessed 26 Feb 2021
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation (2015). https://doi.org/10.48550/arXiv.1505.04597. http://arxiv.org/abs/1505.04597. Accessed 17 Aug 2023
Sadekar, K., et al.: LS-HDIB: a large scale handwritten document image binarization dataset. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 1678–1684 (2022). https://doi.org/10.1109/ICPR56361.2022.9956447.
Social Security Administration. Forms (2024). https://www.ssa.gov/forms/. Accessed 14 Feb 2024
Stewart, S., Barrett, B.: Document image page segmentation and character recognition as semantic segmentation. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. HIP 2017, pp. 101–106. Association for Computing Machinery, New York (2017). isbn: 978-1-4503-5390-8. https://doi.org/10.1145/3151509.3151518. Accesse 31 Jan 2024
Archibald, T.: DocGen (2024). https://github.com/Tahlor/docgen
Tensmeyer, C., Martinez, T.: Document image binarization with fully convolutional neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 99–104 (2017). https://doi.org/10.1109/ICDAR.2017.25.
The Augraphy Project. Augraphy: An Augmentation Pipeline for Rendering Synthetic Paper Printing, Faxing, Scanning and Copy Machine Processes. Version 8.2 (2023). https://github.com/sparkfish/augraphy. Accessed 22 Aug 2023
Vafaie, M., et al.: Handwritten and printed text identification in historical archival documents. In: Archiving Conference, vol. 19, pp. 15–20 (2022). issn: 2161–8798. https://doi.org/10.2352/issn.2168-3204.2022.19.1.4. https://library.imaging.org/archiving/articles/19/1/4. Accessed 18 Aug 2023
Acknowledgment
The authors would like to thank the Handwriting Recognition Team at Ancestry.com for providing essential data and support that contributed to the findings of this study.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Archibald, T., Martinez, T. (2024). DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-70543-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)