DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents | SpringerLink
Skip to main content

DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2024 (ICDAR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14806))

Included in the following conference series:

  • 262 Accesses

Abstract

Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset (DELINE8K is available at https://github.com/Tahlor/DELINE8K). Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. 1001 Free Fonts (2024). https://www.1001freefonts.com/. Accessed 14 Feb 2024

  2. An Objective Evaluation Methodology for Document Image Binarization Techniques – IEEE Conference Publication – IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/4669964. Accessed 19 Feb 2024

  3. Bhunia, A.K., et al.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1094 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Bhunia_Handwriting_Transformers_ICCV_2021_paper.html. Accessed 19 Feb 2024

  4. Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), 10 January 2021, pp. 2134–2141 (2021). https://doi.org/10.1109/ICPR48806.2021.9412447. http://arxiv.org/abs/2012.14163. Accessed on 16 Aug 2023

  5. Buslaev, A., et al.: Albumentations: fast and flexible image augmentations. Information 11(2), 125 (2020). issn: 2078–2489. https://doi.org/10.3390/info11020125. https://www.mdpi.com/2078-2489/11/2/125. Accessed 19 Feb 2024

  6. Cohen, G., et al.: EMNIST: extending MNIST to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 2921–2926 (2017). https://doi.org/10.1109/IJCNN.2017.7966217. https://ieeexplore.ieee.org/abstract/document/7966217. Accessed 19 Feb 2024

  7. Crawford, A., Ray, A., Carriquiry, A.: A database of handwriting samples for applications in forensic statistics. Data Brief 28, 105059 (2020). issn: 2352-3409. https://doi.org/10.1016/j.dib.2019.105059. https://www.sciencedirect.com/science/article/pii/S2352340919314155. Accessed 19 Feb 2024

  8. Davis, B., et al.: Deep visual template-free form parsing (2019). https://doi.org/10.48550/arXiv.1909.02576. http://arxiv.org/abs/1909.02576. Accessed 12 Feb 2024

  9. General Services Administration. GSA Forms (2024). https://www.gsa.gov/forms. Accessed 14 Feb 2024

  10. Gholamian, S., Vahdat, A.: Handwritten and printed text segmentation: a signature case study (2023). https://doi.org/10.48550/arXiv.2307.07887. http://arxiv.org/abs/2307.07887. Accessed 18 Aug 2023

  11. Groleau, A., et al.: ShabbyPages: a reproducible document denoising and binarization dataset (2023). https://doi.org/10.48550/arXiv.2303. http://arxiv.org/abs/2303.09339. Accessed 16 Aug 2023

  12. He, K., et al.: Deep residual learning for image recognition (2015). https://doi.org/10.48550/arXiv.1512.03385. http://arxiv.org/abs/1512.03385. Accessed 17 Feb 2024

  13. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994). issn: 1939–3539. https://doi.org/10.1109/34.291440. https://ieeexplore.ieee.org/document/291440. Accessed 19 Feb 2024

  14. Internal Revenue Service. Forms, Instructions & Publications (2024). https://www.irs.gov/forms-instructions-and-publications. Accessed 14 Feb 2024

  15. Jo, J., et al.: Handwritten text segmentation via end-to-end learning of convolutional neural networks. Multimedia Tools Appl. 79(43), 32137–32150 (2020). issn: 1573-7721. https://doi.org/10.1007/s11042-020-09624-9. Accessed 18 Aug 2023

  16. Kleber, F., et al.: CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: 2013 12th International Conference on Document Analysis and Recognition. 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 560–564. IEEE, Washington, DC (2013). isbn: 978-0-7695- 4999-6. https://doi.org/10.1109/ICDAR.2013.117. http://ieeexplore.ieee.org/document/6628682/. Accessed 19 Feb 2024

  17. Marti, U.-V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition, pp. 39–46 (2002). http://www.tummy.com/xvscan/. Accessed 23 July 2020

  18. NVlabs. ocrodeg: Document Image Degradation. https://github.com/NVlabs/ocrodeg 2024

  19. Office of Personnel Management. OPM Forms (2024). https://www.opm.gov/forms/. Accessed 14 Feb 2024

  20. Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22(3), 313–318 (2003). issn: 0730–0301. https://doi.org/10.1145/882262.882269. Accessed 19 Feb 2024

  21. Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 817–822 (2012). https://doi.org/10.1109/ICFHR.2012.216. https://ieeexplore.ieee.org/abstract/document/6424498. Accessed 19 Feb 2024

  22. Pratikakis, I., et al.: ICDAR2017 competition on document image binarization (DIBCO 2017). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1395–1403 (2017). https://doi.org/10.1109/ICDAR.2017.228.

  23. Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv:2102. https://github.com/openai/DALL-E. Accessed 26 Feb 2021

  24. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation (2015). https://doi.org/10.48550/arXiv.1505.04597. http://arxiv.org/abs/1505.04597. Accessed 17 Aug 2023

  25. Sadekar, K., et al.: LS-HDIB: a large scale handwritten document image binarization dataset. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 1678–1684 (2022). https://doi.org/10.1109/ICPR56361.2022.9956447.

  26. Social Security Administration. Forms (2024). https://www.ssa.gov/forms/. Accessed 14 Feb 2024

  27. Stewart, S., Barrett, B.: Document image page segmentation and character recognition as semantic segmentation. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. HIP 2017, pp. 101–106. Association for Computing Machinery, New York (2017). isbn: 978-1-4503-5390-8. https://doi.org/10.1145/3151509.3151518. Accesse 31 Jan 2024

  28. Archibald, T.: DocGen (2024). https://github.com/Tahlor/docgen

  29. Tensmeyer, C., Martinez, T.: Document image binarization with fully convolutional neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 99–104 (2017). https://doi.org/10.1109/ICDAR.2017.25.

  30. The Augraphy Project. Augraphy: An Augmentation Pipeline for Rendering Synthetic Paper Printing, Faxing, Scanning and Copy Machine Processes. Version 8.2 (2023). https://github.com/sparkfish/augraphy. Accessed 22 Aug 2023

  31. Vafaie, M., et al.: Handwritten and printed text identification in historical archival documents. In: Archiving Conference, vol. 19, pp. 15–20 (2022). issn: 2161–8798. https://doi.org/10.2352/issn.2168-3204.2022.19.1.4. https://library.imaging.org/archiving/articles/19/1/4. Accessed 18 Aug 2023

Download references

Acknowledgment

The authors would like to thank the Handwriting Recognition Team at Ancestry.com for providing essential data and support that contributed to the findings of this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taylor Archibald .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 421 KB)

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Archibald, T., Martinez, T. (2024). DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70543-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70542-7

  • Online ISBN: 978-3-031-70543-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics