DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Archibald, Taylor; Martinez, Tony

doi:10.1007/978-3-031-70543-4_17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14806))

Included in the following conference series:

International Conference on Document Analysis and Recognition

262 Accesses

Abstract

Document semantic segmentation is a promising avenue that can facilitate document analysis tasks, including optical character recognition (OCR), form classification, and document editing. Although several synthetic datasets have been developed to distinguish handwriting from printed text, they fall short in class variety and document diversity. We demonstrate the limitations of training on existing datasets when solving the National Archives Form Semantic Segmentation dataset (NAFSS), a dataset which we introduce. To address these limitations, we propose the most comprehensive document semantic segmentation synthesis pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources to create the Document Element Layer INtegration Ensemble 8K, or DELINE8K dataset (DELINE8K is available at https://github.com/Tahlor/DELINE8K). Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Document Structure Extraction Using Prior Based High Resolution Hierarchical Semantic Segmentation

The Learnable Typewriter: A Generative Approach to Text Analysis

References

1001 Free Fonts (2024). https://www.1001freefonts.com/. Accessed 14 Feb 2024
An Objective Evaluation Methodology for Document Image Binarization Techniques – IEEE Conference Publication – IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/4669964. Accessed 19 Feb 2024
Bhunia, A.K., et al.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1086–1094 (2021). https://openaccess.thecvf.com/content/ICCV2021/html/Bhunia_Handwriting_Transformers_ICCV_2021_paper.html. Accessed 19 Feb 2024
Boillet, M., Kermorvant, C., Paquet, T.: Multiple document datasets pre-training improves text line detection with deep neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), 10 January 2021, pp. 2134–2141 (2021). https://doi.org/10.1109/ICPR48806.2021.9412447. http://arxiv.org/abs/2012.14163. Accessed on 16 Aug 2023
Buslaev, A., et al.: Albumentations: fast and flexible image augmentations. Information 11(2), 125 (2020). issn: 2078–2489. https://doi.org/10.3390/info11020125. https://www.mdpi.com/2078-2489/11/2/125. Accessed 19 Feb 2024
Cohen, G., et al.: EMNIST: extending MNIST to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), May 2017, pp. 2921–2926 (2017). https://doi.org/10.1109/IJCNN.2017.7966217. https://ieeexplore.ieee.org/abstract/document/7966217. Accessed 19 Feb 2024
Crawford, A., Ray, A., Carriquiry, A.: A database of handwriting samples for applications in forensic statistics. Data Brief 28, 105059 (2020). issn: 2352-3409. https://doi.org/10.1016/j.dib.2019.105059. https://www.sciencedirect.com/science/article/pii/S2352340919314155. Accessed 19 Feb 2024
Davis, B., et al.: Deep visual template-free form parsing (2019). https://doi.org/10.48550/arXiv.1909.02576. http://arxiv.org/abs/1909.02576. Accessed 12 Feb 2024
General Services Administration. GSA Forms (2024). https://www.gsa.gov/forms. Accessed 14 Feb 2024
Gholamian, S., Vahdat, A.: Handwritten and printed text segmentation: a signature case study (2023). https://doi.org/10.48550/arXiv.2307.07887. http://arxiv.org/abs/2307.07887. Accessed 18 Aug 2023
Groleau, A., et al.: ShabbyPages: a reproducible document denoising and binarization dataset (2023). https://doi.org/10.48550/arXiv.2303. http://arxiv.org/abs/2303.09339. Accessed 16 Aug 2023
He, K., et al.: Deep residual learning for image recognition (2015). https://doi.org/10.48550/arXiv.1512.03385. http://arxiv.org/abs/1512.03385. Accessed 17 Feb 2024
Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994). issn: 1939–3539. https://doi.org/10.1109/34.291440. https://ieeexplore.ieee.org/document/291440. Accessed 19 Feb 2024
Internal Revenue Service. Forms, Instructions & Publications (2024). https://www.irs.gov/forms-instructions-and-publications. Accessed 14 Feb 2024
Jo, J., et al.: Handwritten text segmentation via end-to-end learning of convolutional neural networks. Multimedia Tools Appl. 79(43), 32137–32150 (2020). issn: 1573-7721. https://doi.org/10.1007/s11042-020-09624-9. Accessed 18 Aug 2023
Kleber, F., et al.: CVL-database: an off-line database for writer retrieval, writer identification and word spotting. In: 2013 12th International Conference on Document Analysis and Recognition. 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 560–564. IEEE, Washington, DC (2013). isbn: 978-0-7695- 4999-6. https://doi.org/10.1109/ICDAR.2013.117. http://ieeexplore.ieee.org/document/6628682/. Accessed 19 Feb 2024
Marti, U.-V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition, pp. 39–46 (2002). http://www.tummy.com/xvscan/. Accessed 23 July 2020
NVlabs. ocrodeg: Document Image Degradation. https://github.com/NVlabs/ocrodeg 2024
Office of Personnel Management. OPM Forms (2024). https://www.opm.gov/forms/. Accessed 14 Feb 2024
Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22(3), 313–318 (2003). issn: 0730–0301. https://doi.org/10.1145/882262.882269. Accessed 19 Feb 2024
Pratikakis, I., Gatos, B., Ntirogiannis, K.: ICFHR 2012 competition on handwritten document image binarization (H-DIBCO 2012). In: 2012 International Conference on Frontiers in Handwriting Recognition, pp. 817–822 (2012). https://doi.org/10.1109/ICFHR.2012.216. https://ieeexplore.ieee.org/abstract/document/6424498. Accessed 19 Feb 2024
Pratikakis, I., et al.: ICDAR2017 competition on document image binarization (DIBCO 2017). In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1395–1403 (2017). https://doi.org/10.1109/ICDAR.2017.228.
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv:2102. https://github.com/openai/DALL-E. Accessed 26 Feb 2021
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation (2015). https://doi.org/10.48550/arXiv.1505.04597. http://arxiv.org/abs/1505.04597. Accessed 17 Aug 2023
Sadekar, K., et al.: LS-HDIB: a large scale handwritten document image binarization dataset. In: 2022 26th International Conference on Pattern Recognition (ICPR), pp. 1678–1684 (2022). https://doi.org/10.1109/ICPR56361.2022.9956447.
Social Security Administration. Forms (2024). https://www.ssa.gov/forms/. Accessed 14 Feb 2024
Stewart, S., Barrett, B.: Document image page segmentation and character recognition as semantic segmentation. In: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. HIP 2017, pp. 101–106. Association for Computing Machinery, New York (2017). isbn: 978-1-4503-5390-8. https://doi.org/10.1145/3151509.3151518. Accesse 31 Jan 2024
Archibald, T.: DocGen (2024). https://github.com/Tahlor/docgen
Tensmeyer, C., Martinez, T.: Document image binarization with fully convolutional neural networks. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 99–104 (2017). https://doi.org/10.1109/ICDAR.2017.25.
The Augraphy Project. Augraphy: An Augmentation Pipeline for Rendering Synthetic Paper Printing, Faxing, Scanning and Copy Machine Processes. Version 8.2 (2023). https://github.com/sparkfish/augraphy. Accessed 22 Aug 2023
Vafaie, M., et al.: Handwritten and printed text identification in historical archival documents. In: Archiving Conference, vol. 19, pp. 15–20 (2022). issn: 2161–8798. https://doi.org/10.2352/issn.2168-3204.2022.19.1.4. https://library.imaging.org/archiving/articles/19/1/4. Accessed 18 Aug 2023

Download references

Acknowledgment

The authors would like to thank the Handwriting Recognition Team at Ancestry.com for providing essential data and support that contributed to the findings of this study.

Author information

Authors and Affiliations

Brigham Young University, Provo, UT, USA
Taylor Archibald & Tony Martinez

Authors

Taylor Archibald
View author publications
You can also search for this author in PubMed Google Scholar
Tony Martinez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taylor Archibald .

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 421 KB)

Copyright information

About this paper

Cite this paper

Archibald, T., Martinez, T. (2024). DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-70543-4_17
Published: 09 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Document Structure Extraction Using Prior Based High Resolution Hierarchical Semantic Segmentation

The Learnable Typewriter: A Generative Approach to Text Analysis

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 421 KB)

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Palmira: A Deep Deformable Network for Instance Segmentation of Dense and Uneven Layouts in Handwritten Manuscripts

Document Structure Extraction Using Prior Based High Resolution Hierarchical Semantic Segmentation

The Learnable Typewriter: A Generative Approach to Text Analysis

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 421 KB)

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation