End-to-End Document Recognition and Understanding with Dessurt

Davis, Brian; Morse, Bryan; Price, Brian; Tensmeyer, Chris; Wigington, Curtis; Morariu, Vlad

doi:10.1007/978-3-031-25069-9_19

Brian Davis^10,11,
Bryan Morse¹¹,
Brian Price¹²,
Chris Tensmeyer¹²,
Curtis Wigington¹² &
…
Vlad Morariu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13804))

Included in the following conference series:

European Conference on Computer Vision

1754 Accesses

Abstract

We introduce Dessurt, a relatively simple document understanding transformer capable of being fine-tuned on a greater variety of document tasks than prior methods. It receives a document image and task string as input and generates arbitrary text autoregressively as output. Because Dessurt is an end-to-end architecture that performs text recognition in addition to document understanding, it does not require an external recognition model as prior methods do. Dessurt is a more flexible model than prior methods and is able to handle a variety of document domains and tasks. We show that this model is effective at 9 different dataset-task combinations.

B. Davis—Work completed prior to Brian joining AWS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Long-Range Transformer Architectures for Document Understanding

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding

Article 13 December 2024

End-to-End Transformer-Based Architecture for Text Recognition from Document Images

Notes

1.
https://huggingface.co/datasets/wikipedia.

References

Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Bluche, T.: Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. Advances in Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Chung, J., Delteil, T.: A computationally efficient pipeline approach to full page offline handwritten text recognition. In: International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE (2019)
Google Scholar
Coquenet, D., Chatelain, C., Paquet, T.: End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-free form parsing. In: International Conference on Document Analysis and Recognition (ICDAR). IEEE (2019)
Google Scholar
Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual FUDGE: form understanding via dynamic graph editing. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 416–431. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_27
Chapter Google Scholar
Davis, B., Tensmeyer, C., Price, B., Wigington, C., Morse, B., Jain, R.: Text and style conditioned gan for generation of offline handwriting lines (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2019)
Google Scholar
Foundation, W.: Wikimedia downloads. https://dumps.wikimedia.org
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. In: International Conference on Document Analysis and Recognition (ICDAR)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). arXiv preprint arXiv:1503.02531 2 (2015)
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: A pre-trained language model focusing on text and layout for better key information extraction from documents. arXiv preprint arXiv:2108.04539 (2021)
Hwang, W., Lee, H., Yim, J., Kim, G., Seo, M.: Cost-effective end-to-end information extraction for semi-structured document images. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2021)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.P.: Funsd: A dataset for form understanding in noisy scanned documents. In: International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE (2019)
Google Scholar
Kim, G., et al.: Donut: document understanding transformer without ocr. arXiv preprint arXiv:2111.15664 (2021)
Klaiman, S., Lehne, M.: Docreader: bounding-box free training of a document information extraction model. In: International Conference on Document Analysis and Recognition (ICDAR) (2021)
Google Scholar
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: ACM SIGIR Conference on Research and Development in Information Retrieval (2006)
Google Scholar
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: 58th Annual Meeting of the Association for Computational Linguistics (ACL) (2020)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Marti, U.V., Bunke, H.: The iam-database: an English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5(1) (2002)
Google Scholar
Mathew, M., Gomez, L., Karatzas, D., Jawahar, C.: Asking questions on handwritten document collections. Int. J. Document Anal. Recogn. (IJDAR) 24(3) (2021)
Google Scholar
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: a dataset for VQA on document images. In: Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-tilt boogie on document understanding with text-image-layout transformer. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 732–747 (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2016)
Google Scholar
Rowtula, V., Krishnan, P., Jawahar, C., CVIT, I.: Pos tagging and named entity recognition on handwritten documents. In: International Conference on Natural Language Processing (ICNLP) (2018)
Google Scholar
Smith, R.: An overview of the tesseract ocr engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (2007). https://doi.org/10.1109/ICDAR.2007.4376991
Toledo, J.I., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86 (2019)
Google Scholar
Tüselmann, O., Müller, F., Wolf, F., Fink, G.A.: Recognition-free question answering on handwritten document collections. arXiv preprint arXiv:2202.06080 (2022)
Tüselmann, O., Wolf, F., Fink, G.A.: Are end-to-end systems really necessary for ner on handwritten document images? In: Lladós, J., Lopresti, D., Uchida, S. (eds.) International Conference on Document Analysis and Recognition (ICDAR) (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Wigington, C., Stewart, S., Davis, B., Barrett, B., Price, B., Cohen, S.: Data augmentation for recognition of handwritten words and lines using a cnn-lstm network. In: International Conference on Document Analysis and Recognition (ICDAR) (2017)
Google Scholar
Wigington, C., Tensmeyer, C., Davis, B., Barrett, W., Price, B., Cohen, S.: Start, follow, read: End-to-end full-page handwriting recognition. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: 59th Annual Meeting of the Association for Computational Linguistics (ACL) (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: International Conference on Knowledge Discovery & Data Mining (KDD) (2020)
Google Scholar
Yousef, M., Bishop, T.E.: Origaminet: weakly-supervised, segmentation-free, one-step, full page text recognition by learning to unfold. In: Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

AWS AI, Washington, USA
Brian Davis
Brigham Young University, Provo, USA
Brian Davis & Bryan Morse
Adobe Research, College Park, USA
Brian Price, Chris Tensmeyer, Curtis Wigington & Vlad Morariu

Authors

Brian Davis
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Morse
View author publications
You can also search for this author in PubMed Google Scholar
Brian Price
View author publications
You can also search for this author in PubMed Google Scholar
Chris Tensmeyer
View author publications
You can also search for this author in PubMed Google Scholar
Curtis Wigington
View author publications
You can also search for this author in PubMed Google Scholar
Vlad Morariu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brian Davis .

Editor information

Editors and Affiliations

IBM Research - MIT-IBM Watson AI Lab, Massachusetts, USA
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6445 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Davis, B., Morse, B., Price, B., Tensmeyer, C., Wigington, C., Morariu, V. (2023). End-to-End Document Recognition and Understanding with Dessurt. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham. https://doi.org/10.1007/978-3-031-25069-9_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-25069-9_19
Published: 14 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25068-2
Online ISBN: 978-3-031-25069-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

End-to-End Document Recognition and Understanding with Dessurt

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Long-Range Transformer Architectures for Document Understanding

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding

End-to-End Transformer-Based Architecture for Text Recognition from Document Images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6445 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

End-to-End Document Recognition and Understanding with Dessurt

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Long-Range Transformer Architectures for Document Understanding

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding

End-to-End Transformer-Based Architecture for Text Recognition from Document Images

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6445 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation