Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Stanisławek, Tomasz; Graliński, Filip; Wróblewska, Anna; Lipiński, Dawid; Kaliska, Agnieszka; Rosalska, Paulina; Topolski, Bartosz; Biecek, Przemysław

doi:10.1007/978-3-030-86549-8_36

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

International Conference on Document Analysis and Recognition

4449 Accesses
34 Citations
11 Altmetric

Abstract

The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 11439; Price includes VAT (Japan)

Softcover Book: JPY 14299; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Framework for Evaluating Entity Alignment Impact on Downstream Knowledge Discovery

RuREBus: A Case Study of Joint Named Entity Recognition and Relation Extraction from E-Government Domain

NLP Methods’ Information Extraction for Textual Data: An Analytical Study

Notes

1.
https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3.
2.
https://www.nist.gov/srd/nist-special-database-2.
3.
http://icdar2019.org/competitions-2/.
4.
https://rrc.cvc.uab.es/?ch=13.
5.
https://www.sec.gov/edgar.shtml.
6.
https://github.com/puppeteer/puppeteer.
7.
https://en.wikipedia.org/wiki/Cohen%27s_kappa.
8.
The normalization standards are described in the public repository with datasets.
9.
https://apps.charitycommission.gov.uk/showcharity/registerofcharities/RegisterHomePage.aspx.
10.
Organizations with an income below 25,000 GBP a year are required to submit a condoned financial report instead.
11.
Postal codes in the UK were aggregated from www.streetlist.co.uk.
12.
https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text.
13.
http://jwilk.net/software/pdf2djvu, https://github.com/jwilk/ocrodjvu.
14.
run with --oem 2 -l eng --dpi 300 flags (meaning both new and old OCR engines were used simultaneously, with language and pixel density set to English and 300dpi respectively).
15.
https://aws.amazon.com/textract/ (API in version from March 1, 2020 was used).

References

Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. Association for Computational Linguistics, Santa Fe, New Mexico, USA (August 2018), https://www.aclweb.org/anthology/C18-1139
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. ArXiv arXiv:2004.05150 (2020)
Borchmann, L., et al.: Contract discovery: Dataset and a few-shot semantic retrieval challenge with competitive baselines. In: Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16–20 November 2020, pp. 4254–4268. Association for Computational Linguistics (2020)
Google Scholar
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: Transformer-xl: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019). https://www.aclweb.org/anthology/P19-1285
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv arXiv:1810.04805 (2018)
Dwojak, T., Pietruszka, M., Borchmann, Ł., Chłędowski, J., Graliński, F.: From dataset recycling to multi-property extraction and beyond. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 641–651. Association for Computational Linguistics, Online (November 2020). https://doi.org/10.18653/v1/2020.conll-1.52, https://www.aclweb.org/anthology/2020.conll-1.52
Garncarek, Ł., et al.: LAMBERT: Layout-Aware (Language) Modeling using BERT for information extraction. ArXiv arXiv:2002.08087 (2020)
Hewlett, D., et al.: WikiReading: a novel large-scale language understanding task over Wikipedia. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1535–1545. Association for Computational Linguistics, Berlin, Germany (2016)
Google Scholar
Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings of the Australasian Language Technology Association Workshop 2018, pp. 53–59. Dunedin, New Zealand (December 2018). https://www.aclweb.org/anthology/U18-1006
Hugging Face: Transformers. https://github.com/huggingface/transformers (2020)
Jaume, G., Kemal Ekenel, H., Thiran, J.: FUNSD: A dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6 (2019)
Google Scholar
Katti, A.R., Reisswig, C., Guder, C., Brarda, S., Bickel, S., Höhne, J., Faddoul, J.B.: Chargrid: Towards Understanding 2D Documents. ArXiv arXiv:1809.08799 (2018)
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (2019). http://dx.doi.org/10.18653/v1/N19-2005
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv arXiv:1907.11692 (2019)
Mathew, M., Karatzas, D., Jawahar, C.V.: DocVQA: A Dataset for VQA on Document Images. ArXiv arXiv:2007.00398 (2021)
Palm, R.B., Laws, F., Winther, O.: Attend, copy, parse end-to-end information extraction from documents. In: International Conference on Document Analysis and Recognition (ICDAR) (2019)
Google Scholar
Palm, R.B., Winther, O., Laws, F.: Cloudscan - a configuration-free invoice analysis system using recurrent neural networks. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) (2017). https://doi.org/10.1109/icdar.2017.74
Park, S., et al.: CORD: a consolidated receipt dataset for post-OCR parsing. In: Document Intelligence Workshop at Neural Information Processing Systems (2019)
Google Scholar
Smith, R.: Tesseract Open Source OCR Engine (2020). https://github.com/tesseract-ocr/tesseract
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of the Seventh Conference of the North American Chapter of the Association for Computational Linguistics (2003)
Google Scholar
Wellmann, C., Stierle, M., Dunzer, S., Matzner, M.: A framework to evaluate the viability of robotic process automation for business process activities. In: Asatiani, A., et al. (eds.) BPM 2020. LNBIP, vol. 393, pp. 200–214. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58779-6_14
Chapter Google Scholar
Wróblewska, A., Stanisławek, T., Prus-Zajączkowski, B., Garncarek, Ł.: Robotic process automation of unstructured data with machine learning. In: Position Papers of the 2018 Federated Conference on Computer Science and Information Systems, FedCSIS 2018, Poznań, Poland, 9–12 September 2018, pp. 9–16 (2018). https://doi.org/10.15439/2018F373
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020). https://doi.org/10.1145/3394486.3403172

Download references

Acknowledgments.

The Smart Growth Operational Programme supported this research under project no. POIR.01.01.01-00-0605/19 (Disruptive adoption of Neural Language Modelling for automation of text-intensive work).

Author information

Authors and Affiliations

Applica.ai, 15 Zajęcza, Warsaw, 00351, Poland
Tomasz Stanisławek, Filip Graliński, Dawid Lipiński, Agnieszka Kaliska, Paulina Rosalska & Bartosz Topolski
Warsaw University of Technology, Koszykowa 75, Warsaw, Poland
Tomasz Stanisławek, Anna Wróblewska & Przemysław Biecek
Adam Mickiewicz University, 1 Wieniawskiego, Poznan, 61712, Poland
Filip Graliński & Agnieszka Kaliska
Nicolaus Copernicus University, 11 Gagarina, Torun, 87100, Poland
Paulina Rosalska
Samsung R&D Institute Poland, Plac Europejski 1, Warsaw, Poland
Przemysław Biecek

Authors

Tomasz Stanisławek
View author publications
You can also search for this author in PubMed Google Scholar
Filip Graliński
View author publications
You can also search for this author in PubMed Google Scholar
Anna Wróblewska
View author publications
You can also search for this author in PubMed Google Scholar
Dawid Lipiński
View author publications
You can also search for this author in PubMed Google Scholar
Agnieszka Kaliska
View author publications
You can also search for this author in PubMed Google Scholar
Paulina Rosalska
View author publications
You can also search for this author in PubMed Google Scholar
Bartosz Topolski
View author publications
You can also search for this author in PubMed Google Scholar
Przemysław Biecek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filip Graliński .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stanisławek, T. et al. (2021). Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_36

Download citation

DOI: https://doi.org/10.1007/978-3-030-86549-8_36
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86548-1
Online ISBN: 978-3-030-86549-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Framework for Evaluating Entity Alignment Impact on Downstream Knowledge Discovery

RuREBus: A Case Study of Joint Named Entity Recognition and Relation Extraction from E-Government Domain

NLP Methods’ Information Extraction for Textual Data: An Analytical Study

Notes

References

Acknowledgments.

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Framework for Evaluating Entity Alignment Impact on Downstream Knowledge Discovery

RuREBus: A Case Study of Joint Named Entity Recognition and Relation Extraction from E-Government Domain

NLP Methods’ Information Extraction for Textual Data: An Analytical Study

Notes

References

Acknowledgments.

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation