Automatic Information Extraction from Electronic Documents Using Machine Learning | SpringerLink
Skip to main content

Automatic Information Extraction from Electronic Documents Using Machine Learning

  • Conference paper
  • First Online:
Artificial Intelligence XXXVIII (SGAI-AI 2021)

Abstract

The digital processing of electronic documents is widely exploited across many domains to improve the efficiency of information extraction. However, paper documents are still largely being used in practice. In order to process such documents, a manual procedure is used to inspect them and extract the values of interest. As this task is monotonous and time consuming, it is prone to introduce human errors during the process. In this paper, we present an efficient and robust system that automates the aforementioned task by using a combination of machine learning techniques: optical character recognition, object detection and image processing techniques. This not only speeds up the process but also improves the accuracy of extracted information compared to a manual procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8579
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10724
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://opensource.google/projects/tesseract.

  2. 2.

    https://github.com/AlexeyAB/darknet.

  3. 3.

    https://cloud.google.com/vision.

  4. 4.

    https://aws.amazon.com/textract/.

  5. 5.

    https://azure.microsoft.com/en-gb/services/cognitive-services/computer-vision/.

  6. 6.

    https://github.com/tesseract-ocr/tesseract.

  7. 7.

    http://jocr.sourceforge.net/.

  8. 8.

    https://en.wikipedia.org/wiki/CuneiForm_(software).

  9. 9.

    https://github.com/tzutalin/labelImg.

References

  1. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)

  2. Druzhkov, P.N., Kustikova, V.D.: A survey of deep learning methods and software tools for image classification and object detection. Pattern Recogn. Image Anal. 26(1), 9–15 (2016). https://doi.org/10.1134/S1054661816010065

    Article  Google Scholar 

  3. Hirano, T., Okano, Y., Okada, Y., Yoda, F.: Text and layout information extraction from document files of various formats based on the analysis of page description language. In: 9th International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 1, pp. 262–266. IEEE (2007)

    Google Scholar 

  4. Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)

    Google Scholar 

  5. Ishitani, Y.: Model based information extraction and its application to document images. In: Proceedings of the Workshop on Document Layout Interpretation and its Applications (2001)

    Google Scholar 

  6. Meier, R., Urbschat, H., Wanschura, T., Hausmann, J.: Methods for automatic structured extraction of data in OCR documents having tabular data. US Patent 9,251,413, 2 February 2016

    Google Scholar 

  7. Peanho, C.A., Stagni, H., da Silva, F.S.C.: Semantic information extraction from images of complex documents. Appl. Intell. 37(4), 543–557 (2012)

    Article  Google Scholar 

  8. Takasu, A., Aihara, K.: Quality enhancement in information extraction from scanned documents. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 122–124 (2006)

    Google Scholar 

  9. Vaillant, R., Monrocq, C., Le Cun, Y.: Original approach for the localisation of objects in images. IEE Proc. Visi. Image Sig. Process. 141(4), 245–250 (1994)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nishanthan Kamaleson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kamaleson, N., Chu, D., Otero, F.E.B. (2021). Automatic Information Extraction from Electronic Documents Using Machine Learning. In: Bramer, M., Ellis, R. (eds) Artificial Intelligence XXXVIII. SGAI-AI 2021. Lecture Notes in Computer Science(), vol 13101. Springer, Cham. https://doi.org/10.1007/978-3-030-91100-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91100-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91099-0

  • Online ISBN: 978-3-030-91100-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics