Abstract
The digital processing of electronic documents is widely exploited across many domains to improve the efficiency of information extraction. However, paper documents are still largely being used in practice. In order to process such documents, a manual procedure is used to inspect them and extract the values of interest. As this task is monotonous and time consuming, it is prone to introduce human errors during the process. In this paper, we present an efficient and robust system that automates the aforementioned task by using a combination of machine learning techniques: optical character recognition, object detection and image processing techniques. This not only speeds up the process but also improves the accuracy of extracted information compared to a manual procedure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Druzhkov, P.N., Kustikova, V.D.: A survey of deep learning methods and software tools for image classification and object detection. Pattern Recogn. Image Anal. 26(1), 9–15 (2016). https://doi.org/10.1134/S1054661816010065
Hirano, T., Okano, Y., Okada, Y., Yoda, F.: Text and layout information extraction from document files of various formats based on the analysis of page description language. In: 9th International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 1, pp. 262–266. IEEE (2007)
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1516–1520. IEEE (2019)
Ishitani, Y.: Model based information extraction and its application to document images. In: Proceedings of the Workshop on Document Layout Interpretation and its Applications (2001)
Meier, R., Urbschat, H., Wanschura, T., Hausmann, J.: Methods for automatic structured extraction of data in OCR documents having tabular data. US Patent 9,251,413, 2 February 2016
Peanho, C.A., Stagni, H., da Silva, F.S.C.: Semantic information extraction from images of complex documents. Appl. Intell. 37(4), 543–557 (2012)
Takasu, A., Aihara, K.: Quality enhancement in information extraction from scanned documents. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 122–124 (2006)
Vaillant, R., Monrocq, C., Le Cun, Y.: Original approach for the localisation of objects in images. IEE Proc. Visi. Image Sig. Process. 141(4), 245–250 (1994)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kamaleson, N., Chu, D., Otero, F.E.B. (2021). Automatic Information Extraction from Electronic Documents Using Machine Learning. In: Bramer, M., Ellis, R. (eds) Artificial Intelligence XXXVIII. SGAI-AI 2021. Lecture Notes in Computer Science(), vol 13101. Springer, Cham. https://doi.org/10.1007/978-3-030-91100-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-91100-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91099-0
Online ISBN: 978-3-030-91100-3
eBook Packages: Computer ScienceComputer Science (R0)