Abstract
Object detection in documents is a key step to automate the structural elements identification process in a digital or scanned document through understanding the hierarchical structure and relationships between different elements. Large and complex models, while achieving high accuracy, can be computationally expensive and memory-intensive, making them impractical for deployment on resource constrained devices. Knowledge distillation allows us to create small and more efficient models that retain much of the performance of their larger counterparts. Here we present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image. Here, we design a structured graph with nodes containing proposal-level features and edges representing the relationship between the different proposal regions. Also, to reduce text bias an adaptive node sampling strategy is designed to prune the weight distribution and put more weightage on non-text nodes. We encode the complete graph as a knowledge representation and transfer it from the teacher to the student through the proposed distillation loss by effectively capturing both local and global information concurrently. Extensive experimentation on competitive benchmarks demonstrates that the proposed framework outperforms the current state-of-the-art approaches. The code will be available at: github.com/ayanban011/GraphKD.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aditya, S., Saha, R., Yang, Y., Baral, C.: Spatial knowledge distillation to aid visual reasoning. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 227–235 (2019)
Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9163–9171 (2019)
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)
Asi, A., Cohen, R., Kedem, K., El-Sana, J.: Simplifying the reading of historical manuscripts. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 826–830. IEEE (2015)
Banerjee, A., Biswas, S., Lladós, J., Pal, U.: SwinDocSegmenter: An end-to-end unified domain adaptive transformer for document instance segmentation. arXiv preprint arXiv:2305.04609 (2023)
Biswas, S., Banerjee, A., Lladós, J., Pal, U.: DocSegTr: an instance-level end-to-end document image segmentation transformer. arXiv preprint arXiv:2201.11438 (2022)
Chawla, A., Yin, H., Molchanov, P., Alvarez, J.: Data-free knowledge distillation for object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3289–3298 (2021)
Chen, D., Mei, J.P., Zhang, H., Wang, C., Feng, Y., Chen, C.: Knowledge distillation with the reused teacher classifier. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Chen, D., et al.: Cross-layer distillation with semantic calibration. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)
Chen, J., Lopresti, D.: Table detection in noisy off-line handwritten documents. In: 2011 International Conference on Document Analysis and Recognition, pp. 399–403. IEEE (2011)
Chen, K., Seuret, M., Hennebert, J., Ingold, R.: Convolutional neural networks for page segmentation of historical document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 965–970. IEEE (2017)
Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1011–1015. IEEE (2015)
Chen, P., Liu, S., Zhao, H., Jia, J.: Distilling knowledge via knowledge review. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Chen, Y., Chen, P., Liu, S., Wang, L., Jia, J.: Deep structured instance graph for distilling object detectors. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4359–4368 (2021)
Chi, Z., et al: NormKD: normalized logits for knowledge distillation. arXiv preprint arXiv:2308.00520 (2023)
Clausner, C., Antonacopoulos, A., Pletschacher, S.: ICDAR2019 competition on recognition of documents with complex layouts-rdcl2019. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1521–1526 (2019)
Coquenet, D., Chatelain, C., Paquet, T.: DAN: a segmentation-free document attention network for handwritten document recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8227–8243 (2023)
Da, C., Luo, C., Zheng, Q., Yao, C.: Vision grid transformer for document layout analysis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19462–19472 (2023)
Dai, X., et al.: General instance distillation for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7842–7851 (2021)
De Rijk, P., Schneider, L., Cordts, M., Gavrila, D.: Structural knowledge distillation for object detection. Adv. Neural. Inf. Process. Syst. 35, 3858–3870 (2022)
Deng, Q., Ibrayim, M., Hamdulla, A., Zhang, C.: The yolo model that still excels in document layout analysis, pp. 1–10. Signal, Image and Video Processing pp (2023)
Douzon, T., Duffner, S., Garcia, C., Espinas, J.: Long-range transformer architectures for document understanding. In: Coustaty, M., Fornés, A. (eds.) International Conference on Document Analysis and Recognition, vol. 14194, pp. 47–64. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41501-2_4
Fang, J., Gao, L., Bai, K., Qiu, R., Tao, X., Tang, Z.: A table detection method for multipage pdf documents via visual seperators and tabular structures. In: 2011 International Conference on Document Analysis and Recognition, pp. 779–783. IEEE (2011)
Fateh, A., Fateh, M., Abolghasemi, V.: Enhancing optical character recognition: efficient techniques for document layout analysis and text line detection. Eng. Rep., e12832 (2023)
Gong, L., et al.: Adaptive hierarchy-branch fusion for online knowledge distillation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 7731–7739 (2023)
Gou, J., Xiong, X., Yu, B., Du, L., Zhan, Y., Tao, D.: Multi-target knowledge distillation via student self-reflection. Int. J. Comput. Vis. 131, 1857–1874 (2023). https://doi.org/10.1007/s11263-023-01792-z
Heo, B., Lee, M., Yun, S., Choi, J.Y.: Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3779–3787 (2019)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: LayoutLMv3: pre-training for document AI with unified text and image masking. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Journet, N., Eglin, V., Ramel, J.Y., Mullot, R.: Text/graphic labelling of ancient printed documents. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), pp. 1010–1014. IEEE (2005)
Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. Adv. Neural. Inf. Process. Syst. 34, 16468–16480 (2021)
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: DiT: self-supervised pre-training for document image transformer. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022)
Li, K., et al.: Cross-domain document object detection: Benchmark suite and method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12915–12924 (2020)
Li, X.-H., Yin, F., Liu, C.-L.: Page segmentation using convolutional neural network and graphical model. In: Bai, X., Karatzas, D., Lopresti, D. (eds.) DAS 2020. LNCS, vol. 12116, pp. 231–245. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-57058-3_17
Li, Z., et al.: When object detection meets knowledge distillation: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45, 10555–10579 (2023)
Liao, H., et al.: DocTr: document transformer for structured information extraction in documents. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19584–19594 (2023)
Lin, G.S., Tu, J.C., Lin, J.Y.: Keyword detection based on RetinaNet and transfer learning for personal information protection in document images. Appl. Sci. 11(20), 9528 (2021)
Lin, H., Han, G., Ma, J., Huang, S., Lin, X., Chang, S.F.: Supervised masked knowledge distillation for few-shot transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19649–19659 (2023)
Markewich, L., et al.: Segmentation for document layout analysis: not dead yet. Int. J. Doc. Anal. Recogn. (IJDAR) 25, 67–77 (2021). https://doi.org/10.1007/s10032-021-00391-3
Mathur, P., et al et al.: LayerDoc: layer-wise extraction of spatial hierarchical structure in visually-rich documents. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3610–3620 (2023)
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 5191–5198 (2020)
Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. ACM SIGKDD Explorations Newsl 7(1), 3–10 (2005)
Negrinho, R., Gormley, M., Gordon, G.J.: Learning beam search policies via imitation learning. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Oliveira, S.A., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12. IEEE (2018)
Pfitzmann, B., Auer, C., Dolfi, M., Nassar, A.S., Staar, P.: DocLayNet: a large human-annotated dataset for document-layout segmentation. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3743–3751 (2022)
Powalski, R., Borchmann, Ł, Jurkiewicz, D., Dwojak, T., Pietruszka, M., Pałka, G.: Going full-TILT boogie on document understanding with text-image-layout transformer. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12822, pp. 732–747. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86331-9_47
Rahal, N., Vögtlin, L., Ingold, R.: Layout analysis of historical document images using a light fully convolutional network. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) International Conference on Document Analysis and Recognition, vol. 14191, pp. 325–341. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_20
Saabni, R., El-Sana, J.: Language-independent text lines extraction using seam carving. In: 2011 International Conference on Document Analysis and Recognition, pp. 563–568. IEEE (2011)
Saha, R., Mondal, A., Jawahar, C.: Graphical object detection in document images. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 51–58. IEEE (2019)
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol. 1, pp. 1162–1167. IEEE (2017)
Shen, Z., Zhang, K., Dell, M.: A large dataset of historical Japanese documents with complex layouts. In: Proceedings of the IEEE Conference on CVPRW, pp. 548–549 (2020)
Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
Stanisławek, T., et al.: Kleister: key information extraction datasets involving long documents with complex layouts. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 564–579. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_36
Sun, N., Zhu, Y., Hu, X.: Faster R-CNN based table detection combining corner locating. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1314–1319. IEEE (2019)
Tang, Z., et al.: Unifying vision, text, and layout for universal document processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19254–19264 (2023)
Wang, Y., Weng, X., Kitani, K.: Joint detection and multi-object tracking with graph neural networks. arXiv preprint arXiv:2006.13164 (2020)
Wu, A., Deng, C.: Single-domain generalized object detection in urban scene via cyclic-disentangled self-distillation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp. 847–856 (2022)
Wu, D., Chen, P., Yu, X., Li, G., Han, Z., Jiao, J.: Spatial self-distillation for object detection with inaccurate bounding boxes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6855–6865 (2023)
Wu, X., et al.: A region-based document VQA. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4909–4920 (2022)
Yang, H., Hsu, W.: Transformer-based approach for document layout understanding. In: 2022 IEEE International Conference on Image Processing (ICIP), pp. 4043–4047. IEEE (2022)
Yang, H., Hsu, W.H.: Vision-based layout detection from scientific literature using recurrent convolutional neural networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6455–6462. IEEE (2021)
Yang, X., et al.: Learning high-precision bounding box for rotated object detection via Kullback-Leibler divergence. Adv. Neural. Inf. Process. Syst. 34, 18381–18394 (2021)
Yang, Z., Zeng, A., Li, Z., Zhang, T., Yuan, C., Li, Y.: From knowledge distillation to self-knowledge distillation: a unified approach with normalized loss and customized soft labels. arXiv preprint arXiv:2303.13005 (2023)
Zhang, L., Ma, K.: Structured knowledge distillation for accurate and efficient object detection. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Zheng, Z., et al.: Localization distillation for dense object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9407–9416 (2022)
Zhong, X., Tang, J., Yepes, A.J.: PubLayNet: largest dataset ever for document layout analysis. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1015–1022 (2019)
Zhong, Z., et al.: A hybrid approach to document layout analysis for heterogeneous document images. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) International Conference on Document Analysis and Recognition, vol. 14191, pp. 189–206. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41734-4_12
Acknowledgment
This work has been partially supported by the Spanish project PID2021-126808OB-I00, the Catalan project 2021 SGR 01559 and the PhD Scholarship from AGAUR (2021FIB-10010). The Computer Vision Center is part of the CERCA Program/Generalitat de Catalunya.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Banerjee, A., Biswas, S., Lladós, J., Pal, U. (2024). GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14806. Springer, Cham. https://doi.org/10.1007/978-3-031-70543-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-70543-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70542-7
Online ISBN: 978-3-031-70543-4
eBook Packages: Computer ScienceComputer Science (R0)