Webpage Text Detection Based on Improved Faster-RCNN Model | SpringerLink
Skip to main content

Webpage Text Detection Based on Improved Faster-RCNN Model

  • Conference paper
  • First Online:
Machine Learning for Cyber Security (ML4CS 2022)

Abstract

Under the new international situation, every country attaches great importance to the information security of the Internet, among which webpage anti-tampering is one of the top priorities. The premise of webpage tampering detection is to obtain webpages with different timestamps. Because of the diversity of website structure, it is necessary to make reasonable crawling strategy when using the traditional crawler method to obtain web pages, which leads to the problem of inflexible application. To address this, this paper adopts deep learning approach for detecting webpage text and thus acquiring web page information. The improved Faster-RCNN model is used to detect webpages and the resnet network is used to extract text features. In view of the feature of long text image font, the square convolution kernel of the traditional network is replaced by a rectangular convolution kernel to better fit the long and narrow features of the text; for the characteristics of dense text lines, the traditional NMS algorithm is replaced by the Soft-NMS algorithm to reduce the missed detection of dense regions. The experiments show that this algorithm has a better detection effect, which is important for network information security.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 11439
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 14299
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Lu, Y.H., Gao, J.: Design and implementation of anti-tampering monitoring system for campus secondary websites based on webpage comparison. Experimental Technol. Manage. 28(06), 119–121+133 (2011)

    Google Scholar 

  2. Sun, L.W., He, G.F., Wu, L.F.: Research on web crawler technology. Computer Knowledge Technol. 6(15), 4112–4115 (2010)

    Google Scholar 

  3. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (2009)

    Google Scholar 

  4. Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20(11), 3111–3122 (2018)

    Article  Google Scholar 

  5. He, T., Huang, W., Qiao, Y., Yao, J.: Text-attentional convolutional neural network for scene text detection. IEEE Trans. Image Process. 25(6), 2529–2541 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  6. Zhang, X., Zeng, Y., Jin., X.B., Yan, Z.W., Geng, G.G.: Boosting the phishing detection performance by semantic analysis. In: 2017 IEEE International Conference on Big Data, pp. 1063–1070. IEEE publisher, Piscataway (2017)

    Google Scholar 

  7. Yao, C., Bai, X., Liu, W.: A unified framework for multioriented text detection and recognition. IEEE Trans. Image Process. 23(11), 4737–4749 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  8. Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2015)

    Article  Google Scholar 

  9. Zhi, T., Huang, W., Tong, H.: Detecting text in natural image with connectionist text proposal network. In: 2016 14th European Conference on Computer Vision, pp. 56–72. Springer Science press, Amsterdam (2016). https://doi.org/10.1007/978-3-319-46484-8_4

  10. Liu, W., Anguelov, D., Erhan, D.: SSD: single shot multi-box detector. In: 2016 14th European Conference on Computer Vision, pp. 21–37. Springer Science press, Amsterdam (2016). https://doi.org/10.1007/978-3-319-46448-0_2

  11. Shi, B.G., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3482–3490. IEEE publisher, Honolulu (2017)

    Google Scholar 

  12. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Analysis Machine Intelligence 39(4), 640–651 (2015)

    Google Scholar 

  13. Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE Conference on Computer Vision and Pattern Recognition ( CVPR ), pp. 2117–2125. IEEE publisher, Honolulu (2017)

    Google Scholar 

  14. Zhou, X.Y., et al.: EAST: an efficient and accurate scene text detector. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651. IEEE publisher, Honolulu (2017)

    Google Scholar 

  15. Wang, W.H., et al.: Shape robust text detection with progressive scale expansion network. In: 2019 IEEE / CVF Conference on Computer Vision and Pattern Recognition (CVPR ), pp. 9328–9337. IEEE publisher, Long Beach (2019)

    Google Scholar 

  16. Xie, Y., Lei, Y.: Image object detection based on deep convolutional neural network. Industrial Control Computer 30(4), 96–97 (2017)

    Google Scholar 

  17. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519. IEEE publisher, Columbus, OH, USA (2014)

    Google Scholar 

  18. Gu, J.X., et al.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 458–463 (2018)

    Article  Google Scholar 

  19. Rong, X.J., Yi, C., Tian, Y.L.: Unambiguous text localization, retrieval, and recognition for cluttered scenes. In IEEE Trans. Pattern Analysis Machine Intelligence 44(3), 1638–1652 (2022)

    Article  Google Scholar 

  20. Liu, Y., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene textdetection via transverse and longitudinal sequence connection. Pattern Recogn. 90, 337–345 (2019)

    Article  Google Scholar 

  21. Patgiri, R., Katari, H., Kumar, R., Sharma, D.: Empirical study on malicious URL detection using machine learning. In: 15th International Conference on Distributed Computing and Internet Technology, pp. 380–388. IEEE publisher, India: Bhubaneswar (2019)

    Google Scholar 

  22. Ling, O.Y., Theng, L.B., Weiyen, A.C., Mccarthy, C.: Development of vertical text interpreter for natural scene images. IEEE Access 9, 144341–144351 (2021)

    Article  Google Scholar 

  23. Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., Bai, X.: Textfield: Learning a deep direction field for irregular scene text detection. IEEE Trans. Image Process. 28(11), 5566–5579 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  24. Bodla, N., Singh, B., Chellappa, R.: Soft-NMS-Improving object detection with one line of code. In: 2017 IEEE International Conference on Computer Vision, pp. 5562–5570. IEEE publisher, Venice, Italy (2017)

    Google Scholar 

  25. Liu, Y.F., Lu, B.H., Peng, J.Y.: Research on the use of YOLOv5 object detection algorithm in mask recognition. World Scientific Reaearch J. 6(11), 377–383 (2020)

    Google Scholar 

  26. Duan, K.W., Song, B., Xie, L.: Center Net: keypoint triplets for object detection. In: Proceedings of 2019IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6568–6577. IEEE publisher, Seoul, Korea (2019)

    Google Scholar 

Download references

Acknowledgements

We acknowledge funding from the sub project of national key R & D plan covid-19 patient rehabilitation training posture monitoring bracelet based on 4G network (Grant No.2021YFC0863200–6), the Hebei College and Middle School Students Science and Technology Innovation Ability Cultivation Special Project (Grant No.22E50075D), (Grant No.2021H011404) and (Grant No.2021H010203).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianchao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gao, J., Zhao, M., Wang, J., Zhao, Y., Ma, L., Yu, P. (2023). Webpage Text Detection Based on Improved Faster-RCNN Model. In: Xu, Y., Yan, H., Teng, H., Cai, J., Li, J. (eds) Machine Learning for Cyber Security. ML4CS 2022. Lecture Notes in Computer Science, vol 13657. Springer, Cham. https://doi.org/10.1007/978-3-031-20102-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20102-8_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20101-1

  • Online ISBN: 978-3-031-20102-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics