Abstract
Web scraping refers to the extraction of data from a specific website. Every website will include web pages and each page has a source code containing HTML tags that show a representation of the data. The problem with any scraping method for a web page is how the page is structured, each page has a different structure. That’s why the process of data extraction requires more knowledge about web page structure. To solve such a problem users should know the content and the structure of any page. In this paper, we propose a multi-language pattern mining technique to scrap news and articles websites by recognising title, body and thumbnail image based on a content structure pattern. This approach is an improvement for our previous work that has been done before. By using the same method we have added thumbnail Image as new parameters to be extracted. This approach can be applied to several data-sets, in our case we have prepared 550 web pages as a dataset to test it in both languages Arabic and English.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Zhao, B.: Web scraping. In: Encyclopedia of Big Data, pp. 1–3 (2017)
Slamet, C., et al.: Web scraping and Naïve Bayes classification for job search engine. In: IOP Conference Series: Materials Science and Engineering, vol. 288, no. 1. IOP Publishing (2018)
Julian, L.R., Natalia, F.: The use of web scraping in computer parts and assembly price comparison. In: 2015 3rd International Conference on New Media (CONMEDIA). IEEE (2015)
Sirisuriya, D.S.: A comparative study on web scraping (2015)
Nair, V.G.: Getting Started with Beautiful Soup. Packt Publishing Ltd., Birmingham (2014)
Salem, H., Mazzara, M.: Pattern matching-based scraping of news websites. J. Phys. Conf. Ser. 1694(1), 012011 (2020)
Maududie, A., Retnani, W.E.Y., Rohim, M.A.: An approach of web scraping on news website based on regular expression. In: 2018 2nd East Indonesia Conference on Computer and Information Technology (EIConCIT). IEEE (2018)
Prehanto, D.R., et al.: Implementation of web scraping on news sites using the supervised learning method. Ilkogretim Online 20.3 (2021)
Richardson, L.: Beautiful soup documentation. Dosegljivo (2007). https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Dostopano 7 July 2018
Enghamzasalem. Enghamzasalem/Websegmentation. GitHub (n.d.). https://github.com/enghamzasalem/websegmentation/. Retrieved 14 Nov 2022
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Salem, H., Mazzara, M. (2023). Multi Languages Pattern Matching-Based Scraping of News and Articles Websites. In: Barolli, L. (eds) Advanced Information Networking and Applications. AINA 2023. Lecture Notes in Networks and Systems, vol 655. Springer, Cham. https://doi.org/10.1007/978-3-031-28694-0_60
Download citation
DOI: https://doi.org/10.1007/978-3-031-28694-0_60
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28693-3
Online ISBN: 978-3-031-28694-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)