{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,29]],"date-time":"2024-08-29T21:53:49Z","timestamp":1724968429922},"reference-count":55,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2017,6,5]],"date-time":"2017-06-05T00:00:00Z","timestamp":1496620800000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"University of Hail, Hail, Saudi Arabia"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2018,1,31]]},"abstract":"It has long been suspected that web archives and search engines favor Western and English language webpages. In this article, we quantitatively explore how well indexed and archived Arabic language webpages are as compared to those from other languages. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multilingual), Raddadi, and Star28 (the last two primarily Arabic language). Using language identification tools, we eliminated pages not in the Arabic language (e.g., English-language versions of Aljazeera pages) and culled the collection to 7,976 Arabic language webpages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We compared the analysis of Arabic language pages with that of English, Danish, and Korean language pages. First, for each language, we sampled unique URIs from DMOZ; then, using language identification tools, we kept only pages in the desired language. Finally, we crawled the archived and live web to collect a larger sample of pages in English, Danish, or Korean. In total for the four languages, we analyzed over 500,000 webpages. We discovered: (1) English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. (2) Most Arabic and English language pages are located in the United States; only 14.84% of the Arabic URIs had an Arabic country code top-level domain (e.g., sa) and only 10.53% had a GeoIP in an Arabic country. Most Danish-language pages were located in Denmark, and most Korean-language pages were located in South Korea. (3) The presence of a webpage in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving in all four languages. In this work, we show that web archives and search engines favor English pages. However, it is not universally true for all Western-language webpages because, in this work, we show that Arabic webpages have a higher archival rate than Danish language webpages.<\/jats:p>","DOI":"10.1145\/3041656","type":"journal-article","created":{"date-parts":[[2017,6,7]],"date-time":"2017-06-07T12:47:23Z","timestamp":1496839643000},"page":"1-34","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages"],"prefix":"10.1145","volume":"36","author":[{"given":"Lulwah M.","family":"Alkwai","sequence":"first","affiliation":[{"name":"Old Dominion University, Norfolk, USA"}]},{"given":"Michael L.","family":"Nelson","sequence":"additional","affiliation":[{"name":"Old Dominion University, Norfolk, USA"}]},{"given":"Michele C.","family":"Weigle","sequence":"additional","affiliation":[{"name":"Old Dominion University, Norfolk, USA"}]}],"member":"320","published-online":{"date-parts":[[2017,6,5]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Central Intelligence Agency (ed.). 2015. The World Factbook 2014-15. Government Printing Office Washington DC. Central Intelligence Agency (ed.). 2015. The World Factbook 2014-15. Government Printing Office Washington DC."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1998076.1998100"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2910896.2925452"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","first-page":"2472","DOI":"10.5897\/SRE11.1708","article-title":"Estimating the size of Arabic indexed web content","volume":"7","author":"Alarifi Abdulrahman","year":"2012","journal-title":"Scientific Research and Essays"},{"key":"e_1_2_1_5_1","volume-title":"Retrieved","author":"Alkwai Lulwah M.","year":"2016"},{"key":"e_1_2_1_6_1","volume-title":"Retrieved","author":"Alkwai Lulwah M.","year":"2016"},{"key":"e_1_2_1_7_1","unstructured":"Lulwah M. Alkwai. 2017a. Arabic language web pages dataset. Lulwah M. Alkwai. 2017a. Arabic language web pages dataset."},{"key":"e_1_2_1_8_1","unstructured":"Lulwah M. Alkwai. 2017b. Danish language web pages dataset. Lulwah M. Alkwai. 2017b. Danish language web pages dataset."},{"key":"e_1_2_1_9_1","unstructured":"Lulwah M. Alkwai. 2017c. English language web pages dataset. Lulwah M. Alkwai. 2017c. English language web pages dataset."},{"key":"e_1_2_1_10_1","unstructured":"Lulwah M. Alkwai. 2017d. Korean language web pages dataset. Lulwah M. Alkwai. 2017d. Korean language web pages dataset."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2756406.2756912"},{"key":"e_1_2_1_12_1","unstructured":"Ahmed AlSum. 2014. Web Archive Services Framework for Tighter Integration Between the Past and Present Web. Ph.D. Dissertation. Old Dominion University Norfolk VA. Ahmed AlSum. 2014. Web Archive Services Framework for Tighter Integration Between the Past and Present Web. Ph.D. Dissertation. Old Dominion University Norfolk VA."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00799-014-0118-y"},{"key":"e_1_2_1_14_1","unstructured":"Mohamed Aturban. 2016. Pro-Gaddafi Digital Newspapers Disappeared from the Live Web! Retrieved February 13 2017 from http:\/\/ws-dl.blogspot.com\/2016\/11\/2016-11-05-pro-gaddafi-digital.html. (2016). Mohamed Aturban. 2016. Pro-Gaddafi Digital Newspapers Disappeared from the Live Web! Retrieved February 13 2017 from http:\/\/ws-dl.blogspot.com\/2016\/11\/2016-11-05-pro-gaddafi-digital.html. (2016)."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1239971.1239973"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 29th Annual Conference of the American Translators Association","volume":"47","author":"Beesley Kenneth R.","year":"1988"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.841784"},{"key":"e_1_2_1_18_1","volume-title":"Retrieved","author":"Callan Jamie","year":"2009"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/S1389-1286(99)00052-3"},{"key":"e_1_2_1_20_1","unstructured":"Junghoo Cho. 2001. Crawling the Web: Discovery and Maintenance of Large-scale Web Data. Ph.D. Dissertation. Stanford University Stanford CA. Junghoo Cho. 2001. Crawling the Web: Discovery and Maintenance of Large-scale Web Data. Ph.D. Dissertation. Stanford University Stanford CA."},{"key":"e_1_2_1_21_1","volume-title":"5th International Web Archiving Workshop (IWAW\u201905)","author":"Christensen Niels","year":"2005"},{"key":"e_1_2_1_22_1","unstructured":"Facebook. 2016. Company Info\/Facebook Newsroom. https:\/\/web.archive.org\/web\/20161110081856\/https:\/\/ newsroom.fb.com\/company-info\/. (2016). Facebook. 2016. Company Info\/Facebook Newsroom. https:\/\/web.archive.org\/web\/20161110081856\/https:\/\/ newsroom.fb.com\/company-info\/. (2016)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1084772.1084775"},{"key":"e_1_2_1_24_1","volume-title":"Retrieved","author":"Graves Jessie","year":"2012"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/2740769.2740827"},{"key":"e_1_2_1_26_1","unstructured":"Internet World Stats. 2009. Arabic Speaking Internet Users Statistics. https:\/\/web.archive.org\/web\/20100515122707\/http:\/\/www.internetworldstats.com\/stats19.htm. (2009). Internet World Stats. 2009. Arabic Speaking Internet Users Statistics. https:\/\/web.archive.org\/web\/20100515122707\/http:\/\/www.internetworldstats.com\/stats19.htm. (2009)."},{"key":"e_1_2_1_27_1","unstructured":"Internet World Stats. 2015a. Arabic Speaking Internet Users Statistics. https:\/\/web.archive.org\/web\/20160229163031\/http:\/\/www.internetworldstats.com\/stats19.htm. (2015). Internet World Stats. 2015a. Arabic Speaking Internet Users Statistics. https:\/\/web.archive.org\/web\/20160229163031\/http:\/\/www.internetworldstats.com\/stats19.htm. (2015)."},{"key":"e_1_2_1_28_1","unstructured":"Internet World Stats. 2015b. Internet Users in Asia November 2015. https:\/\/web.archive.org\/web\/20160422031013\/http:\/\/www.internetworldstats.com\/stats3.htm\" ?>https:\/\/web.archive.org\/web\/20160422031013\/http:\/\/www.internetworldstats.com\/stats3.htm. (2015). Internet World Stats. 2015b. Internet Users in Asia November 2015. https:\/\/web.archive.org\/web\/20160422031013\/http:\/\/www.internetworldstats.com\/stats3.htm\" ?>https:\/\/web.archive.org\/web\/20160422031013\/http:\/\/www.internetworldstats.com\/stats3.htm. (2015)."},{"key":"e_1_2_1_29_1","unstructured":"Internet World Stats. 2015c. Internet World Users By Language. https:\/\/web.archive.org\/web\/20160424042315\/http:\/\/www.internetworldstats.com\/stats7.htm. (2015). Internet World Stats. 2015c. Internet World Users By Language. https:\/\/web.archive.org\/web\/20160424042315\/http:\/\/www.internetworldstats.com\/stats7.htm. (2015)."},{"key":"e_1_2_1_30_1","volume-title":"Retrieved","author":"Johnson Kent","year":"2010"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1038\/scientificamerican0397-82"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.bushor.2009.09.003"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0115253"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/1772690.1772751"},{"key":"e_1_2_1_36_1","volume-title":"The New Yorker. Retrieved","author":"Lepore Jill","year":"2015"},{"key":"e_1_2_1_37_1","volume-title":"Retrieved","author":"Lui Marco","year":"2011"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1255175.1255237"},{"key":"e_1_2_1_39_1","volume-title":"Retrieved","author":"Moody Glyn","year":"2016"},{"key":"e_1_2_1_40_1","volume-title":"Retrieved","author":"Nelson Michael L.","year":"2010"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.acalib.2007.03.008"},{"key":"e_1_2_1_42_1","unstructured":"Leonard Richardson. 2013. Beautiful soup. http:\/\/www.crummy.com\/software\/BeautifulSoup. Leonard Richardson. 2013. Beautiful soup. http:\/\/www.crummy.com\/software\/BeautifulSoup."},{"key":"e_1_2_1_43_1","volume-title":"Retrieved","author":"Roberts Daniel","year":"2015"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2487788.2488121"},{"key":"e_1_2_1_45_1","volume-title":"International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries. Springer, 333--345","author":"Hany"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1045\/february2006-smith"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1045\/march2008-smith"},{"key":"e_1_2_1_48_1","volume-title":"The Telegraph. Retrieved","author":"Sparkes Matthew","year":"2014"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2914677"},{"key":"e_1_2_1_50_1","doi-asserted-by":"crossref","unstructured":"Mike Thelwall and Liwen Vaughan. 2004. A fair history of the web? Examining country balance in the Internet archive. Library 8 Information Science Research 26 2 162--176. Mike Thelwall and Liwen Vaughan. 2004. A fair history of the web? Examining country balance in the Internet archive. Library 8 Information Science Research 26 2 162--176.","DOI":"10.1016\/j.lisr.2003.12.009"},{"key":"e_1_2_1_51_1","volume-title":"Retrieved","year":"2016"},{"key":"e_1_2_1_52_1","volume-title":"Retrieved","author":"de Sompel Herbert Van","year":"2013"},{"key":"e_1_2_1_53_1","volume-title":"Memento: Time Travel for the Web. Technical Report arXiv:0911.1112. Los Alamos National Laboratory","author":"de Sompel Herbert Van","year":"2009"},{"key":"e_1_2_1_54_1","volume-title":"3rd Workshop on Web Archives","author":"\u017dabi\u010dka Petr","year":"2003"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065385.1065407"},{"key":"e_1_2_1_56_1","volume-title":"12th International Conference on Preservation of Digital Objects. http:\/\/hdl.handle.net\/109","author":"Zierau Eld","year":"2015"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3041656","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,31]],"date-time":"2022-12-31T08:51:09Z","timestamp":1672476669000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3041656"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,6,5]]},"references-count":55,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2018,1,31]]}},"alternative-id":["10.1145\/3041656"],"URL":"https:\/\/doi.org\/10.1145\/3041656","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,6,5]]},"assertion":[{"value":"2016-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-06-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}