前两天看了许久BeautifulSoap,想找个网站挑战一下,刚好想到之前曾经爬过携程网,想爬一下酒店信息试一下,没想到刚尝试就碰到了钉子。

钉子一:根据以前的入口进行urlopen,发现酒店内容不见了

钉子二:找了个办法,通过selenium进行网站内容获取,可webdriver提示错误

钉子三:beautifulsoap还是一如既往的难以掌握

钉子四:关于异常信息捕获的问题,有点困惑

关于钉子一,估计是缺乏模拟文件头导致的

关于钉子二,网上有很多解决办法,我也是百度出来的,所以不再介绍了。

关于钉子三,不断尝试就OK了

关于钉子四,问题暂时缓解,我也不愿意深究了

总的来说,这个笔记只是爬取了当前页面内的所有酒店的总览信息,酒店的详细介绍和酒店的客户评论,待后文续。

携程网酒店的总览信息,tag的深度能有5、6层左右,整个页面的深度为7、8层,我是找了个XML转换器,对酒店的当前信息进行了格式化,这样才方便对页面进行分析。

这个是以前的爬虫代码方式(urllib和BeautifulSoap),突然就不行了,呜呜呜。

代码示例

  1. import urllib.request

  2. from bs4 import BeautifulSoup

  3. from selenium import webdriver


  4. def processhotelentry(url):

  5. htmlscenerylist = urllib.request.urlopen(url).read()

  6. print(htmlscenerylist)

  7. xmlscenerylist = BeautifulSoup(htmlscenerylist, 'lxml')

  8. print(xmlscenerylist)

  9. for curhotel in xmlscenerylist.findAll(class_="hotel_item"):

  10. print(curhotel)

  11. url='http://hotels.ctrip.com/hotel/haikou42/p1'

  12. processhotelentry(url)


运行结果

  1. <html>

  2. <head id="ctl00_Head1">

  3. <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>

  4. <meta content="all" name="robots"/>

  5. <meta content="index,follow" name="robots"/>

  6. <title/>

  7. </head>

  8. </html>


关于BeautifulSoap和selenium结合使用的例子

代码示例

  1. from bs4 import BeautifulSoup

  2. from selenium import webdriver

  3. urllists=['http://hotels.ctrip.com/hotel/haikou42/p1','http://hotels.ctrip.com/hotel/haikou42/p2']

  4. driver=webdriver.Chrome(r'D:\Python36\Coding\PycharmProjects\ttt\chromedriver_win32\chromedriver.exe')

  5. for url in urllists:

  6. driver.get(url)

  7. htmlscenerylist=driver.page_source

  8. xmlscenerylist=BeautifulSoup(htmlscenerylist,'lxml')

  9. for curhotel in xmlscenerylist.findAll(class_="hotel_item"):

  10. hotelicolabels = []

  11. speciallabels=[]

  12. iconlistlabels=[]

  13. hotelid = curhotel.find(attrs={"data-hotel": True})['data-hotel']

  14. hotelnum = curhotel.find(class_='hotel_num').get_text()

  15. hotelname = curhotel.find(class_='hotel_item_pic haspic',attrs={"title": True})['title']

  16. try:

  17. hotelicostag = curhotel.find("span", class_="hotel_ico").find_all("span",attrs={"title": True})

  18. for hotelico in hotelicostag:

  19. hotelicolabels.append(hotelico.get('title'))

  20. except AttributeError:

  21. hotelicolabels=[]

  22. try:

  23. speciallabeltag = curhotel.find("span", class_="special_label").find_all("i")

  24. for speciallabel in speciallabeltag:

  25. speciallabels.append(speciallabel.get_text())

  26. except AttributeError:

  27. speciallabel=[]

  28. try:

  29. iconlisttags = curhotel.find("div", class_="icon_list").find_all("i",attrs={"title": True})

  30. for iconlisttag in iconlisttags:

  31. iconlistlabels.append(iconlisttag.get('title'))

  32. except AttributeError:

  33. iconlistlabels=[]

  34. try:

  35. hotelprice=curhotel.find("span",class_ ="J_price_lowList").get_text()

  36. except AttributeError:

  37. hotelprice='N/A'

  38. try:

  39. hotellevel = curhotel.find("span",class_='hotel_level').get_text()

  40. except AttributeError:

  41. hotellevel = 'N/A'

  42. try:

  43. hotelvalue = curhotel.find("span", class_='hotel_value').get_text()

  44. except AttributeError:

  45. hotelvalue = 'N/A'

  46. try:

  47. hoteltotaljudgementscore = curhotel.find("span",class_='total_judgement_score').get_text()

  48. except AttributeError:

  49. hoteltotaljudgementscore = 'N/A'

  50. try:

  51. hoteljudgement = curhotel.find("span",class_='hotel_judgement').get_text()

  52. except AttributeError:

  53. hoteljudgement ='N/A'

  54. try:

  55. hotelrecommend = curhotel.find("span",class_='recommend').get_text()

  56. except AttributeError:

  57. hotelrecommend='N/A'

  58. print(hotelid, hotelnum, hotelname, hotelicolabels, speciallabels, iconlistlabels,hotellevel,

  59. hotelvalue, hoteltotaljudgementscore,hoteljudgement, hotelrecommend)