Python爬虫之携程网笔记一

原创

baoqiangwang 2022-04-12 16:52:50 ©著作权

文章标签 ico xml html 其它 文章分类 代码人生

©著作权归作者所有：来自51CTO博客作者baoqiangwang的原创作品，请联系作者获取转载授权，否则将追究法律责任

前两天看了许久BeautifulSoap，想找个网站挑战一下，刚好想到之前曾经爬过携程网，想爬一下酒店信息试一下，没想到刚尝试就碰到了钉子。

钉子一：根据以前的入口进行urlopen，发现酒店内容不见了

钉子二：找了个办法，通过selenium进行网站内容获取，可webdriver提示错误

钉子三：beautifulsoap还是一如既往的难以掌握

钉子四：关于异常信息捕获的问题，有点困惑

关于钉子一，估计是缺乏模拟文件头导致的

关于钉子二，网上有很多解决办法，我也是百度出来的，所以不再介绍了。

关于钉子三，不断尝试就OK了

关于钉子四，问题暂时缓解，我也不愿意深究了

总的来说，这个笔记只是爬取了当前页面内的所有酒店的总览信息，酒店的详细介绍和酒店的客户评论，待后文续。

携程网酒店的总览信息，tag的深度能有5、6层左右，整个页面的深度为7、8层，我是找了个XML转换器，对酒店的当前信息进行了格式化，这样才方便对页面进行分析。

这个是以前的爬虫代码方式（urllib和BeautifulSoap），突然就不行了，呜呜呜。

代码示例

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
def processhotelentry(url):
htmlscenerylist = urllib.request.urlopen(url).read()
print(htmlscenerylist)
xmlscenerylist = BeautifulSoup(htmlscenerylist, 'lxml')
print(xmlscenerylist)
for curhotel in xmlscenerylist.findAll(class_="hotel_item"):
print(curhotel)
url='http://hotels.ctrip.com/hotel/haikou42/p1'
processhotelentry(url)

运行结果

<html>
<head id="ctl00_Head1">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="all" name="robots"/>
<meta content="index,follow" name="robots"/>
<title/>
</head>
</html>

关于BeautifulSoap和selenium结合使用的例子

代码示例

from bs4 import BeautifulSoup
from selenium import webdriver
urllists=['http://hotels.ctrip.com/hotel/haikou42/p1','http://hotels.ctrip.com/hotel/haikou42/p2']
driver=webdriver.Chrome(r'D:\Python36\Coding\PycharmProjects\ttt\chromedriver_win32\chromedriver.exe')
for url in urllists:
driver.get(url)
htmlscenerylist=driver.page_source
xmlscenerylist=BeautifulSoup(htmlscenerylist,'lxml')
for curhotel in xmlscenerylist.findAll(class_="hotel_item"):
hotelicolabels = []
speciallabels=[]
iconlistlabels=[]
hotelid = curhotel.find(attrs={"data-hotel": True})['data-hotel']
hotelnum = curhotel.find(class_='hotel_num').get_text()
hotelname = curhotel.find(class_='hotel_item_pic haspic',attrs={"title": True})['title']
try:
hotelicostag = curhotel.find("span", class_="hotel_ico").find_all("span",attrs={"title": True})
for hotelico in hotelicostag:
hotelicolabels.append(hotelico.get('title'))
except AttributeError:
hotelicolabels=[]
try:
speciallabeltag = curhotel.find("span", class_="special_label").find_all("i")
for speciallabel in speciallabeltag:
speciallabels.append(speciallabel.get_text())
except AttributeError:
speciallabel=[]
try:
iconlisttags = curhotel.find("div", class_="icon_list").find_all("i",attrs={"title": True})
for iconlisttag in iconlisttags:
iconlistlabels.append(iconlisttag.get('title'))
except AttributeError:
iconlistlabels=[]
try:
hotelprice=curhotel.find("span",class_ ="J_price_lowList").get_text()
except AttributeError:
hotelprice='N/A'
try:
hotellevel = curhotel.find("span",class_='hotel_level').get_text()
except AttributeError:
hotellevel = 'N/A'
try:
hotelvalue = curhotel.find("span", class_='hotel_value').get_text()
except AttributeError:
hotelvalue = 'N/A'
try:
hoteltotaljudgementscore = curhotel.find("span",class_='total_judgement_score').get_text()
except AttributeError:
hoteltotaljudgementscore = 'N/A'
try:
hoteljudgement = curhotel.find("span",class_='hotel_judgement').get_text()
except AttributeError:
hoteljudgement ='N/A'
try:
hotelrecommend = curhotel.find("span",class_='recommend').get_text()
except AttributeError:
hotelrecommend='N/A'
print(hotelid, hotelnum, hotelname, hotelicolabels, speciallabels, iconlistlabels,hotellevel,
hotelvalue, hoteltotaljudgementscore,hoteljudgement, hotelrecommend)