【python爬虫】使用线程池来爬取数据
原创
©著作权归作者所有:来自51CTO博客作者飝鱻?的原创作品,请联系作者获取转载授权,否则将追究法律责任
简介
- 在爬虫的过程中,难免会遇到阻塞的情况,这样子效率就会大大的降低,所以在爬虫时一般都需要使用到,线程池,来实现并发的爬虫,来提高效率
具体操作
import requests
from multiprocessing.dummy import Pool
from lxml import etree
#进行UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
#url路径
url = 'https://sc.chinaz.com/tupian/meinvtupian.html'
#将页面实例化,并且对其进行解析,以便于获取每一个图片的url
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
#获取一个存储所有li的集合
li_list = tree.xpath('//*[@id="container"]/div')
urls = []
for li in li_list:
img_url = 'http:' + li.xpath('./div/a/@href')[0]
img_name = li.xpath('./div/a/@alt')[0] + '.rar'
img_name = img_name.encode('iso-8859-1').decode('utf-8')
# print(img_name,img_url)
detail_page_text = requests.get(url=img_url, headers=headers).text
# print(detail_page_text)
tree = etree.HTML(detail_page_text)
detail_url = tree.xpath('//*[@class="downbody"]/div[3]/a[4]/@href')[0]
#用字典来存储图片名及其url
dic = {
'name': img_name,
'url': detail_url
}
urls.append(dic)
def get_vi_data(url):
url = dic['url']
print(dic['name'], '正在下载.....')
data = requests.get(url=url, headers=headers).content
# 持久化存储的操作
with open('./resource/' + dic['name'], 'wb') as fp:
fp.write(data)
print(dic['name'], '下载成功')
#设置一个线程池,他的容量为5
pool = Pool(5)
#因为在get_vi_data方法中会发生阻塞,因此将这个方法放入线程池中来提高爬取的效率
pool.map(get_vi_data, urls)
#关闭线程池
pool.close()
pool.join()