最近做的爬取比较多,查看网上的代码很多都用到了scrapy框架。下面是一个简单的scrapy爬取实例(环境为python3.8+pycharm):
(1)右击项目目录->open in terminal输入下面代码创建Scapy初始化项目:
scrapy startproject qsbk
(2)建立一个爬虫,爬虫的名称为qsbk_spider,爬虫要爬取的网站范围为"http://www.lovehhy.net"
scrapy genspider qsbk_spider "http://www.lovehhy.net"
(3)配置settings文件:
BOT_NAME = 'qsbk'
SPIDER_MODULES = ['qsbk.spiders']
NEWSPIDER_MODULE = 'qsbk.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
#项目中只要是需要pipelines操作的,此注释就需要打开
ITEM_PIPELINES = {
'qsbk.pipelines.QsbkPipeline': 300,
}
(4)items配置,这里的items与javaweb中的javabean用法类似,就像是一个类,里面可以自定义需要爬取的字段的名称
import scrapy
class QsbkItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
time = scrapy.Field()
(5)编写爬虫的代码:
爬虫代码中的parse作用是产生item,传给pipelines来对数据进行储存
import scrapy
from qsbk.items import QsbkItem
class QsbkSpiderSpider(scrapy.Spider):
name = 'qsbk_spider'
start_urls = ['http://www.lovehhy.net/Joke/Detail/QSBK/1']
baseUrl = "http://www.lovehhy.net"
def parse(self, response):
node_title_list = response.xpath("//div[@class='post_recommend_new']/h3/a/text()").extract()
node_time_list = response.xpath("//div[@class='post_recommend_new']/div[@class='post_recommend_time']/text()").extract()
items = []
for i in range(len(node_title_list)):
item = QsbkItem()
title = node_title_list[i]
content = node_time_list[i]
item = QsbkItem(title=title, time=time)
yield item
(6)编写pipelines代码对数据进行储存:
这里储存到了csv数据集文件中
import csv
class ScPipeline(object):
def __init__(self):
self.file = open("mmm.csv", 'w+', newline="", encoding='utf-8')
self.writer = csv.writer(self.file)
def open_spider(self, spider):
print("爬虫开始了...")
def process_item(self, item, spider):
self.writer.writerow([item['title'], item['time']])
return item
def close_spider(self, spider):
self.file.close()
print("爬虫结束了...")
最后看一下我们爬取的结果:
还是在命令行中输入下面内容来启动爬虫
scrapy crawl qsbk_spider
爬取结果: