scrapy超实用的两个中间件和参数配置

  • 中间件(代理、UA)
  • 自定义代理中间件
  • 自定义代理中间件setting.py的参数编写
  • 自定义UA中间件
  • 启动代理和UA中间件
  • setting常用的参数配置

中间件(代理、UA)

自定义代理中间件

我自己编写了一个IP池子,代理IP放在redis当中,需要在请求的时候从redis当中随机获取到一条代理IP
(我有两个代理池环境 一个正式环境一个测试环境 如果你只有一个的话请看 自定义代理中间件setting.py的参数编写)

class ProxyMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        cls.connect_type = crawler.settings.get('CONNECT_TYPE')
        print('\033[3;31mIPpool连接摸索:{}\033[0m\n\n'.format(cls.connect_type))
        if cls.connect_type == 'localhost':
            cls.REDIS_URL = crawler.settings.get('REDIS_HOST')
            cls.REDIS_PORT = crawler.settings.get('REDIS_PORT')
            cls.REDIS_DB = crawler.settings.get('REDIS_DATABASE')
            cls.REDIS_PASSWORD = crawler.settings.get('REDIS_PASSWORD')
            cls.REDIS_QUEUE_NAME = crawler.settings.get('REDIS_QUEUE_NAME')
        elif cls.connect_type == 'server':
            cls.REDIS_URL = crawler.settings.get('SERVER_REDIS_HOST')
            cls.REDIS_PORT = crawler.settings.get('SERVER_REDIS_PORT')
            cls.REDIS_DB = crawler.settings.get('SERVER_REDIS_DATABASE')
            cls.REDIS_PASSWORD = crawler.settings.get('SERVER_REDIS_PASSWORD')
            cls.REDIS_QUEUE_NAME = crawler.settings.get('SERVER_REDIS_QUEUE_NAME')
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def spider_opened(self, spider):
        self.pika = redis.Redis(host=self.REDIS_URL, port=self.REDIS_PORT, password=self.REDIS_PASSWORD,
                                db=self.REDIS_DB,
                                decode_responses=True)
        print('============IP池已链接{}============'.format(self.connect_type))

    def close_spider(self, spider):
        self.pika.close()
        print("################################################")
        print("##############    IPpool_spider    #############")
        print("################################################")

    def getIP(self):
        proxies_list = self.pika.hvals(self.REDIS_QUEUE_NAME)
        if proxies_list:
            ip = random.choice(proxies_list)
            proxy = 'http://{}'.format(ip)
            return proxy
        else:
            print('\033[3;31m《《《代理池为空》》》\033[0m\n')

    def process_request(self, request, spider):
        spiderNames = spider.settings.get('SPIDERNAMES')
        if spider.name in spiderNames:
            proxies = self.getIP()
            print('获取到的代理:{}'.format(proxies))
            request.meta['proxy'] = proxies

CONNECT_TYPE:IP池的链接模式 localhost为测试环境代理IP池 server为正式环境的代理IP池

REDIS_URL 、REDIS_PORT 、REDIS_DB 、REDIS_PASSWORD 、REDIS_QUEUE_NAME 为连接redis的参数(写在了setting.py文件当中)

SPIDERNAMES:需要使用代理的爬虫脚本名称(写在了setting.py当中),因为有的网页不需要代理也能获取到就没有必要使用代理
(spider.settings.get 获取setting.py当中的参数设置)

自定义代理中间件setting.py的参数编写

如果需要使用上面的代理中间件 需要在setting.py当中自定义几个参数 如下:

SPIDERNAMES = []  # 需要使用代理的脚本名称

# redis配置  写自己的redis代理池连接参数
# 线上  正式
SERVER_REDIS_HOST = 'xxx.xxx.xxx.xx'
SERVER_REDIS_PORT = xxx
SERVER_REDIS_DATABASE = x
SERVER_REDIS_PASSWORD = 'xxxx'
SERVER_REDIS_QUEUE_NAME = 'xxxx'

# 本地、测试运行环境
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
REDIS_DATABASE = 0
REDIS_PASSWORD = ''
REDIS_QUEUE_NAME = 'ippool'


CONNECT_TYPE = 'server'  # 运行环境

如果只有一个环境的话 就随便写一个 CONNECT_TYPE 就写那个对应的环境 比如我只有一个环境 我写到REDIS_HOST REDIS_PORT REDIS_DATABASE REDIS_PASSWORD REDIS_QUEUE_NAME 当中 那么CONNECT_TYPE就写localhost 或者更改ProxyMiddleware当中的读取参数逻辑(from_crawler当中)

自定义UA中间件

这个就是判断一下有没有手动添加UA,如果已经添加了UA就不做修改,如果没有自定义UA就随机添加一个 UA中间件不需要配置其他参数

class UAMiddleware(object):
    user_agent_list = [
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1"
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
        "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
        "Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50",
        "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
        "Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12 "
    ]

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if request.headers.get("'User-Agent'") or request.headers.get('user-agent'):
            print('拥有UA  不需要更更换 ')
        else:
            request.headers['User-Agent'] = ua

启动代理和UA中间件

在setting中找到 DOWNLOADER_MIDDLEWARES 将这两个中间件添加进去

DOWNLOADER_MIDDLEWARES = {
    'zmnProject.middlewares.ZmnprojectDownloaderMiddleware': 543,
    '你的项目.middlewares.UAMiddleware': 350,
    '你的项目.middlewares.ProxyMiddleware': 200,
}

setting常用的参数配置

DOWNLOAD_DELAY 下载延迟
ROBOTSTXT_OBEY 是否遵守ROBOT协议
CONCURRENT_REQUESTS 请求并发数量
CONCURRENT_ITEMS item并发数量