python多进程爬取笔趣阁小说

原创

Python爬虫案例 2022-12-28 17:11:46 博主文章分类：Python爬虫案例 ©著作权

文章标签 html ide 创建目录 文章分类 运维

©著作权归作者所有：来自51CTO博客作者Python爬虫案例的原创作品，请联系作者获取转载授权，否则将追究法律责任

这几天看见有许多分享用python爬虫获取小说的文章，弄得我自己也手痒了，于是就写了个小爬虫，基本能实现分析下载小说以.TXT格式下载到本地的功能，本文主要是给想爬小说的小伙伴们一个思路。

笔趣阁是广大书友最值得收藏的网络小说阅读网,新笔趣阁收录了当前最火热的网络小说,新笔趣阁免费提供高质量的小说最新章节,新笔趣阁是广大网络小说爱好者必备的小说..今天我们就拿笔趣阁来进行练手!

老套路，让我们首先输入网址：

http://www.biquge.info/

可以看到所有的图书，然后我们随便点开一个图书的链接进行分析如下：

python多进程爬取笔趣阁小说_html

chapter_names = html.xpath('//dd/a/text()')
chapter_links = html.xpath('//dd/a/@href')

然后我们，随便点一章节进行内容查看，

python多进程爬取笔趣阁小说_ide_02

由此可见，内容获取:

content_list = html.xpath('//div[@id="content"]//text()')

接下来，就进入敲代码的环节，如下：

import requests
from lxml import etree
from multiprocessing import Pool
import os


def Chapterspider(self):
    """章节爬虫，参数传入目录，返回(章节名称， 对应页面链接)的列表"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    }
    content = requests.get(url, headers=headers).content
    html = etree.HTML(content)
    chapter_names = html.xpath('//dd/a/text()')
    chapter_links = html.xpath('//dd/a/@href')
    return chapter_names, chapter_links




def Chapterdownload(turple):
    """章节下载成对应的txt，这个url参数指每一页的链接，chapter_link"""
    url = turple[0]
    name = turple[1]
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
    }
    rsp = requests.get(url, headers=headers)
    content = rsp.content
    html = etree.HTML(content)
    content_list = html.xpath('//div[@id="content"]//text()')
    content_list = Remove_r(content_list, "\r")
    content_list = Formatlist(content_list)[:-2]
    dir_name = "小说2"
    isExists = os.path.exists(dir_name)
    # 判断结果
    if not isExists:
        # 如果不存在则创建目录
        # 创建目录操作函数
        os.makedirs(dir_name)
    with open(dir_name+'/' + name + '.txt', 'w', encoding='utf-8') as f:
        f.writelines(content_list)
    print(content_list)
    print(url)
    print(name, '爬取完毕')




def Remove_r(list, a):
    """去除列表中含字符串a的项"""
    while True:  # 无限循环，利用break退出
        if a not in list:  # 判断"a"在不在char_list里，不在就break。否则执行删除“a”的操作
            break
        else:
            list.remove(a)
    return list




def Formatlist(list):
    """去除只有/r的项后，还要把每项的特殊字符去掉     和\r"""
    for i in range(len(list)):
        if '\r' in list[i]:
            list[i] = list[i].replace('\r', '\n')
        if '\xa0\xa0\xa0\xa0' in list[i]:
            list[i] = list[i].replace('\xa0\xa0\xa0\xa0', '')
    return list




if __name__ == '__main__':
    chapter_links = []
    url = 'http://www.biquge.info/10_10582/'  # 这个是目录的链接，只需要填这个就行了，当前目录下自己手动创建一个叫小说的文件夹来存小说
    chapter_names, chapter_links_before = Chapterspider(url)
    # print(chapter_names,chapter_links_before)
    for i, j in zip(chapter_names, chapter_links_before):
        j = url + j
        chapter_links.append(j)
    # print(chapter_links)
    # # 生成链接和章名一一对应的字典
    link_find_name = dict(zip(chapter_links, chapter_names))
    # print(link_find_name)
    canshu = []
    for link, name in link_find_name.items():
        canshu.append((link, name))
    # print(canshu)
    pool = Pool(processes=4)
    pool.map(Chapterdownload, canshu)
    # Chapterdownload(canshu)