python3爬虫小网站 python爬虫小项目

转载

技术极客 2023-07-09 13:09:15

文章标签 python3爬虫小网站 json html ci 文章分类 Python 后端开发

爬取目标：美女网的相关信息

实现时需要用到的包：

requests库
Beautifulsoup
time
json

值得注意的是 Beautifulsoup 在bs4里，记得pip install bs4 安装一下

目标分析：

从页面中找出需要信息的位置

python3爬虫小网站 python爬虫小项目_html

很容易找到 class="content-box " 这个属性就是我们需要信息的位置，然后注意右下角的箭头，很明显这个标签属性是唯一的，看得出第一页就是10张图片，有了这个信息，我们就可以进一步分析所需要的具体数据。

从找到的信息里筛选出需要的具体信息

python3爬虫小网站 python爬虫小项目_json_02

在上一步的基础上进一步分析，很明显姓名和图片地址信息在img标签里的 alt 和 src 中。
由图上箭头看得出，姓名、生日、和故乡可以在 class="posts-text"里找到，现在所需要的信息就已经都找到了，接下来就是获取怎么获取下一页的链接。

python3爬虫小网站 python爬虫小项目_html_03

由于下一页是个按钮，因此只能分析获取接口，由图可知是个POST请求，有form表单，且仔细观察可知箭头指的地方 paged 对应的就是页数，因此我们只需要发起一个POST请求获取响应对象即可。
注意：此处获取的响应对象有一个坑，并不能直接获取到html，需要进一步处理！详细情况会在代码中通过注释说明！！

代码实现:

import json
import time

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3823.400 QQBrowser/10.7.4307.400',
    'Cookie': 'aQQ_ajkguid=B4D4C2CC-2F46-D252-59D7-83356256A4DC; id58=e87rkGBclxRq9+GOJC4CAg==; _ga=GA1.2.2103255298.1616680725; 58tj_uuid=4b56b6bf-99a3-4dd5-83cf-4db8f2093fcd; wmda_uuid=0f89f6f294d0f974a4e7400c1095354c; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; als=0; cmctid=102; ctid=15; sessid=E454865C-BA2D-040D-1158-5E1357DA84BA; twe=2; isp=true; _gid=GA1.2.1192525458.1617078804; new_uv=4; obtain_by=2; xxzl_cid=184e09dc30c74089a533faf230f39099; xzuid=7763438f-82bc-4565-9fe8-c7a4e036c3ee'
}


def get(url):  # 获取网页
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        parse(resp.text)
    else:
        print('网页请求失败！')


def parse(html):  # 进行数据解析
    items = {}
    root = BeautifulSoup(html, 'lxml')
    content_boxs = root.select('.content-box')
    for content_box in content_boxs:  # 遍历content_box标签，以获得需要的信息
        img = content_box.find('img')
        items['name'] = img.attrs.get('alt')
        infos = content_box.select('.posts-text')[0].get_text()  # 获取到的是一个列表，取出该元素并获取其文本
        try:
            items['birthday'] = infos.split('/')[1].strip()[3:]  # 进行字符串切割获取到生日信息
            items['city'] = infos.split('/')[2].strip()[3:]
        except:
            items['birthday'] = ''
            items['city'] = ''
        items['cover'] = img.attrs.get('src')
        itempipeline(items)  # 传给itempipeline()进行数据处理
    # 加载下一页
    get_next_page('http://www.meinv.hk/wp-admin/admin-ajax.php')


def itempipeline(item):  # 此处没做任何处理，直接打印
    print(item)


page = 2


def get_next_page(url):
    time.sleep(1)  # 休息一秒，避免请求过于频繁引起的限制请求
    global page
    resp = requests.post(url, data={
        "total": 4,
        "action": "fa_load_postlist",
        "paged": page,  # 由之前分析可知，此处是页码
        "tag": "%e9%9f%a9%e5%9b%bd%e7%be%8e%e5%a5%b3",
        "wowDelay": "0.3s"
    }, headers=headers)
    page += 1
    if page > 5:  # 这次爬取的总共就34个（网页上有显示），一页10个， 5页足够
        print('获取完毕!')
        exit()
    resp.encoding = 'utf-8'  # 指定其编码方式为utf-8
    if resp.status_code == 200:
        # 此处就是坑！！！！！
        # 获取到的是json字典，但字典前面有干扰，需要进行切片获取
        # 进行切片后，其html文件是在 key为postlist的里面
        rec = json.loads(resp.text[1:])['postlist']
        parse(rec)  # 将其传给parse()进行解析
    else:
        print('下一页链接获取失败！')


if __name__ == '__main__':
    get('http://www.meinv.hk/?tag=%e9%9f%a9%e5%9b%bd%e7%be%8e%e5%a5%b3')