python采集简历 python爬取个人简历

转载

mob64ca1400bfa8 2023-12-20 09:11:29

文章标签 python采集简历 python 爬虫 html 数据 文章分类 Python 后端开发

第一次爬数据遇到了很多坎儿，游走在各大大佬们的经验贴中，最终顺利完成任务，记录下来，以便我这猪脑忘记！

（一）任务

爬取“上海市”+“web前端”+“应届生”+“boss直聘网站”的第一页数据
技术路线：selenium获取动态cookie + BeautifulSoup信息提取 + csv文件读写

（二）我的坎坷经历

作为一个python爬虫的初学者，刚开始爬数据，只记得我刚学到的requests（用于自动爬取HTML页面及自动网络请求提交），于是乎

import requests
def getHTMLText(url):
    try:
        kv = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
              "cookie":"此处省略我的cookie"}
        r = requests.get(url,headers = kv,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

在这里，使用user-agent和cookie键值对是十分重要的。

（1）为什么要使用user-agent?

很多网站对网络爬虫有限制，无外乎两种方法：

通过robots协议告知爬虫，哪些可以访问哪些不可以
通过判断对网站访问的HTTP的头来判断访问是不是由一个爬虫引起的（网站一般接受的是由浏览器引发或产生的HTTP请求，而对于爬虫的请求，网站是可以拒绝的）

user-agent:python-requests明确表示请求网络爬虫，但被我们请求的网站拒绝了，解决方法：通过修改headers字段，让我们的程序模拟一个浏览器。

（2）为什么要使用cookie?

后面，我遇到了这样的问题：我发现通过requests爬取到的网页代码和右键查看的网页源代码不一样，查找了很多资料发现大佬们让加入cookie。

如何粘贴cookie和user-agent呢?

在需要爬取的网站那一页上右键-选择“检查”,在弹出的框框内选择Network,然后刷新网页，发现框框内会动态显示一些信息，选择左边的Name中的第一条，选择Headers，就可以在下面找到cookie和user-agent信息了，如图所示：

python采集简历 python爬取个人简历_数据

粉色盖起来的全部都是需要粘贴的cookie内容！（大家要注意不要泄露自己的cookie内容）

接下来是解析网页和信息提取部分：

from bs4 import BeautifulSoup
def fillPostList(postlist,html):
    try:
        soup = BeautifulSoup(html,"html.parser")
        job_all = soup.find_all('div', {"class": "job-primary"})
        for job in job_all:
            position = job.find('span', {"class": "job-name"}).get_text()
            address = job.find('span', {'class': "job-area"}).get_text()
            company = job.find('div', {'class': 'company-text'}).find('h3', {'class': "name"}).get_text()
            salary = job.find('span', {'class': 'red'}).get_text()
            diploma = job.find('div', {'class': 'job-limit clearfix'}).find('p').get_text()[-2:]
            experience = job.find('div', {'class': 'job-limit clearfix'}).find('p').get_text()[:-2]
            labels = job.find('a', {'class': 'false-link'}).get_text()
            postlist.append([position,address,company,salary,diploma,experience,labels])
    except IndexError:
        pass

我将爬取到的数据全部都保存在postlist列表中，最后print打印出来却报错：IndexError: list index out of range，通过大佬们的解惑，报错原因在于我的postlist为空！爬了半天，我的数据爬去哪里了，我就很疑惑。

再次通过大佬们的解惑，我发现每一次网页刷新，cookie的内容都在发生变化，如果想要不停地获取网页数据，那我只能一次一次地复制粘贴新的cookie，想要解决这个问题，就要用到selenium了，即实现动态cookie的获取。

现在，就可以动态获取cookie了。

from selenium import webdriver
import csv

def main():
    jobinfo = []
    driver = webdriver.Chrome(r'C:\Users\1\PycharmProjects\pachong01\chromedriver.exe')
    url = "https://www.zhipin.com/c101020100/e_102/?query=web%E5%89%8D%E7%AB%AF&page=1&ka=page-1"
    driver.get(url)
    html = driver.page_source
    fillPostList(jobinfo,html)
    #将jobinfo列表信息写入csv文件
    headers = ["职位","工作地址","公司全称","薪水","学历","工作经验","行业标签"]
    with open('job.csv','w',newline = '')as f:
        f_csv = csv.writer(f)
        f_csv.writerow(headers)
        f_csv.writerows(jobinfo)
    driver.quit()

最后，可以在项目下找到爬取的数据（以csv文件形式保存），如图：

python采集简历 python爬取个人简历_爬虫_02

开心的点开job.csv文件，见证奇迹的时刻到啦！

python采集简历 python爬取个人简历_python采集简历_03

若文章中有任何问题，希望大家可以指正出来，我们一起进步【冲鸭】

最后附上全部代码：

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

def fillPostList(postlist,html):
    try:
        soup = BeautifulSoup(html,"html.parser")
        job_all = soup.find_all('div', {"class": "job-primary"})
        for job in job_all:
            position = job.find('span', {"class": "job-name"}).get_text()
            address = job.find('span', {'class': "job-area"}).get_text()
            company = job.find('div', {'class': 'company-text'}).find('h3', {'class': "name"}).get_text()
            salary = job.find('span', {'class': 'red'}).get_text()
            diploma = job.find('div', {'class': 'job-limit clearfix'}).find('p').get_text()[-2:]
            experience = job.find('div', {'class': 'job-limit clearfix'}).find('p').get_text()[:-2]
            labels = job.find('a', {'class': 'false-link'}).get_text()
            postlist.append([position,address,company,salary,diploma,experience,labels])
    except IndexError:
        pass

def main():
    jobinfo = []
    driver = webdriver.Chrome(r'C:\Users\1\PycharmProjects\pachong01\chromedriver.exe')
    url = "https://www.zhipin.com/c101020100/e_102/?query=web%E5%89%8D%E7%AB%AF&page=1&ka=page-1"
    driver.get(url)
    html = driver.page_source
    fillPostList(jobinfo,html)
    #将jobinfo列表信息写入csv文件
    headers = ["职位","工作地址","公司全称","薪水","学历","工作经验","行业标签"]
    with open('job.csv','w',newline = '')as f:
        f_csv = csv.writer(f)
        f_csv.writerow(headers)
        f_csv.writerows(jobinfo)
    driver.quit()

main()

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。