我在第1篇分享的代码,仅能爬取一个知乎用户。代码不复杂,但最核心的 Python 知识点都在这里。


haili:零基础自学爬虫(1)获取知乎单个用户基础数据-附Python源代码zhuanlan.zhihu.com



我在第2篇分享的代码,能同时爬取 N 个知乎用户。简单地做了升级:封装函数,嵌套字典,跳过异常。


haili:零基础自学爬虫(2)获取知乎粉丝数排行榜 TOP50 用户基础数据-附Python源代码zhuanlan.zhihu.com



但输出的结果是一个嵌套字典,并不方便查看。通常我们更希望能通过表格文件来保存数据和查阅数据。

数据的清洗、整理和数据分析,推荐使用 pandas ,很简单,下面是短短 3 行代码,将把第2篇的执行结果 rlts 保存到表格文件中。



import pandas as pd
df = pd.DataFrame(rlts).T
df.to_csv("./zhihu_top50.csv")



文件打开长这样:




python 爬金山表格 爬虫python做表格_数据


部分数据为空,是由于第2篇的代码在运行时出现了异常。——执行爬虫任务爬取大量数据时,总是会遇到某些异常。有时异常可忽略,没抓到的数据那就不要吧;有时异常不可忽略,需要我们再想办法,继续爬取数据。

pandas 的 dataframe 数据结构,想要做这种数据的筛选和整理非常方便。比如想知道哪些用户未能成功爬取到粉丝数?代码如下:


error_data = df[df["关注者"].isnull()]["用户"]


数据结果为:


20    https://www.zhihu.com/people/excited-vczh
29           https://www.zhihu.com/people/magie
33        https://www.zhihu.com/people/cai-tong
41       https://www.zhihu.com/people/nordenbox
Name: 用户, dtype: object


想知道哪些用户未能成功爬取到创作数据?代码如下:


error_data2 = df[df["回答"].isnull()]["用户"]


数据结果为:


2      https://www.zhihu.com/people/zhi-hu-ri-bao-51-41
4      https://www.zhihu.com/people/ding-xiang-yi-sheng
6             https://www.zhihu.com/people/zhi-ke-ji-13
7           https://www.zhihu.com/people/knowyourself-1
23            https://www.zhihu.com/people/xia-chu-fang
35      https://www.zhihu.com/people/qiong-you-jin-nang
39                 https://www.zhihu.com/people/lens-27
43    https://www.zhihu.com/people/zhen-shi-gu-shi-j...
48    https://www.zhihu.com/people/zhong-guo-ke-pu-b...
Name: 用户, dtype: object


大面积地爬取数据前,需要小范围测试,检查代码是否足够健壮,如果代码不够健壮,很容易被异常中断,甚至已经执行的结果未能保存,只能重新从0开始;检查代码是否准确覆盖了所有情况,能尽可能抓取到数据。

经过测试,第 2 篇的代码可作出以下调整:


"""采用 Python Selenium + 无头浏览器,获取知乎粉丝数排行榜 TOP 50 的个人主页基础数据。"""

from time import sleep
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)

def get_one_info(driver,url):
    driver.get(url)
    sleep(1)
    rlts = driver.find_elements_by_class_name("Tabs-meta")
    nums1 = [rlt.text for rlt in rlts]
    rlts = driver.find_elements_by_class_name("NumberBoard-itemValue")
    nums2 = [rlt.text for rlt in rlts]

    rlt = {}
    rlt["用户"] = url
    if len(nums1) >= 7:
        rlt["回答"] = nums1[-7]
        rlt["视频"] = nums1[-6]
        rlt["提问"] = nums1[-5]
        rlt["文章"] = nums1[-4]
        rlt["专栏"] = nums1[-3]
        rlt["想法"] = nums1[-2]
        rlt["收藏"] = nums1[-1]
    else:
        print(url,"nums1 少于7个",nums1)
    if len(nums2) == 2:
        rlt["关注了"] = nums2[0]
        rlt["关注者"] = nums2[1]
    else:
        print(url,"nums2异常",nums2)
    rlt["日期"] = str(datetime.now())[:-7]
    return rlt

urls = [
    'https://www.zhihu.com/people/haili-9-70/',
    "https://www.zhihu.com/people/zhi-hu-ri-bao-51-41",
    "https://www.zhihu.com/people/liu-kan-shan-78",
    "https://www.zhihu.com/people/ding-xiang-yi-sheng",
    "https://www.zhihu.com/people/zhang-jia-wei",
    "https://www.zhihu.com/people/zhi-ke-ji-13",
    "https://www.zhihu.com/people/knowyourself-1",
    "https://www.zhihu.com/people/kaifulee",
    "https://www.zhihu.com/people/zhouyuan",
    "https://www.zhihu.com/people/zhang-xiao-bei",
    "https://www.zhihu.com/people/warfalcon",
    "https://www.zhihu.com/people/lisanshui1230",
    "https://www.zhihu.com/people/tian-ji-shun",
    "https://www.zhihu.com/people/jixin",
    "https://www.zhihu.com/people/ma-bo-yong",
    "https://www.zhihu.com/people/sizhuren",
    "https://www.zhihu.com/people/imike",
    "https://www.zhihu.com/people/raymond-wang",
    "https://www.zhihu.com/people/ChenZhangyu",
    "https://www.zhihu.com/people/excited-vczh",
    "https://www.zhihu.com/people/zhu-xuan-86",
    "https://www.zhihu.com/people/lisongwei",
    "https://www.zhihu.com/people/xia-chu-fang",
    "https://www.zhihu.com/people/dong-ji-zai-hang-zhou",
    "https://www.zhihu.com/people/gejinyuban",
    "https://www.zhihu.com/people/guo-zi-501",
    "https://www.zhihu.com/people/gao-ke-69",
    "https://www.zhihu.com/people/chenqin",
    "https://www.zhihu.com/people/magie",
    "https://www.zhihu.com/people/chenbailing",
    "https://www.zhihu.com/people/wang-ni-ma-94",
    "https://www.zhihu.com/people/thejennyyy",
    "https://www.zhihu.com/people/cai-tong",
    "https://www.zhihu.com/people/zhou-xiao-nong",
    "https://www.zhihu.com/people/qiong-you-jin-nang",
    "https://www.zhihu.com/people/mali",
    "https://www.zhihu.com/people/bo-cai-28-7",
    "https://www.zhihu.com/people/cheng-yi-nan",
    "https://www.zhihu.com/people/lens-27",
    "https://www.zhihu.com/people/commando",
    "https://www.zhihu.com/people/nordenbox",
    "https://www.zhihu.com/people/binka",
    "https://www.zhihu.com/people/zhen-shi-gu-shi-ji-hua",
    "https://www.zhihu.com/people/he-ming-ke",
    "https://www.zhihu.com/people/ccat",
    "https://www.zhihu.com/people/talich",
    "https://www.zhihu.com/people/feifeimao",
    "https://www.zhihu.com/people/zhong-guo-ke-pu-bo-lan",
    "https://www.zhihu.com/people/pan-fan-65",
    "https://www.zhihu.com/people/gong-qing-tuan-zhong-yang-67",
    "https://www.zhihu.com/people/divinites"
]

rlts = {}
number = 0
for url in urls:
    rlt = get_one_info(driver,url)
    number += 1
    rlts[number] = rlt
    
driver.quit()