我在第1篇分享的代码,仅能爬取一个知乎用户。代码不复杂,但最核心的 Python 知识点都在这里。
haili:零基础自学爬虫(1)获取知乎单个用户基础数据-附Python源代码zhuanlan.zhihu.com
我在第2篇分享的代码,能同时爬取 N 个知乎用户。简单地做了升级:封装函数,嵌套字典,跳过异常。
haili:零基础自学爬虫(2)获取知乎粉丝数排行榜 TOP50 用户基础数据-附Python源代码zhuanlan.zhihu.com
但输出的结果是一个嵌套字典,并不方便查看。通常我们更希望能通过表格文件来保存数据和查阅数据。
数据的清洗、整理和数据分析,推荐使用 pandas ,很简单,下面是短短 3 行代码,将把第2篇的执行结果 rlts 保存到表格文件中。
import pandas as pd
df = pd.DataFrame(rlts).T
df.to_csv("./zhihu_top50.csv")
文件打开长这样:
部分数据为空,是由于第2篇的代码在运行时出现了异常。——执行爬虫任务爬取大量数据时,总是会遇到某些异常。有时异常可忽略,没抓到的数据那就不要吧;有时异常不可忽略,需要我们再想办法,继续爬取数据。
pandas 的 dataframe 数据结构,想要做这种数据的筛选和整理非常方便。比如想知道哪些用户未能成功爬取到粉丝数?代码如下:
error_data = df[df["关注者"].isnull()]["用户"]
数据结果为:
20 https://www.zhihu.com/people/excited-vczh
29 https://www.zhihu.com/people/magie
33 https://www.zhihu.com/people/cai-tong
41 https://www.zhihu.com/people/nordenbox
Name: 用户, dtype: object
想知道哪些用户未能成功爬取到创作数据?代码如下:
error_data2 = df[df["回答"].isnull()]["用户"]
数据结果为:
2 https://www.zhihu.com/people/zhi-hu-ri-bao-51-41
4 https://www.zhihu.com/people/ding-xiang-yi-sheng
6 https://www.zhihu.com/people/zhi-ke-ji-13
7 https://www.zhihu.com/people/knowyourself-1
23 https://www.zhihu.com/people/xia-chu-fang
35 https://www.zhihu.com/people/qiong-you-jin-nang
39 https://www.zhihu.com/people/lens-27
43 https://www.zhihu.com/people/zhen-shi-gu-shi-j...
48 https://www.zhihu.com/people/zhong-guo-ke-pu-b...
Name: 用户, dtype: object
大面积地爬取数据前,需要小范围测试,检查代码是否足够健壮,如果代码不够健壮,很容易被异常中断,甚至已经执行的结果未能保存,只能重新从0开始;检查代码是否准确覆盖了所有情况,能尽可能抓取到数据。
经过测试,第 2 篇的代码可作出以下调整:
"""采用 Python Selenium + 无头浏览器,获取知乎粉丝数排行榜 TOP 50 的个人主页基础数据。"""
from time import sleep
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
def get_one_info(driver,url):
driver.get(url)
sleep(1)
rlts = driver.find_elements_by_class_name("Tabs-meta")
nums1 = [rlt.text for rlt in rlts]
rlts = driver.find_elements_by_class_name("NumberBoard-itemValue")
nums2 = [rlt.text for rlt in rlts]
rlt = {}
rlt["用户"] = url
if len(nums1) >= 7:
rlt["回答"] = nums1[-7]
rlt["视频"] = nums1[-6]
rlt["提问"] = nums1[-5]
rlt["文章"] = nums1[-4]
rlt["专栏"] = nums1[-3]
rlt["想法"] = nums1[-2]
rlt["收藏"] = nums1[-1]
else:
print(url,"nums1 少于7个",nums1)
if len(nums2) == 2:
rlt["关注了"] = nums2[0]
rlt["关注者"] = nums2[1]
else:
print(url,"nums2异常",nums2)
rlt["日期"] = str(datetime.now())[:-7]
return rlt
urls = [
'https://www.zhihu.com/people/haili-9-70/',
"https://www.zhihu.com/people/zhi-hu-ri-bao-51-41",
"https://www.zhihu.com/people/liu-kan-shan-78",
"https://www.zhihu.com/people/ding-xiang-yi-sheng",
"https://www.zhihu.com/people/zhang-jia-wei",
"https://www.zhihu.com/people/zhi-ke-ji-13",
"https://www.zhihu.com/people/knowyourself-1",
"https://www.zhihu.com/people/kaifulee",
"https://www.zhihu.com/people/zhouyuan",
"https://www.zhihu.com/people/zhang-xiao-bei",
"https://www.zhihu.com/people/warfalcon",
"https://www.zhihu.com/people/lisanshui1230",
"https://www.zhihu.com/people/tian-ji-shun",
"https://www.zhihu.com/people/jixin",
"https://www.zhihu.com/people/ma-bo-yong",
"https://www.zhihu.com/people/sizhuren",
"https://www.zhihu.com/people/imike",
"https://www.zhihu.com/people/raymond-wang",
"https://www.zhihu.com/people/ChenZhangyu",
"https://www.zhihu.com/people/excited-vczh",
"https://www.zhihu.com/people/zhu-xuan-86",
"https://www.zhihu.com/people/lisongwei",
"https://www.zhihu.com/people/xia-chu-fang",
"https://www.zhihu.com/people/dong-ji-zai-hang-zhou",
"https://www.zhihu.com/people/gejinyuban",
"https://www.zhihu.com/people/guo-zi-501",
"https://www.zhihu.com/people/gao-ke-69",
"https://www.zhihu.com/people/chenqin",
"https://www.zhihu.com/people/magie",
"https://www.zhihu.com/people/chenbailing",
"https://www.zhihu.com/people/wang-ni-ma-94",
"https://www.zhihu.com/people/thejennyyy",
"https://www.zhihu.com/people/cai-tong",
"https://www.zhihu.com/people/zhou-xiao-nong",
"https://www.zhihu.com/people/qiong-you-jin-nang",
"https://www.zhihu.com/people/mali",
"https://www.zhihu.com/people/bo-cai-28-7",
"https://www.zhihu.com/people/cheng-yi-nan",
"https://www.zhihu.com/people/lens-27",
"https://www.zhihu.com/people/commando",
"https://www.zhihu.com/people/nordenbox",
"https://www.zhihu.com/people/binka",
"https://www.zhihu.com/people/zhen-shi-gu-shi-ji-hua",
"https://www.zhihu.com/people/he-ming-ke",
"https://www.zhihu.com/people/ccat",
"https://www.zhihu.com/people/talich",
"https://www.zhihu.com/people/feifeimao",
"https://www.zhihu.com/people/zhong-guo-ke-pu-bo-lan",
"https://www.zhihu.com/people/pan-fan-65",
"https://www.zhihu.com/people/gong-qing-tuan-zhong-yang-67",
"https://www.zhihu.com/people/divinites"
]
rlts = {}
number = 0
for url in urls:
rlt = get_one_info(driver,url)
number += 1
rlts[number] = rlt
driver.quit()