Python爬虫网页分析工具 python网页爬虫教程

转载

数据小筑 2023-07-27 21:40:34

文章标签 Python爬虫网页分析工具 python 爬虫正则表达式数据 文章分类 Python 后端开发

Python版本：python3.6

使用工具：pycharm

一、第一个爬虫程序

获得网址源代码。如下图获取百度页面源代码

Python爬虫网页分析工具 python网页爬虫教程_正则表达式

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_02

Python爬虫网页分析工具 python网页爬虫教程_爬虫_03

二、Web请求过程

1. 服务器渲染：在服务器中直接把数据和html整合在一起，返回给浏览器。（在页面源代码中能看到数据）

2. 客户端渲染：第一次请求只要一个html骨架，第二次请求拿到数据，进行数据展示。（在页面源代码中，看不到数据）

三、requests入门

1. 获取搜狗引擎搜索周杰伦页面源代码。

Python爬虫网页分析工具 python网页爬虫教程_数据_04

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_05

2. 出现报错，网页存在反爬，处理反扒。

Python爬虫网页分析工具 python网页爬虫教程_python_06

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_07

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_08

3. 修改代码，使其更加灵活获取不同搜索内容对应不同页面源代码。

Python爬虫网页分析工具 python网页爬虫教程_爬虫_09

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_10

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_11

4．当请求方式为POST时，以百度翻译为例获取页面源代码。

Python爬虫网页分析工具 python网页爬虫教程_爬虫_12

Python爬虫网页分析工具 python网页爬虫教程_爬虫_13

5. 当想要爬取数据与网页框架不在一起时。以豆瓣电影排行榜为例。需先找到所需爬取数据位置。

Python爬虫网页分析工具 python网页爬虫教程_数据_14

将其参数变量进行封装，补充网址参数，其参数信息位置及代码如下：

Python爬虫网页分析工具 python网页爬虫教程_python_15

Python爬虫网页分析工具 python网页爬虫教程_数据_16

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_17

注意：使用完爬虫程序后，关闭爬虫程序。

如上示例需使用resp.close()关闭爬虫程序。

四、数据解析

本文中将介绍三种解析方式：re解析、bs4解析、xpath解析。

1. re解析：Regular Expression，正则表达式，一种使用表达式的方式对字符串进行匹配的语法规则。
优点：速度快、效率高、准确性高

缺点：上手难度较高

语法：使用元字符进行排列组合用来匹配字符串（元字符是具有固定含义的特殊符号），常用元字符：

. 匹配除换行符以外的任意字符 a|b 匹配字符a或b

\w 匹配字母或数字或下划线 \W 匹配非字母或数字或下划线

\s 匹配任意空白符 \S 匹配非空白符

\d 匹配数字 \D 匹配非数字

[…] 匹配字符组中的字符 [^…] 匹配除字符组中的所有字符

^ 匹配字符串的开始 $ 匹配字符串的结束

量词：控制元字符出现的次数

* 重复零次或多次

+ 重复一次或多次

？重复零次或一次

{n} 重复n次

{n,} 重复n次或更多次

{n,m} 重复n到m次
（1）re模块使用，findall() 匹配字符串中所有的符合正则的内容

Python爬虫网页分析工具 python网页爬虫教程_爬虫_18

Python爬虫网页分析工具 python网页爬虫教程_python_19

（2）finditer()匹配字符串中所有的内容，返回迭代器

Python爬虫网页分析工具 python网页爬虫教程_数据_20

Python爬虫网页分析工具 python网页爬虫教程_python_21

从迭代器中拿出内容需要.group()

Python爬虫网页分析工具 python网页爬虫教程_数据_22

Python爬虫网页分析工具 python网页爬虫教程_数据_23

（3）search，找到一个结果就返回，返回的结果是match对象，拿数据需使用.group()

Python爬虫网页分析工具 python网页爬虫教程_python_24

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_25

（4）match，从头开始匹配

Python爬虫网页分析工具 python网页爬虫教程_python_26

Python爬虫网页分析工具 python网页爬虫教程_数据_27

当选取数据开始即为所需数据时，可输出结果。

Python爬虫网页分析工具 python网页爬虫教程_数据_28

Python爬虫网页分析工具 python网页爬虫教程_爬虫_29

（5）预加载正则表达式，可重复使用

Python爬虫网页分析工具 python网页爬虫教程_爬虫_30

Python爬虫网页分析工具 python网页爬虫教程_python_31

（6）正则中内容单独提取。我们将想要提取内容部分定义组(?P<分组名字>正则)，然后使用.group()提取某个组的内容。（re.S：让.能够匹配换行符）

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_32

Python爬虫网页分析工具 python网页爬虫教程_爬虫_33

2. 实战爬取豆瓣Top250电影信息。
（1）使用requests，拿到页面源代码。

Python爬虫网页分析工具 python网页爬虫教程_数据_34

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_35

（2）使用re，解析数据

Python爬虫网页分析工具 python网页爬虫教程_数据_36

设定爬取以上四个数据，电影名字、年份、评分、评价人数，在页面源代码中找到所需内容位置（红色框），并找到内容定位方法及位置（白色框）

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_37

解析数据：其中.strip()去掉年份前面空格。

Python爬虫网页分析工具 python网页爬虫教程_python_38

Python爬虫网页分析工具 python网页爬虫教程_python_39

将获取到文件保存为文件。导入csv，将内容存入字典，同理year需单独处理。

Python爬虫网页分析工具 python网页爬虫教程_python_40

输出文件data.csv内容

Python爬虫网页分析工具 python网页爬虫教程_爬虫_41

3. 实战爬取电影天堂下载链接，目标爬取2021必看热片信息。

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_42

（1）获取页面源代码

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_43

Python爬虫网页分析工具 python网页爬虫教程_Python爬虫网页分析工具_44

如上输出结果中存在乱码问题，我们默认使用的字符集为utf-8，可看到其网站的字符集为gb2312，需要指定使用字符集解决乱码问题。

Python爬虫网页分析工具 python网页爬虫教程_python_45

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_46

（2）定位到2021必看热片，在页面源代码中找到所需位置。

Python爬虫网页分析工具 python网页爬虫教程_数据_47

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_48

Python爬虫网页分析工具 python网页爬虫教程_数据_49

（3）从2021必看热片中提取到子页面的链接地址

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_50

Python爬虫网页分析工具 python网页爬虫教程_python_51

得到的子页面链接不完整，缺少域名，需要进行一个链接的拼接。

Python爬虫网页分析工具 python网页爬虫教程_python_52

Python爬虫网页分析工具 python网页爬虫教程_正则表达式_53

得到了完整的子页面链接，将子页面链接保存起来。

Python爬虫网页分析工具 python网页爬虫教程_爬虫_54

提取子页面内容，输出得到片名及下载链接。

Python爬虫网页分析工具 python网页爬虫教程_数据_55

Python爬虫网页分析工具 python网页爬虫教程_爬虫_56

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql机读顺序鱼骨图 mysql机制

下一篇：Python爬虫英文名 python爬虫英文怎么说

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯