Python爬虫解析robot协议 python爬虫re

转载

IT独行侠 2023-08-10 19:14:23

文章标签 Python爬虫解析robot协议 python网络爬虫字符串正则表达式搜索 文章分类 Python 后端开发

上一篇博客我们学习了正则表达式，python有一个re库专门用于正则表达式匹配。

一、浅谈Re库

导入re库：
Re库是Python的标准库(使用时不需要安装额外的插件)，主要用于字符串匹配。
调用方式：import

正则表达式的表示：
raw string：原生字符串类型
表示方法：r’text’
举个栗子：r’[1-9]\d{5}’

raw string：不包含转义字符，不需要考虑需要多少个 \
string字符串：使用起来更加繁琐，因为表示一个 \ 需要使用 \

二、re库的主要函数

函数名与功能：

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的子串
re.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的选代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

参数说明：

参数名	描述
pattern	正则表达式的字符串或原生字符串表示
string	待匹配字符串
flags	正则表达式使用的控制标记
maxsplit	split：最大分割数，剩余部分作为最后一个元素输出
repl	sub：替换匹配字符串的字符串
count	sub：匹配的最大替换次数

flags	说明
re.I re,IGNORECASE	忽略正则表达式的大小写，[A-Z]也可以匹配小写字母
re.M re.NMULTILINE	正则表达式的^操作符可以将给定字符的每行当作匹配开始
re.S re.DOTALL	正则表达式中的 . 操作符能够匹配所有字符，默认匹配处换行外的所有字符

函数使用语法：

1、re.search(pattern, string, flags=0)
注意：返回的match可能为空，使用group前要先判断

import re

match = re.search(r'[1-9]\d{5}', '342100')
if match:
	print(match.group(0))

2、re.match(pattern, string, flags=0)

match = re.match(r'[1-9]\d{5}', "342300 HNU")
if match:
	print(match.group(0))

3、re.findall(pattern, string, flags=0)

mylist = re.findall(r'[1-9]\d{5}', "HNU496300 342300")
print(mylist)

4、re.split(pattern, string, maxsplit=0, flags=0)
比较设置了maxsplit参数和未设置的区别：

mylist = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(mylist)
mylist = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
print(mylist)

5、re.finditer(pattern, string, flags=0)

for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
	if m:
		print(m.group(0))

6、re.sub(pattern, string, count=0, repl, flags=0)

s =re.sub(r'[1-9]\d{5}', ":zipcode", "BIT100081 TSU100084")
print(s)

三、面向对象使用re库

re库的两种使用方法：
1、函数式用法：
rst = re.search(r’[1-9]\d{5}’, ‘BIT 100084’)
2、面向对象用法：
pat = re.compiler(r’[1-9]\d{5}’)
rst = pat.search(‘BIT 10084’)
3、函数式用法实质：
我们打开re库，查看各个函数的定义，可以发现返回值是_compile().match()，也就是间接使用了面向对象用法。

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

Python爬虫解析robot协议 python爬虫re_搜索

compile函数：
语法：regex = re.compile(pattern, flags=0)
功能：将正则表达式的字符串形式编译为正则表达式对象
任何函数式的用法都剋采用面向对象的方法替代，如：

match = re.match(r'[1-9]\d{5}', "342300 HNU")
if match:
	print(match.group(0))

regex = re.compile(r'[1-9]\d{5}')
match = regex.match("342300 HNU")
if match:
	print(match.group(0))

四、match对象

match对象的属性：

属性	描述
.string	待匹配的字符串
.re	匹配时使用的pattern对象(正则表达式)
.pos	正则表达式搜索文本的开始位置
.endpos	正则表达式搜索文本的结束位置

match对象的方法：

方法	描述
.group(0)	获得匹配后的字符串
.start()	返回匹配字符串在原始字符串的开始位置
.end()	返回匹配字符串在原始字符串的结束位置
.span()	返回(.start(), .end())

match对象使用：

import re
m = re.search(r'[1-9]\d{5}', "BIT100084 TSU100081")
print(m.string)
print(m.re)
print(m.pos)
print(m.endpos)
print(m.group(0))
print(m.span())

输出：

BIT100084 TSU100081
re.compile('[1-9]\\d{5}')
0
19
100084
(3, 9)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：pytorch 查看模型结构 pytorch自带模型

下一篇：java餐饮管理系统源码免费餐饮系统源代码

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯