Python 学习笔记（5）—— 正则表达式

原创

缥缈孤鸿一pmhong 2013-12-26 17:04:22 博主文章分类：Python ©著作权

©著作权归作者所有：来自51CTO博客作者缥缈孤鸿一pmhong的原创作品，请联系作者获取转载授权，否则将追究法律责任

元字符

. ^ $ * + ? {} [] () \ |

python 的正则表达式需要re 模块支持

定义一个字符串s，通过"r" 定义一个规则'abc' 通过findall 从提供的字符串中匹配

>>> import re
>>> s = 'abc'
>>> s = r'abc'
>>> re.findall(s,'abcdfdsajk')
['abc']

[ ]

常用来指定一个字符集: [abc],[a-z]

元字符在字符集中不起作用: [abc$]

例如，[akm$]将匹配字符"a", "b", "c", 或 "$" 中的任意一个

[^string]

匹配指定字符串以外的字符，例如[^a]，表示匹配“a”以外的所有字符

通过元字符“[string]”匹配

>>> st = 'top tip tap tsp tep'
>>> res = r'top'
>>> re.findall(res,st)
['top']
>>> res = r't[io]p'
>>> re.findall(res,st)
['top', 'tip']

[^string]匹配不包含“io” 的字符串

>>> res = r't[^io]p'
>>> re.findall(res,st)
['tap', 'tsp', 'tep']

^ 匹配行首

$ 匹配行尾

>>> s = "hello world,hello boy"
>>> r = r"hello"
>>> re.findall(r,s)
['hello', 'hello']
>>> r = r"^hello"
>>> re.findall(r,s)
['hello']
>>> r = r"boy$"
>>> re.findall(r,s)
['boy']

. 匹配换行符以外的所有字符

\ 脱义符

\d 匹配任何十进制数，相当于[0-9]

\D 匹配任何非数字字符，相当于[^0-9]

\s 匹配任何空白字符，相当于[\t\n\r\f\v]

\S 匹配任何非空白字符，相当于[^\t\n\r\f\v]

\w 匹配任何字母数字字符，相当于[a-zA-Z0-9]

\W 匹配任何非字母数字字符，相当于[^a-zA-Z0-9]

\\ 匹配"\"

* 匹配指定字符0次或多次，等同于{0,}

+ 匹配指定字符1次或多次，等同于{1,}

？匹配1次或0次，等同于{0,1}

{n,m} 匹配大于等于n，小于等于m次的字符串

{m,} 匹配m次以上的字符串

例子：匹配电话号码

>>> import re
>>> r1 = r"\d{3,4}-?\d{8}"
>>> re.findall(r1,'020-88776655')
['020-88776655']

（）分组

例子：匹配邮箱

>>> email = r'\w{3}@\w+(\.com|\.net)'
>>> re.match(email,'abc@qq.com')
<_sre.SRE_Match object at 0x7f81fea30828>
>>> re.match(email,'bbb@163.net')
<_sre.SRE_Match object at 0x7f81fea470a8>
>>> re.match(email,'ccc@redhat.org')
>>>

编译正则表达式

正则表达式被编译成 `RegexObject` 实例，可以为不同的操作提供方法，如模式匹配搜索或字符串替换。

re 模块提供了一个正则表达式引擎的接口，可以将REstring 编译成对象并用它们来进行匹配，例如：

>>> import re
>>> r1 = r"\d{3,4}-?\d{8}"
>>> p_tel = re.compile(r1)
>>> p_tel
<_sre.SRE_Pattern object at 0x7f81fead6ab0>
>>> p_tel.findall('020-88776655')
['020-88776655']

数量词的贪婪模式与非贪婪模式

正则表达式通常用于在文本中查找匹配的字符串。Python里数量词默认是贪婪的（在少数语言里也可能是默认非贪婪），总是尝试匹配尽可能多的字符；非贪婪的则相反，总是尝试匹配尽可能少的字符。例如：正则表达式"ab*"如果用于查找"abbbc"，将找到"abbb"。而如果使用非贪婪的数量词"ab*?"，将找到"a"。

像 * 这样地重复是“贪婪的”；当重复一个 RE 时，匹配引擎会试着重复尽可能多的次数。如果模式的后面部分没有被匹配，匹配引擎将退回并再次尝试更小的重复。

不贪婪的限定符 *?、+?、?? 或 {m,n}?

贪婪限定符 .*

函数

match() 决定RE是否在字符串刚开始的位置匹配

search() 扫描字符串，找到这个RE匹配的位置

findall() 找到RE匹配的所有子串，并把它们作为一个列表返回

finditer() 找到RE匹配的所有子串，并把它们作为一个迭代器返回

如果没有匹配到，match()和search() 将返回None。匹配到，则返回一个'MatchObject' 实例

>>> string_re.match('pmghong hello')
<_sre.SRE_Match object at 0x7f81fea28578>
>>> string_re.match('hello pmghong ')
>>>
>>> string_re.search('pmghong hello')
<_sre.SRE_Match object at 0x7f81fea285e0>
>>> string_re.search('hello pmghong')
<_sre.SRE_Match object at 0x7f81fea28578>

可以看到match 只能匹配字符串在开头的情况，而search 则不管在开头、结尾都可以。

在实际程序中，最常见的作法是将 `MatchObject` 保存在一个变量里，然後检查它是否为 None，通常如下所示：

>>> string_re.match('pmghong hello')
<_sre.SRE_Match object at 0x7f81fea28648>
>>> x = string_re.match('pmghong hello')
>>> if x:
...     print 'OK'
...
OK
>>> string_re.match('hello pmghong')
>>> x = string_re.match('hello pmghong')
>>> if x:
...     print 'OK'
... else:
...     print 'Not OK'
...
Not OK

match() 的方法

group() 返回被 RE 匹配的字符串

start() 返回匹配开始的位置

end() 返回匹配结束的位置

span() 返回一个元组包含匹配 (开始,结束) 的位置

>>> s = "hello python"
>>> r1 = r'hello'
>>> re.match(r1,s)
<_sre.SRE_Match object at 0x7f81fea285e0>
>>>
>>> x = re.match(r1,s)
>>> x.group()
'hello'
>>> x.start()
0
>>> x.end()
5
>>> x.span()
(0, 5)

group() 返回 RE 匹配的子串。start() 和 end() 返回匹配开始和结束时的索引。span() 则用单个元组把开始和结束时的索引一起返回。

re.sub() 替换字符串

>>> s = "hello world"
>>> s.replace('world','boy')
'hello boy'
>>> s.replace('w...d','boy')
'hello world'
>>>
>>> rs = r'w...d'
>>> re.sub(rs,'boy','world would woked hello')
'boy boy boy hello'

replace() 虽然能替换字符串，但它不支持正则表达式，需要匹配正则表达式的话，需要使用sub() 这个函数

re.subn()

>>> re.subn(rs,'boy','world would woked hello')
('boy boy boy hello', 3)

这个函数也是起到替换字符串的作用，相比于sub() 多了最后一项-- 匹配次数

re.split()切割，相比于split ，可以使用正则表达式匹配

>>> ip = '192.168.10.1'
>>> ip.split('.')
['192', '168', '10', '1']
>>> s = '111+222-333*444/555'
>>> re.split(r'[\+\-\*\/]',s)
['111', '222', '333', '444', '555']

RE 属性

re.compile() 也接受可选的标志参数，常用来实现不同的特殊功能和语法变更

>>> p = re.compile('ab*',re.IGONRECASE)

IGNORECASE，I 忽略字符串的大小写

>>> string_re = re.compile(r'pmghong',re.I)
>>> string_re.findall('PMGHONG')
['PMGHONG']
>>> string_re.findall('pmghong')
['pmghong']
>>> string_re.findall('Pmghong')
['Pmghong']

DOTALL，S 使“.”匹配包括换行在内的所有字

>>> r1 = r"baidu.com"
>>> re.findall(r1,'baidu.com')
['baidu.com']
>>> re.findall(r1,'baidu_com')
['baidu_com']
>>> re.findall(r1,'baidu com')
['baidu com']
>>> re.findall(r1,'baidu\ncom')
[]
>>> re.findall(r1,'baidu\ncom',re.S)
['baidu\ncom']
>>> re.findall(r1,'baidu\tcom',re.S)
['baidu\tcom']

可以看到，一般情况下，"." 这个元字符并不能匹配像\n 这种换行符号，要匹配的话，需要加入S 这个属性

MULTILINE，M 多行匹配，影响$和^

比如说，我想匹配docstring中以"hello"开头的句子时，直接通过正则表达式是匹配不到的

>>> s = '''
... hello boy
... boys and girls
... hello girl
... what a nice day
... '''
>>> r1 = r'^hello'
>>> re.findall(r1,s)
[]

原因是docstring 是这样存放数据的：

>>> s
'\nhello boy\nboys and girls\nhello girl\nwhat a nice day\n'

所以需要加入M属性，进行多行匹配

>>> re.findall(r1,s,re.M)
['hello', 'hello']

VERBOSE，X 能够使用REs 的verbose 状态，使之被组织得更清晰易懂

类似的，有时我们正则太长，我们也可以通过分行写，使得结构更清晰易懂一些，但是直接应用这样的正则表达式去匹配字符串的话，也会出问题，原因跟上一个例子一样，因为docstring 会将\n 的字符也存放进去。

>>> tel = r'''
... \d{3,4}
... -?
... \d{8}
... '''
>>> re.findall(tel,'020-88776655')
[]
>>> tel
'\n\\d{3,4}\n-?\n\\d{8}\n'

解决办法就是加入re.X 属性

>>> re.findall(tel,'020-88776655',re.X)
['020-88776655']

附上网上搜到的一张表

上一篇：Python 写的几个监控脚本（CPU，内存，网卡流量，负载，磁盘空间）

下一篇：Python 学习笔记（6）—— 读写文件

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

Python 学习笔记 （5）—— 正则表达式

Python 学习笔记 （5）—— 正则表达式

51CTO博客

Python 学习笔记（5）—— 正则表达式

Python 学习笔记（5）—— 正则表达式