一. 先看联机文档上的一段有关特殊字符的说明

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced above, or almost any textbook about compiler construction.

A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the ​Regular Expression HOWTO​.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)

Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using the \number notation, e.g., '\x00'.


The special characters are:

'.' (Dot.) In the default mode, this matches any character except a newline. If the ​​DOTALL​​​ flag has been specified, this matches any character including a newline. '^' (Caret.) Matches the start of the string, and in ​​MULTILINE​​​ mode also matches immediately after each newline. '$' Matches the end of the string or just before the newline at the end of the string, and in ​​MULTILINE​​​ mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in ​​MULTILINE​​ mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string. '*' Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s. '+' Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’. '?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’. *?, +?, ?? The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'. {m} Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five. {m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form. {m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters. '\'

Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

If you’re not using a raw string to express the pattern, remember that Python also uses the backslash as an escape sequence in string literals; if the escape sequence isn’t recognized by Python’s parser, the backslash and subsequent character are included in the resulting string. However, if Python would recognize the resulting sequence, the backslash should be repeated twice. This is complicated and hard to understand, so it’s highly recommended that you use raw strings for all but the simplest expressions.

[]

Used to indicate a set of characters. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. Special characters are not active inside sets. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; [a-z] will match any lowercase letter, and [a-zA-Z0-9] matches any letter or digit. Character classes such as \w or \S (defined below) are also acceptable inside a range, although the characters they match depends on whether ​​ASCII​​​ or ​​LOCALE​​ mode is in force. If you want to include a ']' or a '-' inside a set, precede it with a backslash, or place it as the first character. The pattern []] will match ']', for example.

You can match the characters not within a range by complementing the set. This is indicated by including a '^' as the first character of the set; '^' elsewhere will simply match the '^' character. For example, [^5] will match any character except '5', and [^^] will match any character except '^'.

Note that inside [] the special forms and special characters lose their meanings and only the syntaxes described here are valid. For example, +, *, (, ), and so on are treated as literals inside [], and backreferences cannot be used inside [].

'|' A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|]. (...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)]. (?...) This is an extension notation (a '?' following a '(' is not meaningful otherwise). The first character after the '?' determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group; (?P<name>...) is the only exception to this rule. Following are the currently supported extensions. (?aiLmsux)

(One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: ​​re.A​​​ (ASCII-only matching), ​​re.I​​​ (ignore case), ​​re.L​​​ (locale dependent), ​​re.M​​​ (multi-line), ​​re.S​​​ (dot matches all), and ​​re.X​​​ (verbose), for the entire regular expression. (The flags are described in ​Module Contents​.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the ​​re.compile()​​ function.

Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern. (?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible within the rest of the regular expression via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named. So the group named id in the example below can also be referenced as the numbered group 1.

For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in the regular expression itself (using (?P=id)) and replacement text given to .sub() (using \g<id>).

(?P=name) Matches whatever text was matched by the earlier group named name. (?#...) A comment; the contents of the parentheses are simply ignored. (?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'. (?!...) Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. (?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will never match at the beginning of the string being searched; you will most likely want to use the ​​search()​​​ function rather than the ​​match()​​ function:


>>> import re >>> m = re.search('(?<=abc)def', 'abcdef') >>> m.group(0) 'def'


This example looks for a word following a hyphen:


>>> m = re.search('(?<=-)\w+', 'spam-egg') >>> m.group(0) 'egg'

(?<!...) Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched. (?(id/name)yes-pattern|no-pattern) Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) is a poor email matching pattern, which will match with '<user@host.com>' as well as 'user@host.com', but not with '<user@host.com' nor 'user@host.com>' .

The special sequences consist of '\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character '$'.

\number Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters. \A Matches only at the start of the string. \b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa). By default Unicode alphanumerics are the ones used, but this can be changed by using the ​​ASCII​​ flag. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals. \B Matches the empty string, but only when it is not at the beginning or end of a word. This is just the opposite of \b, so word characters are Unicode alphanumerics or the underscore, although this can be changed by using the ​​ASCII​​​ flag. \d For Unicode (str) patterns: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ​​ASCII​​​ flag is used only [0-9] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [0-9] may be a better choice). For 8-bit (bytes) patterns: Matches any decimal digit; this is equivalent to [0-9]. \D Matches any character which is not a Unicode decimal digit. This is the opposite of \d. If the ​​ASCII​​​ flag is used this becomes the equivalent of [^0-9] (but the flag affects the entire regular expression, so in such cases using an explicit [^0-9] may be a better choice). \s For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ​​ASCII​​​ flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice). For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v]. \S Matches any character which is not a Unicode whitespace character. This is the opposite of \s. If the ​​ASCII​​​ flag is used this becomes the equivalent of [^ \t\n\r\f\v] (but the flag affects the entire regular expression, so in such cases using an explicit [^ \t\n\r\f\v] may be a better choice). \w For Unicode (str) patterns: Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ​​ASCII​​​ flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice). For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_]. \W Matches any character which is not a Unicode word character. This is the opposite of \w. If the ​​ASCII​​ flag is used this becomes the equivalent of [^a-zA-Z0-9_] (but the flag affects the entire regular expression, so in such cases using an explicit [^a-zA-Z0-9_] may be a better choice). \Z Matches only at the end of the string.

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser:


\a      \b      \f      \n \r      \t      \v      \x \\


Octal escapes are included in a limited form: If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is a group reference. As for string literals, octal escapes are always at most three digits in length.


二. 测试



#encoding=utf-8

############# 正则表达式 ######################

#正则表达式(RE)为高级文本模式匹配,以及搜索-替代等功能提供了基础。正则表达式(RE)是一
#些由字符和特殊符号组成的字符串,它们描述了这些字符和字符的某种重复方式,因此能按某种模
#式匹配一个有相似特征的字符串的集合,因此能按某模式匹配一系列有相似特征的字符串

#Python 通过标准库的re 模块支持正则表达式(RE)。

##*************** Part 1: 正则表达式使用的特殊符号和字符 **************

## 1.1 用管道符号( | )匹配多个正则表达式模式
#管道符号( | ), 就是您键盘上的竖杠,表示一个或操作,它的意思是选择被管道符号分隔的多个不同的正则表达式中的一个。

#David|Dai -->David,Dai


## 1.2 匹配任意一个单个的字符( . )
#点字符或句号(.)符号匹配除换行符(NEWLINE)外的任意一个单个字符(Python 的正则表达式有
#一个编译标识 [S or DOTALL],该标识能 去掉 这一限制,使 ( . ) 在匹配时包括换行符(NEWLINEs)。)
#(这里括号缺一半) 无论是字母、数字、不包括“\n”的空白符、可打印的字符、还是非打印字符,
#或是一个符号,“点”,( . )都可以匹配他们。
#
#正表达式模式 匹配的字符串
#f.o 在"f"和"o"中间的任何字符,如fao, f9o, f#o 等
#.. 任意两个字符
#.end 匹配在字符串end 前面的任意一个字符


## 1.3 从字符串的开头或结尾或单词边界开始匹配( ^/$ /\b /\B )
#还有些符号和特殊字符是用来从字符串的开头或结尾开始搜索正则表达式模式的。如果想从字
#符串的开头开始匹配一个模式,你必须用脱字符号( ^ , 即,Caret)或特殊字符 \A (大写字母A 前
#面加上一个反斜线). 后者主要是为那些没有caret 符号的键盘使用的,比如说国际键盘。类似,美
#元符号 ( $ ) 或 \Z 是用来(零宽度)匹配字符串的结尾的。

#正则表达式模式 匹配的字符串
#^From 匹配任何以From 开始的字符串
#/bin/tcsh$ 匹配任何以 /bin/tcsh 结束的字符串
#^Subject: hi$ 匹配仅由 Subject: hi 组成的字符串

#特殊字符 \b and \B 用来匹配单词边界。两者之间的区别是,\b 匹配的模式是一个单词边界,
#就是说,与之对应的模式一定在一个单词的开头,不论这个单词的前面是有字符(该词在一个字符串
#的中间),还是没有字符(该单词在一行的起始处)。同样地,\B 只匹配出现在一个单词中间的模式(即,
#不在单词边界上的字符)。看下面几个例子:

#RE Pattern Strings Matched
#the 任何包含有"the"的字符串
#\bthe 任何以"the"开始的字符串
#\bthe\b 仅匹配单词 “the”
#\Bthe 任意包含“the”但不以“the”开头的单词


## 1.4 创建字符类( [ ] )
#使用方括号的正则表达式会匹配方括号里的任何一个字符。几个例子如下:
#正则表达式模式 匹配的字符串
#b[aeiu]t bat, bet, bit, but
#[cr][23][dp][o2] 一个包含4 个字符的字符串: 第一个字符是 “r” 或 “c”,后面是 “2”
# 或 “3”,再接下来是 “d” 或 “p”,最后是 “o” 或 “2“ ,例如:c2do, r3p2, r2d2, c3po, 等等。

## 1.5 指定范围 ( - ) 和 否定( ^ )
#方括号除匹配单个字符外,还可以支持所指定的字符范围。方括号里一对符号中间的连字符(-)
#用来表示一个字符的范围,例如,A–Z, a–z, 或 0–9 分别代表大写字母、小写字母和十进制数
#字。这是一个按字母顺序排序的范围,所以它不限于只用在字母和十进制数字上。另外,如果在左
#方括号后第一个字符是上箭头符号(^),就表示不匹配指定字符集里的任意字符。

#正则表达式模式 匹配的字符
#z.[0-9] 字符"z",后面跟任意一个字符,然后是一个十进制数字
#[r-u][env-y][us] “r” “s,” “t” 或 “u” 中的任意一个字符,后面跟的是 “e,”“n,” “v,” “w,” “x,” 或 “y”中的任意一个字符,再后面是字符“u” 或 “s”.
#[^aeiou] 一个非元音字符 (练习: 为什么我们说”非元音“, 而不说”辅音字母“?)
#[^\t\n] 除TAB 制表符和换行符以外的任意一个字符
#["-a] 在使用ASCII 字符集的系统中,顺序值在‘"‘ 和 “a”之间 的任意一个字符,即,顺序号在34 和97 之间的某一个字符。

## 1.6 使用闭包操作符 ( *, +, ?, {} ) 实现多次出现/重复匹配
#特殊符号 “*”, “+”, 和 “?”, 它们可以用于匹配字符串模式出现一次、多次、或未出现的情况。
#星号或称星号操作符匹配它左边那个正则表达式出现零次或零次以上的情况。
#加号(+)操作符匹配它左边那个正则表达式模式至少出现一次的情况(它也被称为正闭包操作符),而
#问号操作符( ? )匹配它左边那个正则表达式模式出现零次或一次的情况。

#还有花括号操作符({ }), 花括号里可以是单个的值,也可以是由逗号分开的一对值。如果是
#一个值,如,{N},则表示匹配N 次出现;如果是一对值,即,{M, N},就表示匹配M 次到N 次出现。
#可以在这些符号前用反斜线进行转义,使它们失去特殊作用,即, “\*” 将匹配星号本身等。



## 1.7 特殊字符表示字符集
#有一些特殊字符可以用来代表字符集合。例如,你可以不使用 “0–9”这个范围
#表示十进制数字,而改用简写“\d”表示。另一个特殊的字符 “\w” 可用来表示整个 字符数字的
#字符集,即相当于“A-Za-z0-9_”的简写形式,特殊字符“\s” 代表空白字符。这些特殊字符的大
#写形式表示不匹配,比如,“\D” 表示非十进制数字的字符(等价于 “[^0-9]”),等等。


## 1.8 用圆括号(()) 组建组
#一对圆括号(()) 和正则表达式一起使用时可以实现以下任意一个(或两个)功能:
# 对正则表达式进行分组
# 匹配子组
#
# 有时你需要对正则表达式进行分组,其中一个很好的例子就是,你要用两个不同的正则表达式
#去比较一个字符串。另一个理由是为整个正则表达式添加一个重复操作符(即不是仅重复单个字符或
#单一字符集)。
# 使用圆括号的一个额外好处就是匹配的子串会被保存到一个子组,便于今后使用。这些子组可
#以在同一次匹配或搜索中被重复调用,或被提取出来做进一步处理.



##*************** Part 2: 正则表达式和Python 语言 **************
#Python 的默认正则表达式模块是 re 模块.

## 2.1 re 模块: 核心函数和方法

#常见的正则表达式函数与方法

#函数/方法 描述
#####re 模块的函数
#compile(pattern,flags=0) 对正则表达式模式pattern 进行编译,flags 是可选标志符,并返回一个regex 对象re 模块的函数和regex 对象的方法
#match(pattern,string, flags=0) 尝试用正则表达式模式pattern 匹配字符串string,flags 是可选标志符,如果匹配成功,则返回一个匹配对象;否则返回None
#search(pattern,string, flags=0) 在字符串string 中查找正则表达式模式pattern 的第一次出现,flags 是可选标志符,如果匹配成功,则返回一个匹配对象;否则返回None
#findall(pattern,string[,flags]) 在字符串string 中查找正则表达式模式pattern 的所有(非重复)出现;返回一个匹配对象的列表
#finditer(pattern,string[, flags]) 和findall()相同,但返回的不是列表而是迭代器;对于每个匹配,该迭代器返回一个匹配对象
#
#
#####匹配对象的方法
#split(pattern,string, max=0) 根据正则表达式pattern 中的分隔符把字符string 分割为一个列表,返回成功匹配的列表,最多分割max 次(默认是分割所有匹配的地方)。
#sub(pattern, repl, string, max=0) 把字符串string 中所有匹配正则表达式pattern 的地方替换成字符串repl,如果max 的值没有给出,则对所有匹配的地方进行替换。
#group(num=0) 返回全部匹配对象(或指定编号是num 的子组)
#groups() 返回一个包含全部匹配的子组的元组(如果没有成功匹配,就返回一个空元组)


#核心笔记: RE 编译(何时应该使用compile 函数?)
#Python 的代码最终会被编译为字节码,然后才被解释器执行。使用预编译代码对象要比使用字符串快,
#因为解释器在执行字符串形式的代码前必须先把它编译成代码对象。
#这个概念也适用于正则表达式,在模式匹配之前,正则表达式模式必须先被编译成regex 对象。
#由于正则表达式在执行过程中被多次用于比较,我们强烈建议先对它做预编译,而且,既然正则表
#达式的编译是必须的,那使用么预先编译来提升执行性能无疑是明智之举。re.compile() 就是用来提供此功能的。
#其实模块函数会对已编译对象进行缓存,所以不是所有使用相同正则表达式模式的search()和
#match()都需要编译。即使这样,你仍然节省了查询缓存,和用相同的字符串反复调用函数的性能开销。


## 2.2 使用compile()编译正则表达式

#大多数re 模块函数都可以作为regex 对象的方法。注意,尽管我们建议预编译,但它并不是必需的。如果你需要编译,就用方法,如果不需要,可以使用函数。


## 2.3 匹配对象 和 group(), groups() 方法
#在处理正则表达式时,除regex 对象外,还有另一种对象类型 - 匹配对象。这些对象是在match()
#或search()被成功调用之后所返回的结果。匹配对象有两个主要方法:group() 和 groups().
#
#group()方法或者返回所有匹配对象或是根据要求返回某个特定子组。groups()则很简单,它返
#回一个包含唯一或所有子组的元组。如果正则表达式中没有子组的话, groups() 将返回一个空元
#组,而group()仍会返回全部匹配对象。


## 2.4 用match()匹配字符串
#match()函数尝试从字符串的开头开始对模式进行匹配。如果匹配成功,就返回一个匹配对象,而如果匹配失
#败了,就返回None。匹配对象的group() 方法可以用来显示那个成功的匹配。

#import re
#m = re.match('foo', 'food on the table') # match succeeds # 匹配成功
#print(m.group())
#-->
#foo
#更简洁一点的写法:
#print(re.match('foo', 'food on the table').group())


## 2.5 search() 在一个字符串中查找一个模式 (搜索与匹配的比较)
#你要搜索的模式出现在一个字符串中间的机率要比出现在字符串开头的机率更大一些。
#这正是search()派上用场的时候。search 和match 的工作方式一样,不同之处在于search 会检查
#参数字符串任意位置的地方给定正则表达式模式的匹配情况。如果搜索到成功的匹配,会返回一个匹配对象,否则返回None。

#import re
#m = re.match('foo', 'seafood') # no match 匹配失败
#if m is not None:
# print(m.group())
##返回空,这个匹配是失败的。match()尝试从字符串起始处进行匹配模式,即,模式中的"f"试匹配到字符串中首字母"s"上, 这样匹配肯定是失败的。
#
##可以用search()函数。search() 查找字符串中模式首次出现的位置,而不是尝试(在起始处)匹配。严格地说,search() 是从左到右进行搜索。
#m = re.search('foo', 'seafood') # use search() instead 改用search()
#if m is not None:
# print(m.group())
#-->foo

## 2.6 匹配多个字符串( | )

#import re
#bt = 'bat|bet|bit' # RE pattern: bat, bet, bit #正则表达式模式: bat, bet, bit
#m = re.match(bt, 'bat') # 'bat' is a match #'bat' 是匹配的
#if m is not None:
# print(m.group())
#-->bat

## 2.7 匹配任意单个字符( . )
#点号是不能匹配换行符或非字符(即,空字符串).

#import re
#anyend = '.end'
#m = re.match(anyend, 'bend') # dot matches 'b' #点号匹配'b'
#if m is not None: print(m.group())
#
#m = re.match(anyend, 'end') # no char to match #没有字符匹配
#if m is not None: print(m.group())
#
#m = re.match(anyend, '\nend') # any char except \n #匹配字符(\n 除外)
#if m is not None: print(m.group())
#
#m = re.search('.end', 'The end.')# matches ' ' in search . #匹配' '
#if m is not None: print(m.group())


#搜索一个真正点号(小数点)的正则表达式,在正则表达式中,用反斜线对它进行转义,使点号失去它的特殊意义:

#import re
#patt314 = '3.14' # RE dot #正则表达式点号
#pi_patt = '3\.14' # literal dot (dec. point) #浮点(小数点)
#m = re.match(pi_patt, '3.14') # exact match #完全匹配
#if m is not None: print(m.group())
#
#m = re.match(patt314, '3014') # dot matches '0' #点号匹配 '0'
#if m is not None: print(m.group())


## 2.8 创建字符集合( [ ] )
#import re
#m = re.match('[cr][23][dp][o2]', 'c3po')# matches 'c3po' #匹配'c3po'
#if m is not None: print(m.group())
#
#m = re.match('r2d2|c3po', 'c2do')# does not match 'c2do' #不匹配'c2do'
#if m is not None: print(m.group())
#
#m = re.match('r2d2|c3po', 'r2d2')# matches 'r2d2' #匹配'r2d2'
#if m is not None: print(m.group())


## 2.9 重复、特殊字符和子组
#正则表达式中最常见的情况包括特殊字符的使用,正则表达式模式的重复出现,以及使用圆括号对匹配模式的各部分进行分组和提取操作。
#import re
#patt = '\w+@(\w+\.)*\w+\.com'
#print(re.match(patt, 'nobody@www.xxx.yyy.zzz.com').group())
#'nobody@www.xxx.yyy.zzz.com'
#m = re.match('(\w\w\w)-(\d\d\d)', 'abc-123')
#print(m.group()) # entire match 所有匹配部分
#'abc-123'
#print(m.group(1)) # subgroup 1 匹配的子组1
#'abc'
#print(m.group(2)) # subgroup 2 匹配的子组2
#'123'
#print(m.groups()) # all subgroups 所有匹配子组
#('abc', '123')


## 2.10 从字符串的开头或结尾匹配及在单词边界上的匹配
#match()总是从字符串的开头进行匹配.

#import re
#m = re.search('^The', 'The end.') # match #匹配
#if m is not None: print(m.group())
#
#m = re.search('^The', 'end. The') # not at beginning #不在开头
#if m is not None: print(m.group())
#
#m = re.search(r'\bthe', 'bite the dog') # at a boundary #在词边界
#if m is not None: print(m.group())
#
#m = re.search(r'\bthe', 'bitethe dog') # no boundary #无边界
#if m is not None: print(m.group())
#
#m = re.search(r'\Bthe', 'bitethe dog') # no boundary #无边界
#if m is not None: print(m.group())


## 2.11 用findall()找到每个出现的匹配部分
#findall()用于非重叠地查找某字符串中一个正则表达式模式出现的情况。findall()和search()相似之处在于二者都执行字符串搜索,
#但findall()和match()与search()不同之处是,findall()总返回一个列表。如果findall()没有找到匹配的部分,会返回空
#列表;如果成功找到匹配部分,则返回所有匹配部分的列表(按从左到右出现的顺序排列)。

#import re
#print(re.findall('car', 'car'))
#['car']
#print(re.findall('car', 'scary'))
#['car']
#print(re.findall('car', 'carry the barcardi to the car'))
#['car', 'car', 'car']


## 2.12 用sub()[和 subn()]进行搜索和替换
#有两种函数/方法用于完成搜索和代替的功能: sub()和subn(). 二者几乎是一样的,都是将某字符串中所有匹配正则表达式模式的部分进行替换。
#用来替换的部分通常是一个字符串,但也可能是一个函数,该函数返回一个用来替换的字符串。subn()和sub()一样,但它还返回一个表示替换次
#数的数字,替换后的字符串和表示替换次数的数字作为一个元组的元素返回。

#import re
#print(re.sub('X', 'Dave', 'attn: X\n\nDear X,\n'))
#-->
#attn: Dave
#Dear Dave,

#print(re.subn('X', 'Dave', 'attn: X\n\nDear X,\n'))
#-->
#('attn: Dave\n\nDear Dave,\n', 2)


#print(re.sub('[ae]', 'X', 'abcdef'))
#-->
#XbcdXf

#print(re.subn('[ae]', 'X', 'abcdef'))
#-->('XbcdXf', 2)


## 2.13 用split()分割(分隔模式)
#re 模块和正则表达式对象的方法split()与字符串的split()方法相似,前者是根据正则表达式
#模式分隔字符串,后者是根据固定的字符串分割,因此与后者相比,显著提升了字符分割的能力。如
#果你不想在每个模式匹配的地方都分割字符串,你可以通过设定一个值参数(非零)来指定分割的最
#大次数。
#如果分隔符没有使用由特殊符号表示的正则表达式来匹配多个模式,那re.split()和string.split()的执行过程是一样的.

#import re
#print(re.split(':', 'str1:str2:str3'))
#-->
#['str1', 'str2', 'str3']


#核心笔记 : Python 原始字符串(raw strings)的用法
#原始字符串的产生正是由于有正则表达式的存在。原因是ASCII 字符和正则表达式特殊字符间所产生的冲突。比如,特殊符号“\b”在
#ASCII 字符中代表退格键,但同时“\b”也是一个正则表达式的特殊符号,代表“匹配一个单词边界”。
#为了让RE 编译器把两个字符“\b”当成你想要表达的字符串,而不是一个退格键,你需要用另一个反斜线对它进行转义,即可以这样写:“\\b”。




##*************** Part 3: 正则表达式示例 **************

## 匹配一个字符串
import re
data = 'Thu Feb 15 17:46:04 2012::uzifzf@dpyivihw.gov::1171590364-6-8'

#我们要从上面的Data里匹配出星期
#方法一:

#patt = '^(Mon|Tue|Wed|Thu|Fri|Sat|Sun)'
#m = re.match(patt, data)
#print(m.group()) # entire match
#print(m.group(1)) # subgroup 1
#-->
#Thu
#Thu

#方法二:
#patt = '^(\w){3}'
#m = re.match(patt, data)
#if m is not None:
# print(m.group())
# print(m.group(1))
#-->
#Thu
#u

#访问子组1 的数据时,只看到“u”是因为子组1 中的数据被不断地替换成下一个字符。也就是
#说,m.group(1)开始的结果是“T”,然后是“h”,最后又被替换成“u”。它们是三个独立(而且重复)
#的组,每个组是由字符或数字所组成的字符,而不是由连续的三个字符或数字组成的字符所形成的单个组。