2.正则表达式

正则表达式(Regular Expression)是一种强大的文本处理工具,用于匹配、查找、替换或提取字符串中的特定模式。它由普通字符和特殊字符(称为“元字符”)组成,这些特殊字符具有特殊的含义,用于定义匹配规则.

练习文件sample.txt的内容如下

[root@RHEL7-1 ~]# pwd
/root
[root@RHEL7-1 ~]# cat  /root/sample.txt
"Open Source" is a good mechanism to develop programs.
apple is my favorite food.
Football game is not use feet only.
this dress doesn't fit me.
However, this dress is about $ 3183 dollars.^M
GNU is free air not free beer.^M
Her hair is very beauty.^M
I can't finish the test.^M
Oh! The soup taste good.^M
motorcycle is cheap than car.
This window is clear.
the symbol '*' is represented as start.
Oh!     My god!
The gd software is a library for drafting programs.^M
You are the best is mean you are the no. 1.
The world <Happy> is the same with "glad".
I like dog.
google is the best tools for search keyword.
goooooogle yes!
go! go! Let's go.
# I am Bobby

(1)查找特定字符串。

假设我们要从文件sample.txt当中取得“the”这个特定字符串,最简单的方式是:

[root@RHEL7-1 ~]# grep  -n  'the'  /root/sample.txt
8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

如果想要反向选择呢,即当该行没有“the”这个字符串时才显示在屏幕上:

[root@RHEL7-1 ~]# grep -vn 'the' /root/sample.txt

如果你想要获得不论大小写的“the”这个字符串,则执行:

[root@RHEL7-1 ~]# grep  -in  'the'  /root/sample.txt
8:I can't finish the test.
9:Oh! The soup taste good.
12:the symbol '*' is represented as start.
14:The gd software is a library for drafting programs.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

(2)利用中括号 [] 来搜寻集合字符。

对比“test”或“taste”这两个单词可以发现,它们有共同点“t?st”存在。这个时候,可以这样来查寻:

[root@RHEL7-1 ~]# grep  -n  't[ae]st'  /root/sample.txt
8:I can't finish the test.
9:Oh! The soup taste good.

其实 [] 里面不论有几个字符,都只代表某一个字符,所以,上面的例子说明需要的字符串是tast或test。而如果想要搜寻到有“oo”的字符时,则使用:

[root@RHEL7-1 ~]# grep  -n  'oo'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

如果不想要“oo”前面有“g”的行显示出来。此时,可以利用在集合字节的反向选择[^]来完成:

[root@RHEL7-1 ~]# grep  -n  '[^g]oo'  /root/sample.txt
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

假设oo前面不想有小写字母,可以这样写:[^abcd....z]oo。但是这样似乎不怎么方便,由于小写字母的ASCII上编码的顺序是连续的,因此,我们可以将之简化:

[root@RHEL7-1 ~]# grep  -n  '[^a-z]oo'  sample.txt
3:Football game is not use feet only.

如果要求字符串是数字与英文呢?那就将其全部写在一起,变成:[a-zA-Z0-9]。例如,我们要获取有数字的那一行:

[root@RHEL7-1 ~]# grep  -n  '[0-9]'  /root/sample.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

由于考虑到语系对于编码顺序的影响,所以除了连续编码使用减号“-”之外,也可以使用如下的方法来取得前面两个测试的结果:

[root@RHEL7-1 ~]# grep  -n  '[^[:lower:]]oo'  /root/sample.txt
#  [:lower:]代表的就是a-z的意思
[root@RHEL7-1 ~]# grep  -n  '[[:digit:]]'  /root/sample.txt

3)行首与行尾字节^ $。
在前面,可以查询到一行字串里面有“the”,那如果想要让“the”只在行首才列出呢?

[root@RHEL7-1 ~]# grep   -n   '^the'   /root/sample.txt
12:the symbol '*' is represented as start.

如果想要开头是小写字母的那些行列出呢?可以这样写:

[root@RHEL7-1 ~]# grep   -n  '^[a-z]'   /root/sample.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

如果不想要开头是英文字母,则可以这样:

[root@RHEL7-1 ~]# grep  -n  '^[^a-zA-Z]'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
21:# I am Bobby

特别提示:“^”符号在字符集合符号(括号[])之内与之外的意义是不同的。在 [] 内代表“反向选择”,在 [] 之外则代表定位在行首。反过来思考,如果想要找出行尾结束为小数点(.)的那些行,该如何处理?

[root@RHEL7-1 ~]# grep  -n  '\.$'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
11:This window is clear.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
20:go! go! Let's go.

特别注意:因为小数点具有其他意义(下面会介绍),所以必须要使用跳转字节(\)来解除其特殊意义。不过,你或许会觉得奇怪,第5~9行最后面也是“.”啊。怎么无法打印出来?这里就牵涉到Windows平台的软件对于断行字符的判断问题了!我们使用cat -A将第5行显示出来,你会发现(命令cat中的-A参数含义:显示不可打印字符,行尾显示“$”):

[root@RHEL7-1 ~]# cat  -An  /root/sample.txt | head -n  10  | tail -n  6
     5  However, this dress is about $ 3183 dollars.^M$
     6  GNU is free air not free beer.^M$
     7  Her hair is very beauty.^M$
     8  I can't finish the test.^M$
     9  Oh! The soup taste good.^M$
    10  motorcycle is cheap than car.$

由此,可以发现第5~9行为Windows的断行字节(^M$),而正常的Linux应该仅有第10行显示的那样($)。所以,也就找不到5~9行了。这样就可以了解“^”与“$”的意义了。

如果想要找出哪一行是空白行,即该行没有输入任何数据,该如何搜寻?

[root@RHEL7-1 ~]# grep  -n  '^$'  /root/sample.txt
22:

技巧:假设已经知道在一个程序脚本(shell script)或者是配置文件中,空白行与开头为# 的那些行是注解,因此如果你要将数据打印出参考时,可以将这些数据省略掉以节省纸张,那么怎么操作呢?我们以/etc/rsyslog.conf这个文件来作范例,可以自行参考以下输出的结果(-v选项表示输出除之外的所有行):

[root@RHEL7-1 ~]# cat  -n  /etc/rsyslog.conf
#结果可以发现有91行的输出,其中包含很多空白行与 # 开头的注释行
 
[root@RHEL7-1 ~]# grep  -v  '^$'  /etc/rsyslog.conf | grep  -v  '^#'
# 结果仅有10行,其中第一个“-v '^$'”代表不要空白行
# 第二个“-v '^#'”代表不要开头是 # 的那行

任意一个字符“.”与重复字节“*”。

万用字符“*”可以用来代表任意(0或多个)字符,但是正则表示法并不是万用字符,两者之间是不相同的。至于正则表示法当中的“.”则代表“绝对有一个任意字符”的意思。这两个符号在正则表示法的意义如下。

l . (小数点):代表一个任意字符。

l *(星号):代表重复前一个字符0次到无穷多次的意思,为组合形态。

假设需要找出“g??d”的字符串,即共有4个字符,开头是“g”而结束是“d”,可以这样做:

[root@RHEL7-1 ~]# grep  -n  'g..d'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

因为强调g与d之间一定要存在两个字符,因此,第13行的god与第14行的gd就不会被列出来。如果想要列出oo、ooo、oooo等数据,也就是说,至少要有两个(含)o以上,该如何操作呢?是o* 还是oo* 还是ooo* 呢?

因为 * 代表的是“重复0个或多个前面的RE字符”,因此,“o*”代表的是“拥有空字符或一个o以上的字符”。

特别注意:因为允许空字符(即有没有字符都可以),所以“grep -n 'o*' sample.txt”将会把所有的数据都列出来。

那如果是“oo*”呢?则第一个o肯定必须要存在,第二个o则是可有可无的多个o,所以,凡是含有o、oo、ooo、oooo等,都可以被列出来。

同理,当需要“至少两个o以上的字符串”时,就需要ooo*,即

[root@RHEL7-1 ~]# grep  -n  'ooo*'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

如果想要字符串开头与结尾都是g,但是两个g之间仅能存在至少一个o,即gog、goog、gooog等,那该如何操作呢?

[root@RHEL7-1 ~]# grep  -n  'goo*g'  sample.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

要找出以g开头且以g结尾的字符串,利用任意一个字符“.”,即“g.*g”。因为“*”可以是0个或多个重复前面的字符,而“.”是任意字节,所以“.*”就代表零个或多个任意字符.

[root@RHEL7-1 ~]# grep  -n  'g.*g'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
14:The gd software is a library for drafting programs.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

如果想要找出“任意数字”的行列呢?因为仅有数字,所以这样做:

[root@RHEL7-1 ~]# grep  -n  '[0-9][0-9]*'  /root/sample.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

限定连续RE字符范围{}。

如果想要限制一个范围区间内的重复字符数该怎么办呢?举例来说,想要找出2个~5个o的连续字符串,该如何操作?这时候就要使用限定范围的字符{}了。但因为“{”与“}”的符号在shell里是有特殊意义的,所以必须使用转义字符“\”来让其失去特殊意义才行。

先来做一个练习,假设要找到含两个o的字符串的行,可以这样做:

[root@RHEL7-1 ~]# grep  -n  'o\{2\}'  /root/sample.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

似乎与ooo* 的字符没有什么差异,因为第19行有多个o依旧也出现了!那么换个搜寻的字符串试试。假设要找出g后面接2~5个o,然后再接一个g的字符串,应该这样操作:

[root@RHEL7-1 ~]# grep  -n  'go\{2,5\}g'  /root/sample.txt
18:google is the best tools for search keyword.

第19行没有被选中(因为19行有6个o)。那么,如果想要的是2个o以上的goooo....g呢?除了可以使用gooo*g外,也可以这样:

[root@RHEL7-1 ~]# grep  -n  'go\{2,\}g'  /root/sample.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

/dev/null空设备的一个典型用法是丢弃从find或grep等命令送来的错误信息:

[root@RHEL7-1 ~]# grep delegate  /etc/* 2>/dev/null