文章目录
- Python进阶系列---(4)文本操作--词频统计
- 一、具体要求
- 二、代码分析
- 三、改进
- 四、注意事项
Python进阶系列—(4)文本操作–词频统计
此篇文章主要结合一个文本操作实例来对文本操作进行一些说明。
一、具体要求
首先看要求:
给定一个文本文件dream.txt
1、读取文件。
2、去除所有标点符号和换行符,大写改小写。
3、合并相同的词,统计词频,按照由大到小排序。
4、输出结果至output.txt。
文本内容如下:
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight, and the glory of the Lord shall be revealed, and all flesh shall see it together.
This is our hope. . . With this faith we will be able to hew out of the mountain of despair a stone of hope. With this faith we will be able to transform the jangling discords of our nation into a beautiful symphony of brotherhood. With this faith we will be able to work together, to pray together, to struggle together, to go to jail together, to stand up for freedom together, knowing that we will be free one day. . . .
And when this happens, and when we allow freedom ring, when we let it ring from every village and every hamlet, from every state and every city, we will be able to speed up that day when all of God's children, black men and white men, Jews and Gentiles, Protestants and Catholics, will be able to join hands and sing in the words of the old Negro spiritual: "Free at last! Free at last! Thank God Almighty, we are free at last!"
具体代码:
import re
def parse(text):
# 使用正则表达式去除标点符号和换行符
text = re.sub(r'[^\w ]', ' ', text)
# 转为小写
text = text.lower()
# 生成所有单词的列表
word_list = text.split(' ')
# 去除空白单词
word_list = filter(None, word_list)
# 生成单词和词频的字典
word_cnt = {}
for word in word_list:
if word not in word_cnt:
word_cnt[word] = 0
word_cnt[word] += 1
# 按照词频排序
sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True)
return sorted_word_cnt
with open('in.txt', 'r') as fin:
text = fin.read()
word_and_freq = parse(text)
with open('out.txt', 'w') as fout:
for word, freq in word_and_freq:
fout.write('{} {}\n'.format(word, freq))
########## 输出(省略较长的中间结果) ##########
and 15
be 13
will 11
to 11
the 10
of 10
二、代码分析
这里有两个地方值得注意:
1、正则表达式去除标点符号和换行符。
re.sub(r'[^\w ]', ' ', text)
re.sub()用于替换文本中的指定内容。此处接收三个参数:
- 文本中待替换的内容。
- 用于替换的内容。
- 文本内容。
正则表达式:r'[^\w ]'
用于替换标点符号及换行符。
[ABC]
:匹配 [ ] 中的所有字符。
[^ABC]
:匹配除了[ ] 中字符的所有字符。
\w
:匹配字母、数字、下划线。等价于 [A-Za-z0-9]。
注意:此处\w后有一个空格,若无空格,则所有单词之间空格也被匹配。
2、filter()去除空白单词
word_list = filter(None, word_list)
去除空白单词主要利用了filter函数,第一个参数接收匹配的格式,此处为空白单词,即换行符替换成空格后形成的None,第二个参数为序列。返回一个可迭代对象。
示例:
def is_odd(n):
return n % 2 == 1
tmplist = filter(is_odd, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
newlist = list(tmplist)
[1, 3, 5, 7, 9]
三、改进
from collections import defaultdict
import re
f = open("ini.txt", mode="r", encoding="utf-8")
d = defaultdict(int)
for line in f:
for word in filter(lambda x: x, re.split(r"\s", line)):
word.lower()
d[word] += 1
f.close()
1、利用collections中的defaultdict方法为字典提供默认值,避免Python中通过Key访问字典,Key不存在时,引发的 ‘KeyError’
异常。使用int,默认值为0。
2、利用lambda表达式替代filter函数中的None:
四、注意事项
1、open()函数对应于close()函数,而使用with语句,不需要显式调用close(),close()会被自动调用。
2、如果文本内容量巨大,尽量不要一次性读取,可能造成内存崩溃。
3、文件I/O操作以及强制类型转换(例如str转int)尽量使用错误处理,比如try,except。