今日学习内容
1.了解Python的组合数据类型,例如集合类型、序列类型(元组类型、列表类型)、字典类型
2.根据三种类型,编写代码实现基本统计值的计算
3.安装jieba库并熟悉它的函数
4.根据jieba库和学习的组合数据类型,实现文本的词频统计,根据英文文本的《哈姆雷特》和中文文本的《三国演义》,分别统计其中频率最高的英文单词和中文人物单词
组合数据类型
集合
序列
字典
jieba库的安装
jieba库是优秀的中文分词第三方库
安装教程
win+r打开cmd控制台,输入命令 pip install jieba 安装jieba库
使用方法
文本词频统计
文本下载地址如图所示:
进入上述网址后,页面右键另存为本地的txt文档即可
《哈姆雷特》词频统计
根据《哈姆雷特》的txt文本,统计其中出现频率最高的英文词汇
效果图
源代码
#CalHamlet.py
def getText():
txt = open("hamlet.txt", "r").read()
txt = txt.lower()
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中的所有特殊符号换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True) #按照从大到小,以键值对第二个位置为索引
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
《三国演义》人物出现频率统计
根据《三国演义》的txt文本,统计其中人物出现频率最高的前10名
效果图
源代码
#CalThreeKingdoms.py
import jieba
txt = open("threekingdoms.txt", "r", encoding = "utf-8").read()
#exclude为排除词库,排除有可能组成中文单词的却不是人名的出现频率较高的词汇
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何", "主公", "军士", \
"左右", "军马", "引兵", "次日", "大喜", "天下", "东吴", "于是", "今日", "不敢", "魏兵", \
"陛下", "一人", "都督", "人马", "不知", "汉中"}
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
注:本文是博主本人学习的日常记录,不进行任何商用所以不支持转载请理解!如果你也对Python有一定的兴趣和理解,欢迎随时找博主交流~