哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法

转载

mob64ca140088a9 2023-11-28 08:48:46

文章标签 python 自然语言处理数据分析统计模型数据类型 文章分类 Python 后端开发

今日学习内容

1.了解Python的组合数据类型，例如集合类型、序列类型（元组类型、列表类型）、字典类型
2.根据三种类型，编写代码实现基本统计值的计算
3.安装jieba库并熟悉它的函数
4.根据jieba库和学习的组合数据类型，实现文本的词频统计，根据英文文本的《哈姆雷特》和中文文本的《三国演义》，分别统计其中频率最高的英文单词和中文人物单词

组合数据类型

集合

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_自然语言处理

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_统计模型_02

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_自然语言处理_03

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据类型_04

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_统计模型_05

序列

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据类型_06

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_自然语言处理_07

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_python_08

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_python_09

字典

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据类型_10

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_python_11

jieba库的安装

jieba库是优秀的中文分词第三方库

安装教程

win+r打开cmd控制台，输入命令 pip install jieba 安装jieba库

使用方法

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_python_12

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据分析_13

文本词频统计

文本下载地址如图所示：

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据分析_14

进入上述网址后，页面右键另存为本地的txt文档即可

《哈姆雷特》词频统计

根据《哈姆雷特》的txt文本，统计其中出现频率最高的英文词汇

效果图

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据类型_15

源代码

#CalHamlet.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ") #将文本中的所有特殊符号换为空格
    return txt

hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1

items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True) #按照从大到小，以键值对第二个位置为索引
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

《三国演义》人物出现频率统计

根据《三国演义》的txt文本，统计其中人物出现频率最高的前10名

效果图

哈姆雷特词频统计python文件怎么导入 python哈姆雷特词频统计方法_数据类型_16

源代码

#CalThreeKingdoms.py
import jieba
txt = open("threekingdoms.txt", "r", encoding = "utf-8").read()
#exclude为排除词库，排除有可能组成中文单词的却不是人名的出现频率较高的词汇
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何", "主公", "军士", \
            "左右", "军马", "引兵", "次日", "大喜", "天下", "东吴", "于是", "今日", "不敢", "魏兵", \
            "陛下", "一人", "都督", "人马", "不知", "汉中"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

注：本文是博主本人学习的日常记录，不进行任何商用所以不支持转载请理解！如果你也对Python有一定的兴趣和理解，欢迎随时找博主交流~

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。