软件工程作业词频统计
第一阶段
要求
输出某个文本文件中26个字母和汉字出现的频率,由高到低排列,并显示出现的百分比,精确到小数点后面两位。
命令行参数是:
wf.exe -c <file name>
字母频率 = 这个字母出现的次数/(所有A-Z,a-z字母、汉字出现的总数)
如果两个token出现的频率一样,那么就按照字典序排列。 如果S和T出现频率都是10.21%, 那么, S要排在T的前面。
如果要处理一本大部头小说 (例如 Gone With The Wind), 你的程序效率如何?有没有什么可以优化的地方?
PSP
PSP2.1 | Personal Software Process Stage | 预估耗时(分钟) | 实际耗时(分钟) |
Planning | 计划 | 15 | |
Estimate | 估计这个任务需要多少时间 | 10 | |
Development | 开发 | 180 | |
Analysis | 需求分析(包括学习新技术) | 90 | |
Design Spec | 生成设计文档 | 30 | |
Design Review | 设计复审 | 15 | |
Coding standard | 代码规范 | 10 | |
Design | 具体设计 | 50 | |
Coding | 具体编码 | 60 | |
Code Review | 代码复审 | 30 | |
Test | 测试(自我测试,修改代码,提交测试) | 120 | |
Reporting | 报告 | 120 | |
Test Report | 测试报告 | 30 | |
Size Measurement | 计算工作量 | 5 | |
Postmortem & Process Improvement Plan | 事后总结,并提出过程改进计划 | 60 | |
合计 | 825 |
解题思路
1)初步分析题目要求,实现字频统计,需要先对输入文本进行处理,去掉多余的符号、数字等,只保留汉字和字母;由于统计的是单个字母和汉字的频率,故只需要按照每个元素统计即可,无需考虑分词;最后将统计结果排序,格式化输出
2)采用面向对象方法,选择python来实现
需求分析
一.新技术学习:
1)在pycharm中将.py程序打包成.exe
2)学习如何使用pylint进行代码质量分析
3)学习使用profile进行性能分析
4)学习markdown的语法
5)学习使用git管理仓库
二.功能需求:字频统计
三.建模
1)静态模型
类:WordFrequency
类名 | WordFrequency |
属性 | 文件名,字表,频数表,总字数(字母、汉字) |
方法 | 字频统计 |
2)功能模型
用例图
设计
概要设计
函数模块设计
1)初始化 init
2)字频统计 ele_frequency
3)格式化输出 output
4)主函数main
函数名 | 功能 | 参数 | 返回值 |
init | 初始化,保存文件名 | filename(文件名) | - |
ele_frequency | 打开文件、按行读取处理 | 无 | - |
output | 按照给定格式输出(两位小数百分数) | 无 | - |
cut_count | 切割句子,去掉多于符号并统计字频 | line | - |
word_sort | 按字频从高到低排序 | 无 | - |
main | 实例化对象,调用各个模块,控制流程 | argv | - |
详细设计
1)init
接受并保存main传来的参数filename(文件名),以便后续根据文件名查找打开文件
声明字表word_list,频数表 ele_num,总字数sum
class WordFrequency():
#初始化
def __init__(self,filename):
# 字表
self.word_list = {}
# 排序后的频数表
self.ele_num = []
#文件名
self.filename=filename
#总字数
self.sum=0
2)ele_frequency
打开文件读取内容,按行处理,只保留字母、汉字(由于第一阶段的要求中没有说明文本的编码方式,也没有相关参数,暂时默认为UTF-8编码)
统计每个字母、汉字的出现次数(频数),保存在word_list中,大小写字母作为不同的字母处理。
按照频数从大到小排序,频数相同按字典序排序,保存在ele_num中
#字频统计
def ele_frequency(self):
#打开文件,一次读入,按行处理
with open(self.filename,'r',encoding='utf8') as txt:
for line in txt.readlines():
#切割句子并统计字频
self.cut_count(line)
#排序
self.word_sort()
3)output
计算总字数sum
计算每个字的频率:频率=频数/sum
按两位小数百分数(.2%)输出
#格式化输出
def output(self):
#计算总字数(字母+汉字)
self.sum = sum(self.word_list.values())
#按两位小数百分数输出
for i in self.ele_num:
ch, num = i
#转化成频率
num=num/self.sum
print("{:<3}:{:.2%}".format(ch, num))
4)cut_count
将传入的句子line去掉多余的符号,只保留汉字、字母,并按每个字切分,统计字频
#切割并统计字频
def cut_count(self,line):
#只保留字母、汉字
line = re.sub("[^a-zA-Z\u4e00-\u9fa5]", '', line)
#如果把大小写看作同种字母,则需先把大写转换成小写
# line=line.lower()
for ch in line:
self.word_list[ch] = self.word_list.get(ch, 0) + 1
5)word_sort
二级排序,先按频数从高到低排序,频数相同按字典序排序
# 排序
def word_sort(self):
#先按频数从高到低排序
self.ele_num = sorted(self.word_list.items(), key=lambda x: x[0])
#再按字典序排序
self.ele_num = sorted(self.ele_num, key=lambda y: y[1], reverse=True)
self.output()
6)主函数main
接受命令行传来的文件名参数
实例化一个WordFrequency类对象wf
调用字频统计方法
def main():
fn="test.txt"
# 如果命令行参数正确(个数为3)
if len(sys.argv) == 3:
fn=sys.argv[2]
else:
print("请正确输入要处理的文件名")
wf = WordFrequency(fn)
# 调用字频统计方法
wf.ele_frequency()
# input("输入任意字符结束:")
# 测试
测试
1)英文文本:哈利波特1-7全集HarryPotter.txt ,共78451个字,448KB
1.输出结果如下(部分):
2.将结果以频数输出到文本test_result.txt中,与使用在线工具统计结果做对比:
在线工具统计结果:
测试结果:
对比可得,输出结果正确
2)中文文本:人民日报语料库rmrb.txt,共1822596个字,7548KB
1.输出结果如下(部分):
2.将结果以频数输出到文本test_result.txt中,与使用在线工具统计结果做对比:
在线工具统计结果:
测试结果:
3)单元测试
以下是对cut_count模块的单元测试,共10个测试用例
import unittest
from flask import current_app
from wf import WordFrequency
class MyTestCase(unittest.TestCase):
# 该方法会首先执行,方法名固定
def setUp(self):
self.testwf=WordFrequency()
def test_something0(self):
self.testwf.cut_count("")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
def test_something1(self):
self.testwf.cut_count("我")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
def test_something0(self):
self.testwf.cut_count("s")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
def test_something2(self):
self.testwf.cut_count("S")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
def test_something3(self):
self.testwf.cut_count(",。、 \n -=")
for i in self.testwf.word_list:
print(i,self.testwf.word_list[i])
def test_something4(self):
self.testwf.cut_count("我的圣诞,oirte5节妇\n女24324日。、。、121easfsefs")
for i in self.testwf.word_list:
print(i,self.testwf.word_list[i])
def test_something5(self):
self.testwf.cut_count("123243534132423")
for i in self.testwf.word_list:
print(i,self.testwf.word_list[i])
def test_something6(self):
self.testwf.cut_count("AWSFDssfdg")
for i in self.testwf.word_list:
print(i,self.testwf.word_list[i])
def test_something7(self):
self.testwf.cut_count("ss我喜欢哈哈哈哈哈哈哈\n")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
def test_something8(self):
self.testwf.cut_count("我今天吃了三碗饭——早上吃了一碗,中午吃了一碗,晚上又吃了一碗。")
for i in self.testwf.word_list:
print(i,self.testwf.word_list[i])
def test_something9(self):
self.testwf.cut_count("She is very cute.")
for i in self.testwf.word_list:
print(i, self.testwf.word_list[i])
# 测试应用实例是否存在
def test_app_exist(self):
self.assertFalse(current_app is None)
if __name__ == '__main__':
unittest.main()
经验证,测试结果都正确(由于篇幅限制测试结果此处省略,详见文档unit_test_result.txt)
代码质量分析
1)使用pylint对wf.py进行分析
wf.py源代码如下:
# This is a sample Python script.
# Press Shift+F10 to execute it or replace it with your code.
# Press Double Shift to search everywhere for classes, files, tool windows, actions, and settings.
import time
import sys
import re
def main():
filename = sys.argv[2]
wf = WordFrequency(filename)
wf.ele_frequency()
class WordFrequency():
def __init__(self,filename):
# 字表
self.word_list = {}
# 频率表
self.ele_num = []
#文件名
self.filename=filename
self.sum=0
#格式化输出
def output(self):
self.sum = sum(self.word_list.values())
print("字频统计结果:\n")
for i in self.ele_num:
ch, num = i
num=num/self.sum
print("{:}:{:.2%}".format(ch, num))
#字频统计
def ele_frequency(self):
#打开文件,一次读入,按行处理
with open(self.filename,'r',encoding='utf8') as txt:
for line in txt.readlines():
#只保留字母、汉字
line = re.sub("[^a-zA-Z\u4e00-\u9fa5]", '', line)
#如果把大小写看作同种字母,则需先把大写转换成小写
# line=line.lower()
for ch in line:
self.word_list[ch] = self.word_list.get(ch, 0) + 1
#排序
self.ele_num = sorted(sorted(self.word_list.items(), key=lambda x: x[0]), key=lambda y: y[1], reverse=True)
#print("time:",etime-stime)
self.output()
#将测试结果输入文本中
def test_result(self):
with open("test_result.txt", "w") as f:
for i in self.ele_num:
ch, num = i
f.write("{:<3}:{:}\n".format(ch, num)) # 这句话自带文件关闭功能,不需要再写f.close()
if __name__ == '__main__':
#接受命令行参数
filename = sys.argv[2]
#实例化
wf=WordFrequency(filename)
#调用字频统计方法
wf.ele_frequency()
input("输入任意字符结束:")
#测试
#wf.test_result()
分析结果如下:
*********** Module wf
E:\anaconda\envs\python39\week1\word-frequency\wf.py:38: [W0311(bad-indentation), ] Bad indentation. Found 9 spaces, expected 12
E:\anaconda\envs\python39\week1\word-frequency\wf.py:40: [W0311(bad-indentation), ] Bad indentation. Found 13 spaces, expected 16
E:\anaconda\envs\python39\week1\word-frequency\wf.py:43: [W0311(bad-indentation), ] Bad indentation. Found 13 spaces, expected 16
E:\anaconda\envs\python39\week1\word-frequency\wf.py:44: [W0311(bad-indentation), ] Bad indentation. Found 18 spaces, expected 20
E:\anaconda\envs\python39\week1\word-frequency\wf.py:46: [C0301(line-too-long), ] Line too long (115/100)
E:\anaconda\envs\python39\week1\word-frequency\wf.py:54: [W0311(bad-indentation), ] Bad indentation. Found 16 spaces, expected 12
E:\anaconda\envs\python39\week1\word-frequency\wf.py:55: [W0311(bad-indentation), ] Bad indentation. Found 20 spaces, expected 16
E:\anaconda\envs\python39\week1\word-frequency\wf.py:56: [W0311(bad-indentation), ] Bad indentation. Found 20 spaces, expected 16
E:\anaconda\envs\python39\week1\word-frequency\wf.py:71: [C0305(trailing-newlines), ] Trailing newlines
E:\anaconda\envs\python39\week1\word-frequency\wf.py:1: [C0114(missing-module-docstring), ] Missing module docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:9: [C0116(missing-function-docstring), main] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:10: [W0621(redefined-outer-name), main] Redefining name 'filename' from outer scope (line 61)
E:\anaconda\envs\python39\week1\word-frequency\wf.py:11: [W0621(redefined-outer-name), main] Redefining name 'wf' from outer scope (line 63)
E:\anaconda\envs\python39\week1\word-frequency\wf.py:11: [C0103(invalid-name), main] Variable name "wf" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:13: [C0115(missing-class-docstring), WordFrequency] Missing class docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:14: [W0621(redefined-outer-name), WordFrequency.__init__] Redefining name 'filename' from outer scope (line 61)
E:\anaconda\envs\python39\week1\word-frequency\wf.py:24: [C0116(missing-function-docstring), WordFrequency.output] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:28: [C0103(invalid-name), WordFrequency.output] Variable name "ch" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:35: [C0116(missing-function-docstring), WordFrequency.ele_frequency] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:43: [C0103(invalid-name), WordFrequency.ele_frequency] Variable name "ch" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:51: [C0116(missing-function-docstring), WordFrequency.test_result] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:52: [C0103(invalid-name), WordFrequency.test_result] Variable name "f" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:55: [C0103(invalid-name), WordFrequency.test_result] Variable name "ch" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:6: [W0611(unused-import), ] Unused import time
------------------------------------------------------------------
Your code has been rated at 3.68/10 (previous run: 3.68/10, +0.00)
由分析结果可得,代码评分只有3.68,存在的主要问题有:
1.规范(C):不符合代码风格标准,主要是缩进的不规范、缺少模块注释、语句过长的问题
2.警告(W): 函数内部与外部变量名重复
2)修改
1.将排序单独写成一个函数模块word_sort()
2.去掉def main()模块,太过冗余
3.修改重复的变量名
修改后代码如下:
#软工大作业第一阶段wf.py
import sys
import re
#定义类
class WordFrequency():
def __init__(self,filename):
# 字表
self.word_list = {}
# 频率表
self.ele_num = []
#文件名
self.filename=filename
self.sum=0
#格式化输出
def output(self):
self.sum = sum(self.word_list.values())
print("字频统计结果:\n")
for i in self.ele_num:
ch, num = i
num=num/self.sum
print("{:}:{:.2%}".format(ch, num))
#字频统计
def ele_frequency(self):
#打开文件,一次读入,按行处理
with open(self.filename,'r',encoding='utf8') as txt:
for line in txt.readlines():
#只保留字母、汉字
line = re.sub("[^a-zA-Z\u4e00-\u9fa5]", '', line)
#如果把大小写看作同种字母,则需先把大写转换成小写
# line=line.lower()
for ch in line:
self.word_list[ch] = self.word_list.get(ch, 0) + 1
self.word_sort()
# 排序
def word_sort(self): #排序
self.ele_num = sorted(self.word_list.items(), key=lambda x: x[0])
self.ele_num = sorted(self.ele_num, key=lambda y: y[1], reverse=True)
self.output()
#将测试结果输入文本中
def test_result(self):
with open("test_result.txt", "w") as f:
for i in self.ele_num:
ch, num = i
f.write("{:<3}:{:}\n".format(ch, num)) # 这句话自带文件关闭功能,不需要再写f.close()
if __name__ == '__main__':
#接受命令行参数并实例化
wf=WordFrequency(sys.argv[2])
#调用字频统计方法
wf.ele_frequency()
input("输入任意字符结束:")
#测试
#wf.test_result()
再次使用pylint分析:
************* Module wf
E:\anaconda\envs\python39\week1\word-frequency\wf.py:1: [C0114(missing-module-docstring), ] Missing module docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:5: [C0115(missing-class-docstring), WordFrequency] Missing class docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:16: [C0116(missing-function-docstring), WordFrequency.output] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:20: [C0103(invalid-name), WordFrequency.output] Variable name "ch" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:27: [C0116(missing-function-docstring), WordFrequency.ele_frequency] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:35: [C0103(invalid-name), WordFrequency.ele_frequency] Variable name "ch" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:38: [C0116(missing-function-docstring), WordFrequency.word_sort] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:44: [C0116(missing-function-docstring), WordFrequency.test_result] Missing function or method docstring
E:\anaconda\envs\python39\week1\word-frequency\wf.py:45: [C0103(invalid-name), WordFrequency.test_result] Variable name "f" doesn't conform to snake_case naming style
E:\anaconda\envs\python39\week1\word-frequency\wf.py:47: [C0103(invalid-name), WordFrequency.test_result] Variable name "ch" doesn't conform to snake_case naming style
------------------------------------------------------------------
Your code has been rated at 7.14/10 (previous run: 7.14/10, +0.00)
由分析结果可知,修改后消除了所有警告,但仍存在代码规范问题,总体评分提高到了7.14
关于性能分析的可视化呈现还在摸索中,会在后续中更新
代码性能分析
1.处理HarryPotter.txt
分析结果(部分):
Wed Jan 19 10:55:22 2022 tem_result.txt
347659 function calls (347658 primitive calls) in 0.906 seconds
Random listing order was used
ncalls tottime percall cumtime percall filename:lineno(function)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.906 0.906 profile:0(wf.ele_frequency())
1 0.016 0.016 0.906 0.906 :0(exec)
1 0.000 0.000 0.891 0.891 <string>:1(<module>)
1 0.453 0.453 0.891 0.891 E:\anaconda\envs\python39\week1\word-frequency\wf.py:27(ele_frequency)
E:\anaconda\lib\encodings\__init__.py:70(search_function)
335009 0.375 0.000 0.375 0.000 :0(get)
1 0.000 0.000 0.000 0.000 E:\anaconda\lib\encodings\__init__.py:43(normalize_encoding)
3042 0.000 0.000 0.000 0.000 :0(isinstance)
4 0.000 0.000 0.000 0.000 :0(isalnum)
31 0.000 0.000 0.000 0.000 :0(append)
1 0.000 0.000 0.000 0.000 :0(readlines)
57 0.000 0.000 0.000 0.000 E:\anaconda\lib\codecs.py:319(decode)
57 0.000 0.000 0.000 0.000 :0(utf_8_decode)
3033 0.031 0.000 0.062 0.000 E:\anaconda\lib\re.py:203(sub)
3033 0.016 0.000 0.031 0.000 E:\anaconda\lib\re.py:289(_compile)
2 0.000 0.000 0.000 0.000
E:\anaconda\lib\sre_parse.py:435(_parse_sub)
2 0.000 0.000 0.000 0.000 E:\anaconda\lib\sre_parse.py:286(tell)
29/28 0.000 0.000 0.000 0.000 :0(len)
1 0.000 0.000 0.000 0.000 E:\anaconda\lib\sre_compile.py:492(_get_charset_prefix)
2 0.016 0.008 0.016 0.008 E:\anaconda\lib\sre_compile.py:276(_optimize_charset)
10 0.000 0.000 0.000 0.000 :0(find)
2 0.000 0.000 0.000 0.000 E:\anaconda\lib\sre_compile.py:411(_mk_bitmap)
2 0.000 0.000 0.000 0.000 :0(translate)
2 0.000 0.000 0.000 0.000 E:\anaconda\lib\sre_compile.py:413(<listcomp>)
2 0.000 0.000 0.000 0.000
3033 0.000 0.000 0.000 0.000 :0(sub)
1 0.000 0.000 0.000 0.000 E:\anaconda\envs\python39\week1\word-frequency\wf.py:39(word_sort)
2 0.000 0.000 0.000 0.000 :0(sorted)
52 0.000 0.000 0.000 0.000 E:\anaconda\envs\python39\week1\word-frequency\wf.py:40(<lambda>)
52 0.000 0.000 0.000 0.000 E:\anaconda\envs\python39\week1\word-frequency\wf.py:41(<lambda>)
此版本执行347659次函数调用花费时间0.906s,在列表中同样还有调用次数,函数的总时间花费,每次调用的时间,函数的累积花费时间和累积时间在原生调用中所占比率。可以看出,主要时间花费在执行函数ele_frequency上,其中读入文本、去除多余字符时调用sub、统计结果排序sort占据了较多时间
2.处理人民日报语料库rmrb.txt
分析结果(部分):
Wed Jan 19 11:07:25 2022 tem_result.txt
1686343 function calls (1686342 primitive calls) in 4.219 seconds
Random listing order was used
ncalls tottime percall cumtime percall filename:lineno(function)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 4.219 4.219 profile:0(wf.ele_frequency())
1 0.000 0.000 4.219 4.219 :0(exec)
1 0.000 0.000 4.219 4.219 <string>:1(<module>)
1 1.984 1.984 4.219 4.219 E:\anaconda\envs\python39\week1\word-frequency\wf.py:27(ele_frequency)
E:\anaconda\lib\encodings\__init__.py:70(search_function)
1589737 1.594 0.000 1.594 0.000 :0(get)
1 0.000 0.000 0.000 0.000 E:\anaconda\lib\encodings\__init__.py:43(normalize_encoding)
19065 0.016 0.000 0.016 0.000 :0(isinstance)
4 0.000 0.000 0.000 0.000 :0(isalnum)
31 0.000 0.000 0.000 0.000 :0(append)
E:\anaconda\lib\encodings\utf_8.py:33(getregentry)
945 0.000 0.000 0.016 0.000 E:\anaconda\lib\codecs.py:319(decode)
945 0.016 0.000 0.016 0.000 :0(utf_8_decode)
19056 0.109 0.000 0.531 0.000 E:\anaconda\lib\re.py:203(sub)
19056 0.078 0.000 0.094 0.000 E:\anaconda\lib\re.py:289(_compile)
2 0.000 0.000 0.000 0.000
E:\anaconda\lib\sre_parse.py:435(_parse_sub)
29/28 0.000 0.000 0.000 0.000 :0(len)
E:\anaconda\lib\sre_compile.py:249(_compile_charset)
19056 0.328 0.000 0.328 0.000 :0(sub)
4574 0.000 0.000 0.000 0.000 E:\anaconda\envs\python39\week1\word-frequency\wf.py:40(<lambda>)
4574 0.000 0.000 0.000 0.000 E:\anaconda\envs\python39\week1\word-frequency\wf.py:41(<lambda>)
4575 0.016 0.000 0.016 0.000 :0(print)
4574 0.000 0.000 0.000 0.000 :0(format)
执行1686343次函数调用花费时间 4.219s,与处理英文文本时一样,时间花费主要在读取文件、去除多余字符时调用sub、统计结果排序sort
PSP
PSP2.1 | Personal Software Process Stage | 预估耗时(分钟) | 实际耗时(分钟) |
Planning | 计划 | 15 | 20 |
Estimate | 估计这个任务需要多少时间 | 10 | 10 |
Development | 开发 | 180 | 180 |
Analysis | 需求分析(包括学习新技术) | 90 | 100 |
Design Spec | 生成设计文档 | 30 | 20 |
Design Review | 设计复审 | 15 | 30 |
Coding standard | 代码规范 | 10 | 20 |
Design | 具体设计 | 50 | 30 |
Coding | 具体编码 | 60 | 60 |
Code Review | 代码复审 | 30 | 45 |
Test | 测试(自我测试,修改代码,提交测试) | 120 | 70 |
Reporting | 报告 | 120 | 100 |
Test Report | 测试报告 | 30 | 50 |
Size Measurement | 计算工作量 | 5 | 10 |
Postmortem & Process Improvement Plan | 事后总结,并提出过程改进计划 | 60 | 30 |
合计 | 825 | 775 |