- 用了不到4天的时间,一共抓取了17个寺院的3个信息来源的11049条评论
- 使用 wordcloud, jieba, PIL, matplotlib, numpy 进行分词,统计词频,并绘制词云
#coding=utf-8
from wordcloud import WordCloud
import jieba
import PIL
import matplotlib.pyplot as plt
import numpy as np
def wordcloudplot(txt):
path = r'ancient_style.ttf'
# path = unicode(path, 'utf8').encode('gb18030')
# path = str(path, 'utf8').encode('gb18030')
alice_mask = np.array(PIL.Image.open('black.jpg'))
wordcloud = WordCloud(font_path=path,
background_color="black",
margin=2, width=900, height=400, mask=alice_mask, max_words=2000, max_font_size=300,
random_state=42)
wordcloud = wordcloud.generate(txt)
wordcloud.to_file('black7.jpg')
plt.imshow(wordcloud)
plt.axis("off")
# plt.show()
def main():
# a = []
f = open(r'../Day-4-comment_txt/红螺寺_comment.txt', 'r', encoding='utf-8', errors='ignore').read()
words = jieba.lcut(f)
# for word in words:
# if len(word) > 1:
# a.append(word)
a = [word for word in words if len(word) > 1]
txt = r' '.join(a)
wordcloudplot(txt)
#print(a)
print(txt)
if __name__ == '__main__':
main()
- 得到的测试词云如下:
感觉这些 wordcloud 中展示的数据乱七八糟的,稍稍了解了一下发现还需要用到情感分析等诸多,还要过滤掉噪音什么乱七八糟的,我家小姐姐
说项目中的这部分内容也并不需要她来负责,所以,就这样吧
滚回去继续学习我的 Django 和 MySQL ,还有计算机网络啦