计算机科学 ›› 2021, Vol. 48 ›› Issue (11A): 170-175.doi: 10.11896/jsjkx.210100232
景丽, 何婷婷
JING Li, HE Ting-ting
摘要: 文本分类是自然语言处理领域中的重要内容,常用于信息检索、情感分析等领域。针对传统的文本分类模型文本特征提取不全面、文本语义表达弱的问题,提出一种基于改进TF-IDF算法、带有注意力机制的长短期记忆卷积网络(Attention base on Bi-LSTM and CNN,ABLCNN)相结合的文本分类模型。该模型首先利用特征项在类内、类间的分布关系和位置信息改进TF-IDF算法,突出特征项的重要性,并结合Word2vec工具训练的词向量对文本进行表示;然后使用ABLCNN提取文本特征,ABLCNN结合了注意力机制、长短期记忆网络和卷积神经网络的优点,既可以有重点地提取文本的上下文语义特征,又兼顾了局部语义特征;最后,将特征向量通过softmax函数进行文本分类。在THUCNews数据集和online_shopping_10_cats数据集上对基于改进TF-IDF和ABLCNN的文本分类模型进行实验,结果表明,所提模型在两个数据集上的准确率分别为97.38%和91.33%,高于其他文本分类模型。
中图分类号:
[1]WEI J.Research on chinese text classification algorithm basedon convolutional neural network[C]//3rd International Confe-rence on Computer Engineering,Information Science & Application Technology(ICCIA 2019).Paris:Atlantis Press,2019:250-254. [2]KOWSARI K,JAFARI MEIMANDI K,HEIDARYSAFA M,et al.Text classification algorithms:a survey[J].Information,2019,10(4):150. [3]CHEN Z,ZHOU L J,DA LI X,et al.The Lao text classification method based on KNN[J].Procedia Computer Science,2020,166:523-528. [4]HUO G Y,ZHANG Y,SUN Y,et al.Research on Archive Data Intelligent Classification Based on Semantic[J/OL].(2020-11-18) [2021-01-21].http:// kns.cnki.net/kcms/detail/11.2127.TP.20201118.1647.018.html. [5]HU W,GU Z,XIE Y,et al.Chinese text classification based on neural networks and word2vec[C]//2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC).Piscata-way:IEEE,2019:284-291. [6]LU Y,ZHANG P Z,ZHANG C.Research on News Keyword Extraction Technology Based on TF-IDF and TextRank[C]//2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS).Piscataway:IEEE,2019:425-455. [7]YE X M,MAO X M,XIA J C.Improved approach to TF-IDF algorithm in text classification[J].Computer Engineering and Applications,2019,55(2):104-109,161. [8]MA Y,ZHAO H,LI W L,et al.Optimization of TF-IDF algorithm combined with improved CHI statistical method[J].Application Research of Computers,2019,36(9):2596-2598,2603. [9]ZHANG L,LI Z H.An improved feature weighting method in text classification[J].Journal of Fujian Normal University(Na-tural Science Edition),2020,36(2):49-54. [10]PENG H,LI J,HE Y,et al.Large-scale hierarchical text classification with recursively regularized deep graph-cnn[C]//Proceedings of the 2018 World Wide Web.Switzerland:InternationalWorld Wide Web Conferences Steering Committee Republic and Canton of Geneva,2018:1063-1072. [11]LIU P,QIU X,HUANG X.Recurrent neural network for text classification with multi-task learning[J].arXiv:1605.05101,2016. [12]KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Qatar,2014:1746-1751. [13]ZHOU P,QI Z,ZHENG S,et al.Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling[J].arXiv:1611.06639,2016. [14]XING X,SUN G Z.Dual-channel word vectors based acrnn for text classification.[J/OL].(2020-12-14)[2021-01-21].https://doi.org/10.19734/j.issn.1001-3695. [15]DU L,CAO D,LIN S Y,et al.Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM[J].Computer Science,2020,47(S2):416-420. [16]BAI F B,CHANG L,WANG S F,et al.An Improved method study on the extracting keywords in chinese Judgment documents[J].Computer Engineering and Applications,2020,56(23):153-160. [17]HOCHSREITER S,SCHMIDHUBER J.Long short-term me-mory[J].Neural Computation,1997,9(8):1735-1780. [18]DONG Y R,LIU P Y,LIU W F,et al.A text classification model based on BiLSTM and label embedding[J].Journal of Shandong University(Natural Science),2020,55(11):78-86. [19]SUN H,CHEN Y Q.Chinese text classification based on BERT and attention.[J/OL].(2021-01-06) [2021-01-21].https://kns.cnki.net/kcms/detail/detail.aspx?FileName=XXWX2021010500E&DbName=CAPJ2021. [20]WANG H T,SONG W,WANG H.Text classification method based on hybrid model of LSTM and CNN[J].Journal of Chinese Computer Systems,2020,41(6):1163-1168. [21]WANG G S,HUANG X J.convolution neural network textclassification model based on Word2vec and improved TF-IDF[J].Journal of Chinese Computer Systems,2019,40(5):1120-1126. [22]LI Y H,LIANG S C,REN J,et al.Text classification method based on recurrent neural network variants and convolutional neural network[J].Journal of Northwest University(Natural Science Edition),2019,49(4):573-579. |
[1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[2] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 |
[3] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[5] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[6] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[7] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[8] | 李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023 |
[9] | 王馨彤, 王璇, 孙知信. 基于多尺度记忆残差网络的网络流量异常检测模型 Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network 计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011 |
[10] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[11] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[12] | 陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121 |
[13] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[14] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[15] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
|