hanlp和jieba哪个好 hanlp jieba

转载

小题大作 2023-07-14 21:25:36

文章标签 hanlp和jieba哪个好 HanLp Java NLP maven 文章分类 NLP 人工智能

在使用jieba分词模块进行分词的处理之后，由于项目的需要，要写一个java的分词模块。浏览了jieba的GitHub网页之后发现：jieba的java部分已经是好久没有更新过了，并且jieba的java版本功能非常不完善(甚至没有按照词性分词的功能)。不过无可厚非就是了，毕竟jieba的口号是做最好的Python分词。于是我就去网上查阅，发现另外一个评价非常高的分词模块——HanLP分词。

1.HanLP分词简介(摘自官网)

HanLP是由一系列模型与算法组成的工具包，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点；提供词法分析（中文分词、词性标注、命名实体识别）、句法分析、文本分类和情感分析等功能。

HanLp分词的功能是非常强大的，在把几个常用的java分词模块拿到一起比较之后(拿了数学课本上面的三段文本来进行分词处理)，HanLP分词的分词效果是其中最好的，甚至有根据语境区分词性的功能，所以决定使用HanLP分词来写java版本的代码。

2.涉及到的一些问题

2.1数据结构

在之前的python版本的代码当中，分词所得到的关键词和其对应的三级知识点之间的数量关系(一张二维表)是存储在二维的字典当中的，但是java当中并没有这种数据结构(我的java学的又不是那么好)。所以找了半天发现使用java当中的HashMap就可以解决这个问题，二维HashMap，也就是HashMap<String,HashMap<String,Integer>>这样的数据结构，通过其自带的get()和put()方法就可以实现二维HashMap的初始化和值的更改。

2.2Excel文件的读取

项目需要的分词语料存储在Excel表当中，我本以为使用java的poi工具就可以实现读取的工作，在一开始做的小demo当中也没有问题，但是当我把文件大小为27187kb的excel文件导入的时候，就出现了JVM内存不够的问题，后来到网上一查发现如果Workbook使用XSSFWorkBook创建对象，导入的Excel大小又大于2M，就很容易出现OOM的问题，所以要使用poi里面的事件驱动的读取方式来进行excel文件的读取。

错误(不适用于大容量excel文件导入)的方式:(这个try-catch是eclipse自动插入的)

1 Workbook wb_Workbook=null;//初始化一个workbook类型的对象
 2         try {
 3             wb_Workbook=new XSSFWorkbook(exceFile);
 4         } catch (InvalidFormatException e) {
 5             // TODO Auto-generated catch block
 6             e.printStackTrace();
 7         } catch (IOException e) {
 8             // TODO Auto-generated catch block
 9             e.printStackTrace();
10         }//获取excel文件当中的workbook到wb当中，需要try-catch

正确(适合大容量Excel文件导入的方式):

1 FileInputStream in=new FileInputStream(new File(data_path));
2 Workbook wb1=null;
3 wb1=new StreamingReader(null).builder()
4           .rowCacheSize(100)
5           .bufferSize(4096)
6           .open(in);

以上内容参考:

3.代码及大致思路

3.1maven依赖部分

需要添加的依赖：

1 <dependency>
 2       <groupId>com.hankcs</groupId>
 3       <artifactId>hanlp</artifactId>
 4       <version>portable-1.7.4</version>
 5     </dependency>
 6     <!-- 以下两个为关于excel读写的依赖 -->
 7     <dependency>
 8       <groupId>org.apache.poi</groupId>
 9       <artifactId>poi</artifactId>
10       <version>3.15</version>
11     </dependency>
12     <dependency>
13       <groupId>org.apache.poi</groupId>
14       <artifactId>poi-ooxml</artifactId>
15       <version>3.15</version>
16     </dependency>
17     <dependency>
18        <groupId>com.monitorjbl</groupId>
19        <artifactId>xlsx-streamer</artifactId>
20        <version>1.2.0</version>
21     </dependency>
22     <dependency>
23         <groupId>org.slf4j</groupId>
24         <artifactId>slf4j-log4j12</artifactId>
25         <version>1.7.2</version>
26     </dependency>

3.2App.java部分

hanlp和jieba哪个好 hanlp jieba_maven

hanlp和jieba哪个好 hanlp jieba_maven_02

1 package Big_Create.Word_Cut;
  2 
  3 import java.io.BufferedReader;
  4 import java.io.BufferedWriter;
  5 import java.io.File;
  6 import java.io.FileInputStream;
  7 import java.io.FileNotFoundException;
  8 import java.io.FileReader;
  9 import java.io.FileWriter;
 10 import java.io.IOException;
 11 import java.util.ArrayList;
 12 import java.util.HashMap;
 13 
 14 import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
 15 import org.apache.poi.ss.usermodel.Cell;
 16 import org.apache.poi.ss.usermodel.Row;
 17 import org.apache.poi.ss.usermodel.Workbook;
 18 import org.apache.poi.xssf.usermodel.XSSFWorkbook;
 19 
 20 import com.graphbuilder.curve.Point;
 21 import com.hankcs.hanlp.dictionary.CustomDictionary;
 22 import com.hankcs.hanlp.seg.common.Term;
 23 import com.hankcs.hanlp.tokenizer.StandardTokenizer;
 24 import com.monitorjbl.xlsx.StreamingReader;
 25 
 26 /**
 27  * Hello world!
 28  *
 29  */
 30 public class App 
 31 {
 32     
 33     //1.从excel文件当中导入词汇的方法(是xlsx类型的文件)
 34     public static void add_words_from_excel(String excel_path,int cell_num) {
 35         File exceFile=new File(excel_path);//以传进来的参数创建文件
 36         Workbook wb_Workbook=null;//初始化一个workbook类型的对象
 37         try {
 38             wb_Workbook=new XSSFWorkbook(exceFile);
 39         } catch (InvalidFormatException e) {
 40             // TODO Auto-generated catch block
 41             e.printStackTrace();
 42         } catch (IOException e) {
 43             // TODO Auto-generated catch block
 44             e.printStackTrace();
 45         }//获取excel文件当中的workbook到wb当中，需要try-catch
 46         org.apache.poi.ss.usermodel.Sheet sheet_readSheet=wb_Workbook.getSheetAt(0);
 47         int firstRowIndex=sheet_readSheet.getFirstRowNum()+1;
 48         int lastRowIndex=sheet_readSheet.getLastRowNum();
 49         //获取第一行和最后一行的行号
 50         for (int i = firstRowIndex; i < lastRowIndex; i++) {
 51             Row row=sheet_readSheet.getRow(i);
 52             if (row!=null) {
 53                 Cell cell=row.getCell(cell_num);
 54                 if (cell!=null) {
 55                     CustomDictionary.add(cell.getStringCellValue());
 56                     //测试用System.out.println(cell.toString());
 57                 }
 58             }
 59         }
 60     }
 61     
 62     //2.从txt文件当中导入停用词表的方法，返回值为一个存储字符串的arraylist
 63     //这里用的停用词表是网上下载的哈工大的停用词表
 64     public static ArrayList<String> get_stop_list(String txt_path) throws IOException {
 65         ArrayList<String> stop_list=new ArrayList<String>();
 66         //创建arraylist来存储读出的停用词
 67         //现在已知停用词表(txt格式)是一个词一行存储，词的末尾有空格
 68         try {
 69             FileReader file_Reader=new FileReader(txt_path);
 70             BufferedReader buffer_BufferedReader=new BufferedReader(file_Reader);
 71             String read_line;
 72             while((read_line=buffer_BufferedReader.readLine()) != null) {
 73                 String out_str=delete_space(read_line);//去掉了空格之后的字符串，将其添加进输出的arraylist里
 74                 stop_list.add(out_str);
 75             }
 76         } catch (FileNotFoundException e) {
 77             // TODO Auto-generated catch block
 78             e.printStackTrace();
 79         }
 80         return stop_list;
 81     }
 82     
 83     //3.用于去掉读取的字符串中的空格的方法，返回为string
 84     public static String delete_space(String str_with_space) {
 85         String str_with_out_space = "";
 86         for (int i = 0; i < str_with_space.length(); i++) {
 87             char temp=str_with_space.charAt(i);
 88             if (temp!=' ') {
 89                 str_with_out_space=str_with_out_space+temp;
 90                 //因为这个情景下的文本不是很长，所以直接用字符串拼接即可
 91             }
 92         }
 93         return str_with_out_space;
 94     }
 95     
 96     //4.主体分词方法，将所有的题干信息提取并且做分词处理
 97     public static void striped(String data_path,String stop_path,String result_path1,String result_path2) throws IOException {
 98         //三个参数分别为:源数据路径、停用词表路径、输出结果路径
 99         //设置变量部分:
100         int text_num=1;//题干所在的列号
101         int class_one_num=7;//一级知识点所在的列号
102         int knowledge_num=9;//三级知识点所在的列号
103         int knowledge_code_num=6;//三级知识点对应的知识点编码所在的列号
104         
105         //设置数据结构部分:
106         ArrayList<String>knowledge_points=new ArrayList<String>();//存储所有的三级单知识点
107         ArrayList<String>point_code=new ArrayList<String>();//存储所有的三级知识点的知识点代码
108         ArrayList<String>word_list=new ArrayList<String>();//存储所有分出来的词的索引
109         HashMap<String, String>point_to_code=new HashMap<String, String>();//存储三级知识点和其代码的对应关系
110         HashMap<String, HashMap<String, Integer>>word_to_point=new HashMap<String, HashMap<String,Integer>>();
111           //用于记录词在知识点当中出现次数的数据结构(相当于二维表，列为词，行为知识点)
112         HashMap<String, HashMap<String, Integer>>point_to_word=new HashMap<String, HashMap<String,Integer>>();
113           //用于记录知识点在词当中出现次数的数据结构(相当于二维表，列为知识点，行为词)
114           //以上两个hashmap的初始化工作将在循环当中进行
115         
116         //获取停用词表:stop_list(ArrayList)
117         try {
118             ArrayList<String>stop_list=get_stop_list(stop_path);
119         } catch (IOException e) {
120             // TODO Auto-generated catch block
121             e.printStackTrace();
122         }
123         
124         File excel_file=new File(data_path);
125         ArrayList<String>all_textStrings=new ArrayList<String>();//创建arraylist来存储所有的题干信息
126         
127         FileInputStream in=new FileInputStream(new File(data_path));
128         Workbook wb1=null;
129         wb1=new StreamingReader(null).builder()
130                 .rowCacheSize(100)
131                 .bufferSize(4096)
132                 .open(in);
133         
134         org.apache.poi.ss.usermodel.Sheet sheet1=wb1.getSheetAt(0);
135         //int firstRowIndex=sheet1.getFirstRowNum()+1;//获取第一行的index
136         //int lastRowIndex=sheet1.getLastRowNum();//获取最后一行的index
137         for (Row row:sheet1) {
138             //Row row=sheet1.getRow(i);
139             if (row!=null&&row.getRowNum()!=0) {
140                 Cell text_cell=row.getCell(text_num);//获取存储题干信息的单元格
141                 Cell point_cell=row.getCell(knowledge_num);//获取存储三级知识点信息的单元格
142                 Cell point_code_cell=row.getCell(knowledge_code_num);//获取存储知识点代码信息的单元格
143                 String temp_text=text_cell.getStringCellValue();
144                 all_textStrings.add(temp_text);//将这个文本添加进arraylist里
145                 String pointString=point_cell.getStringCellValue();//知识点，这里还需要筛选单知识点
146                 String[]tempStrings=pointString.split("\\^\\.\\^");
147                 if (tempStrings.length==1) {
148                     //经过分割之后的长度为1，说明是单知识点，将其添加进arraylist当中
149                     String temp_point=tempStrings[0];
150                     if (knowledge_points.contains(temp_point)==false) {
151                         //若长度为1，并且在arraylist当中没有出现过，就将其加入其中
152                         knowledge_points.add(temp_point);
153                         point_code.add(point_code_cell.getStringCellValue());
154                         point_to_code.put(temp_point, point_code_cell.getStringCellValue());
155                         //因为是单知识点，所以说可以同时将知识点代码也添加进去
156                     }
157                 //获取了无重复的知识点、知识点代码索引
158                 java.util.List<Term>temp_List=StandardTokenizer.segment(temp_text);
159                     for (int j = 0; j < temp_List.size() ; j++) {
160                         String str_out=temp_List.get(j).toString();
161                         String[] str_splited=str_out.split("/");
162                         if ((word_list.contains(str_splited[0])==false)&&str_splited[1].equals("n")) {
163                             word_list.add(str_splited[0]);
164                         }
165                     }//获得了所有的分词得到的词汇的无重复索引
166                 }
167             }//第一个for循环，获取了词汇、知识点、知识点代码三个索引    
168         }
169         System.out.println("三级知识点总数为(不重复):"+knowledge_points.size());
170         System.out.println("分词所得词数为(不重复):"+word_list.size());
171         
172         for (int i = 0; i < word_list.size(); i++) {
173             HashMap<String, Integer>word_to_point_num=new HashMap<String, Integer>();
174             //这里要把HashMap的创建放在外面，否则会导致数据的覆盖，出现null
175             for (int j = 0; j < knowledge_points.size(); j++) {
176                 word_to_point_num.put(knowledge_points.get(j), 0);
177             }
178             word_to_point.put(word_list.get(i), word_to_point_num);
179         }//至此完成了二维哈希图的初始化工作，这时数据应当为一张二维表，值全部为0
180         
181         for (int i = 0; i < knowledge_points.size(); i++) {
182             HashMap <String, Integer>point_to_word_num=new HashMap<String, Integer>();
183             for (int j = 0; j < word_list.size(); j++) {
184                 point_to_word_num.put(word_list.get(j), 0);
185             }
186             point_to_word.put(knowledge_points.get(i), point_to_word_num);
187         }//初始化一张二维哈希图，记录知识点对应关键词的出现次数
188         
189         //接下来开始再次遍历excel表来获取数据
190         File excel_file_again=new File(data_path);
191         FileInputStream in_again=new FileInputStream(excel_file_again);
192         Workbook data_Workbook=null;
193         data_Workbook=new StreamingReader(null).builder()
194                 .rowCacheSize(100)
195                 .bufferSize(4096)
196                 .open(in_again);
197         
198         org.apache.poi.ss.usermodel.Sheet data_Sheet=data_Workbook.getSheetAt(0);
199         for (Row row:data_Sheet) {
200             if(row!=null&&row.getRowNum()!=0) {
201                 //Row row=data_Sheet.getRow(i);
202                 Cell read_point_cell=row.getCell(knowledge_num);
203                 Cell read_text_cell=row.getCell(text_num);
204                 String point_to_check=read_point_cell.getStringCellValue();
205                 String[] temp_Strings=point_to_check.split("\\^\\.\\^");
206                 if (temp_Strings.length==1) {
207                     String temp_point=temp_Strings[0];
208                     String text_to_cut=read_text_cell.getStringCellValue();
209                     java.util.List<Term>temp_Terms=StandardTokenizer.segment(text_to_cut); 
210                     for (int j = 0; j < temp_Terms.size(); j++) {
211                         String strs_to_put=temp_Terms.get(j).toString();
212                         String[]str_striped=strs_to_put.split("/");
213                         String str_to_put=str_striped[0];
214                         if (word_list.contains(str_striped[0])&&str_striped[1].equals("n")) {
215                             //若word_list里存在这个词，那么就将其出现次数+1
216                             Integer last_value=word_to_point.get(str_to_put).get(temp_point);
217                             Integer an_last_value=point_to_word.get(temp_point).get(str_to_put);
218                             //获取之前的数值
219                             HashMap<String, Integer>temp_map=new HashMap<String, Integer>();
220                             temp_map=word_to_point.get(str_to_put);
221                             if (temp_map.get(temp_point)!=null) {
222                                 temp_map.put(temp_point, (last_value+1));
223                                 //word_to_point.get(str_to_put).put(temp_point, (last_value+1));
224                                 word_to_point.put(str_to_put, temp_map);
225                                 //将数值修改
226                             }
227                             HashMap<String, Integer>an_temp_map=new HashMap<String, Integer>();
228                             an_temp_map=point_to_word.get(temp_point);
229                             if (an_temp_map.get(str_to_put)!=null) {
230                                 an_temp_map.put(str_to_put, (an_last_value+1));
231                                 point_to_word.put(temp_point, an_temp_map);
232                             }//修改另外一个表的数值
233                             //System.out.println("成功修改:word:"+str_to_put+",在知识点:"+temp_point+" 当中的出现次数为:"+word_to_point.get(str_to_put).get(temp_point)+" 原始值为:"+last_value);
234                         }
235                     }
236                 }
237             }
238         }
239         //写入txt文件的过程
240         File write_File=new File(result_path1);
241         BufferedWriter out_to_file=new BufferedWriter(new FileWriter(write_File));
242         for (int i = 0; i < word_list.size(); i++) {
243             out_to_file.write("word:"+word_list.get(i)+" 出现情况:");
244             //System.out.print("word:"+word_list.get(i)+" 出现情况:");
245             int appear_num=0;//统计词频用
246             for (int j = 0; j < knowledge_points.size(); j++) {
247                 appear_num=appear_num+word_to_point.get(word_list.get(i)).get(knowledge_points.get(j));
248             }
249             out_to_file.write("出现总数:"+appear_num);
250             //System.out.print("出现总数:"+appear_num);
251             if (appear_num!=0) {
252                 for (int k = 0; k < knowledge_points.size(); k++) {
253                     double f=(double)word_to_point.get(word_list.get(i)).get(knowledge_points.get(k))/appear_num;
254                     if (f!=0) {
255                         out_to_file.write("知识点:"+knowledge_points.get(k)+" 出现频率:"+ f);
256                         //System.out.print("知识点:"+knowledge_points.get(k)+" 出现频率:"+ f);
257                     }
258                 }
259             }
260             out_to_file.write("\r\n");
261             //System.out.print("\n");
262         }
263         out_to_file.flush();
264         out_to_file.close();
265         
266         File write_File2=new File(result_path2);
267         BufferedWriter out_to_file2=new BufferedWriter(new FileWriter(write_File2));
268         for (int i = 0; i < knowledge_points.size(); i++) {
269             out_to_file2.write("知识点:"+knowledge_points.get(i)+"包含词语情况:");
270             int appear_num=0;
271             for (int j = 0; j < word_list.size(); j++) {
272                 appear_num=appear_num+point_to_word.get(knowledge_points.get(i)).get(word_list.get(j));
273             }
274             out_to_file2.write("出现总数:"+appear_num);
275             if (appear_num!=0) {
276                 for (int k = 0; k < word_list.size(); k++) {
277                     double f=(double)point_to_word.get(knowledge_points.get(i)).get(word_list.get(k))/appear_num;
278                     if (f!=0) {
279                         out_to_file2.write("词语:"+word_list.get(k)+"频率:"+f);
280                     }
281                 }
282             }
283             out_to_file2.write("\r\n");
284         }
285         out_to_file2.flush();
286         out_to_file2.close();
287     }
288     
289     //主方法
290     public static void main( String[] args ) throws IOException
291     {
292         String data_path="src/";//这里是要处理的exel文件名，将其改成你自己的路径，或者直接放到maven项目的src下即可
293         String stop_path="src/stop_words1.txt";
294         String result_path="src/final_result.txt";
295         String point_to_word_path="src/point_to_word.txt";
296         striped(data_path, stop_path, result_path,point_to_word_path);
297     }
298 }

View Code