HanLP学习笔记

原创

wx61090d1892228 2021-08-04 09:55:30 博主文章分类：NLP ©著作权

文章标签 词性 java 自定义元模型回调函数 文章分类 NLP 人工智能

©著作权归作者所有：来自51CTO博客作者wx61090d1892228的原创作品，请联系作者获取转载授权，否则将追究法律责任

设置分词结果词性显示

HanLP.Config.ShowTermNature = false; // 分词结果不显示词性

通过构造函数传入多个词典

String dict1 = "data/dictionary/CoreNatureDictionary.mini.txt";
String dict2 = "data/dictionary/custom/上海地名.txt ns";
segment = new DoubleArrayTrieSegment(dict1, dict2);

开启POS-tagging后, 就可以激活数词和英文识别

segment.enablePartOfSpeechTagging(true);    // 激活数词和英文识别

词性定义 com.hankcs.hanlp.corpus.tag.Nature

通过修改自定义词典第一行实现自定义词性

HanLP学习笔记_元模型
HanLP学习笔记_java_02

敏感词替换

public interface IHit<V>
{
    void hit(int begin, int end, V value);
}

重载接口实现回调函数传递

final StringBuilder sbOut = new StringBuilder(text.length());
final int[] offset = new int[]{0};

// 开始重载
new AhoCorasickDoubleArrayTrie.IHit<String>()
{
    @Override
    public void hit(int begin, int end, String value)
    {
        if (begin > offset[0])
            sbOut.append(text.substring(offset[0], begin));
        sbOut.append(replacement);
        offset[0] = end;
    }
}

if (offset[0] < text.length())
    sbOut.append(text.substring(offset[0]));
return sbOut.toString();

注意, 回调函数闭包时调用的外部变量只能为final. 因为本场景涉及到对外部变量做修改, 所以只能用final int[]

NatureDictionaryMaker在建立词性字典的时候会自动建立一元与2元模型

List<List<IWord>> sentenceList = CorpusLoader.convert2SentenceList(corpusPath);
for (List<IWord> sentence : sentenceList)
    for (IWord word : sentence)
        if (word.getLabel() == null) word.setLabel("n"); // 赋予每个单词一个虚拟的名词词性
final NatureDictionaryMaker dictionaryMaker = new NatureDictionaryMaker();
dictionaryMaker.compute(sentenceList);
dictionaryMaker.saveTxtTo(modelPath);

2元模型词典高深的算法，本质上是用数组+二分代替了hashMap

词网、原子分词看不懂

词网部分，java的demo代码并没有显示地在demo中定义词网，而是直接用了一波dij的分词

Segment segment = new DijkstraSegment()
    .enableAllNamedEntityRecognize(false)// 禁用命名实体识别
    .enableCustomDictionary(false); // 禁用用户词典
System.out.println(segment.seg("商品和服务"));

com.hankcs.hanlp.seg.Dijkstra.DijkstraSegment

wordNetStorage.add(searcher.begin + 1, new Vertex(new String(charArray, searcher.begin, searcher.length), searcher.value, searcher.index));

突然有点理解这个图是怎么建的，就是begin看成一个顶点，这里叫行号

HanLP学习笔记_回调函数_03

保证图连通的原子分词操作不是特别能看懂

A是转移矩阵，B是发射矩阵

HanLP学习笔记_java_04

HanLP学习笔记_词性_05

com.hankcs.hanlp.model.hmm.FirstOrderHiddenMarkovModel#predict

public float predict(int[] observation, int[] state)

state可以理解为引用参数，相当于是根据观测和模型参数 ( π , A , B ) (\pi,A,B) (π,A,B)预测出隐状态

com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap#addTransitionFeatures

MutableFeatureMap中的发射特征

private void addTransitionFeatures(TagSet tagSet)
{
    for (int i = 0; i < tagSet.size(); i++)
    {
        idOf("BL=" + tagSet.stringOf(i));
    }
    idOf("BL=_BL_");
}

HanLP学习笔记_java_06

回到com.hankcs.hanlp.model.perceptron.instance.CWSInstance#extractFeature

HanLP学习笔记_自定义_07 当前词是"，后接人们

featureMap是用来造独热码的

com.hankcs.hanlp.model.perceptron.feature.MutableFeatureMap#idOf

考虑上一个、当前、下一个，3个字

sbFeature.delete(0, sbFeature.length());
sbFeature.append(preChar).append('1');
addFeature(sbFeature, featureVec, featureMap);

sbFeature.delete(0, sbFeature.length());
sbFeature.append(curChar).append('2');
addFeature(sbFeature, featureVec, featureMap);

sbFeature.delete(0, sbFeature.length());
sbFeature.append(nextChar).append('3');
addFeature(sbFeature, featureVec, featureMap);

上2个和上一个、上一个和当前、当前和下一个、下一个和后2个

sbFeature.delete(0, sbFeature.length());
sbFeature.append(pre2Char).append("/").append(preChar).append('4');
addFeature(sbFeature, featureVec, featureMap);

sbFeature.delete(0, sbFeature.length());
sbFeature.append(preChar).append("/").append(curChar).append('5');
addFeature(sbFeature, featureVec, featureMap);

sbFeature.delete(0, sbFeature.length());
sbFeature.append(curChar).append("/").append(nextChar).append('6');
addFeature(sbFeature, featureVec, featureMap);

sbFeature.delete(0, sbFeature.length());
sbFeature.append(nextChar).append("/").append(next2Char).append('7');
addFeature(sbFeature, featureVec, featureMap);

按照书中188页的说法，现在提取的这7个特征是状态特征

HanLP学习笔记_java_08

instance指的是一个句子的所有字符和标签

com.hankcs.hanlp.model.perceptron.model.StructuredPerceptron#update(com.hankcs.hanlp.model.perceptron.instance.Instance)

先忽略转移特征的调优，看到状态特征。

HanLP学习笔记_词性_09 i i i表示句子instance的各个字符与对应标签，获取一个特征向量，每个特征对于每个类别有一个参数parameter，这也是featureVector[j] * tagSet.size()的含义。之前用viterbiDecode(instance, guessLabel);预测出了guessLabel，然后看哪些权值不正确，对不正确的权值进行惩罚。 y = w x , y ∈ { − 1 , 1 } y=wx,y\in\{-1,1\} y=wx,y∈{−1,1}本质上也是向量内积相似性的度量

运行com.hankcs.book.ch06.CrfppTrainHanLPLoad显示：

语料已转换为 data/test/my_cws_corpus.txt.tsv ，特征模板已导出为 data/test/cws-template.txt
请安装CRF++后执行 crf_learn -f 3 -c 4.0 data/test/cws-template.txt data/test/my_cws_corpus.txt.tsv data/test/crf-cws-model -t
或者执行移植版 java -cp hanlp.jar com.hankcs.hanlp.model.crf.crfpp.crf_learn -f 3 -c 4.0 data/test/cws-template.txt data/test/my_cws_corpus.txt.tsv data/test/crf-cws-model -t

HanLP学习笔记_元模型_10

在com.hankcs.hanlp.model.crf.crfpp.CRFEncoderThread#call打断点

for (int i = start_i; i < size; i = i + threadNum)

每个CRFEncoderThread对象都有start_i，表示线程ID。

x是训练集，x[i]表示训练集的第i个句子

obj 目标函数值

wSize 特征数

expected 梯度向量

先看com.hankcs.hanlp.model.crf.crfpp.TaggerImpl#gradient

node是二维链表List<List<Node>>

看得一脸懵逼

有空看有注释或者讲解的cpp/python代码吧

在com.hankcs.hanlp.corpus.dictionary.NRDictionaryMaker#roleTag

运行到com/hankcs/hanlp/corpus/dictionary/NRDictionaryMaker.java:99时，标注了AK
HanLP学习笔记_词性_11

原始语料 [这里/r, 有/v, 关天培/nr, 的/u, 有关/vn, 事迹/n, 。/w]
标注非前 [这里/A, 有/K, 关天培/nr, 的/A, 有关/A, 事迹/A, 。/A]

不一行一行看代码了，跳到最后打印的结果：

原始语料 [这里/r, 有/v, 关天培/nr, 的/u, 有关/vn, 事迹/n, 。/w]
标注非前 [这里/A, 有/K, 关天培/nr, 的/A, 有关/A, 事迹/A, 。/A]
标注中后 [这里/A, 有/K, 关天培/nr, 的/L, 有关/A, 事迹/A, 。/A]
姓名拆分 [这里/A, 有/K, 关/B, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A]
上文成词 [这里/A, 有关/U, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A]
头部成词 [这里/A, 有关/U, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A]
尾部成词 [这里/A, 有关/U, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A]
头部成词 [这里/A, 有关/U, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A]
添加首尾 [始##始/S, 这里/A, 有关/U, 天/C, 培/D, 的/L, 有关/A, 事迹/A, 。/A, 末##末/A]