es 中文分词器对比 es分词器作用

转载

技术极客侠 2024-02-15 16:38:29

文章标签 es 中文分词器对比 elasticsearch 搜索引擎 java 分词器 文章分类 架构后端开发

Elasticsearch之Analyzer分词器介绍

Analysis

Analyzer的组成
ES中内置的分词器
Analyzer的使用
几种分词器介绍
Standard Analyzer
Simple Analyzer

Stop Analyzer
Whitespace Analyzer
Keyword Analyzer
Pattern Analyzer
Language Analyzer

Analysis

Analysis文本分析，也叫分词，是把全文本转换为一系列单词的过程。

Analyzer叫做分词器。Analysis是通过Analyzer来实现的，ES当中内置了很多分词器，同时我们也可以按需定制化分词器。

分词器的作用，除了在数据写入时对需要分词的字段进行词条切分转换，同时匹配Query语句的时候也需要使用相同的分词器对查询语句进行分析。

例如：
Elasticsearch is fun这个文本就会被分词器切分成，elasticsaerch、is、fun三个单词。

Analyzer的组成

通常Analyzer由三个部分组成。

Character Filters：针对原始文本处理，例如去除html标签等。
Tokenizer：按照一定的规则，对字符串进行切分单词。
Token Filter：将切分的单词进行加工、大小写转换、删除stopwords、增加同义词等。

ES中内置的分词器

Standard Analyzer：默认分词器，按词切分，小写处理
Simple Analyzer：按照非字母切分（符号被过滤），小写处理
Stop Analyzer：小写处理，停用词过滤器（the、a、is等）
Whitespace Analyzer：按照空格切分，不转小写
Keyword Analyzer：不分词，直接将输入当作输出
Patter Analyzer：正则表达式，默认\W+(非字符分割)
Language：提供了30多种常见语言的分词器
Customer Analyzer：自定义分词器

Analyzer的使用

我们可以直接指定Analyzer进行分词测试。

举例：比如我们现在要查看一下ES是如何进行分词的。

GET /_analyze
{
  "analyzer": "standard",
  "text":"Elasticsearch is fun"
}

返回如下，可以看到分词结果。token表示分词的单词，start_offset表示单词在文档中的开始位置，end_offset表示单词在文档中的结束位置，type表示单词的类型（文本/数字…），position表示单词在文档中的位置。

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

我们也可以指定索引的字段来进行分词测试，可以看到这个字段是如何对文本进行分词的。

比如我们要指定索引index中的字段comment字段来进行分词测试，发起请求如下：

post index/_analyze
{
  "field":"comment",
  "text":"ES真好玩"
}

可以看到把我们输入的文本进行了分词处理。

{
  "tokens" : [
    {
      "token" : "es",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "真",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "好",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "玩",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

我们也可以自定义分词器进行测试

post /_analyze
{
  "tokenizer":"standard",
  "filter":["lowercase"],
  "text":"Elasticsearch is FUN"
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

几种分词器介绍

Standard Analyzer

Standard Analyzer是ES中默认的分词器，它有几个规则：

按照单词进行切分
小写处理
它的Stop（词过滤器，is、the等）默认是关闭的。

es 中文分词器对比 es分词器作用_elasticsearch

GET /_analyze
{
  "analyzer": "standard",
  "text":" 1 Elasticsearch is FUN5."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "fun5",
      "start_offset" : 20,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

可以看到standard分词器，就是按照空格进行分词，没有过滤掉is这种关键字，并且没有过滤掉数字等。

Simple Analyzer

按照非字母切分，非字母的都会被去除
小写处理

es 中文分词器对比 es分词器作用_java_02

GET /_analyze
{
  "analyzer": "simple",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "fun",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "asd",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    }
  ]
}

可以看到Simple分词器，把最后单词FUN511asd中的进行了切分，切分成fun和asd（这里不仅仅是数字，只要是非字母都会切分、符号等），并且全部转小写处理。

Stop Analyzer

按照非字母切分，非字母的都会被去除
小写处理
多了stop filter，会将is、a、the等关键词去除

GET /_analyze
{
  "analyzer": "stop",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "fun",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "asd",
      "start_offset" : 26,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    }
  ]
}

可以看到stop除了有simple的功能，还将一些关键词，比如is进行了去除

Whitespace Analyzer

按照空格进行切分

es 中文分词器对比 es分词器作用_搜索引擎_03

GET /_analyze
{
  "analyzer": "whitespace",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "FUN511asd.",
      "start_offset" : 20,
      "end_offset" : 30,
      "type" : "word",
      "position" : 3
    }
  ]
}

Keyword Analyzer

不进行分词，直接将输入当作一个term输出

es 中文分词器对比 es分词器作用_elasticsearch_04

GET /_analyze
{
  "analyzer": "keyword",
  "text":" 1 Elasticsearch is FUN511asd."
}

{
  "tokens" : [
    {
      "token" : " 1 Elasticsearch is FUN511asd.",
      "start_offset" : 0,
      "end_offset" : 30,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern Analyzer

通过正则表达式进行分词
默认是\W+，非字符的符号进行分割

es 中文分词器对比 es分词器作用_分词器_05

GET /_analyze
{
  "analyzer": "pattern",
  "text":" 1 Elasticsearch is FUN511asd-a."
}

{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 3,
      "end_offset" : 16,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "fun511asd",
      "start_offset" : 20,
      "end_offset" : 29,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 30,
      "end_offset" : 31,
      "type" : "word",
      "position" : 4
    }
  ]
}

可以看到FUN511asd-a被分成了fun511asd和a

Language Analyzer

可以指定不同的语言进行分词，比如English.

GET /_analyze
{
  "analyzer": "english",
  "text":"ES真是太好玩了，Elasticsearch is FUN-fun"
}

但是对于中文来说，分词器就有了一些特定的难点：

一个句子，要被切分成一个个单词，而不是一个个的字。
在英文中，单词有空格进行分割，中文没有
一句中文，在不同的上下文语言环境中，有不同的意思
几句中文可能表达的是相同的意思，但是分词不同

我们可以安装不同的中文分词器
比如：
ICU Analyzer
IK-支持自定义词库，支持热更新分词字典
THULAC

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：print设置参数java java print int

下一篇：Grafana就是计算出来的step的算法 grabcut算法原理

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯