引入:分词的概念

  1. 环境说明:Kibana + ElasticSearch

我们百度搜索:Java学习路线

es 自定义分词时转为小写 es怎么分词_搜索引擎


可以看到高亮的字,都是我们搜索使用的关键字匹配出来的,我们在百度框框中输入的关键字,经过分词后,再通过搜索匹配,最后才将结果展示出来。

ik_smart和ik_max_word的区别

使用kibana演示分词的效果:

借助es的分词器:

类型:ik_smart,称作搜索分词

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "我爱你我的祖国母亲"
}

得到结果:

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

类型:ik_max_word,称作索引分词

GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱你我的祖国母亲"
}

得到结果:

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "爱你",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "你我",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "国母",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

搜索分词ik_smart是按照最初粒度分词,索引分词ik_max_word是按照最细粒度分词。可见,这两种分词效果,都会把 “祖国母亲” 分词分出来,如果我们不想分出来呢?使用我们自定义的词典,那将“祖国母亲” 存储为一个词语!

自定义词典 mydoc.dic

es 自定义分词时转为小写 es怎么分词_elasticsearch_02


es 自定义分词时转为小写 es怎么分词_搜索引擎_03

重新分词:

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "我爱你我的祖国母亲"
}

结果:

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "祖国母亲",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}
GET _analyze
{
  "analyzer": "ik_max_word",
  "text": "我爱你我的祖国母亲"
}

结果:

{
  "tokens" : [
    {
      "token" : "我爱你",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "爱你",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "你我",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "祖国母亲",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "祖国",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "国母",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "母亲",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

结果可以看到 搜索分词ik_smart,按照了我们自定义的词典,粗粒度的将 “祖国母亲” 分成一个词语了

es的 text和keyword的区别

  1. 准备测试数据
POST /book/novel/_bulk
{"index": {"_id": 1}}
{"name": "Gone with the Wind", "author": "Margaret Mitchell", "date": "2018-01-01"}
{"index": {"_id": 2}}
{"name": "Robinson Crusoe", "author": "Daniel Defoe", "date": "2018-01-02"}
{"index": {"_id": 3}}
{"name": "Pride and Prejudice", "author": "Jane Austen", "date": "2018-01-01"}
{"index": {"_id": 4}}
{"name": "Jane Eyre", "author": "Charlotte Bronte", "date": "2018-01-02"}

在Kibana运行后的结果

es 自定义分词时转为小写 es怎么分词_es_04


查看_mapping

{
  "book" : {
    "mappings" : {
      "properties" : {
        "author" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "date" : {
          "type" : "date"
        },
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

author 和 name 类型都是text,但是也有对应的keyword类型

  1. 精确查找name = Gone with the Wind 的字符串
GET book/novel/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name": "Gone with the Wind"
        }
      },
      "boost": 1.2
    }
  }
}

结果为空!

#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

这是为什么呢?

原因是Gone with the Wind经过分词器的分词处理

GET book/_analyze
{
  "field": "name",
  "text": "Gone with the Wind"
}

存储到es中就变成了

{
  "tokens" : [
    {
      "token" : "gone",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "with",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "wind",
      "start_offset" : 14,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

所以我们肯定找 name = Gone with the Wind 是匹配不到任何结果的。

但是如果我们使用

GET book/novel/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "name.keyword": "Gone with the Wind"
        }
      },
      "boost": 1.2
    }
  }
}

得到结果

#! Deprecation: [types removal] Specifying types in search requests is deprecated.
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.2,
    "hits" : [
      {
        "_index" : "book",
        "_type" : "novel",
        "_id" : "1",
        "_score" : 1.2,
        "_source" : {
          "name" : "Gone with the Wind",
          "author" : "Margaret Mitchell",
          "date" : "2018-01-01"
        }
      }
    ]
  }
}

就可以匹配到,一条数据。

原因是name.keyword 不采用分词处理,是直接存储到es中,类似于

GET book/_analyze
{
  "field": "name.keyword",
  "text": "Gone with the Wind"
}

结果:

{
  "tokens" : [
    {
      "token" : "Gone with the Wind",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

所以我们就可以直接匹配到索引库中的数据了!