前言

前面介绍过doc_values,主要作用是为了更好的支持排序,聚合,脚本等需求,以面向列的方式存储,对于排序和聚合来说更高效,不过对于text字段,doc_values是不支持的。

fielddata

对于上述问题,有一种替代方案就是使用fielddata,这是一种把文本字段放到内存中来处理的方式,先直接从磁盘读取每个段的反向索引,然后通过反向索引,反转索引与文档的关系,最后将结果放到JVM堆内存中来处理。

注意:由于fielddata的机制,会占用大量堆空间,因此可能会造成频繁的FullGC,导致用户遇到延迟、卡顿等现象,这也是为什么fielddata默认为不开启的原因。

案例演示

首先,先建立一个emp索引

PUT /emp/
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text"
      },
      "age":{
        "type": "integer"
      }
    }
  }
}

插入一条数据

PUT /emp/_doc/1
{
  "name":"zhang san",
  "age":18
}

尝试对age进行一次聚合查询

GET /emp/_search
{
  "aggs": {
    "age_bucket": {
      "terms": {
        "field": "age"
      }
    }
  }
}

OK,没问题

es 文件分析 es filedata_sed


再尝试对name进行一次聚合查询

GET /emp/_search
{
  "aggs": {
    "name_bucket": {
      "terms": {
        "field": "name"
      }
    }
  }
}

报错了

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "emp",
        "node" : "ev2pyH4yRBGAVpXTGrXUzg",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [name] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    }
  },
  "status" : 400
}

显然,不能对text字段进行聚合处理,现在我们尝试加上fielddata再试试。

重新建立一个索引emp2

PUT /emp2/
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "fielddata": true
      },
      "age":{
        "type": "integer"
      }
    }
  }
}

插入数据

PUT /emp2/_doc/1
{
  "name":"zhang san",
  "age":18
}

聚合查询

GET /emp2/_search
{
  "aggs": {
    "name_bucket": {
      "terms": {
        "field": "name"
      }
    }
  }
}

这次可以了

es 文件分析 es filedata_elasticsearch_02

fielddata的替代方案

虽然现在已经可以对name进行聚合查询了,但是前面已经分析过了,由于启用fielddata会造成JVM堆内存异常,所以这并不是一个明智的选择,那么还有什么可替代的方案呢?

其实答案在前面的报错中,就已经给出了

es 文件分析 es filedata_sed_03

我们可以使用keyword来实现,name用于全文搜索,而name.keywork用于聚合等查询。

就像如下案例这样:

PUT /emp3/
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "age":{
        "type": "integer"
      }
    }
  }
}
PUT /emp3/_doc/1
{
  "name":"zhang san",
  "age":18
}
GET /emp3/_search
{
  "aggs": {
    "name_bucket": {
      "terms": {
        "field": "name.keyword"
      }
    }
  }
}

es 文件分析 es filedata_elasticsearch_04

当然keyword不会分词,不过你要好好考虑的是,为什么你会对文本字段分词后再进行聚合、排序或者在脚本中使用,当你仔细分析后你会发现这样做通常是没有意义的。