es script_field es script_fields

转载

mob64ca1416b5a8 2024-04-02 15:21:37

文章标签 es script_field elasticsearch 大数据数据库字段 文章分类 架构后端开发

在之前的文章 “Elasticsearch：从搜索中获取选定的字段”，我有讲到过一些关于 script fields 的话题。在今天的文章中，我想就这个话题更进一步地详述。在搜索时，每个 _search 请求的匹配（hit）可以使用 script_fields （基于不同的字段）定制一些属性。这些定制的属性（script fields）通常是：

针对原有值的修改（比如，价钱的转换，不同的排序方法等）
一个崭新的及算出来的属性（比如，总和，加权，指数运算，距离测量等）
合并多个字段的值（比如，firstname + lastname）

一个 _search 请求能定义多于一个以上的 script field：

POST myindex/_search
{
  "script_fields": {
    "using_doc_values": {
      "script": "doc['price'].value * 42"
    },
    "using_source": {
      "script": "params['_source']['price'] * 42"
    }
  }
}

在上面，我们使用了两种不同的方法来计算同样的内容。这个和 script query 不同。你在 script query 中只可以使用 doc_values，但是在 script field 中，你可以访问最原始的文档 _source。我们必须注意的是：

引用 docs：doc [...] 表示法将使该字段的 terms 加载到内存中（缓存），这将导致更快的执行速度，但是请记住，这种表示法仅允许访问简单值字段（你无法从其中获取 JSON 对象，这个你只能从 _source 获取）。但是，你也可以指定目标数组字段，比如就像这里的例子。
尽管直接访问 _source 比访问 doc values 要慢，但脚本字段中的脚本仅对前 N 个文档执行。因此，它们并不是真正的性能考虑因素，尤其是当你认为请求的大小通常在较低的两位数时。

Script fields 也可以被用于 Kibana 的制表中，它可以帮助我们来对数据进行清洗。

如果你对 Painless 脚本编程还不是挺了解的话，请参阅我之前的教程 “Elastic：菜鸟上手指南”。在 “Painless 编程” 部分可以看到。

用例

我们首先来创建如下的一个索引：

PUT products 
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "color": {
        "type": "keyword"
      },
      "last_modified_utc": {
        "type": "date",
        "format": "epoch_millis"
      },
      "price": {
        "properties": {
          "amount": {
            "type": "long"
          },
          "currency": {
            "type": "keyword"
          }
        }
      },
      "availability_per_gb": {
        "type": "nested",
        "properties": {
          "gigabytes": {
            "type": "integer"
          },
          "units": {
            "type": "integer"
          }
        }
      },
      "warehouse_location": {
        "type": "geo_point"
      }
    }
  }
}

在上面，我们创建了一个比较复杂结果的索引。它包含有时间字段 last_modified_utc，keyword 字段 color，全文搜索字段 name，Object 字段 price， nested 字段 availability_per_gb 以及一个 geo_point 字段。我们使用如下的方式来写入文档：

PUT products/_doc/1
{
  "name": "iPhone 12 Pro Max",
  "color": "gold",
  "last_modified_utc": 1609521634371,
  "price": {
    "amount": 1600000,
    "currency": "usd"
  },
  "availability_per_gb": [
    {
      "gigabytes": 128,
      "units": 58
    },
    {
      "gigabytes": 256,
      "units": 32
    },
    {
      "gigabytes": 512,
      "units": 0
    }
  ],
  "warehouse_location": [
    116.472737,
    40.004556
  ]
}

现在我们的要求是：

返回结果的 color 字段都必须是大写的
last_modified_utc 的返回值必须是 yyyy/MM/dd HH:mm:ss 这样的格式，并且是以 Asia/Shanghai 的时间来进行显示的
从 warehouse_location 到客户的地理距离（以米为单位）
在 availability_per_gb 字段里所有的 units 总和

针对上面的要求，我们来一一解答。

返回大写的 color

把 color 的值返回为大写比较简单直接，我们可以通过 doc values 来直接进行操作：

GET products/_search
{
  "script_fields": {
    "uppercase_color": {
      "script": """
        doc['color'].value.toUpperCase()
      """
    }
  }
}

上面的返回结果是：

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "uppercase_color" : [
            "GOLD"
          ]
        }
      }
    ]
  }
}

显然，它把我们的 color 字段变成为大写。

可能很多人会奇怪地问，你咋知道有一个叫做 toUpperCase 的方法呢？在 Painless 的编程中，有时真的不知道有什么样的 API 供我们使用。我们可以参考我之前的文档 “Elasticsearch：Painless 编程调试”，我们可以尝试如下的方法：

GET products/_search
{
  "script_fields": {
    "uppercase_color": {
      "script": """
        Debug.explain(doc['color'])
      """
    }
  }
}

我们可以看到如下的输出：

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "painless_class" : "org.elasticsearch.index.fielddata.ScriptDocValues.Strings",
        "to_string" : "[gold]",
        "java_class" : "org.elasticsearch.index.fielddata.ScriptDocValues$Strings",
        "script_stack" : [
          "Debug.explain(doc['color'])\n      ",
          "                 ^---- HERE"
        ],
        "script" : " ...",
        "lang" : "painless",
        "position" : {
          "offset" : 26,
          "start" : 9,
          "end" : 43
        }
      }
    ],

在上面，它显示出一个painless_class 和一个 java_class。我们直接在网上进行搜索 org.elasticsearch.index.fielddata.ScriptDocValues$Strings。我们发现

java.lang.String	getValue()

也就是说 doc['color'].value 是一个 java.lang.String 的对象。我们更进一步查询 java.lang.String。在此处，我们可以看到 toUpperCase 的定义。在接下来的文章中，我们可以使用同样的方法来针对我们不熟悉的对象进行处理，并找到相应的 API。

把时间格式修改为想要的格式

在上面，我们可以看出来一个整型的时间格式不便于阅读，而且又不是我们熟悉的时区。通过先将时间戳转换为 java.time.Instant 对象，然后使用所需的时区格式化DateTimeFormatter，可以将 Painless 中的毫秒级时间戳转换为日期字符串。

POST products/_search
{
  "script_fields": {
    "parsed_last_modified": {
      "script": """
       DateTimeFormatter
                .ofPattern('yyyy/MM/dd HH:mm:ss')
                .withZone(ZoneId.of('Asia/Shanghai'))
                .format(Instant.ofEpochMilli(doc['last_modified_utc'].value.toInstant().toEpochMilli()));
      """
    }
  }
}

上面的查询结果显示：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "parsed_last_modified" : [
            "2021/01/02 01:20:34"
          ]
        }
      }
    ]
  }
}

计算到客户的距离

地理位置 doc values 支持 arcDistance 方法，该方法期望 lat 及 lon 作为参数（按此顺序）。客户的位置是 39.979849,116.466108，因此可以通过以下方式计算到客户的地理距离：

POST products/_search
{
  "script_fields": {
    "distance_in_meters": {
      "script": {
        "source": "doc['warehouse_location'].arcDistance(params.lat, params.lon)",
        "params": {
          "lat": 39.979849,
          "lon": 116.466108
        }
      }
    }
  }
}

上面的结果显示：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "distance_in_meters" : [
            2804.735713697863
          ]
        }
      }
    ]
  }
}

上面显示距离客户的距离是 2804.7 米的距离。 arcDistance 返回的是以米为单位的。

计算所有的手机数量

由于 nested 字段在内部表示为单独的隐藏文档，因此你不能通过 doc values 来访问它们，而只能通过 _source 来访问它们（如上面的几段所述）。如果使用 _source，则就像处理普通的 Java 对象，例如：

HashMaps (比如 params['_source'])
ArrayLists (比如 params['_source']['availability_per_gb']) 等

ArrayList 是 “streamable” 的，因此你可以遍历其条目并汇总计数：

POST products/_search
{
  "script_fields": {
    "available_units_count": {
      "script": """
        params['_source']['availability_per_gb']
          .stream()
          .mapToInt(model -> model.units)
          .sum()
      """
    }
  }
}

上面的查询结果是：

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "available_units_count" : [
            90
          ]
        }
      }
    ]
  }
}

上面显示我们还有 90 部手机。

把上面的所有放在一起

我们把上面所有的 script fields 都放在一起，这样就形成了我们最终的答案：

GET products/_search
{
  "query": {
    "match": {
      "name": "iphone"
    }
  },
  "script_fields": {
    "uppercase_color": {
      "script": """
        doc['color'].value.toUpperCase()
      """
    },
    "parsed_last_modified": {
      "script": """
       DateTimeFormatter
                .ofPattern('yyyy/MM/dd HH:mm:ss')
                .withZone(ZoneId.of('Asia/Shanghai'))
                .format(Instant.ofEpochMilli(doc['last_modified_utc'].value.toInstant().toEpochMilli()));
      """
    },
   "distance_in_meters": {
      "script": {
        "source": "doc['warehouse_location'].arcDistance(params.lat, params.lon)",
        "params": {
          "lat": 39.979849,
          "lon": 116.466108
        }
      }
    },
    "available_units_count": {
      "script": """
        params['_source']['availability_per_gb']
          .stream()
          .mapToInt(model -> model.units)
          .sum()
      """
    }    
  }
}

特别指出的是，我在上面加上了一个 query，这样我们的 script fields 才真正地针对我们所感兴趣的文档进行计算，并形成我们想要的字段。上面的查询结果为：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "fields" : {
          "distance_in_meters" : [
            2804.735713697863
          ],
          "uppercase_color" : [
            "GOLD"
          ],
          "available_units_count" : [
            90
          ],
          "parsed_last_modified" : [
            "2021/01/02 01:20:34"
          ]
        }
      }
    ]
  }
}

Script fields 针对我们的制表非常有用。它基于原有的字段的值，创建一些新的字段供我们制表。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。