空字段能加索引么空值索引

转载

mob64ca1418e88d 2024-04-06 14:07:33

elasticsearch 处理空值

源地址考虑前面的例子，其中文档有一个称为 tags

这个问题真诡异，因为答案是，它并没有被存储。让我们看一下上一节提到的倒排索引：

Token	DocIDs
`open_source`	2
`search`	1,2

如何存储一个在那个数据结构中不存在的字段呢？压根不行！倒排索引是一系列 token 和包含它的文档的列表。如果字段不存在，那也不会保存任何 token，所以在倒排索引中也不会有任何表示。

null、[] 和 [null]null

exists Filter

exists

POST /my_index/posts/_bulk
{ "index": { "_id": "1" }}
{ "tags" : ["search"] }                     ...(1)
{ "index": { "_id": "2" }}
{ "tags" : ["search", "open_source"] }      ...(2)
{ "index": { "_id": "3" }}
{ "other_field" : "some data" }             ...(3)
{ "index": { "_id": "4" }}
{ "tags" : null }                           ...(4)
{ "index": { "_id": "5" }}
{ "tags" : ["search", null] }               ...(5)

tags

字段设置为

null

tag

字段有一个值和一个

null

最终的倒排索引就是：

Token	DocIDs
`open_source`	2
`search`	1,2,5

IS NOT NULL

SELECT tags
FROM posts
WHERE tags IS NOT NULL
SELECT tags
FROM posts
WHERE tags IS NOT NULL

exists

GET /my_index/posts/_search
{ 
  "query" : { 
    "filtered" : { 
      "filter" : { 
        "exists" : { "field" : "tags" }}}}}

最后返回三个文档：

"hits" : [
    {
      "_id" :     "1",
      "_score" :  1.0,
      "_source" : { "tags" : ["search"] }
    },
    {
      "_id" :     "5",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", null] }   ...(1)
    },
    {
      "_id" :     "2",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", "open source"] }
    }
]
"hits" : [
    {
      "_id" :     "1",
      "_score" :  1.0,
      "_source" : { "tags" : ["search"] }
    },
    {
      "_id" :     "5",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", null] }   ...(1)
    },
    {
      "_id" :     "2",
      "_score" :  1.0,
      "_source" : { "tags" : ["search", "open source"] }
    }
]

(1) 文档 5 即使包含

null

值也返回了。因为真实值的 tag 被索引了，所以这个字段存在。所以

nulltags

missing 过滤器

missing 过滤器本质上是 exists

SELECT tags
FROM posts
WHERE tags IS NULL
SELECT tags
FROM posts
WHERE tags IS NULL

missing 来替换上面例子中 exists

GET /my_index/posts/_search
{
    "query" : {
        "filtered" : {
            "filter": {
                "missing" : { "field" : "tags" }
            }
        }
    }
}

tags

"hits" : [
    {
      "_id" :     "3",
      "_score" :  1.0,
      "_source" : { "other_field" : "some data" }
    },
    {
      "_id" :     "4",
      "_score" :  1.0,
      "_source" : { "tags" : null }
    }
]
"hits" : [
    {
      "_id" :     "3",
      "_score" :  1.0,
      "_source" : { "other_field" : "some data" }
    },
    {
      "_id" :     "4",
      "_score" :  1.0,
      "_source" : { "tags" : null }
    }
]

在 null 表示 null 时
null。根据我们前面看到的默认行为，这是不可能的；数据丢失了。幸运的是，还有一种方法我们可以用一个占位符来替换显式的 null。
当指定一个 string、numeric、Boolean 或者日期字段时，你同样能设置 null_value 可以用在任何遇到显式的 null 值的地方。没有一个值的字段显然可以从倒排索引中排除。
选择合适的 null_value，确保下面的事项：
匹配了字段的类型（type）。你不能在一个类型为
date
的字段上用一个 string 的
null_value
不同于字段可能包含的正常值，来避免出令人困惑的出现
null

exists/missing on Objects

exists/missing 过滤器同样可以用在内部对象上（inner objects），不仅仅核心类型（core types）。假如有下面的文档

{
   "name" : {
      "first" : "John",
      "last" :  "Smith"
   }
}

name.first 和 name.last 不仅仅是 name 的存在。然而，在Types and Mappings中，我们提到对象在内部会进行平化展开成一个简单的字段值结构，像这样：

{
   "name.first" : "John",
   "name.last"  : "Smith"
}

name 字段上使用 exists 和 missing

其原因就是，这会按照如下的过滤器那样：

{ "exists" : { "field" : "name" }}

实际上是按照：

{
    "bool": {
        "should": [
            { "exists": { "field": { "name.first" }}},
            { "exists": { "field": { "name.last"  }}}
        ]
    }
}

执行的。

first 和 last 同时是空，name

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。