本篇我们讨论ES的聚合功能,聚合可以对数据进行复杂的统计分析,作用类似于SQL中的group by,不过其统计功能更灵活,更强大。

在讲解前先填充些数据,posts索引的article类型中目前含有以下数据

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "5",
      "_score" : 1.0,
      "_source" : {
        "id" : 5,
        "name" : "生活日志",
        "author" : "wthfeng",
        "date" : "2015-09-21",
        "contents" : "这是日常生活的记录",
        "readNum" : 100
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "8",
      "_score" : 1.0,
      "_source" : {
        "name" : "ES笔记2",
        "author" : "hefeng",
        "contents" : "ES 的 search ",
        "date" : "2016-10-23",
        "readNum" : 40
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "id" : 2,
        "name" : "更新后的文档",
        "author" : "wthfeng",
        "date" : "2016-10-23",
        "contents" : "这是我的javascript学习笔记",
        "brief" : "简介,这是新加的字段",
        "readNum" : 200
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "4",
      "_score" : 1.0,
      "_source" : {
        "id" : 4,
        "name" : "javascript指南",
        "author" : "wthfeng",
        "date" : "2016-09-21",
        "contents" : "js的权威指南",
        "readNum" : 200
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "6",
      "_score" : 1.0,
      "_source" : {
        "id" : "6",
        "name" : "java笔记1",
        "author" : "hefeng",
        "contents" : "java String info",
        "date" : "2016-10-21",
        "readNum" : 12
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "1",
      "_score" : 1.0,
      "_source" : {
        "id" : 1,
        "name" : "ES更新过的文档",
        "author" : "wthfeng",
        "date" : "2016-10-25",
        "contents" : "这是更新内容",
        "readNum" : 200
      }
    }, {
      "_index" : "posts",
      "_type" : "article",
      "_id" : "7",
      "_score" : 1.0,
      "_source" : {
        "id" : "7",
        "name" : "ES笔记1",
        "author" : "hefeng",
        "contents" : "ES search",
        "date" : "2016-09-21",
        "readNum" : 100
      }
    } ]
  }
}

我们有7篇文档。下面操作均来自这些数据。

聚合结构

聚合是与query(查询)、sort(排序)同等地位的数据操作类型。使用aggs表示。类似于

{
    "query":{},
    "aggs":{},
}

先来演示一个例子

GET /posts/article/_search?pretty&search_type=count -d @search.json

{
    "aggs":{
        "readNum_stats":{
            "stats":{
                "field":"readNum"
            }
        }
    }
}

search_type=count指定只返回结果条数,查询语句中stats表示查询某字段的最值及平均值状况。readNum_stats为自定义字段,返回结果时将结果放入此字段内。返回结果如下:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "readNum_stats" : {
      "count" : 7,
      "min" : 12.0,
      "max" : 200.0,
      "avg" : 121.71428571428571,
      "sum" : 852.0
    }
  }
}

返回的聚合结果在aggregations内,readNum字段的最值、平均值、总和及数量都统计出来了。

聚合类型

聚合类型主要有两种,一种是度量聚合,一种是桶聚合。前面示例为度量结合,主要用于求某字段的统计值(如最值、平均值等);另一种桶聚合则是按条件将数据分组,类似于SQL中的group by。下面我们一一介绍。

度量聚合

度量聚合类似SQL中sumavgminmax等的作用,生成一个或多个统计项。具体用法如下:

1. min、max、avg、sum聚合

针对给定字段,返回该字段相应统计值。注意这些字段类型需是数值型

① 求最低的文档阅读量

GET /posts/article/_search?pretty&search_type=count -d @search.json

{
    "aggs":{
        "minReadNum":{
            "min":{
                "field":"readNum"
            }
        }
    }
}

返回结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "minReadNum" : {
      "value" : 12.0
    }
  }
}

② 求总阅读量

GET /posts/article/_search?pretty&search_type=count -d @search.json

{
    "aggs": {
        "sum_ReadNum": {
            "sum": {
                "field": "readNum"
            }
        }
    }
}

返回结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "sum_ReadNum" : {
      "value" : 852.0
    }
  }
}

用法都很简单,这里就不一一列举了。还有一种度量聚合将这些度量值集中一起输出。就是我们上节演示的stats 聚合

2. stats、extended_stats聚合

stats聚合输出指定字段的数目、最大、小值,平均值、总值,extended_stats是stats的扩展,在stats基础上还包括了平方和、方差、标准差等统计值。

GET /posts/article/_search?pretty&search_type=count -d @search.json

{
    "aggs": {
        "stats_of_readNum": {
            "extended_stats": {
                "field": "readNum"
            }
        }
    }
}

返回结果:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "stats_of_readNum" : {
      "count" : 7,
      "min" : 12.0,
      "max" : 200.0,
      "avg" : 121.71428571428571,
      "sum" : 852.0,
      "sum_of_squares" : 141744.0, //平方和
      "variance" : 5434.775510204081, //方差
      "std_deviation" : 73.72092993312063, //标准差
      "std_deviation_bounds" : {
        "upper" : 269.156145580527,
        "lower" : -25.72757415195555
      }
    }
  }
}

桶聚合

1. terms聚合

terms聚合就类似SQL中的group by,先看看下面示例:

将文档按作者分类,查询每位作者的文档数

GET /posts/article/_search?pretty&search_type=count -d @search.json

{
    "aggs": {
        "author_aggs": {
            "terms": {
                "field": "author"                       
            }
        }
    }
}

返回结果

{
  "took" : 125,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "author_aggs" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "wthfeng",
        "doc_count" : 4
      }, {
        "key" : "hefeng",
        "doc_count" : 3
      } ]
    }
  }
}

由返回结果可知,名为wthfeng的作者有4篇文档,hefeng有3篇文档。用SQL表示则为:

select author,count(*) from article group by author

默认情况下,返回结果按文档数(doc_count)倒序排序,我们也可以按其正序排序,或使用key排序。按doc_count排序应使用_count,按key排序应使用_terms。例按key正序排列应使用如下查询。

{
    "aggs": {
        "author_aggs": {
            "terms": {
                "field": "author",
                "order":{
                    "_term":"asc" }                       
            }
        }
    }
}

2. range聚合

range聚合按可以自定义范围将数值类型数据分组。起始值用from表示(包括边界),终止值用to表示(不包括边界)。可以给分组起一个便于记忆的自定义的名字,用key表示。如按阅读量分组:

GET /posts/article/_search?pretty&search_type=count’ -d @search.json

{
    "aggs": {
        "read_docs": {
            "range": {
                "field":"readNum",
                "ranges":[
                    {"to":50,"key":"less 50"},
                    {"from":50,"to":100,"key":"50 - 100"},
                    {"from":100,"to":150,"key":"100 - 150"},
                    {"from":150,"key":"more than 150"}
                ]                                   
            }
        }
    }
}

返回结果:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "read_docs" : {
      "buckets" : [ {
        "key" : "less 50",
        "to" : 50.0,
        "to_as_string" : "50.0",
        "doc_count" : 2
      }, {
        "key" : "50 - 100",
        "from" : 50.0,
        "from_as_string" : "50.0",
        "to" : 100.0,
        "to_as_string" : "100.0",
        "doc_count" : 0
      }, {
        "key" : "100 - 150",
        "from" : 100.0,
        "from_as_string" : "100.0",
        "to" : 150.0,
        "to_as_string" : "150.0",
        "doc_count" : 2
      }, {
        "key" : "more than 150",
        "from" : 150.0,
        "from_as_string" : "150.0",
        "doc_count" : 3
      } ]
    }
  }
}

3. date_range聚合

date_range聚合与range用法一致,只是date_range专用于日期聚合。另外,可以使用format指定日期格式。

GET ‘/posts/article/_search?pretty&search_type=count’

{
    "aggs":{
        "date_docs":{
            "field":"date",
            "format":"yyyy-MM",
            "ranges":[
                {"key":"before 2016","to":"2016-01"},
                {"key":"first half of 2016","from":"2016-01","to":"2016-06"},
                {"key":"second half of 2016","from":"2016-06","to":"2016-12"}
            ]
        }
    }
}
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 7,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "date_docs" : {
      "buckets" : [ {
        "key" : "before 2016",
        "to" : 1.4516064E12,
        "to_as_string" : "2016-01",
        "doc_count" : 1
      }, {
        "key" : "first half of 2016",
        "from" : 1.4516064E12,
        "from_as_string" : "2016-01",
        "to" : 1.4647392E12,
        "to_as_string" : "2016-06",
        "doc_count" : 0
      }, {
        "key" : "second half of 2016",
        "from" : 1.4647392E12,
        "from_as_string" : "2016-06",
        "to" : 1.4805504E12,
        "to_as_string" : "2016-12",
        "doc_count" : 6
      } ]
    }
  }
}