本篇我们讨论ES的聚合功能,聚合可以对数据进行复杂的统计分析,作用类似于SQL中的group by
,不过其统计功能更灵活,更强大。
在讲解前先填充些数据,posts索引的article类型中目前含有以下数据
{
"took" : 8,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 1.0,
"hits" : [ {
"_index" : "posts",
"_type" : "article",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"id" : 5,
"name" : "生活日志",
"author" : "wthfeng",
"date" : "2015-09-21",
"contents" : "这是日常生活的记录",
"readNum" : 100
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "8",
"_score" : 1.0,
"_source" : {
"name" : "ES笔记2",
"author" : "hefeng",
"contents" : "ES 的 search ",
"date" : "2016-10-23",
"readNum" : 40
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"id" : 2,
"name" : "更新后的文档",
"author" : "wthfeng",
"date" : "2016-10-23",
"contents" : "这是我的javascript学习笔记",
"brief" : "简介,这是新加的字段",
"readNum" : 200
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "4",
"_score" : 1.0,
"_source" : {
"id" : 4,
"name" : "javascript指南",
"author" : "wthfeng",
"date" : "2016-09-21",
"contents" : "js的权威指南",
"readNum" : 200
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "6",
"_score" : 1.0,
"_source" : {
"id" : "6",
"name" : "java笔记1",
"author" : "hefeng",
"contents" : "java String info",
"date" : "2016-10-21",
"readNum" : 12
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"name" : "ES更新过的文档",
"author" : "wthfeng",
"date" : "2016-10-25",
"contents" : "这是更新内容",
"readNum" : 200
}
}, {
"_index" : "posts",
"_type" : "article",
"_id" : "7",
"_score" : 1.0,
"_source" : {
"id" : "7",
"name" : "ES笔记1",
"author" : "hefeng",
"contents" : "ES search",
"date" : "2016-09-21",
"readNum" : 100
}
} ]
}
}
我们有7篇文档。下面操作均来自这些数据。
聚合结构
聚合是与query
(查询)、sort
(排序)同等地位的数据操作类型。使用aggs
表示。类似于
{
"query":{},
"aggs":{},
}
先来演示一个例子
GET /posts/article/_search?pretty&search_type=count -d @search.json
{
"aggs":{
"readNum_stats":{
"stats":{
"field":"readNum"
}
}
}
}
search_type=count指定只返回结果条数,查询语句中stats
表示查询某字段的最值及平均值状况。readNum_stats
为自定义字段,返回结果时将结果放入此字段内。返回结果如下:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"readNum_stats" : {
"count" : 7,
"min" : 12.0,
"max" : 200.0,
"avg" : 121.71428571428571,
"sum" : 852.0
}
}
}
返回的聚合结果在aggregations
内,readNum字段的最值、平均值、总和及数量都统计出来了。
聚合类型
聚合类型主要有两种,一种是度量聚合,一种是桶聚合。前面示例为度量结合,主要用于求某字段的统计值(如最值、平均值等);另一种桶聚合则是按条件将数据分组,类似于SQL中的group by
。下面我们一一介绍。
度量聚合
度量聚合类似SQL中sum
、avg
、min
、max
等的作用,生成一个或多个统计项。具体用法如下:
1. min、max、avg、sum聚合
针对给定字段,返回该字段相应统计值。注意这些字段类型需是数值型。
① 求最低的文档阅读量
GET /posts/article/_search?pretty&search_type=count -d @search.json
{
"aggs":{
"minReadNum":{
"min":{
"field":"readNum"
}
}
}
}
返回结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"minReadNum" : {
"value" : 12.0
}
}
}
② 求总阅读量
GET /posts/article/_search?pretty&search_type=count -d @search.json
{
"aggs": {
"sum_ReadNum": {
"sum": {
"field": "readNum"
}
}
}
}
返回结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"sum_ReadNum" : {
"value" : 852.0
}
}
}
用法都很简单,这里就不一一列举了。还有一种度量聚合将这些度量值集中一起输出。就是我们上节演示的stats
聚合
2. stats、extended_stats聚合
stats聚合输出指定字段的数目、最大、小值,平均值、总值,extended_stats是stats的扩展,在stats基础上还包括了平方和、方差、标准差等统计值。
GET /posts/article/_search?pretty&search_type=count -d @search.json
{
"aggs": {
"stats_of_readNum": {
"extended_stats": {
"field": "readNum"
}
}
}
}
返回结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"stats_of_readNum" : {
"count" : 7,
"min" : 12.0,
"max" : 200.0,
"avg" : 121.71428571428571,
"sum" : 852.0,
"sum_of_squares" : 141744.0, //平方和
"variance" : 5434.775510204081, //方差
"std_deviation" : 73.72092993312063, //标准差
"std_deviation_bounds" : {
"upper" : 269.156145580527,
"lower" : -25.72757415195555
}
}
}
}
桶聚合
1. terms聚合
terms聚合就类似SQL中的group by
,先看看下面示例:
将文档按作者分类,查询每位作者的文档数
GET /posts/article/_search?pretty&search_type=count -d @search.json
{
"aggs": {
"author_aggs": {
"terms": {
"field": "author"
}
}
}
}
返回结果
{
"took" : 125,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"author_aggs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "wthfeng",
"doc_count" : 4
}, {
"key" : "hefeng",
"doc_count" : 3
} ]
}
}
}
由返回结果可知,名为wthfeng的作者有4篇文档,hefeng有3篇文档。用SQL表示则为:
select author,count(*) from article group by author
默认情况下,返回结果按文档数(doc_count
)倒序排序,我们也可以按其正序排序,或使用key
排序。按doc_count
排序应使用_count
,按key
排序应使用_terms
。例按key
正序排列应使用如下查询。
{
"aggs": {
"author_aggs": {
"terms": {
"field": "author",
"order":{
"_term":"asc" }
}
}
}
}
2. range聚合
range聚合按可以自定义范围将数值类型数据分组。起始值用from
表示(包括边界),终止值用to
表示(不包括边界)。可以给分组起一个便于记忆的自定义的名字,用key
表示。如按阅读量分组:
GET /posts/article/_search?pretty&search_type=count’ -d @search.json
{
"aggs": {
"read_docs": {
"range": {
"field":"readNum",
"ranges":[
{"to":50,"key":"less 50"},
{"from":50,"to":100,"key":"50 - 100"},
{"from":100,"to":150,"key":"100 - 150"},
{"from":150,"key":"more than 150"}
]
}
}
}
}
返回结果:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"read_docs" : {
"buckets" : [ {
"key" : "less 50",
"to" : 50.0,
"to_as_string" : "50.0",
"doc_count" : 2
}, {
"key" : "50 - 100",
"from" : 50.0,
"from_as_string" : "50.0",
"to" : 100.0,
"to_as_string" : "100.0",
"doc_count" : 0
}, {
"key" : "100 - 150",
"from" : 100.0,
"from_as_string" : "100.0",
"to" : 150.0,
"to_as_string" : "150.0",
"doc_count" : 2
}, {
"key" : "more than 150",
"from" : 150.0,
"from_as_string" : "150.0",
"doc_count" : 3
} ]
}
}
}
3. date_range聚合
date_range聚合与range用法一致,只是date_range专用于日期聚合。另外,可以使用format
指定日期格式。
GET ‘/posts/article/_search?pretty&search_type=count’
{
"aggs":{
"date_docs":{
"field":"date",
"format":"yyyy-MM",
"ranges":[
{"key":"before 2016","to":"2016-01"},
{"key":"first half of 2016","from":"2016-01","to":"2016-06"},
{"key":"second half of 2016","from":"2016-06","to":"2016-12"}
]
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 7,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"date_docs" : {
"buckets" : [ {
"key" : "before 2016",
"to" : 1.4516064E12,
"to_as_string" : "2016-01",
"doc_count" : 1
}, {
"key" : "first half of 2016",
"from" : 1.4516064E12,
"from_as_string" : "2016-01",
"to" : 1.4647392E12,
"to_as_string" : "2016-06",
"doc_count" : 0
}, {
"key" : "second half of 2016",
"from" : 1.4647392E12,
"from_as_string" : "2016-06",
"to" : 1.4805504E12,
"to_as_string" : "2016-12",
"doc_count" : 6
} ]
}
}
}