es rang聚合 api es 聚合性能

转载

网络安全战士 2024-05-23 22:12:33

文章标签 es rang聚合 api 大数据字段官网文档返回结果 文章分类 架构后端开发

博客地址：http://www.moonxy.com

一、前言

Elasticsearch 是一个分布式的全文搜索引擎，索引和搜索是 Elasticsarch 的基本功能。同时，Elasticsearch 的聚合（Aggregations）功能也时分强大，允许在数据上做复杂的分析统计。ES 提供的聚合分析功能主要有指标聚合、桶聚合、管道聚合和矩阵聚合。需要主要掌握的是前两个，即指标聚合和桶聚合。

聚合分析的官方文档：Aggregations

二、聚合分析

2.1 指标聚合

指标聚合官网文档：Metric

指标聚合中主要包括 min、max、sum、avg、stats、extended_stats、value_count 等聚合，相当于 SQL 中的聚合函数。

指标聚合中包括如下聚合：

Aggregations that keep track and compute metrics over a set of documents.

在一组文档中跟踪和计算度量的聚合。如下以 max 聚合为例：

Max Aggregation

max 聚合官网文档：Max Aggregation

max 聚合用于最大值统计，与 SQL 中的聚合函数 max() 的作用类似，其中 "max_price" 为自定义的聚合名称。

##Max Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "max_price": {
      "max":  {
        "field": "price"
      }
    }
  }
}

返回结果如下：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "max_price": {
      "value": 81.4
    }
  }
}

Cardinality Aggregation

基数统计聚合官网文档：Cardinality Aggregation

Cardinality Aggregation 用于基数查询，其作用是先执行类似 SQL 中的 distinct 操作，去掉集合中的重复项，然后统计排重后的集合长度。

##Cardinality Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "all_language": {
      "cardinality":  {
        "field": "language"
      }
    }
  }
}

返回结果如下：

{
  "took": 41,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "all_language": {
      "value": 3
    }
  }
}

Stats Aggregation

基本统计聚合官网文档：Stats Aggregation

Stats Aggregation 用于基本统计，会一次返回 count、max、min、avg 和 sum 这 5 个指标。如下：

##Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "stats_pirce": {
      "stats":  {
        "field": "price"
      }
    }
  }
}

返回结果如下：

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319
    }
  }
}

Extended Stats Aggregation

高级统计聚合官网文档：Extended Stats Aggregation

用于高级统计，和基本统计功能类似，但是会比基本统计多4个统计结果：平方和、方差、标准差、平均值加/减两个标准差的区间。

##Extended Stats Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "extend_stats_pirce": {
      "extended_stats":  {
        "field": "price"
      }
    }
  }
}

返回响应结果：

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "extend_stats_pirce": {
      "count": 5,
      "min": 46.5,
      "max": 81.4,
      "avg": 63.8,
      "sum": 319,
      "sum_of_squares": 21095.46,
      "variance": 148.65199999999967,
      "std_deviation": 12.19229264740638,
      "std_deviation_bounds": {
        "upper": 88.18458529481276,
        "lower": 39.41541470518724
      }
    }
  }
}

Value Count Aggregation

文档数量聚合官网文档：Value Count Aggregation

Value Count Aggregation 可按字段统计文档数量。

##Value Count Aggregation
GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "author"
      }
    }
  }
}

返回结果如下：

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "doc_count": {
      "value": 5
    }
  }
}

注意：

text 类型的字段不能做排序和聚合（terms Aggregation 除外），如下对 title 字段做聚合，title 定义为 text：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "doc_count": {
      "value_count":  {
        "field": "title"
      }
    }
  }
}

返回结果如下：

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "books",
        "node": "6n3douACShiPmlA9j2soBw",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [title] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ]
  },
  "status": 400
}

2.2 桶聚合

桶聚合官网文档：Bucket Aggregations

Bucket 可以理解为一个桶，它会遍历文档中的内容，凡是符合某一要求的就放入一个桶中，分桶相当与 SQL 中 SQL 中的 group by。

桶聚合包括如下聚合：

terms Aggregation 用于分组聚合，统计属于各编程语言的书籍数量，如下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      }
    }
  }
}

返回结果如下：

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2
        },
        {
          "key": "python",
          "doc_count": 2
        },
        {
          "key": "javascript",
          "doc_count": 1
        }
      ]
    }
  }
}

在 terms 分桶的基础上，还可以对每个桶进行指标聚合。例如，想统计每一类图书的平局价格，可以先按照 language 字段进行 Terms Aggregation，再进行 Avg Aggregattion，查询语句如下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "terms_count": {
      "terms":  {
        "field": "language"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

返回结果如下：

{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "terms_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "java",
          "doc_count": 2,
          "avg_price": {
            "value": 58.35
          }
        },
        {
          "key": "python",
          "doc_count": 2,
          "avg_price": {
            "value": 67.95
          }
        },
        {
          "key": "javascript",
          "doc_count": 1,
          "avg_price": {
            "value": 66.4
          }
        }
      ]
    }
  }
}

Range Aggregation

Range Aggregation 是范围聚合，用于反映数据的分布情况。比如，对 books 索引中的图书按照价格区间在 0~50、50~80、80 以上进行范围聚合，如下：

GET books/_search
{
  "size": 0, 
  "aggs": {
    "price_range": {
      "range": {
        "field": "price",
        "ranges": [
          {"to": 50},
          {"from": 50, "to": 80},
          {"from": 80}
        ]
      }
    }
  }
}

返回结果如下：

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "price_range": {
      "buckets": [
        {
          "key": "*-50.0",
          "to": 50,
          "doc_count": 1
        },
        {
          "key": "50.0-80.0",
          "from": 50,
          "to": 80,
          "doc_count": 3
        },
        {
          "key": "80.0-*",
          "from": 80,
          "doc_count": 1
        }
      ]
    }
  }
}

Range Aggregation 不仅可以对数值型字段进行范围统计，也可以作用在日期类型上。Date Range Aggregation

2.3 管道聚合

管道聚合官网文档：Pipeline Aggregations

Pipeline Aggregations 处理的对象是其他聚合的输出（而不是文档）。

2.4 矩阵聚合

矩阵聚合官网文档：Matrix Aggregations

Matrix Stats

Matrix Stats 聚合是一种面向数值型的聚合，用于计算一组文档字段中的以下统计信息：

计数：计算过程中每种字段的样本数量；

平均值：每个字段数据的平均值；

方差：每个字段样本数据偏离平均值的程度；

偏度：量化每个字段样本数据在平均值附近的非对称分布情况；

峰度：量化每个字段样本数据分布的形状；

协方差：一种量化描述一个字段数据随另一个字段数据变化程度的矩阵；

相关性：描述两个字段数据之间的分布关系，其协方差矩阵取值在[-1,1]之间。

主要用于计算两个数值型字段之间的关系。如对日志记录长度和 HTTP 状态码之间关系的计算。

GET /_search
{
    "aggs": {
        "statistics": {
            "matrix_stats": {
                "fields": ["log_size", "status_code"]
            }
        }
    }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。