一.Hive聚合运算 - GROUP BY
GROUP BY用于分组
- Hive基本内置聚合函数与GROUP BY一起使用
- 如果没有指定GROUP BY子句,则默认聚合整个表
- 除聚合函数这一列外,所选的其他列也必须包含在GROUP BY中,在前面查询的时候可以不加,不会报错,但是看不出来结果代表的意义
- GROUP BY支持使用CASE WHEN或表达式
select category, max(offervalue) from offers group by category;
-- group by使用表达式
select
if(category > 4000, 'GOOD', 'BAD') as newcat,
max(offervalue)
from offers
group by category if(category > 4000, 'GOOD', 'BAD');
Hive聚合运算 - HAVING
- HAVING:对GROUP BY聚合结果的条件过滤- 可以避免在GROUP BY之后使用子查询
- HAVING之后可以使用表达式,但不建议使用,会造成效率大大降低
-- having使用
select sex_age.age from employee group by sex_age.age having count(*) <= 1;
-- 使用子查询代替having
select a.age from ( select count(*) as cnt, sex_age.age
from employee group by sex_age.age ) a where a.cnt <= 1;
Hive聚合运算 - 基础聚合 基础聚合函数
- max, min, count, sum, avg- max(distinct col1)、avg(col2)等- collect_set, collect_list:返回每个组列中的对象集/列表
注意事项 - 一般与GROUP BY一起使用
- 可应用于列或表达式
- 对NULL的count聚合为0,即过滤了NULL
二.窗口函数 - 概述
窗口函数是一组特殊函数
- 扫描多个输入行来计算每个输出值,为每行数据生成一行结果
- 可以通过窗口函数来实现复杂的计算和聚合
语法
Function (arg1,..., arg n) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_clause>])
- PARTITION BY类似于GROUP BY,未指定则按整个结果集
- 只有指定ORDER BY子句之后才能进行窗口定义,窗口定义的范围窗口,行窗口都需要根据整个数据的顺序决定的
- 可同时使用多个窗口函数
- 过滤窗口函数计算结果必须在外面一层
- 按功能可划分为:排序,聚合,分析
1. 窗口函数 - 排序
ROW_NUMBER()
- 对所有数值输出不同的序号,序号唯一连续
RANK()
- 对相同数值,输出相同的序号,下一个序号跳过(1,1,3)
DENSE_RANK()
- 对相同数值,输出相同的序号,下一个序号连续(1,1,2)
根据需要使用的不同功能,去选择使用的函数,使用方式一致,只不过功能不太一样。主要掌握上面三个。
NLITE(n)
- 将有序的数据集合平均分配到n个桶中(若不能均分一般第一个桶数据会多些), 将桶号分配给每一行,根据桶号,选取前或后 n分之几的数据
- 举例:若想查询订单记录的前1/3记录,可用NLITE(3)平均分成三份再套个查询语句使用where条件桶号=1即可实现
PERCENT_RANK()
- (目前排名- 1)/(总行数- 1),值相对于一组值的百分比排名
-- 窗口函数 排序类 给所有的员工薪资排序,从低到高,针对所有员工。
SELECT
name, dept_num, salary,
ROW_NUMBER() OVER (order by salary) AS row_num, //over() 括号内不写的话默认是按照整个行进行排序,这里还是需要按照薪资排序
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, //按照部门对部门内员工薪资排序
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS dense_rank,
PERCENT_RANK() OVER(PARTITION BY dept_num ORDER BY salary) AS percent_rank,
NLITE(2) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile
FROM employee_contract
ORDER BY dept_num, salary;
2. 窗口函数 - 聚合
COUNT() 计数,可以和DISTINCT一起用
SELECT
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS
BETWEEN 1 PRECEDING AND 1 FOLLOWING)
SUM():求和
AVG():平均值
MAX()/MIN(): 最大/小值
从Hive 2.1.0开始在OVER子句中支持聚合函数
SELECT rank() OVER (ORDER BY sum(b)) FROM T GROUP BY a;
-- 窗口函数 聚合类
SELECT
name, dept_num, salary,
COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
--COUNT(DISTINCT *) OVER (PARTITION BY dept_num) AS row_cnt_dis,
SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,
SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1,
SUM(salary) OVER(ORDER BY dept_num, name rows unbounded preceding) AS runningTotal2,
AVG(salary) OVER(PARTITION BY dept_num) AS avgDept,
MIN(salary) OVER(PARTITION BY dept_num) AS minDept,
MAX(salary) OVER(PARTITION BY dept_num) AS maxDept
FROM employee_contract
ORDER BY dept_num, name;
按部门分组,获取每个部门薪资最低的员工(考虑重复,如果有多个最低薪资,都取出来)
使用rank和dense_rank效果是一样的
# 经常遇到的分组求top n的场景
select name,dept_num,salary from
(select
name,dept_num,salary,
dense_rank() over(partition by dept_num order by salary) as rn
from employee_contract) t
where t.rn=1;
3.窗口函数 - 分析
CUME_DIST
- 小于等于当前值的行数/分组内总行数
LEAD/LAG(column,n)
- 某一列进行往前/后第n行值(n可选,默认为1)
- 这个函数很有用,可用于分析频率,比如lag(购买时间,1),就可知道每次购买时间的频率
FIRST_VALUE
- 对该列到目前为止的首个值
LAST_VALUE
- 到目前行为止的最后一个值
-- 窗口函数 分析类
SELECT
name, dept_num, salary,
LEAD(salary, 2) OVER(PARTITION BY dept_num ORDER BY salary) AS lead,
LAG(salary, 2, 0) OVER(PARTITION BY dept_num ORDER BY salary) AS lag,
FIRST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS first_value,
LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary) AS last_value_default,
LAST_VALUE(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_value
FROM employee_contract
ORDER BY dept_num, salary;
4.窗口函数 - 窗口定义(必须使用order by)
窗口定义由[<window_clause>]子句描述
- 用于进一步细分结果并应用分析函数
支持两类窗口定义
- 行类型窗口
- 范围类型窗口
RANK、NTILE、DENSE_RANK、CUME_DIST、PERCENT_RANK、LEAD、LAG和ROW_NUMBER函数不支持与窗口子句一起使用
行窗口:根据当前行之前或之后的行号确定的窗口 - ROWS BETWEEN <start_expr> AND <end_expr>
<start_expr>可以为下列值 - UNBOUNDED PRECEDING : 窗口起始位置(分组第一行)- CURRENT ROW:当前行- N PRECEDING/FOLLOWING:当前行之前/之后n行
<end_expr>可以为下列值 - UNBOUNDED FOLLOWING : 窗口结束位置(分组最后一行)
- CURRENT ROW:当前行- N
- PRECEDING/FOLLOWING:当前行之前/之后n行
SELECT
name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win6,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS 2 PRECEDING) win12
FROM employee_contract ORDER BY dept, name;
范围窗口是取分组内的值在指定范围区间内的行
- 该范围值/区间必须是数字或日期类型
- 目前只支持一个ORDER BY列
SUM(close) RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING
-- 假设当前close值的行数为3000,语句将包含分区内范围从2500到4000的行
-- 示例
SELECT name, dept_num AS dept, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
salary - 1000 as sal_r_start,salary as sal_r_end,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY name RANGE BETWEEN 1000 PRECEDING AND CURRENT ROW) win13
FROM employee_contract ORDER BY dept, name;
以下内容摘自官网:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Windowing and Analytics Functions
Windowing and Analytics Functions
Enhancements to Hive QL
Examples
PARTITION BY with one partitioning column, no ORDER BY or window specification
PARTITION BY with two partitioning columns, no ORDER BY or window specification
PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
PARTITION BY with partitioning, ORDER BY, and window specification
WINDOW clause
LEAD using default 1 row lead and not specifying default value
LAG specifying a lag of 3 rows and default value of 0
Distinct counting for each partition
Enhancements to Hive QL
Version
Introduced in Hive version 0.11.
This section introduces the Hive QL enhancements for windowing and analytics functions. See “Windowing Specifications in HQL” (attached to HIVE-4197) for details. HIVE-896 has more information, including links to earlier documentation in the initial comments.
All of the windowing and analytics functions operate as per the SQL standard.
The current release supports the following functions for windowing and analytics:
1. Windowing functions
LEAD
The number of rows to lead can optionally be specified. If the number of rows to lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the window.
LAG
The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the window.
FIRST_VALUE
column for which you want the first value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
LAST_VALUE
This takes at most two parameters. The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.
2. The OVER clause
OVER with standard aggregates:
COUNT
SUM
MIN
MAX
AVG
OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
The OVER clause supports the following functions, but it does not support a window with them (see HIVE-4797):
Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.
Lead and Lag functions.
3. Analytics functions
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE
4. Distinct support in Hive 2.1.0 and later (see HIVE-9534)
Distinct is supported for aggregation functions including SUM, COUNT and AVG, which aggregate over the distinct values within each partition. Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason. The supported syntax is as follows.
COUNT(DISTINCT a) OVER (PARTITION BY c)
ORDER BY and window specification is supported for distinct in Hive 2.2.0 (see HIVE-13453). An example is as follows.
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
5. Aggregate functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)
Support to reference aggregate functions within the OVER clause has been added. For instance, currently we can use the SUM aggregation function within the OVER clause as follows.
SELECT rank() OVER (ORDER BY sum(b))
FROM T
GROUP BY a;
Examples
This section provides examples of how to use the Hive QL windowing and analytics functions in SELECT statements. See HIVE-896 for additional examples.
PARTITION BY with one partitioning column, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c)
FROM T;
PARTITION BY with two partitioning columns, no ORDER BY or window specification
SELECT a, COUNT(b) OVER (PARTITION BY c, d)
FROM T;
PARTITION BY with one partitioning column, one ORDER BY column, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d)
FROM T;
PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification
SELECT a, SUM(b) OVER (PARTITION BY c, d ORDER BY e, f)
FROM T;
PARTITION BY with partitioning, ORDER BY, and window specification
SELECT a, SUM(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM T;
There can be multiple OVER clauses in a single query. A single OVER clause only applies to the immediately preceding function call. In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b):
SELECT
a,
COUNT(b) OVER (PARTITION BY c),
SUM(b) OVER (PARTITION BY c)
FROM T;
Aliases can be used as well, with or without the keyword AS:
SELECT
a,
COUNT(b) OVER (PARTITION BY c) AS b_count,
SUM(b) OVER (PARTITION BY c) b_sum
FROM T;
WINDOW clause
SELECT a, SUM(b) OVER w
FROM T
WINDOW w AS (PARTITION BY c ORDER BY d ROWS UNBOUNDED PRECEDING);
LEAD using default 1 row lead and not specifying default value
SELECT a, LEAD(a) OVER (PARTITION BY b ORDER BY C)
FROM T;
LAG specifying a lag of 3 rows and default value of 0
SELECT a, LAG(a, 3, 0) OVER (PARTITION BY b ORDER BY C)
FROM T;
Distinct counting for each partition
SELECT a, COUNT(distinct a) OVER (PARTITION BY b)
FROM T;