- grouping sets: 根据不同的维度组合进行聚合,等价于将不同维度的group by结果集进行union all
- grouping__id(请注意函数名中的下划线是两个!):表示结果属于哪一个分组集合,属于虚字段
- cube: 根据group by的维度的所有组合进行聚合。
- rollup: 为cube的子集,以最左侧的维度为主,从该维度进行层级聚合。(从右向左依次递减)
这几个分析函数通常用于,根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的uv数
在一个group by查询中,根据不同的维度组合进行聚合,等价于将不同维度的group by结果集进行union all。
通俗的说,grouping sets是一种将多个group by 逻辑写在一个语句中的便利写法
上卷(roll-up):上卷是沿着维的层次向上聚集汇总数据。例如,对产品销售数据,沿着时间维上卷,可以求出所有产品在所有地区每月(或季度或年或全部)的销售额。
下探(drill-down):下探是上卷的逆操作,它是沿着维的层次向下,查看更详细的数据。
希望能帮助你理解。
数据:
10001 2007-12-24 2 A 1200
10005 2007-12-24 1 B 2000
10006 2008-01-18 1 C 1400
20001 2008-02-12 2 B 1200
20002 2009-02-16 1 C 2000
30001 2007-08-02 3 A 1000
30003 2009-04-18 2 B 1500
30004 2009-04-18 3 C 2200
30007 2009-09-07 3 D 3000
40001 2008-01-09 2 A 4000
40005 2009-02-12 3 A 1000
准备工作
create table orders(
orderid int,
orderdate date,
empid int,
custid string,
qty int
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/module/hive/datas/Orders.txt' into table orders;
select*from orders;
– 1. 汇总每年收入
select
year(orderdate) orderyear,
sum(qty) sum_qty
from Orders
group by year(orderdate)
order by orderyear;
– 2. 公司每个月的收入
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty
from Orders
group by year(orderdate),month(orderdate)
order by orderyear,ordermonth;
– 3. 公司收入(每年|每月)。1是按照年,2是按照年和月,汇总1、结果
-答案1
select
year(orderdate) orderyear,
null ordermonth,
sum(qty) sum_qty
from Orders
group by year(orderdate)
union all
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty
from Orders
group by year(orderdate),month(orderdate)
order by orderyear,ordermonth;
–null出现在里面按照年的汇总数据,非null为按照年和月的分组汇总
-答案2–用 GROUPING SETS实现相同结果
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
grouping__id --表示结果属于哪一个分组集合,属于虚字段
from Orders
group by year(orderdate),month(orderdate)
grouping sets ( year(orderdate), --1st grouping set
(year(orderdate), month(orderdate)) --2nd grouping set
);
–等价于
select year(orderdate) orderyear,null,sum(qty) sum_qty,1 as grouping__id from Orders group by year(orderdate)
union all
select year(orderdate) orderyear,month(orderdate) ordermonth,sum(qty) sum_qty,0 as grouping__id from Orders group by year(orderdate),month(orderdate);
–一个分组包含两个列,假设列A和B,两个列都需要包含在括号内:(column A, column B)。如果没有括号,这个子句将会被定义为独立的分组,结果就不同了。
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty
from Orders
group by year(orderdate),month(orderdate)
grouping sets (year(orderdate), month(orderdate) );
– 4. 加入总体汇总结果
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
grouping__id
from Orders
group by year(orderdate),month(orderdate)
grouping sets ( year(orderdate), (year(orderdate), month(orderdate)), () );
–为啥没2??
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
(case grouping__id when 1 then '年度' when 0 then '年度|月度' else '总' end)
from Orders
group by year(orderdate),month(orderdate)
grouping sets ( year(orderdate), (year(orderdate), month(orderdate)), () );
1.在group by中使用rollup:
select
empid,
avg(qty)
from Orders
group by empid;
–rollup是cube的子集,以最左侧的维度为主,从该维度进行层级聚合
–在生成原有统计结果基础上,生成横向小计结果。
select
nvl(empid, '所有订单平均') empid,
avg(qty)
from Orders
group by rollup(empid); --汇总的列返回的不是每个人的平均,而是所有行的平均(所有订单)
select
nvl(empid, '所有订单平均') empid,
custid,
avg(qty)
from Orders
group by rollup(empid, custid);
这里如果是group by rollup(a,b,c);对(a,b,c)三列分组的话,就是先对(a,b,c)进行group by,再对(a,b)进行group by,再对(a)进行group by,再对全表group by。从右向左依次递减
即rollup(1,2…n)时,group by的所有可能的group by数是n+1个,比如rollup(a,b,c);时,总共有4个汇总。
2.在group by中使用cube:
–cube操作符时,在生成原有统计结果基础上,生成纵向小计结果。
–根据GROUP BY的维度的所有组合进行聚合
select
empid,
avg(qty)
from Orders
group by cube(empid);
select
nvl(empid, '订单平均') empid,
custid,
avg(qty)
from Orders
group by cube(empid, custid);
–另一种写法
select
month(orderdate) ordermonth,
day(orderdate) orderday,
sum(qty) sum_qty,
grouping__id
from Orders
group by orderdate
with cube
order by grouping__id;
如果是group by cube(a,b,c);首先会对(a、b、c)进行group by,然后依次是(a、b),(a、c),(a),(b、c),(b),©,再对全表进行group by。
即cube(1,2,n)时,group by的所有可能的group by数是2的n次方,比如cube(a,b,c);时,总共有8个。
总结:rollup非常高效,对一个查询增加的开销非常少;cube相对更耗费资源。
在group by子句有列(a,b)两列时,rollup统计(a,b),(a);而cube统计了(a,b),(a),(b)。
4.grouping函数,区分出小计,汇总数据,只能在使用rollup或cube的查询中使用。
–对输入列返回0或1,如果该行数据使用了数据的列中的信息,即此列数据参与rollup/cube函数分组汇总活动,则输出0;没有用到则输出1
select
nvl(empid, '订单平均') empid,
custid,
avg(qty),
grouping(empid) g_e,
grouping(custid) g_c
from orders
group by rollup(empid, custid);
–通过在 having 子句中使用 grouping 函数,您可以只显示 empid 和 custid 组合的小计。
select
nvl(empid, '订单平均') empid,
custid,
avg(qty),
grouping(empid) g_e,
grouping(custid) g_c
from orders
group by rollup(empid, custid)
having grouping(custid) = 0;
–注意:有些表有一项由于表达式所基于的列对于表中的一行或多行为 null 值而返回 null 值,而不是表示该列的小计(此null非分组null)
5.grouping__id函数
–grouping__id 函数返回一个整数值。该值对应于位向量的十进制解释,该向量由串联的 1 和 0 组成,将由一系列 grouping 函数按从左到右的顺序返回,这与 grouping__id 函数中指定参数的顺序相同。二进制
select
nvl(empid, '订单平均') empid,
custid,
avg(qty),
grouping__id
from orders
group by empid,custid
grouping sets(empid, custid);
select
nvl(empid, '订单平均') empid,
custid,
avg(qty),
if(grouping__id = 1, '员工平均', '顾客平均')
from orders
group by empid,custid
grouping sets(empid, custid);
–二进制
select
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
grouping(year(orderdate)) g_y,
grouping(year(orderdate), month(orderdate)) g_ym,
grouping__id gid
from Orders
group by year(orderdate),month(orderdate)
grouping sets ( year(orderdate), (year(orderdate), month(orderdate)), () );
–好理解
select
nvl(empid, '订单平均') empid,
custid,
avg(qty),
grouping(empid) g_e,
grouping(custid) g_c,
grouping__id gid
from orders
group by empid,custid
grouping sets(empid, custid);
–在hive里grouping__id不能放参数,oral可以
6.grouping__sets多维度组合去重统计避免使用distinct
–通过grouping sets和distinct进行统计
select
grouping__id g_id,
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
count(distinct empid) dis_empid
from Orders
group by year(orderdate),month(orderdate)
grouping sets ( year(orderdate),
(year(orderdate), month(orderdate))
);
–通过grouping sets将empid加入维度组合再进行group by统计
select g_id, orderyear, ordermonth, sum(sum_qty), count(1) dis_empid
from
(
select
cast(grouping__id as int)&7 g_id,--可以让grouping__id好看点
year(orderdate) orderyear,
month(orderdate) ordermonth,
sum(qty) sum_qty,
empid
from Orders
group by year(orderdate),month(orderdate),empid
grouping sets ( (year(orderdate),empid),
(year(orderdate), month(orderdate),empid)
)
) t
group by g_id, orderyear, ordermonth;
–在hive没有group by all----sql或olap有
select
empid,
sum(qty) sum_qty
from orders
where empid=2
group by all empid;
hive grouping sets多维度报错
如果你的 grouping sets大于等于5个维度,将会报如上的错误;解决办法:
An additional MR job is introduced since the cardinality of grouping sets is more than hive.new.job.grouping.set.cardinality. This functionality is not supported with distincts. Either set hive.new.job.grouping.set.cardinality to a high number (higher than the number of rows per input row due to grouping sets in the query), or rewrite the query to not use distincts. The number of rows per input row due to grouping sets is 32
- 在你的hql语句前面加上 set hive.new.job.grouping.set.cardinality=xx;(例如我这里是5个维度,一共32个grouping sets,xx我写的64 )
- 可以通过在子查询中用group by去重,避免在聚合中用到distinct
hive中grouping sets 数量较多时,可以使用如下设置来:set hive.new.job.grouping.set.cardinality = 30;这条设置的意义在于告知解释器,group by之前,每条数据复制量在30份以内。
当维度过高,且统计语句中使用了 count distinct语句,就有可能出现如下报错
- Error while compiling statement: FAILED: SemanticException [Error 10226]: An additional MR job is introduced since the cardinality of grouping sets is more than hive.new.job.grouping.set.cardinality. This functionality is not supported with distincts. Either set hive.new.job.grouping.set.cardinality to a high number (higher than the number of rows per input row due to grouping sets in the query), or rewrite the query to not use distincts. The number of rows per input row due to grouping sets is 256
此时可以通过以下语句突破这个限制set hive.new.job.grouping.set.cardinality = 256;
其中256 为可以调节的数字,一般这个数字要大于你的维度的最高值,但是切记要慎用,一般维度最好不要超过7维,否则对集群压力会比较大