文章目录
- 1. 去重技巧—用group by替换distinct
- 2. 聚合技巧–利用窗口函数grouping sets、cube、rollup
- 3. 换个思路解题
- 4. union all时可以开启并发执行
- 5. 利用lateral view进行行转列
- 6. 表连接优化
- 7. 如何解决数据倾斜
- 8. 如何计算按月累计去重
- 9. 综合练习
1. 去重技巧—用group by替换distinct
取出user_trade中的全部支付用户
## 原有写法
SELECT distinct user_name
FROM user_trade
WHERE dt>'0';
## 优化写法
SELECT user_name
FROM user_trade
WHERE dt>'0'
GROUP BY user_name;
在2019年购买又退款的用户
## 原有写法
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2019) a
JOIN
(SELECT distinct user_name
FROM user_refund
WHERE year(dt)=2019) b ON a.user_name=b.user_name;
## 优化写法
SELECT a.user_name
FROM
(SELECT user_name
FROM user_trade
WHERE year(dt)=2019
GROUP BY user_name) a
JOIN
(SELECT user_name
FROM user_refund
WHERE year(dt)=2019
GROUP BY user_name) b ON a.user_name=b.user_name;
在极大的数据量(且很多重复值)时,可以先group by去重,再count()计数,效率高于直接count(distinct **)
2. 聚合技巧–利用窗口函数grouping sets、cube、rollup
grouping sets、cube、rollup详解
性别、城市和等级用户分布
SELECT sex,
city,
level,
count(user_id)
FROM user_info
GROUP BY sex,city,level
GROUPING SETS (sex,city,level);
GROUPING SETS() 在group by查询中,根据不同的维度进行聚合,等价于将不同维度的group by结果集进行union all。聚合规则在括号中进行指定。
性别、性别&城市的用户分布
SELECT sex,
city,
count(user_id)
FROM user_info
GROUP BY sex,
city
GROUPING SETS (sex,(sex,city));
cube:根据group by维度的所有组合进行聚合
性别、城市和等级的各种组合的用户分布
SELECT sex,
city,
level,
count(user_id)
FROM user_info
GROUP BY sex,city,level
with cube;
rollup 以最左侧的维度为主,进行层级聚合,是cube的子集
计算每月的支付金额和每年的支付金额汇总
## 原有写法
SELECT a.dt,
sum(a.year_amount),
sum(a.month_amount)
FROM
(SELECT substr(dt,1,4) as dt,
sum(pay_amount) as year_amount,
0 as month_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,4)
UNION ALL
SELECT substr(dt,1,7) as dt,
0 as year_amount,
sum(pay_amount) as month_amount
FROM user_trade
WHERE dt>'0'
GROUP BY substr(dt,1,7)
) a
GROUP BY a.dt;
## 优化写法
SELECT year(dt) as year,
month(dt) as month,
sum(pay_amount)
FROM user_trade
WHERE dt>'0'
GROUP BY year(dt),
month(dt)
with rollup;
3. 换个思路解题
在2017年和2018年都购买的用户
## 原有写法
SELECT a.user_name
FROM
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2017)a
JOIN
(SELECT distinct user_name
FROM user_trade
WHERE year(dt)=2018)b on a.user_name=b.user_name;
## 优化写法
SELECT a.user_name
FROM
(SELECT user_name,
count(distinct year(dt)) as year_num
FROM user_trade
WHERE year(dt) in (2017,2018)
GROUP BY user_name) a
WHERE a.year_num=2;
4. union all时可以开启并发执行
参数设置:set hive.exec.parallel=true
可以并行的任务较多时,开启并发执行,可以提高执行效率。
每个用户的支付和退款金额汇总
SELECT a.user_name,
sum(a.pay_amount),
sum(a.refund_amount)
FROM
(SELECT user_name,
sum(pay_amount) as pay_amount,
0 as refund_amount
FROM user_trade
WHERE dt>'0'
GROUP BY user_name
UNION ALL
SELECT user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY user_name) a
GROUP BY a.user_name;
5. 利用lateral view进行行转列
每个品类的购买用户数
SELECT b.category,
count(distinct a.user_name)
FROM user_goods_category a
lateral view explode(split(category_detail,',')) b as category
GROUP BY b.category;
split():字符串分隔函数
explode:行转列函数
列转行函数:concat_ws(',',collect_set(column))
6. 表连接优化
小表在前,大表在后
Hive假定查询中最后的一个表是大表,它会将其他表缓存起来,然后扫描最后一个表
使用相同的连接条件
当3个或者更多个表进行join连接时,如果每个on子句都使用相同的连接键的话,那么之后产生一个MapReduce job
尽早的过滤数据
减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段
逻辑过于复杂,引入中间表
7. 如何解决数据倾斜
数据倾斜的表现:任务进度长时间维持在99%(或100%),查看任务监控页面,发现只有少量(1个或几个)reduce子任务未完成。因为其处理的数据量和其他reduce差异过大
数据倾斜的原因与解决办法:
空值产生的数据倾斜
解决:如果两个表连接时,使用的连接条件有很多空值,建议在连接条件中增加过滤
例如:on a.user_id=b.user_id and a.user_id is not null
大小表连接(其中一张表很大,另一张表非常小)
解决:将小表放到内存里,在map端做Join
例如:
SELECT /*+mapjoin(a)*/,
b.*
FROM a join b on a.**=b.**
两个表连接的字段类型不一样
解决:将连接条件的字段数据类型转换成一致的
例如:on a.user_id=cast(b.user_id as string)
8. 如何计算按月累计去重
2017、2018年按月累计去重的购买用户数
SELECT b.year,
b.month,
sum(b.user_num) over(partition by b.year order by b.month)
FROM
(SELECT a.year,
a.month,
count(distinct a.user_name) user_num
FROM
(SELECT year(dt) as year,
user_name,
min(month(dt)) as month
FROM user_trade
WHERE year(dt) in (2017,2018)
GROUP BY year(dt),
user_name) a
GROUP BY a.year,
a.month) b
ORDER BY b.year,
b.month
limit 24;
## 其他写法
SELECT b.month,
count(distinct a.user_name)
FROM
SELECT substr(dt,1,7) as month,
user_name
FROM user_trade
WHERE year(dt) in (2017,2018)
GROUP BY substr(dt,1,7),
user_name) a
CROSS JOIN
(SELECT month
FROM dim_month) b
WHERE b.month>=a.month
and substr(a.month,1,4)=substr(b.month,1,4)
GROUP BY b.month;
9. 综合练习
每个性别、不同性别和手机品牌的退款金额分布
SELECT b.sex,
b.phonebrand,
sum(a.refund_amount)
FROM
(SELECT user_name,
sum(refund_amount) as refund_amount
FROM user_refund
WHERE dt>'0'
GROUP BY user_name)a
JOIN
(SELECT user_name,
sex,
extra2['phonebrand'] phonebrand
FROM user_info) b on a.user_name=b.user_name
GROUP BY b.sex,
b.phonebrand
GROUPING SETS (b.sex,(b.sex,b.phonebrand));
把每个用户购买的品类变成一行,品类间用逗号分隔
SELECT a.user_name,
concat_ws(',',collect_set(a.goods_category))
FROM
(SELECT user_name,
goods_category
FROM user_trade
WHERE dt>'0'
GROUP BY user_name,
goods_category)a
GROUP BY a.user_name;
2017、2018年按月累计去重的退款用户数
SELECT b.year,
b.month,
sum(b.user_num) over(partition by b.year order by b.month)
FROM
(SELECT a.year,
a.month,
count(distinct a.user_name) user_num
FROM
(SELECT year(dt) as year,
user_name,
min(month(dt)) as month
FROM user_refund
WHERE year(dt) in (2017,2018)
GROUP BY year(dt),
user_name)a
GROUP BY a.year,
a.month)b
ORDER BY b.year,
b.month
limit 24;