文章目录

  • 1. 去重技巧—用group by替换distinct
  • 2. 聚合技巧–利用窗口函数grouping sets、cube、rollup
  • 3. 换个思路解题
  • 4. union all时可以开启并发执行
  • 5. 利用lateral view进行行转列
  • 6. 表连接优化
  • 7. 如何解决数据倾斜
  • 8. 如何计算按月累计去重
  • 9. 综合练习


1. 去重技巧—用group by替换distinct

取出user_trade中的全部支付用户

## 原有写法
SELECT distinct user_name
FROM user_trade
WHERE dt>'0';

## 优化写法
SELECT user_name
FROM user_trade
WHERE dt>'0'
GROUP BY user_name;

在2019年购买又退款的用户
## 原有写法
SELECT a.user_name
FROM
	(SELECT distinct user_name
	FROM user_trade
	WHERE year(dt)=2019) a
  JOIN
	(SELECT distinct user_name
	FROM user_refund
	WHERE year(dt)=2019) b ON a.user_name=b.user_name;

## 优化写法
SELECT a.user_name
FROM
	(SELECT user_name
	FROM user_trade
	WHERE year(dt)=2019
	GROUP BY user_name) a
  JOIN
	(SELECT user_name
	FROM user_refund
	WHERE year(dt)=2019
	GROUP BY user_name) b ON a.user_name=b.user_name;

在极大的数据量(且很多重复值)时,可以先group by去重,再count()计数,效率高于直接count(distinct **)

2. 聚合技巧–利用窗口函数grouping sets、cube、rollup

grouping sets、cube、rollup详解

性别、城市和等级用户分布
SELECT sex,
	   city,
	   level,
	   count(user_id)
FROM user_info
GROUP BY sex,city,level
GROUPING SETS (sex,city,level);
GROUPING SETS() 在group by查询中,根据不同的维度进行聚合,等价于将不同维度的group by结果集进行union all。聚合规则在括号中进行指定。


性别、性别&城市的用户分布
SELECT sex,
	   city,
	   count(user_id)
FROM user_info
GROUP BY sex,
	     city
GROUPING SETS (sex,(sex,city));


cube:根据group by维度的所有组合进行聚合
性别、城市和等级的各种组合的用户分布
SELECT sex,
	   city,
	   level,
	   count(user_id)
FROM user_info
GROUP BY sex,city,level
with cube;


rollup 以最左侧的维度为主,进行层级聚合,是cube的子集
计算每月的支付金额和每年的支付金额汇总
## 原有写法
SELECT a.dt,
	   sum(a.year_amount),
	   sum(a.month_amount)
FROM
	(SELECT substr(dt,1,4) as dt,
		    sum(pay_amount) as year_amount,
		    0 as month_amount
	FROM user_trade
	WHERE dt>'0'
	GROUP BY substr(dt,1,4)
  UNION ALL 
	SELECT substr(dt,1,7) as dt,
		   0 as year_amount,
		   sum(pay_amount) as month_amount	   
	FROM user_trade
	WHERE dt>'0'
	GROUP BY substr(dt,1,7)
	) a
GROUP BY a.dt;

## 优化写法
SELECT year(dt) as year,
	   month(dt) as month,
	   sum(pay_amount)
FROM user_trade
WHERE dt>'0'
GROUP BY year(dt),
	     month(dt)
with rollup;

3. 换个思路解题

在2017年和2018年都购买的用户

## 原有写法
SELECT a.user_name 
FROM 
	(SELECT distinct user_name 
	FROM user_trade 
	WHERE year(dt)=2017)a 
  JOIN 
	(SELECT distinct user_name 
	FROM user_trade 
	WHERE year(dt)=2018)b on a.user_name=b.user_name;

## 优化写法
SELECT a.user_name
FROM 
	(SELECT user_name,
		   count(distinct year(dt)) as year_num
	FROM user_trade
	WHERE year(dt) in (2017,2018)
	GROUP BY user_name) a
WHERE a.year_num=2;

4. union all时可以开启并发执行

参数设置:set hive.exec.parallel=true

可以并行的任务较多时,开启并发执行,可以提高执行效率。

每个用户的支付和退款金额汇总


SELECT a.user_name,
	   sum(a.pay_amount),
	   sum(a.refund_amount)
FROM
	(SELECT user_name,
	       sum(pay_amount) as pay_amount,
	       0 as refund_amount
	FROM user_trade
	WHERE dt>'0'
	GROUP BY user_name
  UNION ALL
	SELECT user_name,
		   0 as pay_amount,
	       sum(refund_amount) as refund_amount
	FROM user_refund
	WHERE dt>'0'
	GROUP BY user_name) a
GROUP BY a.user_name;

5. 利用lateral view进行行转列

每个品类的购买用户数
SELECT b.category,
	   count(distinct a.user_name)
FROM user_goods_category a
lateral view explode(split(category_detail,',')) b as category
GROUP BY b.category;


split():字符串分隔函数
explode:行转列函数

列转行函数:concat_ws(',',collect_set(column))

6. 表连接优化

小表在前,大表在后
Hive假定查询中最后的一个表是大表,它会将其他表缓存起来,然后扫描最后一个表

使用相同的连接条件
当3个或者更多个表进行join连接时,如果每个on子句都使用相同的连接键的话,那么之后产生一个MapReduce job

尽早的过滤数据
减少每个阶段的数据量,对于分区表要加分区,同时只选择需要使用到的字段

逻辑过于复杂,引入中间表

7. 如何解决数据倾斜

数据倾斜的表现:任务进度长时间维持在99%(或100%),查看任务监控页面,发现只有少量(1个或几个)reduce子任务未完成。因为其处理的数据量和其他reduce差异过大

数据倾斜的原因与解决办法:
空值产生的数据倾斜
解决:如果两个表连接时,使用的连接条件有很多空值,建议在连接条件中增加过滤
例如:on a.user_id=b.user_id and a.user_id is not null

大小表连接(其中一张表很大,另一张表非常小)
解决:将小表放到内存里,在map端做Join
例如:
SELECT /*+mapjoin(a)*/, 
			b.* 
FROM a join b on a.**=b.**

两个表连接的字段类型不一样
解决:将连接条件的字段数据类型转换成一致的
例如:on a.user_id=cast(b.user_id as string)

8. 如何计算按月累计去重

2017、2018年按月累计去重的购买用户数

SELECT b.year,
	   b.month,
	   sum(b.user_num) over(partition by b.year order by b.month)
FROM
	(SELECT a.year,
		   a.month,
		   count(distinct a.user_name) user_num
	FROM
		(SELECT year(dt) as year,
			   user_name,
			   min(month(dt)) as month
		FROM user_trade
		WHERE year(dt) in (2017,2018)
		GROUP BY year(dt),
			     user_name) a
	GROUP BY a.year,
		     a.month) b
ORDER BY b.year,
	     b.month
limit 24;
## 其他写法
SELECT b.month,
	   count(distinct a.user_name)
FROM
		SELECT substr(dt,1,7) as month,
			   user_name
		FROM user_trade
		WHERE year(dt) in (2017,2018)
		GROUP BY substr(dt,1,7),
			     user_name) a
	CROSS JOIN
		(SELECT month
		FROM dim_month) b
		WHERE b.month>=a.month
			  and substr(a.month,1,4)=substr(b.month,1,4)
GROUP BY b.month;

9. 综合练习

每个性别、不同性别和手机品牌的退款金额分布

SELECT b.sex,
       b.phonebrand,
       sum(a.refund_amount)
FROM 
		(SELECT user_name,
		       sum(refund_amount) as refund_amount
		FROM user_refund
		WHERE dt>'0'
		GROUP BY user_name)a
	JOIN
		(SELECT user_name,
			    sex,
			    extra2['phonebrand'] phonebrand
		FROM user_info) b on a.user_name=b.user_name
GROUP BY b.sex,
         b.phonebrand
GROUPING SETS (b.sex,(b.sex,b.phonebrand));

把每个用户购买的品类变成一行,品类间用逗号分隔

SELECT a.user_name,
       concat_ws(',',collect_set(a.goods_category))
FROM 
	(SELECT user_name,
	       goods_category
	FROM user_trade
	WHERE dt>'0'
	GROUP BY user_name,
	         goods_category)a
GROUP BY a.user_name;

2017、2018年按月累计去重的退款用户数

SELECT b.year,
       b.month,
       sum(b.user_num) over(partition by b.year order by b.month)
FROM
      (SELECT a.year,
             a.month,
             count(distinct a.user_name) user_num
      FROM 
          (SELECT year(dt) as year,
                 user_name,
                 min(month(dt)) as month
          FROM user_refund
          WHERE year(dt) in (2017,2018)
          GROUP BY year(dt),
                   user_name)a
      GROUP BY a.year,
               a.month)b
ORDER BY b.year,
         b.month 
limit 24;