文章目录
- 1.累计计算窗口函数
- 1).sum(...)over(...)
- 2).avg(...) over(...)
- 3.语法总结:
- 2.分区排序窗口函数:
- row_number() over(partition by ...A...order by...B...) 1,2,3,4
- rank() over(partition by ...A...order by...B...) 1,2,2,4
- dense_rank() over(partition by ...A...order by...B...) 1,2,2,3
- 3.分组排序窗口函数
- ntile(n) over(......)
- 4.偏移分析窗口函数
- lag(...)over(...),lead(...)over(...)
1、掌握sum(),avg()用于累计计算的窗口函数;
2、掌握row_number(),rank()用于排序的窗口函数;
3、掌握ntile()用于分组查询的窗口函数;
4、掌握lag(),lead()偏移分析窗口函数
1.累计计算窗口函数
1).sum(…)over(…)
求截止某月的数值
user_trade列名 | 举例 |
user_name | Amy,Dennis |
piece | 购买数量 |
pay_amount | 支付金额 |
goods_category | food,clothes,book,computer,electronics,shoes |
pay_time | 132265412,时间戳 |
dt | partition,‘yyyy-mm-dd’ |
求2018年每月的支付总额和当年累计支付总额
select a.month,
a.pay_amount,
sum(a.pay_amount) over(order by a.month)
from
(select month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt)=2018
group by month(dt))a;
2017-2018年每月的支付总额和当年累积支付总额(如果用上面的方式会把之前月份的都汇总,无法实现)
select a.year,
a.month,
a.pay_amount,
sum(a.pay_amount) over(partition by a.year order by a.month) -- over:指定累计计算的条件,这里须正确分组
from
(select year(dt) year,
month(dt) month,
sum(pay_amount) pay_amount
from user_trade
where year(dt) in (2017,2018)
group by year(dt),
month(dt))a;
注:
1.partition by 起到分组的作用;
2.order by按照什么顺序进行累加,升序ASC,降序DESC,默认升序
3.常见错误:分组没有限制正确(partition by 后多加了a.month)
2).avg(…) over(…)
2018年每个月的近三个月移动平均支付金额
select a.month,
a.pay_amount,
avg(a.pay_amount) over(order by a.month rows between 2 preceding and current row)
from
(select month(dt)month,
sum(pay_amount)pay_amount
from user_trade
where year(dt)=2018
group by month(dt))a;
说明:我们用rows between 2 preceding and current row来限制计算移动平均的平均范围,本语句含义是包含本行及前两行,这个就是我们题目中要求的近三月的写法
3.语法总结:
sum(…A…) over(partition by …B… order by…C…rows between…D1…and…D2…)
avg(…A…) over(partition by …B… order by…C…rows between…D1…and…D2…)
A:需要被加工的字段名称
B:分组的字段名称
C:排序的字段名称
D:计算的行数范围
rows between unbounded preceding and current row–包括本行和之前所有的行
rows between current row and unbounded following–包括本行和之后所有的行
rows between 3 preceding and current row --包括本行以内和前三行
rows between 3 preceding and 1 following–从前三行到下一行(5行)
拓展:
max(…) over(partition by …order by … rows between… and…)
min(…) over(partition by …order by … rows between… and…)
2.分区排序窗口函数:
这三个函数的作用都是返回相应规则的排序序号
row_number() over(partition by …A…order by…B…) 1,2,3,4
rank() over(partition by …A…order by…B…) 1,2,2,4
dense_rank() over(partition by …A…order by…B…) 1,2,2,3
2019年1月,用户购买商品品类数量的排名
select user_name,
count(distinct goods_category),
row_number() over(order by count(distinct goods_category)),
rank() over(order by count(distinct goods_category)),
dense_rank() over(order by count(distinct goods_category))
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name;
选择2019年支付金额排名在第10,20,30名的用户:
select a.user_name,
a.pay_amount,
a.rank
from
(select user_name,
sum(pay_amount)pay_amount,
rank() over(order by sum(pay_amount)desc)rank
from user_trade
where year(dt)=2019
group by user_name)a
where a.rank in (10,20,30);
3.分组排序窗口函数
ntile(n) over(…)
ntile(n) over(partition by…A…order by…B…)
n:切分的片数
A:分组的字段名称
B:排序的字段名称
ntile(n),用于将分组数据按照顺序切分成n片,返回当前切片值
ntile不支持row between,比如ntile(2) over(partition by …order by…rows between 3 preceding and current row)
如果切片不均匀,默认增加第一个切片的分布
将2019年1月份支付用户,按照支付金额分成5组:
select user_name,
sum(pay_amount)pay_amount,
ntile(5) over(order by sum(pay_amount) desc)level
from user_trade
where substr(dt,1,7)='2019-01'
group by user_name;
选出2019年退款金额排名前10%的用户:
user_refund列名 | 举例 |
user_name | Amy,Dennis |
refund_piece | 退款件数 |
refund_amount | 退款金额 |
refund_time | 134854654,时间戳 |
dt | partition,‘yyyy-mm-dd’ |
select a.user_name,
a.refund_amount,
a.level
from
(select user_name,
sum(refund_amount)refund_amount,
ntile(10) over(order by sum(refund_amount)desc)level
from user_refund
where year(dt)=2019
group by user_name)a
where a.level=1;
4.偏移分析窗口函数
lag(…)over(…),lead(…)over(…)
lag和lead分析函数可以在通一次查询中取出同一字段的前N行的数据(lag)和后N行的数据(lead)作为独立的列。
在实际应用中,若要用到取今天和昨天的某字段差值时,lag和lead函数的应用就显得尤为重要。当然,
这种操作可以用表的自连接实现,但是lag和lead与left join、right join等自连接相比,效率更高,SQL更简洁。
lag(exp_str,offset,defval) over(partition by …order by…)
lead(exp_str,offset,defval) over(partition by …order by…)
– exp_str是字段名称。
– offset是偏移量,即是上1个或上N个的值,假设当前行在表中排在第5行,offset为3,则表示我们所要找的数据行就是表中的第2行(即5-3=2)。offset默认值为1。
– defval默认值,当两个函数取上N/下N个值,当在表中从当前行位置向前数N行已经超出了表的范围时,lag()函数
– 将defval这个参数值作为函数的返回值,若没有指定默认值,则返回null,那么在数学运算中,总要给一个默认值才不会出错。
Alice和Alexander的各种时间偏移
lag()实例:
select user_name,
dt,
lag(dt,1,dt) over(partition by user_name order by dt),
lag(dt) over(partition by user_name order by dt),
lag(dt,2,dt) over(partition by user_name order by dt),
lag(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>'0'
and user_name in ('Alice','Alexander');
注意:已经对人进行分组了,所以不会由Alice偏移到Alexander,否则没有意义
lead()实例:
select user_name,
dt,
lead(dt,1,dt) over(partition by user_name order by dt),
lead(dt) over(partition by user_name order by dt),
lead(dt,2,dt) over(partition by user_name order by dt),
lead(dt,2) over(partition by user_name order by dt)
from user_trade
where dt>'0'
and user_name in ('Alice','Alexander');
实例:支付时间间隔超过100天的用户数
select count(distinct user_name)
from
(select user_name,
dt, --选出每个人的支付日期
lead(dt) over(partition by user_name order by dt) lead_dt
from user_trade
where dt>'0')a --用的是全量,所以用>0
where a.lead_dt is not null and datediff(a.lead_dt,a.dt)>100; -- 大的写左边,小的写右边
重点练习
1.每个城市,不同性别,2018年支付金额最高的top3用户(使用user_trade和user_info两个表)
user_info列名 | 举例 |
user_id | 10001,10002 |
user_name | Amy,Dennis |
sex | [male,female] |
age | [13,70] |
city | beijing,shanghai |
fistactivetime | 2019-04-19 15:40:00 |
level | [1,10] |
extra1 | string类型:{“systemtype”:“ios”,“education”:“master”,“marriage_status”:“1”,“phonebrand”:“iphoneX”} |
extra2 | map<string.string>类型: {“systemtype”:“ios”,“education”:“master”,“marriage_status”:“1”,“phonebrand”:“iphoneX”} |
user_trade列名 | 举例 |
user_name | Amy,Dennis |
piece | 购买数量 |
price | 价格 |
pay_amount | 支付金额 |
goods_category | food,clothes,book,computer,electronics,shoes |
pay_time | 2412521561,时间戳 |
dt | partition,‘yyyy-mm-dd’ |
select c.user_name,
c.city,
c.sex,
c.pay_amount,
c.rank
from
(select a.user_name,
b.city,
b.sex,
a.pay_amount,
row_number() over(partition by b.city,b.sex order by a.pay_amount desc) rank
from
(select user_name,
sum(pay_amount) pay_amount
from user_trade
where year(dt)=2018
group by user_name)a
left join user_info b on a.user_name=b.user_name)c
where c.rank<=3;
-- 使用row_number是因为它可以精确限制可以具体出来多少行,使用dense_rank和rank都不行
每个手机品牌退款金额前25%的用户(使用user_refund和user_info两个表)
user_refund列名 | 举例 |
user_name | Amy,Dennis |
refund_piece | 退款件数 |
refund_amount | 退款金额 |
refund_time | 123412521,时间戳 |
dt | partition,‘yyyy-mm-dd’ |
select *
from
(select a.user_name,
extra2['b.phonebrand'] as phonebrand,
a.refund_amount,
ntile(4) over(partition by extra2['b.phonebrand'] order by a.refund_amount desc)level
from
(select user_name,
sum(refund_amount) refund_amount
from user_refund
where dt>'0'
group by user_name) a
left join user_info b on a.user_name=b.user_name)c
where c.level=1;