大家:
好!今天看到了一个hive的分析函数的题,感觉很有意思。把答案整理了整理,分享出来,希望对大家有用。需求如下所示:
起初我看到这道题时,感觉应该要用到分析函数,但就不知道怎么用。好在最终,写出来了,思路如下所示:
----测试表的表结构以及数据如下所示:
hive> desc sales;
OK
id int
produce_name string
start_time date
end_time date
days int
Time taken: 0.354 seconds, Fetched: 5 row(s)
hive> select * from sales;
OK
1 nike 2011-09-01 2011-09-05 5
2 nike 2011-09-03 2011-09-06 4
3 nike 2011-09-09 2011-09-15 7
4 oppo 2011-08-04 2011-08-05 2
5 oppo 2011-08-04 2011-08-15 12
6 vivo 2011-08-15 2011-08-21 7
7 vivo 2011-09-02 2011-09-12 11
Time taken: 0.223 seconds, Fetched: 7 row(s)
---第一步:求出每个开始时间的前一次促销的开始时间和结束时间
select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales;
---第一步:执行结果(第一步)
1 nike 2011-09-01 2011-09-05 NULL NULL
2 nike 2011-09-03 2011-09-06 2011-09-01 2011-09-05
3 nike 2011-09-09 2011-09-15 2011-09-03 2011-09-06
4 oppo 2011-08-04 2011-08-05 NULL NULL
5 oppo 2011-08-04 2011-08-15 2011-08-04 2011-08-05
6 vivo 2011-08-15 2011-08-21 NULL NULL
7 vivo 2011-09-02 2011-09-12 2011-08-15 2011-08-21
---第二步 依据前一次促销的开始时间和结束时间,开始时间统一为最早的,为后面分组做准备
select produce_name,start_time,max(end_time) end_time from
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time;
--第二步 执行结果(第二步)
nike 2011-09-01 2011-09-06
nike 2011-09-09 2011-09-15
oppo 2011-08-04 2011-08-15
vivo 2011-08-15 2011-08-21
vivo 2011-09-02 2011-09-12
---第三步 依据合并后的开始时间,算出每个时间段内的促销天数之和
select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e;
--执行结果(第三步)
nike 2011-09-01 2011-09-06 6
nike 2011-09-09 2011-09-15 7
oppo 2011-08-04 2011-08-15 12
vivo 2011-08-15 2011-08-21 7
vivo 2011-09-02 2011-09-12 11
--第四步:依据产品名称,求出最终的促销天数之和
select produce_name,sum(days) from
(select produce_name,start_time,end_time,datediff(end_time,start_time)+1 days from
(select produce_name,start_time,max(end_time) end_time from
(select id,produce_name,case when start_time>=before_start_time and start_time<=before_end_time then before_start_time else start_time end as start_time, end_time
from (select id,produce_name,start_time,end_time,
lag(start_time) over(partition by produce_name order by id) before_start_time,
lag(end_time) over(partition by produce_name order by id) before_end_time
from sales) t) d
group by produce_name,start_time) e) f
group by produce_name;
---执行结果(第四步)
nike 13
oppo 12
vivo 18
说明: 个人的见解,希望对大家有帮助!
后面经过分析,此种方法有问题,如果连续值涉及到了多行,按照偏移量取最小值时就会有问题。
第一步 获取每行的上一次的结束时间
select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales
效果如下所示:
第二步 依据本行的开始时间和上次的结束时间,计算中间差了多少天。因为datediff函数是差值概念,连续值需要增加1,间隔的天数需要减1
select
id,
produce_name,
end_time,
start_time,
before_end_time,
datediff(end_time,start_time)+1 as cnt_all,
case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end cnt
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t
效果如下所示:
结果符合猜想,nike的第三行开始时间是9号,上次结束是6号,中间差了2天
第三步 按照产品汇总取连续值和中间的间隔值求和
select
produce_name,
min(start_time) as start_time,
max(end_time) as end_time,
datediff(max(end_time),min(start_time))+1 cnt_all,
sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t
group by produce_name
结果如下所示:
第四步 最终的脚本 ,精简下
select
produce_name,
datediff(max(end_time),min(start_time))+1-
sum(case when start_time<=before_end_time then 0 else datediff(start_time,before_end_time)-1 end) cnt
from (select id,
produce_name,
start_time,
end_time,
lag(end_time,1,start_time) over(partition by produce_name order by id) before_end_time
from sales ) t
group by produce_name
最终结果如下所示:
个人见解,欢迎指正。
后来经过研究,发现此种方法也是有问题的。
如果是第一段时间的结束时间比较长,还会有重复计算的问题
经过在网上的搜素,此问题的本质是 "HiveSQL——打折日期交叉问题 "
参考地址 如下所示: