hive拉链表
拉链表优缺点
节省空间,尤其是数据量很大的时候;
对于订单事务性的数据,查看历史操作记录非常方便,比如说需要查看某一个时间点或者时间段的历史快照信息,查看某一个订单在历史某一个时间点的状态,查看某一个用户在过去某一段时间内,更新过几次等等
不过仅适合基于历史数据更新频率比较低的场景,如果每天1000w订单,每天更新1000次以上
Demo尝试与实现
做一个订单分析的拉链表
准备订单事务表
CREATE TABLE `orders`(
`orderid` int,
`createtime` string,
`modifiedtime` string,
`status` string)
PARTITIONED BY (
`p_event_date` string)
准备订单新增表(这个表可以保留久一些便于回算数据,数据回滚)
CREATE TABLE `default.t_ods_orders_inc`(
`orderid` int,
`createtime` string,
`modifiedtime` string,
`status` string)
PARTITIONED BY (
`p_event_date` string)
准备订单历史表(这个表保留一段时间即可)
CREATE TABLE `default.t_dw_orders_his`(
`orderid` int,
`createtime` string,
`modifiedtime` string,
`status` string,
`dw_start_date` string,
`dw_end_date` string)
PARTITIONED BY (
`p_event_date` string)
准备订单数据
>show partitions `default`.`orders`;
p_event_date=2015-08-21
p_event_date=2015-08-22
p_event_date=2015-08-23
>SELECT * from `default`.`orders`;
1 2015-08-18 2015-08-18 创建 2015-08-21
2 2015-08-18 2015-08-18 创建 2015-08-21
3 2015-08-19 2015-08-21 支付 2015-08-21
4 2015-08-19 2015-08-21 完成 2015-08-21
5 2015-08-19 2015-08-20 支付 2015-08-21
6 2015-08-20 2015-08-20 创建 2015-08-21
7 2015-08-20 2015-08-21 支付 2015-08-21
8 2015-08-21 2015-08-21 创建 2015-08-21
1 2015-08-18 2015-08-22 支付 2015-08-22
2 2015-08-18 2015-08-22 完成 2015-08-22
3 2015-08-19 2015-08-21 支付 2015-08-22
4 2015-08-19 2015-08-21 完成 2015-08-22
5 2015-08-19 2015-08-20 支付 2015-08-22
6 2015-08-20 2015-08-22 支付 2015-08-22
7 2015-08-20 2015-08-21 支付 2015-08-22
8 2015-08-21 2015-08-22 支付 2015-08-22
9 2015-08-22 2015-08-22 创建 2015-08-22
10 2015-08-22 2015-08-22 支付 2015-08-22
1 2015-08-18 2015-08-23 完成 2015-08-23
2 2015-08-18 2015-08-22 完成 2015-08-23
3 2015-08-19 2015-08-23 完成 2015-08-23
4 2015-08-19 2015-08-21 完成 2015-08-23
5 2015-08-19 2015-08-23 完成 2015-08-23
6 2015-08-20 2015-08-22 支付 2015-08-23
7 2015-08-20 2015-08-21 支付 2015-08-23
8 2015-08-21 2015-08-23 完成 2015-08-23
9 2015-08-22 2015-08-22 创建 2015-08-23
10 2015-08-22 2015-08-22 支付 2015-08-23
11 2015-08-23 2015-08-23 创建 2015-08-23
12 2015-08-23 2015-08-23 创建 2015-08-23
13 2015-08-23 2015-08-23 支付 2015-08-23
计算订单新增数据
>INSERT overwrite TABLE t_ods_orders_inc PARTITION (p_event_date="${date}")
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = "${date}" OR modifiedtime = "${date}") and p_event_date="${date}";
>show partitions `default`.`t_ods_orders_inc`;
p_event_date=2015-08-20
p_event_date=2015-08-21
p_event_date=2015-08-22
>SELECT * from `default`.`t_ods_orders_inc`;
status p_event_date orderid createtime modifiedtime status p_event_date
1 2015-08-18 2015-08-18 创建 2015-08-20
2 2015-08-18 2015-08-18 创建 2015-08-20
3 2015-08-19 2015-08-21 支付 2015-08-20
4 2015-08-19 2015-08-21 完成 2015-08-20
5 2015-08-19 2015-08-20 支付 2015-08-20
6 2015-08-20 2015-08-20 创建 2015-08-20
7 2015-08-20 2015-08-21 支付 2015-08-20
3 2015-08-19 2015-08-21 支付 2015-08-21
4 2015-08-19 2015-08-21 完成 2015-08-21
7 2015-08-20 2015-08-21 支付 2015-08-21
8 2015-08-21 2015-08-21 创建 2015-08-21
1 2015-08-18 2015-08-22 支付 2015-08-22
6 2015-08-20 2015-08-22 支付 2015-08-22
8 2015-08-21 2015-08-22 支付 2015-08-22
2 2015-08-18 2015-08-22 完成 2015-08-22
10 2015-08-22 2015-08-22 支付 2015-08-22
9 2015-08-22 2015-08-22 创建 2015-08-22
根据订单数据和每日新增数据计算最新一天拉链表
(订单数据以21日开始,21日以前的数据全部看作20日新增数据)
准备20日订单新增数据
INSERT overwrite TABLE t_ods_orders_inc PARTITION (p_event_date = "2015-08-20")
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime <= "2015-08-20";
根据20日订单新增数据计算20日拉链表结果
INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-20")
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
"9999-12-31" AS dw_end_date
FROM t_ods_orders_inc
WHERE p_event_date = "2020-08-20";
根据21日新增数据计算21日拉链表结果,union all 上半部分判断昨日拉链表的数据是否需要更新有效期(9999-12-31为当前有效数据)union all下半部分即合并今日新增数据并全部设置为当前有效状态
INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-21")
SELECT
orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
select
a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
case
when b.orderid is not null and a.dw_end_date>"2015-08-21" then "2015-08-20"
else a.dw_end_date
end as dw_end_date
from ( select * from t_dw_orders_his where p_event_date="2015-08-20")a
left join (select * from t_ods_orders_inc where p_event_date="2015-08-21")b
on (a.orderid=b.orderid)
union all
select
orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
from t_ods_orders_inc where p_event_date="2015-08-21"
)c
order by orderid,dw_start_date;
>SELECT * from t_dw_orders_his WHERE p_event_date="2015-08-21"
1 2015-08-18 2015-08-18 创建 2015-08-18 9999-12-31 2015-08-21
2 2015-08-18 2015-08-18 创建 2015-08-18 9999-12-31 2015-08-21
3 2015-08-19 2015-08-21 支付 2015-08-19 2015-08-20 2015-08-21
3 2015-08-19 2015-08-21 支付 2015-08-21 9999-12-31 2015-08-21
4 2015-08-19 2015-08-21 完成 2015-08-19 2015-08-20 2015-08-21
4 2015-08-19 2015-08-21 完成 2015-08-21 9999-12-31 2015-08-21
5 2015-08-19 2015-08-20 支付 2015-08-19 9999-12-31 2015-08-21
6 2015-08-20 2015-08-20 创建 2015-08-20 9999-12-31 2015-08-21
7 2015-08-20 2015-08-21 支付 2015-08-20 2015-08-20 2015-08-21
7 2015-08-20 2015-08-21 支付 2015-08-21 9999-12-31 2015-08-21
8 2015-08-21 2015-08-21 创建 2015-08-21 9999-12-31 2015-08-21
同上,根据22日订单新增数据计算22日拉链表结果
INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-22")
SELECT
orderid,
createtime,
modifiedtime,
status,
dw_start_date,
dw_end_date
FROM (
select
a.orderid,
a.createtime,
a.modifiedtime,
a.status,
a.dw_start_date,
case
when b.orderid is not null and a.dw_end_date>"2015-08-22" then "2015-08-21"
else a.dw_end_date
end as dw_end_date
from ( select * from t_dw_orders_his where p_event_date="2015-08-21")a
left join (select * from t_ods_orders_inc where p_event_date="2015-08-22")b
on (a.orderid=b.orderid)
union all
select
orderid,
createtime,
modifiedtime,
status,
modifiedtime AS dw_start_date,
'9999-12-31' AS dw_end_date
from t_ods_orders_inc where p_event_date="2015-08-22"
)c
order by orderid,dw_start_date;
>SELECT * from t_dw_orders_his WHERE p_event_date="2015-08-22"
1 2015-08-18 2015-08-18 创建 2015-08-18 2015-08-21 2015-08-22
1 2015-08-18 2015-08-22 支付 2015-08-22 9999-12-31 2015-08-22
2 2015-08-18 2015-08-18 创建 2015-08-18 2015-08-21 2015-08-22
2 2015-08-18 2015-08-22 完成 2015-08-22 9999-12-31 2015-08-22
3 2015-08-19 2015-08-21 支付 2015-08-19 2015-08-20 2015-08-22
3 2015-08-19 2015-08-21 支付 2015-08-21 9999-12-31 2015-08-22
4 2015-08-19 2015-08-21 完成 2015-08-19 2015-08-20 2015-08-22
4 2015-08-19 2015-08-21 完成 2015-08-21 9999-12-31 2015-08-22
5 2015-08-19 2015-08-20 支付 2015-08-19 9999-12-31 2015-08-22
6 2015-08-20 2015-08-20 创建 2015-08-20 2015-08-21 2015-08-22
6 2015-08-20 2015-08-22 支付 2015-08-22 9999-12-31 2015-08-22
7 2015-08-20 2015-08-21 支付 2015-08-20 2015-08-20 2015-08-22
7 2015-08-20 2015-08-21 支付 2015-08-21 9999-12-31 2015-08-22
8 2015-08-21 2015-08-21 创建 2015-08-21 2015-08-21 2015-08-22
8 2015-08-21 2015-08-22 支付 2015-08-22 9999-12-31 2015-08-22
9 2015-08-22 2015-08-22 创建 2015-08-22 9999-12-31 2015-08-22
10 2015-08-22 2015-08-22 支付 2015-08-22 9999-12-31 2015-08-22
同上,根据23日订单新增数据计算23日拉链表结果.....就这么个意思