hive拉链表

拉链表优缺点

节省空间,尤其是数据量很大的时候;

对于订单事务性的数据,查看历史操作记录非常方便,比如说需要查看某一个时间点或者时间段的历史快照信息,查看某一个订单在历史某一个时间点的状态,查看某一个用户在过去某一段时间内,更新过几次等等

不过仅适合基于历史数据更新频率比较低的场景,如果每天1000w订单,每天更新1000次以上

Demo尝试与实现

做一个订单分析的拉链表

准备订单事务表

CREATE TABLE `orders`(
  `orderid` int, 
  `createtime` string, 
  `modifiedtime` string, 
  `status` string)
PARTITIONED BY ( 
  `p_event_date` string)

准备订单新增表(这个表可以保留久一些便于回算数据,数据回滚)

CREATE TABLE `default.t_ods_orders_inc`(
  `orderid` int, 
  `createtime` string, 
  `modifiedtime` string, 
  `status` string)
PARTITIONED BY ( 
  `p_event_date` string)

准备订单历史表(这个表保留一段时间即可)

CREATE TABLE `default.t_dw_orders_his`(
  `orderid` int, 
  `createtime` string, 
  `modifiedtime` string, 
  `status` string, 
  `dw_start_date` string, 
  `dw_end_date` string)
PARTITIONED BY ( 
  `p_event_date` string)

准备订单数据

>show partitions `default`.`orders`;
p_event_date=2015-08-21
p_event_date=2015-08-22
p_event_date=2015-08-23
>SELECT * from  `default`.`orders`;
1	2015-08-18	2015-08-18	创建	2015-08-21
2	2015-08-18	2015-08-18	创建	2015-08-21
3	2015-08-19	2015-08-21	支付	2015-08-21
4	2015-08-19	2015-08-21	完成	2015-08-21
5	2015-08-19	2015-08-20	支付	2015-08-21
6	2015-08-20	2015-08-20	创建	2015-08-21
7	2015-08-20	2015-08-21	支付	2015-08-21
8	2015-08-21	2015-08-21	创建	2015-08-21
1	2015-08-18	2015-08-22	支付	2015-08-22
2	2015-08-18	2015-08-22	完成	2015-08-22
3	2015-08-19	2015-08-21	支付	2015-08-22
4	2015-08-19	2015-08-21	完成	2015-08-22
5	2015-08-19	2015-08-20	支付	2015-08-22
6	2015-08-20	2015-08-22	支付	2015-08-22
7	2015-08-20	2015-08-21	支付	2015-08-22
8	2015-08-21	2015-08-22	支付	2015-08-22
9	2015-08-22	2015-08-22	创建	2015-08-22
10	2015-08-22	2015-08-22	支付	2015-08-22
1	2015-08-18	2015-08-23	完成	2015-08-23
2	2015-08-18	2015-08-22	完成	2015-08-23
3	2015-08-19	2015-08-23	完成	2015-08-23
4	2015-08-19	2015-08-21	完成	2015-08-23
5	2015-08-19	2015-08-23	完成	2015-08-23
6	2015-08-20	2015-08-22	支付	2015-08-23
7	2015-08-20	2015-08-21	支付	2015-08-23
8	2015-08-21	2015-08-23	完成	2015-08-23
9	2015-08-22	2015-08-22	创建	2015-08-23
10	2015-08-22	2015-08-22	支付	2015-08-23
11	2015-08-23	2015-08-23	创建	2015-08-23
12	2015-08-23	2015-08-23	创建	2015-08-23
13	2015-08-23	2015-08-23	支付	2015-08-23

计算订单新增数据

>INSERT overwrite TABLE t_ods_orders_inc PARTITION (p_event_date="${date}")
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE (createtime = "${date}" OR modifiedtime = "${date}") and p_event_date="${date}";

>show partitions  `default`.`t_ods_orders_inc`;
p_event_date=2015-08-20
p_event_date=2015-08-21
p_event_date=2015-08-22
>SELECT * from   `default`.`t_ods_orders_inc`;
status p_event_date orderid createtime modifiedtime status p_event_date
1	2015-08-18	2015-08-18	创建	2015-08-20
2	2015-08-18	2015-08-18	创建	2015-08-20
3	2015-08-19	2015-08-21	支付	2015-08-20
4	2015-08-19	2015-08-21	完成	2015-08-20
5	2015-08-19	2015-08-20	支付	2015-08-20
6	2015-08-20	2015-08-20	创建	2015-08-20
7	2015-08-20	2015-08-21	支付	2015-08-20
3	2015-08-19	2015-08-21	支付	2015-08-21
4	2015-08-19	2015-08-21	完成	2015-08-21
7	2015-08-20	2015-08-21	支付	2015-08-21
8	2015-08-21	2015-08-21	创建	2015-08-21
1	2015-08-18	2015-08-22	支付	2015-08-22
6	2015-08-20	2015-08-22	支付	2015-08-22
8	2015-08-21	2015-08-22	支付	2015-08-22
2	2015-08-18	2015-08-22	完成	2015-08-22
10	2015-08-22	2015-08-22	支付	2015-08-22
9	2015-08-22	2015-08-22	创建	2015-08-22

根据订单数据和每日新增数据计算最新一天拉链表

(订单数据以21日开始,21日以前的数据全部看作20日新增数据)

准备20日订单新增数据

INSERT overwrite TABLE t_ods_orders_inc PARTITION (p_event_date = "2015-08-20")
SELECT orderid,createtime,modifiedtime,status
FROM orders
WHERE createtime <= "2015-08-20";

根据20日订单新增数据计算20日拉链表结果

INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-20")
SELECT orderid,createtime,modifiedtime,status,
createtime AS dw_start_date,
"9999-12-31" AS dw_end_date
FROM t_ods_orders_inc
WHERE p_event_date = "2020-08-20";

根据21日新增数据计算21日拉链表结果,union all 上半部分判断昨日拉链表的数据是否需要更新有效期(9999-12-31为当前有效数据)union all下半部分即合并今日新增数据并全部设置为当前有效状态

INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-21")	
SELECT 
	orderid,
	createtime,
	modifiedtime,
	status,
	dw_start_date,
	dw_end_date
FROM (
	select 
		a.orderid,
		a.createtime,
		a.modifiedtime,
		a.status,
		a.dw_start_date,
		case 
			when b.orderid is not null and a.dw_end_date>"2015-08-21" then "2015-08-20"
			else a.dw_end_date
			end as  dw_end_date 
		from ( select * from t_dw_orders_his where p_event_date="2015-08-20")a
		left join (select * from t_ods_orders_inc where p_event_date="2015-08-21")b
		on (a.orderid=b.orderid)
		union all
		select 
		orderid,
		createtime,
		modifiedtime,
		status,
		 modifiedtime AS dw_start_date,
		'9999-12-31' AS dw_end_date 
		from t_ods_orders_inc where p_event_date="2015-08-21"
		)c
order by orderid,dw_start_date;
>SELECT * from  t_dw_orders_his WHERE p_event_date="2015-08-21"
1	2015-08-18	2015-08-18	创建	2015-08-18	9999-12-31	2015-08-21
2	2015-08-18	2015-08-18	创建	2015-08-18	9999-12-31	2015-08-21
3	2015-08-19	2015-08-21	支付	2015-08-19	2015-08-20	2015-08-21
3	2015-08-19	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-21
4	2015-08-19	2015-08-21	完成	2015-08-19	2015-08-20	2015-08-21
4	2015-08-19	2015-08-21	完成	2015-08-21	9999-12-31	2015-08-21
5	2015-08-19	2015-08-20	支付	2015-08-19	9999-12-31	2015-08-21
6	2015-08-20	2015-08-20	创建	2015-08-20	9999-12-31	2015-08-21
7	2015-08-20	2015-08-21	支付	2015-08-20	2015-08-20	2015-08-21
7	2015-08-20	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-21
8	2015-08-21	2015-08-21	创建	2015-08-21	9999-12-31	2015-08-21

同上,根据22日订单新增数据计算22日拉链表结果

INSERT overwrite TABLE t_dw_orders_his partition(p_event_date="2020-08-22")	
SELECT 
	orderid,
	createtime,
	modifiedtime,
	status,
	dw_start_date,
	dw_end_date
FROM (
	select 
		a.orderid,
		a.createtime,
		a.modifiedtime,
		a.status,
		a.dw_start_date,
		case 
			when b.orderid is not null and a.dw_end_date>"2015-08-22" then "2015-08-21"
			else a.dw_end_date
			end as  dw_end_date 
		from ( select * from t_dw_orders_his where p_event_date="2015-08-21")a
		left join (select * from t_ods_orders_inc where p_event_date="2015-08-22")b
		on (a.orderid=b.orderid)
		union all
		select 
		orderid,
		createtime,
		modifiedtime,
		status,
		 modifiedtime AS dw_start_date,
		'9999-12-31' AS dw_end_date 
		from t_ods_orders_inc where p_event_date="2015-08-22"
		)c
order by orderid,dw_start_date;
>SELECT * from  t_dw_orders_his WHERE p_event_date="2015-08-22"
1	2015-08-18	2015-08-18	创建	2015-08-18	2015-08-21	2015-08-22
1	2015-08-18	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22
2	2015-08-18	2015-08-18	创建	2015-08-18	2015-08-21	2015-08-22
2	2015-08-18	2015-08-22	完成	2015-08-22	9999-12-31	2015-08-22
3	2015-08-19	2015-08-21	支付	2015-08-19	2015-08-20	2015-08-22
3	2015-08-19	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-22
4	2015-08-19	2015-08-21	完成	2015-08-19	2015-08-20	2015-08-22
4	2015-08-19	2015-08-21	完成	2015-08-21	9999-12-31	2015-08-22
5	2015-08-19	2015-08-20	支付	2015-08-19	9999-12-31	2015-08-22
6	2015-08-20	2015-08-20	创建	2015-08-20	2015-08-21	2015-08-22
6	2015-08-20	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22
7	2015-08-20	2015-08-21	支付	2015-08-20	2015-08-20	2015-08-22
7	2015-08-20	2015-08-21	支付	2015-08-21	9999-12-31	2015-08-22
8	2015-08-21	2015-08-21	创建	2015-08-21	2015-08-21	2015-08-22
8	2015-08-21	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22
9	2015-08-22	2015-08-22	创建	2015-08-22	9999-12-31	2015-08-22
10	2015-08-22	2015-08-22	支付	2015-08-22	9999-12-31	2015-08-22

同上,根据23日订单新增数据计算23日拉链表结果.....就这么个意思