Greenplum的分布键设计
问题背景
最高执行指挥项目组在使用GP的过程中遇到两类问题
- 数据插入缓慢
- 多表join查询效率低下
问题1:数据插入缓慢
现象是3000条数据插入要7分钟左右。
排查过程
- 查看数据分布情况
fb09_1=# select gp_segment_id, count(*) from db_xsys.t_xsys group by 1 order by 1;
gp_segment_id | count
---------------+-------
0 | 12499
1 | 12495
2 | 12497
3 | 12501
4 | 12509
5 | 12504
6 | 12496
7 | 12497
8 | 12493
9 | 12500
10 | 12506
11 | 12503
(12 rows)
数据分布很均匀,并且数据量不大,没有发现数据量有可能带来的性能影响疑点
- 查看单条插入语句的执行计划
fb09_1=# explain analyze insert into db_xsys.t_xsys (c_id, n_ajbs) values ('id123', 123);
QUERY PLAN
\----------------------------------------------------------------------------------------------------------------------
Insert (slice0; segments: 12) (rows=1 width=0)
-> Redistribute Motion 12:12 (slice1; segments: 12) (cost=0.00..0.01 rows=1 width=0)
Rows out: 1 rows at destination (seg7) with 0.478 ms to first row, 0.488 ms to end, start offset by 126 ms.
-> Result (cost=0.00..0.01 rows=1 width=0)
Rows out: 1 rows (seg1) with 0.010 ms to first row, 0.011 ms to end, start offset by 126 ms.
Slice statistics:
(slice0) Executor memory: 280K bytes avg x 12 workers, 319K bytes max (seg7).
(slice1) Executor memory: 310K bytes avg x 12 workers, 310K bytes max (seg0).
Statement statistics:
Memory used: 524288K bytes
Optimizer status: legacy query optimizer
Total runtime: 128.952 ms
(12 rows)
从执行计划看到关键字segments: 12
意识到该插入语句计算了所有数据节点(测试环境数据节点总数为12个),而实际应该只用到一个数据节点即可,所以推测是因为该表分布策略
选择错误导致的该问题。
- 查看当前分布策略
fb09_1=# \d db_xsys.t_xsys
Table "db_xsys.t_xsys"
Column | Type | Modifiers
-----------------+-----------------------------+-----------
c_id | character(32) | not null
n_ajbs | numeric(17,0) | not null
......
Indexes:
"i_xsys_ajbs" btree (n_ajbs)
Distributed randomly
-
Distributed randomly
表明当前分布策略为随机分布。
随机分布:当分布策略为随机时,GP会首先通知所有节点将要插入的数据,再结合当前所有节点整体的数据分布情况,决定新行应该分布到哪个节点
解决方案
综上所述随机分布与业务要求不符,应修改分布策略为hash分布
- 修改表分布策略
fb09_1=# alter table db_xsys.t_xsys set distributed by (n_ajbs);
ALTER TABLE
- 重新查看执行计划
fb09_1=# explain analyze insert into db_xsys.t_xsys (c_id, n_ajbs) values ('id111', 111);
QUERY PLAN
\---------------------------------------------------------------------------
Insert (slice0; segments: 1) (rows=1 width=0)
-> Result (cost=0.00..0.01 rows=1 width=0)
Rows out: 1 rows with 0.055 ms to end, start offset by 2.056 ms.
Slice statistics:
(slice0) Executor memory: 101K bytes (seg8).
Statement statistics:
Memory used: 524288K bytes
Optimizer status: legacy query optimizer
Total runtime: 2.427 ms
(9 rows)
单条插入由原来的120ms+变成了2ms+,问题解决
问题2:多表join查询效率低下
排查过程
- 查看执行计划
fb09_1=# explain analyze select t_xsys.n_ajbs from db_xsys.t_xsys left join db_xsys.t_xsysbgr on t_xsys.n_ajbs = t_xsysbgr.n_ajbs where t_xsys.n_ajbs = 100010000006;
QUERY PLAN
\----------------------------------------------------------------------------------------------------------------------------------
Gather Motion 12:1 (slice3; segments: 12) (cost=200.31..389.50 rows=4 width=17)
Rows out: 1 rows at destination with 49 ms to end, start offset by 107 ms.
-> Hash Left Join (cost=200.31..389.50 rows=1 width=17)
Hash Cond: t_xsys.n_ajbs = t_xsysbgr.n_ajbs
Rows out: 1 rows (seg3) with 43 ms to first row, 46 ms to end, start offset by 110 ms.
Executor memory: 1K bytes.
Work_mem used: 1K bytes. Workfile: (0 spilling, 0 reused)
(seg3) Hash chain length 1.0 avg, 1 max, using 1 of 2097211 buckets.
-> Redistribute Motion 12:12 (slice1; segments: 12) (cost=0.00..189.15 rows=1 width=17)
Hash Key: 100010000006::numeric
Rows out: 1 rows at destination (seg3) with 0.004 ms to first row, 0.009 ms to end, start offset by 153 ms.
-> Seq Scan on t_xsys (cost=0.00..189.12 rows=1 width=17)
Filter: n_ajbs = 100010000006::numeric
Rows out: 1 rows (seg5) with 0.123 ms to first row, 5.987 ms to end, start offset by 111 ms.
-> Hash (cost=200.30..200.30 rows=1 width=10)
Rows in: 1 rows (seg3) with 0.061 ms to end, start offset by 153 ms.
-> Redistribute Motion 12:12 (slice2; segments: 12) (cost=0.00..200.30 rows=1 width=10)
Hash Key: t_xsysbgr.n_ajbs
Rows out: 1 rows at destination (seg3) with 0.031 ms to first row, 0.039 ms to end, start offset by 153 ms.
-> Index Scan using i_xsysbgr_ajbs on t_xsysbgr (cost=0.00..200.28 rows=1 width=10)
Index Cond: n_ajbs = 100010000006::numeric
Rows out: 1 rows (seg3) with 0.211 ms to first row, 0.214 ms to end, start offset by 112 ms.
Slice statistics:
(slice0) Executor memory: 249K bytes.
(slice1) Executor memory: 339K bytes avg x 12 workers, 339K bytes max (seg0).
(slice2) Executor memory: 344K bytes avg x 12 workers, 351K bytes max (seg3).
(slice3) Executor memory: 33025K bytes avg x 12 workers, 33055K bytes max (seg3). Work_mem: 1K bytes max.
Statement statistics:
Memory used: 524288K bytes
Optimizer status: legacy query optimizer
Total runtime: 156.683 ms
(31 rows)
从执行计划看到关键字Redistribute Motion
,意识到该查询语句相关的两个表分布键列不一致,若分布键列一致,不会有数据重新分布的步骤
- 查看当前分布键策略
fb09_1=# \d db_xsys.t_xsys
Table "db_xsys.t_xsys"
Column | Type | Modifiers
-----------------+-----------------------------+-----------
c_id | character(32) | not null
n_ajbs | numeric(17,0) | not null
...
Distributed by: (c_id)
fb09_1=# \d db_xsys.t_xsysbgr
Table "db_xsys.t_xsysbgr"
Column | Type | Modifiers
-------------+-----------------------------+-----------
c_id | character(32) | not null
c_id_xsys | character(32) | not null
...
Distributed by: (c_id)
-
Redistribute Motion
表明多表关联查询时,相同的数据并没有存放在同一个数据节点内,需要重新分布数据。
hash分布:Greenplum默认使用hash分布策略。该策略可选一个或者多个列作为分布键(distribution key,简称DK)。分布键做hash算法来确认数据存放到对应的segment上。相同分布键值会hash到相同的segment上
解决方案
综上所述因为两个表的分布键列不同导致该问题,应修改两个表的分布键列
- 修改相关表的分布键并重新分布数据
fb09_1=# alter table db_xsys.t_xsys set with (reorganize=true) distributed by (n_ajbs);
ALTER TABLE
fb09_1=# alter table db_xsys.t_xsysbgr set with (reorganize=true) distributed by (n_ajbs);
ALTER TABLE
- 重新查看执行计划
fb09_1=# explain analyze select t_xsys.n_ajbs from db_xsys.t_xsys left join db_xsys.t_xsysbgr on t_xsys.n_ajbs = t_xsysbgr.n_ajbs where t_xsys.n_ajbs = 100010000006;
QUERY PLAN
\-----------------------------------------------------------------------------------------------------------
Gather Motion 1:1 (slice1; segments: 1) (cost=200.29..385.21 rows=4 width=17)
Rows out: 1 rows at destination with 27 ms to end, start offset by 0.707 ms.
-> Hash Left Join (cost=200.29..385.21 rows=1 width=17)
Hash Cond: t_xsys.n_ajbs = t_xsysbgr.n_ajbs
Rows out: 1 rows with 17 ms to first row, 26 ms to end, start offset by 1.000 ms.
Executor memory: 1K bytes.
Work_mem used: 1K bytes. Workfile: (0 spilling, 0 reused)
Hash chain length 1.0 avg, 1 max, using 1 of 2097211 buckets.
-> Seq Scan on t_xsys (cost=0.00..184.88 rows=1 width=17)
Filter: n_ajbs = 100010000006::numeric
Rows out: 1 rows with 0.117 ms to first row, 5.352 ms to end, start offset by 1.001 ms.
-> Hash (cost=200.28..200.28 rows=1 width=10)
Rows in: 1 rows with 0.168 ms to end, start offset by 18 ms.
-> Index Scan using i_xsysbgr_ajbs on t_xsysbgr (cost=0.00..200.28 rows=1 width=10)
Index Cond: n_ajbs = 100010000006::numeric
Rows out: 1 rows with 0.153 ms to first row, 0.156 ms to end, start offset by 18 ms.
Slice statistics:
(slice0) Executor memory: 244K bytes.
(slice1) Executor memory: 33054K bytes (seg3). Work_mem: 1K bytes max.
Statement statistics:
Memory used: 524288K bytes
Optimizer status: legacy query optimizer
Total runtime: 27.832 ms
(23 rows)
执行计划中未发现重新分布数据的关键字,并且执行时间由原来的150ms+变成了27ms+,问题解决
扩展知识—分布键规则
- 每个用户表都有分布键(类似pg_class等元数据都保存在master中,所以没有分布键)
- 分布键必须包含主键、唯一约束。且必须完全包含对应列,或对应约束的左侧子集(经过实际验证,分布键为对应约束的完整集或子集即可,不强制必须为左侧子集)
- 若建表时未指定主键、唯一约束,并且未指定分布键,默认第一列为分布键
- 为表添加主键、唯一约束时,分布键自动切换为指定列
- 删除主键、唯一约束时,已存在的分布键不会自动跟随变动
- 在两表关联查询时,若分布键规则不一致,则会分发两表中间数据量较小的全表数据到所有节点,再与本节点大表做join
总结
- 这两个问题都是因为同一个根本原因导致的:数据库设计时未做分布策略设计
- 在正常业务场景中,不允许使用GP的随机分布键策略
- 各GP数据库中所有表的分布键列必须一致
- 因分布键不一致会导致多表join时触发数据分发操作,所以结合实际业务,可以设计为数据库不包含物理主键、唯一键约束。因为这有可能导致join时分发数据而带来的性能下降
- 为保证性能的最大化,推荐各节点配置相同的情况下被检索的数据均匀分布。新增或删除节点后推荐重新分布数据
- 推荐master服务器中不添加任何数据节点,并加大master的带宽,以保证不会因为数据处理而影响master原本作为协调者身份工作的初衷