Greenplum的分布键设计

问题背景

最高执行指挥项目组在使用GP的过程中遇到两类问题

  • 数据插入缓慢
  • 多表join查询效率低下

问题1:数据插入缓慢

现象是3000条数据插入要7分钟左右。

排查过程

  • 查看数据分布情况
fb09_1=# select gp_segment_id, count(*) from db_xsys.t_xsys group by 1 order by 1;
gp_segment_id | count 
---------------+-------
           0 | 12499
           1 | 12495
           2 | 12497
           3 | 12501
           4 | 12509
           5 | 12504
           6 | 12496
           7 | 12497
           8 | 12493
           9 | 12500
          10 | 12506
          11 | 12503
(12 rows)

数据分布很均匀,并且数据量不大,没有发现数据量有可能带来的性能影响疑点

  • 查看单条插入语句的执行计划
fb09_1=# explain analyze insert into db_xsys.t_xsys (c_id, n_ajbs) values ('id123', 123);
                       QUERY PLAN  
\----------------------------------------------------------------------------------------------------------------------
 Insert (slice0; segments: 12)  (rows=1 width=0)
   ->  Redistribute Motion 12:12  (slice1; segments: 12)  (cost=0.00..0.01 rows=1 width=0)
         Rows out:  1 rows at destination (seg7) with 0.478 ms to first row, 0.488 ms to end, start offset by 126 ms.
         ->  Result  (cost=0.00..0.01 rows=1 width=0)
               Rows out:  1 rows (seg1) with 0.010 ms to first row, 0.011 ms to end, start offset by 126 ms.
 Slice statistics:
   (slice0)    Executor memory: 280K bytes avg x 12 workers, 319K bytes max (seg7).
   (slice1)    Executor memory: 310K bytes avg x 12 workers, 310K bytes max (seg0).
 Statement statistics:
   Memory used: 524288K bytes
 Optimizer status: legacy query optimizer
 Total runtime: 128.952 ms
(12 rows)

从执行计划看到关键字segments: 12意识到该插入语句计算了所有数据节点(测试环境数据节点总数为12个),而实际应该只用到一个数据节点即可,所以推测是因为该表分布策略选择错误导致的该问题。

  • 查看当前分布策略
fb09_1=# \d db_xsys.t_xsys
                  Table "db_xsys.t_xsys"
     Column      |            Type             | Modifiers 
-----------------+-----------------------------+-----------
 c_id            | character(32)               | not null
 n_ajbs          | numeric(17,0)               | not null
......
Indexes:
    "i_xsys_ajbs" btree (n_ajbs)
     Distributed randomly
  • Distributed randomly表明当前分布策略为随机分布。
    随机分布:当分布策略为随机时,GP会首先通知所有节点将要插入的数据,再结合当前所有节点整体的数据分布情况,决定新行应该分布到哪个节点

解决方案

综上所述随机分布与业务要求不符,应修改分布策略为hash分布

  • 修改表分布策略
fb09_1=# alter table db_xsys.t_xsys set distributed by (n_ajbs);
ALTER TABLE
  • 重新查看执行计划
fb09_1=# explain analyze insert into db_xsys.t_xsys (c_id, n_ajbs) values ('id111', 111);
                      QUERY PLAN                                 
\---------------------------------------------------------------------------
 Insert (slice0; segments: 1)  (rows=1 width=0)
   ->  Result  (cost=0.00..0.01 rows=1 width=0)
         Rows out:  1 rows with 0.055 ms to end, start offset by 2.056 ms.
 Slice statistics:
   (slice0)    Executor memory: 101K bytes (seg8).
 Statement statistics:
   Memory used: 524288K bytes
 Optimizer status: legacy query optimizer
 Total runtime: 2.427 ms
(9 rows)

单条插入由原来的120ms+变成了2ms+,问题解决

问题2:多表join查询效率低下

排查过程

  • 查看执行计划
fb09_1=# explain analyze select t_xsys.n_ajbs from db_xsys.t_xsys left join db_xsys.t_xsysbgr on t_xsys.n_ajbs = t_xsysbgr.n_ajbs where t_xsys.n_ajbs = 100010000006;
                    QUERY PLAN                                                            
\----------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 12:1  (slice3; segments: 12)  (cost=200.31..389.50 rows=4 width=17)
   Rows out:  1 rows at destination with 49 ms to end, start offset by 107 ms.
   ->  Hash Left Join  (cost=200.31..389.50 rows=1 width=17)
         Hash Cond: t_xsys.n_ajbs = t_xsysbgr.n_ajbs
         Rows out:  1 rows (seg3) with 43 ms to first row, 46 ms to end, start offset by 110 ms.
         Executor memory:  1K bytes.
         Work_mem used:  1K bytes. Workfile: (0 spilling, 0 reused)
         (seg3)   Hash chain length 1.0 avg, 1 max, using 1 of 2097211 buckets.
         ->  Redistribute Motion 12:12  (slice1; segments: 12)  (cost=0.00..189.15 rows=1 width=17)
               Hash Key: 100010000006::numeric
               Rows out:  1 rows at destination (seg3) with 0.004 ms to first row, 0.009 ms to end, start offset by 153 ms.
               ->  Seq Scan on t_xsys  (cost=0.00..189.12 rows=1 width=17)
                     Filter: n_ajbs = 100010000006::numeric
                     Rows out:  1 rows (seg5) with 0.123 ms to first row, 5.987 ms to end, start offset by 111 ms.
         ->  Hash  (cost=200.30..200.30 rows=1 width=10)
               Rows in:  1 rows (seg3) with 0.061 ms to end, start offset by 153 ms.
               ->  Redistribute Motion 12:12  (slice2; segments: 12)  (cost=0.00..200.30 rows=1 width=10)
                     Hash Key: t_xsysbgr.n_ajbs
                     Rows out:  1 rows at destination (seg3) with 0.031 ms to first row, 0.039 ms to end, start offset by 153 ms.
                     ->  Index Scan using i_xsysbgr_ajbs on t_xsysbgr  (cost=0.00..200.28 rows=1 width=10)
                           Index Cond: n_ajbs = 100010000006::numeric
                           Rows out:  1 rows (seg3) with 0.211 ms to first row, 0.214 ms to end, start offset by 112 ms.
 Slice statistics:
   (slice0)    Executor memory: 249K bytes.
   (slice1)    Executor memory: 339K bytes avg x 12 workers, 339K bytes max (seg0).
   (slice2)    Executor memory: 344K bytes avg x 12 workers, 351K bytes max (seg3).
   (slice3)    Executor memory: 33025K bytes avg x 12 workers, 33055K bytes max (seg3).  Work_mem: 1K bytes max.
 Statement statistics:
   Memory used: 524288K bytes
 Optimizer status: legacy query optimizer
 Total runtime: 156.683 ms
(31 rows)

从执行计划看到关键字Redistribute Motion,意识到该查询语句相关的两个表分布键列不一致,若分布键列一致,不会有数据重新分布的步骤

  • 查看当前分布键策略
fb09_1=# \d db_xsys.t_xsys
                  Table "db_xsys.t_xsys"
     Column      |            Type             | Modifiers 
-----------------+-----------------------------+-----------
 c_id            | character(32)               | not null
 n_ajbs          | numeric(17,0)               | not null
...
Distributed by: (c_id)

fb09_1=# \d db_xsys.t_xsysbgr
               Table "db_xsys.t_xsysbgr"
   Column    |            Type             | Modifiers 
-------------+-----------------------------+-----------
 c_id        | character(32)               | not null
 c_id_xsys   | character(32)               | not null
...
Distributed by: (c_id)
  • Redistribute Motion表明多表关联查询时,相同的数据并没有存放在同一个数据节点内,需要重新分布数据。
    hash分布:Greenplum默认使用hash分布策略。该策略可选一个或者多个列作为分布键(distribution key,简称DK)。分布键做hash算法来确认数据存放到对应的segment上。相同分布键值会hash到相同的segment上

解决方案

综上所述因为两个表的分布键列不同导致该问题,应修改两个表的分布键列

  • 修改相关表的分布键并重新分布数据
fb09_1=# alter table db_xsys.t_xsys set with (reorganize=true) distributed by (n_ajbs);
ALTER TABLE
fb09_1=# alter table db_xsys.t_xsysbgr set with (reorganize=true) distributed by (n_ajbs);
ALTER TABLE
  • 重新查看执行计划
fb09_1=# explain analyze select t_xsys.n_ajbs from db_xsys.t_xsys left join db_xsys.t_xsysbgr on t_xsys.n_ajbs = t_xsysbgr.n_ajbs where t_xsys.n_ajbs = 100010000006;
                    QUERY PLAN                                                 
\-----------------------------------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=200.29..385.21 rows=4 width=17)
   Rows out:  1 rows at destination with 27 ms to end, start offset by 0.707 ms.
   ->  Hash Left Join  (cost=200.29..385.21 rows=1 width=17)
         Hash Cond: t_xsys.n_ajbs = t_xsysbgr.n_ajbs
         Rows out:  1 rows with 17 ms to first row, 26 ms to end, start offset by 1.000 ms.
         Executor memory:  1K bytes.
         Work_mem used:  1K bytes. Workfile: (0 spilling, 0 reused)
         Hash chain length 1.0 avg, 1 max, using 1 of 2097211 buckets.
         ->  Seq Scan on t_xsys  (cost=0.00..184.88 rows=1 width=17)
               Filter: n_ajbs = 100010000006::numeric
               Rows out:  1 rows with 0.117 ms to first row, 5.352 ms to end, start offset by 1.001 ms.
         ->  Hash  (cost=200.28..200.28 rows=1 width=10)
               Rows in:  1 rows with 0.168 ms to end, start offset by 18 ms.
               ->  Index Scan using i_xsysbgr_ajbs on t_xsysbgr  (cost=0.00..200.28 rows=1 width=10)
                     Index Cond: n_ajbs = 100010000006::numeric
                     Rows out:  1 rows with 0.153 ms to first row, 0.156 ms to end, start offset by 18 ms.
 Slice statistics:
   (slice0)    Executor memory: 244K bytes.
   (slice1)    Executor memory: 33054K bytes (seg3).  Work_mem: 1K bytes max.
 Statement statistics:
   Memory used: 524288K bytes
 Optimizer status: legacy query optimizer
 Total runtime: 27.832 ms
(23 rows)

执行计划中未发现重新分布数据的关键字,并且执行时间由原来的150ms+变成了27ms+,问题解决

扩展知识—分布键规则

  • 每个用户表都有分布键(类似pg_class等元数据都保存在master中,所以没有分布键)
  • 分布键必须包含主键、唯一约束。且必须完全包含对应列,或对应约束的左侧子集(经过实际验证,分布键为对应约束的完整集或子集即可,不强制必须为左侧子集)
  • 若建表时未指定主键、唯一约束,并且未指定分布键,默认第一列为分布键
  • 为表添加主键、唯一约束时,分布键自动切换为指定列
  • 删除主键、唯一约束时,已存在的分布键不会自动跟随变动
  • 在两表关联查询时,若分布键规则不一致,则会分发两表中间数据量较小的全表数据到所有节点,再与本节点大表做join

总结

  • 这两个问题都是因为同一个根本原因导致的:数据库设计时未做分布策略设计
  • 在正常业务场景中,不允许使用GP的随机分布键策略
  • 各GP数据库中所有表的分布键列必须一致
  • 因分布键不一致会导致多表join时触发数据分发操作,所以结合实际业务,可以设计为数据库不包含物理主键、唯一键约束。因为这有可能导致join时分发数据而带来的性能下降
  • 为保证性能的最大化,推荐各节点配置相同的情况下被检索的数据均匀分布。新增或删除节点后推荐重新分布数据
  • 推荐master服务器中不添加任何数据节点,并加大master的带宽,以保证不会因为数据处理而影响master原本作为协调者身份工作的初衷