Spark SQL实现原理-逻辑计划优化-Project合并规则:CollapseProject

该规则主要是对select操作(对应Project逻辑计划)进行合并。

在进行介绍其效果时,有一些基本概念需要说明。确定和不确定列:当select时,若指定了某个列名,或通过该列进行某个确定的运算时,则该列是确定的;当在进行select时,指定的列或数据不确定时,则该数据时不确定的,比如:select Rand(20)等。

针对不确定的列,有时可以合并,但有时是不能合并的。

CollapseProject优化规则的效果

合并两个确定且独立的列

假设有两个确定,且独立的select操作可以进行合并。假设执行以下代码:

val t =  spark.range(100)
val ds = t.withColumnRenamed("id", "c1").withColumn("c2", 'c1+1)

ds.select(('c1 + 1).as('c1_plus_1), 'c2)
	.select('c1_plus_1, ('c2 + 1).as('c2_plus_1)).explain(true)

打开TRACE日志跟踪执行过程,在scala-shell的终端上可以看到以下优化前的逻辑计划输出:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [c1_plus_1#101L, (c2#78L + cast(1 as bigint)) AS c2_plus_1#104L]   
!+- Project [(c1#76L + cast(1 as bigint)) AS c1_plus_1#101L, c2#78L]        
!   +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]             
!      +- Project [id#74L AS c1#76L]                                        
!         +- Range (0, 100, step=1, splits=Some(1))

优化后的逻辑计划输出为:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
Project [(id#74L + cast(1 as bigint)) AS c1_plus_1#101L, ((id#74L + cast(1 as bigint)) + cast(1 as bigint)) AS c2_plus_1#104L]
+- Range (0, 100, step=1, splits=Some(1))

从以上优化后的逻辑计划中可以看出:两个select操作被合并成一个,并且和数据来源的逻辑计划进行了合并。

合并两个确定的相互依赖的列

将两个确定性的、依赖的项目合二为一

val t =  spark.range(100)
val ds = t.withColumnRenamed("id", "c1").withColumn("c2", 'c1+1)

ds.select(('c1 + 1).as('c1_plus_1), 'c2)
	.select(('c1_plus_1 + 1).as('c1_plus_2), 'c2).explain(true)

初始化的逻辑计划是:

!Project [(c1_plus_1#111L + cast(1 as bigint)) AS c1_plus_2#114L, c2#78L]   
!+- Project [(c1#76L + cast(1 as bigint)) AS c1_plus_1#111L, c2#78L]       
!   +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]             
!      +- Project [id#74L AS c1#76L]                                        
!         +- Range (0, 100, step=1, splits=Some(1))

优化后的逻辑计划为:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
Project [((id#74L + cast(1 as bigint)) + cast(1 as bigint)) AS c1_plus_2#114L, (id#74L + cast(1 as bigint)) AS c2#78L]
+- Range (0, 100, step=1, splits=Some(1))

从优化的逻辑计划中可以看出,两个前后依赖的select列的运算最终被合并在一起了。

合并不确定的列

当有两个select操作,虽然其列是不确定的,但是相同的操作,此时会取第二个select操作为主。

val q = ds.select(rand(30).as('rand1)).select(rand(20).as('rand2))

其逻辑计划打印如下:

== Analyzed Logical Plan ==
rand2: double
Project [rand(20) AS rand2#162]
+- Project [rand(30) AS rand1#160]
   +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
      +- Project [id#74L AS c1#76L]
         +- Range (0, 100, step=1, splits=Some(1))

== Optimized Logical Plan ==
Project [rand(20) AS rand2#162]
+- Range (0, 100, step=1, splits=Some(1))

从以上输出可以看到,两个相同select操作的列运算相同,被合并了。

不合并不确定的列

两个不确定的列的运算不同时,操作不能合并。

val q = ds.select(rand(10).as('rand))
					.select(('rand + 1).as('rand1), ('rand+2).as('rand2))

输出的逻辑执行计划如下:

= Analyzed Logical Plan ==
rand1: double, rand2: double
Project [(rand#148 + cast(1 as double)) AS rand1#150, (rand#148 + cast(2 as double)) AS rand2#151]
+- Project [rand(10) AS rand#148]
   +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
      +- Project [id#74L AS c1#76L]
         +- Range (0, 100, step=1, splits=Some(1))

== Optimized Logical Plan ==
Project [(rand#148 + 1.0) AS rand1#150, (rand#148 + 2.0) AS rand2#151]
+- Project [rand(10) AS rand#148]
   +- Range (0, 100, step=1, splits=Some(1))
合并聚合操作后的列

下面的语句的聚合操作后的结果列和select的运算的列有依赖,此时可以和聚合操作的列进行合并。

val q = ds.groupBy('c1, 'c2).agg(avg('c1).as('avg_c1), max('c2).as('max_c2))
					.select('avg_c1, ('max_c2+1).as('max_c2_plus_1))

开始的逻辑计划:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [avg_c1#207, (max_c2#209L + cast(1 as bigint)) AS max_c2_plus_1#214L]            
!+- Aggregate [c1#76L, c2#78L], [avg(c1#76L) AS avg_c1#207, max(c2#78L) AS max_c2#209L]   
!   +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]                              
!      +- Project [id#74L AS c1#76L]                                                      
!         +- Range (0, 100, step=1, splits=Some(1))

优化后的逻辑计划:

Aggregate [c1#76L, c2#78L], [avg(c1#76L) AS avg_c1#207, (max(c2#78L) + cast(1 as bigint)) AS max_c2_plus_1#214L]
+- Project [id#74L AS c1#76L, (id#74L + cast(1 as bigint)) AS c2#78L]
	+- Range (0, 100, step=1, splits=Some(1))

可以看到,select操作的列运算和聚合操作(这里是groupBy操作)合并了。但当聚合操作和不确定列的select操作是不会合并的。

CollapseProject规则的实现

实现代码如下:

object CollapseProject extends Rule[LogicalPlan] {

  def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
    // select+select组合的情况
    case p1 @ Project(_, p2: Project) =>
      if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
        p1
      } else {
        p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
      }
    // 聚合操作+select操作组合的情况
    case p @ Project(_, agg: Aggregate) =>
      if (haveCommonNonDeterministicOutput(p.projectList, agg.aggregateExpressions)) {
        p
      } else {
        agg.copy(aggregateExpressions = buildCleanedProjectList(
          p.projectList, agg.aggregateExpressions))
      }
  }
	...
}

小结

本文介绍了逻辑计划优化规则:CollapseProject规则的效果和实现原理。该规则可以对多个select操作,或是聚合+select操作时的列选择或运算进行优化。通过这种方式来简化逻辑执行计划,并使得在生成物理计划时产生最优化的代码。