Spark SQL实现原理-逻辑计划优化-Project合并规则:CollapseProject
该规则主要是对select操作(对应Project逻辑计划)进行合并。
在进行介绍其效果时,有一些基本概念需要说明。确定和不确定列:当select时,若指定了某个列名,或通过该列进行某个确定的运算时,则该列是确定的;当在进行select时,指定的列或数据不确定时,则该数据时不确定的,比如:select Rand(20)等。
针对不确定的列,有时可以合并,但有时是不能合并的。
CollapseProject优化规则的效果
合并两个确定且独立的列
假设有两个确定,且独立的select操作可以进行合并。假设执行以下代码:
val t = spark.range(100)
val ds = t.withColumnRenamed("id", "c1").withColumn("c2", 'c1+1)
ds.select(('c1 + 1).as('c1_plus_1), 'c2)
.select('c1_plus_1, ('c2 + 1).as('c2_plus_1)).explain(true)
打开TRACE日志跟踪执行过程,在scala-shell的终端上可以看到以下优化前的逻辑计划输出:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [c1_plus_1#101L, (c2#78L + cast(1 as bigint)) AS c2_plus_1#104L]
!+- Project [(c1#76L + cast(1 as bigint)) AS c1_plus_1#101L, c2#78L]
! +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
! +- Project [id#74L AS c1#76L]
! +- Range (0, 100, step=1, splits=Some(1))
优化后的逻辑计划输出为:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
Project [(id#74L + cast(1 as bigint)) AS c1_plus_1#101L, ((id#74L + cast(1 as bigint)) + cast(1 as bigint)) AS c2_plus_1#104L]
+- Range (0, 100, step=1, splits=Some(1))
从以上优化后的逻辑计划中可以看出:两个select操作被合并成一个,并且和数据来源的逻辑计划进行了合并。
合并两个确定的相互依赖的列
将两个确定性的、依赖的项目合二为一
val t = spark.range(100)
val ds = t.withColumnRenamed("id", "c1").withColumn("c2", 'c1+1)
ds.select(('c1 + 1).as('c1_plus_1), 'c2)
.select(('c1_plus_1 + 1).as('c1_plus_2), 'c2).explain(true)
初始化的逻辑计划是:
!Project [(c1_plus_1#111L + cast(1 as bigint)) AS c1_plus_2#114L, c2#78L]
!+- Project [(c1#76L + cast(1 as bigint)) AS c1_plus_1#111L, c2#78L]
! +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
! +- Project [id#74L AS c1#76L]
! +- Range (0, 100, step=1, splits=Some(1))
优化后的逻辑计划为:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
Project [((id#74L + cast(1 as bigint)) + cast(1 as bigint)) AS c1_plus_2#114L, (id#74L + cast(1 as bigint)) AS c2#78L]
+- Range (0, 100, step=1, splits=Some(1))
从优化的逻辑计划中可以看出,两个前后依赖的select列的运算最终被合并在一起了。
合并不确定的列
当有两个select操作,虽然其列是不确定的,但是相同的操作,此时会取第二个select操作为主。
val q = ds.select(rand(30).as('rand1)).select(rand(20).as('rand2))
其逻辑计划打印如下:
== Analyzed Logical Plan ==
rand2: double
Project [rand(20) AS rand2#162]
+- Project [rand(30) AS rand1#160]
+- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
+- Project [id#74L AS c1#76L]
+- Range (0, 100, step=1, splits=Some(1))
== Optimized Logical Plan ==
Project [rand(20) AS rand2#162]
+- Range (0, 100, step=1, splits=Some(1))
从以上输出可以看到,两个相同select操作的列运算相同,被合并了。
不合并不确定的列
两个不确定的列的运算不同时,操作不能合并。
val q = ds.select(rand(10).as('rand))
.select(('rand + 1).as('rand1), ('rand+2).as('rand2))
输出的逻辑执行计划如下:
= Analyzed Logical Plan ==
rand1: double, rand2: double
Project [(rand#148 + cast(1 as double)) AS rand1#150, (rand#148 + cast(2 as double)) AS rand2#151]
+- Project [rand(10) AS rand#148]
+- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
+- Project [id#74L AS c1#76L]
+- Range (0, 100, step=1, splits=Some(1))
== Optimized Logical Plan ==
Project [(rand#148 + 1.0) AS rand1#150, (rand#148 + 2.0) AS rand2#151]
+- Project [rand(10) AS rand#148]
+- Range (0, 100, step=1, splits=Some(1))
合并聚合操作后的列
下面的语句的聚合操作后的结果列和select的运算的列有依赖,此时可以和聚合操作的列进行合并。
val q = ds.groupBy('c1, 'c2).agg(avg('c1).as('avg_c1), max('c2).as('max_c2))
.select('avg_c1, ('max_c2+1).as('max_c2_plus_1))
开始的逻辑计划:
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject ===
!Project [avg_c1#207, (max_c2#209L + cast(1 as bigint)) AS max_c2_plus_1#214L]
!+- Aggregate [c1#76L, c2#78L], [avg(c1#76L) AS avg_c1#207, max(c2#78L) AS max_c2#209L]
! +- Project [c1#76L, (c1#76L + cast(1 as bigint)) AS c2#78L]
! +- Project [id#74L AS c1#76L]
! +- Range (0, 100, step=1, splits=Some(1))
优化后的逻辑计划:
Aggregate [c1#76L, c2#78L], [avg(c1#76L) AS avg_c1#207, (max(c2#78L) + cast(1 as bigint)) AS max_c2_plus_1#214L]
+- Project [id#74L AS c1#76L, (id#74L + cast(1 as bigint)) AS c2#78L]
+- Range (0, 100, step=1, splits=Some(1))
可以看到,select操作的列运算和聚合操作(这里是groupBy操作)合并了。但当聚合操作和不确定列的select操作是不会合并的。
CollapseProject规则的实现
实现代码如下:
object CollapseProject extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
// select+select组合的情况
case p1 @ Project(_, p2: Project) =>
if (haveCommonNonDeterministicOutput(p1.projectList, p2.projectList)) {
p1
} else {
p2.copy(projectList = buildCleanedProjectList(p1.projectList, p2.projectList))
}
// 聚合操作+select操作组合的情况
case p @ Project(_, agg: Aggregate) =>
if (haveCommonNonDeterministicOutput(p.projectList, agg.aggregateExpressions)) {
p
} else {
agg.copy(aggregateExpressions = buildCleanedProjectList(
p.projectList, agg.aggregateExpressions))
}
}
...
}
小结
本文介绍了逻辑计划优化规则:CollapseProject规则的效果和实现原理。该规则可以对多个select操作,或是聚合+select操作时的列选择或运算进行优化。通过这种方式来简化逻辑执行计划,并使得在生成物理计划时产生最优化的代码。