介绍

SparkHint是在使用SparkSQL开发过程中,针对SQL进行优化的一点小技巧,我们可以通过Hint的方式实现BraodcastJoin优化、Reparttion分区等操作,提供了传统SQL中无法实现的一些功能。

语法介绍

SparkSQL的语法定义是通Antlr4实现的,Antlr4是一个提供语法定义、语法解析等第三方库,Antlr4语法的定义基本复合正则表达式,因此会正则表达式的同学可以尝试去理解一下下面这段代码。

# 来自于Spark源代码的SqlBase.g4文件
# 内容有部分删除,如查看源代码是请注意

# 定义特别查询的语法树节点
querySpecification
    : ((kind=SELECT (hints+=hint)* setQuantifier? namedExpressionSeq fromClause?
       | fromClause (kind=SELECT setQuantifier? namedExpressionSeq)?)
       lateralView*
       (WHERE where=booleanExpression)?
       aggregation?
       (HAVING having=booleanExpression)?
       windows?)
    ;

# Hint基本的格式,开始符号、结束符号,语句列表格式,格式为
# /*+ 多条语句 */
hint
    : '/*+' hintStatements+=hintStatement (','? hintStatements+=hintStatement)* '*/'
    ;
# 单个Hint的组成
# 名称
# 名称(参数, 参数2...)
hintStatement
    : hintName=identifier
    | hintName=identifier '(' parameters+=primaryExpression (',' parameters+=primaryExpression)* ')'
    ;

上述Hint的语法树中定义了Hint的使用方式为:

  • SparkHint只能在Select语句中使用
  • SparkHint的结果必须要SELECT关键字以后标识,且结构体为
    /*+ … */

目前SparkHint支持的语法很少,只有两种语法:

  • Broadcast: MAPJOIN/BROADCASTJOIN/BROADCAST
  • Coalesce: REPARTITION/COALESCE

下面的例子简单介绍一下SparkHint是如何使用的:

select /*+ BRAODCASTJOIN(B) */
        A.COL1
        , B.COL2
    FROM 
        A LEFT JOIN B USING(COL3);

源代码解析

SparkSQL的解析过程大致上可以分为一下几个过程:

  • G4文件定义语法
  • AstBuilder/SparkAstBuilder将语法树种的信息解析成相应的逻辑计划
  • Analyzer层将一些LogicalPlan翻译成另一个LogicalPlan语法
  • Optimizer层在Schema等级别进行优化,生成LogicalPlan
  • SparkPlan将LogicPlan翻译成影响的SparkPlan
  • 执行SparkPlan映射成底层的RDD操作

这里简单介绍一下SparkSQL的解析过程,后续我们再写一篇文章详细介绍SparkSQL解析的整体过程。

G4文件

当我们使用 sparkSession.sql(string) 执行SparkHint语句时,首先要经历语法解析,在这里会将SparkHint的语法解析成相应的hintStatement语法,并且将HintStatement中的参数进行提取。

# Hint基本的格式,开始符号、结束符号,语句列表格式
hint
    : '/*+' hintStatements+=hintStatement (','? hintStatements+=hintStatement)* '*/'
    ;

# 单个Hint的组成
# 名称
# 名称(参数, 参数2...)
hintStatement
    : hintName=identifier
    | hintName=identifier '(' parameters+=primaryExpression (',' parameters+=primaryExpression)* ')'

AstBuilder解析

SparkSQL采用的是Antlr4的Visitor模式,继承Visitor接口实现相应的代码(AstBuilder.java和 SparkAstBuilder.java),这两个类的主要作用是将一条SQL语句的每一部分翻译成相应的逻辑计划(LogicalPlan)代码。然后SparkSQL的三层将生成的LogicalPlan进行相应的解析、优化和转化成底层的RDD操作。

/**
   * Add [[UnresolvedHint]]s to a logical plan.
   */
  private def withHints(
      ctx: HintContext,
      query: LogicalPlan): LogicalPlan = withOrigin(ctx) {
    var plan = query
    // 将hintStatement列表转换成UnresolvedHint对象
    ctx.hintStatements.asScala.reverse.foreach { case stmt =>
      plan = UnresolvedHint(stmt.hintName.getText, stmt.parameters.asScala.map(expression), plan)
    }
    plan
  }

这段代码结合G4文件的语法就能得到一下几点:

  • 一个hintStatemenet转换成一个UnresolvedHint
  • 生成UnresolvedHint对象时,将定义的hintName以及params进行了存储

Analyzer

Analyzer对象可以理解成是一系列解析过程的集合,每一个解析过程会针对一种LogicalPlan进行解析,并生成后续可以被优化以及翻译的过程LogicalPlan。

Analyzer的解析过程列表,Hint的解析过程排列在第一个:

# Spark 2.4 版本代码
lazy val batches: Seq[Batch] = Seq(
    Batch("Hints", fixedPoint,
      new ResolveHints.ResolveBroadcastHints(conf),
      ResolveHints.ResolveCoalesceHints,
      ResolveHints.RemoveAllHints),
    Batch("Simple Sanity Check", Once,
      LookupFunctions),
    Batch("Substitution", fixedPoint,
      CTESubstitution,
      WindowsSubstitution,
      EliminateUnions,
      new SubstituteUnresolvedOrdinals(conf)),
    Batch("Resolution", fixedPoint,
      ResolveTableValuedFunctions ::
      ResolveRelations ::
      ResolveReferences ::
      ResolveCreateNamedStruct ::
      ResolveDeserializer ::
      ResolveNewInstance ::
      ResolveUpCast ::
      ResolveGroupingAnalytics ::
      ResolvePivot ::
      ResolveOrdinalInOrderByAndGroupBy ::
      ResolveAggAliasInGroupBy ::
      ResolveMissingReferences ::
      ExtractGenerator ::
      ResolveGenerate ::
      ResolveFunctions ::
      ResolveAliases ::
      ResolveSubquery ::
      ResolveSubqueryColumnAliases ::
      ResolveWindowOrder ::
      ResolveWindowFrame ::
      ResolveNaturalAndUsingJoin ::
      ResolveOutputRelation ::
      ExtractWindowExpressions ::
      GlobalAggregates ::
      ResolveAggregateFunctions ::
      TimeWindowing ::
      ResolveInlineTables(conf) ::
      ResolveHigherOrderFunctions(catalog) ::
      ResolveLambdaVariables(conf) ::
      ResolveTimeZone(conf) ::
      ResolveRandomSeed ::
      TypeCoercion.typeCoercionRules(conf) ++
      extendedResolutionRules : _*),
    Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*),
    Batch("Nondeterministic", Once,
      PullOutNondeterministic),
    Batch("UDF", Once,
      HandleNullInputsForUDF),
    Batch("FixNullability", Once,
      FixNullability),
    Batch("Subquery", Once,
      UpdateOuterReferences),
    Batch("Cleanup", fixedPoint,
      CleanupAliases)
  )

我们可以看到Hint的Batch里头有三个对象,一个是ResolveBroadcastHints,一个是ResolveCoalesceHints,最后一个是RemoveAllHints,我们来看一下源代码:

  • ResolveCoalesceHints
/**
   * COALESCE Hint accepts name "COALESCE" and "REPARTITION".
   * Its parameter includes a partition number.
   */
  object ResolveCoalesceHints extends Rule[LogicalPlan] {
    private val COALESCE_HINT_NAMES = Set("COALESCE", "REPARTITION")
    
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
      // 判断为UnresolvedHint对象并且hintName为REPARTITION/COALESCE
      case h: UnresolvedHint if COALESCE_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
        // 获取的HintName
        val hintName = h.name.toUpperCase(Locale.ROOT)
        // 是否shuffle
        val shuffle = hintName match {
          case "REPARTITION" => true
          case "COALESCE" => false
        }
        // 分区数量
        val numPartitions = h.parameters match {
          case Seq(IntegerLiteral(numPartitions)) =>
            numPartitions
          case Seq(numPartitions: Int) =>
            numPartitions
          case _ =>
            throw new AnalysisException(s"$hintName Hint expects a partition number as parameter")
        }
        
        // 生成Repartition LogicalPlan对象
        Repartition(numPartitions, shuffle, h.child)
    }
  }
/**
   * Removes all the hints, used to remove invalid hints provided by the user.
   * This must be executed after all the other hint rules are executed.
   */
  object RemoveAllHints extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp     {
      // 直接将Dataset链进行返回,不生成任何LogicalPlan,因此为空操作
      case h: UnresolvedHint => h.child
    }
  }
/**
   * For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of
   * relation aliases can be specified in the hint. A broadcast hint plan node will be inserted
   * on top of any relation (that is not aliased differently), subquery, or common table expression
   * that match the specified name.
   *
   * The hint resolution works by recursively traversing down the query plan to find a relation or
   * subquery that matches one of the specified broadcast aliases. The traversal does not go past
   * beyond any existing broadcast hints, subquery aliases.
   *
   * This rule must happen before common table expressions.
   */
  class ResolveBroadcastHints(conf: SQLConf) extends Rule[LogicalPlan] {
    private val BROADCAST_HINT_NAMES = Set("BROADCAST", "BROADCASTJOIN", "MAPJOIN")

    def resolver: Resolver = conf.resolver

    private def applyBroadcastHint(plan: LogicalPlan, toBroadcast: Set[String]): LogicalPlan = {
      // Whether to continue recursing down the tree
      var recurse = true

      val newNode = CurrentOrigin.withOrigin(plan.origin) {
        plan match {
          // 将子Dataset进行BroadCast操作
          case u: UnresolvedRelation if toBroadcast.exists(resolver(_, u.tableIdentifier.table)) =>
            ResolvedHint(plan, HintInfo(broadcast = true))
          case r: SubqueryAlias if toBroadcast.exists(resolver(_, r.alias)) =>
            ResolvedHint(plan, HintInfo(broadcast = true))

          case _: ResolvedHint | _: View | _: With | _: SubqueryAlias =>
            // Don't traverse down these nodes.
            // For an existing broadcast hint, there is no point going down (if we do, we either
            // won't change the structure, or will introduce another broadcast hint that is useless.
            // The rest (view, with, subquery) indicates different scopes that we shouldn't traverse
            // down. Note that technically when this rule is executed, we haven't completed view
            // resolution yet and as a result the view part should be deadcode. I'm leaving it here
            // to be more future proof in case we change the view we do view resolution.
            recurse = false
            plan

          case _ =>
            plan
        }
      }

      if ((plan fastEquals newNode) && recurse) {
        newNode.mapChildren(child => applyBroadcastHint(child, toBroadcast))
      } else {
        newNode
      }
    }

    def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperatorsUp {
      // 判断是否为BroadCastJoin操作
      case h: UnresolvedHint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) =>
        // 如果参数为空,则将子Dataset进行broadcast
        if (h.parameters.isEmpty) {
          // If there is no table alias specified, turn the entire subtree into a BroadcastHint.
          ResolvedHint(h.child, HintInfo(broadcast = true))
        } else {
          // Otherwise, find within the subtree query plans that should be broadcasted.
          // 如果参数不为空时,则对表名的Dataset进行BroadCastJoin操作
          applyBroadcastHint(h.child, h.parameters.map {
            case tableName: String => tableName
            case tableId: UnresolvedAttribute => tableId.name
            case unsupported => throw new AnalysisException("Broadcast hint parameter should be " +
              s"an identifier or string but was $unsupported (${unsupported.getClass}")
          }.toSet)
        }
    }
  }

所以到这里SparkHint框架基本就结束了,因为SparkHint操作生成的是Unresolved级别的LogicalPlan对象,因此会在Analyzer层被相应的Batch捕捉并处理成别的LogicalPlan对象。

总结

SparkHint目前仅提供了BroadCastJoin和REPARTITION两种操作,其他的Hint语法都会被删除忽而略;我们可以在SQL中使用者两种语法来达到优化的目的。如果有什么写的不对的地方,请大家指点出来~