1.DAG调度器简介

DAG即Directed Acyclic Graph,有向无环图的意思,Spark会存储RDD之间的依赖广西,依赖关系是有向的,总是由子RDD指向父RDD(平时我们看到的箭头一般是数据流向而不是依赖指向,它们刚好相反),RDD依赖的有向性导致RDD的计算呈现明显的阶段特征。因此所形成的的计算链也可以被分割为多个阶段,后面的阶段依赖前面的阶段是否完成。由于RDD内部的数据是不可边,阶段直接的依赖关系所形成的的有向图自然就不会出现回路。

DAG调度的目的就是把一个作业分成不同阶段,根据依赖广西构建一张DAG,并进入到阶段内部,把阶段划分为可以并行计算的任务,最后再把一个阶段内的所有任务交付给任务调度器来收尾。

2.DAG调度的通信机制

eventProcessActor,用于发送和处理各种调度事件,如提交作业,监控作业、任务完成情况等,对应的信息内容和处理方法在DAGSchedulerEventProcessLoop类的doOnReceive方法中。

private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
  dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
//some other code...
}

3.count作业处理流程

如下通过跟踪Spark中一个常用的动作操作count的执行流程,从而理清DAG调度的整个过程,count方法在RDD抽象类的实现如下

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

count方法调用了SparkContext中的runJob方法,runJop方法返回一个包含每个分区内部数据记录个数的整型数组对象,因此sum方法求和即可得到整个RDD内部的数据记录个数。

如下是SparkContext类的runJob方法

def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint()
  }

SparkContext.runJob方法首先获取函数的调用位置用于后期日志输出和调试,而后清除func函数闭包以方便函数的序列化处理,调用DAGScheduler.runJob方法,交付作业给DAG调度器。(函数闭包可以理解成一个函数,使其能够读取外部行数的内部变量)

接下来便从SparkContext类转移到了 DAGScheduler类

def runJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): Unit = {
    val start = System.nanoTime
    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
    waiter.awaitResult() match {
      case JobSucceeded =>
        logInfo("Job %d finished: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
      case JobFailed(exception: Exception) =>
        logInfo("Job %d failed: %s, took %f s".format
          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
        val callerStackTrace = Thread.currentThread().getStackTrace.tail
        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
        throw exception
    }
  }

runJob函数继续调用了submitJob方法提交一个作业,submitJob函数会返回一个JobWaiter类的实例waiter,其主要被用于两个用途:

1.通过eventProcessLoop发送一个JobCanceled请求消息来取消一个作业的执行

2.阻塞DAGScheduler.runJob所在进程,并等待提交作业执行完成。

JobWaiter类部会监听每一个任务的完成时间,统计任务完成的个数,在每个任务完成之后调用回调函数resultHandle来执行任务得到结果,当作业执行完毕或执行失败后,阻塞停止,返回作业执行结果给runJob方法。

如下是 JobWaiter类下的几个方法

override def taskSucceeded(index: Int, result: Any): Unit = synchronized {
    if (_jobFinished) {
      throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter")
    }
    resultHandler(index, result.asInstanceOf[T])
    finishedTasks += 1
    if (finishedTasks == totalTasks) {
      _jobFinished = true
      jobResult = JobSucceeded
      this.notifyAll()
    }
  }

  override def jobFailed(exception: Exception): Unit = synchronized {
    _jobFinished = true
    jobResult = JobFailed(exception)
    this.notifyAll()
  }

  def awaitResult(): JobResult = synchronized {
    while (!_jobFinished) {
      this.wait()
    }
    return jobResult
  }

下面继续看

submitJob方法,其代码如下。程序首先确保RDD分区的编号在合法的范围内,并给当前的作业分配了一个编号。对于分区数目为0的RDD,直接返回一个JobWaiter对象,这个对象在阻塞一开始就会立即返回执行成功。对于分区数目不为0的RDD,则新建一个JobWaiter对象,通过eventProcessLoop对象提交一个JobSubmitted,随后也返回一个JobWaiter对象给runJob函数。

def submitJob[T, U](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      callSite: CallSite,
      resultHandler: (Int, U) => Unit,
      properties: Properties): JobWaiter[U] = {
    // Check to make sure we are not launching a task on a partition that does not exist.
    val maxPartitions = rdd.partitions.length
    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
      throw new IllegalArgumentException(
        "Attempting to access a non-existent partition: " + p + ". " +
          "Total number of partitions: " + maxPartitions)
    }

    val jobId = nextJobId.getAndIncrement()
    if (partitions.size == 0) {
      // Return immediately if the job is running 0 tasks
      return new JobWaiter[U](this, jobId, 0, resultHandler)
    }

    assert(partitions.size > 0)
    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
    eventProcessLoop.post(JobSubmitted(
      jobId, rdd, func2, partitions.toArray, callSite, waiter,
      SerializationUtils.clone(properties)))
    waiter
  }

正如前面介绍的,JobSubmitted信号会被DAGSchedulerEventProcessLoop类的doOnReceive接受处理,并调用DAGScheduler.handleJobSubmitted方法处理该信息,具体的

handleJobSubmitted方法代码如下:

private[scheduler] def handleJobSubmitted(jobId: Int,
      finalRDD: RDD[_],
      func: (TaskContext, Iterator[_]) => _,
      partitions: Array[Int],
      callSite: CallSite,
      listener: JobListener,
      properties: Properties) {
    var finalStage: ResultStage = null
    try {
      // New stage creation may throw an exception if, for example, jobs are run on a
      // HadoopRDD whose underlying HDFS files have been deleted.
      finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
    } catch {
      //omission:some code deal with the exception..
    }
    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
    clearCacheLocs()
    //omission:some code about loginfo
    val jobSubmissionTime = clock.getTimeMillis()
    jobIdToActiveJob(jobId) = job
    activeJobs += job
    finalStage.setActiveJob(job)
    val stageIds = jobIdToStageIds(jobId).toArray
    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
    listenerBus.post(
      SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
    submitStage(finalStage)
    submitWaitingStages()
  }

handleJobSubmitted首先做的就是调用newResultStage函数对作业进行阶段划分,得到表示末阶段(Final Stage)的变量finaStage,finaStage内部件存储末阶段的信息,还可能保存了父阶段的信息,而父段又会保存祖父阶段的信息,因此finalStage时间已经保存了希望得到的DAG的信息。(具体细节参看 阶段划分)

阶段划分完毕后,程序将当前作业转变为活作业,活作业与普通作业的最大不同在于前者保存了阶段划分的信息(finalState),此处和Spark1.4的版本有点不同,没有runLocally方法处理仅有一个阶段的action方法,全部统一使用submitStage方法进行处理。

在submitStage方法中,程序会检查传入阶段是否有父阶段尚未执行,如果有,则通过调用submitStage(parent)优先执行父阶段,并将自家放到waitingStages队列中,等待后期被取出执行,如果前面的所有阶段都已经执行完毕,则直接调用submitMissingTasks方法,执行当前阶段内的任务。

/** Submits stage, but first recursively submits any missing parents. */
  private def submitStage(stage: Stage) {
    val jobId = activeJobForStage(stage)
    if (jobId.isDefined) {
      logDebug("submitStage(" + stage + ")")
      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
        val missing = getMissingParentStages(stage).sortBy(_.id)
        logDebug("missing: " + missing)
        if (missing.isEmpty) {
          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
          submitMissingTasks(stage, jobId.get)
        } else {
          for (parent <- missing) {
            submitStage(parent)
          }
          waitingStages += stage
        }
      }
    } else {
      abortStage(stage, "No active job for stage " + stage.id, None)
    }
  }

submitMissingTasks方法负责将一个阶段划分成多个任务并交付集群执行