1.DAG调度器简介
DAG即Directed Acyclic Graph,有向无环图的意思,Spark会存储RDD之间的依赖广西,依赖关系是有向的,总是由子RDD指向父RDD(平时我们看到的箭头一般是数据流向而不是依赖指向,它们刚好相反),RDD依赖的有向性导致RDD的计算呈现明显的阶段特征。因此所形成的的计算链也可以被分割为多个阶段,后面的阶段依赖前面的阶段是否完成。由于RDD内部的数据是不可边,阶段直接的依赖关系所形成的的有向图自然就不会出现回路。
DAG调度的目的就是把一个作业分成不同阶段,根据依赖广西构建一张DAG,并进入到阶段内部,把阶段划分为可以并行计算的任务,最后再把一个阶段内的所有任务交付给任务调度器来收尾。
2.DAG调度的通信机制
eventProcessActor,用于发送和处理各种调度事件,如提交作业,监控作业、任务完成情况等,对应的信息内容和处理方法在DAGSchedulerEventProcessLoop类的doOnReceive方法中。
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
//some other code...
}
3.count作业处理流程
如下通过跟踪Spark中一个常用的动作操作count的执行流程,从而理清DAG调度的整个过程,count方法在RDD抽象类的实现如下
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
count方法调用了SparkContext中的runJob方法,runJop方法返回一个包含每个分区内部数据记录个数的整型数组对象,因此sum方法求和即可得到整个RDD内部的数据记录个数。
如下是SparkContext类的runJob方法
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}
SparkContext.runJob方法首先获取函数的调用位置用于后期日志输出和调试,而后清除func函数闭包以方便函数的序列化处理,调用DAGScheduler.runJob方法,交付作业给DAG调度器。(函数闭包可以理解成一个函数,使其能够读取外部行数的内部变量)
接下来便从SparkContext类转移到了 DAGScheduler类中
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
val start = System.nanoTime
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
waiter.awaitResult() match {
case JobSucceeded =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}
runJob函数继续调用了submitJob方法提交一个作业,submitJob函数会返回一个JobWaiter类的实例waiter,其主要被用于两个用途:
1.通过eventProcessLoop发送一个JobCanceled请求消息来取消一个作业的执行
2.阻塞DAGScheduler.runJob所在进程,并等待提交作业执行完成。
JobWaiter类部会监听每一个任务的完成时间,统计任务完成的个数,在每个任务完成之后调用回调函数resultHandle来执行任务得到结果,当作业执行完毕或执行失败后,阻塞停止,返回作业执行结果给runJob方法。
如下是 JobWaiter类下的几个方法
override def taskSucceeded(index: Int, result: Any): Unit = synchronized {
if (_jobFinished) {
throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter")
}
resultHandler(index, result.asInstanceOf[T])
finishedTasks += 1
if (finishedTasks == totalTasks) {
_jobFinished = true
jobResult = JobSucceeded
this.notifyAll()
}
}
override def jobFailed(exception: Exception): Unit = synchronized {
_jobFinished = true
jobResult = JobFailed(exception)
this.notifyAll()
}
def awaitResult(): JobResult = synchronized {
while (!_jobFinished) {
this.wait()
}
return jobResult
}
下面继续看
submitJob方法,其代码如下。程序首先确保RDD分区的编号在合法的范围内,并给当前的作业分配了一个编号。对于分区数目为0的RDD,直接返回一个JobWaiter对象,这个对象在阻塞一开始就会立即返回执行成功。对于分区数目不为0的RDD,则新建一个JobWaiter对象,通过eventProcessLoop对象提交一个JobSubmitted,随后也返回一个JobWaiter对象给runJob函数。
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
正如前面介绍的,JobSubmitted信号会被DAGSchedulerEventProcessLoop类的doOnReceive接受处理,并调用DAGScheduler.handleJobSubmitted方法处理该信息,具体的
handleJobSubmitted方法代码如下:
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = newResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
//omission:some code deal with the exception..
}
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
clearCacheLocs()
//omission:some code about loginfo
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
submitStage(finalStage)
submitWaitingStages()
}
handleJobSubmitted首先做的就是调用newResultStage函数对作业进行阶段划分,得到表示末阶段(Final Stage)的变量finaStage,finaStage内部件存储末阶段的信息,还可能保存了父阶段的信息,而父段又会保存祖父阶段的信息,因此finalStage时间已经保存了希望得到的DAG的信息。(具体细节参看 阶段划分)
阶段划分完毕后,程序将当前作业转变为活作业,活作业与普通作业的最大不同在于前者保存了阶段划分的信息(finalState),此处和Spark1.4的版本有点不同,没有runLocally方法处理仅有一个阶段的action方法,全部统一使用submitStage方法进行处理。
在submitStage方法中,程序会检查传入阶段是否有父阶段尚未执行,如果有,则通过调用submitStage(parent)优先执行父阶段,并将自家放到waitingStages队列中,等待后期被取出执行,如果前面的所有阶段都已经执行完毕,则直接调用submitMissingTasks方法,执行当前阶段内的任务。
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
submitMissingTasks方法负责将一个阶段划分成多个任务并交付集群执行