TaskSchedulerImpl对Task的调度依赖于调度池Pool,所有需要被调度的TaskSet都被置于调度池中。调度池Pool通过调度算法对每个TaskSet进行调度,并将调度的TaskSet交给TaskSchedulerImpl进行资源调度。
1 调度算法
调度池对TaskSet的调度取决于调度算法。特质SchedulingAlgorithm定义了调度算法的规范,代码如下:
//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] trait SchedulingAlgorithm {
def comparator(s1: Schedulable, s2: Schedulable): Boolean
}
SchedulingAlgorithm仅仅定义了一个comparator方法,用于对两个Schedulable进行比较。
SchedulingAlgorithm有两个实现类,分别为实现了先进先出(First In First Out, FIFO)算法的FIFOSchedulingAlgorithm和公平调度算法的FairSchedulingAlgorithm。
1.1 FIFOSchedulingAlgorithm详解
该类实现了FIFO调度算法
//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val priority1 = s1.priority
val priority2 = s2.priority
var res = math.signum(priority1 - priority2)
if (res == 0) {
val stageId1 = s1.stageId
val stageId2 = s2.stageId
res = math.signum(stageId1 - stageId2)
}
res < 0
}
}
- 1)对s1和s2两个Schedulable的优先级(值越小,优先级越高)进行比较
- 2)如果两个Schedulable的优先级相同,则对s1和s2所属的Stage的身份标识进行比较
- 3)如果比较的结果小于0,则优先调度s1,否则优先调度s2
1.2 FairSchedulingAlgorithm详解
该类实现了调度算法
//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
val minShare1 = s1.minShare
val minShare2 = s2.minShare
val runningTasks1 = s1.runningTasks
val runningTasks2 = s2.runningTasks
val s1Needy = runningTasks1 < minShare1
val s2Needy = runningTasks2 < minShare2
val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
var compare = 0
if (s1Needy && !s2Needy) {
return true
} else if (!s1Needy && s2Needy) {
return false
} else if (s1Needy && s2Needy) {
compare = minShareRatio1.compareTo(minShareRatio2)
} else {
compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
}
if (compare < 0) {
true
} else if (compare > 0) {
false
} else {
s1.name < s2.name
}
}
}
2 Pool的实现
TaskScheduler对任务的调度是借助于调度池实现的,Pool是对Task集合进行调度的调度池。调度池内部有一个根调度队列,根调度队列中包含了多个子调度池。子调度池自身的调度队列中还可以包含其它的调度池或者TaskSetManager,所以整个调度池是一个多层次的调度队列。Pool实现了Schedulable特质,其中包含了如下属性:
- parent:当前Pool的父Pool
- poolName:Pool的构造器属性之一,表示Pool的名称
- schedulingMode:Pool的构造器属性之一,表示调度模式(SchedulingMode)。枚举类型SchedulingMode共有FAIR、FIFO、NONE三种枚举值
- initMinShare:minShare的初始值
- initWeight:weight的初始值
- weight:用于公平调度算法的权重
- minShare:用于公平调度算法的参考值
- scheduableQueue:类型为ConcurrentLinkedQueue[Schedulable],用于存储Schedulable。由于Schedulable只有Pool和TaskSetManager两个实现类,所以SchedulableQueue是一个可以嵌套的层次结构
- schedulableNameToSchedulable:调度名称与Schedulable的对应关系
- runningTasks:当前正在运行的任务数量
- priority:进行调度的优先级
- stageId:调度池或TaskSetManager所属Stage的身份标识
- name:与poolName相同
- taskSetSchedulingAlgorithm:任务集合的调度算法,默认为FIFOSchedlingAlgorithm
Pool类有如下方法
2.1 addSchedulable
用于将Schedulable添加到schedulableQueue和schedulableNameToSchedulable中,并将Schedulable的父亲设置为当前Pool
//org.apache.spark.scheduler.Pool
override def addSchedulable(schedulable: Schedulable) {
require(schedulable != null)
schedulableQueue.add(schedulable)
schedulableNameToSchedulable.put(schedulable.name, schedulable)
schedulable.parent = this
}
2.2 removeSchedulable
用于指定的Schedulable从schedulableQueue和schedulableNameToSchedulable中移除
override def removeSchedulable(schedulable: Schedulable) {
schedulableQueue.remove(schedulable)
schedulableNameToSchedulable.remove(schedulable.name)
}
2.3 getSchedulableByName
用于根据指定名称查找Schedulable
override def getSchedulableByName(schedulableName: String): Schedulable = {
if (schedulableNameToSchedulable.containsKey(schedulableName)) {
return schedulableNameToSchedulable.get(schedulableName) //从当前Pool中找到的指定名称的Schedulable
}
for (schedulable <- schedulableQueue.asScala) {
val sched = schedulable.getSchedulableByName(schedulableName) //从子Schedulable中查找
if (sched != null) {
return sched
}
}
null
}
2.4 executorLost
用于当某个Executor丢失后,调用当前Pool的schedulableQueue中的各个Schedulabe(可能为子调度池,也可能是TaskSetManager)的executorLost方法。TaskSetManager的execuotrLost进而将在此Executor上正在运行的Task作为失败任务处理,并重新提交这些任务。
override def executorLost(executorId: String, host: String, reason: ExecutorLossReason) {
schedulableQueue.asScala.foreach(_.executorLost(executorId, host, reason))
}
2.5 checkSpeculatableTasks
checkSpeculatableTasks方法用于检查当前 Pool中是否有需要推断执行的任务。checkSpeculatable实际通过迭代调用schedulableQueue中的各个子Schedulabe的checkSpeculatableTasks方法来实现。Pool的checkSpeculatableTasks方法和TaskSetManager的checkSpeculatableTasks方法,一起实现了按照深度遍历算法从调度池中查找可推断执行的任务。
override def checkSpeculatableTasks(): Boolean = {
var shouldRevive = false
for (schedulable <- schedulableQueue.asScala) {
shouldRevive |= schedulable.checkSpeculatableTasks()
}
shouldRevive
}
2.6 getSortedTaskSetQueue
用于对当前Pool中的所有TaskSetManager按照调度算法进行排序,并返回排序后的TaskSetManager。getSortedTaskSetQueue实际是通过迭代调用schedulableQueue中的各个Schedulable的getSortedTaskSetQueue方法来实现。
override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
var sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
val sortedSchedulableQueue =
schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
for (schedulable <- sortedSchedulableQueue) {
sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
}
sortedTaskSetQueue
}
2.7 increaseRunningTasks
用于增加当前Pool及其父Pool中记录的当前正在运行的任务数量
def increaseRunningTasks(taskNum: Int) {
runningTasks += taskNum
if (parent != null) {
parent.increaseRunningTasks(taskNum)
}
}
2.8 decreasingRunningTasks
用于减少当前Pool及其父Pool中记录的当前正在运行的任务数量
def decreaseRunningTasks(taskNum: Int) {
runningTasks -= taskNum
if (parent != null) {
parent.decreaseRunningTasks(taskNum)
}
}
3 调度池构建器
特质SchedulableBuilder定义了调度池构建器的行为规范,如下:
//org.apache.spark.scheduler.SchedulableBuilder
private[spark] trait SchedulableBuilder {
def rootPool: Pool
def buildPools(): Unit
def addTaskSetManager(manager: Schedulable, properties: Properties): Unit
}
上述代码定义了三个方法:
- rooPool:返回根调度池
- buildPools:对调度池进行构建
- addTaskSetManager:向调度池内添加TaskSetManager
针对FIFO和Fair两种调度算法,SchedulableBuilder共有两种实现,分别是FIFOSchedulableBuilder和FairSchedulableBuilder
3.1 FIFOSchedulableBuilder详解
private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
extends SchedulableBuilder with Logging {
override def buildPools() {
// nothing
}
override def addTaskSetManager(manager: Schedulable, properties: Properties) {
rootPool.addSchedulable(manager)
}
}
FIFOTaskSetManager实现的buildPools方法是个空方法,而实现的addTaskSetManager方法将向调度池中添加TaskSetManager。FIFOSchedulableBuilder构建出的调度池的内存结构如下:
3.2 FairSchedulableBuilder详解
FairSchedulableBuilder的实现较为复杂,为便于分析,下面开始了解其属性:
- rootPool:根调度池。rootPool是FairSchedulableBuilder的构造器属性
- conf:即SparkConf
- schedulerAllocFile:用户指定的文件系统中的调度分配文件。此文件可以通过spark.scheduler.allocation.file属性配置,FairSchedulableBuilder将从文件系统中读取此文件提供的公平调度配置
- DEFAULT_SCHEDULER_FILE:默认的调度文件名。常量DEFAULT_SCHEDULER_FILE的值固定为”fairscheduler.xml“,FairSchedulableBuilder将从ClassPath中读取此文件提供的公平配置
FairSchedulableBuilder构建出的调度池的内存结构