TaskSchedulerImpl对Task的调度依赖于调度池Pool,所有需要被调度的TaskSet都被置于调度池中。调度池Pool通过调度算法对每个TaskSet进行调度,并将调度的TaskSet交给TaskSchedulerImpl进行资源调度。

1 调度算法

调度池对TaskSet的调度取决于调度算法。特质SchedulingAlgorithm定义了调度算法的规范,代码如下:

//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] trait SchedulingAlgorithm {
  def comparator(s1: Schedulable, s2: Schedulable): Boolean
}

SchedulingAlgorithm仅仅定义了一个comparator方法,用于对两个Schedulable进行比较。

SchedulingAlgorithm有两个实现类,分别为实现了先进先出(First In First Out, FIFO)算法的FIFOSchedulingAlgorithm和公平调度算法的FairSchedulingAlgorithm。

1.1 FIFOSchedulingAlgorithm详解

该类实现了FIFO调度算法

//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] class FIFOSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val priority1 = s1.priority
    val priority2 = s2.priority
    var res = math.signum(priority1 - priority2)
    if (res == 0) {
      val stageId1 = s1.stageId
      val stageId2 = s2.stageId
      res = math.signum(stageId1 - stageId2)
    }
    res < 0
  }
}
  • 1)对s1和s2两个Schedulable的优先级(值越小,优先级越高)进行比较
  • 2)如果两个Schedulable的优先级相同,则对s1和s2所属的Stage的身份标识进行比较
  • 3)如果比较的结果小于0,则优先调度s1,否则优先调度s2

1.2 FairSchedulingAlgorithm详解

该类实现了调度算法

//org.apache.spark.scheduler.SchedulingAlgorithm
private[spark] class FairSchedulingAlgorithm extends SchedulingAlgorithm {
  override def comparator(s1: Schedulable, s2: Schedulable): Boolean = {
    val minShare1 = s1.minShare
    val minShare2 = s2.minShare
    val runningTasks1 = s1.runningTasks
    val runningTasks2 = s2.runningTasks
    val s1Needy = runningTasks1 < minShare1
    val s2Needy = runningTasks2 < minShare2
    val minShareRatio1 = runningTasks1.toDouble / math.max(minShare1, 1.0)
    val minShareRatio2 = runningTasks2.toDouble / math.max(minShare2, 1.0)
    val taskToWeightRatio1 = runningTasks1.toDouble / s1.weight.toDouble
    val taskToWeightRatio2 = runningTasks2.toDouble / s2.weight.toDouble
    var compare = 0
    if (s1Needy && !s2Needy) {
      return true
    } else if (!s1Needy && s2Needy) {
      return false
    } else if (s1Needy && s2Needy) {
      compare = minShareRatio1.compareTo(minShareRatio2)
    } else {
      compare = taskToWeightRatio1.compareTo(taskToWeightRatio2)
    }
    if (compare < 0) {
      true
    } else if (compare > 0) {
      false
    } else {
      s1.name < s2.name
    }
  }
}

2 Pool的实现

TaskScheduler对任务的调度是借助于调度池实现的,Pool是对Task集合进行调度的调度池。调度池内部有一个根调度队列,根调度队列中包含了多个子调度池。子调度池自身的调度队列中还可以包含其它的调度池或者TaskSetManager,所以整个调度池是一个多层次的调度队列。Pool实现了Schedulable特质,其中包含了如下属性:

  • parent:当前Pool的父Pool
  • poolName:Pool的构造器属性之一,表示Pool的名称
  • schedulingMode:Pool的构造器属性之一,表示调度模式(SchedulingMode)。枚举类型SchedulingMode共有FAIR、FIFO、NONE三种枚举值
  • initMinShare:minShare的初始值
  • initWeight:weight的初始值
  • weight:用于公平调度算法的权重
  • minShare:用于公平调度算法的参考值
  • scheduableQueue:类型为ConcurrentLinkedQueue[Schedulable],用于存储Schedulable。由于Schedulable只有Pool和TaskSetManager两个实现类,所以SchedulableQueue是一个可以嵌套的层次结构

spark管理页面 sparkpool主页_spark管理页面

  • schedulableNameToSchedulable:调度名称与Schedulable的对应关系
  • runningTasks:当前正在运行的任务数量
  • priority:进行调度的优先级
  • stageId:调度池或TaskSetManager所属Stage的身份标识
  • name:与poolName相同
  • taskSetSchedulingAlgorithm:任务集合的调度算法,默认为FIFOSchedlingAlgorithm

Pool类有如下方法

2.1 addSchedulable

用于将Schedulable添加到schedulableQueue和schedulableNameToSchedulable中,并将Schedulable的父亲设置为当前Pool

//org.apache.spark.scheduler.Pool
override def addSchedulable(schedulable: Schedulable) {
  require(schedulable != null)
  schedulableQueue.add(schedulable)
  schedulableNameToSchedulable.put(schedulable.name, schedulable)
  schedulable.parent = this
}

2.2 removeSchedulable

用于指定的Schedulable从schedulableQueue和schedulableNameToSchedulable中移除

override def removeSchedulable(schedulable: Schedulable) {
  schedulableQueue.remove(schedulable)
  schedulableNameToSchedulable.remove(schedulable.name)
}

2.3 getSchedulableByName

用于根据指定名称查找Schedulable

override def getSchedulableByName(schedulableName: String): Schedulable = {
  if (schedulableNameToSchedulable.containsKey(schedulableName)) {
    return schedulableNameToSchedulable.get(schedulableName) //从当前Pool中找到的指定名称的Schedulable
  }
  for (schedulable <- schedulableQueue.asScala) {
    val sched = schedulable.getSchedulableByName(schedulableName) //从子Schedulable中查找
    if (sched != null) {
      return sched
    }
  }
  null
}

2.4 executorLost

用于当某个Executor丢失后,调用当前Pool的schedulableQueue中的各个Schedulabe(可能为子调度池,也可能是TaskSetManager)的executorLost方法。TaskSetManager的execuotrLost进而将在此Executor上正在运行的Task作为失败任务处理,并重新提交这些任务。

override def executorLost(executorId: String, host: String, reason: ExecutorLossReason) {
   schedulableQueue.asScala.foreach(_.executorLost(executorId, host, reason))
 }

2.5 checkSpeculatableTasks

checkSpeculatableTasks方法用于检查当前 Pool中是否有需要推断执行的任务。checkSpeculatable实际通过迭代调用schedulableQueue中的各个子Schedulabe的checkSpeculatableTasks方法来实现。Pool的checkSpeculatableTasks方法和TaskSetManager的checkSpeculatableTasks方法,一起实现了按照深度遍历算法从调度池中查找可推断执行的任务。

override def checkSpeculatableTasks(): Boolean = {
  var shouldRevive = false
  for (schedulable <- schedulableQueue.asScala) {
    shouldRevive |= schedulable.checkSpeculatableTasks()
  }
  shouldRevive
}

2.6 getSortedTaskSetQueue

用于对当前Pool中的所有TaskSetManager按照调度算法进行排序,并返回排序后的TaskSetManager。getSortedTaskSetQueue实际是通过迭代调用schedulableQueue中的各个Schedulable的getSortedTaskSetQueue方法来实现。

override def getSortedTaskSetQueue: ArrayBuffer[TaskSetManager] = {
  var sortedTaskSetQueue = new ArrayBuffer[TaskSetManager]
  val sortedSchedulableQueue =
    schedulableQueue.asScala.toSeq.sortWith(taskSetSchedulingAlgorithm.comparator)
  for (schedulable <- sortedSchedulableQueue) {
    sortedTaskSetQueue ++= schedulable.getSortedTaskSetQueue
  }
  sortedTaskSetQueue
}

2.7 increaseRunningTasks

用于增加当前Pool及其父Pool中记录的当前正在运行的任务数量

def increaseRunningTasks(taskNum: Int) {
  runningTasks += taskNum
  if (parent != null) {
    parent.increaseRunningTasks(taskNum)
  }
}

2.8 decreasingRunningTasks

用于减少当前Pool及其父Pool中记录的当前正在运行的任务数量

def decreaseRunningTasks(taskNum: Int) {
  runningTasks -= taskNum
  if (parent != null) {
    parent.decreaseRunningTasks(taskNum)
  }
}

3 调度池构建器 

特质SchedulableBuilder定义了调度池构建器的行为规范,如下:

//org.apache.spark.scheduler.SchedulableBuilder
private[spark] trait SchedulableBuilder {
  def rootPool: Pool
  def buildPools(): Unit
  def addTaskSetManager(manager: Schedulable, properties: Properties): Unit
}

上述代码定义了三个方法:

  • rooPool:返回根调度池
  • buildPools:对调度池进行构建
  • addTaskSetManager:向调度池内添加TaskSetManager

针对FIFO和Fair两种调度算法,SchedulableBuilder共有两种实现,分别是FIFOSchedulableBuilder和FairSchedulableBuilder

3.1 FIFOSchedulableBuilder详解

private[spark] class FIFOSchedulableBuilder(val rootPool: Pool)
  extends SchedulableBuilder with Logging {
  override def buildPools() {
    // nothing
  }
  override def addTaskSetManager(manager: Schedulable, properties: Properties) {
    rootPool.addSchedulable(manager)
  }
}

FIFOTaskSetManager实现的buildPools方法是个空方法,而实现的addTaskSetManager方法将向调度池中添加TaskSetManager。FIFOSchedulableBuilder构建出的调度池的内存结构如下:

spark管理页面 sparkpool主页_spark管理页面_02

3.2 FairSchedulableBuilder详解

FairSchedulableBuilder的实现较为复杂,为便于分析,下面开始了解其属性:

  • rootPool:根调度池。rootPool是FairSchedulableBuilder的构造器属性
  • conf:即SparkConf
  • schedulerAllocFile:用户指定的文件系统中的调度分配文件。此文件可以通过spark.scheduler.allocation.file属性配置,FairSchedulableBuilder将从文件系统中读取此文件提供的公平调度配置
  • DEFAULT_SCHEDULER_FILE:默认的调度文件名。常量DEFAULT_SCHEDULER_FILE的值固定为”fairscheduler.xml“,FairSchedulableBuilder将从ClassPath中读取此文件提供的公平配置

FairSchedulableBuilder构建出的调度池的内存结构

spark管理页面 sparkpool主页_调度算法_03