文章目录
- Container资源申请分配
- 1. ApplicationMaster.createAllocator()
- 2. YarnAllocator
- 1)allocateResources()
- 2)updateResourceRequests()
- splitPendingAllocationsByLocality()
- requestTotalExecutorsWithPreferredLocalities()
- localityOfRequestedContainers()
- 3)handleAllocatedContainers()
- Executor的启动
- 参考
Spark on YARN应用程序提交之后的“SparkSubmit初始化和ApplicationMaster的启动注册”我们已经在 Spark on YARN SparkSubmit初始化、ApplicationMaster的启动注册这篇文章中分析过了。本篇文章,我们就来分析一下ApplicationMaster启动注册之后的执行流程,也就是YARN是如何申请分配container资源,以及Executor是如何启动的。
我还是把之前文章的源码调用流程图贴在这里,方便大家对照着看。
Container资源申请分配
1. ApplicationMaster.createAllocator()
ApplicationMaster向Driver注册之后,会创建一个YarnAllocator实例用来向YARN ResourceMananger请求containers资源,并决定如何分配这些container资源。
private def createAllocator(driverRef: RpcEndpointRef, _sparkConf: SparkConf): Unit = {
//获取YARN ApplicationId,主要用来记录日志信息和监控信息
val appId = client.getAttemptId().getApplicationId().toString()
//为driver创建一个名称为"CoarseGrainedScheduler"的RPC endpoint,
val driverUrl = RpcEndpointAddress(driverRef.address.host, driverRef.address.port,
CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString
...
//通过YarnRMClient(YARN ResourceManager)来创建一个YarnAllocator实例
allocator = client.createAllocator(
yarnConf,
_sparkConf,
driverUrl,
driverRef,
securityMgr,
localResources)
//credentialRenewer是AMCredentialRenewer类的实例,用来定期更新应用程序所需的token。
//如果原始token的生命周期超过了续订时间的75%,就会创建新的token。
//一旦ApplicationMaster向Driver进行了注册,新的创建的token会被发送到Driver endpoint。
credentialRenewer.foreach(_.setDriverRef(driverRef))
//在初始化YarnAllocator实例之后,初始化AM endpoint,这可以确保当driver发送一个初始的executor请求时,YarnAllocator已经准备好服务的请求了。
//AMEndpoint是ApplicationMaster的内部类,是用来和driver进行通信的
rpcEnv.setupEndpoint("YarnAM", new AMEndpoint(rpcEnv, driverRef))
//这才是当前方法中的重点,也就是向YARN请求资源
allocator.allocateResources()
//下面主要是监控相关的实现
val ms = MetricsSystem.createMetricsSystem("applicationMaster", sparkConf, securityMgr)
val prefix = _sparkConf.get(YARN_METRICS_NAMESPACE).getOrElse(appId)
ms.registerSource(new ApplicationMasterSource(prefix, allocator))
// do not register static sources in this case as per SPARK-25277
ms.start(false)
metricsSystem = Some(ms)
reporterThread = launchReporterThread()
}
2. YarnAllocator
YarnAllocator负责向YARN ResourceManager请求containers资源,并决定如何处理YARN返回containers资源。YarnAllocator类的主要实现其实都是通过调用AMRMClient APIs,主要以3种方式与AMRMClient进行交互:
- 向AMRMClient告知所需要的containers资源,并更新本地关于containers资源的记录
- 调用AMRMClient的allocate()函数,向ResourceManager同步本地的containers资源请求,并返回YARN给予我们的containers资源。这也起到了心跳通知的作用。
- 处理YARN给予的containers资源,并在containers中启动Executor。
YarnAllocator中主要函数实现的功能:
- requestTotalExecutorsWithPreferredLocalities(…):向ResourceManager请求尽可能多的executors,已达到所需的总数。
- allocateResources():请求资源,如果YARN满足了所请求的所有资源,我们就会得到与executor数相等的containers,并在返回的containers中启动executor
- updateResourceRequests():根据当前运行的executor数量和要请求的executor总数,来更新向ResourceManager请求的containers数
- handleAllocatedContainers(allocatedContainers: Seq[Container]):处理向ResourceManager获取的containers,并在containers中启动executor
- matchContainerToRequest(…):查找与给定container相匹配的位置的请求,如果找到了,就移除这个请求
- runAllocatedContainers(containersToUse: ArrayBuffer[Container]):在分配的containers中启动executor
- processCompletedContainers(completedContainers: Seq[ContainerStatus]):处理使用过的containers
- splitPendingAllocationsByLocality(…):根据当前排队tasks的位置将containers请求划分为3组
1)allocateResources()
我们重点分析一下上面代码中的allocator.allocateResources()函数,也就是YarnAllocator类中的allocateResources()函数。它用来向ResourceManager请求资源,如果YARN满足了所请求的所有资源,我们就会得到与executor数相等的containers,并在返回的containers中启动executor。
def allocateResources(): Unit = synchronized {
//根据当前运行的executor数量和要请求的executor总数,来更新向ResourceManager请求的containers数
updateResourceRequests()
val progressIndicator = 0.1f
//向ResourceManager申请containers资源,并返回分配响应
val allocateResponse = amClient.allocate(progressIndicator)
//获取从ResourceManager返回的container列表
val allocatedContainers = allocateResponse.getAllocatedContainers()
//黑名单节点跟踪
allocatorBlacklistTracker.setNumClusterNodes(allocateResponse.getNumClusterNodes)
//如果分配的containers数大于0,就处理这些containers
if (allocatedContainers.size > 0) {
...
//处理向ResourceManager获取的containers,并在containers中启动executor
handleAllocatedContainers(allocatedContainers.asScala)
}
//获取已使用containers列表,也可能是出错的containers
val completedContainers = allocateResponse.getCompletedContainersStatuses()
if (completedContainers.size > 0) {
logDebug("Completed %d containers".format(completedContainers.size))
//处理使用过的containers
processCompletedContainers(completedContainers.asScala)
logDebug("Finished processing %d completed containers. Current running executor count: %d."
.format(completedContainers.size, runningExecutors.size))
}
}
2)updateResourceRequests()
YARN container资源申请的核心就位于updateResourceRequests()函数中。
updateResourceRequests()根据当前运行的executor数量和要请求的executor总数,来同步更新向ResourceManager请求的containers数。
updateResourceRequests()请求container资源的流程:
- 首先,根据每个节点上要执行的tasks,将待定的container请求列表划分成3组:
- 本地匹配的请求列表:可以将任务在当前节点执行的container请求
- 本地未匹配的请求列表:不能将任务在当前节点执行,但是可以在与当前节点同一机架的其他节点上执行的container请求
- 非本地的请求列表:只能将任务在其他机架上执行的container请求
- 对于那些本地匹配的请求列表之外的两种请求,会进行取消并重新发起请求,然后,根据container放置策略来重新计算本地性,以最大化任务的本地性执行。
def updateResourceRequests(): Unit = {
//待定未满足的container请求列表
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
//missing:还差的executor个数
//targetNumExecutors: 总的要请求的executors个数, 如果没有开启动态资源分配的情况下,就是我们提交作业时指定的executor个数(--num-executors)
//numExecutorsStarting: 正在启动的executors个数;
//runningExecutors.size:正在运行的executors个数
val missing = targetNumExecutors - numPendingAllocate -
numExecutorsStarting.get - runningExecutors.size
logDebug(s"Updating resource requests, target: $targetNumExecutors, " +
s"pending: $numPendingAllocate, running: ${runningExecutors.size}, " +
s"executorsStarting: ${numExecutorsStarting.get}")
//将待定的container请求划分为3组:本地能匹配到的请求列表、本地不能匹配到的请求列表和非本地的请求列表。
//对于本地能匹配到的请求列表,考虑在对应的节点上启动container,并标记为已分配的containers。
//对于另外两种列表,取消container的请求,并重新计算container的放置策略,因为不能满足本地优先策略。
val (localRequests, staleRequests, anyHostRequests) = splitPendingAllocationsByLocality(
hostToLocalTaskCounts, pendingAllocate)
if (missing > 0) {
logInfo(s"Will request $missing executor container(s), each with " +
s"${resource.getVirtualCores} core(s) and " +
s"${resource.getMemory} MB memory (including $memoryOverhead MB of overhead)")
//取消本地不能匹配到的请求。因为这些container请求已经被发送给了ResourceManager。
staleRequests.foreach { stale =>
amClient.removeContainerRequest(stale)
}
val cancelledContainers = staleRequests.size
if (cancelledContainers > 0) {
logInfo(s"Canceled $cancelledContainers container request(s) (locality no longer needed)")
}
//更新可用的containers个数
val availableContainers = missing + cancelledContainers
//为了最大化任务本地性,还要把没有本地优先的请求考虑在内
val potentialContainers = availableContainers + anyHostRequests.size
//重新计算每个container的节点本地性(node locality)和机架本地性(rack locality,当前机架其他节点)
val containerLocalityPreferences = containerPlacementStrategy.localityOfRequestedContainers(
potentialContainers, numLocalityAwareTasks, hostToLocalTaskCounts,
allocatedHostToContainersMap, localRequests)
//根据计算的containers本地性,重新实例化container请求
val newLocalityRequests = new mutable.ArrayBuffer[ContainerRequest]
containerLocalityPreferences.foreach {
case ContainerLocalityPreferences(nodes, racks) if nodes != null =>
newLocalityRequests += createContainerRequest(resource, nodes, racks)
case _ =>
}
if (availableContainers >= newLocalityRequests.size) { //当前可用的containers可以满足所有新的container请求
// more containers are available than needed for locality, fill in requests for any host
for (i <- 0 until (availableContainers - newLocalityRequests.size)) {
newLocalityRequests += createContainerRequest(resource, null, null)
}
} else { //当前可用的containers不能满足所有新的container请求,不能满足的请求会在其他机架的节点上放置container,所以会取消这些请求,来获取更好的本地性
val numToCancel = newLocalityRequests.size - availableContainers
// cancel some requests without locality preferences to schedule more local containers
anyHostRequests.slice(0, numToCancel).foreach { nonLocal =>
amClient.removeContainerRequest(nonLocal)
}
if (numToCancel > 0) {
logInfo(s"Canceled $numToCancel unlocalized container requests to resubmit with locality")
}
}
//重新添加container请求
newLocalityRequests.foreach { request =>
amClient.addContainerRequest(request)
}
...
} else if (numPendingAllocate > 0 && missing < 0) {
val numToCancel = math.min(numPendingAllocate, -missing)
logInfo(s"Canceling requests for $numToCancel executor container(s) to have a new desired " +
s"total $targetNumExecutors executors.")
// cancel pending allocate requests by taking locality preference into account
val cancelRequests = (staleRequests ++ anyHostRequests ++ localRequests).take(numToCancel)
cancelRequests.foreach(amClient.removeContainerRequest)
}
}
updateResourceRequests()函数中只是一些粗粒度的实现,更加细节的实现,我们主要关注这3个函数:
- splitPendingAllocationsByLocality(…)
- requestTotalExecutorsWithPreferredLocalities(…)
- localityOfRequestedContainers()
splitPendingAllocationsByLocality()
splitPendingAllocationsByLocality()根据等待中tasks的本地性将等待中的container请求划分成3组:
- 能匹配到本地主机的请求,也就是task对应的host可以分配container。个人理解对应的数据本地性应该是PROCESS_LOCAL 、NODE_LOCAL 。
- 不能匹配到本地主机的请求,task对应的host无法分配container。对应的数据本地性应该是RACK_LOCAL、ANY。
- 无本地性要求的请求。对应的数据本地性应该是NO_PREF,也就是从任何节点访问数据都是一样的。
对于后两种container请求,是要取消并重新计算container放置策略的。
private def splitPendingAllocationsByLocality(
hostToLocalTaskCount: Map[String, Int], //hostToLocalTaskCount本地task对应的主机
pendingAllocations: Seq[ContainerRequest] //本地性为“ANY_HOST”的container请求
): (Seq[ContainerRequest], Seq[ContainerRequest], Seq[ContainerRequest]) = {
val localityMatched = ArrayBuffer[ContainerRequest]()
val localityUnMatched = ArrayBuffer[ContainerRequest]()
val localityFree = ArrayBuffer[ContainerRequest]()
val preferredHosts = hostToLocalTaskCount.keySet
pendingAllocations.foreach { cr =>
val nodes = cr.getNodes
if (nodes == null) {
localityFree += cr //无本地性的container请求
} else if (nodes.asScala.toSet.intersect(preferredHosts).nonEmpty) {
localityMatched += cr //交集为能匹配到本地主机的container请求
} else {
localityUnMatched += cr //不能匹配到本地主机的container请求
}
}
(localityMatched.toSeq, localityUnMatched.toSeq, localityFree.toSeq)
}
requestTotalExecutorsWithPreferredLocalities()
requestTotalExecutorsWithPreferredLocalities()函数会向ResourceManager请求尽可能多并能满足要求数量的Executors。如果请求的Executors个数少于当前正在运行的Executors个数,那么没有Executors会被kill掉。
def requestTotalExecutorsWithPreferredLocalities(
requestedTotal: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int],
nodeBlacklist: Set[String]): Boolean = synchronized {
this.numLocalityAwareTasks = localityAwareTasks
this.hostToLocalTaskCounts = hostToLocalTaskCount
if (requestedTotal != targetNumExecutors) {
logInfo(s"Driver requested a total number of $requestedTotal executor(s).")
targetNumExecutors = requestedTotal
allocatorBlacklistTracker.setSchedulerBlacklistedNodes(nodeBlacklist)
true
} else {
false
}
}
localityOfRequestedContainers()
localityOfRequestedContainers()是LocalityPreferredContainerPlacementStrategy类中函数,用来计算每个container的节点本地性(node locality)和机架本地性(rack locality)。
- LocalityPreferredContainerPlacementStrategy:Container首选位置放置策略。该策略通过考虑待处理task的节点比例、所需的core/containers以及当前已存在和待分配containers的位置来计算YARN containers的最优位置。该算法的目标是最大限度地增加本地运行的tasks数。
假设有这样一个场景,我们有20个tasks需要分配给hosts1,、host2和host3三台主机,10个tasks需要分配给host1、host2和host4,每个container有2个core,每个task占用一个cpu,那么总共需要15个containers,主机比例为(host1: 30, host2: 30, host3: 20, host4: 10),也就是3 : 3 : 2 : 1。
- 如果请求的container个数(18)比所需container数大(15),对应分配比例如下:
向节点(host1, host2, host3, host4)请求5个containers;
向节点(host1, host2, host3)请求5个containers;
向节点(host1, host2)请求5个containers;
剩下的3个container没有任何本地优先分配;
这种情况下的放置比例为3 : 3 : 2 : 1。 - 如果请求的container个数(10)比所需container数小(15),对应分配比例如下:
向节点(host1, host2, host3, host4)请求4个containers;
向节点(host1, host2, host3)请求3个containers;
向节点(host1, host2)请求3个containers;
这种情况下的放置比例为10 : 10 : 7 : 4,接近于3 : 3 : 2 : 1 - 如果存在可用的containers,没有一个可以满足请求的本地性,分配规则遵循上面两种情况。
- 如果存在可用的containers,而且其中部分containers可以满足请求的本地性。
例如,每个节点上可分配1个满足条件的container:(host1: 1, host2: 1: host3: 1, host4: 1),但是期望每个节点分配的container个数为:(host1: 5, host2: 5, host3: 4, host4: 2),那么每个节点上新请求的容器个数为: (host1: 4, host2: 4, host3: 3, host4: 1)。
- 如果要请求的containers个数(18)多于要求的containers个数(4+4+3+1=12),遵循规则1,比例为4 : 4 : 3 : 1
- 如果要请求的containers个数(10)多于要求的containers个数(4+4+3+1=12),遵循规则1,比例为4 : 4 : 3 : 1
- 如果存在可用的containers,且现有的container本地性可以完全满足所需的本地性要求。
例如,如果每个节点上都有5个container:(host1: 5, host2: 5, host3: 5, host4: 5),可以满足当前请求所需的本地性。
def localityOfRequestedContainers(
numContainer: Int, //要计算的container个数
numLocalityAwareTasks: Int, //本地要求的task个数
hostToLocalTaskCount: Map[String, Int], //首选主机和可能在其上运行的task的映射关系
allocatedHostToContainersMap: HashMap[String, Set[ContainerId]], //主机和已分配的container的映射关系,通过已存在的containers来计算期望的首选位置
localityMatchedPendingAllocations: Seq[ContainerRequest] //与当前所需task的位置匹配的且等待中的containers请求
): Array[ContainerLocalityPreferences] = {
//通过已分配的container,来计算所需containers对应的主机数
val updatedHostToContainerCount = expectedHostToContainerCount(
numLocalityAwareTasks, hostToLocalTaskCount, allocatedHostToContainersMap,
localityMatchedPendingAllocations)
val updatedLocalityAwareContainerNum = updatedHostToContainerCount.values.sum
// 待分配containers会被划分成两组:有本地优先的和无本地优先的。
val requiredLocalityFreeContainerNum =
math.max(0, numContainer - updatedLocalityAwareContainerNum)
val requiredLocalityAwareContainerNum = numContainer - requiredLocalityFreeContainerNum
//初始化一个本地优先container数组
val containerLocalityPreferences = ArrayBuffer[ContainerLocalityPreferences]()
if (requiredLocalityFreeContainerNum > 0) {
for (i <- 0 until requiredLocalityFreeContainerNum) {
containerLocalityPreferences += ContainerLocalityPreferences(
null.asInstanceOf[Array[String]], null.asInstanceOf[Array[String]])
}
}
if (requiredLocalityAwareContainerNum > 0) {
val largestRatio = updatedHostToContainerCount.values.max
var preferredLocalityRatio = updatedHostToContainerCount.map { case(k, ratio) =>
val adjustedRatio = ratio.toDouble * requiredLocalityAwareContainerNum / largestRatio
(k, adjustedRatio.ceil.toInt)
}
for (i <- 0 until requiredLocalityAwareContainerNum) {
// 只过滤出那些比率比0大的,这就意味着当前主机仍然可以给新container请求分配container
val hosts = preferredLocalityRatio.filter(_._2 > 0).keys.toArray
val racks = hosts.map { h =>
resolver.resolve(yarnConf, h)
}.toSet
containerLocalityPreferences += ContainerLocalityPreferences(hosts, racks.toArray)
// 如果主机被使用就减1。当当前比率为0时,就意味着所有请求都被满足了。
preferredLocalityRatio = preferredLocalityRatio.map { case (k, v) => (k, v - 1) }
}
}
containerLocalityPreferences.toArray
}
3)handleAllocatedContainers()
handleAllocatedContainers()在ResourceMananger给予的containers上来启动executors。
def handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit = {
val containersToUse = new ArrayBuffer[Container](allocatedContainers.size)
// 通过主机来匹配请求
val remainingAfterHostMatches = new ArrayBuffer[Container]
for (allocatedContainer <- allocatedContainers) {
//查找与给定分配container相匹配的给定位置的请求。如果存在这样的位置,就删除对应的请求,这样就不会再次提交了。并把这个container放入待使用container列表中。
matchContainerToRequest(allocatedContainer, allocatedContainer.getNodeId.getHost,
containersToUse, remainingAfterHostMatches)
}
// 通过机架匹配剩余的container。
val remainingAfterRackMatches = new ArrayBuffer[Container]
if (remainingAfterHostMatches.nonEmpty) {
var exception: Option[Throwable] = None
//这里为什么要用一个单独的线程:当SparkContext被关闭后,YarnAllocator线性会被中断。
//如果中断发生在错误的时候,将会看到这样的报错java.io.IOException: java.lang.InterruptedException...
//这意味着被调用的YARN代码(RackResolver)正在吞下中断,所以Spark Yarnallocator线程永远都不会退出。
//在这种情况中,allocator正在分配大量的executor,应用看起来似乎像是挂起了,但是即使SparkContext已经关闭,仍然会有很多executor出现。
val thread = new Thread("spark-rack-resolver") {
override def run(): Unit = {
try {
for (allocatedContainer <- remainingAfterHostMatches) {
val rack = resolver.resolve(conf, allocatedContainer.getNodeId.getHost)
matchContainerToRequest(allocatedContainer, rack, containersToUse,
remainingAfterRackMatches)
}
} catch {
case e: Throwable =>
exception = Some(e)
}
}
}
thread.setDaemon(true)
thread.start()
try {
thread.join()
} catch {
case e: InterruptedException =>
thread.interrupt()
throw e
}
if (exception.isDefined) {
throw exception.get
}
}
// 分配剩下既不是本地节点又不是机架本地的container
val remainingAfterOffRackMatches = new ArrayBuffer[Container]
for (allocatedContainer <- remainingAfterRackMatches) {
//ANY_HOST表示数据在其他机架上
matchContainerToRequest(allocatedContainer, ANY_HOST, containersToUse,
remainingAfterOffRackMatches)
}
//释放不必要的container
if (!remainingAfterOffRackMatches.isEmpty) {
logDebug(s"Releasing ${remainingAfterOffRackMatches.size} unneeded containers that were " +
s"allocated to us")
for (container <- remainingAfterOffRackMatches) {
internalReleaseContainer(container)
}
}
//在分配的container上启动executor
runAllocatedContainers(containersToUse)
logInfo("Received %d containers from YARN, launching executors on %d of them."
.format(allocatedContainers.size, containersToUse.size))
}
Executor的启动
Executor的启动也是在YarnAllocator类中,是由runAllocatedContainers()启动的。此函数所有可使用的container进行遍历,并使用一个cached thread pool来执行一个Runnable。
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
executorIdCounter += 1
val executorHostname = container.getNodeId.getHost
val containerId = container.getId
val executorId = executorIdCounter.toString
assert(container.getResource.getMemory >= resource.getMemory)
logInfo(s"Launching container $containerId on host $executorHostname " +
s"for executor with ID $executorId")
def updateInternalState(): Unit = synchronized {
runningExecutors.add(executorId)
numExecutorsStarting.decrementAndGet()
executorIdToContainer(executorId) = container
containerIdToExecutorId(container.getId) = executorId
val containerSet = allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
new HashSet[ContainerId])
containerSet += containerId
allocatedContainerToHostMap.put(containerId, executorHostname)
}
//正在运行的executor小于要求的executor个数
if (runningExecutors.size() < targetNumExecutors) {
//正在运行的executor个数加一
numExecutorsStarting.incrementAndGet()
if (launchContainers) {
//launcherPool其实就是一个ThreadPoolExecutor
launcherPool.execute(new Runnable {
override def run(): Unit = {
try {
//这里是真正启动executor的地方。
//这个类ExecutorRunnable会用一个单独的/bin/java命令在对应的节点上使用指定的资源来启动一个executor进程。
new ExecutorRunnable(
Some(container),
conf,
sparkConf,
driverUrl,
executorId,
executorHostname,
executorMemory,
executorCores,
appAttemptId.getApplicationId.toString,
securityMgr,
localResources
).run()
updateInternalState()
} catch {
case e: Throwable =>
numExecutorsStarting.decrementAndGet()
if (NonFatal(e)) {
logError(s"Failed to launch executor $executorId on container $containerId", e)
// Assigned container should be released immediately
// to avoid unnecessary resource occupation.
amClient.releaseAssignedContainer(containerId)
} else {
throw e
}
}
}
})
} else {
// For test only
updateInternalState()
}
} else {
logInfo(("Skip launching executorRunnable as running executors count: %d " +
"reached target executors count: %d.").format(
runningExecutors.size, targetNumExecutors))
}
}
}