SparkContext是整个Spark的唯一入口,是Spark上层应用和底层实现的中转站,以重要性不言而喻,这也是我学习Spark源码的第一步。
借鉴 博主里面的时序图,可以清楚的看到SparkContext的执行流程。
SparkContext在初始化过程中,主要实现以下几个组件:
- SparkEnv
- DAGScheduler
- TaskScheduler
- SchedulerBackend
- WebUI
在SparkContext中最重要的参数就是SparkConf,在源码中可以看到,SparkContext里面的conf是调用clone()而来的,然后进行各种验证。
try {
_conf = config.clone()
_conf.validateSettings()
if (!_conf.contains("spark.master")) {
throw new SparkException("A master URL must be set in your configuration")
}
if (!_conf.contains("spark.app.name")) {
throw new SparkException("An application name must be set in your configuration")
}
// System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
// yarn-standalone is deprecated, but still supported
if ((master == "yarn-cluster" || master == "yarn-standalone") &&
!_conf.contains("spark.yarn.app.id")) {
throw new SparkException("Detected yarn-cluster mode, but isn't running on a cluster. " +
"Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
}
if (_conf.getBoolean("spark.logConf", false)) {
logInfo("Spark configuration:\n" + _conf.toDebugString)
}
// Set Spark driver host and port system properties
_conf.setIfMissing("spark.driver.host", Utils.localHostName())
_conf.setIfMissing("spark.driver.port", "0")
_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
第一步,创建SparkEnv
SparkEnv是Spark的执行环境对象,其中包括与众多Executor执行相关的对象。在local模式下Driver会创建Executor,local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend进程中也会创建Executor,所以SparkEnv存在于Driver或者CoarseGrainedExecutorBackend进程中。
_env = createSparkEnv(_conf, isLocal, listenerBus)
SparkEnv.set(_env)
private[spark] def createSparkEnv(
conf: SparkConf,
isLocal: Boolean,
listenerBus: LiveListenerBus): SparkEnv = {
SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))
通过SparkEnv.createDriverEnv()最终还是调用SparkEnv.create(),
private def create(
conf: SparkConf,
executorId: String,
hostname: String,
port: Int,
isDriver: Boolean,
isLocal: Boolean,
numUsableCores: Int,
listenerBus: LiveListenerBus = null,
mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = {
...
val envInstance = new SparkEnv(
executorId,
rpcEnv,
actorSystem,
serializer,
closureSerializer,
cacheManager,
mapOutputTracker,
shuffleManager,
broadcastManager,
blockTransferService,
blockManager,
securityManager,
sparkFilesDir,
metricsSystem,
memoryManager,
outputCommitCoordinator,
conf)
if (isDriver) {
envInstance.driverTmpDirToDelete = Some(sparkFilesDir)
}
envInstance
}
其目的只是返回一个SparkEnv的实例, 中间的大量操作只是为构造SparkEnv准备参数而已,所以我们先看一下SparkEnv构造函数入参。
class SparkEnv (
val executorId: String,
private[spark] val rpcEnv: RpcEnv,
_actorSystem: ActorSystem, // TODO Remove actorSystem
val serializer: Serializer,
val closureSerializer: Serializer,
val cacheManager: CacheManager,
val mapOutputTracker: MapOutputTracker,
val shuffleManager: ShuffleManager,
val broadcastManager: BroadcastManager,
val blockTransferService: BlockTransferService,
val blockManager: BlockManager,
val securityManager: SecurityManager,
val sparkFilesDir: String,
val metricsSystem: MetricsSystem,
val memoryManager: MemoryManager,
val outputCommitCoordinator: OutputCommitCoordinator,
val conf: SparkConf) extends Logging {
先看看这些参数的用途
- rpcEnv: 网络通信,默认使用netty,这个较复杂,以后单独解析
- ActorSystem: Spark中最基础的设施,Spark既使用它发送分布式消息,又用它实现并发编程。
- cacheManager : 用以存储中间计算结果
- mapOutputTracker: 用来缓存MapStatus 信息,并提供从MapOutputMaster获取信息的功能
- shuffleManager: 路由维护表
- broadcastManager: 广播管理器
- blockManager: 块管理
- securityManager: 安全管理
- sparkFilesDir: 文件存储目录
- metricsSystem: 测量
第二步,创建TaskScheduler
根据Spark的运行模式选择相应的SchedulerBackend,同时启动TaskScheduler,这一步至关重要。
val (sched, ts) = SparkContext.createTaskScheduler(this, master)
_schedulerBackend = sched
_taskScheduler = ts
_dagScheduler = new DAGScheduler(this) _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
_taskScheduler.start()
private def createTaskScheduler(
sc: SparkContext,
master: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount <= 0) {
throw new SparkException(s"Asked to run locally with $threadCount threads")
}
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
...
}
createTaskScheduler最为关键的一点在于就是根据master环境变量来判断Spark当前的部署方式,进而生成相应的SchedulerBackend的不同子类。
taskScheduler.start()的目的是启动相应的SchedulerBackend,并启动定时器进行检测,以localBackend为例
override def start() {
val rpcEnv = SparkEnv.get.rpcEnv
val executorEndpoint = new LocalEndpoint(rpcEnv, userClassPath, scheduler, this, totalCores)
localEndpoint = rpcEnv.setupEndpoint("LocalBackendEndpoint", executorEndpoint)
listenerBus.post(SparkListenerExecutorAdded(
System.currentTimeMillis,
executorEndpoint.localExecutorId,
new ExecutorInfo(executorEndpoint.localExecutorHostname, totalCores, Map.empty)))
launcherBackend.setAppId(appId)
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
第三步,创建DAGScheduler并启动
直接看代码
_dagScheduler = new DAGScheduler(this)
我们再查看DAGScheduler的构造方法
def this(sc: SparkContext) = this(sc, sc.taskScheduler)
很清楚的看到,这是以上一步中创建的taskScheduler为参数,创建DAGScheduler
第四步,启动WebUI
_ui =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,
_env.securityManager, appName, startTime = startTime))
} else {
// For tests, do not enable the UI
None
}
// Bind the UI before starting the task scheduler to communicate
// the bound port to the cluster manager properly
_ui.foreach(_.bind())