Spark的部署模式有Standalone、Hadoop YARN、Apache Mesos、Kubernete等。
在我们平时的练习中,可能会用到Standalone模式,但是在实际生产环境中,绝大多数用的还是YARN(Mesos国内基本很少用)。而对于YARN的两种模式,个人认为cluster模式比较多,所以,我们先从cluster模式讲起。
注:本文Spark版本为2.1,新版本会有所改动
之前我也知道在一个Spark程序中,Driver是什么,Executor是什么,在YARN的cluster模式下,Driver是运行在ApplicationMaster中的等等之类的。但是等我工作几个月之后就又忘了,因为在实际开发中,你并不需要知道他是怎么运行的,你只需要指定“--deploy-mode cluster”就行了。但是作为具有极客精神(其实是为了面试)的我,还是准备探究一番。
首先我们提交一个Application时,用的是${SPARK_HOME}/bin目录下的 spark-submit 脚本,后面跟一堆参数,例如:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
在查看spark-submit这个脚本时发现,其实他是运行了一个SparkSubmit类,后面的“$@”是将你输入的所有参数传递给此类:
找到SparkSubmit类,直接查看它的main()方法
def main(args: Array[String]): Unit = {
//将输入参数封装成一个SparkSubmitArguments对象
val appArgs = new SparkSubmitArguments(args)
if (appArgs.verbose) { //verbose是打印调试输出的,参数没有指定就是false
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
action参数初始为null,点进去搜索会发现有一行代码:
// Action should be SUBMIT unless otherwise specified
action = Option(action).getOrElse(SUBMIT)
action=null,则Option(action)为None,所以会将SUBMIT赋值给action,SUBMIT是一个枚举类型。所以上述代码会运行submit方法,点进去。
private def submit(args: SparkSubmitArguments): Unit = {
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
def doRunMain(): Unit = {
if (args.proxyUser != null) {
val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
UserGroupInformation.getCurrentUser())
try {
proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
override def run(): Unit = {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
})
} catch {
...
}
} else {
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
}
}
submit方法的第一行调用了prepareSubmitEnvironment方法,根据这个方法的方法名和注释就可以知道,他是确定了提交模式,准备好提交环境。下面的doRunMain函数中,proxyUser参数在上面SparkSubmitArguments对象中初始值就为null,所以if不会进去,直接到下面调用了runMain方法,点进去。
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
// scalastyle:off println
if (verbose) {
printStream.println(s"Main class:\n$childMainClass")
printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
printStream.println(s"System properties:\n${sysProps.mkString("\n")}")
printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
printStream.println("\n")
}
// scalastyle:on println
val loader =
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
for ((key, value) <- sysProps) {
System.setProperty(key, value)
}
var mainClass: Class[_] = null
try {// ①使用反射的方式加载 childMainClass
mainClass = Utils.classForName(childMainClass)
} catch {...}
// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}
// ②反射出来 Client 的 main 方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
if (!Modifier.isStatic(mainMethod.getModifiers)) {
throw new IllegalStateException("The main method in the given main class must be static")
}
@tailrec
def findCause(t: Throwable): Throwable = t match {
case e: UndeclaredThrowableException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: InvocationTargetException =>
if (e.getCause() != null) findCause(e.getCause()) else e
case e: Throwable =>
e
}
try {
// ③调用 main 方法.
mainMethod.invoke(null, childArgs.toArray)
} catch {
...
}
}
runMain方法中最重要的就是注释①,使用反射的方式加载 childMainClass。回过头去看childMainClass究竟是什么,要找到前面的prepareSubmitEnvironment这个准备环境的方法。此方法代码几百行,但都是验证之类的,直接找到第603行:
if (isYarnCluster) {
// In yarn-cluster mode, use yarn.Client as a wrapper around the user class
// 在 yarn 集群模式下, 使用yarn.Client来对封装一下 user class
childMainClass = "org.apache.spark.deploy.yarn.Client"
if (args.isPython) {
childArgs += ("--primary-py-file", args.primaryResource)
childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
} else if (args.isR) {
val mainFile = new Path(args.primaryResource).getName
childArgs += ("--primary-r-file", mainFile)
childArgs += ("--class", "org.apache.spark.deploy.RRunner")
} else {
if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
childArgs += ("--jar", args.primaryResource)
}
childArgs += ("--class", args.mainClass)
}
if (args.childArgs != null) {
args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
}
}
以上可以清晰的看出,childMainClass其实就是“org.apache.spark.deploy.yarn.Client”这个类。找到这个类之后,我们就直接查看它的main()方法。
Client类的main()方法在1231行,在main()方法中最重要的就最后一行:
val args = new ClientArguments(argStrings)//对参数进一步封装
new Client(args, sparkConf).run()
注意,这里是调用run()方法,并不是启动一个线程!点开run()方法,发现直接调用了submitApplication()方法,字面意思就是提交Application。点开此方法
def submitApplication(): ApplicationId = {
var appId: ApplicationId = null
try {
launcherBackend.connect()//目测是连接yarn的
// Setup the credentials before doing anything else,
// so we have don't have issues at any point.
setupCredentials()
yarnClient.init(yarnConf)// 初始化 yarn 客户端
yarnClient.start()// 启动 yarn 客户端
logInfo("Requesting a new application from cluster with %d NodeManagers"
.format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))
// Get a new application from our RM
// 从 RM 创建一个应用程序
val newApp = yarnClient.createApplication()
val newAppResponse = newApp.getNewApplicationResponse()
// 获取到 applicationID
appId = newAppResponse.getApplicationId()
reportLauncherState(SparkAppHandle.State.SUBMITTED)
launcherBackend.setAppId(appId.toString)
new CallerContext("CLIENT", Option(appId.toString)).setCurrentContext()
// Verify whether the cluster has enough resources for our AM
//验证集群是否有足够的资源用于我们的AM
verifyClusterResources(newAppResponse)
// Set up the appropriate contexts to launch our AM
// 设置正确的上下文对象来启动 ApplicationMaster
val containerContext = createContainerLaunchContext(newAppResponse)
// 创建应用程序提交任务上下文
val appContext = createApplicationSubmissionContext(newApp, containerContext)
// Finally, submit and monitor the application
logInfo(s"Submitting application $appId to ResourceManager")
// 提交应用给 ResourceManager
yarnClient.submitApplication(appContext)
appId
} catch {
...
}
}
在此可以看到一个createContainer...方法,字面意思就是创建Container容器,点进去
/**
* Set up a ContainerLaunchContext to launch our ApplicationMaster container.
* This sets up the launch environment, java options, and the command for launching the AM.
*/
logInfo("Setting up container launch context for our AM")
从方法注释和打印的日志信息就知道,这里是为启动ApplicationMaster准备Container资源,由于方法体太长我就不粘贴了。方法体内部其实也是定义各种资源,直到看到如下代码:
val amClass =
if (isClusterMode) {
Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
} else {
Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
}
ApplicationMaster。
然后主要就是向ResourceManager申请在NodeManager上创建ApplicationMaster,接下来的事情就与Client无关了,所以我们直接奔向ApplicationMaster类
此类有main()方法,Ctrl+F12直接找
def main(args: Array[String]): Unit = {
SignalUtils.registerLogger(log)
// 对传来的参数做封装
val amArgs: ApplicationMasterArguments = new ApplicationMasterArguments(args)
// Load the properties file with the Spark configuration and set entries as system properties,
// so that user code run inside the AM also has access to them.
// Note: we must do this before SparkHadoopUtil instantiated
if (amArgs.propertiesFile != null) {
Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
sys.props(k) = v
}
}
SparkHadoopUtil.get.runAsSparkUser { () =>
master = new ApplicationMaster(amArgs, new YarnRMClient)
System.exit(master.run())
}
}
方法最后两行先构建了一个ApplicationMaster实例,然后调用其run()方法,最后程序退出。
点开run()方法,
if (isClusterMode) {
runDriver(securityMgr)
} else {
runExecutorLauncher(securityMgr)
}
发现最终调用了runDriver()方法,继续点开
private def runDriver(securityMgr: SecurityManager): Unit = {
addAmIpFilter()
userClassThread = startUserApplication()
// This a bit hacky, but we need to wait until the spark.driver.port property has
// been set by the Thread executing the user class.
logInfo("Waiting for spark context initialization...")
val totalWaitTime = sparkConf.get(AM_MAX_WAIT_TIME)
try {
val sc = ThreadUtils.awaitResult(sparkContextPromise.future,
Duration(totalWaitTime, TimeUnit.MILLISECONDS))
if (sc != null) {
rpcEnv = sc.env.rpcEnv
val driverRef = runAMEndpoint(
sc.getConf.get("spark.driver.host"),
sc.getConf.get("spark.driver.port"),
isClusterMode = true)
// 注册 ApplicationMaster , 其实就是请求资源
registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
securityMgr)
} else {
// Sanity check; should never happen in normal operation, since sc should only be null
// if the user app did not create a SparkContext.
if (!finished) {
throw new IllegalStateException("SparkContext is null but app is still running!")
}
}
// 线程 join: 把userClassThread线程执行完毕之后再继续执行当前线程.
userClassThread.join()
} catch {
...
}
}
重点在第2行,将startUserApplication()方法的返回值赋值给了userClassThread。点开startUserApplication()方法一探究竟:
private def startUserApplication(): Thread = {
logInfo("Starting the user application in a separate Thread")
val classpath = Client.getUserClasspath(sparkConf)
val urls = classpath.map { entry =>
new URL("file:" + new File(entry.getPath()).getAbsolutePath())
}
val userClassLoader =
if (Client.isUserClassPathFirst(sparkConf, isDriver = true)) {
new ChildFirstURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
} else {
new MutableURLClassLoader(urls, Utils.getContextOrSparkClassLoader)
}
//获取用户程序的main方法
val mainMethod = userClassLoader.loadClass(args.userClass)
.getMethod("main", classOf[Array[String]])
val userThread = new Thread {
override def run() {
try {
//启动一个线程调用main方法
mainMethod.invoke(null, userArgs.toArray)
finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS)
logDebug("Done running users class")
} catch {
...
} finally {
sparkContextPromise.trySuccess(null)
}
}
}
userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")//将线程起名为Driver
userThread.start()
userThread
}
由以上可以看出,先试用反射获取到--class指定的类,也就是你提交的程序,然后启动一个线程调用我们程序的main()方法,将线程命名为“Driver”后开始执行并返回。
到此我们终于知道,在YARN的cluster模式下,是由ApplicationMaster创建了一个名叫Driver线程,也就是我们常说的Driver。
后退到runDriver()方法,在请求资源后,写了这样一句代码:userClassThread.join()。可以看到,在run()方法结束前,会将用户(Driver)线程插队到ApplicationMaster的main线程之前,只有用户线程运行完毕,ApplicationMaster才可以继续执行然后System.exit()。
到此,剩下的事情就交给YARN来调度执行了。
下面是画的一张草图供大家参考,并不完善:
接下来就要看看初始化SparkContext做了什么事情了,后续更新。