Spark历史:
Hadoop:2006 1.x (有问题) --> 2.x (2.2.0)出现时间(2013.10)
-->Spark出现时间(2013.4)计算框架
Yarn
RM:ResourceManager主要作用:
- 处理客户端请求
- 监控NodeManager
- 启动或监控ApplicationMaster
- 资源的分配与调度
AM:ApplicationMaster(MRAppMaster)作用:
- 负责数据的切分
- 为应用程序申请资源并分配内部的任务
- 任务的监控与容错
NM:NodeManager主要作用:
- 管理单个节点上的资源
- 处理来自ResourceManager的命令
- 处理来自ApplicationMaster的命令
Container(虚拟资源容器):是YARN中的资源抽象,它封装了某个节点上的多维度资源,如内存/cpu/磁盘/网络等. [可插拔]
算子:Operator(方法,操作)
Spark内核概述
Spark内核泛指Spark的核心运行机制,包括Spark核心组件的运行机制/Spark任务调度机制/Spark内存管理机制/Spark核心功能的运行原理等,熟练掌握Spark内核原理,能够帮助我们更好地完成Spark代码设计,并能够帮助我们准确锁定项目过程中出现的问题的症结所在.
Spark核心组件回顾
Driver:运行SparkContext类的程序称为驱动器类.
Spark驱动器节点,用于执行Spark任务中的main方法,负责实际代码的执行工作.Driver在Spark作业执行时主要负责:
- 将用户程序转化为任务(job)
- 在Executor之间调度任务(执行task)
- 跟踪Executor的执行情况
- 通过UI展示查询运行情况
Executor:执行器,执行任务
Spark Executor节点是一个JVM进程,负责在Spark作业中运行具体任务,任务彼此之间相互独立.Spark应用启动时,Executor节点被同时启动,并且始终伴随着整个Spark应用的生命周期而存在.如果有Executor节点发生了故障或崩溃,Spark应用也可以继续执行,会将出错节点上的任务调度到其他Executor节点上继续运行.
Executor有两个核心功能:
- 负责运行组成Spark应用的任务,并将结果返回给驱动器进程
- 它们通过自身的块管理器(Block Manager)为用户程序中要求缓存的RDD提供内存式存储.RDD是直接缓存在Executor进程内的,因此任务可以在运行时充分利用缓存数据加速运算.
Client:客户端,提交任务
Master相当于ResourceManager
Worker相当于NodeManager
并行&并发
并行:多个CPU叫并行
并发:一个进程,多个线程(单个cpu之间的多个线程之间的调度)
源码解析 eg:
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--num-executors 2 \
--master yarn \
--deploy-mode cluster \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100
spark-submit
exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
进入 org.apache.spark.deploy.SparkSubmit 这个类中
def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
进入SparkSubmitArguments
private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
extends SparkSubmitArgumentsParser {
var master: String = null
var deployMode: String = null
var executorMemory: String = null
var executorCores: String = null
var totalExecutorCores: String = null
var propertiesFile: String = null
var driverMemory: String = null
var driverExtraClassPath: String = null
var driverExtraLibraryPath: String = null
var driverExtraJavaOptions: String = null
var queue: String = null
var numExecutors: String = null
var files: String = null
var archives: String = null
var mainClass: String = null
var primaryResource: String = null
var name: String = null
var childArgs: ArrayBuffer[String] = new ArrayBuffer[String]()
var jars: String = null
var packages: String = null
var repositories: String = null
var ivyRepoPath: String = null
var packagesExclusions: String = null
var verbose: Boolean = false
var isPython: Boolean = false
var pyFiles: String = null
var isR: Boolean = false
var action: SparkSubmitAction = null
val sparkProperties: HashMap[String, String] = new HashMap[String, String]()
var proxyUser: String = null
var principal: String = null
var keytab: String = null
// Standalone cluster mode only
var supervise: Boolean = false
var driverCores: String = null
var submissionToKill: String = null
var submissionToRequestStatusFor: String = null
var useRest: Boolean = true // used internally
/** Default properties present in the currently defined defaults file. */
lazy val defaultSparkProperties: HashMap[String, String] = {
val defaultProperties = new HashMap[String, String]()
// scalastyle:off println
if (verbose) SparkSubmit.printStream.println(s"Using properties file: $propertiesFile")
Option(propertiesFile).foreach { filename =>
Utils.getPropertiesFromFile(filename).foreach { case (k, v) =>
defaultProperties(k) = v
if (verbose) SparkSubmit.printStream.println(s"Adding default property: $k=$v")
}
}
// scalastyle:on println
defaultProperties
}
进入submit
def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args)
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
private def submit(args: SparkSubmitArguments): Unit = {
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)
--SparkSubmit 反射执行
//声明Main类
childMainClass = "org.apache.spark.deploy.yarn.Client"//调用主类
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)//反射加载类
mainClass = Utils.classForName(childMainClass)//反射获取类的main方法
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)//调用指定类的main方法
mainMethod.invoke(null, childArgs.toArray)
添加依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.1.1</version>
</dependency>
进入org.apache.spark.deploy.yarn.Client
private object Client extends Logging {
def main(argStrings: Array[String]) {
if (!sys.props.contains("SPARK_SUBMIT")) {
logWarning("WARNING: This client is deprecated and will be removed in a " +
"future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
}
// Set an env variable indicating we are running in YARN mode.
// Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
System.setProperty("SPARK_YARN_MODE", "true")
val sparkConf = new SparkConf
// SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
// so remove them from sparkConf here for yarn mode.
sparkConf.remove("spark.jars")
sparkConf.remove("spark.files")
val args = new ClientArguments(argStrings)
new Client(args, sparkConf).run()
}
val amClass =
if (isClusterMode) {
Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
} else {
Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
}
if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
}
val userArgs = args.userArgs.flatMap { arg =>
Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
}
val amArgs =
Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ primaryRFile ++
userArgs ++ Seq(
"--properties-file", buildPath(YarnSparkHadoopUtil.expandEnvironment(Environment.PWD),
LOCALIZED_CONF_DIR, SPARK_CONF_FILE))
// Command for the ApplicationMaster
val commands = prefixEnv ++ Seq(
YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + "/bin/java", "-server"
) ++
javaOpts ++ amArgs ++
Seq(
"1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
"2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")
Client
//创建Yarn客户端,用于和Yarn进行关联
yarnClient = YarnClient.createYarnClient//提交应用
this.appId = submitApplication()//提交过程中,封装指令字符串
bin/java org.apache.spark.deploy.yarn.ApplicationMaster
bin/java org.apache.spark.deploy.yarn.ExecutorLauncher//将指令通过Yarn客户端发送给Yarn
Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
进入 org.apache.spark.deploy.yarn.ApplicationMaster
def main(args: Array[String]): Unit = {
SignalUtils.registerLogger(log)
val amArgs = new ApplicationMasterArguments(args)
// Load the properties file with the Spark configuration and set entries as system properties,
// so that user code run inside the AM also has access to them.
// Note: we must do this before SparkHadoopUtil instantiated
if (amArgs.propertiesFile != null) {
Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
sys.props(k) = v
}
}
SparkHadoopUtil.get.runAsSparkUser { () =>
master = new ApplicationMaster(amArgs, new YarnRMClient)
System.exit(master.run())
}
}
while (!args.isEmpty) {
// --num-workers, --worker-memory, and --worker-cores are deprecated since 1.0,
// the properties with executor in their names are preferred.
args match {
case ("--jar") :: value :: tail =>
userJar = value
args = tail
case ("--class") :: value :: tail =>
userClass = value
args = tail
case ("--primary-py-file") :: value :: tail =>
primaryPyFile = value
args = tail
case ("--primary-r-file") :: value :: tail =>
primaryRFile = value
args = tail
case ("--arg") :: value :: tail =>
userArgsBuffer += value
args = tail
case ("--properties-file") :: value :: tail =>
propertiesFile = value
args = tail
case _ =>
printUsageAndExit(1, args)
}
}
ApplicationMaster :
// 获取参数 --class : SparkCoreDemo
userClass = value// 运行Driver
// 创建SparkContext对象的类称之为Driver
runDriver(securityMgr)userClassThread = startUserApplication()
// 反射获取Driver类中的main方法
val mainMethod = userClassLoader.loadClass(args.userClass)
.getMethod("main", classOf[Array[String]])// 创建用户Driver线程
val userThread = new Thread
// 启动Driver线程
userThread.start
// 执行drvier线程中执行main方法
mainMethod.invoke(null, userArgs.toArray)// 等待userClassThread执行完毕
userClassThread.join()// 注册ApplicationMaster
registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
securityMgr)// 分配资源
allocator.allocateResources()// 运行分配的容器
runAllocatedContainers(containersToUse)// 准备向NM发送的指令
val commands = prepareCommand()bin/java org.apache.spark.executor.CoarseGrainedExecutorBackend
CoarseGrainedExecutorBackend :消息通信体
run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)
SparkEnv.createExecutorEnv
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend
env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
env.rpcEnv.awaitTermination()
NettyRpcEnv :
Dispatcher:
NettyRpcEndpointRef:
RpcEndpointAddress:
EndpointData:
Inbox:// 创建Executor计算对象,调度Task
executor = new Executor
Spark部署模式
spark支持3种集群管理器(Cluster Manager),分别为:
- Standalone:独立模式,Spark原生的简单集群管理器,自带完整的服务,可单独部署到一个集群中,无需依赖任何其他资源管理系统,使用Standalone可以很方便地搭建一个集群.
- Apache Mesos:一个强大的分布式资源管理框架,它允许多种不同的框架部署在其中,包括yarn,用的不多,国外多点.
- Hadoop YARN:统一的资源管理机制,在上面可以运行多套计算框架,如map reduce/storm等,根据driver在集群中的位置不同,分为yarn client和yarn cluster
Standalone模式运行机制
Standalone集群有四个重要组成部分,分别是:
- Driver:是一个进程,我们编写的Spark应用程序就运行在Driver上,由Driver进程执行
- Master(RM):是一个进程,主要负责资源的调度和分配,并进行集群的监控等职责
- Work(NM):是一个进程,一个Worker运行在集群中的一台服务器上,主要负责两个职责,一个是用自己的内存存储RDD的某个或某些partition;另一个是启动其他进程和线程(Executor),对RDD上的partition进行并行的处理和计算.
- Executor:是一个进程,一个Worker上可以运行多个Executor,Executor通过启动多个线程(task)来执行对RDD的partition进行并行计算,也就是执行我们队RDD定义的列如map/flatMap/reduce等算子操作