spark yarn需要哪些配置

转载

网络智叶 2025-01-03 11:24:12

文章标签 spark yarn需要哪些配置 spark apache jar 文章分类 Spark 大数据

Spark历史:

Hadoop:2006 1.x (有问题) --> 2.x (2.2.0)出现时间(2013.10)

-->Spark出现时间(2013.4)计算框架

Yarn

RM:ResourceManager主要作用:

处理客户端请求
监控NodeManager
启动或监控ApplicationMaster
资源的分配与调度

AM:ApplicationMaster(MRAppMaster)作用:

负责数据的切分
为应用程序申请资源并分配内部的任务
任务的监控与容错

NM:NodeManager主要作用:

管理单个节点上的资源
处理来自ResourceManager的命令
处理来自ApplicationMaster的命令

Container(虚拟资源容器):是YARN中的资源抽象,它封装了某个节点上的多维度资源,如内存/cpu/磁盘/网络等. [可插拔]

spark yarn需要哪些配置_spark

算子:Operator(方法,操作)

Spark内核概述

Spark内核泛指Spark的核心运行机制,包括Spark核心组件的运行机制/Spark任务调度机制/Spark内存管理机制/Spark核心功能的运行原理等,熟练掌握Spark内核原理,能够帮助我们更好地完成Spark代码设计,并能够帮助我们准确锁定项目过程中出现的问题的症结所在.

Spark核心组件回顾

Driver:运行SparkContext类的程序称为驱动器类.

Spark驱动器节点,用于执行Spark任务中的main方法,负责实际代码的执行工作.Driver在Spark作业执行时主要负责:

将用户程序转化为任务(job)
在Executor之间调度任务(执行task)
跟踪Executor的执行情况
通过UI展示查询运行情况

Executor:执行器,执行任务

Spark Executor节点是一个JVM进程,负责在Spark作业中运行具体任务,任务彼此之间相互独立.Spark应用启动时,Executor节点被同时启动,并且始终伴随着整个Spark应用的生命周期而存在.如果有Executor节点发生了故障或崩溃,Spark应用也可以继续执行,会将出错节点上的任务调度到其他Executor节点上继续运行.

Executor有两个核心功能:

负责运行组成Spark应用的任务,并将结果返回给驱动器进程
它们通过自身的块管理器(Block Manager)为用户程序中要求缓存的RDD提供内存式存储.RDD是直接缓存在Executor进程内的,因此任务可以在运行时充分利用缓存数据加速运算.

Client:客户端,提交任务

Master相当于ResourceManager

Worker相当于NodeManager

并行&并发

并行:多个CPU叫并行

并发:一个进程,多个线程(单个cpu之间的多个线程之间的调度)

源码解析 eg:

bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --num-executors 2 \
 --master yarn \
 --deploy-mode cluster \
 ./examples/jars/spark-examples_2.11-2.1.1.jar \
 100

spark-submit

exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

进入 org.apache.spark.deploy.SparkSubmit 这个类中

def main(args: Array[String]): Unit = {
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

进入SparkSubmitArguments

private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, String] = sys.env)
  extends SparkSubmitArgumentsParser {
  var master: String = null
  var deployMode: String = null
  var executorMemory: String = null
  var executorCores: String = null
  var totalExecutorCores: String = null
  var propertiesFile: String = null
  var driverMemory: String = null
  var driverExtraClassPath: String = null
  var driverExtraLibraryPath: String = null
  var driverExtraJavaOptions: String = null
  var queue: String = null
  var numExecutors: String = null
  var files: String = null
  var archives: String = null
  var mainClass: String = null
  var primaryResource: String = null
  var name: String = null
  var childArgs: ArrayBuffer[String] = new ArrayBuffer[String]()
  var jars: String = null
  var packages: String = null
  var repositories: String = null
  var ivyRepoPath: String = null
  var packagesExclusions: String = null
  var verbose: Boolean = false
  var isPython: Boolean = false
  var pyFiles: String = null
  var isR: Boolean = false
  var action: SparkSubmitAction = null
  val sparkProperties: HashMap[String, String] = new HashMap[String, String]()
  var proxyUser: String = null
  var principal: String = null
  var keytab: String = null

  // Standalone cluster mode only
  var supervise: Boolean = false
  var driverCores: String = null
  var submissionToKill: String = null
  var submissionToRequestStatusFor: String = null
  var useRest: Boolean = true // used internally

  /** Default properties present in the currently defined defaults file. */
  lazy val defaultSparkProperties: HashMap[String, String] = {
    val defaultProperties = new HashMap[String, String]()
    // scalastyle:off println
    if (verbose) SparkSubmit.printStream.println(s"Using properties file: $propertiesFile")
    Option(propertiesFile).foreach { filename =>
      Utils.getPropertiesFromFile(filename).foreach { case (k, v) =>
        defaultProperties(k) = v
        if (verbose) SparkSubmit.printStream.println(s"Adding default property: $k=$v")
      }
    }
    // scalastyle:on println
    defaultProperties
  }

进入submit

def main(args: Array[String]): Unit = {
    val appArgs = new SparkSubmitArguments(args)
    if (appArgs.verbose) {
      // scalastyle:off println
      printStream.println(appArgs)
      // scalastyle:on println
    }
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
    }
  }

private def submit(args: SparkSubmitArguments): Unit = {
    val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

--SparkSubmit     反射执行
//声明Main类
 childMainClass = "org.apache.spark.deploy.yarn.Client"//调用主类
 runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)//反射加载类
 mainClass = Utils.classForName(childMainClass)//反射获取类的main方法
 val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)//调用指定类的main方法
 mainMethod.invoke(null, childArgs.toArray)

添加依赖

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-yarn_2.11</artifactId>
            <version>2.1.1</version>
        </dependency>

进入org.apache.spark.deploy.yarn.Client

private object Client extends Logging {

  def main(argStrings: Array[String]) {
    if (!sys.props.contains("SPARK_SUBMIT")) {
      logWarning("WARNING: This client is deprecated and will be removed in a " +
        "future version of Spark. Use ./bin/spark-submit with \"--master yarn\"")
    }

    // Set an env variable indicating we are running in YARN mode.
    // Note that any env variable with the SPARK_ prefix gets propagated to all (remote) processes
    System.setProperty("SPARK_YARN_MODE", "true")
    val sparkConf = new SparkConf
    // SparkSubmit would use yarn cache to distribute files & jars in yarn mode,
    // so remove them from sparkConf here for yarn mode.
    sparkConf.remove("spark.jars")
    sparkConf.remove("spark.files")
    val args = new ClientArguments(argStrings)
    new Client(args, sparkConf).run()
  }

val amClass =
      if (isClusterMode) {
        Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
      } else {
        Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName
      }
    if (args.primaryRFile != null && args.primaryRFile.endsWith(".R")) {
      args.userArgs = ArrayBuffer(args.primaryRFile) ++ args.userArgs
    }
    val userArgs = args.userArgs.flatMap { arg =>
      Seq("--arg", YarnSparkHadoopUtil.escapeForShell(arg))
    }
    val amArgs =
      Seq(amClass) ++ userClass ++ userJar ++ primaryPyFile ++ primaryRFile ++
        userArgs ++ Seq(
          "--properties-file", buildPath(YarnSparkHadoopUtil.expandEnvironment(Environment.PWD),
            LOCALIZED_CONF_DIR, SPARK_CONF_FILE))

    // Command for the ApplicationMaster
    val commands = prefixEnv ++ Seq(
        YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) + "/bin/java", "-server"
      ) ++
      javaOpts ++ amArgs ++
      Seq(
        "1>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout",
        "2>", ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr")

Client
//创建Yarn客户端,用于和Yarn进行关联
 yarnClient = YarnClient.createYarnClient//提交应用
 this.appId = submitApplication()//提交过程中,封装指令字符串
 bin/java  org.apache.spark.deploy.yarn.ApplicationMaster
 bin/java org.apache.spark.deploy.yarn.ExecutorLauncher//将指令通过Yarn客户端发送给Yarn
 Utils.classForName("org.apache.spark.deploy.yarn.ApplicationMaster").getName
 Utils.classForName("org.apache.spark.deploy.yarn.ExecutorLauncher").getName

进入 org.apache.spark.deploy.yarn.ApplicationMaster

def main(args: Array[String]): Unit = {
    SignalUtils.registerLogger(log)
    val amArgs = new ApplicationMasterArguments(args)

    // Load the properties file with the Spark configuration and set entries as system properties,
    // so that user code run inside the AM also has access to them.
    // Note: we must do this before SparkHadoopUtil instantiated
    if (amArgs.propertiesFile != null) {
      Utils.getPropertiesFromFile(amArgs.propertiesFile).foreach { case (k, v) =>
        sys.props(k) = v
      }
    }
    SparkHadoopUtil.get.runAsSparkUser { () =>
      master = new ApplicationMaster(amArgs, new YarnRMClient)
      System.exit(master.run())
    }
  }

while (!args.isEmpty) {
      // --num-workers, --worker-memory, and --worker-cores are deprecated since 1.0,
      // the properties with executor in their names are preferred.
      args match {
        case ("--jar") :: value :: tail =>
          userJar = value
          args = tail

        case ("--class") :: value :: tail =>
          userClass = value
          args = tail

        case ("--primary-py-file") :: value :: tail =>
          primaryPyFile = value
          args = tail

        case ("--primary-r-file") :: value :: tail =>
          primaryRFile = value
          args = tail

        case ("--arg") :: value :: tail =>
          userArgsBuffer += value
          args = tail

        case ("--properties-file") :: value :: tail =>
          propertiesFile = value
          args = tail

        case _ =>
          printUsageAndExit(1, args)
      }
    }

ApplicationMaster ：

// 获取参数 --class : SparkCoreDemo
 userClass = value// 运行Driver
 // 创建SparkContext对象的类称之为Driver
 runDriver(securityMgr)userClassThread = startUserApplication()
 // 反射获取Driver类中的main方法
 val mainMethod = userClassLoader.loadClass(args.userClass)
   .getMethod("main", classOf[Array[String]])// 创建用户Driver线程
 val userThread = new Thread
 // 启动Driver线程
 userThread.start
 // 执行drvier线程中执行main方法
 mainMethod.invoke(null, userArgs.toArray)// 等待userClassThread执行完毕
 userClassThread.join()// 注册ApplicationMaster
 registerAM(sc.getConf, rpcEnv, driverRef, sc.ui.map(_.appUIAddress).getOrElse(""),
           securityMgr)// 分配资源
 allocator.allocateResources()// 运行分配的容器
 runAllocatedContainers(containersToUse)// 准备向NM发送的指令
 val commands = prepareCommand()bin/java org.apache.spark.executor.CoarseGrainedExecutorBackend
  CoarseGrainedExecutorBackend ：消息通信体
run(driverUrl, executorId, hostname, cores, appId, workerUrl, userClassPath)
SparkEnv.createExecutorEnv
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend
env.rpcEnv.setupEndpoint("WorkerWatcher", new WorkerWatcher(env.rpcEnv, url))
env.rpcEnv.awaitTermination()
NettyRpcEnv ：
 Dispatcher：
 NettyRpcEndpointRef:
 RpcEndpointAddress:
 EndpointData:
 Inbox：// 创建Executor计算对象，调度Task
 executor = new Executor

spark yarn需要哪些配置_jar_02

Spark部署模式

spark支持3种集群管理器(Cluster Manager),分别为:

Standalone:独立模式,Spark原生的简单集群管理器,自带完整的服务,可单独部署到一个集群中,无需依赖任何其他资源管理系统,使用Standalone可以很方便地搭建一个集群.
Apache Mesos:一个强大的分布式资源管理框架,它允许多种不同的框架部署在其中,包括yarn,用的不多,国外多点.
Hadoop YARN:统一的资源管理机制,在上面可以运行多套计算框架,如map reduce/storm等,根据driver在集群中的位置不同,分为yarn client和yarn cluster

Standalone模式运行机制

Standalone集群有四个重要组成部分,分别是:

Driver:是一个进程,我们编写的Spark应用程序就运行在Driver上,由Driver进程执行
Master(RM):是一个进程,主要负责资源的调度和分配,并进行集群的监控等职责
Work(NM):是一个进程,一个Worker运行在集群中的一台服务器上,主要负责两个职责,一个是用自己的内存存储RDD的某个或某些partition;另一个是启动其他进程和线程(Executor),对RDD上的partition进行并行的处理和计算.
Executor:是一个进程,一个Worker上可以运行多个Executor,Executor通过启动多个线程(task)来执行对RDD的partition进行并行计算,也就是执行我们队RDD定义的列如map/flatMap/reduce等算子操作

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。