spark 标签 spark代码示例

转载

技术领航员 2024-03-04 06:25:53

文章标签 spark 标签 spark scala SPARK 文章分类 Spark 大数据

1.从哪里开始？

我们可以看到spark examples模块下，有各种spark应用的示例代码。包括graphx，ml（机器学习），sql，streaming等等

spark 标签 spark代码示例_spark

我们看一下最简单的SparkPi这个应用源码

// scalastyle:off println
package org.apache.spark.examples

import scala.math.random

import org.apache.spark.sql.SparkSession

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .appName("Spark Pi")
      .getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println(s"Pi is roughly ${4.0 * count / (n - 1)}")
    spark.stop()
  }
}
// scalastyle:on println

spark源码中默认禁用了print等的输出，如果需要使用print等，需将代码放置在// scalastyle:off println 和// scalastyle:on println 中间。

我们看到这段代码首先创建了一个sparkSession，Ctl+单击进去看一下这个类的源码，打开structure：

spark 标签 spark代码示例_spark_02

可以看到，这个scala文件里面包含了SparkSession类和它的一个伴生对象。

我们先看一下sparkSession的注释

spark 标签 spark代码示例_spark 标签_03

上面说，sparkSession是开发spark基于Dataset and DataFrame API应用的入口类，还说了创建方法和构造参数。第一个构造参数就是sparkContext，即sparkSession关联的Spark context。

2.开发一个Demo应用

官方的基于RDD的开发quik start demo里都是以创建sparkContext 开始开发一个Spark应用，但是作者在examples 里面翻了一圈都是以创建SparkSession开始的应用，于是结合官方文档的guide，我们来编写一个简单的本地测试应用：

// scalastyle:off println

object RddLlocaltest {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
    val spark = new SparkContext(conf)
    val data = Array(1, 2, 3, 4)
    val disData = spark.parallelize(data)
    val mapedData = disData.map(x => x + 1)
    println(mapedData.count())
    println("numPartitions: " + mapedData.getNumPartitions)
    mapedData.foreach(x => println(x))
    spark.stop()

  }
}

// scalastyle:on println

上述代码有几个主要步骤：

1.基于Sparkconf 创建一个sparkContext

2. 进行数据操作

3. stop sparkContext

3.源码走读

以下是sparkContext.scala中包含的类和对象：

spark 标签 spark代码示例_spark_04

可以看到这里面比较复杂，有很多类和对象。

spark 标签 spark代码示例_spark_05

我们再来看一下SparkContext类的注释，里面说到：

SparkContext是spark功能的主要入口点。它抽象了与spark集群的连接，可以用于创建在集群上RDDs，accumulators 和broadcast。

每个JVM中，只能有个Active的SparkContext，在创建新的SparkContext前，可以通过stop方法在停止Active的SparkContext。这个限制可能最后会被移除。

3.1 基本结构

展开SparkContext类。我们先看看其构造函数：

spark 标签 spark代码示例_spark_06

这里提供了各式各样的构造方式来生成一个SparkContext.

还有非常多的数据结构

spark 标签 spark代码示例_spark_07

注释中写道：

这些私有变量保存了上下文的内部状态（state），并且外部无法访问他们。他们可以修改，因为我们需要事先将他们初始化，保证在构造还在进行时调用stop()方法是安全的。

这些数据结构包括比较核心的taskScheduler，dagScheduler等。然后我们再看taskScheduler的get和set方法：

spark 标签 spark代码示例_spark_08

最后，我们看一看Spark如何初始化这些纷繁复杂的数据结构，我们找到了如下的代码块，这段代码非常的长：

前面有一部分注释，说明这段代码执行的是context的初始化操作。

spark 标签 spark代码示例_spark_09

try {
  _conf = config.clone()
  _conf.validateSettings()

  if (!_conf.contains("spark.master")) {
    throw new SparkException("A master URL must be set in your configuration")
  }
  if (!_conf.contains("spark.app.name")) {
    throw new SparkException("An application name must be set in your configuration")
  }

  // log out spark.app.name in the Spark driver logs
  logInfo(s"Submitted application: $appName")

  // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
  if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
    throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
      "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
  }

  if (_conf.getBoolean("spark.logConf", false)) {
    logInfo("Spark configuration:\n" + _conf.toDebugString)
  }

  // Set Spark driver host and port system properties. This explicitly sets the configuration
  // instead of relying on the default value of the config constant.
  _conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
  _conf.setIfMissing("spark.driver.port", "0")

  _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)

  _jars = Utils.getUserJars(_conf)
  _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
    .toSeq.flatten

  _eventLogDir =
    if (isEventLogEnabled) {
      val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
        .stripSuffix("/")
      Some(Utils.resolveURI(unresolvedDir))
    } else {
      None
    }

  _eventLogCodec = {
    val compress = _conf.getBoolean("spark.eventLog.compress", false)
    if (compress && isEventLogEnabled) {
      Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
    } else {
      None
    }
  }

  _listenerBus = new LiveListenerBus(_conf)

  // Initialize the app status store and listener before SparkEnv is created so that it gets
  // all events.
  _statusStore = AppStatusStore.createLiveStore(conf)
  listenerBus.addToStatusQueue(_statusStore.listener.get)

  // Create the Spark execution environment (cache, map output tracker, etc)
  _env = createSparkEnv(_conf, isLocal, listenerBus)
  SparkEnv.set(_env)

  // If running the REPL, register the repl's output dir with the file server.
  _conf.getOption("spark.repl.class.outputDir").foreach { path =>
    val replUri = _env.rpcEnv.fileServer.addDirectory("/classes", new File(path))
    _conf.set("spark.repl.class.uri", replUri)
  }

  _statusTracker = new SparkStatusTracker(this, _statusStore)

  _progressBar =
    if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
      Some(new ConsoleProgressBar(this))
    } else {
      None
    }

  _ui =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
        startTime))
    } else {
      // For tests, do not enable the UI
      None
    }
  // Bind the UI before starting the task scheduler to communicate
  // the bound port to the cluster manager properly
  _ui.foreach(_.bind())

  _hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

  // Add each JAR given through the constructor
  if (jars != null) {
    jars.foreach(addJar)
  }

  if (files != null) {
    files.foreach(addFile)
  }

  _executorMemory = _conf.getOption("spark.executor.memory")
    .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
    .orElse(Option(System.getenv("SPARK_MEM"))
      .map(warnSparkMem))
    .map(Utils.memoryStringToMb)
    .getOrElse(1024)

  // Convert java options to env vars as a work around
  // since we can't set env vars directly in sbt.
  for {(envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))
       value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {
    executorEnvs(envKey) = value
  }
  Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>
    executorEnvs("SPARK_PREPEND_CLASSES") = v
  }
  // The Mesos scheduler backend relies on this environment variable to set executor memory.
  // TODO: Set this only in the Mesos scheduler.
  executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"
  executorEnvs ++= _conf.getExecutorEnv
  executorEnvs("SPARK_USER") = sparkUser

  // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
  // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
  _heartbeatReceiver = env.rpcEnv.setupEndpoint(
    HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

  // Create and start the scheduler
  val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
  _schedulerBackend = sched
  _taskScheduler = ts
  _dagScheduler = new DAGScheduler(this)
  _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

  // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
  // constructor
  _taskScheduler.start()

  _applicationId = _taskScheduler.applicationId()
  _applicationAttemptId = taskScheduler.applicationAttemptId()
  _conf.set("spark.app.id", _applicationId)
  if (_conf.getBoolean("spark.ui.reverseProxy", false)) {
    System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)
  }
  _ui.foreach(_.setAppId(_applicationId))
  _env.blockManager.initialize(_applicationId)

  // The metrics system for Driver need to be set spark.app.id to app ID.
  // So it should start after we get app ID from the task scheduler and set spark.app.id.
  _env.metricsSystem.start()
  // Attach the driver metrics servlet handler to the web ui after the metrics system is started.
  _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

  _eventLogger =
    if (isEventLogEnabled) {
      val logger =
        new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
          _conf, _hadoopConfiguration)
      logger.start()
      listenerBus.addToEventLogQueue(logger)
      Some(logger)
    } else {
      None
    }

  // Optionally scale number of executors dynamically based on workload. Exposed for testing.
  val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
  _executorAllocationManager =
    if (dynamicAllocationEnabled) {
      schedulerBackend match {
        case b: ExecutorAllocationClient =>
          Some(new ExecutorAllocationManager(
            schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf))
        case _ =>
          None
      }
    } else {
      None
    }
  _executorAllocationManager.foreach(_.start())

  _cleaner =
    if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
      Some(new ContextCleaner(this))
    } else {
      None
    }
  _cleaner.foreach(_.start())

  setupAndStartListenerBus()
  postEnvironmentUpdate()
  postApplicationStart()

  // Post init
  _taskScheduler.postStartHook()
  _env.metricsSystem.registerSource(_dagScheduler.metricsSource)
  _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
  _executorAllocationManager.foreach { e =>
    _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
  }

  // Make sure the context is stopped if the user forgets about it. This avoids leaving
  // unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
  // is killed, though.
  logDebug("Adding shutdown hook") // force eager creation of logger
  _shutdownHookRef = ShutdownHookManager.addShutdownHook(
    ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
    logInfo("Invoking stop() from shutdown hook")
    stop()
  }
} catch {
  case NonFatal(e) =>
    logError("Error initializing SparkContext.", e)
    try {
      stop()
    } catch {
      case NonFatal(inner) =>
        logError("Error stopping SparkContext after init error.", inner)
    } finally {
      throw e
    }
}

首先，复制了Spark的config到_conf变量上，紧接着基于_conf进行了一堆的验证操作,看是否有非法的参数设置。然后对其他的变量进行初始化操作。下面这段代码就是初始化并启动taskscheduler的操作。

spark 标签 spark代码示例_spark 标签_10