spark执行过程 spark 执行

转载

mob64ca14061c9e 2023-09-04 11:12:50

文章标签 spark执行过程 spark hadoop jar 文章分类 Spark 大数据

1 执行第一个Spark程序

该算法是利用蒙特·卡罗算法求PI

/home/hadoop/software/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/software/spark/examples/jars/spark-examples_2.11-2.2.1.jar

参数说明：

--master spark://harvey:7077 指定Master的地址
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 2 指定每个executor使用的cup核数为2个

运行结果：

Pi is roughly 3.140955704778524

2 Spark 应用提交

一旦打包好,就可以使用bin/spark-submit脚本启动应用了. 这个脚本负责设置spark使用的classpath和依赖,支持不同类型的集群管理器和发布模式

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]

一些常用选项
1)–class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
2)–master: 集群的master URL (如spark://harvey:7077)
3)–deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)*
4)–conf: 任意的Spark配置属性，格式key=value. 如果值包含空格，可以加引号“key=value”. 缺省的Spark配置
5)application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。比如hdfs:// 共享存储系统，如果是 file:// path，那么所有的节点的path都包含同样的jar.
6)application-arguments: 传给main()方法的参数

Master URL 可以是以下格式：

spark执行过程 spark 执行_spark

spark-submit 全部参数：

[hadoop@harvey bin]$ ./spark-submit 
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone and YARN only:
  --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                              or all available cores on the worker in standalone mode)

 YARN-only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.
  --principal PRINCIPAL       Principal to be used to login to KDC, while running on
                              secure HDFS.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above. This keytab will be copied to
                              the node running the Application Master via the Secure
                              Distributed Cache, for renewing the login tickets and the
                              delegation tokens periodically.

3.启动Spark Shell

3.1 启动Spark Shell

spark-shell是Spark自带的交互式Shell程序，方便用户进行交互式编程，用户可以在该命令行下用scala编写spark程序
启动spark shell 使用如下命令

/home/hadoop/software/spark/bin/spark-shell \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2

注意：
        如果启动spark shell时没有指定master地址，但是也可以正常启动spark shell和执行spark shell中的程序，其实是启动了spark的cluster模式，如果spark是单节点，并且没有指定slave文件，这个时候如果打开spark-shell 默认是local模式
        Local模式是master和worker在同同一进程内
        Cluster模式是master和worker在不同进程内

Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到，则直接应用sc即可

3.2 Spark Shell 中编写WordCount程序

将 Spark 目录下的 RELEASE 上传到 hdfs://harvey:9000/RELEASE

[hadoop@harvey spark]$ hadoop fs -put RELEASE /

查看文件内容

[hadoop@harvey spark]$ hadoop fs -text /RELEASE
Spark 2.2.1 built for Hadoop 2.7.3
Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036

在Spark shell中用scala语言编写WordCount程序，代码如下

sc.textFile("hdfs://harvey:9000/RELEASE").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).saveAsTextFile("hdfs://harvey:9000/out")

spark执行过程 spark 执行_spark_02

使用hdfs命令查看结果

$ hadoop fs -text /out/*

spark执行过程 spark 执行_spark执行过程_03

说明：

sc 是 SparkContext 对象，该对象时提交spark程序的入口

textFile(“hdfs://harvey:9000/RELEASE”) 是 hdfs 中读取数据

flatMap(.split(" ")) 先 map 再压平
map((,1)) 将单词和1构成元组

reduceByKey(+) 按照key进行reduce，并将value累加

saveAsTextFile(“hdfs://harvey:9000/out”) 将结果写入到 hdfs 中

spark执行过程 spark 执行_hadoop_04

4.在IDEA中编写WordCount程序

创建一个项目名为 spark-wordcount

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.spark.harvey</groupId>
    <artifactId>spark-wordcount</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <spak.version>2.2.1</spak.version>
        <hadoop.version>3.0.1</hadoop.version>
        <scala.version>2.11.12</scala.version>
        <log4j.version>1.2.12</log4j.version>
        <slf4j.version>1.7.25</slf4j.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spak.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <scope>provided</scope>
        </dependency>

        <dependency>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
            <version>${log4j.version}</version>
        </dependency>
        <dependency>
            <artifactId>slf4j-log4j12</artifactId>
            <groupId>org.slf4j</groupId>
            <version>${slf4j.version}</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <!--<testSourceDirectory>src/test/scala</testSourceDirectory>-->
        <resources>
            <!-- 保证resources下的所有的properties配置文件可以被过滤-->
            <resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>**/*.properties</include>
                </includes>
                <filtering>true</filtering>
            </resource>
        </resources>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.10</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                    <archive>
                        <manifest>
                            <mainClass>com.spark.harvey.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    <reporting>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </reporting>
</project>

配置日志(resources下新建log4j.xml)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration>
    <!-- 将日志信息输出到控制台 -->
    <appender name="ConsoleAppender" class="org.apache.log4j.ConsoleAppender">
        <!-- 设置日志输出的样式 -->
        <layout class="org.apache.log4j.PatternLayout">
            <!-- 设置日志输出的格式 -->
            <param name="ConversionPattern" value="[%d{yyyy-MM-dd HH:mm:ss:SSS}] [%-5p] [method:%l]%n%m%n%n" />
        </layout>
        <!--过滤器设置输出的级别-->
        <filter class="org.apache.log4j.varia.LevelRangeFilter">
            <!-- 设置日志输出的最小级别 -->
            <param name="levelMin" value="debug" />
            <!-- 设置日志输出的最大级别 -->
            <param name="levelMax" value="debug" />
            <!-- 设置日志输出的xxx，默认是false -->
            <param name="AcceptOnMatch" value="true" />
        </filter>
    </appender>

    <!-- 根logger的设置-->
    <root>
        <level value ="INFO"/>
        <appender-ref ref="ConsoleAppender"/>
    </root>
</log4j:configuration>

程序代码

package com.spark.harvey

import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory

object WordCount {

  val logger = LoggerFactory.getLogger(WordCount.getClass)

  def main(args: Array[String]): Unit = {
    // 创建SparkConf并设置APP名称
    val conf: SparkConf = new SparkConf().setAppName("WordCount")

    // 创建 Spark 上下文对象
    val sc: SparkContext = new SparkContext(conf)

    // 使用RDD计算统计单词个数
    sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))

    logger.info("wordcount complete!")

    sc.stop()
  }
}

上传集群，运行任务

/home/hadoop/software/spark/bin/spark-submit \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/jar/hadoop-mr-wordcount-jar-with-dependencies.jar \
hdfs://harvey:9000/RELEASE \
hdfs://harvey:9000/out

查看运行结果

$ hadoop fs -text /out/*

spark执行过程 spark 执行_hadoop_05

5.在IDEA中本地调试WordCount程序

本地Spark程序调试需要使用local提交模式，即将本机当做运行环境，Master和Worker都为本机。

spark执行过程 spark 执行_spark执行过程_06

运行时直接加断点调试即可。如下：

spark执行过程 spark 执行_spark_07

问题记录

1).windows下执行hadoop程序，没有配置hadoop环境

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

参考博客：MapReduce 工作机制错误记录及解决

2).windows 下 idea 远程连接 hadoop hdfs 没有写权限

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=admin, access=WRITE, inode="/":hadoop:supergroup:drwxr-xr-x

提示不允许往hdfs中写文件，解决方法如下
hadoop hdfs-site.xml 文件中添加如下配置

<property>    
	<name>dfs.permissions</name>
	<value>false</value>
</property>

然后重启hadoop 重启spark，重新运行程序，问题消失

6.IDEA中远程调试WordCount程序

通过IDEA进行远程调试，主要是将IDEA作为Driver来提交应用程序，配置过程如下：

修改sparkConf，添加最终需要运行的Jar包、Driver程序的地址，并设置Master的提交地址：

spark执行过程 spark 执行_spark_08

运行结果：

spark执行过程 spark 执行_spark执行过程_09

7.Spark 核心概念

每个Spark应用都由一个驱动器程序(driver program)来发起集群上的各种并行操作。驱动器程序包含应用的 main 函数，并且定义了集群上的分布式数据集，还对这些分布式数据集应用了相关操作。

驱动器程序通过一个 SparkContext 对象来访问 Spark。这个对象代表对计算集群的一个连接。shell 启动时已经自动创建了一个 SparkContext 对象，是一个叫作 sc 的变量。

驱动器程序一般要管理多个执行器(executor)节点。

spark执行过程 spark 执行_spark_10

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mongodb 读取二进制文件 mongodb存储二进制

下一篇：Android 热点源码分析安卓热点工具

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯