1 执行第一个Spark程序
该算法是利用蒙特·卡罗算法求PI
/home/hadoop/software/spark/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/software/spark/examples/jars/spark-examples_2.11-2.2.1.jar
参数说明:
--master spark://harvey:7077 指定Master的地址
--executor-memory 1G 指定每个executor可用内存为1G
--total-executor-cores 2 指定每个executor使用的cup核数为2个
运行结果:
Pi is roughly 3.140955704778524
2 Spark 应用提交
一旦打包好,就可以使用bin/spark-submit脚本启动应用了. 这个脚本负责设置spark使用的classpath和依赖,支持不同类型的集群管理器和发布模式
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
一些常用选项
1)–class: 你的应用的启动类 (如 org.apache.spark.examples.SparkPi)
2)–master: 集群的master URL (如spark://harvey:7077)
3)–deploy-mode: 是否发布你的驱动到worker节点(cluster) 或者作为一个本地客户端 (client) (default: client)*
4)–conf: 任意的Spark配置属性, 格式key=value. 如果值包含空格,可以加引号“key=value”. 缺省的Spark配置
5)application-jar: 打包好的应用jar,包含依赖. 这个URL在集群中全局可见。 比如hdfs:// 共享存储系统, 如果是 file:// path, 那么所有的节点的path都包含同样的jar.
6)application-arguments: 传给main()方法的参数
Master URL 可以是以下格式:
spark-submit 全部参数:
[hadoop@harvey bin]$ ./spark-submit
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and the
delegation tokens periodically.
3.启动Spark Shell
3.1 启动Spark Shell
spark-shell是Spark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下用scala编写spark程序
启动spark shell 使用如下命令
/home/hadoop/software/spark/bin/spark-shell \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2
注意:
如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的cluster模式,如果spark是单节点,并且没有指定slave文件,这个时候如果打开spark-shell 默认是local模式
Local模式是master和worker在同同一进程内
Cluster模式是master和worker在不同进程内
Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可
3.2 Spark Shell 中编写WordCount程序
将 Spark 目录下的 RELEASE 上传到 hdfs://harvey:9000/RELEASE
[hadoop@harvey spark]$ hadoop fs -put RELEASE /
查看文件内容
[hadoop@harvey spark]$ hadoop fs -text /RELEASE
Spark 2.2.1 built for Hadoop 2.7.3
Build flags: -Phadoop-2.7 -Psparkr -Phive -Phive-thriftserver -Pyarn -Pmesos -DzincPort=3036
在Spark shell中用scala语言编写WordCount程序,代码如下
sc.textFile("hdfs://harvey:9000/RELEASE").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).saveAsTextFile("hdfs://harvey:9000/out")
使用hdfs命令查看结果
$ hadoop fs -text /out/*
说明:
sc 是 SparkContext 对象,该对象时提交spark程序的入口
textFile(“hdfs://harvey:9000/RELEASE”) 是 hdfs 中读取数据
flatMap(.split(" ")) 先 map 再压平
map((,1)) 将单词和1构成元组
reduceByKey(+) 按照key进行reduce,并将value累加
saveAsTextFile(“hdfs://harvey:9000/out”) 将结果写入到 hdfs 中
4.在IDEA中编写WordCount程序
创建一个项目名为 spark-wordcount
- pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.spark.harvey</groupId>
<artifactId>spark-wordcount</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<spak.version>2.2.1</spak.version>
<hadoop.version>3.0.1</hadoop.version>
<scala.version>2.11.12</scala.version>
<log4j.version>1.2.12</log4j.version>
<slf4j.version>1.7.25</slf4j.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spak.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
<version>${slf4j.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<!--<testSourceDirectory>src/test/scala</testSourceDirectory>-->
<resources>
<!-- 保证resources下的所有的properties配置文件可以被过滤-->
<resource>
<directory>src/main/resources</directory>
<includes>
<include>**/*.properties</include>
</includes>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.10</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.spark.harvey.WordCount</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<reporting>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
</plugins>
</reporting>
</project>
- 配置日志(resources下新建log4j.xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration>
<!-- 将日志信息输出到控制台 -->
<appender name="ConsoleAppender" class="org.apache.log4j.ConsoleAppender">
<!-- 设置日志输出的样式 -->
<layout class="org.apache.log4j.PatternLayout">
<!-- 设置日志输出的格式 -->
<param name="ConversionPattern" value="[%d{yyyy-MM-dd HH:mm:ss:SSS}] [%-5p] [method:%l]%n%m%n%n" />
</layout>
<!--过滤器设置输出的级别-->
<filter class="org.apache.log4j.varia.LevelRangeFilter">
<!-- 设置日志输出的最小级别 -->
<param name="levelMin" value="debug" />
<!-- 设置日志输出的最大级别 -->
<param name="levelMax" value="debug" />
<!-- 设置日志输出的xxx,默认是false -->
<param name="AcceptOnMatch" value="true" />
</filter>
</appender>
<!-- 根logger的设置-->
<root>
<level value ="INFO"/>
<appender-ref ref="ConsoleAppender"/>
</root>
</log4j:configuration>
- 程序代码
package com.spark.harvey
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory
object WordCount {
val logger = LoggerFactory.getLogger(WordCount.getClass)
def main(args: Array[String]): Unit = {
// 创建SparkConf并设置APP名称
val conf: SparkConf = new SparkConf().setAppName("WordCount")
// 创建 Spark 上下文对象
val sc: SparkContext = new SparkContext(conf)
// 使用RDD计算统计单词个数
sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))
logger.info("wordcount complete!")
sc.stop()
}
}
- 上传集群,运行任务
/home/hadoop/software/spark/bin/spark-submit \
--master spark://harvey:7077 \
--executor-memory 1G \
--total-executor-cores 2 \
/home/hadoop/jar/hadoop-mr-wordcount-jar-with-dependencies.jar \
hdfs://harvey:9000/RELEASE \
hdfs://harvey:9000/out
查看运行结果
$ hadoop fs -text /out/*
5.在IDEA中本地调试WordCount程序
本地Spark程序调试需要使用local提交模式,即将本机当做运行环境,Master和Worker都为本机。
设置参数:输入目录及输出目录
运行时直接加断点调试即可。如下:
问题记录
- 1).windows下执行hadoop程序,没有配置hadoop环境
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
参考博客:MapReduce 工作机制 错误记录及解决
- 2).windows 下 idea 远程连接 hadoop hdfs 没有写权限
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=admin, access=WRITE, inode="/":hadoop:supergroup:drwxr-xr-x
提示不允许往hdfs中写文件,解决方法如下
hadoop hdfs-site.xml 文件中添加如下配置
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
然后重启hadoop 重启spark,重新运行程序,问题消失
6.IDEA中远程调试WordCount程序
通过IDEA进行远程调试,主要是将IDEA作为Driver来提交应用程序,配置过程如下:
修改sparkConf,添加最终需要运行的Jar包、Driver程序的地址,并设置Master的提交地址:
运行结果:
7.Spark 核心概念
每个Spark应用都由一个驱动器程序(driver program)来发起集群上的各种 并行操作。驱动器程序包含应用的 main 函数,并且定义了集群上的分布式数据集,还对这 些分布式数据集应用了相关操作。
驱动器程序通过一个 SparkContext 对象来访问 Spark。这个对象代表对计算集群的一个连 接。shell 启动时已经自动创建了一个 SparkContext 对象,是一个叫作 sc 的变量。
驱动器程序一般要管理多个执行器(executor)节点。