1. 准备必要的开发工具和环境:

  • 安装 jdk 1.8:参考链接:
  • scala 2.11.8

   下载地址:https://www.scala-lang.org/download/2.11.8.html  我下载的文件名是 scala-2.11.8.tgz

## scala:Spark由Scala语言写成,本地编译需要用到scala
## 解压
sudo tar zxvf scala-2.11.8.tgz -C /usr/lib
## 移动文件夹
sudo mv /usr/lib/scala-2.11.8 /usr/lib/scala
## 配置环境变量
sudo vim /etc/profile
## 在文件末尾添加
export SCALA_HOME=/usr/lib/scala
export PATH=${SCALA_HOME}/bin:$PATH
## 执行命令使修改立即生效
source /etc/profile
## 验证是否安装成功
zmx@ubuntu:~/nju$ scala -version
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

 

  • maven 3.3.9

    下载地址:https://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.3.9/binaries/ 我下载的文件名为 apache-maven-3.3.9-bin.tar.gz

## 解压
sudo tar zxvf apache-maven-3.3.9-bin.tar.gz 
## 移动文件夹
sudo mv apache-maven-3.3.9 maven
sudo mv maven/ /usr/lib/maven
## 添加环境变量
gedit ~/.bashrc
## 添加如下内容
export MAVEN_HOME=/usr/lib/maven
export CLASSPATH=$CLASSPATH:$MAVEN_HOME/lib
export PATH=$PATH:$MAVEN_HOME/bin
## 使路径生效
source ~/.bashrc
## 查看是否安装成功
zmx@ubuntu:~/nju$ mvn -v
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/lib/maven
Java version: 1.8.0_201, vendor: Oracle Corporation
Java home: /usr/lib/jdk/jdk1.8.0_201/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-46-generic", arch: "amd64", family: "unix"
## 修改settings.xmml以达到下载jar加速的效果
sudo gedit /usr/lib/maven/conf/settings.xml 

<!-- 阿里云中央仓库 -->
     <mirror>
      <id>alimaven</id>
      <name>aliyun maven</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  • 安装IntelliJ IDEA(scala plugin)

  官网下载安装包:http://www.jetbrains.com/education/download/#section=idea-Scala 
  文件名为:ideaIE-2018.3.1.tar.gz

## 解压
tar xzvf ideaIE-2018.3.1.tar.gz
## 启动
cd ideaIE-2018.3.1/bin
./idea.sh

   安装 Scala plugins

    

spark 如何设置schedulingMode_java

    将maven设置为本地下载的maven并且更改User settings file

     

spark 如何设置schedulingMode_java_02

  • 如果使用maven进行build可以不用安装sbt!!!!理论上本次环境搭建无需安装sbt,但博主安装了sbt,并未出现问题)安装sbt 0.13.x:scala工程构建工具,参考链接 ubuntu16.04安装sbt  注意sbt的版本不要下错了!下载0.13.x版本

 

2. Spark 2.1.0 源码下载

  下载Spark源码,地址:https://archive.apache.org/dist/spark/spark-2.1.0/

## 解压
tar zxvf spark-2.1.0.tgz

 

3. 编译Spark项目

./build/mvn -DskipTests clean package

  出现错误,由于出现错误时没有及时记录,错误详情参见

     https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven

  使用命令 ./build/mvn -e查看错误详情

[ERROR] No goals have been specified for this build. You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy. -> [Help 1]
org.apache.maven.lifecycle.NoGoalSpecifiedException: No goals have been specified for this build. You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-clean, clean, post-clean, pre-site, site, post-site, site-deploy.
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:97)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:954)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:288)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:192)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:356)
[ERROR] 
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/NoGoalSpecifiedException

 原因:https://stackoverflow.com/questions/28004552/problems-while-compiling-spark-with-maven

 第一次安装jdk时安装了openjdk8而不是jdk1.8,使用命令 sudo apt-get remove openjdk* 删除openjdk8之后,jdk1.8安装教程见链接:ubuntu16.04搭建jdk1.8运行环境

问题仍然存在:参考链接:https://github.com/davidB/scala-maven-plugin/issues/185 

                                           

总结出问题在于,出现错误后,反复执行build命令,大概在第2-3次编译成功,编译时间视网络而定,编译成功后如下图所示:

spark 如何设置schedulingMode_spark_03

编译完成后测试一下

zmx@ubuntu:~/nju/spark-2.1.0$ ./bin/spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/03/11 23:26:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/03/11 23:26:07 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.127.163 instead (on interface ens33)
19/03/11 23:26:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark context Web UI available at http://192.168.127.163:4040
Spark context available as 'sc' (master = local[*], app id = local-1552371968105).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

4. 将编译完成的spark源码导入IDEA

spark 如何设置schedulingMode_java_04

一路点击next

可能碰到的问题:用户权限不够,文件权限都是read-only,进入spark根目录,使用命令 sudo chmod -R 777 ./ 改变权限

                                             

spark 如何设置schedulingMode_spark_05

6. 导入成功后,运行并调试一下spark在examples目录下的实例,以LogQuery为例

                                               

spark 如何设置schedulingMode_spark_06

  •   配置运行参数,VM options: -Dspark.master=local,代表使用本地模式运行Spark代码

spark 如何设置schedulingMode_maven_07

运行结果如下图所示:出现错误,因为找不到flume依赖的部分源码

spark 如何设置schedulingMode_maven_08

解决方案如下:参考链接:搭建Spark源码研读和代码调试的开发环境 

File -> ProjectStructure -> Module -> spark-streaming-flume-sink_2.11 ->Sources

将taget目录和子目录sink均加入Sources(对比下图右侧蓝色部分Source Folders)

spark 如何设置schedulingMode_java_09

  • 添加运行依赖的jars

    再次运行,此次会花费较长时间,因为需要成功编译LogQuery,但仍然出现如下错误:SLF4J:Failed to load class "org.slf4j.impl.StaticLoggerBinder"

spark 如何设置schedulingMode_java_10

出错的原因在于没有更改IDEA maven的配置,按照步骤1中的配置将maven设置为本地下载的maven并且更改User settings file后重新执行LogQuery 参考链接 SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”. in a Maven Project [duplicate]

  • 出现错误 java.lang.NoClassDefFoundError: scala/collection/immutable/List 和 java.lang.NoClassFoundException scala.collection.immutable.List

出错原因在于运行Spark App一般都是通过spark-submit命令,把你的jar运行到已经安装的Spark环境里,环境中包含所有Spark的依赖,而IDE环境中缺少这些依赖

   解决方法:File -> ProjectStructure -> Modules -> spark-examples_2.11 -> Dependencies 添加依赖 jars -> {spark dir}/spark/assembly/target/scala-2.11/jars/

spark 如何设置schedulingMode_spark_11

需要注意的是:

    1. jars/*.jar是在build Spark时下载的,如果目录为空或者修改了源代码想要更新jars,可以重新编译Spark

    2. 上图中可以看到基本上所有依赖jars都是provided,意为默认提供,因为默认采用spark-submit方式运行Spark App

  • 再次运行LogQuery查看输出

spark 如何设置schedulingMode_java_12

  • 单步调试源代码

spark 如何设置schedulingMode_spark_13