前言

大数据与容器是近年来的两个热点技术,大数据平台的容器化部署自然格外被关注。关于Apache Spark的容器化方案,目前使用最多的是sequenceiq/spark,在Docker Hub上有330K的下载量。

sequenceiq/spark镜像中包含了Hadoop/Yarn,虽然采用的是一种“伪集群”的部署方式,但是用来做开发和测试还是非常便利的。遗憾的是sequenceiq的更新速度不是很给力,目前最新的版本是sequenceiq/spark:1.6.0,想要体验Spark 2.1.0就得自己动动手了。

运行环境

操作系统

内核版本

Docker版本

Ubuntu 15.10

4.2.0-42

1.12.5


环境准备

1) 下载sequenceiq/spark镜像构建源码

$ git clone https://github.com/sequenceiq/docker-spark
正克隆到 'docker-spark'...
remote: Counting objects: 211, done.
remote: Total 211 (delta 0), reused 0 (delta 0), pack-reused 211
接收对象中: 100% (211/211), 55.20 KiB | 71.00 KiB/s, 完成.
处理 delta 中: 100% (108/108), 完成.
检查连接... 完成。


2) 从Spark官网下载Spark 2.1.0安装包

Docker Spark docker spark镜像_spark

因为我们用sequenceiq的Dockerfile来构建Spark 2.1.0镜像,需要使用基础镜像sequenceiq/hadoop:2.6.0,所以在下载选项中“2. Choose a package type”要选择“Pre-built for Hadoop 2.6”。

注意:下载的文件需要放到docker-spark目录下。


3) 缓存sequenceiq/hadoop-docker:2.6.0镜像

  • 运行以下命令,拉取hadoop-docker 2.6.0版本镜像:
$ docker pull sequenceiq/hadoop-docker:2.6.0
2.6.0: Pulling from sequenceiq/hadoop-docker
b253335dcf03: Already exists 
a3ed95caeb02: Pull complete 
3452351686f4: Pull complete 
dfb6df69b64d: Pull complete 
...
bae586fb2d97: Pull complete 
Digest: sha256:2b95f51b7f0ddf0d7bb2c2cfa793bae3298fcda5523783155a2db9430cba494a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.6.0
  • 查看本地缓存Docker镜像,确认sequenceiq/hadoop-docker:2.6.0拉取成功:
$ docker images
REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
hello-world                latest              c54a2cc56cbb        6 months ago        1.848 kB
sequenceiq/hadoop-docker   2.6.0               140b265bd62a        24 months ago       1.624 GB


镜像构建

1) 进入docker-spark目录,确认所有用于镜像构建的文件已经准备好。

$ cd docker-spark/
$ ls
总用量 188856
drwxrwxr-x 4 farawayzheng farawayzheng      4096  1月 10 23:10 ./
drwxrwxr-x 9 farawayzheng farawayzheng      4096  1月 10 23:06 ../
-rwxrwxr-x 1 farawayzheng farawayzheng       901  1月 10 23:06 bootstrap.sh*
-rw-rw-r-- 1 farawayzheng farawayzheng       970  1月 10 23:16 Dockerfile
drwxrwxr-x 8 farawayzheng farawayzheng      4096  1月 10 23:06 .git/
-rw-rw-r-- 1 farawayzheng farawayzheng        18  1月 10 23:06 .gitignore
-rw-rw-r-- 1 farawayzheng farawayzheng     71624  1月 10 23:06 LICENSE
-rw-rw-r-- 1 farawayzheng farawayzheng      3454  1月 10 23:06 README.md
-rwxrwx--- 1 farawayzheng farawayzheng 193281941  1月 10 23:04 spark-2.1.0-bin-hadoop2.6.tgz*
drwxrwxr-x 2 farawayzheng farawayzheng      4096  1月 10 23:06 yarn-remote-client/


2) 修改Dockerfile和bootstrap.sh启动脚本

  • 修改Dockerfile为以下内容
FROM sequenceiq/hadoop-docker:2.6.0
#MAINTAINER SequenceIQ
MAINTAINER farawayzheng

#support for Hadoop 2.6.0
#RUN curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz | tar -xz -C /usr/local/
ADD spark-2.1.0-bin-hadoop2.6.tgz /usr/local/
RUN cd /usr/local && ln -s spark-2.1.0-bin-hadoop2.6 spark
ENV SPARK_HOME /usr/local/spark
RUN mkdir $SPARK_HOME/yarn-remote-client
ADD yarn-remote-client $SPARK_HOME/yarn-remote-client

RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/jars /spark && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/examples/jars /spark 

ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin
# update boot script
COPY bootstrap.sh /etc/bootstrap.sh
RUN chown root.root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh

#install R
RUN rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
RUN yum -y install R

ENTRYPOINT ["/etc/bootstrap.sh"]
  • 修改bootstrap.sh为以下内容
#!/bin/bash

: ${HADOOP_PREFIX:=/usr/local/hadoop}

$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

rm /tmp/*.pid

# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do  echo == $cp; curl -LO $cp ; done; cd -

# altering the core-site configuration
sed s/HOSTNAME/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml

# setting spark defaults
echo spark.yarn.jars hdfs:///spark/* > $SPARK_HOME/conf/spark-defaults.conf
cp $SPARK_HOME/conf/metrics.properties.template $SPARK_HOME/conf/metrics.properties

service sshd start
$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh



CMD=${1:-"exit 0"}
if [[ "$CMD" == "-d" ]];
then
    service sshd stop
    /usr/sbin/sshd -D -d
else
    /bin/bash -c "$*"
fi


3) 构建Spark 2.1.0镜像

$ docker build --rm -t farawayzheng/spark:2.1.0 .
Sending build context to Docker daemon 193.5 MB
Step 1 : FROM sequenceiq/hadoop-docker:2.6.0
 ---> 140b265bd62a
Step 2 : MAINTAINER farawayzheng
 ---> Running in 12a47858c223
 ---> 04b98762d0b7
Removing intermediate container 12a47858c223
Step 3 : ADD spark-2.1.0-bin-hadoop2.6.tgz /usr/local/
 ---> 07eab98fe3f9
Removing intermediate container 3533899c0e8e
Step 4 : RUN cd /usr/local && ln -s spark-2.1.0-bin-hadoop2.6 spark
 ---> Running in 8dbcac623198
 ---> f7d68c7d52f4
Removing intermediate container 8dbcac623198
Step 5 : ENV SPARK_HOME /usr/local/spark
 ---> Running in 55a56f466fcb
 ---> 7f891e362f29
Removing intermediate container 55a56f466fcb
Step 6 : RUN mkdir $SPARK_HOME/yarn-remote-client
 ---> Running in e989ef3d7d67
 ---> 85485e987afd
Removing intermediate container e989ef3d7d67
Step 7 : ADD yarn-remote-client $SPARK_HOME/yarn-remote-client
 ---> f14d86c9f5c0
Removing intermediate container bae32c1ae32a
Step 8 : RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/jars /spark && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/examples/jars /spark
 ---> Running in ed073536dd11
/
Starting sshd: [  OK  ]
Starting namenodes on [70b4a57bb473]
70b4a57bb473: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-70b4a57bb473.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-70b4a57bb473.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-70b4a57bb473.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-70b4a57bb473.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-70b4a57bb473.out
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF
 ---> e4bdf8a1628d
Removing intermediate container ed073536dd11
Step 9 : ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
 ---> Running in ea435f5b6141
 ---> 07083ac117c2
Removing intermediate container ea435f5b6141
Step 10 : ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin
 ---> Running in 0dabc7dd5211
 ---> c55b79a1e670
Removing intermediate container 0dabc7dd5211
Step 11 : COPY bootstrap.sh /etc/bootstrap.sh
 ---> e942148ae10f
Removing intermediate container 996819338fda
Step 12 : RUN chown root.root /etc/bootstrap.sh
 ---> Running in 7a8c32d8ddaa
 ---> 12cbdc408ed4
Removing intermediate container 7a8c32d8ddaa
Step 13 : RUN chmod 700 /etc/bootstrap.sh
 ---> Running in f833cda3afb5
 ---> d8d17a1babbf
Removing intermediate container f833cda3afb5
Step 14 : RUN rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
 ---> Running in 5c35660f3e9b
warning: /var/tmp/rpm-tmp.ByIPjn: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Preparing...                ##################################################
epel-release                ##################################################
 ---> 666a2cbda116
Removing intermediate container 5c35660f3e9b
Step 15 : RUN yum -y install R
 ---> Running in 7192828863d7
Loaded plugins: fastestmirror, keys, protect-packages, protectbase
Determining fastest mirrors
 * base: mirrors.btte.net
 * epel: mirrors.tuna.tsinghua.edu.cn
 * extras: mirrors.btte.net
 * updates: mirrors.yun-idc.com
0 packages excluded due to repository protections
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package R.x86_64 0:3.3.2-3.el6 will be installed
......
......
......
---> Package xz-lzma-compat.x86_64 0:4.999.9-0.3.beta.20091007git.el6 will be updated
---> Package xz-lzma-compat.x86_64 0:4.999.9-0.5.beta.20091007git.el6 will be an update
--> Finished Dependency Resolution

Dependencies Resolved

================================================================================
 Package               Arch    Version                           Repository
                                                                           Size
================================================================================
Installing:
 R                     x86_64  3.3.2-3.el6                       epel      26 k
Installing for dependencies:
 R-core                x86_64  3.3.2-3.el6                       epel      53 M
......
Updating for dependencies:
 cpp                   x86_64  4.4.7-17.el6                      base     3.7 M
......

Transaction Summary
================================================================================
Install      69 Package(s)
Upgrade      19 Package(s)

Total download size: 145 M
Downloading Packages:
--------------------------------------------------------------------------------
Total                                           494 kB/s | 145 MB     05:01     
warning: rpmts_HdrFromFdno: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Importing GPG key 0x0608B895:
 Userid : EPEL (6) <epel@fedoraproject.org>
 Package: epel-release-6-8.noarch (installed)
 From   : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
  Updating   : libgcc-4.4.7-17.el6.x86_64                                 1/107 
  ......  
  Cleanup    : glib2-2.26.1-3.el6.x86_64                                107/107 
  Verifying  : acl-2.2.49-6.el6.x86_64                                    1/107 
  ......  
Installed:
  R.x86_64 0:3.3.2-3.el6                                                        

Dependency Installed:
  R-core.x86_64 0:3.3.2-3.el6                                                   
  ......                                                 
  zlib-devel.x86_64 0:1.2.3-29.el6                                              

Dependency Updated:
  cpp.x86_64 0:4.4.7-17.el6                                                     
  ......                          
  xz-lzma-compat.x86_64 0:4.999.9-0.5.beta.20091007git.el6                      

Complete!
 ---> ed7e19858dc9
Removing intermediate container 7192828863d7
Step 16 : ENTRYPOINT /etc/bootstrap.sh
 ---> Running in 31d75ee50b7d
 ---> 4eb30ebd34a2
Removing intermediate container 31d75ee50b7d
Successfully built 4eb30ebd34a2


4) 查看新建Spark 2.1.0镜像

$ docker images
REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
farawayzheng/spark         2.1.0               4eb30ebd34a2        4 hours ago         2.649 GB
hello-world                latest              c54a2cc56cbb        6 months ago        1.848 kB
sequenceiq/hadoop-docker   2.6.0               140b265bd62a        24 months ago       1.624 GB


测试镜像

1) 启动一个Spark 2.1.0容器

$ docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox farawayzheng/spark:2.1.0 bash
/
Starting sshd:                                             [  OK  ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-sandbox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sandbox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-sandbox.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-sandbox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-sandbox.out
bash-4.1#

出现“bash-4.1#”提示符说明Spark 2.1.0容器启动成功了!

2) 使用YARN-client模式验证Spark集群是否工作正常

bash-4.1# spark-shell --master yarn --deploy-mode client --driver-memory 1g --executor-memory 1g --executor-cores 1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/01/11 05:00:45 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/01/11 05:00:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark/jars/datanucleus-api-jdo-3.2.6.jar."
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark/jars/datanucleus-rdbms-3.2.9.jar."
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/01/11 05:01:18 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/01/11 05:01:18 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
17/01/11 05:01:19 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://172.17.0.3:4040
Spark context available as 'sc' (master = yarn, app id = application_1484126893491_0005).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

出现“scala>”提示符说明spark shell正常运行了,但是输出几条WARN级别的LOG,虽然不影响我们做些简单测试,后续还是需要对配置进行调整和校正。

关于“WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException”的解释:
https://issues.apache.org/jira/browse/SPARK-14067

This is expected behavior. We need the default database for a number of operations. We try to look it up in the metastore, if it does not exist the metastore gives a warning and we will create a default database.


输入scala命令测试Spark能否工作:

scala> sc.parallelize(1 to 1000).count()
res0: Long = 1000                                                               

scala>

验证通过,大功告成!