前言
大数据与容器是近年来的两个热点技术,大数据平台的容器化部署自然格外被关注。关于Apache Spark的容器化方案,目前使用最多的是sequenceiq/spark,在Docker Hub上有330K的下载量。
sequenceiq/spark镜像中包含了Hadoop/Yarn,虽然采用的是一种“伪集群”的部署方式,但是用来做开发和测试还是非常便利的。遗憾的是sequenceiq的更新速度不是很给力,目前最新的版本是sequenceiq/spark:1.6.0,想要体验Spark 2.1.0就得自己动动手了。
运行环境
操作系统 | 内核版本 | Docker版本 |
Ubuntu 15.10 | 4.2.0-42 | 1.12.5 |
环境准备
1) 下载sequenceiq/spark镜像构建源码
$ git clone https://github.com/sequenceiq/docker-spark
正克隆到 'docker-spark'...
remote: Counting objects: 211, done.
remote: Total 211 (delta 0), reused 0 (delta 0), pack-reused 211
接收对象中: 100% (211/211), 55.20 KiB | 71.00 KiB/s, 完成.
处理 delta 中: 100% (108/108), 完成.
检查连接... 完成。
2) 从Spark官网下载Spark 2.1.0安装包
因为我们用sequenceiq的Dockerfile来构建Spark 2.1.0镜像,需要使用基础镜像sequenceiq/hadoop:2.6.0,所以在下载选项中“2. Choose a package type”要选择“Pre-built for Hadoop 2.6”。
注意:下载的文件需要放到docker-spark目录下。
3) 缓存sequenceiq/hadoop-docker:2.6.0镜像
- 运行以下命令,拉取hadoop-docker 2.6.0版本镜像:
$ docker pull sequenceiq/hadoop-docker:2.6.0
2.6.0: Pulling from sequenceiq/hadoop-docker
b253335dcf03: Already exists
a3ed95caeb02: Pull complete
3452351686f4: Pull complete
dfb6df69b64d: Pull complete
...
bae586fb2d97: Pull complete
Digest: sha256:2b95f51b7f0ddf0d7bb2c2cfa793bae3298fcda5523783155a2db9430cba494a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.6.0
- 查看本地缓存Docker镜像,确认sequenceiq/hadoop-docker:2.6.0拉取成功:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
hello-world latest c54a2cc56cbb 6 months ago 1.848 kB
sequenceiq/hadoop-docker 2.6.0 140b265bd62a 24 months ago 1.624 GB
镜像构建
1) 进入docker-spark目录,确认所有用于镜像构建的文件已经准备好。
$ cd docker-spark/
$ ls
总用量 188856
drwxrwxr-x 4 farawayzheng farawayzheng 4096 1月 10 23:10 ./
drwxrwxr-x 9 farawayzheng farawayzheng 4096 1月 10 23:06 ../
-rwxrwxr-x 1 farawayzheng farawayzheng 901 1月 10 23:06 bootstrap.sh*
-rw-rw-r-- 1 farawayzheng farawayzheng 970 1月 10 23:16 Dockerfile
drwxrwxr-x 8 farawayzheng farawayzheng 4096 1月 10 23:06 .git/
-rw-rw-r-- 1 farawayzheng farawayzheng 18 1月 10 23:06 .gitignore
-rw-rw-r-- 1 farawayzheng farawayzheng 71624 1月 10 23:06 LICENSE
-rw-rw-r-- 1 farawayzheng farawayzheng 3454 1月 10 23:06 README.md
-rwxrwx--- 1 farawayzheng farawayzheng 193281941 1月 10 23:04 spark-2.1.0-bin-hadoop2.6.tgz*
drwxrwxr-x 2 farawayzheng farawayzheng 4096 1月 10 23:06 yarn-remote-client/
2) 修改Dockerfile和bootstrap.sh启动脚本
- 修改Dockerfile为以下内容
FROM sequenceiq/hadoop-docker:2.6.0
#MAINTAINER SequenceIQ
MAINTAINER farawayzheng
#support for Hadoop 2.6.0
#RUN curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz | tar -xz -C /usr/local/
ADD spark-2.1.0-bin-hadoop2.6.tgz /usr/local/
RUN cd /usr/local && ln -s spark-2.1.0-bin-hadoop2.6 spark
ENV SPARK_HOME /usr/local/spark
RUN mkdir $SPARK_HOME/yarn-remote-client
ADD yarn-remote-client $SPARK_HOME/yarn-remote-client
RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/jars /spark && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/examples/jars /spark
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin
# update boot script
COPY bootstrap.sh /etc/bootstrap.sh
RUN chown root.root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
#install R
RUN rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
RUN yum -y install R
ENTRYPOINT ["/etc/bootstrap.sh"]
- 修改bootstrap.sh为以下内容
#!/bin/bash
: ${HADOOP_PREFIX:=/usr/local/hadoop}
$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
rm /tmp/*.pid
# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do echo == $cp; curl -LO $cp ; done; cd -
# altering the core-site configuration
sed s/HOSTNAME/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
# setting spark defaults
echo spark.yarn.jars hdfs:///spark/* > $SPARK_HOME/conf/spark-defaults.conf
cp $SPARK_HOME/conf/metrics.properties.template $SPARK_HOME/conf/metrics.properties
service sshd start
$HADOOP_PREFIX/sbin/start-dfs.sh
$HADOOP_PREFIX/sbin/start-yarn.sh
CMD=${1:-"exit 0"}
if [[ "$CMD" == "-d" ]];
then
service sshd stop
/usr/sbin/sshd -D -d
else
/bin/bash -c "$*"
fi
3) 构建Spark 2.1.0镜像
$ docker build --rm -t farawayzheng/spark:2.1.0 .
Sending build context to Docker daemon 193.5 MB
Step 1 : FROM sequenceiq/hadoop-docker:2.6.0
---> 140b265bd62a
Step 2 : MAINTAINER farawayzheng
---> Running in 12a47858c223
---> 04b98762d0b7
Removing intermediate container 12a47858c223
Step 3 : ADD spark-2.1.0-bin-hadoop2.6.tgz /usr/local/
---> 07eab98fe3f9
Removing intermediate container 3533899c0e8e
Step 4 : RUN cd /usr/local && ln -s spark-2.1.0-bin-hadoop2.6 spark
---> Running in 8dbcac623198
---> f7d68c7d52f4
Removing intermediate container 8dbcac623198
Step 5 : ENV SPARK_HOME /usr/local/spark
---> Running in 55a56f466fcb
---> 7f891e362f29
Removing intermediate container 55a56f466fcb
Step 6 : RUN mkdir $SPARK_HOME/yarn-remote-client
---> Running in e989ef3d7d67
---> 85485e987afd
Removing intermediate container e989ef3d7d67
Step 7 : ADD yarn-remote-client $SPARK_HOME/yarn-remote-client
---> f14d86c9f5c0
Removing intermediate container bae32c1ae32a
Step 8 : RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/jars /spark && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.0-bin-hadoop2.6/examples/jars /spark
---> Running in ed073536dd11
/
Starting sshd: [ OK ]
Starting namenodes on [70b4a57bb473]
70b4a57bb473: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-70b4a57bb473.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-70b4a57bb473.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-70b4a57bb473.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-70b4a57bb473.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-70b4a57bb473.out
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Safe mode is OFF
---> e4bdf8a1628d
Removing intermediate container ed073536dd11
Step 9 : ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
---> Running in ea435f5b6141
---> 07083ac117c2
Removing intermediate container ea435f5b6141
Step 10 : ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin
---> Running in 0dabc7dd5211
---> c55b79a1e670
Removing intermediate container 0dabc7dd5211
Step 11 : COPY bootstrap.sh /etc/bootstrap.sh
---> e942148ae10f
Removing intermediate container 996819338fda
Step 12 : RUN chown root.root /etc/bootstrap.sh
---> Running in 7a8c32d8ddaa
---> 12cbdc408ed4
Removing intermediate container 7a8c32d8ddaa
Step 13 : RUN chmod 700 /etc/bootstrap.sh
---> Running in f833cda3afb5
---> d8d17a1babbf
Removing intermediate container f833cda3afb5
Step 14 : RUN rpm -ivh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
---> Running in 5c35660f3e9b
warning: /var/tmp/rpm-tmp.ByIPjn: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Retrieving http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
Preparing... ##################################################
epel-release ##################################################
---> 666a2cbda116
Removing intermediate container 5c35660f3e9b
Step 15 : RUN yum -y install R
---> Running in 7192828863d7
Loaded plugins: fastestmirror, keys, protect-packages, protectbase
Determining fastest mirrors
* base: mirrors.btte.net
* epel: mirrors.tuna.tsinghua.edu.cn
* extras: mirrors.btte.net
* updates: mirrors.yun-idc.com
0 packages excluded due to repository protections
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package R.x86_64 0:3.3.2-3.el6 will be installed
......
......
......
---> Package xz-lzma-compat.x86_64 0:4.999.9-0.3.beta.20091007git.el6 will be updated
---> Package xz-lzma-compat.x86_64 0:4.999.9-0.5.beta.20091007git.el6 will be an update
--> Finished Dependency Resolution
Dependencies Resolved
================================================================================
Package Arch Version Repository
Size
================================================================================
Installing:
R x86_64 3.3.2-3.el6 epel 26 k
Installing for dependencies:
R-core x86_64 3.3.2-3.el6 epel 53 M
......
Updating for dependencies:
cpp x86_64 4.4.7-17.el6 base 3.7 M
......
Transaction Summary
================================================================================
Install 69 Package(s)
Upgrade 19 Package(s)
Total download size: 145 M
Downloading Packages:
--------------------------------------------------------------------------------
Total 494 kB/s | 145 MB 05:01
warning: rpmts_HdrFromFdno: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Retrieving key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Importing GPG key 0x0608B895:
Userid : EPEL (6) <epel@fedoraproject.org>
Package: epel-release-6-8.noarch (installed)
From : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-6
Running rpm_check_debug
Running Transaction Test
Transaction Test Succeeded
Running Transaction
Warning: RPMDB altered outside of yum.
Updating : libgcc-4.4.7-17.el6.x86_64 1/107
......
Cleanup : glib2-2.26.1-3.el6.x86_64 107/107
Verifying : acl-2.2.49-6.el6.x86_64 1/107
......
Installed:
R.x86_64 0:3.3.2-3.el6
Dependency Installed:
R-core.x86_64 0:3.3.2-3.el6
......
zlib-devel.x86_64 0:1.2.3-29.el6
Dependency Updated:
cpp.x86_64 0:4.4.7-17.el6
......
xz-lzma-compat.x86_64 0:4.999.9-0.5.beta.20091007git.el6
Complete!
---> ed7e19858dc9
Removing intermediate container 7192828863d7
Step 16 : ENTRYPOINT /etc/bootstrap.sh
---> Running in 31d75ee50b7d
---> 4eb30ebd34a2
Removing intermediate container 31d75ee50b7d
Successfully built 4eb30ebd34a2
4) 查看新建Spark 2.1.0镜像
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
farawayzheng/spark 2.1.0 4eb30ebd34a2 4 hours ago 2.649 GB
hello-world latest c54a2cc56cbb 6 months ago 1.848 kB
sequenceiq/hadoop-docker 2.6.0 140b265bd62a 24 months ago 1.624 GB
测试镜像
1) 启动一个Spark 2.1.0容器
$ docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox farawayzheng/spark:2.1.0 bash
/
Starting sshd: [ OK ]
Starting namenodes on [sandbox]
sandbox: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-sandbox.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-sandbox.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-sandbox.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-sandbox.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-sandbox.out
bash-4.1#
出现“bash-4.1#”提示符说明Spark 2.1.0容器启动成功了!
2) 使用YARN-client模式验证Spark集群是否工作正常
bash-4.1# spark-shell --master yarn --deploy-mode client --driver-memory 1g --executor-memory 1g --executor-cores 1
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/01/11 05:00:45 WARN spark.SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
17/01/11 05:00:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark/jars/datanucleus-api-jdo-3.2.6.jar."
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark/jars/datanucleus-rdbms-3.2.9.jar."
17/01/11 05:01:10 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/usr/local/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/usr/local/spark-2.1.0-bin-hadoop2.6/jars/datanucleus-core-3.2.10.jar."
17/01/11 05:01:18 WARN metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
17/01/11 05:01:18 WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException
17/01/11 05:01:19 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://172.17.0.3:4040
Spark context available as 'sc' (master = yarn, app id = application_1484126893491_0005).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
出现“scala>”提示符说明spark shell正常运行了,但是输出几条WARN级别的LOG,虽然不影响我们做些简单测试,后续还是需要对配置进行调整和校正。
关于“WARN metastore.ObjectStore: Failed to get database default, returning NoSuchObjectException”的解释:
https://issues.apache.org/jira/browse/SPARK-14067“This is expected behavior. We need the default database for a number of operations. We try to look it up in the metastore, if it does not exist the metastore gives a warning and we will create a default database.”
输入scala命令测试Spark能否工作:
scala> sc.parallelize(1 to 1000).count()
res0: Long = 1000
scala>
验证通过,大功告成!