先修篇
1. 启动容器
1. 运行程序docker desktop,并启动容器h01与h02
2. 打开cmd命令行窗口( WIN + R ,输入cmd , 回车运行即可)
3. 在cmd中运行如下命令,以获取container ID(建议复制下ID)
docker ps
结果回显为:
4. 在cmd中运行如下命令(将其中的container ID换为3中得到的container ID),以获得terminal交互界面
由上一步知晓:
主机名 | container ID |
h01 | 1e4f869cb0f5 |
h02 | 5c2ee32cfbfe |
docker exec -it containerID /bin/bash
此处本人的具体操作为:
在当前CMD窗口输入如下命令(获取h01的terminal)
docker exec -it 1e4f869cb0f5 /bin/bash
另外打开一个CMD窗口,在其下输入如下命令(获取h02的terminal)
docker exec -it 5c2ee32cfbfe /bin/bash
2. 下载安装spark
1. 在h01的terminal中输入如下命令,下载spark
# 切换至家目录,将spark压缩包下载至当前用户家目录下
cd ~
wget https://mirrors.aliyun.com/apache/spark/spark-3.2.1/spark-3.2.1-bin-without-hadoop.tgz
2. 解压
tar -zxvf spark-3.2.1-bin-without-hadoop.tgz -C /usr/local/
3. 重命名
mv /usr/local/spark-3.2.1-bin-without-hadoop /usr/local/spark
4. 修改文件
2.4.1 /etc/profile
vim /etc/profile
在文件末尾添加如下内容:
# spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
2.4.2 /usr/local/spark/conf/spark-env.sh
vim /usr/local/spark/conf/spark-env.sh
在文件中添加如下内容
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_HOST=h01
export SPARK_MASTER_IP=h01
export SPARK_WORKER_MEMORY=4g
export PYSPARK_PYTHON=/usr/local/python3/bin/python3
2.4.3 /usr/local/spark/conf/log4j.properties(可以不修改)
此处只是将日志打印由INFO改为ERROR
vim /usr/local/spark/conf/log4j.properties
在文件中添加如下内容:
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
# For deploying Spark ThriftServer
# SPARK-34128:Suppress undesirable TTransportException warnings involved in THRIFT-4805
log4j.appender.console.filter.1=org.apache.log4j.varia.StringMatchFilter
log4j.appender.console.filter.1.StringToMatch=Thrift error occurred during processing of message
log4j.appender.console.filter.1.AcceptOnMatch=false
2.4.4 /usr/local/spark/conf/workers
vim /usr/local/spark/conf/workers
在文件内容替换为:
h01
h02
2.4.5 将配置好的spark复制到h02上
scp -r /usr/local/spark/ root@h02:/usr/local/
5. 安装依赖
2.5.1 安装gcc
apt install gcc
apt install make
2.5.2 安装zlib1g-dev、libffi-dev
apt install zlib1g-dev
apt install libffi-dev
2.5.3 安装python
# 切换至家目录,将spark压缩包下载至当前用户家目录下
cd ~
wget https://www.python.org/ftp/python/3.9.12/Python-3.9.12.tgz
tar -zxvf Python-3.9.12.tgz -C /usr/local/
mkdir /usr/local/python3
cd /usr/local/Python-3.9.12/
./configure prefix=/usr/local/python3 --enable-loaded-sqlite-extensions && make && make install
6. 修改文件/usr/local/hadoop/etc/hadoop/hadoop-env.sh
vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
在文件中添加如下内容:
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
添加完内容后,执行:(重启文件 /usr/local/hadoop/etc/hadoop/hadoop-env.sh)
source /usr/local/hadoop/etc/hadoop/hadoop-env.sh
7. 启动spark
2.7.1 下载包
- 由于安装的是without-hadoop,所以在/usr/local/spark/jars/下缺少许多包
所以需要添加包
去之前下载spark的地址下载有hadoop的 - 将下载好的文件解压至一个目录a(如:E:\COURSE\Spark),需要解压两次(一次是gz,一次是tar)
- 将其中的jars放到刚刚的目录a下(如:E:\COURSE\Spark),如下图所示:
- 重新打开一个cmd窗口(第三个)(WIN + R ,输入cmd,回车)
- 在新打开的cmd窗口中输入如下命令
docker run -v 目录a:/home -it --network hadoop -h "h03" --name "h03" hdp /bin/bash
此处本人命令为:
docker run -v E:\COURSE\Spark:/home -it --network hadoop -h "h03" --name "h03" hdp /bin/bash
- 将/home/jars/下全部文件发送至h01与h02
scp -r /home/jars/ root@h01:/usr/local/spark/
scp -r /home/jars/ root@h02:/usr/local/spark/
- 正式启动spark
/usr/local/spark/bin/pyspark
3. 提交镜像
重新打开一个cmd窗口
docker commit -m "spark" -a "spark" h01的containerID spark
此处本人命令为:(可以使用docker ps查看container ID)
docker commit -m "spark" -a "spark" 1e4f869cb0f5 spark