先修篇


1. 启动容器

1. 运行程序docker desktop,并启动容器h01与h02

docker postgres 安装 docker安装spark_apache

2. 打开cmd命令行窗口( WIN + R ,输入cmd , 回车运行即可)

3. 在cmd中运行如下命令,以获取container ID(建议复制下ID)

docker ps

结果回显为:

docker postgres 安装 docker安装spark_spark_02

4. 在cmd中运行如下命令(将其中的container ID换为3中得到的container ID),以获得terminal交互界面

由上一步知晓:

主机名

container ID

h01

1e4f869cb0f5

h02

5c2ee32cfbfe

docker exec -it containerID /bin/bash

此处本人的具体操作为:
在当前CMD窗口输入如下命令(获取h01的terminal)

docker exec -it 1e4f869cb0f5 /bin/bash

另外打开一个CMD窗口,在其下输入如下命令(获取h02的terminal)

docker exec -it 5c2ee32cfbfe /bin/bash

2. 下载安装spark

1. 在h01的terminal中输入如下命令,下载spark

阿里云 Spark 资源网址

# 切换至家目录,将spark压缩包下载至当前用户家目录下
cd ~
wget https://mirrors.aliyun.com/apache/spark/spark-3.2.1/spark-3.2.1-bin-without-hadoop.tgz

2. 解压

tar -zxvf spark-3.2.1-bin-without-hadoop.tgz -C /usr/local/

3. 重命名

mv /usr/local/spark-3.2.1-bin-without-hadoop /usr/local/spark

4. 修改文件

2.4.1 /etc/profile

vim /etc/profile

在文件末尾添加如下内容:

# spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin

2.4.2 /usr/local/spark/conf/spark-env.sh

vim /usr/local/spark/conf/spark-env.sh

在文件中添加如下内容

#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export SCALA_HOME=/usr/share/scala

export SPARK_MASTER_HOST=h01
export SPARK_MASTER_IP=h01
export SPARK_WORKER_MEMORY=4g

export PYSPARK_PYTHON=/usr/local/python3/bin/python3

2.4.3 /usr/local/spark/conf/log4j.properties(可以不修改)

此处只是将日志打印由INFO改为ERROR

vim /usr/local/spark/conf/log4j.properties

在文件中添加如下内容:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
log4j.logger.org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.sparkproject.jetty=WARN
log4j.logger.org.sparkproject.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

# For deploying Spark ThriftServer
# SPARK-34128:Suppress undesirable TTransportException warnings involved in THRIFT-4805
log4j.appender.console.filter.1=org.apache.log4j.varia.StringMatchFilter
log4j.appender.console.filter.1.StringToMatch=Thrift error occurred during processing of message
log4j.appender.console.filter.1.AcceptOnMatch=false

2.4.4 /usr/local/spark/conf/workers

vim /usr/local/spark/conf/workers

在文件内容替换为:

h01
h02

2.4.5 将配置好的spark复制到h02上

scp -r /usr/local/spark/ root@h02:/usr/local/

5. 安装依赖

2.5.1 安装gcc

apt install gcc
apt install make

2.5.2 安装zlib1g-dev、libffi-dev

apt install zlib1g-dev
apt install libffi-dev

2.5.3 安装python

python官网

# 切换至家目录,将spark压缩包下载至当前用户家目录下
cd ~
wget https://www.python.org/ftp/python/3.9.12/Python-3.9.12.tgz
tar -zxvf Python-3.9.12.tgz -C /usr/local/
mkdir /usr/local/python3
cd /usr/local/Python-3.9.12/
./configure prefix=/usr/local/python3 --enable-loaded-sqlite-extensions && make && make install

docker postgres 安装 docker安装spark_docker postgres 安装_03

6. 修改文件/usr/local/hadoop/etc/hadoop/hadoop-env.sh

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

在文件中添加如下内容:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

添加完内容后,执行:(重启文件 /usr/local/hadoop/etc/hadoop/hadoop-env.sh)

source /usr/local/hadoop/etc/hadoop/hadoop-env.sh

7. 启动spark

2.7.1 下载包

  1. 由于安装的是without-hadoop,所以在/usr/local/spark/jars/下缺少许多包
    所以需要添加包
    之前下载spark的地址下载有hadoop的
  2. 将下载好的文件解压至一个目录a(如:E:\COURSE\Spark),需要解压两次(一次是gz,一次是tar)
  3. 将其中的jars放到刚刚的目录a下(如:E:\COURSE\Spark),如下图所示:
  4. 重新打开一个cmd窗口(第三个)(WIN + R ,输入cmd,回车)
  5. 在新打开的cmd窗口中输入如下命令
docker run -v 目录a:/home -it --network hadoop -h "h03" --name "h03" hdp /bin/bash

此处本人命令为:

docker run -v E:\COURSE\Spark:/home -it --network hadoop -h "h03" --name "h03" hdp /bin/bash
  1. 将/home/jars/下全部文件发送至h01与h02
scp -r /home/jars/ root@h01:/usr/local/spark/
scp -r /home/jars/ root@h02:/usr/local/spark/
  1. 正式启动spark
/usr/local/spark/bin/pyspark

docker postgres 安装 docker安装spark_hadoop_04

3. 提交镜像

重新打开一个cmd窗口

docker commit -m "spark" -a "spark" h01的containerID spark

此处本人命令为:(可以使用docker ps查看container ID)

docker commit -m "spark" -a "spark" 1e4f869cb0f5 spark