Hadoop 和Spark完全分布式部署

1. 配置相关服务器

1.1 修改主机名

hostname master

1.2 修改/etc/hosts文件, 添加如下配置,方便通过主机名访问服务器

127.0.0.1 localhost
master_ip master
worker1_ip worker01
worker2_ip worker02

1.3 配置ssh免密登录

cd ~/.ssh
ssh-keygen -t rsa
ssh-copy-id -i ~/.ssh/id_rsa.pub username@hostname1
ssh-copy-id -i ~/.ssh/id_rsa.pub username@hostname2
ssh-copy-id -i ~/.ssh/id_rsa.pub username@hostname3

2.安装 JDK

2.1下载java8版本,并解压

cd ~/hadoop/tars
tar -zxvf jdk-8u341-linux-x64.tar.gz -C ../

2.2 打开 ~/.bashrc 或者 /etc/profile 文件,在文件中添加如下配置,配置 java 环境变量

export JAVA_HOME=/home/bd/hadoop/jdk1.8.0_341
export CLASSPATH=:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH

使用如下命令,使配置文件生效

source ~/.bashrc

2.3 在子节点服务器上使用上述方法安装jdk,也可以直接从主节点复制, 同样需要配置环境变量

scp -r ~/hadoop/jdk1.8.0_341/ bigdata@worker1_ip:~/hadoop/jdk1.8.0_341
scp -r ~/hadoop/jdk1.8.0_341/ root@worker2_ip:~/hadoop/jdk1.8.0_341

3. Hadoop 集群搭建

3.1 下载Hadoop,并解压

cd ~/hadoop
tar -zxvf ./tars/hadoop-3.3.4.tar.gz

3.2 打开 ~/.bashrc 或者 /etc/profile 文件,在文件中添加如下配置,配置 Hadoop 环境变量

export HADOOP_HOME=/home/bd/hadoop/hadoop-3.3.4/
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

使用如下命令,使配置文件生效

source ~/.bashrc

3.3 修改 hadoop-3.3.4/etc/hadoop/ 目录下相关文件

3.3.1 workers
bigdata@worker1_ip
root@worker2_ip
3.3.2 hadoop-env.sh
export JAVA_HOME=~/hadoop/jdk1.8.0_341
3.3.3 core-site.xml
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/bd/hadoop/hadoop-3.3.4/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master_ip:9310</value>
    </property>
<property>
    <name>hadoop.proxyuser.bd.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.bd.groups</name>
    <value>*</value>
</property>
</configuration>
3.3.4 hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>master_ip:9311</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>worker1_ip:9312</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/home/bd/hadoop/hadoop-3.3.4/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/home/bd/hadoop/hadoop-3.3.4/hdfs/data</value>
    </property> 
<property>
 <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>
3.3.5 mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
3.3.6 yarn-site.xml
<configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master_ip</value>
    </property>
</configuration>

3.4 子节点配置Hadoop

scp -r ~/hadoop/hadoop-3.3.4/ root@worker2_ip:~/hadoop/hadoop-3.3.4/
scp -r ~/hadoop/hadoop-3.3.4/ bigdata@worker1_ip:~/hadoop/hadoop-3.3.4/

3.5 配置子节点相关环境变量, 修改3.3文件中的相关路径

3.6 启动Hadoop集群

3.6.1.1 Master节点启动命令(由于各节点文件存储路径不同,因此需要分别启动)
hdfs --daemon start namenode
hdfs --daemon start datanode
hdfs --daemon start journalnode

yarn --daemon start resourcemanager
yarn --daemon start nodemanager
3.6.1.2 Master节点停止命令
hdfs --daemon stop namenode
hdfs --daemon stop datanode
hdfs --daemon stop journalnode

yarn --daemon stop resourcemanager
yarn --daemon stop nodemanager
3.6.2.1 Worker1 结点启动命令
hdfs --daemon start secondarynamenode
hdfs --daemon start datanode
hdfs --daemon start journalnode

yarn --daemon start nodemanager
3.6.2.2 Worker1 结点停止命令
hdfs --daemon stop secondarynamenode
hdfs --daemon stop datanode
hdfs --daemon stop journalnode

yarn --daemon stop nodemanager
3.6.3.1 Worker2 结点启动命令
hdfs --daemon start datanode
hdfs --daemon start journalnode

yarn --daemon start nodemanager
3.6.3.2 Worker2 结点停止命令
hdfs --daemon stop datanode
hdfs --daemon stop journalnode

yarn --daemon stop nodemanager

3.7 查看Hadoop管理界面

3.7.1 Hadoop Namenode http://master_ip:9311 显示 Live Nodes 为 3
3.7.2 Hadoop Yarn http://master_ip:8088 显示 Active Nodes 为 3

4 Spark集群搭建

4.1 下载Spark,并解压

cd ~/hadoop
tar -zxvf ./tars/spark-3.3.1-bin-without-hadoop.tgz
mv spark-3.3.1-bin-without-hadoop spark-3.3.1

4.2 打开 ~/.bashrc 或者 /etc/profile 文件,在文件中添加如下配置,配置 Hadoop 环境变量

export SPARK_HOME=/home/bd/hadoop/spark-3.3.1
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

使用如下命令,使配置文件生效

source ~/.bashrc

4.3 修改 spark-3.3.1/conf/ 目录下相关文件

4.3.1 workers
bigdata@worker1_ip
root@worker2_ip
4.3.2 spark-env.sh
export SPARK_DIST_CLASSPATH=$(/home/bd/hadoop/hadoop-3.3.4/bin/hadoop classpath)
export HADOOP_CONF_DIR=/home/bd/hadoop/hadoop-3.3.4/etc/hadoop
export SPARK_MASTER_IP=master_ip
export JAVA_HOME=/home/bd/hadoop/jdk1.8.0_341

4.4 子节点配置Spark

scp -r ~/hadoop/spark-3.3.1 root@worker2_ip:~/hadoop/spark-3.3.1
scp -r ~/hadoop/spark-3.3.1 bigdata@worker1_ip:~/hadoop/spark-3.3.1

4.5 配置子节点相关环境变量, 修改4.3文件中的相关路径

4.6 启动Spark集群

4.6.1.1 Master节点启动命令(由于各节点文件存储路径不同,因此需要分别启动)
/home/bd/hadoop/spark-3.3.1/sbin/start-master.sh -h master_ip
4.6.1.2 Master节点停止命令
/home/bd/hadoop/spark-3.3.1/sbin/stop-master.sh
4.6.2.1 Worker 结点启动命令
/data2/bigData/hadoop/spark-3.3.1/sbin/start-worker.sh spark://master_ip:7077
4.6.2.2 Worker 结点停止命令
/data2/bigData/hadoop/spark-3.3.1/sbin/stop-worker.sh

4.7 查看Spark管理界面

4.7.1 Spark Master Web UI http://master_ip:8080/ 显示 Workers 为 2