在Linux虚拟机中使用docker搭建hadoop分布式集群,用java操作hdfs(一)
Docker安装
windows下限制较多,docker的linux模式与VM的虚拟服务会冲突,每次使用需要重新开关服务,重启电脑,所以是在windows上的虚拟主机(VM)中实现,此教程适合使用过linux系统的人员
docker安装:https://www.runoob.com/docker/centos-docker-install.html
不同的系统只要安装好docker后,docker的操作都是一样的
docker的基本使用
启动docker服务
service docker start
docker刚安装好是没有镜像的,我用的ubuntu来搭建hadoop集群,也可以用其他linux发行版
docker pull ubuntu //拉取也就是在docker中下载ubuntu镜像,默认是最新的
等待下载完成
docker images //查看镜像,可以看到刚拉取的ubuntu镜像
root@linux:~# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
ubuntu latest 72300a873c2c 8 weeks ago 64.2MB
ok,因为hadoop集群的通信是需要局域网的,所以我们要创建hadoop的专用网络,用桥接模式
docker network create --driver=bridge hadoop
创建好之后查看创建的网络
docker network ls
root@liunx:~# docker network ls
NETWORK ID NAME DRIVER SCOPE
a520acd0f5eb bridge bridge local
62cb2d841382 hadoop bridge local
b7fef15ea068 host host local
ba108fc8779a none null local
然后用ubuntu镜像运行一个容器
docker run -it ubuntu /bin/bash //i和t分别是以交互模式和终端运行
root@linux:~# docker run -it ubuntu /bin/bash
root@b434aef5bc5f:/#
exit可以退出容器,ctrl+p+q切换shell到主机而不退出容器
docker ps //显示运行的容器
docker start 容器名/容器ID //运行一个存在的容器 stop是停止运行的容器
docker attach 容器名/容器ID //进入运行的容器
上面的命令之后会经常使用
ubuntu镜像的配置
做完上面两部分之后,就是配置Ubuntu了,因为初始拉取的ubuntu镜像只有最基础的内核和文件系统,缺少网络工具,jdk,ssh,vim而这些是配置集群不可或缺的部分
在root@b434aef5bc5f:/#
1.换源,将下面的内容添加到/etc/apt/sources.list中
deb http://mirrors.aliyun.com/ubuntu/ xenial main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial main
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main
deb http://mirrors.aliyun.com/ubuntu/ xenial universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-security main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main
deb http://mirrors.aliyun.com/ubuntu/ xenial-security universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security universe
2.使用apt update
来更新源
apt install net-tools
apt install inetutils-ping//安装ping工具,不能直接apt-get install ping,因为在inetutils里面
apt install openjdk-8-jdk
apt install vim
3.安装ssh
ssh是一种应用层的安全远程登录协议,它会生成一对密匙,分别是私匙和公匙,详细解释可以去看ssh官方文档
apt-get install openssh-server
apt-get install openssh-client
使用ssh-keygen -t rsa -P ''
来创建无密码登录
root@b434aef5bc5f:/# ssh-keygen -t rsa -P ''
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
/root/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:p0UBShQiq/AxMBv+HxEku3a6WuwPkVu01E1HNck4kA8 root@cdbc600b933b
The key's randomart image is:
+---[RSA 2048]----+
|+ ..o++..+=o+o. |
|.= oooo.oE.+ o. |
|o.+. +.. .+ . |
|.o.o= o . . |
|. .* = S o |
| o B . + |
| * . . |
| o o |
| ..o.. |
+----[SHA256]-----+
cat命令读出id_rsa.pub后以流的方式追加到authorized_keys中
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
启动ssh服务
root@b434aef5bc5f:~# service ssh start
Starting OpenBSD Secure Shell server sshd [ OK ]
172.17.0.2是容器b434aef5bc5f的IP地址
root@b434aef5bc5f:~# ssh 172.17.0.2
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 5.3.0-kali2-amd64 x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
出现上述问题ssh -o StrictHostKeyChecking=no root@172.17.0.2
然后再次执行
成功!
方便以后使用,追加service ssh start
到.bashrc
文件
vim /root/.bashrc
安装hadoop
下载
wget http:///apache/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
解压
tar -zxvf hadoop-3.2.1.tar.gz -C /usr/local
修改 /etc/profile 文件,添加环境变量
vim /etc/profile
追加以下内容
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HDFS_DATANODE_USER=root
export HDFS_DATANODE_SECURE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_NAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop 需要改为
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop否则运行hadoop命令需要找到路径并加./
如果不嫌繁琐的话也可以不用改,最后几行代表hadoop是哪个用户,我用的root
运行source /etc/profile
使其生效
在目录 /usr/local/hadoop/etc/hadoop
下
文件
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
core-site.xml 文件
<configuration>
<property>
<name></name>
<value>hdfs://h01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop3/hadoop/tmp</value>
</property>
</configuration>
hdfs-site.xml文件
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop3/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>/home/hadoop3/hadoop/hdfs/data</value>
</property>
</configuration>
mapred-site.xml文件
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/usr/local/hadoop/etc/hadoop,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
yarn-site.xml文件
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>h01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
worker文件
h00
h01
h02
hadoop终于搞完了
搭建集群
上述步骤相当于在实体机搭建了一台分布式主机,但是我们需要的是hadoop集群,怎么能一台机器呢
所以我们要横向扩展,这也是hadoop集群的特性,易于扩展,也就是死命加机器(手动滑稽)。
在docker中就很方便了,将搞完hadoop的这个容器,也就相当于现实中,一台运行着的机器,直接将其复制无数相同的从节点机,当然我们只要两个从节点就够了,也就是第四步中worker文件的配置,h00为主节点,h01,h02为从节点。好了闲话不多说实操起来。
将容器打包成镜像,因为我们要扩展呀,必须得有镜像
docker commit -m "haddop" -a "hadoop" b434aef5bc5f myhadoop
docker images
查看,里面出现了myhadoop
扩展
首先就是主节点,虽然之前的也可以用,但是我们要配置端口映射,不然无法在docker之外操作hadoop集群
docker run -it --network hadoop -h h01 --name "h00" -p 9870:9870 -p 8088:8088 -p 9000:9000 myhadoop /bin/bash //前两个端口是web端,9000是hdfs文件系统的端口映射
好了主节点配置好了,我们的hadoop集群有了大脑,然后增加两个从节点
docker run -it --network hadoop -h h01 --name "h02" myhadoop /bin/bash
docker run -it --network hadoop -h h02 --name "h02" myhadoop /bin/bash
在客户机中创建docker.sh启动脚本里面添加下面内容,并且用chmod +x
使其可执行
service docker start
docker start h01
docker start h02
docker start h03
docker attach h01
root@linux:~# ./docker.sh
h01
h02
h03
root@h01:/# cat
./usr/local/hadoop/sbin/
root@h01:/#
启动集群
在h01节点中先初始化hdfs,以后启动不用初始化,否则会报错
root@h01:/usr/local/hadoop/bin#./hadoop namenode -format
在/
目录下创建hadoop集群启动脚本,写入下面内容
./usr/local/hadoop/sbin/
同docker启动脚本一样
每次启动h01运行source /etc/profile,执行命令就不用加./
好了,hadoop完全分布式集群搭建到此结束,之后会写利用eclipse来写java程序操作hdfs