1 介绍
为了解决Hadoop 1.x框架中的问题:例如单namenode节点问题等问题,Apache基金会推出新一代的hadoop框架,Hadoop 2.x系列版本,在该版本中,HDFS的一些机制进行了改善,并且Hadoop的MapReduce框架升级为YARY框架(MapReduce 2),并且实现了与spark等现在叫流行的大数据分析框架的集成。关于Hadoop 2.x系列,我们将会在后面详细讲解到
2 安装hadoop 2.6
因为hadoop的安装所需要的环境是相同的,和hadoop 1.2.1版本的安装环境是相同的,在这里笔者将安装前的准备工作进行了简化
(1)安装sshd服务,并且实现节点之间的免密码登录。
因为在hadoop 2.x中,将JobTrack的任务调度和资源管理两个任务进行了分离,分别分布在不同的节点上,所以需要在安装namenode服务的节点上和安装ResourceManager服务的节点上都实现和所有节点实现免密码登录。
(2)配置hosts文件
本集群只实现了四个节点,节点名称与IP地址如下:
192.168.149.129 hadoop1
192.168.149.130 hadoop2
192.168.149.131 hadoop3
192.168.149.132 hadoop4
(3)安装Java1.7
安装Java1.7 已经在hadoop1.2.1安装过程中详细解释了。这里只是简单显示了Java的一些配置
[hadoop@hadoop1 etc]$ ls /opt/
apache-ant-1.9.5 apache-maven-3.3.3 jdk1.7.0_75 protobuf protobuf-2.5.0 rh
[hadoop@hadoop1 etc]$ cat /etc/profile
# /etc/profile
# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc
# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.
pathmunge () {
case ":${PATH}:" in
*:"$1":*)
;;
*)
if [ "$2" = "after" ] ; then
PATH=$PATH:$1
else
PATH=$1:$PATH
fi
esac
}
if [ -x /usr/bin/id ]; then
if [ -z "$EUID" ]; then
# ksh workaround
EUID=`id -u`
UID=`id -ru`
fi
USER="`id -un`"
LOGNAME=$USER
MAIL="/var/spool/mail/$USER"
fi
# Path manipulation
if [ "$EUID" = "0" ]; then
pathmunge /sbin
pathmunge /usr/sbin
pathmunge /usr/local/sbin
else
pathmunge /usr/local/sbin after
pathmunge /usr/sbin after
pathmunge /sbin after
fi
HOSTNAME=`/bin/hostname 2>/dev/null`
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
export HISTCONTROL=ignoreboth
else
export HISTCONTROL=ignoredups
fi
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL
# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
umask 002
else
umask 022
fi
for i in /etc/profile.d/*.sh ; do
if [ -r "$i" ]; then
if [ "${-#*i}" != "$-" ]; then
. "$i"
else
. "$i" >/dev/null 2>&1
fi
fi
done
#Java Install
export JAVA_HOME=/opt/jdk1.7.0_75
export CLASSPATH=/opt/jdk1.7.0_75/lib/tools.jar:.:/opt/jdk1.7.0_75/lib/dt.jar
export PATH=$PATH:/opt/jdk1.7.0_75/jre/bin:/opt/jdk1.7.0_75/bin
#hadoop-2.6.0 install
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:/home/hadoop/hadoop-2.6.0/bin:/home/hadoop/hadoop-2.6.0/sbin
#maven install
export MAVEN_HOME=/opt/apache-maven-3.3.3
export PATH=$PATH:/opt/apache-maven-3.3.3/bin
#ant install
export ANT_HOME=/opt/apache-ant-1.9.5
export PATH=$PATH:/opt/apache-ant-1.9.5/bin
#protobuf install
export PATH=$PATH:/opt/protobuf/bin
unset i
unset -f pathmunge
(4)安装Hadoop 2.6
1)Hadoop 2.6下载
Hadoop 2.6的下载地址为:http://www.apache.org/dyn/closer.cgi/hadoop/common;从该页面中选取下载地址,从中下载相应的Hadoop 2.6的版本
2)在hadoop用户下进行解压,并放在hadoop的家目录下
[hadoop@hadoop1 sources]$ ls
apache-ant-1.9.5-bin.tar.gz hadoop-2.6.0-src.tar.gz protobuf-2.5.0.tar.gz
apache-maven-3.3.3-bin.tar.gz hadoop-2.6.0.tar.gz
hadoop-2.6.0-src jdk-7u75-linux-x64.tar.gz
[hadoop@hadoop1 sources]$ tar -zxf hadoop-2.6.0.tar.gz
[hadoop@hadoop1 sources]$ ls
apache-ant-1.9.5-bin.tar.gz hadoop-2.6.0-src jdk-7u75-linux-x64.tar.gz
apache-maven-3.3.3-bin.tar.gz hadoop-2.6.0-src.tar.gz protobuf-2.5.0.tar.gz
hadoop-2.6.0 hadoop-2.6.0.tar.gz
[hadoop@hadoop1 sources]$ pwd
/home/hadoop/sources
[hadoop@hadoop1 sources]$ mv hadoop-2.6.0 ../
3)配置Hadoop 2.6的环境变量
hadoop的环境变量配置是hadoop安装的核心,所有的配置文件全部放在/home/hadoop/hadoop-2.6.0/etc/hadoop目录下
(A)对hadoop-env.sh和yarn-env.sh文件进行Java环境变量的配置
hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/opt/jdk1.7.0_75
yarn-env.sh
# some Java parameters
export JAVA_HOME=/opt/jdk1.7.0_75
(B)core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.149.129:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
fs.defaultFS属性:和hadoop1.2.1中的fs.default.name属性相同,制定hdfs的入口位置。
io.file.buffer.size属性:在文件读取过程中的缓存,该属性配置的越大,文件的读取速度越快,但是相应的所需要的内存就会增加。设置一般为文件系统页面的大小(4K)的倍数
core-site.xml文件配置内容详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/core-default.xml
(C)hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoop-2.6.0/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.149.129:50090</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoop-2.6.0/data/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
dfs.namenode.name.dir属性:该属性表示hadoop集群中namenode节点上的文件元数据、系统文件树镜像和edits文件存放的位置。
dir.namenode.secondary.http-address属性:表示secondary节点的访问入口。
dfs.datanode.data.dir属性:表示在datanode节点中数据块(Block)所存放的位置
dfs.replication属性:hadoop集群中文件冗余的份数
hdfs-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
(C)mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
mapreduce.framework.name属性:表示Mapreduce处理方案使用的YARN框架,默认情况下为local
mapred-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
(D)yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
yarn.nodemanager.aux-services属性:本来该属性的默认值为mapreduce.shuffle,如果从hadoop 2.2以后这样的写法将无法启动集群,只有改成mapreduce_shuffle集群才能正常的启动。
yarn-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
(5)slaves
[hadoop@hadoop1 hadoop]$ cat slaves
192.168.149.131
192.168.149.132
该文件是datanode节点的IP地址。
当配置完配置环境中,然后将hadoop-2.6.0文件传到每一个节点中
scp -r hadoop-2.6.0/ hadoop@hadoop2:/home/hadoop/
(5)格式化hadoop集群
在格式化之前,要求需要将四个节点上的所有防火墙和selinux全部关闭,默认情况下都是关闭的,如果以防万一可以切换到root用户下,通过 chkconfig iptables off 命令关闭节点上的防火墙。然后进入到任何一个节点中的hadoop-2.6.0文件夹,执行下面的命令:
./bin/hadoop namenode -format
(6)启动hadoop2.6集群
为了更好的使用hadoop集群,可以讲hadoop的命令加载到环境变量PATH中:
[hadoop@hadoop1 ~]$ vim /etc/profile
#hadoop-2.6.0 install
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:/home/hadoop/hadoop-2.6.0/bin:/home/hadoop/hadoop-2.6.0/sbin
然后就是启动我们的 hadoop集群了:
首先我们要进入到ResourceManager节点中,启动资源管理程序:
[hadoop@hadoop2 ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-resourcemanager-hadoop2.out
192.168.149.132: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-nodemanager-hadoop4.out
192.168.149.131: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-nodemanager-hadoop3.out
[hadoop@hadoop2 ~]$ jps
27413 Jps
然后进入namenode节点,启动所有的进程:
[hadoop@hadoop1 ~]$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/06/17 08:30:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop1]
hadoop1: starting namenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-namenode-hadoop1.out
192.168.149.132: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-datanode-hadoop4.out
192.168.149.131: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-datanode-hadoop3.out
Starting secondary namenodes [hadoop1]
hadoop1: starting secondarynamenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-secondarynamenode-hadoop1.out
15/06/17 08:31:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-resourcemanager-hadoop1.out
192.168.149.132: nodemanager running as process 24754. Stop it first.
192.168.149.131: nodemanager running as process 27169. Stop it first.
[hadoop@hadoop1 ~]$ jps
8922 NameNode
9242 ResourceManager
9498 Jps
9080 SecondaryNameNode
datanode节点的信息:
[hadoop@hadoop3 ~]$ jps
27460 Jps
27169 NodeManager
27329 DataNode
[hadoop@hadoop3 ~]$
(6)总结
hadoop2.x系列针对hadoop1.x系列的缺点做出了很大的改进,在HDFS和MapReduce框架中都做出了很大的改变,并且实现了和现主流大数据框架spark等的集合。
(7)修正
在hadoop2.x系列中,要求ResourceManager进程单独分布在一个节点上,所以在start-yarn.sh后,在namenode节点启动命令不是start-all.sh,因为start-all.sh会在namenode节点上也启动一个ResourceManager进程,这里应该使用的是start-dfs.sh,这样启动namenode节点和datanode节点,并且不会再namenode节点中启动ResourceManager进程。