1 介绍

为了解决Hadoop 1.x框架中的问题:例如单namenode节点问题等问题,Apache基金会推出新一代的hadoop框架,Hadoop 2.x系列版本,在该版本中,HDFS的一些机制进行了改善,并且Hadoop的MapReduce框架升级为YARY框架(MapReduce 2),并且实现了与spark等现在叫流行的大数据分析框架的集成。关于Hadoop 2.x系列,我们将会在后面详细讲解到

2 安装hadoop 2.6

因为hadoop的安装所需要的环境是相同的,和hadoop 1.2.1版本的安装环境是相同的,在这里笔者将安装前的准备工作进行了简化

(1)安装sshd服务,并且实现节点之间的免密码登录。

因为在hadoop 2.x中,将JobTrack的任务调度和资源管理两个任务进行了分离,分别分布在不同的节点上,所以需要在安装namenode服务的节点上和安装ResourceManager服务的节点上都实现和所有节点实现免密码登录。

(2)配置hosts文件

本集群只实现了四个节点,节点名称与IP地址如下:


192.168.149.129	hadoop1
192.168.149.130	hadoop2
192.168.149.131	hadoop3
192.168.149.132	hadoop4

(3)安装Java1.7

安装Java1.7 已经在hadoop1.2.1安装过程中详细解释了。这里只是简单显示了Java的一些配置


[hadoop@hadoop1 etc]$ ls /opt/
apache-ant-1.9.5  apache-maven-3.3.3  jdk1.7.0_75  protobuf  protobuf-2.5.0  rh
[hadoop@hadoop1 etc]$ cat /etc/profile
# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

# It's NOT a good idea to change this file unless you know what you
# are doing. It's much better to create a custom.sh shell script in
# /etc/profile.d/ to make custom changes to your environment, as this
# will prevent the need for merging in future updates.

pathmunge () {
    case ":${PATH}:" in
        *:"$1":*)
            ;;
        *)
            if [ "$2" = "after" ] ; then
                PATH=$PATH:$1
            else
                PATH=$1:$PATH
            fi
    esac
}


if [ -x /usr/bin/id ]; then
    if [ -z "$EUID" ]; then
        # ksh workaround
        EUID=`id -u`
        UID=`id -ru`
    fi
    USER="`id -un`"
    LOGNAME=$USER
    MAIL="/var/spool/mail/$USER"
fi

# Path manipulation
if [ "$EUID" = "0" ]; then
    pathmunge /sbin
    pathmunge /usr/sbin
    pathmunge /usr/local/sbin
else
    pathmunge /usr/local/sbin after
    pathmunge /usr/sbin after
    pathmunge /sbin after
fi

HOSTNAME=`/bin/hostname 2>/dev/null`
HISTSIZE=1000
if [ "$HISTCONTROL" = "ignorespace" ] ; then
    export HISTCONTROL=ignoreboth
else
    export HISTCONTROL=ignoredups
fi

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL

# By default, we want umask to get set. This sets it for login shell
# Current threshold for system reserved uid/gids is 200
# You could check uidgid reservation validity in
# /usr/share/doc/setup-*/uidgid file
if [ $UID -gt 199 ] && [ "`id -gn`" = "`id -un`" ]; then
    umask 002
else
    umask 022
fi

for i in /etc/profile.d/*.sh ; do
    if [ -r "$i" ]; then
        if [ "${-#*i}" != "$-" ]; then
            . "$i"
        else
            . "$i" >/dev/null 2>&1
        fi
    fi
done

#Java Install
export JAVA_HOME=/opt/jdk1.7.0_75
export CLASSPATH=/opt/jdk1.7.0_75/lib/tools.jar:.:/opt/jdk1.7.0_75/lib/dt.jar
export PATH=$PATH:/opt/jdk1.7.0_75/jre/bin:/opt/jdk1.7.0_75/bin
#hadoop-2.6.0 install
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:/home/hadoop/hadoop-2.6.0/bin:/home/hadoop/hadoop-2.6.0/sbin

#maven install
export MAVEN_HOME=/opt/apache-maven-3.3.3
export PATH=$PATH:/opt/apache-maven-3.3.3/bin

#ant install
export ANT_HOME=/opt/apache-ant-1.9.5
export PATH=$PATH:/opt/apache-ant-1.9.5/bin

#protobuf install
export PATH=$PATH:/opt/protobuf/bin

unset i
unset -f pathmunge

(4)安装Hadoop 2.6

1)Hadoop 2.6下载

Hadoop 2.6的下载地址为:http://www.apache.org/dyn/closer.cgi/hadoop/common;从该页面中选取下载地址,从中下载相应的Hadoop 2.6的版本

2)在hadoop用户下进行解压,并放在hadoop的家目录下


[hadoop@hadoop1 sources]$ ls
apache-ant-1.9.5-bin.tar.gz    hadoop-2.6.0-src.tar.gz    protobuf-2.5.0.tar.gz
apache-maven-3.3.3-bin.tar.gz  hadoop-2.6.0.tar.gz
hadoop-2.6.0-src               jdk-7u75-linux-x64.tar.gz
[hadoop@hadoop1 sources]$ tar -zxf hadoop-2.6.0.tar.gz 
[hadoop@hadoop1 sources]$ ls
apache-ant-1.9.5-bin.tar.gz    hadoop-2.6.0-src         jdk-7u75-linux-x64.tar.gz
apache-maven-3.3.3-bin.tar.gz  hadoop-2.6.0-src.tar.gz  protobuf-2.5.0.tar.gz
hadoop-2.6.0                   hadoop-2.6.0.tar.gz
[hadoop@hadoop1 sources]$ pwd
/home/hadoop/sources
[hadoop@hadoop1 sources]$ mv hadoop-2.6.0 ../

3)配置Hadoop 2.6的环境变量

hadoop的环境变量配置是hadoop安装的核心,所有的配置文件全部放在/home/hadoop/hadoop-2.6.0/etc/hadoop目录下

(A)对hadoop-env.sh和yarn-env.sh文件进行Java环境变量的配置

hadoop-env.sh


# The java implementation to use.
export JAVA_HOME=/opt/jdk1.7.0_75

yarn-env.sh


# some Java parameters
export JAVA_HOME=/opt/jdk1.7.0_75



(B)core-site.xml


<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.149.129:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>

fs.defaultFS属性:和hadoop1.2.1中的fs.default.name属性相同,制定hdfs的入口位置。

io.file.buffer.size属性:在文件读取过程中的缓存,该属性配置的越大,文件的读取速度越快,但是相应的所需要的内存就会增加。设置一般为文件系统页面的大小(4K)的倍数

core-site.xml文件配置内容详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/core-default.xml

(C)hdfs-site.xml


<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoop-2.6.0/data/hdfs/namenode</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>192.168.149.129:50090</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoop-2.6.0/data/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

dfs.namenode.name.dir属性:该属性表示hadoop集群中namenode节点上的文件元数据、系统文件树镜像和edits文件存放的位置。

dir.namenode.secondary.http-address属性:表示secondary节点的访问入口。

dfs.datanode.data.dir属性:表示在datanode节点中数据块(Block)所存放的位置

dfs.replication属性:hadoop集群中文件冗余的份数

hdfs-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

(C)mapred-site.xml


<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

mapreduce.framework.name属性:表示Mapreduce处理方案使用的YARN框架,默认情况下为local

mapred-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

(D)yarn-site.xml


<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

yarn.nodemanager.aux-services属性:本来该属性的默认值为mapreduce.shuffle,如果从hadoop 2.2以后这样的写法将无法启动集群,只有改成mapreduce_shuffle集群才能正常的启动。

yarn-site.xml配置详细参考:http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

(5)slaves


[hadoop@hadoop1 hadoop]$ cat slaves 
192.168.149.131
192.168.149.132

该文件是datanode节点的IP地址。


当配置完配置环境中,然后将hadoop-2.6.0文件传到每一个节点中


scp -r hadoop-2.6.0/ hadoop@hadoop2:/home/hadoop/



(5)格式化hadoop集群

在格式化之前,要求需要将四个节点上的所有防火墙和selinux全部关闭,默认情况下都是关闭的,如果以防万一可以切换到root用户下,通过 chkconfig iptables off 命令关闭节点上的防火墙。然后进入到任何一个节点中的hadoop-2.6.0文件夹,执行下面的命令:


./bin/hadoop namenode -format

(6)启动hadoop2.6集群

为了更好的使用hadoop集群,可以讲hadoop的命令加载到环境变量PATH中:


[hadoop@hadoop1 ~]$ vim /etc/profile
#hadoop-2.6.0 install
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:/home/hadoop/hadoop-2.6.0/bin:/home/hadoop/hadoop-2.6.0/sbin

然后就是启动我们的 hadoop集群了:

首先我们要进入到ResourceManager节点中,启动资源管理程序:


[hadoop@hadoop2 ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-resourcemanager-hadoop2.out
192.168.149.132: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-nodemanager-hadoop4.out
192.168.149.131: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-nodemanager-hadoop3.out
[hadoop@hadoop2 ~]$ jps
27413 Jps

然后进入namenode节点,启动所有的进程:


[hadoop@hadoop1 ~]$ start-all.sh 
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/06/17 08:30:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop1]
hadoop1: starting namenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-namenode-hadoop1.out
192.168.149.132: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-datanode-hadoop4.out
192.168.149.131: starting datanode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-datanode-hadoop3.out
Starting secondary namenodes [hadoop1]
hadoop1: starting secondarynamenode, logging to /home/hadoop/hadoop-2.6.0/logs/hadoop-hadoop-secondarynamenode-hadoop1.out
15/06/17 08:31:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-hadoop-resourcemanager-hadoop1.out
192.168.149.132: nodemanager running as process 24754. Stop it first.
192.168.149.131: nodemanager running as process 27169. Stop it first.
[hadoop@hadoop1 ~]$ jps
8922 NameNode
9242 ResourceManager
9498 Jps
9080 SecondaryNameNode

datanode节点的信息:


[hadoop@hadoop3 ~]$ jps
27460 Jps
27169 NodeManager
27329 DataNode
[hadoop@hadoop3 ~]$

(6)总结

hadoop2.x系列针对hadoop1.x系列的缺点做出了很大的改进,在HDFS和MapReduce框架中都做出了很大的改变,并且实现了和现主流大数据框架spark等的集合。


(7)修正

在hadoop2.x系列中,要求ResourceManager进程单独分布在一个节点上,所以在start-yarn.sh后,在namenode节点启动命令不是start-all.sh,因为start-all.sh会在namenode节点上也启动一个ResourceManager进程,这里应该使用的是start-dfs.sh,这样启动namenode节点和datanode节点,并且不会再namenode节点中启动ResourceManager进程。