hadoop分布式文件系统

http://hadoop.apache.org/docs/r1.1.2/single_node_setup.html

实验环境:

Masterdesk11192.168.122.11

Datanodeserver90192.168.122.190

Server233192.168.122.233

server73192.168.122.173用作之后的在线添加

做好本地解析:

192.168.122.11desk11.example.comdesk11

192.168.122.190server90.example.comserver90

192.168.122.233server233.example.comserver233

192.168.122.173server73.eample.comserver73

1.环境配置

hadoopjava程序所以必须运行于java虚拟上(jdk

准备jdkjdk-6u26-linux-x64.bin

shjdk-6u26-linux-x64.bin

mvjdk1.6.0_26//usr/local/jdk

vim/etc/profile

添加:

exportJAVA_HOME=/usr/local/jdk

exportCLASSPATH=.:$JAVA_HOME/lib

exportPATH=$PATH:$JAVA_HOME/bin

source/etc/profile

如果系统上安装了默认的javaopenjdk,你需要更新一下:

#alternatives--install/usr/bin/javajava/usr/local/jdk/bin/java2

#alternatives--setjava/usr/local/jdk/bin/java

#java-version

javaversion"1.6.0_32"

Java(TM)SERuntimeEnvironment(build1.6.0_32-b05)

JavaHotSpot(TM)64-BitServerVM(build20.7-b02,mixedmode)

检测一下java命令位置

whichjava

/usr/local/jdk/bin/java则位置正常

java环境正常!

1.伪分布式系统(MasterDatanode等节点均在一台主机上)

选择Master主机:desk11

yum-yinstallopensshrsync

useradd-u600hadoop#hadoop均已hadoop身份运行

echohadoop|passwd--stdinhadoop

Chownhadoop.hadoop/home/hadoop-R

以下操作均在hadoop身份操作

su-hadoop

建立ssh无密码验证(ssh等效性)

ssh-keygent一路回车

ssh-copy-idlocalhost

得到的效果是:

[hadoop@desk11~]$sshlocalhost

Lastlogin:SatAug313:59:582013fromlocalhost

配置hadoop

cdhadoop-1.0.4/conf

vimcore-site.xml

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://desk11:9000</value>

</property>

</configuration>

vimhdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>#启动一个节点

</property>

</configuration>

vimmapred-site.xml

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>desk11:9001</value>

</property>

</configuration>

[hadoop@desk11conf]$vimhadoop-env.sh

exportJAVA_HOME=/usr/local/jdk

如果没有设置java环境的位置在启动hadoop时则会报错:

hadoop分布式文件系统_hadoop分布式文件系统

格式化:

bin/hadoopnamenode-format

正常结果如下:

hadoop分布式文件系统_伪分布式系统_02

启动hadoop

bin/start-all.sh

hadoop分布式文件系统_hadoop分布式文件系统_03

查看启动的进程:jps

hadoop分布式文件系统_seync文件同步_04

启动之后我们可以在web上访问:

hadoop分布式文件系统_seync文件同步_05

hadoop分布式文件系统_hadoop分布式文件系统_06

测试:

bin/hadoopfs-putconfwestos#将当前目录下的conf目录下的内容上传到HDFS存于westos目录中

bin/hadoopfs-ls#查看fs系统中的内容

hadoop分布式文件系统_伪分布式系统_07

复制分布式文件系统的文件到本地:

bin/hadoopfs-getwestostest#hdfs上下载westos的内容到test目录

hadoop分布式文件系统_seync文件同步_08

看见在当前的目录下会有test目录里面的内容正好是上传上去conf里的配置文件;

在文件系统创建目录:hadoop分布式文件系统_伪分布式系统_09

bin/hadoopfs-mkdirwangzi

bin/hadoopfs-ls

bin/hadoopfs-rmrwangzi#删除hdfs上的目录

bin/hadoopfs-ls

hadoop分布式文件系统_seync文件同步_10

bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'

#运算测试,在westos内查找以dfs开头的内容并存到output

运算过程:

13/08/0313:38:27INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary

13/08/0313:38:27WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded

13/08/0313:38:27INFOmapred.FileInputFormat:Totalinputpathstoprocess:16

13/08/0313:38:27INFOmapred.JobClient:Runningjob:job_201308031321_0001

13/08/0313:38:28INFOmapred.JobClient:map0%reduce0%

13/08/0313:39:28INFOmapred.JobClient:map6%reduce0%

13/08/0313:39:33INFOmapred.JobClient:map12%reduce0%

网页监测:

可以看出已经有意个job提交了,并且正在运算

hadoop分布式文件系统_分布式文件系统_11

hadoop分布式文件系统_hadoop分布式文件系统_12

运算完成:hadoop分布式文件系统_seync文件同步_13

可见hadoop将任务分成两个job提交的;

[hadoop@desk11hadoop-1.0.4]$bin/hadoopfs-ls

查看output的内容:

hadoop分布式文件系统_hadoop分布式文件系统_14

关闭hadoop

[hadoop@desk11hadoop-1.0.4]$bin/stop-all.sh

[hadoop@desk11hadoop-1.0.4]$jps

28027Jps

2.分布式文件系统:

1jdk环境:

在两个数据节点server90server233上:

shjdk-6u26-linux-x64.bin

mvjdk1.6.0_26//usr/local/jdk

vim/etc/profile

添加:

exportJAVA_HOME=/usr/local/jdk

exportCLASSPATH=.:$JAVA_HOME/lib

exportPATH=$PATH:$JAVA_HOME/bin

source/etc/profile

检测一下java命令位置

whichjava

/usr/local/jdk/bin/java则位置正常

java环境正常!

2)。seync搭建:(配置以root身份进行)

Master节点与其他的节点之间的配置是要相同的,并且需要各个节点之间要建立ssh等效性,即任意两个节点之间的ssh链接不能有密码。

为方便部署采用里seync

需要安装包:sersync2.5_64bit_binary_stable_final.tar.gz

在各节点上:

yum-yinstallrsyncxinetd

Master上:

tarzxfsersync2.5.4_64bit_binary_stable_final.tar.gz

[root@desk11home]#ls

GNU-Linux-x86hadoop

[root@desk11home]#cdGNU-Linux-x86/

[root@desk11GNU-Linux-x86]#ls

confxml.xmlsersync2

[root@desk11GNU-Linux-x86]#vimconfxml.xml

<sersync>

<localpathwatch="/home/hadoop">#同步服务器同步的目录

<remoteip="192.168.122.190"name="rsync"/>#name为目标服务器设定的同步名

<remoteip="192.168.122.233"name="rsync"/>

<!--<remoteip="192.168.8.40"name="tongbu"/>-->

</localpath>

在目标服务器上,即在两个数据节点上;

useradd-u600hadoop#hadoop均已hadoop身份运行

echohadoop|passwd--stdinhadoop

[root@server90~]#vim/etc/rsyncd.conf

uid=hadoop#同步过来的所有的内容的所有者及属组均为hadoop

gid=hadoop

maxconnections=36000

usechroot=no

logfile=/var/log/rsyncd.log

pidfile=/var/run/rsyncd.pid

lockfile=/var/run/rsyncd.lock

[rsync]#Master上的name="rsync"保持一至

path=/home/hadoop#同步本地的目录

comment=testfiles

ignoreerrors=yes

readonly=no

hostsallow=192.168.122.11/24

hostsdeny=*

[root@server90~]#/etc/init.d/xinetdrestart

Stoppingxinetd:[OK]

Startingxinetd:[OK]

rsync--daemon

两个数据节点上操作完全一样

Master节点上:

[root@desk11GNU-Linux-x86]#/etc/init.d/xinetdrestart

Stoppingxinetd:[FAILED]

Startingxinetd:[OK]

[root@desk11GNU-Linux-x86]#./sersync2-r-d#-r整体同步-d

后台运行并检测同步服务器的数据是否发生变化,只要数据变化就会将变化的数据同步到其他的两个节点上;

3)分布式文件系统的配置:

hadoop身份操作:

Master节点上:

[hadoop@desk11conf]$vimmasters#制定Master位置

desk11#注意解析

[hadoop@desk11conf]$vimslaves

server90

server233

[hadoop@desk11conf]$vimhdfs-site.xml

<configuration>

<property>

<name>dfs.replication</name>

<value>2</value>#启动两个Datanode

</property>

</configuration>

启动:

[hadoop@desk11hadoop-1.0.4]$bin/hadoopnamenode-format

bin/start-all.sh

hadoop分布式文件系统_seync文件同步_15

hadoop分布式文件系统_分布式文件系统_16

可以看见在Master节点上的DataNode进程没有启动,这个进程是在两个数据节点上启动的;

在数据节点server90server233上:

[hadoop@server90conf]$jps

22978DataNode

23145Jps

23071TaskTracker

[hadoop@server233~]$jps

23225Jps

23150TaskTracker

23055DataNode

测试:

hadoop分布式文件系统_伪分布式系统_17

hadoop分布式文件系统_hadoop分布式文件系统_18

可见节点数变成了两个:

执行测试测试程序:

[hadoop@desk11hadoop-1.0.4]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'

13/08/0501:41:10INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary

13/08/0501:41:10WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded

13/08/0501:41:10INFOmapred.FileInputFormat:Totalinputpathstoprocess:16

13/08/0501:41:11INFOmapred.JobClient:Runningjob:job_201308050135_0001

13/08/0501:41:12INFOmapred.JobClient:map0%reduce0%

13/08/0501:41:39INFOmapred.JobClient:map12%reduce0%

13/08/0501:41:43INFOmapred.JobClient:map25%reduce0%

13/08/0501:42:05INFOmapred.JobClient:map31%reduce0%

13/08/0501:42:08INFOmapred.JobClient:map37%reduce0%

13/08/0501:42:17INFOmapred.JobClient:map43%reduce0%

13/08/0501:42:20INFOmapred.JobClient:map50%reduce0%

监控计算状态:

[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-report

ConfiguredCapacity:6209044480(5.78GB)

PresentCapacity:3567787548(3.32GB)

DFSRemaining:3567493120(3.32GB)

DFSUsed:294428(287.53KB)

DFSUsed%:0.01%

Underreplicatedblocks:0

Blockswithcorruptreplicas:0

Missingblocks:0

-------------------------------------------------

Datanodesavailable:2(2total,0dead)

Name:192.168.122.233:50010

DecommissionStatus:Normal

ConfiguredCapacity:3104522240(2.89GB)

DFSUsed:147214(143.76KB)

NonDFSUsed:1320521970(1.23GB)

DFSRemaining:1783853056(1.66GB)

DFSUsed%:0%

DFSRemaining%:57.46%

Lastcontact:MonAug0501:45:38CST2013

Name:192.168.122.190:50010

DecommissionStatus:Normal

ConfiguredCapacity:3104522240(2.89GB)

DFSUsed:147214(143.76KB)

NonDFSUsed:1320734962(1.23GB)

DFSRemaining:1783640064(1.66GB)

DFSUsed%:0%

DFSRemaining%:57.45%

Lastcontact:MonAug0501:45:36CST2013

可以看出两个Datanode均参与了节点的计算,负载均衡

hadoop分布式文件系统_分布式文件系统_19

4hadoop在线添加节点:

1.在新增节点上安装jdk,并创建相同的hadoop用户,uid等保持一致

2.conf/slaves文件中添加新增节点的ip或者对应域名

3.建立各节点之间与server73ssh等效性

4.同步masterhadoop所有数据到新增节点上,路径保持一致

5.在新增节点上启动服务:

[hadoop@server73hadoop-1.0.4]$bin/hadoop-daemon.shstartdatanode

[hadoop@server73hadoop-1.0.4]$bin/hadoop-daemon.shstarttasktracker

[hadoop@server73hadoop-1.0.4]$jps

1926DataNode

2024TaskTracker

2092Jps

hadoop分布式文件系统_伪分布式系统_20

节点数加一了

6.均衡数据:

Master节点上:

[hadoop@desk11hadoop-1.0.4]$bin/start-balancer.sh

startingbalancer,loggingto/home/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-balancer-desk11.example.com.out

1)如果不执行均衡,那么cluster会把新的数据都存放在新的datanode,这样会降低mapred的工作效率

2)设置平衡阈值,默认是10%,值越低各节点越平衡,但消耗时间也更长

[hadoop@desk11hadoop-1.0.4]$bin/start-balancer.sh-threshold5

[hadoop@desk11hadoop-1.0.4]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestostest'dfs[a-z.]+'

13/08/0503:03:49INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary

13/08/0503:03:49WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded

13/08/0503:03:49INFOmapred.FileInputFormat:Totalinputpathstoprocess:16

13/08/0503:03:49INFOmapred.JobClient:Runningjob:job_201308050135_0003

13/08/0503:03:50INFOmapred.JobClient:map0%reduce0%

[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-report

ConfiguredCapacity:9313566720(8.67GB)

PresentCapacity:5379882933(5.01GB)

DFSRemaining:5378859008(5.01GB)

DFSUsed:1023925(999.93KB)

DFSUsed%:0.02%

Underreplicatedblocks:2

Blockswithcorruptreplicas:0

Missingblocks:0

-------------------------------------------------

Datanodesavailable:3(3total,0dead)

Name:192.168.122.233:50010

DecommissionStatus:Normal

ConfiguredCapacity:3104522240(2.89GB)

DFSUsed:424147(414.21KB)

NonDFSUsed:1321731885(1.23GB)

DFSRemaining:1782366208(1.66GB)

DFSUsed%:0.01%

DFSRemaining%:57.41%

Lastcontact:MonAug0503:04:25CST2013

Name:192.168.122.73:50010#新添加的节点

DecommissionStatus:Normal

ConfiguredCapacity:3104522240(2.89GB)

DFSUsed:195467(190.89KB)

NonDFSUsed:1290097781(1.2GB)

DFSRemaining:1814228992(1.69GB)

DFSUsed%:0.01%

DFSRemaining%:58.44%

Lastcontact:MonAug0503:04:24CST2013

Name:192.168.122.190:50010

DecommissionStatus:Normal

ConfiguredCapacity:3104522240(2.89GB)

DFSUsed:404311(394.83KB)

NonDFSUsed:1321854121(1.23GB)

DFSRemaining:1782263808(1.66GB)

DFSUsed%:0.01%

DFSRemaining%:57.41%

Lastcontact:MonAug0503:04:23CST2013

hadoop分布式文件系统_seync文件同步_21

5hadoop在线删除datanode节点

[hadoop@desk11conf]$vimmapred-site.xml

添加:

<property>

<name>dfs.hosts.exclude</name>

<value>/home/hadoop/hadoop-1.0.4/conf/datanode-exclude</value>

</property>

创建/home/hadoop/hadoop-1.0.4/conf/datanode-exclude文件,写入要删除的主机,一行一个

[hadoop@desk11conf]$echo"server73">\

/home/hadoop/hadoop-1.0.4/conf/datanode-exclude

master上在线刷新节点:

[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-refreshNodes

此操作会在后台迁移数据

hadoop分布式文件系统_分布式文件系统_22

可以看出在Datanode上的已经有意个节点down掉了;

6hadoop在线删除tasktracker节点:

master上修改conf/mapred-site.xml

<property>

<name>mapred.hosts.exclude</name>

<value>/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude</value>

</property>

创建/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude文件:

touch/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude

vim/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude

server73192.168.122.173

刷新节点:

[hadoop@desk11bin]$./hadoopmradmin-refreshNodes