hadoop分布式文件系统
http://hadoop.apache.org/docs/r1.1.2/single_node_setup.html
实验环境:
Master:desk11192.168.122.11
Datanode:server90192.168.122.190
Server233192.168.122.233
server73192.168.122.173用作之后的在线添加
做好本地解析:
192.168.122.11desk11.example.comdesk11
192.168.122.190server90.example.comserver90
192.168.122.233server233.example.comserver233
192.168.122.173server73.eample.comserver73
1.环境配置
hadoop是java程序所以必须运行于java虚拟上(jdk)
准备jdk:jdk-6u26-linux-x64.bin
shjdk-6u26-linux-x64.bin
mvjdk1.6.0_26//usr/local/jdk
vim/etc/profile
添加:
exportJAVA_HOME=/usr/local/jdk
exportCLASSPATH=.:$JAVA_HOME/lib
exportPATH=$PATH:$JAVA_HOME/bin
source/etc/profile
如果系统上安装了默认的java:openjdk,你需要更新一下:
#alternatives--install/usr/bin/javajava/usr/local/jdk/bin/java2
#alternatives--setjava/usr/local/jdk/bin/java
#java-version
javaversion"1.6.0_32"
Java(TM)SERuntimeEnvironment(build1.6.0_32-b05)
JavaHotSpot(TM)64-BitServerVM(build20.7-b02,mixedmode)
检测一下java命令位置
whichjava
/usr/local/jdk/bin/java则位置正常
java环境正常!
1.伪分布式系统(MasterDatanode等节点均在一台主机上)
选择Master主机:desk11
yum-yinstallopensshrsync
useradd-u600hadoop#hadoop均已hadoop身份运行
echohadoop|passwd--stdinhadoop
Chownhadoop.hadoop/home/hadoop-R
以下操作均在hadoop身份操作
su-hadoop
建立ssh无密码验证(ssh等效性)
ssh-keygent一路回车
ssh-copy-idlocalhost
得到的效果是:
[hadoop@desk11~]$sshlocalhost
Lastlogin:SatAug313:59:582013fromlocalhost
配置hadoop
cdhadoop-1.0.4/conf
vimcore-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://desk11:9000</value>
</property>
</configuration>
vimhdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>#启动一个节点
</property>
</configuration>
vimmapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>desk11:9001</value>
</property>
</configuration>
[hadoop@desk11conf]$vimhadoop-env.sh
exportJAVA_HOME=/usr/local/jdk
如果没有设置java环境的位置在启动hadoop时则会报错:
格式化:
bin/hadoopnamenode-format
正常结果如下:
启动hadoop
bin/start-all.sh
查看启动的进程:jps
启动之后我们可以在web上访问:
测试:
bin/hadoopfs-putconfwestos#将当前目录下的conf目录下的内容上传到HDFS存于westos目录中
bin/hadoopfs-ls#查看fs系统中的内容
复制分布式文件系统的文件到本地:
bin/hadoopfs-getwestostest#从hdfs上下载westos的内容到test目录
看见在当前的目录下会有test目录里面的内容正好是上传上去conf里的配置文件;
bin/hadoopfs-mkdirwangzi
bin/hadoopfs-ls
bin/hadoopfs-rmrwangzi#删除hdfs上的目录
bin/hadoopfs-ls
bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'
#运算测试,在westos内查找以dfs开头的内容并存到output内
运算过程:
13/08/0313:38:27INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0313:38:27WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0313:38:27INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0313:38:27INFOmapred.JobClient:Runningjob:job_201308031321_0001
13/08/0313:38:28INFOmapred.JobClient:map0%reduce0%
13/08/0313:39:28INFOmapred.JobClient:map6%reduce0%
13/08/0313:39:33INFOmapred.JobClient:map12%reduce0%
网页监测:
可以看出已经有意个job提交了,并且正在运算
可见hadoop将任务分成两个job提交的;
[hadoop@desk11hadoop-1.0.4]$bin/hadoopfs-ls
查看output的内容:
关闭hadoop:
[hadoop@desk11hadoop-1.0.4]$bin/stop-all.sh
[hadoop@desk11hadoop-1.0.4]$jps
28027Jps
2.分布式文件系统:
1)jdk环境:
在两个数据节点server90及server233上:
shjdk-6u26-linux-x64.bin
mvjdk1.6.0_26//usr/local/jdk
vim/etc/profile
添加:
exportJAVA_HOME=/usr/local/jdk
exportCLASSPATH=.:$JAVA_HOME/lib
exportPATH=$PATH:$JAVA_HOME/bin
source/etc/profile
检测一下java命令位置
whichjava
/usr/local/jdk/bin/java则位置正常
java环境正常!
2)。seync搭建:(配置以root身份进行)
Master节点与其他的节点之间的配置是要相同的,并且需要各个节点之间要建立ssh等效性,即任意两个节点之间的ssh链接不能有密码。
为方便部署采用里seync
需要安装包:sersync2.5_64bit_binary_stable_final.tar.gz
在各节点上:
yum-yinstallrsyncxinetd
在Master上:
tarzxfsersync2.5.4_64bit_binary_stable_final.tar.gz
[root@desk11home]#ls
GNU-Linux-x86hadoop
[root@desk11home]#cdGNU-Linux-x86/
[root@desk11GNU-Linux-x86]#ls
confxml.xmlsersync2
[root@desk11GNU-Linux-x86]#vimconfxml.xml
<sersync>
<localpathwatch="/home/hadoop">#同步服务器同步的目录
<remoteip="192.168.122.190"name="rsync"/>#name为目标服务器设定的同步名
<remoteip="192.168.122.233"name="rsync"/>
<!--<remoteip="192.168.8.40"name="tongbu"/>-->
</localpath>
在目标服务器上,即在两个数据节点上;
useradd-u600hadoop#hadoop均已hadoop身份运行
echohadoop|passwd--stdinhadoop
[root@server90~]#vim/etc/rsyncd.conf
uid=hadoop#同步过来的所有的内容的所有者及属组均为hadoop
gid=hadoop
maxconnections=36000
usechroot=no
logfile=/var/log/rsyncd.log
pidfile=/var/run/rsyncd.pid
lockfile=/var/run/rsyncd.lock
[rsync]#与Master上的name="rsync"保持一至
path=/home/hadoop#同步本地的目录
comment=testfiles
ignoreerrors=yes
readonly=no
hostsallow=192.168.122.11/24
hostsdeny=*
[root@server90~]#/etc/init.d/xinetdrestart
Stoppingxinetd:[OK]
Startingxinetd:[OK]
rsync--daemon
两个数据节点上操作完全一样
Master节点上:
[root@desk11GNU-Linux-x86]#/etc/init.d/xinetdrestart
Stoppingxinetd:[FAILED]
Startingxinetd:[OK]
[root@desk11GNU-Linux-x86]#./sersync2-r-d#-r整体同步-d
后台运行并检测同步服务器的数据是否发生变化,只要数据变化就会将变化的数据同步到其他的两个节点上;
3)分布式文件系统的配置:
以hadoop身份操作:
Master节点上:
[hadoop@desk11conf]$vimmasters#制定Master位置
desk11#注意解析
[hadoop@desk11conf]$vimslaves
server90
server233
[hadoop@desk11conf]$vimhdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>#启动两个Datanode
</property>
</configuration>
启动:
[hadoop@desk11hadoop-1.0.4]$bin/hadoopnamenode-format
bin/start-all.sh
可以看见在Master节点上的DataNode进程没有启动,这个进程是在两个数据节点上启动的;
在数据节点server90及server233上:
[hadoop@server90conf]$jps
22978DataNode
23145Jps
23071TaskTracker
[hadoop@server233~]$jps
23225Jps
23150TaskTracker
23055DataNode
测试:
可见节点数变成了两个:
执行测试测试程序:
[hadoop@desk11hadoop-1.0.4]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestosoutput'dfs[a-z.]+'
13/08/0501:41:10INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0501:41:10WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0501:41:10INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0501:41:11INFOmapred.JobClient:Runningjob:job_201308050135_0001
13/08/0501:41:12INFOmapred.JobClient:map0%reduce0%
13/08/0501:41:39INFOmapred.JobClient:map12%reduce0%
13/08/0501:41:43INFOmapred.JobClient:map25%reduce0%
13/08/0501:42:05INFOmapred.JobClient:map31%reduce0%
13/08/0501:42:08INFOmapred.JobClient:map37%reduce0%
13/08/0501:42:17INFOmapred.JobClient:map43%reduce0%
13/08/0501:42:20INFOmapred.JobClient:map50%reduce0%
监控计算状态:
[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-report
ConfiguredCapacity:6209044480(5.78GB)
PresentCapacity:3567787548(3.32GB)
DFSRemaining:3567493120(3.32GB)
DFSUsed:294428(287.53KB)
DFSUsed%:0.01%
Underreplicatedblocks:0
Blockswithcorruptreplicas:0
Missingblocks:0
-------------------------------------------------
Datanodesavailable:2(2total,0dead)
Name:192.168.122.233:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:147214(143.76KB)
NonDFSUsed:1320521970(1.23GB)
DFSRemaining:1783853056(1.66GB)
DFSUsed%:0%
DFSRemaining%:57.46%
Lastcontact:MonAug0501:45:38CST2013
Name:192.168.122.190:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:147214(143.76KB)
NonDFSUsed:1320734962(1.23GB)
DFSRemaining:1783640064(1.66GB)
DFSUsed%:0%
DFSRemaining%:57.45%
Lastcontact:MonAug0501:45:36CST2013
可以看出两个Datanode均参与了节点的计算,负载均衡
4)hadoop在线添加节点:
1.在新增节点上安装jdk,并创建相同的hadoop用户,uid等保持一致
2.在conf/slaves文件中添加新增节点的ip或者对应域名
3.建立各节点之间与server73的ssh等效性
4.同步master上hadoop所有数据到新增节点上,路径保持一致
5.在新增节点上启动服务:
[hadoop@server73hadoop-1.0.4]$bin/hadoop-daemon.shstartdatanode
[hadoop@server73hadoop-1.0.4]$bin/hadoop-daemon.shstarttasktracker
[hadoop@server73hadoop-1.0.4]$jps
1926DataNode
2024TaskTracker
2092Jps
节点数加一了
6.均衡数据:
在Master节点上:
[hadoop@desk11hadoop-1.0.4]$bin/start-balancer.sh
startingbalancer,loggingto/home/hadoop/hadoop-1.0.4/libexec/../logs/hadoop-hadoop-balancer-desk11.example.com.out
1)如果不执行均衡,那么cluster会把新的数据都存放在新的datanode上,这样会降低mapred的工作效率
2)设置平衡阈值,默认是10%,值越低各节点越平衡,但消耗时间也更长
[hadoop@desk11hadoop-1.0.4]$bin/start-balancer.sh-threshold5
[hadoop@desk11hadoop-1.0.4]$bin/hadoopjarhadoop-examples-1.0.4.jargrepwestostest'dfs[a-z.]+'
13/08/0503:03:49INFOutil.NativeCodeLoader:Loadedthenative-hadooplibrary
13/08/0503:03:49WARNsnappy.LoadSnappy:Snappynativelibrarynotloaded
13/08/0503:03:49INFOmapred.FileInputFormat:Totalinputpathstoprocess:16
13/08/0503:03:49INFOmapred.JobClient:Runningjob:job_201308050135_0003
13/08/0503:03:50INFOmapred.JobClient:map0%reduce0%
[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-report
ConfiguredCapacity:9313566720(8.67GB)
PresentCapacity:5379882933(5.01GB)
DFSRemaining:5378859008(5.01GB)
DFSUsed:1023925(999.93KB)
DFSUsed%:0.02%
Underreplicatedblocks:2
Blockswithcorruptreplicas:0
Missingblocks:0
-------------------------------------------------
Datanodesavailable:3(3total,0dead)
Name:192.168.122.233:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:424147(414.21KB)
NonDFSUsed:1321731885(1.23GB)
DFSRemaining:1782366208(1.66GB)
DFSUsed%:0.01%
DFSRemaining%:57.41%
Lastcontact:MonAug0503:04:25CST2013
Name:192.168.122.73:50010#新添加的节点
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:195467(190.89KB)
NonDFSUsed:1290097781(1.2GB)
DFSRemaining:1814228992(1.69GB)
DFSUsed%:0.01%
DFSRemaining%:58.44%
Lastcontact:MonAug0503:04:24CST2013
Name:192.168.122.190:50010
DecommissionStatus:Normal
ConfiguredCapacity:3104522240(2.89GB)
DFSUsed:404311(394.83KB)
NonDFSUsed:1321854121(1.23GB)
DFSRemaining:1782263808(1.66GB)
DFSUsed%:0.01%
DFSRemaining%:57.41%
Lastcontact:MonAug0503:04:23CST2013
5)hadoop在线删除datanode节点
[hadoop@desk11conf]$vimmapred-site.xml
添加:
<property>
<name>dfs.hosts.exclude</name>
<value>/home/hadoop/hadoop-1.0.4/conf/datanode-exclude</value>
</property>
创建/home/hadoop/hadoop-1.0.4/conf/datanode-exclude文件,写入要删除的主机,一行一个
[hadoop@desk11conf]$echo"server73">\
/home/hadoop/hadoop-1.0.4/conf/datanode-exclude
在master上在线刷新节点:
[hadoop@desk11hadoop-1.0.4]$bin/hadoopdfsadmin-refreshNodes
此操作会在后台迁移数据
可以看出在Datanode上的已经有意个节点down掉了;
6)hadoop在线删除tasktracker节点:
在master上修改conf/mapred-site.xml
<property>
<name>mapred.hosts.exclude</name>
<value>/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude</value>
</property>
创建/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude文件:
touch/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude
vim/home/hadoop/hadoop-1.0.4/conf/trasktracker-exclude
server73或192.168.122.173
刷新节点:
[hadoop@desk11bin]$./hadoopmradmin-refreshNodes