准备工作:
在192.168.129.35上搭建一个Hadoop环境,早上已经搞定,所以不说了
可以参照附件的邮件 <technical>canton hadoop environment in 192.168.129.35
Step 1:
下载并解压Hadoop 到本地机器(因为Eclipse需要这个Hadoop里面的部分Jar包充当其运行时)
这个Hadoop可以在http://hadoop.apache.org/ 官方下载,我下载了0.20.2版本,并且解压到D:\hadoop-0.20.2
Step 2:
下载Hadoop Eclipse 插件并且放到Eclipse 的dropin 目录(或者plugin目录)
这个插件可以在\\192.168.0.238\Canton\Software\Eclipse_Plugins\Hadoop_Eclipse 目录下找到
Step 3:
重启Eclipse 后(或者STS,我就是用的Spring Tool Suite 反正都一样)
在Window->Preference里面设置本地运行时,让其指向Step 1中的解压目录
Step 4:
打开MapReduce Tools 视图 (Window->Show View->MapReduce Tools->MapReduce Locations)
并编辑之:(这一步非常复杂,我搞错了N次才全设置正确,网上的设置例子都是在本机的,那种情况下本地账号和Hadoop账号一致,
而我们是相当于连接到远程192.168.129.35上的Hadoop服务器,当然了账号是不一样的(比如,我的开发机器是 charles.wang,远程 192.168.129.35 是root)
General面板里面设置如下
Advanced parameters面板里面,除了保持默认的设置外,有些需要改变:
· dfs.data.dir 设置为/home/dcui/hadoop-0.20.2/tmp/dfs/data
· dfs.name.dir 设置为/home/dcui/hadoop-0.20.2/tmp/dfs/name
· dfs.name.edits.dir设置为/home/dcui/hadoop-0.20.2/tmp/dfs/name
· dfs.replication设置为1
· hadoop.tmp.dir设置为/home/dcui/hadoop-0.20.2/tmp
· hadoop.job.ugi设置为root,Domain,Users,Remote,Desktop,Users,Users
Step 5:
在Project Explorer里面可以看到如下的内容:其中第二个/user/root/inputDir是hadoop分布式文件系统目录,我们上午刚创建的
Step 6:
下面就是开发HelloWorld程序了,用MapReduce Project 向导创建一个项目
完整的项目源代码见附件 HadoopWordCountDemo.zip
我其实什么都不懂,也就是按照API范例依葫芦画瓢搞了一个玩玩,它用于统计我们的canton_codetemplate.xml的关键字数目
Step 7:
配置运行选项,如图,这个项目需要传入2个参数,参数1是Hadoop文件系统的用于被统计的文件位置,参数2是目标输出目录来存放统计结果。我为了运行流畅,还把VM参数设大了点。
Step 8:
观察控制台的输出如下(终于啊,我调试了1个小时。。)
12/03/12 14:52:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/03/12 14:52:08 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/12 14:52:08 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/03/12 14:52:13 INFO input.FileInputFormat: Total input paths to process : 1
12/03/12 14:52:13 INFO mapred.JobClient: Running job: job_local_0001
12/03/12 14:52:13 INFO input.FileInputFormat: Total input paths to process : 1
12/03/12 14:52:13 INFO mapred.MapTask: io.sort.mb = 100
12/03/12 14:52:13 INFO mapred.MapTask: data buffer = 79691776/99614720
12/03/12 14:52:13 INFO mapred.MapTask: record buffer = 262144/327680
12/03/12 14:52:13 INFO mapred.MapTask: Starting flush of map output
12/03/12 14:52:14 INFO mapred.JobClient: map 0% reduce 0%
12/03/12 14:52:14 INFO mapred.MapTask: Finished spill 0
12/03/12 14:52:14 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/03/12 14:52:14 INFO mapred.LocalJobRunner:
12/03/12 14:52:14 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/03/12 14:52:14 INFO mapred.LocalJobRunner:
12/03/12 14:52:14 INFO mapred.Merger: Merging 1 sorted segments
12/03/12 14:52:14 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3278 bytes
12/03/12 14:52:14 INFO mapred.LocalJobRunner:
12/03/12 14:52:14 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/03/12 14:52:15 INFO mapred.LocalJobRunner:
12/03/12 14:52:15 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/03/12 14:52:15 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.129.35:9000/user/root/outputToThisFolder
12/03/12 14:52:15 INFO mapred.LocalJobRunner: reduce > reduce
12/03/12 14:52:15 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
12/03/12 14:52:15 INFO mapred.JobClient: map 100% reduce 100%
12/03/12 14:52:15 INFO mapred.JobClient: Job complete: job_local_0001
12/03/12 14:52:15 INFO mapred.JobClient: Counters: 14
12/03/12 14:52:15 INFO mapred.JobClient: FileSystemCounters
12/03/12 14:52:15 INFO mapred.JobClient: FILE_BYTES_READ=37570
12/03/12 14:52:15 INFO mapred.JobClient: HDFS_BYTES_READ=7374
12/03/12 14:52:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=75880
12/03/12 14:52:15 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=2803
12/03/12 14:52:15 INFO mapred.JobClient: Map-Reduce Framework
12/03/12 14:52:15 INFO mapred.JobClient: Reduce input groups=119
12/03/12 14:52:15 INFO mapred.JobClient: Combine output records=119
12/03/12 14:52:15 INFO mapred.JobClient: Map input records=28
12/03/12 14:52:15 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/12 14:52:15 INFO mapred.JobClient: Reduce output records=119
12/03/12 14:52:15 INFO mapred.JobClient: Spilled Records=238
12/03/12 14:52:15 INFO mapred.JobClient: Map output bytes=4501
12/03/12 14:52:15 INFO mapred.JobClient: Combine input records=209
12/03/12 14:52:15 INFO mapred.JobClient: Map output records=209
12/03/12 14:52:15 INFO mapred.JobClient: Reduce input records=119
Step 9:
去Hadoop分布式文件系统去检验输出
命令为: hadoop fs –ls /user/root/outputToThisFolder
看到 Hadoop文件系统里面的 /user/root/outputToThisFolder 目录下面确实有一个文件,叫part-r-00000
我们打开看其内容
命令为:hadoop fs -cat /user/root/outputToThisFolder/part-r-00000
所以,这个文件的确按照关键字被进行了次数统计。