环境:Vmware 8.0 和Ubuntu11.04

Hadoop 实战之运行DataJoin

第一步:首先创建一个工程命名为HadoopTest.目录结构如下图:

distcp hadoop 实时 hadoop实战_distcp hadoop 实时





第二步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:

 

sudo rm -rf /tmp/*
rm -rf /home/tanglg1987/hadoop-0.20.2/logs
hadoop namenode -format
hadoop datanode -format
start-all.sh
hadoop fs -mkdir input 
hadoop dfsadmin -safemode leave



第三步:给start.sh增加执行权限并启动hadoop伪分布式集群,代码如下:


chmod 777 /home/tanglg1987/ start.sh
./start.sh



执行过程如下:

12/10/15 23:05:38 INFO namenode.NameNode: STARTUP_MSG:
 /************************************************************
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 12/10/15 23:05:39 INFO namenode.FSNamesystem: fsOwner=tanglg1987,tanglg1987,adm,dialout,cdrom,plugdev,lpadmin,admin,sambashare
 12/10/15 23:05:39 INFO namenode.FSNamesystem: supergroup=supergroup
 12/10/15 23:05:39 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/10/15 23:05:39 INFO common.Storage: Image file of size 100 saved in 0 seconds.
 12/10/15 23:05:39 INFO common.Storage: Storage directory /tmp/hadoop-tanglg1987/dfs/name has been successfully formatted.
 12/10/15 23:05:39 INFO namenode.NameNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down NameNode at tanglg1987/127.0.1.1
 ************************************************************/
 12/10/15 23:05:40 INFO datanode.DataNode: STARTUP_MSG: 
 /************************************************************
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 Usage: java DataNode
 [-rollback]
 12/10/15 23:05:40 INFO datanode.DataNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down DataNode at tanglg1987/127.0.1.1
 ************************************************************/
 starting namenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-namenode-tanglg1987.out
 localhost: starting datanode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-datanode-tanglg1987.out
 localhost: starting secondarynamenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-secondarynamenode-tanglg1987.out
 starting jobtracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-jobtracker-tanglg1987.out
 localhost: starting tasktracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-tasktracker-tanglg1987.out
 Safe mode is OFF

第四步:上传本地文件到hdfs

在/home/tanglg1987目录下新建Order.txt内容如下:

3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.00,22-Jan-2009

在/home/tanglg1987目录下新建Customer.txt内容如下:

1,tom,555-555-5555
2,white,123-456-7890
3,jerry,281-330-4563
4,tanglg,408-555-0000

上传本地文件到hdfs:

hadoop fs -put /home/tanglg1987/Orders.txt input
hadoop fs -put /home/tanglg1987/Customer.txt input

第五步:新建一个DataJion.java,代码如下:

package com.baison.action;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
public class DataJoin extends Configured implements Tool {
public static class MapClass extends DataJoinMapperBase {
protected Text generateInputTag(String inputFile) {
String datasource = inputFile.split("-")[0];
return new Text(datasource);
}
protected Text generateGroupKey(TaggedMapOutput aRecord) {
String line = ((Text) aRecord.getData()).toString();
String[] tokens = line.split(",");
String groupKey = tokens[0];
return new Text(groupKey);
}
protected TaggedMapOutput generateTaggedMapOutput(Object value) {
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag);
return retv;
}
}
public static class Reduce extends DataJoinReducerBase {
protected TaggedMapOutput combine(Object[] tags, Object[] values) {
if (tags.length < 2)
return null;
String joinedStr = "";
for (int i = 0; i < values.length; i++) {
if (i > 0)
joinedStr += ",";
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();
String[] tokens = line.split(",", 2);
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]);
return retv;
}
}
public static class TaggedWritable extends TaggedMapOutput {
private Writable data;
public TaggedWritable() {
this.tag = new Text();
}
public TaggedWritable(Writable data) {
this.tag = new Text("");
this.data = data;
}
public Writable getData() {
return data;
}
public void setData(Writable data) {
this.data = data;
}
public void write(DataOutput out) throws IOException {
this.tag.write(out);
out.writeUTF(this.data.getClass().getName());
this.data.write(out);
}
public void readFields(DataInput in) throws IOException {
this.tag.readFields(in);
String dataClz = in.readUTF();
if (this.data == null
|| !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(
Class.forName(dataClz), null);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
this.data.readFields(in);
}
}
public int run(String[] args) throws Exception {
for (String string : args) {
System.out.println(string);
}
Configuration conf = getConf();
JobConf job = new JobConf(conf, DataJoin.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
job.set("mapred.textoutputformat.separator", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
String[] arg = { "hdfs://localhost:9100/user/tanglg1987/input",
"hdfs://localhost:9100/user/tanglg1987/output" };
int res = ToolRunner.run(new Configuration(), new DataJoin(), arg);
System.exit(res);
}
}

第六步:Run On Hadoop,运行过程如下:

hdfs://localhost:9100/user/tanglg1987/input
 hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
 12/10/16 22:05:36 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.JobClient: Running job: job_local_0001
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:36 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
 12/10/16 22:05:37 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:37 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.Merger: Merging 2 sorted segments
 12/10/16 22:05:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 875 bytes
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO datajoin.job: key: 1 this.largestNumOfValues: 2
 12/10/16 22:05:37 INFO datajoin.job: key: 3 this.largestNumOfValues: 3
 12/10/16 22:05:37 INFO mapred.JobClient: map 100% reduce 0%
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
 12/10/16 22:05:37 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: actuallyCollectedCount 4
 collectedCount 5
 groupCount 4
 > reduce
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
 12/10/16 22:05:38 INFO mapred.JobClient: map 100% reduce 100%
 12/10/16 22:05:38 INFO mapred.JobClient: Job complete: job_local_0001
 12/10/16 22:05:38 INFO mapred.JobClient: Counters: 15
 12/10/16 22:05:38 INFO mapred.JobClient: FileSystemCounters
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_READ=51466
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_READ=435
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105007
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=162
 12/10/16 22:05:38 INFO mapred.JobClient: Map-Reduce Framework
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input groups=4
 12/10/16 22:05:38 INFO mapred.JobClient: Combine output records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map input records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce shuffle bytes=0
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce output records=4
 12/10/16 22:05:38 INFO mapred.JobClient: Spilled Records=16
 12/10/16 22:05:38 INFO mapred.JobClient: Map output bytes=855
 12/10/16 22:05:38 INFO mapred.JobClient: Map input bytes=175
 12/10/16 22:05:38 INFO mapred.JobClient: Combine input records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map output records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input records=8


第七步:查看结果集,运行结果如下:

distcp hadoop 实时 hadoop实战_output_02


环境:Vmware 8.0 和Ubuntu11.04

Hadoop 实战之运行DataJoin

第一步:首先创建一个工程命名为HadoopTest.目录结构如下图:

distcp hadoop 实时 hadoop实战_distcp hadoop 实时





第二步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:

 

sudo rm -rf /tmp/*
rm -rf /home/tanglg1987/hadoop-0.20.2/logs
hadoop namenode -format
hadoop datanode -format
start-all.sh
hadoop fs -mkdir input 
hadoop dfsadmin -safemode leave



第三步:给start.sh增加执行权限并启动hadoop伪分布式集群,代码如下:


chmod 777 /home/tanglg1987/ start.sh
./start.sh



执行过程如下:

12/10/15 23:05:38 INFO namenode.NameNode: STARTUP_MSG:
 /************************************************************
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 12/10/15 23:05:39 INFO namenode.FSNamesystem: fsOwner=tanglg1987,tanglg1987,adm,dialout,cdrom,plugdev,lpadmin,admin,sambashare
 12/10/15 23:05:39 INFO namenode.FSNamesystem: supergroup=supergroup
 12/10/15 23:05:39 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/10/15 23:05:39 INFO common.Storage: Image file of size 100 saved in 0 seconds.
 12/10/15 23:05:39 INFO common.Storage: Storage directory /tmp/hadoop-tanglg1987/dfs/name has been successfully formatted.
 12/10/15 23:05:39 INFO namenode.NameNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down NameNode at tanglg1987/127.0.1.1
 ************************************************************/
 12/10/15 23:05:40 INFO datanode.DataNode: STARTUP_MSG: 
 /************************************************************
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 Usage: java DataNode
 [-rollback]
 12/10/15 23:05:40 INFO datanode.DataNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down DataNode at tanglg1987/127.0.1.1
 ************************************************************/
 starting namenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-namenode-tanglg1987.out
 localhost: starting datanode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-datanode-tanglg1987.out
 localhost: starting secondarynamenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-secondarynamenode-tanglg1987.out
 starting jobtracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-jobtracker-tanglg1987.out
 localhost: starting tasktracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-tasktracker-tanglg1987.out
 Safe mode is OFF

第四步:上传本地文件到hdfs

在/home/tanglg1987目录下新建Order.txt内容如下:

3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.00,22-Jan-2009

在/home/tanglg1987目录下新建Customer.txt内容如下:

1,tom,555-555-5555
2,white,123-456-7890
3,jerry,281-330-4563
4,tanglg,408-555-0000

上传本地文件到hdfs:

hadoop fs -put /home/tanglg1987/Orders.txt input
hadoop fs -put /home/tanglg1987/Customer.txt input

第五步:新建一个DataJion.java,代码如下:

package com.baison.action;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
public class DataJoin extends Configured implements Tool {
public static class MapClass extends DataJoinMapperBase {
protected Text generateInputTag(String inputFile) {
String datasource = inputFile.split("-")[0];
return new Text(datasource);
}
protected Text generateGroupKey(TaggedMapOutput aRecord) {
String line = ((Text) aRecord.getData()).toString();
String[] tokens = line.split(",");
String groupKey = tokens[0];
return new Text(groupKey);
}
protected TaggedMapOutput generateTaggedMapOutput(Object value) {
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag);
return retv;
}
}
public static class Reduce extends DataJoinReducerBase {
protected TaggedMapOutput combine(Object[] tags, Object[] values) {
if (tags.length < 2)
return null;
String joinedStr = "";
for (int i = 0; i < values.length; i++) {
if (i > 0)
joinedStr += ",";
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();
String[] tokens = line.split(",", 2);
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]);
return retv;
}
}
public static class TaggedWritable extends TaggedMapOutput {
private Writable data;
public TaggedWritable() {
this.tag = new Text();
}
public TaggedWritable(Writable data) {
this.tag = new Text("");
this.data = data;
}
public Writable getData() {
return data;
}
public void setData(Writable data) {
this.data = data;
}
public void write(DataOutput out) throws IOException {
this.tag.write(out);
out.writeUTF(this.data.getClass().getName());
this.data.write(out);
}
public void readFields(DataInput in) throws IOException {
this.tag.readFields(in);
String dataClz = in.readUTF();
if (this.data == null
|| !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(
Class.forName(dataClz), null);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
this.data.readFields(in);
}
}
public int run(String[] args) throws Exception {
for (String string : args) {
System.out.println(string);
}
Configuration conf = getConf();
JobConf job = new JobConf(conf, DataJoin.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
job.set("mapred.textoutputformat.separator", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
String[] arg = { "hdfs://localhost:9100/user/tanglg1987/input",
"hdfs://localhost:9100/user/tanglg1987/output" };
int res = ToolRunner.run(new Configuration(), new DataJoin(), arg);
System.exit(res);
}
}

第六步:Run On Hadoop,运行过程如下:

hdfs://localhost:9100/user/tanglg1987/input
 hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
 12/10/16 22:05:36 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.JobClient: Running job: job_local_0001
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:36 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
 12/10/16 22:05:37 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:37 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.Merger: Merging 2 sorted segments
 12/10/16 22:05:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 875 bytes
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO datajoin.job: key: 1 this.largestNumOfValues: 2
 12/10/16 22:05:37 INFO datajoin.job: key: 3 this.largestNumOfValues: 3
 12/10/16 22:05:37 INFO mapred.JobClient: map 100% reduce 0%
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
 12/10/16 22:05:37 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: actuallyCollectedCount 4
 collectedCount 5
 groupCount 4
 > reduce
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
 12/10/16 22:05:38 INFO mapred.JobClient: map 100% reduce 100%
 12/10/16 22:05:38 INFO mapred.JobClient: Job complete: job_local_0001
 12/10/16 22:05:38 INFO mapred.JobClient: Counters: 15
 12/10/16 22:05:38 INFO mapred.JobClient: FileSystemCounters
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_READ=51466
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_READ=435
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105007
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=162
 12/10/16 22:05:38 INFO mapred.JobClient: Map-Reduce Framework
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input groups=4
 12/10/16 22:05:38 INFO mapred.JobClient: Combine output records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map input records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce shuffle bytes=0
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce output records=4
 12/10/16 22:05:38 INFO mapred.JobClient: Spilled Records=16
 12/10/16 22:05:38 INFO mapred.JobClient: Map output bytes=855
 12/10/16 22:05:38 INFO mapred.JobClient: Map input bytes=175
 12/10/16 22:05:38 INFO mapred.JobClient: Combine input records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map output records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input records=8


第七步:查看结果集,运行结果如下:

distcp hadoop 实时 hadoop实战_output_02



环境:Vmware 8.0 和Ubuntu11.04

Hadoop 实战之运行DataJoin

第一步:首先创建一个工程命名为HadoopTest.目录结构如下图:

distcp hadoop 实时 hadoop实战_distcp hadoop 实时





第二步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:

 

sudo rm -rf /tmp/*
rm -rf /home/tanglg1987/hadoop-0.20.2/logs
hadoop namenode -format
hadoop datanode -format
start-all.sh
hadoop fs -mkdir input 
hadoop dfsadmin -safemode leave



第三步:给start.sh增加执行权限并启动hadoop伪分布式集群,代码如下:


chmod 777 /home/tanglg1987/ start.sh
./start.sh



执行过程如下:

12/10/15 23:05:38 INFO namenode.NameNode: STARTUP_MSG:
 /************************************************************
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 12/10/15 23:05:39 INFO namenode.FSNamesystem: fsOwner=tanglg1987,tanglg1987,adm,dialout,cdrom,plugdev,lpadmin,admin,sambashare
 12/10/15 23:05:39 INFO namenode.FSNamesystem: supergroup=supergroup
 12/10/15 23:05:39 INFO namenode.FSNamesystem: isPermissionEnabled=true
 12/10/15 23:05:39 INFO common.Storage: Image file of size 100 saved in 0 seconds.
 12/10/15 23:05:39 INFO common.Storage: Storage directory /tmp/hadoop-tanglg1987/dfs/name has been successfully formatted.
 12/10/15 23:05:39 INFO namenode.NameNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down NameNode at tanglg1987/127.0.1.1
 ************************************************************/
 12/10/15 23:05:40 INFO datanode.DataNode: STARTUP_MSG: 
 /************************************************************
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG: host = tanglg1987/127.0.1.1
 STARTUP_MSG: args = [-format]
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 ************************************************************/
 Usage: java DataNode
 [-rollback]
 12/10/15 23:05:40 INFO datanode.DataNode: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down DataNode at tanglg1987/127.0.1.1
 ************************************************************/
 starting namenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-namenode-tanglg1987.out
 localhost: starting datanode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-datanode-tanglg1987.out
 localhost: starting secondarynamenode, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-secondarynamenode-tanglg1987.out
 starting jobtracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-jobtracker-tanglg1987.out
 localhost: starting tasktracker, logging to /home/tanglg1987/hadoop-0.20.2/bin/../logs/hadoop-tanglg1987-tasktracker-tanglg1987.out
 Safe mode is OFF

第四步:上传本地文件到hdfs

在/home/tanglg1987目录下新建Order.txt内容如下:

3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.00,22-Jan-2009

在/home/tanglg1987目录下新建Customer.txt内容如下:

1,tom,555-555-5555
2,white,123-456-7890
3,jerry,281-330-4563
4,tanglg,408-555-0000

上传本地文件到hdfs:

hadoop fs -put /home/tanglg1987/Orders.txt input
hadoop fs -put /home/tanglg1987/Customer.txt input

第五步:新建一个DataJion.java,代码如下:

package com.baison.action;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
public class DataJoin extends Configured implements Tool {
public static class MapClass extends DataJoinMapperBase {
protected Text generateInputTag(String inputFile) {
String datasource = inputFile.split("-")[0];
return new Text(datasource);
}
protected Text generateGroupKey(TaggedMapOutput aRecord) {
String line = ((Text) aRecord.getData()).toString();
String[] tokens = line.split(",");
String groupKey = tokens[0];
return new Text(groupKey);
}
protected TaggedMapOutput generateTaggedMapOutput(Object value) {
TaggedWritable retv = new TaggedWritable((Text) value);
retv.setTag(this.inputTag);
return retv;
}
}
public static class Reduce extends DataJoinReducerBase {
protected TaggedMapOutput combine(Object[] tags, Object[] values) {
if (tags.length < 2)
return null;
String joinedStr = "";
for (int i = 0; i < values.length; i++) {
if (i > 0)
joinedStr += ",";
TaggedWritable tw = (TaggedWritable) values[i];
String line = ((Text) tw.getData()).toString();
String[] tokens = line.split(",", 2);
joinedStr += tokens[1];
}
TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
retv.setTag((Text) tags[0]);
return retv;
}
}
public static class TaggedWritable extends TaggedMapOutput {
private Writable data;
public TaggedWritable() {
this.tag = new Text();
}
public TaggedWritable(Writable data) {
this.tag = new Text("");
this.data = data;
}
public Writable getData() {
return data;
}
public void setData(Writable data) {
this.data = data;
}
public void write(DataOutput out) throws IOException {
this.tag.write(out);
out.writeUTF(this.data.getClass().getName());
this.data.write(out);
}
public void readFields(DataInput in) throws IOException {
this.tag.readFields(in);
String dataClz = in.readUTF();
if (this.data == null
|| !this.data.getClass().getName().equals(dataClz)) {
try {
this.data = (Writable) ReflectionUtils.newInstance(
Class.forName(dataClz), null);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
this.data.readFields(in);
}
}
public int run(String[] args) throws Exception {
for (String string : args) {
System.out.println(string);
}
Configuration conf = getConf();
JobConf job = new JobConf(conf, DataJoin.class);
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin");
job.setMapperClass(MapClass.class);
job.setReducerClass(Reduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TaggedWritable.class);
job.set("mapred.textoutputformat.separator", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
String[] arg = { "hdfs://localhost:9100/user/tanglg1987/input",
"hdfs://localhost:9100/user/tanglg1987/output" };
int res = ToolRunner.run(new Configuration(), new DataJoin(), arg);
System.exit(res);
}
}

第六步:Run On Hadoop,运行过程如下:

hdfs://localhost:9100/user/tanglg1987/input
 hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
 12/10/16 22:05:36 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.JobClient: Running job: job_local_0001
 12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
 12/10/16 22:05:36 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:36 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
 12/10/16 22:05:37 INFO mapred.MapTask: numReduceTasks: 1
 12/10/16 22:05:37 INFO mapred.MapTask: io.sort.mb = 100
 12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
 12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
 12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
 12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
 totalCount 4
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.Merger: Merging 2 sorted segments
 12/10/16 22:05:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 875 bytes
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO datajoin.job: key: 1 this.largestNumOfValues: 2
 12/10/16 22:05:37 INFO datajoin.job: key: 3 this.largestNumOfValues: 3
 12/10/16 22:05:37 INFO mapred.JobClient: map 100% reduce 0%
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: 
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
 12/10/16 22:05:37 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9100/user/tanglg1987/output
 12/10/16 22:05:37 INFO mapred.LocalJobRunner: actuallyCollectedCount 4
 collectedCount 5
 groupCount 4
 > reduce
 12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
 12/10/16 22:05:38 INFO mapred.JobClient: map 100% reduce 100%
 12/10/16 22:05:38 INFO mapred.JobClient: Job complete: job_local_0001
 12/10/16 22:05:38 INFO mapred.JobClient: Counters: 15
 12/10/16 22:05:38 INFO mapred.JobClient: FileSystemCounters
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_READ=51466
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_READ=435
 12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105007
 12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=162
 12/10/16 22:05:38 INFO mapred.JobClient: Map-Reduce Framework
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input groups=4
 12/10/16 22:05:38 INFO mapred.JobClient: Combine output records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map input records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce shuffle bytes=0
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce output records=4
 12/10/16 22:05:38 INFO mapred.JobClient: Spilled Records=16
 12/10/16 22:05:38 INFO mapred.JobClient: Map output bytes=855
 12/10/16 22:05:38 INFO mapred.JobClient: Map input bytes=175
 12/10/16 22:05:38 INFO mapred.JobClient: Combine input records=0
 12/10/16 22:05:38 INFO mapred.JobClient: Map output records=8
 12/10/16 22:05:38 INFO mapred.JobClient: Reduce input records=8


第七步:查看结果集,运行结果如下:

distcp hadoop 实时 hadoop实战_output_02