1, 什么是MapReduces:
MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)"和"Reduce(归约)",是它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。
简单的小案例:
小案例1:
案例介绍:本次案例主要完成的有一个文件,存储的若干字符产,创建MapReduces来计算文件中各个字符串出现的次数;
代码:
package xja.com;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordConunt {
public static void main(String[] args)throws Exception {
Path inpath = new Path(args[0]);
Path outpath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WordConunt.class);
job.setJobName("WordConunt");
job.setMapperClass(Map.class);
job.setReducerClass(Red.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, inpath);
FileOutputFormat.setOutputPath(job, outpath);
job.waitForCompletion(true);
}
public static class Map extends Mapper<LongWritable,Text,Text,IntWritable>{
public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException {
String[] line = value.toString().split(" ");
Text keyy;
IntWritable valuee = new IntWritable(1);
for(int i=0; i<line.length; i++){
keyy = new Text(line[i]);
context.write(keyy, valuee);
}
}
}
public static class Red extends Reducer<Text,IntWritable,Text,IntWritable>{
public void reduce(Text key,Iterable<IntWritable> value,Context context) throws IOException,InterruptedException {
int count = 0;
for(IntWritable val : value){
count = count+val.get();
}
context.write(key, new IntWritable(count));
}
}
}
1,创建目录: 命令:
[root@quickstart /]# hadoop fs -mkdir -p /data/wordcount/input
[root@quickstart /]# hadoop fs -mkdir -p /data/wordcount/output
2,本地创建一个文件,随机写入一些字符串:
命令:[root@quickstart cloudera]# gedit wang.txt
随机输入一些字符串用于测试使用:
3,将文件上传到hdfs文件系统总的/data/wordcount/input中:
命令:[root@quickstart cloudera]# hadoop fs -put wang.txt /data/wordcount/input
4,运行MapReduces代码:
命令:[root@quickstart cloudera]# hadoop jar WordConunt.jar /data/wordcount/input /data/wordcount/output
5,查看运行的结果:
在目录下多了output目录:
查看output目录有什么文件:
查看结果文件:
命令:[root@quickstart /]# hadoop fs -cat /data/wordcount/output/*
小案例2:
案例介绍:给我一个存储了若干电话号码的文件,,主要参数有电话号码,
电话号码 | 上行流量 | 下行流量 |
13726230501 | 200 | 1100 |
13396230502 | 300 | 1200 |
13898205030 | 400 | 1300 |
13897230503 | 100 | 300 |
13597230543 | 500 | 1400 |
13597230534 | 300 | 1200 |
编写一个MapReduces代码,实现计算每个号码的上行流量和下行流量的存储,以及两个流量的总和。
案例步骤:
1, 首先创建一个phone的txt文件:
命令:[root@quickstart cloudera]# gedit phone.txt
录入数据,中间用‘|’隔开(分隔符自选)
文件内容:[root@quickstart cloudera]# gedit phone.txt
13726230501|200|1100
13396230502|300|1200
13898205030|400|1300
13897230503|100|300
13597230543|500|1400
13597230534|300|1200
2, 编写Java类:
代码:
package xja.com;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PhoneTest {
public static void main(String[] args) throws Exception{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(PhoneTest.class);
job.setJobName("PhoneTest");
job.setMapperClass(Map.class);
job.setReducerClass(Red.class);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.waitForCompletion(true);
}
public static class Map extends Mapper<LongWritable,Text,Text,Text>{
public void map(LongWritable key,Text value,Context context) throws IOException,InterruptedException{
String[] line = value.toString().split("\\|",2);
context.write(new Text(line[0]), new Text(line[1]));
}
}
public static class Red extends Reducer<Text,Text,Text,Text>{
public void reduce(Text key,Iterable<Text> value,Context context) throws IOException,InterruptedException{
String[] str;
int upSum = 0;
int downSum = 0;
int totSum = 0;
for(Text val : value){
str = val.toString().split("\\|");
upSum = upSum+Integer.parseInt(str[0]);
downSum = downSum+Integer.parseInt(str[1]);
}
totSum = totSum + upSum + downSum;
context.write(key, new Text(upSum+","+downSum+","+totSum));
}
}
}
将java打包成PhoneTest.jar 包;
注意:(打包教程请参考博客:
)
3, 在HDFS上创建文件存储目录:
这里我在HDFS跟目录下创建了一个/data/phne/input目录用于存放phone.txt文件
命令:[root@quickstart cloudera]# hadoop fs -mkdir -p /data/phone/input
4, 将存储有信息的phone.txt文件上传到/data/phne/input目录下:
命令:[root@quickstart cloudera]# hadoop fs -put phone.txt /data/phone/input
上传完查看文件:[root@quickstart cloudera]# hadoop fs -ls /data/phone/input
5, 运行jar包:
命令:[root@quickstart cloudera]# hadoop jar PhoneTest.jar /data/phone/input /data/phone/output
(Jar包运行完以后会在HDFS文件系统中生成/data/phone/output目录)
Jar包运行完以后查看目录是否生成:
6, 查看目录下文件内容:
命令:[root@quickstart cloudera]# hadoop fs -cat /data/phone/output/*
执行结果文件:
案例2的扩展:
一:加入序列化操作:
代码:
1,
实例化操作的类:PhoneWritable.java:
package xja.com;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class PhoneWritable implements Writable {
int upFlow;
int downFlow;
int totFlow;
public PhoneWritable(){
}
public PhoneWritable(int upFlow,int downFlow){
this.upFlow = upFlow;
this.downFlow = downFlow;
this.totFlow = upFlow + downFlow;
}
public int getUpFlow() {
return upFlow;
}
public void setUpFlow(int upFlow) {
this.upFlow = upFlow;
}
public int getDownFlow() {
return downFlow;
}
public void setDownFlow(int downFlow) {
this.downFlow = downFlow;
}
public int getTotFlow() {
return totFlow;
}
public void setTotFlow(int totFlow) {
this.totFlow = totFlow;
}
public void write(DataOutput out) throws IOException{
out.writeInt(upFlow);
out.writeInt(downFlow);
out.writeInt(totFlow);
}
public void readFields(DataInput in) throws IOException{
upFlow = in.readInt();
downFlow = in.readInt();
totFlow = in.readInt();
}
@Override
public String toString() {
return "PhoneWritable [upFlow=" + upFlow + ", downFlow=" + downFlow
+ ", totFlow=" + totFlow + "]";
}
}
2,
执行类:PhoneTestFlow.java:
package xja.com;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PhoneTestFlow {
public static void main(String[] args) throws Exception{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(PhoneTestFlow.class);
job.setJobName("PhoneTest");
job.setMapperClass(Map.class);
job.setReducerClass(Red.class);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneWritable.class);
job.waitForCompletion(true);
}
public static class Map extends Mapper<LongWritable,Text,Text,PhoneWritable>{
public void map(LongWritable key,Text value,Context context) throws OException,InterruptedException{
String[] line = value.toString().split("\\|");
PhoneWritable fwValue = new honeWritable(Integer.parseInt(line[1]),Integer.parseInt(line[2]));
context.write(new Text(line[0]), fwValue);
}
}
public static class Red extends Reducer<Text,PhoneWritable,Text,Text>{
public void reduce(Text key,Iterable<PhoneWritable> value,Context context) throws IOException,InterruptedException{
String[] str;
int upSum = 0;
int downSum = 0;
int totSum = 0;
for(PhoneWritable val : value){
//str = val.toString().split("\\|");
upSum = upSum+val.getUpFlow();
downSum = downSum+val.getDownFlow();
//upSum = upSum+Integer.parseInt(str[0]);
//downSum = downSum+Integer.parseInt(str[1]);
}
totSum = totSum + upSum + downSum;
context.write(key, new Text(upSum+","+downSum+","+totSum));
}
}
}
将两个类一块打成jar包:PhoneTestFlow.jar
3, 打包完成的jar包:
3, 继续引用/data/phone/input/phone.txt的文件,执行jar包:
命令:[root@quickstart cloudera]# hadoop jar PhoneTestFlow.jar /data/phone/input /data/phone/output1
4, 查看生成的结果:
命令:[root@quickstart cloudera]# hadoop fs -cat /data/phone/output1/*
执行结果文件:
一在序列化操作的基础上实现分区操作:
代码:
1,
实例化操作的类:PhoneWritable.java:
package xja.com;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class PhoneWritable implements Writable {
int upFlow;
int downFlow;
int totFlow;
public PhoneWritable(){
}
public PhoneWritable(int upFlow,int downFlow){
this.upFlow = upFlow;
this.downFlow = downFlow;
this.totFlow = upFlow + downFlow;
}
public int getUpFlow() {
return upFlow;
}
public void setUpFlow(int upFlow) {
this.upFlow = upFlow;
}
public int getDownFlow() {
return downFlow;
}
public void setDownFlow(int downFlow) {
this.downFlow = downFlow;
}
public int getTotFlow() {
return totFlow;
}
public void setTotFlow(int totFlow) {
this.totFlow = totFlow;
}
public void write(DataOutput out) throws IOException{
out.writeInt(upFlow);
out.writeInt(downFlow);
out.writeInt(totFlow);
}
public void readFields(DataInput in) throws IOException{
upFlow = in.readInt();
downFlow = in.readInt();
totFlow = in.readInt();
}
@Override
public String toString() {
return "PhoneWritable [upFlow=" + upFlow + ", downFlow=" + downFlow
+ ", totFlow=" + totFlow + "]";
}
}
2,
执行类:PhoneTestFlow.java:
package xja.com.test;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class PhoneTestFlow {
public static void main(String[] args) throws Exception{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(PhoneTestFlow.class);
job.setJobName("PhoneTest");
job.setMapperClass(Map.class);
job.setReducerClass(Red.class);
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PhoneWritable.class);
job.setPartitionerClass(MyPartitioner.class);
job.setNumReduceTasks(5);
job.waitForCompletion(true);
}
public static class Map extends Mapper<LongWritable,Text,Text,PhoneWritable>{
public void map(LongWritable key,Text value,Context context) throws OException,InterruptedException{
String[] line = value.toString().split("\\|");
PhoneWritable fwValue = new honeWritable(Integer.parseInt(line[1]),Integer.parseInt(line[2]));
context.write(new Text(line[0]), fwValue);
}
}
public static class Red extends Reducer<Text,PhoneWritable,Text,Text>{
public void reduce(Text key,Iterable<PhoneWritable> value,Context context) throws IOException,InterruptedException{
String[] str;
int upSum = 0;
int downSum = 0;
int totSum = 0;
for(PhoneWritable val : value){
//str = val.toString().split("\\|");
upSum = upSum+val.getUpFlow();
downSum = downSum+val.getDownFlow();
//upSum = upSum+Integer.parseInt(str[0]);
//downSum = downSum+Integer.parseInt(str[1]);
}
totSum = totSum + upSum + downSum;
context.write(key, new Text(upSum+","+downSum+","+totSum));
}
}
}
3,
指定分组操作规范的类:MyPartitioner.java:
package xja.com.test;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<Text,PhoneWritable> {
public int getPartition(Text key,PhoneWritable value,int partitionNum){
String phoneAre = key.toString().substring(0,3);
if("137".equals(phoneAre)){
return 0;
} else if("133".equals(phoneAre)) {
return 1;
} else if("138".equals(phoneAre)) {
return 2;
} else if("135".equals(phoneAre)) {
return 3;
} else{
return 4;
}
}
}
将两个类一块打成jar包:PhoneTestFlow.jar
3, 打包完成的jar包:
3, 继续引用/data/phone/input/phone.txt的文件,执行jar包:
命令:[root@quickstart cloudera]# hadoop jar PartitionerPhoneTestFlow.jar /data/phone/input /data/phone/output2
4, 查看生成的结果:
命令:[root@quickstart cloudera]# hadoop fs -cat /data/phone/output1/*
[root@quickstart cloudera]# hadoop fs -ls /data/phone/output2