MapReduce JAVA框架开发流程
总体流程
- 用户提交任务给集群
- 集群首先对输入数据源进行切片
- master 调度 worker 执行 map 任务
- worker 读取输入源片段
- worker 执行 map 任务,将任务输出保存在本地
- master 调度 worker 执行 reduce 任务,reduce worker 读取 map 任务的输出文件
- 执行 reduce 任务,将任务输出保存到 HDFS
细节大致如下
用户提交任务job 给集群
切片 集群查找源数据 对源数据做基本处理
分词(每行执行一次map函数) 集群(yarn的appliction)分配map任务节点worker
映射 其中间数据(存在本地)
分区(partition) .......
排序 (或二次排序) ......
聚合(combine有无key聚合后key无序) 分组(group) 发生在排序后混洗前个人理解
混洗(shuffle后key有序) 混洗横跨mapper和reducer,其发生在mapper的输出和reducer的输入阶段
规约(reduce) 集群(yarn的appliction)分配reduce任务节点worker
数据格式
Shell命令指定输出格式
yarn jar jar_path main_class_path -Dmapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec <in> <out>
GenericOptionsParser op = new GenericOptionsParser(conf, args) 参数解析器
op.getRemainingArgs() 剩余参数
分区
Map数量的确定:由输入数据文件格式,文件的大小,块大小综合确定
一般不可切分计算的只有一个mapper
一般可切分计算的除以128M (Hadoop的默认块大小)分为多少块则有多少mapper
Reduce数量确定:系统根据输入数据量的大小自动推测、和手动指定
自动推测:一般是Reducer的输入处于ruducer的默认处理文件大小64M
手动指定:通过设定分区的数量来控制reducer数量,再在运行前设置reducer的数量
实际分区数量小于等于reducer数量,分区是局部概念,ruducer是全局概念
1)分区默认规则:(key.hashCode() & Integer.MAX_VALUE) % reduceNums (显然却决于reduce的数量 ,一旦reduce数量大于1须指定分区规则吗,让相同key(根据需求)的记录进入同一个分区)
2)自定义分区规则
(1)自定义分区实现类,要实现Partitioner 类重写getPartition方法
(2)job.setPartitionerClass(cls) driver指定分区规则类
(3) -Dmapred.reduce.tasks=2 shell命令通过参数指定reduce数量
job.setNumReduceTasks(2) java API 指定 reduce数量 与shell命令选一即可
样例代码:
class MyPartitoner extends Partitioner<NewStudenInfo, Text> {
@Override
public int getPartition(NewStudenInfo arg0, Text arg1, int arg2) {
return (arg0.getId().hashCode() & Integer.MAX_VALUE) % arg2;
}
}
混洗
shuffle混洗后的key已有序,但未合并(相同Key的合并)
二次排序 (非标准再reducer里,标准MapReduce框架,在混洗时全部有序)
- 非标准 list存储key的value ,comparable或compartor
- 主key和次key形成newkey, newkey 实现类须可序列化、可比较:
实现WritableCompare<NewKey> 重写compareTo方法write() readFiles()
重写的compareTo方法要以联合Key比较小大,先主key比较,主key相同时次key比较
一旦使用MapReduce框架实现二次排序需指定分组规则(分组实现类WritableComparator,重写其compare方法),覆盖联合key的默认分组规则,使其按照联合key的主key分组 ,否则,相同主key的不能分到同一组
样例代码:
class MyGroupSortComparator extends WritableComparator {
public MyGroupSortComparator() {
super(NewStudenInfo.class, true);//指定newkey实现类比较对象
}
@Override
public int compare(Object a, Object b) {
NewStudenInfo a1 = (NewStudenInfo) a;
NewStudenInfo b1 = (NewStudenInfo) b;
return b1.getId().compareTo(a1.getId());
}
}
Mapper类 kv1 输入类型,kv2 输出类型
将输入v1按需处理 遍历处理后的--> vk2 ,添加对应v2, 传入框架类上下文处理对象kv2
Extends Mapper<Key1,Values1,Key2,Values2>
@Overrider map(Key key,Value Value,Context context);
实际流程
Reducer类
利用循环 将输入v1 处理--> v2 添加对应k1, 传入框架类上下文处理对象kv2
Extends Reducer<Key1,Values1,Key2,Values2>
@Overrider reduce(Key key,Value Value,Context context)
Driver类
建项目时修改pom.xml 文件
获得日志对象 Logger logger = Logger.get(cls);
建立hdfs配置单例 Configrutaion conf = new Configrutaion();
通过conf传递参数 conf.set(name, value); conf.get(name)
获取参数解析器 可过其滤系统命令
GenericOptionsParser gop = new GenericOptionsParser(conf, args);
String[] remainingArgs = gop.getRemainingArgs();剩余参数 作判断
if (remainingArgs.length != 2) {
System.err.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
System.exit(2);
}
获得 Job单例 job.getIntance(conf,”job Name”); 装载 主类jar包 jobSetJarByClass(jar.cls);
装载Mapper job.setMapperByClass.(cls);
//
装载Reducer job.setReducerClass(cls);
指定NumReduce数量 job.setNumReduceTasks(2);
// -Dmapred.reduce.tasks=2 或者shell命令通过参数指定reduce数量,与上条选一即可
装载Partition job.setPartitionerClass(cls);// 指定分区规则
装载Gruoup job.setGroupingComparatorClass(cls);// 指定分组规则
装载MapOutKey job.setMapOutputKeyClass (cls); //map和reduce输出不一致时要指定
装载MapOutVal job.setMapOutputValueClass(cls) //map和reduce输出不一致时要指定
装载OutKey job.setOutputKeyClass(cls);
装载OutVal job.etOutputValueClass(cls);
输入路径hdfs路径 FileInputFormat.addInputPath(job,new Path(args[0]));
输出路径hdfs路径 FileOutputFormat.setOutputPath(job,new Path(args[1]));
指定任务客户端的待完成时退出
System.exit(job.waitForCompletion(true) ? 0 : 1);
完成后打包上传
Shell命令 yarn jar 测试上线
下面符实际代码
字符统计
package cn.tl.test_6;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
/**
* WordCount,其中输入路径为/tmp/tianliangedu/input_words
* ,输出路径为/tmp/tianliangedu/output_2018,每行的文本分隔符为空格
*/
public class MRwordCounts {
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration hdpConf = new Configuration();
GenericOptionsParser gop = new GenericOptionsParser(hdpConf, args);
String[] remainingArgs = gop.getRemainingArgs();
if (remainingArgs.length != 2) {
System.err
.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(hdpConf, "wordCounts");
job.setJarByClass(MRwordCounts.class);
job.setMapperClass(MyTokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
class MyTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] s = value.toString().split(" ");
for (String string : s) {
word.set(string);
context.write(word, one);
}
}
}
class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
IntWritable value = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable intWritable : values) {
sum += intWritable.get();
}
value.set(sum);
context.write(key, value);
}
}
二次排序
package cn.tl.test_9;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
/**
*
* 思路:在实体Mapper中
* 重构key,使主要排序key和次要key形成联合key(联合key的实现类,再实现writableCompare接口
* ,重写compare方法),之后的shuffle中将会按联合key中的联合key排序, 此时按要求完全有序
* 一旦使用MR框架按照联合key排序
* ,需指定分组规则(分组实现类WritableComparator,重写其compare方法),覆盖联合key
* 的默认分组规则,使其按照联合key的主key分组 ,否则,相同主key的不能分到同一组
*/
public class MRTwiceSrotStu {
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
GenericOptionsParser gop = new GenericOptionsParser(conf, args);
String[] remainingArgs = gop.getRemainingArgs();
if (remainingArgs.length != 2) {
System.err
.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "counts");
job.setJarByClass(MRTwiceSrotStu.class);
job.setMapperClass(MyMapper.class);
// job.setCombinerClass(MyReducer.class);// reducer输入输出一致时可写
job.setReducerClass(MyReducer.class);
job.setNumReduceTasks(2); // 指定reduce数量
job.setPartitionerClass(MyPartitoner.class);// 指定分区规则
job.setGroupingComparatorClass(MyGroupSortComparator.class);// 指定分组规则
job.setMapOutputKeyClass(NewStudenInfo.class);// mapper和reducer输出不一致时要写
job.setMapOutputValueClass(Text.class);// mapper和reducer输出不一致时要写
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
class MyMapper extends Mapper<Object, Text, NewStudenInfo, Text> {
private NewStudenInfo nk = null;
private Text nv = new Text();
@Override
protected void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] strArr = value.toString().split("\\t");
if (strArr.length == 4) {
nk = new NewStudenInfo(strArr[0], Integer.parseInt(strArr[3]));
nv.set(strArr[1] + "\t" + strArr[2] + "\t" + strArr[3]);
context.write(nk, nv);
} else {
try {
throw new Exception("数据格式不规则");
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
class MyReducer extends Reducer<NewStudenInfo, Text, Text, Text> {
private Text nv = new Text();
@Override
protected void reduce(NewStudenInfo arg0, Iterable<Text> arg1, Context arg2)
throws IOException, InterruptedException {
for (Text st : arg1) {
nv.set(arg0.getId());
arg2.write(nv, st);
// break; //简单去重
}
}
}
class NewStudenInfo implements WritableComparable<NewStudenInfo> {
private String id;
private Integer score;
public NewStudenInfo(String id, Integer score) {
super();
this.id = id;
this.score = score;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public Integer getScore() {
return score;
}
public void setScore(Integer score) {
this.score = score;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(id);
out.writeInt(score);
}
@Override
public void readFields(DataInput in) throws IOException {
id = in.readUTF();
score = in.readInt();
}
@Override
public int compareTo(NewStudenInfo o) {
int val = this.getId().compareTo(o.getId());
if (val == 0) {
val = this.getScore().compareTo(o.getScore());
}
return val;
}
}
class MyPartitoner extends Partitioner<NewStudenInfo, Text> {
@Override
public int getPartition(NewStudenInfo arg0, Text arg1, int arg2) {
return (arg0.getId().hashCode() & Integer.MAX_VALUE) % arg2;
}
}
class MyGroupSortComparator extends WritableComparator {
public MyGroupSortComparator() {
super(NewStudenInfo.class, true);
}
@Override
public int compare(Object a, Object b) {
NewStudenInfo a1 = (NewStudenInfo) a;
NewStudenInfo b1 = (NewStudenInfo) b;
return b1.getId().compareTo(a1.getId());
}
}