MapReduce JAVA框架开发流程

总体流程

  1. 用户提交任务给集群
  2. 集群首先对输入数据源进行切片
  3. master 调度 worker 执行 map 任务
  4. worker 读取输入源片段
  5. worker 执行 map 任务,将任务输出保存在本地
  6. master 调度 worker 执行 reduce 任务,reduce worker 读取 map 任务的输出文件
  7. 执行 reduce 任务,将任务输出保存到 HDFS

细节大致如下

                                                               用户提交任务job 给集群

切片                                                        集群查找源数据 对源数据做基本处理

分词(每行执行一次map函数)                 集群(yarn的appliction)分配map任务节点worker

映射                                                        其中间数据(存在本地)

分区(partition)                                         .......

排序 (或二次排序)                                         ......

聚合(combine有无key聚合后key无序)    分组(group) 发生在排序后混洗前个人理解

混洗(shuffle后key有序) 混洗横跨mapper和reducer,其发生在mapper的输出和reducer的输入阶段

规约(reduce)                                    集群(yarn的appliction)分配reduce任务节点worker

数据格式

Shell命令指定输出格式

yarn jar jar_path main_class_path -Dmapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec <in> <out>

GenericOptionsParser op = new GenericOptionsParser(conf, args)    参数解析器

op.getRemainingArgs()             剩余参数

 

分区

Map数量的确定:由输入数据文件格式,文件的大小,块大小综合确定

一般不可切分计算的只有一个mapper

一般可切分计算的除以128M (Hadoop的默认块大小)分为多少块则有多少mapper

Reduce数量确定:系统根据输入数据量的大小自动推测、和手动指定

     自动推测:一般是Reducer的输入处于ruducer的默认处理文件大小64M

     手动指定:通过设定分区的数量来控制reducer数量,再在运行前设置reducer的数量

实际分区数量小于等于reducer数量,分区是局部概念,ruducer是全局概念

1)分区默认规则:(key.hashCode() & Integer.MAX_VALUE) % reduceNums  (显然却决于reduce的数量 ,一旦reduce数量大于1须指定分区规则吗,让相同key(根据需求)的记录进入同一个分区)

2)自定义分区规则    

(1)自定义分区实现类,要实现Partitioner 类重写getPartition方法  

(2)job.setPartitionerClass(cls)             driver指定分区规则类

(3) -Dmapred.reduce.tasks=2           shell命令通过参数指定reduce数量

job.setNumReduceTasks(2)        java API 指定 reduce数量 与shell命令选一即可

样例代码:

class MyPartitoner extends Partitioner<NewStudenInfo, Text> {



       @Override

       public int getPartition(NewStudenInfo arg0, Text arg1, int arg2) {



              return (arg0.getId().hashCode() & Integer.MAX_VALUE) % arg2;

       }



}

混洗
shuffle混洗后的key已有序,但未合并(相同Key的合并)

 

二次排序 (非标准再reducer里,标准MapReduce框架,在混洗时全部有序)

  1. 非标准 list存储key的value ,comparable或compartor
  2. 主key和次key形成newkey, newkey 实现类须可序列化、可比较:  

实现WritableCompare<NewKey> 重写compareTo方法write()  readFiles()

重写的compareTo方法要以联合Key比较小大,先主key比较,主key相同时次key比较
       一旦使用MapReduce框架实现二次排序需指定分组规则(分组实现类WritableComparator,重写其compare方法),覆盖联合key的默认分组规则,使其按照联合key的主key分组 ,否则,相同主key的不能分到同一组

 

样例代码:

class MyGroupSortComparator extends WritableComparator {



       public MyGroupSortComparator() {

              super(NewStudenInfo.class, true);//指定newkey实现类比较对象

       }



       @Override

       public int compare(Object a, Object b) {

              NewStudenInfo a1 = (NewStudenInfo) a;

              NewStudenInfo b1 = (NewStudenInfo) b;

              return b1.getId().compareTo(a1.getId());

       }

}

Mapper类 kv1 输入类型,kv2 输出类型

              将输入v1按需处理 遍历处理后的--> vk2 ,添加对应v2, 传入框架类上下文处理对象kv2

Extends Mapper<Key1,Values1,Key2,Values2>

 @Overrider  map(Key key,Value Value,Context context);

      

实际流程

       Reducer类

              利用循环 将输入v1 处理--> v2 添加对应k1, 传入框架类上下文处理对象kv2

              Extends Reducer<Key1,Values1,Key2,Values2>  

@Overrider  reduce(Key key,Value Value,Context context)

 

       Driver类

建项目时修改pom.xml 文件

获得日志对象                                  Logger logger = Logger.get(cls);

建立hdfs配置单例                          Configrutaion conf = new Configrutaion();

通过conf传递参数                          conf.set(name, value);         conf.get(name)

获取参数解析器                 可过其滤系统命令

GenericOptionsParser gop = new GenericOptionsParser(conf, args);
              String[] remainingArgs = gop.getRemainingArgs();剩余参数 作判断
              if (remainingArgs.length != 2) {
                     System.err.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
                     System.exit(2);
              }
        获得 Job单例                   job.getIntance(conf,”job Name”);       装载 主类jar包                 jobSetJarByClass(jar.cls);
       装载Mapper                    job.setMapperByClass.(cls);
 //
       装载Reducer                      job.setReducerClass(cls);
指定NumReduce数量       job.setNumReduceTasks(2);
// -Dmapred.reduce.tasks=2             或者shell命令通过参数指定reduce数量,与上条选一即可
       装载Partition                     job.setPartitionerClass(cls);// 指定分区规则
       装载Gruoup                      job.setGroupingComparatorClass(cls);// 指定分组规则
       装载MapOutKey                job.setMapOutputKeyClass (cls); //map和reduce输出不一致时要指定             
       装载MapOutVal                job.setMapOutputValueClass(cls) //map和reduce输出不一致时要指定
       装载OutKey                       job.setOutputKeyClass(cls);
       装载OutVal                       job.etOutputValueClass(cls);
       输入路径hdfs路径            FileInputFormat.addInputPath(job,new Path(args[0]));
       输出路径hdfs路径            FileOutputFormat.setOutputPath(job,new Path(args[1]));
       指定任务客户端的待完成时退出
System.exit(job.waitForCompletion(true) ? 0 : 1);

       完成后打包上传

Shell命令  yarn jar 测试上线

下面符实际代码

字符统计

package cn.tl.test_6;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 *         WordCount,其中输入路径为/tmp/tianliangedu/input_words
 *         ,输出路径为/tmp/tianliangedu/output_2018,每行的文本分隔符为空格
 */
public class MRwordCounts {

	public static void main(String[] args) throws IOException,
			ClassNotFoundException, InterruptedException {
		Configuration hdpConf = new Configuration();
		GenericOptionsParser gop = new GenericOptionsParser(hdpConf, args);
		String[] remainingArgs = gop.getRemainingArgs();
		if (remainingArgs.length != 2) {
			System.err
					.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
			System.exit(2);
		}
		Job job = Job.getInstance(hdpConf, "wordCounts");
		job.setJarByClass(MRwordCounts.class);
		job.setMapperClass(MyTokenizerMapper.class);
		job.setCombinerClass(IntSumReducer.class);
		job.setReducerClass(IntSumReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

class MyTokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
	private Text word = new Text();
	private IntWritable one = new IntWritable(1);

	@Override
	public void map(Object key, Text value, Context context)
			throws IOException, InterruptedException {
		String[] s = value.toString().split(" ");
		for (String string : s) {
			word.set(string);
			context.write(word, one);
		}
	}
}

class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	IntWritable value = new IntWritable();

	@Override
	public void reduce(Text key, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable intWritable : values) {
			sum += intWritable.get();
		}
		value.set(sum);
		context.write(key, value);
	}
}

二次排序

package cn.tl.test_9;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 * 
 *         思路:在实体Mapper中
 *         重构key,使主要排序key和次要key形成联合key(联合key的实现类,再实现writableCompare接口
 *         ,重写compare方法),之后的shuffle中将会按联合key中的联合key排序, 此时按要求完全有序
 *         一旦使用MR框架按照联合key排序
 *         ,需指定分组规则(分组实现类WritableComparator,重写其compare方法),覆盖联合key
 *         的默认分组规则,使其按照联合key的主key分组 ,否则,相同主key的不能分到同一组
 */
public class MRTwiceSrotStu {

	public static void main(String[] args) throws IOException,
			ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		GenericOptionsParser gop = new GenericOptionsParser(conf, args);
		String[] remainingArgs = gop.getRemainingArgs();
		if (remainingArgs.length != 2) {
			System.err
					.println("Usage: yarn jar jar_path main_class_path -D 参数列表 <in> <out>");
			System.exit(2);
		}
		Job job = Job.getInstance(conf, "counts");
		job.setJarByClass(MRTwiceSrotStu.class);
		job.setMapperClass(MyMapper.class);
		// job.setCombinerClass(MyReducer.class);// reducer输入输出一致时可写
		job.setReducerClass(MyReducer.class);
		job.setNumReduceTasks(2); // 指定reduce数量
		job.setPartitionerClass(MyPartitoner.class);// 指定分区规则
		job.setGroupingComparatorClass(MyGroupSortComparator.class);// 指定分组规则
		job.setMapOutputKeyClass(NewStudenInfo.class);// mapper和reducer输出不一致时要写
		job.setMapOutputValueClass(Text.class);// mapper和reducer输出不一致时要写
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

class MyMapper extends Mapper<Object, Text, NewStudenInfo, Text> {
	private NewStudenInfo nk = null;
	private Text nv = new Text();

	@Override
	protected void map(Object key, Text value, Context context)
			throws IOException, InterruptedException {
		String[] strArr = value.toString().split("\\t");
		if (strArr.length == 4) {
			nk = new NewStudenInfo(strArr[0], Integer.parseInt(strArr[3]));
			nv.set(strArr[1] + "\t" + strArr[2] + "\t" + strArr[3]);
			context.write(nk, nv);
		} else {
			try {
				throw new Exception("数据格式不规则");
			} catch (Exception e) {
				e.printStackTrace();
			}
		}
	}
}

class MyReducer extends Reducer<NewStudenInfo, Text, Text, Text> {
	private Text nv = new Text();

	@Override
	protected void reduce(NewStudenInfo arg0, Iterable<Text> arg1, Context arg2)
			throws IOException, InterruptedException {
		for (Text st : arg1) {
			nv.set(arg0.getId());
			arg2.write(nv, st);
			// break; //简单去重
		}

	}
}

class NewStudenInfo implements WritableComparable<NewStudenInfo> {
	private String id;
	private Integer score;

	public NewStudenInfo(String id, Integer score) {
		super();
		this.id = id;
		this.score = score;
	}

	public String getId() {
		return id;
	}

	public void setId(String id) {
		this.id = id;
	}

	public Integer getScore() {
		return score;
	}

	public void setScore(Integer score) {
		this.score = score;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(id);
		out.writeInt(score);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		id = in.readUTF();
		score = in.readInt();
	}

	@Override
	public int compareTo(NewStudenInfo o) {
		int val = this.getId().compareTo(o.getId());
		if (val == 0) {
			val = this.getScore().compareTo(o.getScore());
		}
		return val;
	}

}

class MyPartitoner extends Partitioner<NewStudenInfo, Text> {

	@Override
	public int getPartition(NewStudenInfo arg0, Text arg1, int arg2) {

		return (arg0.getId().hashCode() & Integer.MAX_VALUE) % arg2;
	}

}

class MyGroupSortComparator extends WritableComparator {

	public MyGroupSortComparator() {
		super(NewStudenInfo.class, true);
	}

	@Override
	public int compare(Object a, Object b) {
		NewStudenInfo a1 = (NewStudenInfo) a;
		NewStudenInfo b1 = (NewStudenInfo) b;
		return b1.getId().compareTo(a1.getId());
	}
}