Hadoop生态中的Mapreduce在map阶段可以将大数据或大文件进行分区,然后到Reduce阶段可并行处理,分区数量一般与reduce任务数量一致;自定义实现Hadoop的WritableComparable接口(序列化并排列接口)的Bean在mapreduce中进行排序;分组的好处是在Reduce阶段时可将数据按照自定义的分组属性进行分组处理。
文章通过“寻找订单中的最大金额”的Demo将以上几个要点合成一起来说明用法。
订单Bean序列化、排序
实现WritableComparable接口中的方法,其中Writable是Hadoop中的序列化接口,Comparable是比较接口,用来排序算法调用。另外需要复写toString方法,在mapreduce序列化输出时会调用toString方法
/**
* 订单bean, 实现序列化Writable和比较接口Comparable
*/
public class OrderBean implements WritableComparable<OrderBean>{
private Text orderId;
private DoubleWritable amount;
public Text getOrderId() {
return orderId;
}
public void setOrderId(Text orderId) {
this.orderId = orderId;
}
public DoubleWritable getAmount() {
return amount;
}
public void setAmount(DoubleWritable amount) {
this.amount = amount;
}
//按照属性顺序 序列化bean
public void write(DataOutput out) throws IOException {
out.writeUTF(orderId.toString());
out.writeDouble(amount.get());
}
//按照属性顺序读取序列化bean
public void readFields(DataInput in) throws IOException {
String readUTF = in.readUTF();
double readDouble = in.readDouble();
this.orderId = new Text(readUTF);
this.amount = new DoubleWritable(readDouble);
}
public int compareTo(OrderBean o) {
int res = this.getOrderId().compareTo(o.getOrderId());
if (res == 0) {
res = -this.getAmount().compareTo(o.getAmount()); //从大到不排序
}
return res;
}
//重写toString方法,在mapreduce输出时调用toString方法输出
@Override
public String toString() {
return orderId.toString() + "\t" + amount.get();
}
}
分区
Hadoop中的分区是HashCode求模分区算法,用分区属性的hashCode对Reduce任务数求模,可将数据进行分区(相同属性的必然在一个分区)输出,在Reduce阶段时则每个分区对应一个reducetask处理分区数据。
/**
* 根据id进行的Map分区
*/
public class OrderIdPartitioner extends Partitioner<OrderBean, NullWritable>{
/**
* 分区数由reduceTasks的数量决定,即numPartitions=reduceTasks的数量
*/
@Override
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
分组
hadoop中的分组是用WritableComparator来实现的,实际使用时需要自定义一个分组类,继承WritableComparator指定实例类型和分组属性(实现compare比较属性方法),并在提交yarn任务时指定分组所有class
/**
* 分组order, 在reduce端根据orderId进行分组
*/
public class OrderGroupingComparator extends WritableComparator{
//指定分组所用的Bean
public OrderGroupingComparator() {
super(OrderBean.class, true);
}
@SuppressWarnings("rawtypes")
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean orderA = (OrderBean) a;
OrderBean orderB = (OrderBean) b;
return orderA.getOrderId().compareTo(orderB.getOrderId());
}
}
MapReduce方法
/**
* 利用groupingcomparator在reduce端聚合分组任务提交类
*/
public class RunMain {
static class OrderGroupMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {
OrderBean order = new OrderBean();
NullWritable NullCons = NullWritable.get();
//读取文件写入bean中,并输出
@Override
protected void map(
LongWritable key,
Text value,
Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
String[] fields = value.toString().split("\t");
order.setOrderId(new Text(fields[0]));
order.setAmount(new DoubleWritable(Double.parseDouble(fields[2])));
context.write(order, NullCons);
}
}
static class OrderGroupReduce extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
//不作处理,直接输出。在reduce输出之前,hadoop已调用自定实现的分组class
@Override
protected void reduce(
OrderBean arg0,
Iterable<NullWritable> arg1,
Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
context.write(arg0, NullWritable.get());
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(RunMain.class);
job.setMapperClass(OrderGroupMapper.class);
job.setReducerClass(OrderGroupReduce.class);
//设置业务逻辑:Map\Reduce的输入输出类型
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
//指定位置
FileInputFormat.setInputPaths(job, new Path("hdfs://server1:9000/grouptest/order.txt"));
FileOutputFormat.setOutputPath(job, new Path("hdfs://server1:9000/grouptest/result"));
//设置reduceTask,即分区数量
job.setNumReduceTasks(2);
//设置shuffle所使用的类
job.setGroupingComparatorClass(OrderGroupingComparator.class);
//设置shuffle所使用的partioner类
job.setPartitionerClass(OrderIdPartitioner.class);
//提交任务,打印信息
boolean completion = job.waitForCompletion(true);
System.exit(completion ? 0 : 1);
}
}
运行测试
- 打开hadoop集群,上传所需要的数据到hadoop分布文件系统hdfs中,订单数据如下:
Order_0000001 Pdt_01 222.8
Order_0000001 Pdt_05 25.8
Order_0000002 Pdt_03 522.8
Order_0000002 Pdt_04 122.4
Order_0000002 Pdt_05 722.4
Order_0000003 Pdt_01 223.8
Order_0000003 Pdt_01 23.8
Order_0000003 Pdt_01 322.8
Order_0000004 Pdt_01 701.9
Order_0000004 Pdt_01 120.8
- 将以上java文件导出testgroup.jar包,上传到服务器中运行:
hadoop jar testgroup.jar com.spark.mapreduce.group.RunMain
- 查看结果
可通过WEB端http://192.168.10.121:8088/
查看任务运行情况
结果显示,由于设置reducetask的数量是两个,因此出现两个结果文件。