Hadoop生态中的Mapreduce在map阶段可以将大数据或大文件进行分区,然后到Reduce阶段可并行处理,分区数量一般与reduce任务数量一致;自定义实现Hadoop的WritableComparable接口(序列化并排列接口)的Bean在mapreduce中进行排序分组的好处是在Reduce阶段时可将数据按照自定义的分组属性进行分组处理。
文章通过“寻找订单中的最大金额”的Demo将以上几个要点合成一起来说明用法。

订单Bean序列化、排序

实现WritableComparable接口中的方法,其中Writable是Hadoop中的序列化接口,Comparable是比较接口,用来排序算法调用。另外需要复写toString方法,在mapreduce序列化输出时会调用toString方法

/**
 * 订单bean, 实现序列化Writable和比较接口Comparable
 */
public class OrderBean implements WritableComparable<OrderBean>{
    private Text orderId;
    private DoubleWritable amount;

    public Text getOrderId() {
        return orderId;
    }
    public void setOrderId(Text orderId) {
        this.orderId = orderId;
    }
    public DoubleWritable getAmount() {
        return amount;
    }
    public void setAmount(DoubleWritable amount) {
        this.amount = amount;
    }

    //按照属性顺序  序列化bean
    public void write(DataOutput out) throws IOException {
        out.writeUTF(orderId.toString());
        out.writeDouble(amount.get());

    }
    //按照属性顺序读取序列化bean
    public void readFields(DataInput in) throws IOException {
        String readUTF = in.readUTF();
        double readDouble = in.readDouble();

        this.orderId = new Text(readUTF);
        this.amount = new DoubleWritable(readDouble);
    }

    public int compareTo(OrderBean o) {
        int res = this.getOrderId().compareTo(o.getOrderId());
        if (res == 0) {
            res = -this.getAmount().compareTo(o.getAmount()); //从大到不排序
        }
        return res;
    }

    //重写toString方法,在mapreduce输出时调用toString方法输出
    @Override
    public String toString() {
        return orderId.toString() + "\t" + amount.get();
    }
}

分区

Hadoop中的分区是HashCode求模分区算法,用分区属性的hashCode对Reduce任务数求模,可将数据进行分区(相同属性的必然在一个分区)输出,在Reduce阶段时则每个分区对应一个reducetask处理分区数据。

/**
 * 根据id进行的Map分区
 */
public class OrderIdPartitioner extends Partitioner<OrderBean, NullWritable>{

    /**
     * 分区数由reduceTasks的数量决定,即numPartitions=reduceTasks的数量
     */
    @Override
    public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
        return (key.getOrderId().hashCode() & Integer.MAX_VALUE) % numPartitions;
    }

}

分组

hadoop中的分组是用WritableComparator来实现的,实际使用时需要自定义一个分组类,继承WritableComparator指定实例类型和分组属性(实现compare比较属性方法),并在提交yarn任务时指定分组所有class

/**
 * 分组order, 在reduce端根据orderId进行分组
 */
public class OrderGroupingComparator extends WritableComparator{

    //指定分组所用的Bean
    public OrderGroupingComparator() {
        super(OrderBean.class, true);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        OrderBean orderA = (OrderBean) a;
        OrderBean orderB = (OrderBean) b;
        return orderA.getOrderId().compareTo(orderB.getOrderId());
    }
}

MapReduce方法

/**
 * 利用groupingcomparator在reduce端聚合分组任务提交类
 */
public class RunMain {

    static class OrderGroupMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable> {

        OrderBean order = new OrderBean();
        NullWritable NullCons = NullWritable.get();

        //读取文件写入bean中,并输出
        @Override
        protected void map(
                LongWritable key,
                Text value,
                Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context)
                throws IOException, InterruptedException {
            String[] fields = value.toString().split("\t");
            order.setOrderId(new Text(fields[0]));
            order.setAmount(new DoubleWritable(Double.parseDouble(fields[2])));
            context.write(order, NullCons);
        }
    }

    static class OrderGroupReduce extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable> {
        //不作处理,直接输出。在reduce输出之前,hadoop已调用自定实现的分组class
        @Override
        protected void reduce(
                OrderBean arg0,
                Iterable<NullWritable> arg1,
                Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context)
                throws IOException, InterruptedException {
            context.write(arg0, NullWritable.get());
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);
        job.setJarByClass(RunMain.class);

        job.setMapperClass(OrderGroupMapper.class);
        job.setReducerClass(OrderGroupReduce.class);

        //设置业务逻辑:Map\Reduce的输入输出类型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        //指定位置
        FileInputFormat.setInputPaths(job, new Path("hdfs://server1:9000/grouptest/order.txt"));
        FileOutputFormat.setOutputPath(job, new Path("hdfs://server1:9000/grouptest/result"));
        //设置reduceTask,即分区数量
        job.setNumReduceTasks(2);

        //设置shuffle所使用的类
        job.setGroupingComparatorClass(OrderGroupingComparator.class);
        //设置shuffle所使用的partioner类
        job.setPartitionerClass(OrderIdPartitioner.class);
        //提交任务,打印信息
        boolean completion = job.waitForCompletion(true);
        System.exit(completion ? 0 : 1);
    }
}

运行测试

  • 打开hadoop集群,上传所需要的数据到hadoop分布文件系统hdfs中,订单数据如下:

Order_0000001 Pdt_01 222.8
Order_0000001 Pdt_05 25.8
Order_0000002 Pdt_03 522.8
Order_0000002 Pdt_04 122.4
Order_0000002 Pdt_05 722.4
Order_0000003 Pdt_01 223.8
Order_0000003 Pdt_01 23.8
Order_0000003 Pdt_01 322.8
Order_0000004 Pdt_01 701.9
Order_0000004 Pdt_01 120.8

  • 将以上java文件导出testgroup.jar包,上传到服务器中运行:
    hadoop jar testgroup.jar com.spark.mapreduce.group.RunMain
  • 查看结果
    可通过WEB端http://192.168.10.121:8088/查看任务运行情况
  • hadoop 查看分区 hadoop分区和分组_大数据

结果显示,由于设置reducetask的数量是两个,因此出现两个结果文件。

hadoop 查看分区 hadoop分区和分组_hadoop 查看分区_02