1. join算法

  • 题如下:

                                                                           订单数据表t_order

id

date

pid

amount

1001

20150710

P0001

2

1002

20150710

P0001

3

1002

20150710

P0002

3

 

                                                                            商品信息表t_product

id

pname

category_id

price

P0001

小米5

1000

2

P0002

锤子T1

1000

3

 

                                           假如数据量巨大,两表的数据是以文件的形式存储在HDFS中,

                                           需要用mapreduce程序来实现一下SQL查询运算:

select  a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id

  • reduce端join,容易导致数据倾斜,不可取

1. map端读取全部数据和文件名,商品id相同的分到同一区;

2. reduce端join分组输出;

假如某一商品订单很多,就会导致数据倾斜。

  • map端join: 小表join大表

1. 利用分布式缓存,将小表(商品表)数据分发到每一个mapTask节点:

job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));

2. 在Map类的setup方法中读取缓存文件到hashmap<商品id,商品信息>

    map类的setup()方法会在初始化时调用一次

3. map方法循环读取大表(订单表)

    每读取一行数据,就利用 商品id 去hashmap中获取对应商品信息,和当前订单信息拼接

    输出<join信息,null>

4.  不写reduce端,直接map输出:job.setNumReduceTasks(0)

  • 附代码---Map类
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable>{

   FileReader in = null;
   BufferedReader reader = null;
   HashMap<String,String[]> b_tab = new HashMap<String, String[]>();

   //map初始化的时候调用一次
   @Override
   protected void setup(Context context)throws IOException, InterruptedException {
      // 加载小表(商品)数据到hashmap<id,.....>
      in = new FileReader("pdts.txt");
      reader = new BufferedReader(in);
      String line =null;
      while(StringUtils.isNotBlank((line=reader.readLine()))){
         String[] split = line.split(",");
         String[] products = {split[0],split[1]};
         b_tab.put(split[0], products);
      }
      IOUtils.closeStream(reader);
      IOUtils.closeStream(in);
   }
   //map方法读取大表(订单)数据,每读取一条订单数据,就根据商品id去hashmap中获取对应的商品信息
   @Override
   protected void map(LongWritable key, Text value,Context context)
         throws IOException, InterruptedException {
      String line = value.toString();
      String[] orderFields = line.split(",");
      String pdt_id = orderFields[1];
      String[] pdtFields = b_tab.get(pdt_id);
      String ll = orderFields[0] + "\t" + pdtFields[1] + "\t" + orderFields[1] + "\t" + orderFields[2] ;
      context.write(new Text(ll), NullWritable.get());
   }
}
  • 附代码---运行类
public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      Job job = Job.getInstance(conf);
      
      job.setJarByClass(MapJoinDistributedCacheFile.class);
      job.setMapperClass(MapJoinDistributedCacheFileMapper.class);
      
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(NullWritable.class);
      
      FileInputFormat.setInputPaths(job, new Path("D:/mapjoin/input"));
      FileOutputFormat.setOutputPath(job, new Path("D:/mapjoin/output"));
         
      job.setNumReduceTasks(0);
      
      job.addCacheFile(new URI("file:/D:/pdts.txt"));
//    job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));
      
      job.waitForCompletion(true);
   }

2. 全局计数器

  • 枚举形式

1. 定义枚举

enum MyCounter{M,N};

2. 定义计数器

Counter counterEnum=context.getCounter(MyCounter.M);

3. 赋值

counterEnum.setValue(0);//设置初始值
counterEnum.increment(1);//加1

4. 取值

Counters counters=job.getCounters();
Counter counter=counters.findCounter(WordCountMapper.MyCounter.M);
long value=counter.getValue();
  • 动态设置

1. 定义计数器

Counter counter=context.getCounter("log","error");

2. 赋值

counter.setValue(0);//设置初始值
counter.increment(1);//加1

3. 取值

Counters counters=job.getCounters();
Counter counter=counters.findCounter("log","error");
long value=counter.getValue();

3. 多job串联

复杂的处理逻辑往往需要多个mapreduce程序串联处理,多job的串联可以借助mapreduce框架的JobControl实现。

代码:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//配置
job.set...
...

Job job2 = Job.getInstance(conf);
//配置 job.set... 
...

ControlledJob cjob=new ControlledJob(job.getConfiguration());
ControlledJob cjob2=new ControlledJob(job2.getConfiguration());
cjob.setJob(job);
cjob2.setJob(job2);

//设置依赖
cjob2.addDependingJob(cjob);

JobControl jobControl=new JobControl("wordcount");
jobControl.addJob(cjob);
jobControl.addJob(cjob2);

//新建一个线程来运行已加入JobControl中的作业,开始进程并等待结束
Thread jobThread=new Thread(jobControl);
jobThread.start();
while(!jobControl.allFinished()){
   Thread.sleep(500);
}
jobControl.stop();

4. TopN

  • 题目

订单数据例子:

Order_0000001,Pdt_01,222.8
Order_0000001,Pdt_05,25.8
Order_0000002,Pdt_05,325.8
Order_0000002,Pdt_03,522.8
Order_0000002,Pdt_04,122.4
Order_0000003,Pdt_01,222.8
Order_0000003,Pdt_01,322.8

求:
每一个订单中成交金额最大的那一(n)笔 -----top1(topN)

  • 1. 重写订单Bean的compareTo方法:订单id相同,按照金额倒序排序

           注意:Bean的序列化

public int compareTo(OrderBean o) {
   int cmp = this.itemid.compareTo(o.getItemid());
   if (cmp == 0) {
      cmp = -this.amount.compareTo(o.getAmount());
   }
   return cmp;
}
  • map阶段:读取文件,把数据解析到订单Bean中 
static class TopNMapper extends Mapper<LongWritable, Text, OrderBean, OrderBean> {

   OrderBean v = new OrderBean();
   Text k = new Text();

   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      String[] fields = StringUtils.split(line, ",");
      k.set(fields[0]);
      v.set(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[2])));
      context.write(v,v);
   }
}
  • 重写分区:订单id的hashcode值相同的为同一分区。 job.setPartitionerClass
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
   return (key.getItemid().hashCode() & Integer.MAX_VALUE) % numPartitions;  
}
  • 重写分组:订单id相同的为一组。 job.setGroupingComparatorClass
public int compare(WritableComparable a, WritableComparable b) {
   OrderBean abean = (OrderBean) a;
   OrderBean bbean = (OrderBean) b;
   return abean.getItemid().compareTo(bbean.getItemid());   
}
  • reduce阶段:每组循环,输出前n条数据
for (OrderBean bean : values) {
   if ((count++) == topn) {
      count = 0;
      return;
   }
   context.write(NullWritable.get(), bean);
}
  • TopN的设置:在job的configuration中设置n,在reduce类的setup方法中读取n
Configuration conf = new Configuration();
conf.set("topn", "2");
Job job = Job.getInstance(conf);

reduce类:

static class TopNReducer extends Reducer<OrderBean, OrderBean, NullWritable, OrderBean> {
   int topn = 1;
   int count = 0;

   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
      Configuration conf = context.getConfiguration();
      topn = Integer.parseInt(conf.get("topn"));
   }

   @Override
   protected void reduce(OrderBean key, Iterable<OrderBean> values, Context context) throws IOException, InterruptedException {
      for (OrderBean bean : values) {
         if ((count++) == topn) {
            count = 0;
            return;
         }
         context.write(NullWritable.get(), bean);
      }
   }
}