mapreduce处理脱敏数据 mapreduce数据处理案例

转载

mob64ca14106f2f 2024-03-27 07:23:57

文章标签 mapreduce处理脱敏数据 join算法 mapreduce 全局计数器多job串联 文章分类 架构后端开发

1. join算法

题如下：

订单数据表t_order

id	date	pid	amount
1001	20150710	P0001	2
1002	20150710	P0001	3
1002	20150710	P0002	3

商品信息表t_product

id	pname	category_id	price
P0001	小米5	1000	2
P0002	锤子T1	1000	3

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，

需要用mapreduce程序来实现一下SQL查询运算：

select a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id

reduce端join，容易导致数据倾斜，不可取

1. map端读取全部数据和文件名，商品id相同的分到同一区；

2. reduce端join分组输出；

假如某一商品订单很多，就会导致数据倾斜。

map端join：小表join大表

1. 利用分布式缓存，将小表（商品表）数据分发到每一个mapTask节点：

job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));

2. 在Map类的setup方法中读取缓存文件到hashmap<商品id，商品信息>

map类的setup()方法会在初始化时调用一次

3. map方法循环读取大表（订单表）

每读取一行数据，就利用商品id 去hashmap中获取对应商品信息，和当前订单信息拼接

输出<join信息，null>

4. 不写reduce端，直接map输出：job.setNumReduceTasks(0)

附代码---Map类

public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable>{

   FileReader in = null;
   BufferedReader reader = null;
   HashMap<String,String[]> b_tab = new HashMap<String, String[]>();

   //map初始化的时候调用一次
   @Override
   protected void setup(Context context)throws IOException, InterruptedException {
      // 加载小表（商品）数据到hashmap<id,.....>
      in = new FileReader("pdts.txt");
      reader = new BufferedReader(in);
      String line =null;
      while(StringUtils.isNotBlank((line=reader.readLine()))){
         String[] split = line.split(",");
         String[] products = {split[0],split[1]};
         b_tab.put(split[0], products);
      }
      IOUtils.closeStream(reader);
      IOUtils.closeStream(in);
   }
   //map方法读取大表（订单）数据，每读取一条订单数据，就根据商品id去hashmap中获取对应的商品信息
   @Override
   protected void map(LongWritable key, Text value,Context context)
         throws IOException, InterruptedException {
      String line = value.toString();
      String[] orderFields = line.split(",");
      String pdt_id = orderFields[1];
      String[] pdtFields = b_tab.get(pdt_id);
      String ll = orderFields[0] + "\t" + pdtFields[1] + "\t" + orderFields[1] + "\t" + orderFields[2] ;
      context.write(new Text(ll), NullWritable.get());
   }
}

附代码---运行类

public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      Job job = Job.getInstance(conf);
      
      job.setJarByClass(MapJoinDistributedCacheFile.class);
      job.setMapperClass(MapJoinDistributedCacheFileMapper.class);
      
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(NullWritable.class);
      
      FileInputFormat.setInputPaths(job, new Path("D:/mapjoin/input"));
      FileOutputFormat.setOutputPath(job, new Path("D:/mapjoin/output"));
         
      job.setNumReduceTasks(0);
      
      job.addCacheFile(new URI("file:/D:/pdts.txt"));
//    job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));
      
      job.waitForCompletion(true);
   }

2. 全局计数器

枚举形式

1. 定义枚举

enum MyCounter{M,N};

2. 定义计数器

Counter counterEnum=context.getCounter(MyCounter.M);

3. 赋值

counterEnum.setValue(0);//设置初始值
counterEnum.increment(1);//加1

4. 取值

Counters counters=job.getCounters();
Counter counter=counters.findCounter(WordCountMapper.MyCounter.M);
long value=counter.getValue();

动态设置

1. 定义计数器

Counter counter=context.getCounter("log","error");

2. 赋值

counter.setValue(0);//设置初始值
counter.increment(1);//加1

3. 取值

Counters counters=job.getCounters();
Counter counter=counters.findCounter("log","error");
long value=counter.getValue();

3. 多job串联

复杂的处理逻辑往往需要多个mapreduce程序串联处理，多job的串联可以借助mapreduce框架的JobControl实现。

代码：

Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//配置
job.set...
...

Job job2 = Job.getInstance(conf);
//配置 job.set... 
...

ControlledJob cjob=new ControlledJob(job.getConfiguration());
ControlledJob cjob2=new ControlledJob(job2.getConfiguration());
cjob.setJob(job);
cjob2.setJob(job2);

//设置依赖
cjob2.addDependingJob(cjob);

JobControl jobControl=new JobControl("wordcount");
jobControl.addJob(cjob);
jobControl.addJob(cjob2);

//新建一个线程来运行已加入JobControl中的作业，开始进程并等待结束
Thread jobThread=new Thread(jobControl);
jobThread.start();
while(!jobControl.allFinished()){
   Thread.sleep(500);
}
jobControl.stop();

4. TopN

题目

订单数据例子：

Order_0000001,Pdt_01,222.8
Order_0000001,Pdt_05,25.8
Order_0000002,Pdt_05,325.8
Order_0000002,Pdt_03,522.8
Order_0000002,Pdt_04,122.4
Order_0000003,Pdt_01,222.8
Order_0000003,Pdt_01,322.8

求：
每一个订单中成交金额最大的那一（n）笔 -----top1（topN）

1. 重写订单Bean的compareTo方法：订单id相同，按照金额倒序排序

注意：Bean的序列化

public int compareTo(OrderBean o) {
   int cmp = this.itemid.compareTo(o.getItemid());
   if (cmp == 0) {
      cmp = -this.amount.compareTo(o.getAmount());
   }
   return cmp;
}

map阶段：读取文件，把数据解析到订单Bean中

static class TopNMapper extends Mapper<LongWritable, Text, OrderBean, OrderBean> {

   OrderBean v = new OrderBean();
   Text k = new Text();

   @Override
   protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      String[] fields = StringUtils.split(line, ",");
      k.set(fields[0]);
      v.set(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[2])));
      context.write(v,v);
   }
}

重写分区：订单id的hashcode值相同的为同一分区。 job.setPartitionerClass

public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
   return (key.getItemid().hashCode() & Integer.MAX_VALUE) % numPartitions;  
}

重写分组：订单id相同的为一组。 job.setGroupingComparatorClass

public int compare(WritableComparable a, WritableComparable b) {
   OrderBean abean = (OrderBean) a;
   OrderBean bbean = (OrderBean) b;
   return abean.getItemid().compareTo(bbean.getItemid());   
}

reduce阶段：每组循环，输出前n条数据

for (OrderBean bean : values) {
   if ((count++) == topn) {
      count = 0;
      return;
   }
   context.write(NullWritable.get(), bean);
}

TopN的设置：在job的configuration中设置n，在reduce类的setup方法中读取n

Configuration conf = new Configuration();
conf.set("topn", "2");
Job job = Job.getInstance(conf);

reduce类：

static class TopNReducer extends Reducer<OrderBean, OrderBean, NullWritable, OrderBean> {
   int topn = 1;
   int count = 0;

   @Override
   protected void setup(Context context) throws IOException, InterruptedException {
      Configuration conf = context.getConfiguration();
      topn = Integer.parseInt(conf.get("topn"));
   }

   @Override
   protected void reduce(OrderBean key, Iterable<OrderBean> values, Context context) throws IOException, InterruptedException {
      for (OrderBean bean : values) {
         if ((count++) == topn) {
            count = 0;
            return;
         }
         context.write(NullWritable.get(), bean);
      }
   }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。