1. join算法
- 题如下:
订单数据表t_order
id | date | pid | amount |
1001 | 20150710 | P0001 | 2 |
1002 | 20150710 | P0001 | 3 |
1002 | 20150710 | P0002 | 3 |
商品信息表t_product
id | pname | category_id | price |
P0001 | 小米5 | 1000 | 2 |
P0002 | 锤子T1 | 1000 | 3 |
假如数据量巨大,两表的数据是以文件的形式存储在HDFS中,
需要用mapreduce程序来实现一下SQL查询运算:
select a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id |
- reduce端join,容易导致数据倾斜,不可取
1. map端读取全部数据和文件名,商品id相同的分到同一区;
2. reduce端join分组输出;
假如某一商品订单很多,就会导致数据倾斜。
- map端join: 小表join大表
1. 利用分布式缓存,将小表(商品表)数据分发到每一个mapTask节点:
job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));
2. 在Map类的setup方法中读取缓存文件到hashmap<商品id,商品信息>
map类的setup()方法会在初始化时调用一次
3. map方法循环读取大表(订单表)
每读取一行数据,就利用 商品id 去hashmap中获取对应商品信息,和当前订单信息拼接
输出<join信息,null>
4. 不写reduce端,直接map输出:job.setNumReduceTasks(0)
- 附代码---Map类
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
FileReader in = null;
BufferedReader reader = null;
HashMap<String,String[]> b_tab = new HashMap<String, String[]>();
//map初始化的时候调用一次
@Override
protected void setup(Context context)throws IOException, InterruptedException {
// 加载小表(商品)数据到hashmap<id,.....>
in = new FileReader("pdts.txt");
reader = new BufferedReader(in);
String line =null;
while(StringUtils.isNotBlank((line=reader.readLine()))){
String[] split = line.split(",");
String[] products = {split[0],split[1]};
b_tab.put(split[0], products);
}
IOUtils.closeStream(reader);
IOUtils.closeStream(in);
}
//map方法读取大表(订单)数据,每读取一条订单数据,就根据商品id去hashmap中获取对应的商品信息
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] orderFields = line.split(",");
String pdt_id = orderFields[1];
String[] pdtFields = b_tab.get(pdt_id);
String ll = orderFields[0] + "\t" + pdtFields[1] + "\t" + orderFields[1] + "\t" + orderFields[2] ;
context.write(new Text(ll), NullWritable.get());
}
}
- 附代码---运行类
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(MapJoinDistributedCacheFile.class);
job.setMapperClass(MapJoinDistributedCacheFileMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("D:/mapjoin/input"));
FileOutputFormat.setOutputPath(job, new Path("D:/mapjoin/output"));
job.setNumReduceTasks(0);
job.addCacheFile(new URI("file:/D:/pdts.txt"));
// job.addCacheFile(new URI("hdfs://mini1:9000/cachefile/pdts.txt"));
job.waitForCompletion(true);
}
2. 全局计数器
- 枚举形式
1. 定义枚举
enum MyCounter{M,N};
2. 定义计数器
Counter counterEnum=context.getCounter(MyCounter.M);
3. 赋值
counterEnum.setValue(0);//设置初始值
counterEnum.increment(1);//加1
4. 取值
Counters counters=job.getCounters();
Counter counter=counters.findCounter(WordCountMapper.MyCounter.M);
long value=counter.getValue();
- 动态设置
1. 定义计数器
Counter counter=context.getCounter("log","error");
2. 赋值
counter.setValue(0);//设置初始值
counter.increment(1);//加1
3. 取值
Counters counters=job.getCounters();
Counter counter=counters.findCounter("log","error");
long value=counter.getValue();
3. 多job串联
复杂的处理逻辑往往需要多个mapreduce程序串联处理,多job的串联可以借助mapreduce框架的JobControl实现。
代码:
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//配置
job.set...
...
Job job2 = Job.getInstance(conf);
//配置 job.set...
...
ControlledJob cjob=new ControlledJob(job.getConfiguration());
ControlledJob cjob2=new ControlledJob(job2.getConfiguration());
cjob.setJob(job);
cjob2.setJob(job2);
//设置依赖
cjob2.addDependingJob(cjob);
JobControl jobControl=new JobControl("wordcount");
jobControl.addJob(cjob);
jobControl.addJob(cjob2);
//新建一个线程来运行已加入JobControl中的作业,开始进程并等待结束
Thread jobThread=new Thread(jobControl);
jobThread.start();
while(!jobControl.allFinished()){
Thread.sleep(500);
}
jobControl.stop();
4. TopN
- 题目
订单数据例子:
Order_0000001,Pdt_01,222.8
Order_0000001,Pdt_05,25.8
Order_0000002,Pdt_05,325.8
Order_0000002,Pdt_03,522.8
Order_0000002,Pdt_04,122.4
Order_0000003,Pdt_01,222.8
Order_0000003,Pdt_01,322.8
求:
每一个订单中成交金额最大的那一(n)笔 -----top1(topN)
- 1. 重写订单Bean的compareTo方法:订单id相同,按照金额倒序排序
注意:Bean的序列化
public int compareTo(OrderBean o) {
int cmp = this.itemid.compareTo(o.getItemid());
if (cmp == 0) {
cmp = -this.amount.compareTo(o.getAmount());
}
return cmp;
}
- map阶段:读取文件,把数据解析到订单Bean中
static class TopNMapper extends Mapper<LongWritable, Text, OrderBean, OrderBean> {
OrderBean v = new OrderBean();
Text k = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = StringUtils.split(line, ",");
k.set(fields[0]);
v.set(new Text(fields[0]), new DoubleWritable(Double.parseDouble(fields[2])));
context.write(v,v);
}
}
- 重写分区:订单id的hashcode值相同的为同一分区。 job.setPartitionerClass
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
return (key.getItemid().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
- 重写分组:订单id相同的为一组。 job.setGroupingComparatorClass
public int compare(WritableComparable a, WritableComparable b) {
OrderBean abean = (OrderBean) a;
OrderBean bbean = (OrderBean) b;
return abean.getItemid().compareTo(bbean.getItemid());
}
- reduce阶段:每组循环,输出前n条数据
for (OrderBean bean : values) {
if ((count++) == topn) {
count = 0;
return;
}
context.write(NullWritable.get(), bean);
}
- TopN的设置:在job的configuration中设置n,在reduce类的setup方法中读取n
Configuration conf = new Configuration();
conf.set("topn", "2");
Job job = Job.getInstance(conf);
reduce类:
static class TopNReducer extends Reducer<OrderBean, OrderBean, NullWritable, OrderBean> {
int topn = 1;
int count = 0;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
topn = Integer.parseInt(conf.get("topn"));
}
@Override
protected void reduce(OrderBean key, Iterable<OrderBean> values, Context context) throws IOException, InterruptedException {
for (OrderBean bean : values) {
if ((count++) == topn) {
count = 0;
return;
}
context.write(NullWritable.get(), bean);
}
}
}