之前的WordCount,比较简单,就只上了代码,接下来稍微复杂一点的项目,将会记录整个项目编写过程的思路
项目介绍:统计每年,每月最热的两天的温度
数据:
1949-10-01 14:21:02 34c
1949-10-02 14:01:02 36c
1950-01-01 11:21:02 32c
1950-10-01 12:21:02 37c
1951-12-01 12:21:02 23c
1950-10-02 12:21:02 41c
1950-10-03 12:21:02 27c
1951-07-01 12:21:02 45c
1951-07-02 12:21:02 46c
1951-07-03 12:21:03 47c
期望:
1949-10-02:36 36
1949-10-01:34 34
1950-10-02:41 41
1950-10-01:37 37
1951-07-03:47 47
1951-07-02:46 46
思路:
1. 我们要做的是统计每年每月的最高温度,所以在源数据中,我们所需要的是年
、月
、温度
;
2. 其次我们要修改排序方式sort,按照年、月升序排(降序排也可以),按照温度降序排,这样我们就可以在最终结果中取到前两个数据就ok了;
3. 我们需要重写group,修改MApReduce的组合方式,将年、月相同的数据放到同一个reducer中进行计算。
4. 我们的数据不再像WordCount那样是一个简单的Text了,我们需要自己写一个java bean来实现一个WritableComparable类,来存储我们的数据。
代码:
- 首先写一个java bean,起名
weather
public class Weather implements WritableComparable {
// 年
private Integer year;
// 月
private Integer month;
// 日
private Integer day;
// 温度
private Integer temperature;
public Integer getYear() {
return year;
}
public void setYear(int year) {
this.year = year;
}
public Integer getMonth() {
return month;
}
public void setMonth(int month) {
this.month = month;
}
public Integer getDay() {
return day;
}
public void setDay(int day) {
this.day = day;
}
public Integer getTemperature() {
return temperature;
}
public void setTemperature(int temperature) {
this.temperature = temperature;
}
@Override
public int compareTo(Object o) {
Weather w = (Weather) o;
int res1 = Integer.compare(year, w.getYear());
if (res1 == 0) {
int res2 = Integer.compare(month, w.getMonth());
if (res2 == 0) {
return Integer.compare(w.getTemperature(), temperature);
}
return res2;
}
return res1;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(year);
dataOutput.writeInt(month);
dataOutput.writeInt(day);
dataOutput.writeInt(temperature);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
year = dataInput.readInt();
month = dataInput.readInt();
day = dataInput.readInt();
temperature = dataInput.readInt();
}
}
- 接下来使我们的Mapper类
public class MyMapper extends Mapper<LongWritable, Text, Weather, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss");
Calendar c = Calendar.getInstance();
/*
行数据是: 1949-10-01 14:21:02 34c
日期和温度之间是用tab分隔的,所以我们用\t来拆分
*/
String line = value.toString();
String[] list = StringUtils.split(line, '\t');
if (list.length == 2) {
// 1949-10-01 14:21:02
String arg = list[0];
// 34c
String temp = list[1];
Weather w = new Weather();
try {
Date date =dateFormat.parse(arg);
c.setTime(date);
// 1949
w.setYear(c.get(Calendar.YEAR));
// 10
w.setMonth(c.get(Calendar.MONTH) + 1);
// 01 -> 1
w.setDay(c.get(Calendar.DATE));
// 34c -> 34,我们要注意寒冷的天气,所以截取的时候不能写2,而要根据c的位置类截取
int t = Integer.parseInt(temp.substring(0, temp.toString().lastIndexOf("c")));
w.setTemperature(t);
context.write(w, new IntWritable(t));
} catch (ParseException e) {
e.printStackTrace();
}
}
}
}
- map完之后我们进行分割,默认是hash分割(其实不写也行,只要保证数据量能被平均分配)
public class MyPartitioner extends HashPartitioner<Weather, IntWritable>{
@Override
public int getPartition(Weather key, IntWritable value, int numReduceTasks) {
// 这里我们按照年份取模,按年份划分给不同的task,numReduceTasks是task的数量
return (key.getYear()-1949) % numReduceTasks;
}
}
- 接下来重写sort方法
public class MySort extends WritableComparator {
public MySort() {
super(Weather.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
// 强转成Weather
Weather w1 = (Weather) a;
Weather w2 = (Weather) b;
int res1 = w1.getYear().compareTo(w2.getYear());
if (res1 == 0)
{
int res2 = w1.getMonth().compareTo(w2.getMonth());
if (res2 == 0) {
// -w1.getTemperature().compareTo(w2.getTemperature());也可以
// 只是我觉得多了异步运算
return w2.getTemperature().compareTo(w1.getTemperature());
}
return res2;
}
return res1;
}
}
- 然后是group方法
// 这几个比较方法差别不大,group只需要比较年月相同就ok了
public class MyGroup extends WritableComparator {
public MyGroup() {
super(Weather.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
Weather w1 = (Weather) a;
Weather w2 = (Weather) b;
int res1 = w1.getYear().compareTo(w2.getYear());
if (res1 == 0)
{
return w1.getMonth().compareTo(w2.getMonth());
}
return res1;
}
}
- 接下来是我们的reducer
public class MyReducer extends Reducer<Weather, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Weather key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int i = 0;
for (IntWritable t : values) {
if (i++ == 2) {
// 只取前两个就ok
break;
}
//
String val = key.getYear()+"-"+key.getMonth()+"-"+key.getDay();
context.write(new Text(val), t);
}
}
}
- 最后来写我们的配置来执行吧
public class RunJob {
static Configuration conf;
public static void main (String[] args) {
// 加载配置文件
try {
conf = new Configuration();
// 设置测试用配置
conf.set("fs.defaultFS", "hdfs://localhost:9000");
conf.set("yarn.resourcemanager.hostname", "localhost");
// 实例化hdfs
FileSystem fs = FileSystem.get(conf);
// 获取job实例
Job job = Job.getInstance(conf);
job.setJarByClass(RunJob.class);
job.setMapperClass(MyMapper.class);
job.setPartitionerClass(MyPartitioner.class);
job.setSortComparatorClass(MySort.class);
job.setGroupingComparatorClass(MyGroup.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Weather.class);
job.setMapOutputValueClass(IntWritable.class);
job.setNumReduceTasks(3);
// 上传数据,添加数据输入路径
// hdfs dfs -put data /weather/input/data
Path input = new Path("/weather/input/data");
if (!fs.exists(input)) {
System.out.println("输入文件不存在!");
System.exit(1);
}
FileInputFormat.addInputPath(job, input);
Path output = new Path("/weather/output");
// 保证输出路径不存在
if (fs.exists(output)) {
fs.delete(output, true);
}
// 设置数据输出路径
FileOutputFormat.setOutputPath(job, output);
boolean res = job.waitForCompletion(true);
if(res){
System.out.println("job 成功执行");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
测试,运行main,到 [ http://localhost:8088/cluster/apps/RUNNING ]查看任务
查看输出文件
hdfs dfs -ls /weather/output
查看输出数据
hdfs dfs -cat /weather/output/part-r-00000
hdfs dfs -cat /weather/output/part-r-00001
hdfs dfs -cat /weather/output/part-r-00002
对比预期数据,ok完成