MapReduce实战之WordCount案例

原创

年轻即出发 2022-11-11 10:49:11 博主文章分类：Hadoop ©著作权

文章标签 hadoop apache mapreduce 文章分类 前端开发

©著作权归作者所有：来自51CTO博客作者年轻即出发的原创作品，请联系作者获取转载授权，否则将追究法律责任

WordCount案例

1.1 需求1：统计一堆文件中单词出现的个数

0）需求：在一堆给定的文本文件中统计输出每一个单词出现的总次数

1）数据准备：

hello world
 atguigu atguigu
 hadoop 
 spark
 hello world
 atguigu atguigu
 hadoop 
 spark
 hello world
 atguigu atguigu
 hadoop 
 spark

2）分析

按照mapreduce编程规范，分别编写Mapper，Reducer，Driver

MapReduce实战之WordCount案例_hadoop

MapReduce实战之WordCount案例_hadoop_02

3）编写程序

（1）编写mapper类

package com.atguigu.mapreduce; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 切割 String[] words = line.split(" "); // 3 输出 for (String word : words) { k.set(word); context.write(k, v); } } }

package com.atguigu.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 
public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
      
       Text k = new Text();
       IntWritable v = new IntWritable(1);
      
       @Override
       protected void map(LongWritable key, Text value, Context context)
                     throws IOException, InterruptedException {
             
              // 1 获取一行
              String line = value.toString();
             
              // 2 切割
              String[] words = line.split(" ");
             
              // 3 输出
              for (String word : words) {
                    
                     k.set(word);
                     context.write(k, v);
              }
       }
}

（2）编写reducer类

package com.atguigu.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException { // 1 累加求和 int sum = 0; for (IntWritable count : value) { sum += count.get(); } // 2 输出 context.write(key, new IntWritable(sum)); } }

package com.atguigu.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
 
       @Override
       protected void reduce(Text key, Iterable<IntWritable> value,
                     Context context) throws IOException, InterruptedException {
             
              // 1 累加求和
              int sum = 0;
              for (IntWritable count : value) {
                     sum += count.get();
              }
             
              // 2 输出
              context.write(key, new IntWritable(sum));
       }
}

（3）编写驱动类

package com.atguigu.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 设置jar加载路径 job.setJarByClass(WordcountDriver.class); // 3 设置map和Reduce类 job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); // 4 设置map输出 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 设置Reduce输出 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 设置输入和输出路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 提交 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }

package com.atguigu.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class WordcountDriver {
 
       public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
 
              // 1 获取配置信息
              Configuration configuration = new Configuration();
              Job job = Job.getInstance(configuration);
 
              // 2 设置jar加载路径
              job.setJarByClass(WordcountDriver.class);
 
              // 3 设置map和Reduce类
              job.setMapperClass(WordcountMapper.class);
              job.setReducerClass(WordcountReducer.class);
 
              // 4 设置map输出
              job.setMapOutputKeyClass(Text.class);
              job.setMapOutputValueClass(IntWritable.class);
 
              // 5 设置Reduce输出
              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(IntWritable.class);
             
              // 6 设置输入和输出路径
              FileInputFormat.setInputPaths(job, new Path(args[0]));
              FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
              // 7 提交
              boolean result = job.waitForCompletion(true);
 
              System.exit(result ? 0 : 1);
       }
}

4）集群上测试

（1）将程序打成jar包，然后拷贝到hadoop集群中。

（2）启动hadoop集群

（3）执行wordcount程序

[atguigu@hadoop102 software]$ hadoop jar wc.jar com.atguigu.wordcount.WordcountDriver /user/atguigu/input /user/atguigu/output1

5）本地测试

（1）在windows环境上配置HADOOP_HOME环境变量。

（2）在eclipse上运行程序

（3）注意：如果eclipse打印不出日志，在控制台上只显示

1.log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). 2.log4j:WARN Please initialize the log4j system properly. 3.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

1.log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).  
2.log4j:WARN Please initialize the log4j system properly.  
3.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

需要在项目的src目录下，新建一个文件，命名为“log4j.properties”，在文件中填入

log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

log4j.rootLogger=INFO, stdout 
log4j.appender.stdout=org.apache.log4j.ConsoleAppender 
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n 
log4j.appender.logfile=org.apache.log4j.FileAppender 
log4j.appender.logfile.File=target/spring.log 
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout 
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

1.2 需求2：把单词按照ASCII码奇偶分区（Partitioner）

0）分析

MapReduce实战之WordCount案例_mapreduce_03

1）自定义分区

package com.atguigu.mapreduce.wordcount; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; publicclass WordCountPartitioner extends Partitioner<Text, IntWritable>{ @Override publicint getPartition(Text key, IntWritable value, int numPartitions) { // 1 获取单词key String firWord = key.toString().substring(0, 1); char[] charArray = firWord.toCharArray(); int result = charArray[0]; // int result = key.toString().charAt(0); // 2 根据奇数偶数分区 if (result % 2 == 0) { return 0; }else { return 1; } } }

package com.atguigu.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
 
publicclass WordCountPartitioner extends Partitioner<Text, IntWritable>{
 
       @Override
       publicint getPartition(Text key, IntWritable value, int numPartitions) {
             
              // 1 获取单词key 
              String firWord = key.toString().substring(0, 1);
              char[] charArray = firWord.toCharArray();
              int result = charArray[0];
              // int result  = key.toString().charAt(0);
 
              // 2 根据奇数偶数分区
              if (result % 2 == 0) {
                     return 0;
              }else {
                     return 1;
              }
       }
}

2）在驱动中配置加载分区，设置reducetask个数

job.setPartitionerClass(WordCountPartitioner.class); job.setNumReduceTasks(2);

1.3 需求3：对每一个maptask的输出局部汇总（Combiner）

0）需求：统计过程中对每一个maptask的输出进行局部汇总，以减小网络传输量即采用Combiner功能。

MapReduce实战之WordCount案例_hadoop_04

1）数据准备：

hello world
 atguigu atguigu
 hadoop 
 spark
 hello world
 atguigu atguigu
 hadoop 
 spark
 hello world
 atguigu atguigu
 hadoop 
 spark

方案一

1）增加一个WordcountCombiner类继承Reducer

package com.atguigu.mr.combiner; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; publicclass WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protectedvoid reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // 1 汇总 int count = 0; for(IntWritable v :values){ count += v.get(); } // 2 写出 context.write(key, new IntWritable(count)); } }

package com.atguigu.mr.combiner;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
publicclass WordcountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{
 
       @Override
       protectedvoid reduce(Text key, Iterable<IntWritable> values,
                     Context context) throws IOException, InterruptedException {
        // 1 汇总
              int count = 0;
              for(IntWritable v :values){
                     count += v.get();
              }
              // 2 写出
              context.write(key, new IntWritable(count));
       }
}

2）在WordcountDriver驱动类中指定combiner

// 9 指定需要使用combiner，以及用哪个类作为combiner的逻辑 job.setCombinerClass(WordcountCombiner.class);

方案二

1）将WordcountReducer作为combiner在WordcountDriver驱动类中指定

// 指定需要使用combiner，以及用哪个类作为combiner的逻辑 job.setCombinerClass(WordcountReducer.class);

运行程序

MapReduce实战之WordCount案例_mapreduce_05

MapReduce实战之WordCount案例_mapreduce_06

1.4 需求4：大量小文件的切片优化（CombineTextInputFormat）

0）需求：将输入的大量小文件合并成一个切片统一处理。

1）输入数据：准备5个小文件

2）实现过程

（1）不做任何处理，运行需求1中的wordcount程序，观察切片个数为5

（2）在WordcountDriver中增加如下代码，运行程序，并观察运行的切片个数为1

// 如果不设置InputFormat，它默认用的是TextInputFormat.class job.setInputFormatClass(CombineTextInputFormat.class); CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

// 如果不设置InputFormat，它默认用的是TextInputFormat.class
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m
CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

上一篇：RDD、DataFrame、DataSet

下一篇：SparkStreaming_DStream转换

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯