Hadoop伪分布式Windows
1. 简介
在大数据处理领域,Hadoop是一个非常流行的分布式计算框架。它可以处理大规模数据集,并提供了高可靠性和容错能力。Hadoop可以运行在不同的操作系统上,包括Windows。本文将介绍如何在Windows上搭建Hadoop的伪分布式环境,并提供相应的代码示例。
2. Hadoop伪分布式搭建步骤
2.1 准备工作
在开始之前,需要确保以下几点:
- 安装Java开发环境(JDK)
- 下载Hadoop的稳定版本,并解压到合适的目录
- 配置环境变量
- 配置SSH服务,用于Hadoop集群的通信
2.2 配置Hadoop
在Hadoop的安装目录下,有一个etc/hadoop
文件夹,里面包含了Hadoop的配置文件。我们需要修改以下几个文件:
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2.3 启动Hadoop
打开一个命令行窗口,输入以下命令启动Hadoop的各个组件:
start-dfs.cmd
start-yarn.cmd
mr-jobhistory-daemon.sh start historyserver
2.4 验证Hadoop
在浏览器中输入http://localhost:50070
,可以看到Hadoop的Web界面。在命令行窗口输入以下命令,验证Hadoop是否正常运行:
hdfs dfs -mkdir /input
hdfs dfs -put README.txt /input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
hdfs dfs -cat /output/part-r-00000
如果一切正常,你应该能够看到README.txt
文件中的单词统计结果。
3. 代码示例
下面是一个简单的示例,演示如何使用Hadoop进行单词计数:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass