Hadoop伪分布式Windows

1. 简介

在大数据处理领域,Hadoop是一个非常流行的分布式计算框架。它可以处理大规模数据集,并提供了高可靠性和容错能力。Hadoop可以运行在不同的操作系统上,包括Windows。本文将介绍如何在Windows上搭建Hadoop的伪分布式环境,并提供相应的代码示例。

2. Hadoop伪分布式搭建步骤

2.1 准备工作

在开始之前,需要确保以下几点:

  • 安装Java开发环境(JDK)
  • 下载Hadoop的稳定版本,并解压到合适的目录
  • 配置环境变量
  • 配置SSH服务,用于Hadoop集群的通信

2.2 配置Hadoop

在Hadoop的安装目录下,有一个etc/hadoop文件夹,里面包含了Hadoop的配置文件。我们需要修改以下几个文件:

core-site.xml
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>
mapred-site.xml
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>
yarn-site.xml
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

2.3 启动Hadoop

打开一个命令行窗口,输入以下命令启动Hadoop的各个组件:

start-dfs.cmd
start-yarn.cmd
mr-jobhistory-daemon.sh start historyserver

2.4 验证Hadoop

在浏览器中输入http://localhost:50070,可以看到Hadoop的Web界面。在命令行窗口输入以下命令,验证Hadoop是否正常运行:

hdfs dfs -mkdir /input
hdfs dfs -put README.txt /input
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
hdfs dfs -cat /output/part-r-00000

如果一切正常,你应该能够看到README.txt文件中的单词统计结果。

3. 代码示例

下面是一个简单的示例,演示如何使用Hadoop进行单词计数:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass