环境:服务器:CentOS6.6  Hadoop-2.7.2

client端:windows10:开发工具:intellij IDEA

前期准备:需要在windows平台下载hadoop-2.7.2的bin包,并且解压到本地目录,我的是在E:\hadoop-2.7.2\hadoop-2.7.2,具体如下:

idea2022如何连接hadoop虚拟机 idea链接hadoop_apache

1、在intellij中创建一个maven project flie-》new project

idea2022如何连接hadoop虚拟机 idea链接hadoop_java_02

/

2、pom.xml文件里编写如下:

idea2022如何连接hadoop虚拟机 idea链接hadoop_apache_03

3、新建java  class WordCount

4、在工程目录下的resources文件夹下新建log4j.properties文件和core-site.xml文件

我的log4j.properties是从服务器的hadoop的配置目录中拷过来的。

core-site.xml文件的内容如下:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mj2:9000</value>
</property>
</configuration>

其中的hdfs://mj2:9000为服务器上的hdfs的位置(在服务器的hadoop的core-site.xml中的配置项)


5、添加hadoop-2.7.2依赖包

idea2022如何连接hadoop虚拟机 idea链接hadoop_hadoop_04

idea2022如何连接hadoop虚拟机 idea链接hadoop_hadoop_05

idea2022如何连接hadoop虚拟机 idea链接hadoop_Text_06

给jar包文件夹取名为hadoop-library,最后点apply即可。

idea2022如何连接hadoop虚拟机 idea链接hadoop_Text_07

我的项目中可能由于网络问题,maven依赖的包总是下载不下来,所以我把需要的包都下载或者从服务器拷贝过来放在了lib文件夹中。具体如下:

idea2022如何连接hadoop虚拟机 idea链接hadoop_hadoop_08

6、还需要的就是在hadoop的解压目录下的bin目录下,需要的两个文件原来没有,一个是winutils.exe,另一个是hadoop.dll文件。所以需要上网下载winutils.exe和hadoop.dll文件。为了防止冲突,可以把下载的hadoop.dll文件放在C:\Windows\System32中。

(我在网上没有找到hadoop版本-2.7.2的hadoop.dll,所以用的2.7.2版本的同样可以)

7、WordCount的代码:

package wordCounts;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
    public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }
    public static class IntSumReducer
            extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main(String[] args) throws Exception {
        System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.2\\hadoop-2.7.2");
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }
        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}



8、配置WordCount运行参数

Run-》Edit Configurations...

idea2022如何连接hadoop虚拟机 idea链接hadoop_远程登录_09

其中,Program arguments:为传入的hdfs中的文件,以及reduce之后的结果文件目录(中间用空格隔开)

Environment variables:为程序运行时设定的环境变量:具体如下:

idea2022如何连接hadoop虚拟机 idea链接hadoop_java_10


需要注意的事项有:

(1)你的windows下必须有一个和线上hadoop集群一样的用户名

(2)确保你的hadoop程序中的所有jar都是compile

intellij下唯一不爽的,由于没有类似eclipse的hadoop插件,每次运行完wordcount,下次再要运行时,只能手动命令行删除output目录,再行调试。为了解决这个问题,可以将WordCount代码改进一下,在运行前先删除output目录,见下面的代码:


/**
      * 删除指定目录
      *
      * @param conf
      * @param dirPath
      * @throws IOException
      */
     private static void deleteDir(Configuration conf, String dirPath) throws IOException {
         FileSystem fs = FileSystem.get(conf);
         Path targetPath = new Path(dirPath);
         if (fs.exists(targetPath)) {
             boolean delResult = fs.delete(targetPath, true);
             if (delResult) {
                 System.out.println(targetPath + " has been deleted sucessfullly.");
             } else {
                 System.out.println(targetPath + " deletion failed.");
             }
         }
 
    }

这样,在main函数中的new Job()之前先delete,即deleteDir(conf, otherArgs[ 1

]);