最近在学习数据分析,用到了hadoop和spark。之前在虚拟机配置好了hadoop集群,今天想尝试一下在win10环境下,利用 IDEA 远程向虚拟机上的hadoop集群提交作业(以WordCount为例)

一: 环境以及准备工作:

  1. win10 + IntelliJ IDEA 2017.1.6 + hadoop 2.8.0
    注意:hadoop在虚拟机和本地都要安装,安装步骤二者几乎一样,就不写了,不会的去百度。win10安装好hadoop之后同样需要配置环境变量:
  2. 虚拟机上 Hadoop 集群,这个和根据你自己的配置,把这三行代码粘贴到 C:\Windows\System32\drivers\etc\hosts
# 这贴的是我自己的配置
192.168.253.100 centos
192.168.253.101 server1
192.168.253.102 server2

二:IDEA创建项目

  1. 新建 Maven项目
  2. idea的hdfs连接 idea连接hadoop集群_apache

  3. 选好java版本,然后next
    这个随便填吧
  4. idea的hdfs连接 idea连接hadoop集群_hadoop_02

  5. 然后 next,选择你自己的项目目录,Finish
  6. 添加依赖
    打开 pom.xml,注意这几个hadoop有关的,版本要填你自己的,我的是2.8.0
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.msw</groupId>
    <artifactId>test1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.0</version>
        </dependency>
    </dependencies>
</project>
  1. 添加配置文件
    这个和你在虚拟机配置集群时类似,直接去把你虚拟机hadoop下的:core-site.xmlmapred-site.xmlyarn-site.xml文件拷贝到 idea项目下的 resources文件夹下。我的是这样的:
    core-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

mapred-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>centos:49001</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/root/hadoop/var</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.app-submission.cross-platform</name>
        <value>true</value>
    </property>
</configuration>

yarn-site.xml:

<?xml version="1.0"?>
<configuration>
    <!-- Site specific YARN configuration properties -->
       <property>
           <name>yarn.resourcemanager.hostname</name>
           <value>centos</value>
           </property>
       <property>
            <description>The address of the applications manager interface in the RM.</description>
            <name>yarn.resourcemanager.address</name>
            <value>${yarn.resourcemanager.hostname}:8032</value>
       </property>
       <property>
            <description>The address of the scheduler interface.</description>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>${yarn.resourcemanager.hostname}:8030</value>
       </property>
       <property>
            <description>The http address of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>${yarn.resourcemanager.hostname}:8088</value>
       </property>
       <property>
            <description>The https adddress of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.https.address</name>
            <value>${yarn.resourcemanager.hostname}:8090</value>
       </property>
       <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>${yarn.resourcemanager.hostname}:8031</value>
       </property>
       <property>
            <description>The address of the RM admin interface.</description>
            <name>yarn.resourcemanager.admin.address</name>
            <value>${yarn.resourcemanager.hostname}:8033</value>
       </property>
       <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
       </property>
       <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
            <discription>每个节点可用内存,单位MB,默认8182MB</discription>
       </property>
       <property>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
       </property>
       <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>2048</value>
</property>
       <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
</property>
</configuration>

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

然后是 log4j.properties:

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n
  1. 编写 WordCount.java
package com.msw;
/*
 * File: WordCount.java
 * Date: 2019/10/13-20:40
 * Author: msw.
 * PS ...
*/

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setJar("WordCount.jar");	// 注意这个是你待会要打包的jar
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.setNumReduceTasks(2);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

打包,测试

  1. 打包出 WordCount.jar
    ctrl+alt+shift+s 打开 项目配置页面
    点击 Artifacts => +号 => JAR => Empty
    然后名字输入WordCount

    然后点中间的绿色按钮,选择 module output 选项,确认创建

    OK,应用,退出来
    找到顶部工具栏的 Build => Build Artifacts,会弹出这个

    Build,然后你的项目下会多出一个 out 文件夹,打开,下面有你刚刚打包的 WordCount.jar
    把这个 WordCount.jar 复制粘贴到项目总的目录下。
    完了整个项目目录结构是这样的:

    ok,已经快要结束了
  2. Run => Edit Configurations

    Add 一个 Application,然后配置如下:
    需要你修改的是:Main class、Program arhuments

注意这个 program arguments

hdfs://centos:9000/input/word.txt
hdfs://centos:9000/output

centos 是我的主机名,即hadoop集群的master机名(hostname);

/input/word.txt 是分布式文件系统 hdfs 的一个文件,需要你事先去上传,测试用的,随便上传一个txt文件就ok;

/output 也是hdfs下的一个目录(未创建),你运行程序时他有可能会报错,告诉你这个output文件夹已存在,你把他删了就可以了 hdfs dfs -rm -r /output

OK ! 大功告成,可以点击Run测试了,记得先启动hadoop集群嗷~