idea的hdfs连接 idea连接hadoop集群

转载

数码精灵abc 2024-04-02 09:43:51

文章标签 idea的hdfs连接 Hadoop IDEA hadoop apache 文章分类 架构后端开发

最近在学习数据分析，用到了hadoop和spark。之前在虚拟机配置好了hadoop集群，今天想尝试一下在win10环境下，利用 IDEA 远程向虚拟机上的hadoop集群提交作业（以WordCount为例）

一：环境以及准备工作：

win10 + IntelliJ IDEA 2017.1.6 + hadoop 2.8.0
注意：hadoop在虚拟机和本地都要安装，安装步骤二者几乎一样，就不写了，不会的去百度。win10安装好hadoop之后同样需要配置环境变量：
虚拟机上 Hadoop 集群，这个和根据你自己的配置，把这三行代码粘贴到 C:\Windows\System32\drivers\etc\hosts

# 这贴的是我自己的配置
192.168.253.100 centos
192.168.253.101 server1
192.168.253.102 server2

二：IDEA创建项目

新建 Maven项目

idea的hdfs连接 idea连接hadoop集群_idea的hdfs连接

选好java版本，然后next
这个随便填吧

idea的hdfs连接 idea连接hadoop集群_apache_02

然后 next，选择你自己的项目目录，Finish
添加依赖
打开 pom.xml，注意这几个hadoop有关的，版本要填你自己的，我的是2.8.0

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.msw</groupId>
    <artifactId>test1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.0</version>
        </dependency>
    </dependencies>
</project>

添加配置文件
这个和你在虚拟机配置集群时类似，直接去把你虚拟机hadoop下的：core-site.xml、mapred-site.xml、yarn-site.xml文件拷贝到 idea项目下的 resources文件夹下。我的是这样的：
core-site.xml：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

mapred-site.xml：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>centos:49001</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/root/hadoop/var</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.app-submission.cross-platform</name>
        <value>true</value>
    </property>
</configuration>

yarn-site.xml：

<?xml version="1.0"?>
<configuration>
    <!-- Site specific YARN configuration properties -->
       <property>
           <name>yarn.resourcemanager.hostname</name>
           <value>centos</value>
           </property>
       <property>
            <description>The address of the applications manager interface in the RM.</description>
            <name>yarn.resourcemanager.address</name>
            <value>${yarn.resourcemanager.hostname}:8032</value>
       </property>
       <property>
            <description>The address of the scheduler interface.</description>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>${yarn.resourcemanager.hostname}:8030</value>
       </property>
       <property>
            <description>The http address of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>${yarn.resourcemanager.hostname}:8088</value>
       </property>
       <property>
            <description>The https adddress of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.https.address</name>
            <value>${yarn.resourcemanager.hostname}:8090</value>
       </property>
       <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>${yarn.resourcemanager.hostname}:8031</value>
       </property>
       <property>
            <description>The address of the RM admin interface.</description>
            <name>yarn.resourcemanager.admin.address</name>
            <value>${yarn.resourcemanager.hostname}:8033</value>
       </property>
       <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
       </property>
       <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
            <discription>每个节点可用内存,单位MB,默认8182MB</discription>
       </property>
       <property>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
       </property>
       <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>2048</value>
</property>
       <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
</property>
</configuration>

core-site.xml：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

然后是 log4j.properties:

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n

编写 WordCount.java

package com.msw;
/*
 * File: WordCount.java
 * Date: 2019/10/13-20:40
 * Author: msw.
 * PS ...
*/

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setJar("WordCount.jar");	// 注意这个是你待会要打包的jar
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.setNumReduceTasks(2);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

打包，测试

打包出 WordCount.jar：
ctrl+alt+shift+s 打开项目配置页面
点击 Artifacts => +号 => JAR => Empty
然后名字输入WordCount

然后点中间的绿色按钮，选择 module output 选项，确认创建

OK，应用，退出来
找到顶部工具栏的 Build => Build Artifacts，会弹出这个

Build，然后你的项目下会多出一个 out 文件夹，打开，下面有你刚刚打包的 WordCount.jar
把这个 WordCount.jar 复制粘贴到项目总的目录下。
完了整个项目目录结构是这样的：

ok，已经快要结束了
Run => Edit Configurations

Add 一个 Application，然后配置如下：
需要你修改的是：Main class、Program arhuments

注意这个 program arguments

hdfs://centos:9000/input/word.txt
hdfs://centos:9000/output

centos 是我的主机名，即hadoop集群的master机名（hostname）；

/input/word.txt 是分布式文件系统 hdfs 的一个文件，需要你事先去上传，测试用的，随便上传一个txt文件就ok；

/output 也是hdfs下的一个目录（未创建），你运行程序时他有可能会报错，告诉你这个output文件夹已存在，你把他删了就可以了 hdfs dfs -rm -r /output

OK ! 大功告成，可以点击Run测试了，记得先启动hadoop集群嗷~

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java按引用传参是什么意思举个例子 java引用传递string

下一篇：javacv 播放卡顿 javacv platform

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

idea的hdfs连接 idea连接hadoop集群

idea的hdfs连接 idea连接hadoop集群

一： 环境以及准备工作：

二：IDEA创建项目

打包，测试

注意这个 program arguments

OK ! 大功告成，可以点击Run测试了，记得先启动hadoop集群嗷~

51CTO博客

一：环境以及准备工作：