最近在学习数据分析,用到了hadoop和spark。之前在虚拟机配置好了hadoop集群,今天想尝试一下在win10环境下,利用 IDEA 远程向虚拟机上的hadoop集群提交作业(以WordCount为例)
一: 环境以及准备工作:
- win10 + IntelliJ IDEA 2017.1.6 + hadoop 2.8.0
注意:hadoop在虚拟机和本地都要安装,安装步骤二者几乎一样,就不写了,不会的去百度。win10安装好hadoop之后同样需要配置环境变量: - 虚拟机上 Hadoop 集群,这个和根据你自己的配置,把这三行代码粘贴到 C:\Windows\System32\drivers\etc\hosts
# 这贴的是我自己的配置
192.168.253.100 centos
192.168.253.101 server1
192.168.253.102 server2
二:IDEA创建项目
- 新建
Maven
项目 - 选好java版本,然后next
这个随便填吧 - 然后 next,选择你自己的项目目录,Finish
- 添加依赖
打开pom.xml
,注意这几个hadoop有关的,版本要填你自己的,我的是2.8.0
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.msw</groupId>
<artifactId>test1</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.0</version>
</dependency>
</dependencies>
</project>
- 添加配置文件
这个和你在虚拟机配置集群时类似,直接去把你虚拟机hadoop下的:core-site.xml
、mapred-site.xml
、yarn-site.xml
文件拷贝到 idea项目下的 resources文件夹下。我的是这样的:
core-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoop/tmp</value>
</property>
</configuration>
mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>centos:49001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/root/hadoop/var</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.app-submission.cross-platform</name>
<value>true</value>
</property>
</configuration>
yarn-site.xml:
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>centos</value>
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<description>The address of the scheduler interface.</description>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<description>The http address of the RM web application.</description>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<description>The https adddress of the RM web application.</description>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>${yarn.resourcemanager.hostname}:8090</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<description>The address of the RM admin interface.</description>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8033</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
<discription>每个节点可用内存,单位MB,默认8182MB</discription>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
core-site.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://centos:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/hadoop/tmp</value>
</property>
</configuration>
然后是 log4j.properties:
log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n
- 编写 WordCount.java
package com.msw;
/*
* File: WordCount.java
* Date: 2019/10/13-20:40
* Author: msw.
* PS ...
*/
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setJar("WordCount.jar"); // 注意这个是你待会要打包的jar
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
job.setNumReduceTasks(2);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
打包,测试
- 打包出
WordCount.jar
:
ctrl+alt+shift+s 打开 项目配置页面
点击 Artifacts => +号 => JAR => Empty
然后名字输入WordCount
然后点中间的绿色按钮,选择 module output 选项,确认创建
OK,应用,退出来
找到顶部工具栏的 Build => Build Artifacts,会弹出这个
Build,然后你的项目下会多出一个 out 文件夹,打开,下面有你刚刚打包的 WordCount.jar
把这个 WordCount.jar 复制粘贴到项目总的目录下。
完了整个项目目录结构是这样的:
ok,已经快要结束了 - Run => Edit Configurations
Add 一个 Application,然后配置如下:
需要你修改的是:Main class、Program arhuments
注意这个 program arguments
hdfs://centos:9000/input/word.txt
hdfs://centos:9000/output
centos 是我的主机名,即hadoop集群的master机名(hostname);
/input/word.txt 是分布式文件系统 hdfs 的一个文件,需要你事先去上传,测试用的,随便上传一个txt文件就ok;
/output 也是hdfs下的一个目录(未创建),你运行程序时他有可能会报错,告诉你这个output文件夹已存在,你把他删了就可以了 hdfs dfs -rm -r /output
OK ! 大功告成,可以点击Run测试了,记得先启动hadoop集群嗷~