Hdfs 实验
- Hdfs 实验
- 1、windows 单机 安装hadoop272
- 2、linux 安装hadoop
- 3、JAVA操作WINDOWS上的HDFS
- 4、JAVA操作Linux上的HDFS
- 5、MapReduce WordCount 项目
- 6、MapReduce PhoneFlow 项目
Hdfs 实验
本篇文章为Hdfs基础实验整体流程,中间如有不详处,或因环境所致踩坑,请自行baidu解决
1、windows 单机 安装hadoop272
(hadoop3.2暂时没有windows环境)
在D盘安装Jdk1.7 配置环境变量
在D盘解压hadoop272(路径中不要带符号包括空格)
用hadooponwindows-master 的bin 覆盖 hadoop272的bin
设置D盘的hadoop272
etc/hadoop /
core-site.xml
hdfs-site.xml
mapreduce-site.xml
yarn-site.xml
slavers
hadoop-env.cmd
关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld
格式化hdfs
hadoop namenode -format
启动 hadoop
d:/hadoop272/sbin/start-all.cmd
使用jps 查看启动情况
2、linux 安装hadoop
2.1 使用Vmware12 安装centos7 安装前开启并设置网络 固定ip
1/ /etc/hosts
192.168.198.128 lining01
192.168.198.129 lining02
2.2 安装JAVA
1/ 上传 jdk-7u79-linux-x64.tar.gz 至 /opt
2/ tar -zxvf jdk-7u79-linux-x64.tar.gz
3/ /etc/profile
5/ export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME=JAVA_HOME/lib:CLASSPATH
export PATH=JRE_HOME/bin:$PATH
6/ source /etc/profile
2.3 安装配置Hadoop
1/ 上传 hadoop-2.5.0-cdh5.3.6.tar.gz 至 /opt
2/ cd /opt
3/ tar -zxvf hadoop-2.5.0-cdh5.3.6.tar.gz
4/ 新建tmp var hdfs hdfs/data hdfs/name
5/ 修改 core-site.xml hdfs-site.xml yarn-site.xml mapreduce-site.xml slavers hadoop-env.sh
core-site.xml
<configuration>
<!--指定namenode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://lining01:9000</value>
</property>
<!--用来指定使用hadoop时产生文件的存放目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>file:///opt/hadoop-2.5.0-cdh5.3.6/tmp</value>
</property>
<property>
<!--用来设置检查点备份日志的最长时间-->
<name>fs.checkpoint.period</name>
<value>3600</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<!--指定hdfs保存数据的副本数量-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!--指定hdfs中namenode的存储位置-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/name</value>
</property>
<!--指定hdfs中datanode的存储位置-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/data</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<!--nomenodeManager获取数据的方式是shuffle-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--指定Yarn的老大(ResourceManager)的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>lining01</value>
</property>
<!--Yarn打印工作日志-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<configuration>
2.4生成master和slaver秘钥并把master公钥追加到两台slaver slaver公钥追加到master
1/ 修改 /etc/ssh/sshd_config
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
2/ ssh-keygen -t rsa -P ‘’
复制 /root/.ssh/id_rsa.pub 改名为 authorized_keys
把lining01的pub 追加到 lining02 lining03 的 root/.ssh/authorized_keys
把lining02的pub 追加到 lining01 的 root/.ssh/authorized_keys
把lining03的pub 追加到 lining01 的 root/.ssh/authorized_keys
3/ 测试 ssh lining02
2.5 格式化hdfs
1/ 关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld
2/ 格式化hadoop磁盘
hadoop namenode -format
3/ 启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-dfs.sh
启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-yarn.sh
4/ jps (如权限不够 chmod +x /opt/jdk1.7.0_79/bin/jps
5/ hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put
3、JAVA操作WINDOWS上的HDFS
3.1 安装windows hadoop(因为是在window下编程,所以需要在windows上装hadoop和eclipse)–》1.1已安装
3.2 把eclipse-hadoop-2.7.2插件 放入eclipse\dropins里面 打开eclipse 配置location和dfs视图
3.3 新建 --> java 项目 testwinhdfs
新建文件夹lib,将hadoop272下的所有jar包复制到lib下面,BuildPath
将windows下的hadoop272配置文件coresite.xml hdfs-site.xml复制到src目录
3.4 新建java类 winhdfs
import java.io.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
public class WinHdfs {
static String hdfsroot = "hdfs://localhost:9000";
public static void main(String[] args) throws IOException{
put(args);
}
public static void get(String[] args) throws IOException
{
Configuration conf=new Configuration();
FileSystem fs = FileSystem.get(conf);
Path src = new Path(hdfsroot+args[0]);
FSDataInputStream in = fs.open(src);
FileOutputStream os = new FileOutputStream(args[1]);
IOUtils.copy(in, os);
System.out.println("get success");
}
public static void put(String[] args) throws IOException
{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
//String src = "G:/hello.txt";
FileInputStream in = new FileInputStream(args[0]);
Path outpath = new Path(hdfsroot+args[1]);
FSDataOutputStream out = fs.create(outpath);
IOUtils.copy(in, out);
System.out.println("put success");
}
}
4、JAVA操作Linux上的HDFS
4.1 步骤与windowshdfs的操作相同,但需要将用户指定为root
Properties properties = System.getProperties();
properties.setProperty(“HADOOP_USER_NAME”, “root”);
import java.io.*;
import java.util.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
public class LinuxHdfs {
public static void main(String[] args) throws IOException{
Properties properties = System.getProperties();
properties.setProperty("HADOOP_USER_NAME", "root");
put(args);
}
public static void get(String[] args) throws IOException
{
Configuration conf=new Configuration();
FileSystem fs = FileSystem.get(conf);
Path src = new Path("hdfs://192.168.92.129:9000/wordcount/phoneflow.csv");
FSDataInputStream in = fs.open(src);
FileOutputStream os = new FileOutputStream("G:/phoneflow.csv");
IOUtils.copy(in, os);
System.out.println("get success");
}
public static void put(String[] args) throws IOException
{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String src = "G:/hello.txt";
Path outpath = new Path("hdfs://192.168.92.129:9000/wordcount/wordcount32d21.txt");
FileInputStream in = new FileInputStream(src);
FSDataOutputStream out = fs.create(outpath);
IOUtils.copy(in, out);
System.out.println("put success");
}
}
5、MapReduce WordCount 项目
aim:统计word.txt文件中的单词个数
5.1 Map过程
5.1 Reduce过程
5.3 Shuffle过程
5.4 MapReduce 整体流程
5.5 代码
public class WordCount {
public static class WordMapper extends Mapper<Object, Text, Text, LongWritable> {
private final static LongWritable one = new LongWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class WordReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
private LongWritable result = new LongWritable();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (LongWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
//Properties properties = System.getProperties();
//properties.setProperty("HADOOP_USER_NAME", "root");
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, WordCount.class.getSimpleName());
conf.set("mapreduce.job.jar", "G:/wordcount.jar");
job.setJarByClass(wordcount.WordCount.class);
String src = "hdfs://localhost:9000/wordcount/wordcount1.txt";
//数据在哪里?
//FileInputFormat.setInputPaths(job, new Path(args[0]));
FileInputFormat.setInputPaths(job, new Path(src));
//使用哪个mapper处理输入的数据?
job.setMapperClass(WordMapper.class);
//map输出的数据类型是什么?
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
//使用哪个reducer处理输入的数据?
job.setReducerClass(WordReducer.class);
//reduce输出的数据类型是什么?
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
String out = "hdfs://localhost:9000/wordcount/wordcountout12.txt";
FileSystem fs = FileSystem.get(conf);
if(fs.exists(new Path(out) )){
fs.delete(new Path(out), true);
}
//数据输出到哪里?
//FileOutputFormat.setOutputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(out));
//交给yarn去执行,直到执行结束才退出本程序
job.waitForCompletion(true);
}
}
5.6 运行JAR包
hadoop jar g:/wordcount.jar wordcount.WordCount
6、MapReduce PhoneFlow 项目
aim1: 计算phoneflow.csv中每个手机号的上行,下行及流量总和,实现手机号分区
文件格式:分隔符:\t
13443245065 231 4243
13523421234 4321 142
13443245065 421 3243
…
aim2: 对输出结果phoneflowsum.csv 按流量总和进行排序
文件格式:分隔符:\t
13443245065 652 7486 8138
13523421234 4321 142 4463
…
6.1 制作FlowBean 实现 writableComparable 接口
package FlowSum;
import java.io.*;
import org.apache.hadoop.io.*;
public class FlowBean implements WritableComparable<FlowBean> {
private String phonenum;
private long upflow;
private long downflow;
private long sumflow;
public FlowBean(String phonenum, long upflow, long downflow) {
super();
this.phonenum = phonenum;
this.upflow = upflow;
this.downflow = downflow;
this.sumflow = upflow+downflow;
}
public void set(long upLink,long downLink){
this.upflow=upLink;
this.downflow=downLink;
this.sumflow=upLink+downLink;
}
public FlowBean() {
super();
}
public String getPhonenum() {
return phonenum;
}
public void setPhonenum(String phonenum) {
this.phonenum = phonenum;
}
public long getUpflow() {
return upflow;
}
public void setUpflow(long upflow) {
this.upflow = upflow;
}
public long getDownflow() {
return downflow;
}
public void setDownflow(long downflow) {
this.downflow = downflow;
}
public long getSumflow() {
return upflow + downflow;
}
@Override
public void readFields(DataInput arg0) throws IOException
{
phonenum = arg0.readUTF();
upflow = arg0.readLong();
downflow = arg0.readLong();
sumflow = arg0.readLong();
}
@Override
public void write(DataOutput arg0) throws IOException
{
arg0.writeUTF(phonenum);
arg0.writeLong(upflow);
arg0.writeLong(downflow);
arg0.writeLong(sumflow);
}
@Override
public String toString()
{
return upflow+"\t"+downflow+"\t"+sumflow;
}
public int compareTo(FlowBean o) {
return this.getSumflow() > o.getSumflow()?-1:1;
}
}
6.2 Mapper
package FlowSum;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.commons.lang.StringUtils;
public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean>
{
protected void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException
{
String line = value.toString();
String[] fields = StringUtils.split(line,'\t');
String phonenum = fields[0].toString();
Long upflow = Long.parseLong(fields[1]);
Long downflow = Long.parseLong(fields[2]);
context.write(new Text(phonenum),new FlowBean(phonenum,upflow,downflow));
}
}
6.3 Reducer
package FlowSum;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class FlowSumReduce extends Reducer<Text,FlowBean,Text,FlowBean>{
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context)
throws IOException, InterruptedException {
long upflow_count = 0;
long downflow_count = 0;
for(FlowBean flowBean :values)
{
upflow_count += flowBean.getUpflow();
downflow_count += flowBean.getDownflow();
}
context.write(key,new FlowBean(key.toString(),upflow_count,downflow_count));
}
}
6.4 Partitioner
package FlowSum;
import java.util.HashMap;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
public class FlowPartioner extends Partitioner<Text, FlowBean> {
private static HashMap<String,Integer> map = new HashMap<>();
static {
map.put("135", 0);
map.put("136", 1);
map.put("137", 2);
map.put("138", 3);
map.put("139", 4);
}
@Override
public int getPartition(Text key, FlowBean flowBean, int partitionnum) {
return map.get(key.toString().substring(0,3)==null?5:key.toString().substring(0,3));
}
}
6.5 Runner
package FlowSum;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.Text;
public class FlowSumRunner extends Configured implements Tool {
static String hdfsroot = "hdfs://localhost:9000";
@Override
public int run(String[] arg0) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, FlowSumRunner.class.getSimpleName());
conf.set("mapreduce.job.jar", "G:/phoneflow.jar");
job.setJarByClass(FlowSumRunner.class);
job.setMapperClass(FlowSumMapper.class);
job.setReducerClass(FlowSumReduce.class);
job.setPartitionerClass(FlowPartioner.class);
job.setNumReduceTasks(6);//设置reduce的任务并发数
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.setInputPaths(job, new Path(hdfsroot+arg0[0]));
FileSystem fs = FileSystem.get(conf);
if(fs.exists(new Path(hdfsroot+arg0[1]) )){
fs.delete(new Path(hdfsroot+arg0[1]), true);
}
FileOutputFormat.setOutputPath(job, new Path(hdfsroot+arg0[1]));
return job.waitForCompletion(true)?0:1;
}
public static void main(String[] args) throws Exception
{
int res = ToolRunner.run(new Configuration(), new FlowSumRunner(),args);
System.exit(res);
}
}
6.6 打包运行:
hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum
6.7 二次Map --》对第一次Mapreduce的结果进行排序
Mapper
package floworder;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.commons.lang.StringUtils;
public class FlowSumMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable>
{
protected void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException
{
String line = value.toString();
String[] fields = StringUtils.split(line,'\t');
String phonenum = fields[0].toString();
Long upflow = Long.parseLong(fields[1]);
Long downflow = Long.parseLong(fields[2]);
//使用shuffle对键进行排序,把flowbean设为键,并重写comareTo方法
context.write(new FlowBean(phonenum,upflow,downflow),NullWritable.get());
}
}
Reducer
package floworder;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class FlowSumReduce extends Reducer<FlowBean,NullWritable,Text,FlowBean>{
@Override
protected void reduce(FlowBean bean, Iterable<NullWritable> values,
Reducer<FlowBean, NullWritable, Text, FlowBean>.Context context) throws IOException, InterruptedException {
context.write(new Text(bean.getPhonenum()), bean);
}
}
打包运行:
hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum
MapReduce作业链结束。