Hdfs 实验

  • Hdfs 实验
  • 1、windows 单机 安装hadoop272
  • 2、linux 安装hadoop
  • 3、JAVA操作WINDOWS上的HDFS
  • 4、JAVA操作Linux上的HDFS
  • 5、MapReduce WordCount 项目
  • 6、MapReduce PhoneFlow 项目


Hdfs 实验

本篇文章为Hdfs基础实验整体流程,中间如有不详处,或因环境所致踩坑,请自行baidu解决

1、windows 单机 安装hadoop272

(hadoop3.2暂时没有windows环境)
在D盘安装Jdk1.7 配置环境变量
在D盘解压hadoop272(路径中不要带符号包括空格)
用hadooponwindows-master 的bin 覆盖 hadoop272的bin
设置D盘的hadoop272
etc/hadoop /
core-site.xml
hdfs-site.xml
mapreduce-site.xml
yarn-site.xml
slavers
hadoop-env.cmd

关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld

格式化hdfs
hadoop namenode -format
启动 hadoop
d:/hadoop272/sbin/start-all.cmd
使用jps 查看启动情况

2、linux 安装hadoop

2.1 使用Vmware12 安装centos7 安装前开启并设置网络 固定ip
1/ /etc/hosts
192.168.198.128 lining01
192.168.198.129 lining02
2.2 安装JAVA
1/ 上传 jdk-7u79-linux-x64.tar.gz 至 /opt
2/ tar -zxvf jdk-7u79-linux-x64.tar.gz
3/ /etc/profile
5/ export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME=Hadoop实验计数器统计 hadoop实验二_mapreduceJAVA_HOME/lib:Hadoop实验计数器统计 hadoop实验二_Hadoop实验计数器统计_02CLASSPATH
export PATH=Hadoop实验计数器统计 hadoop实验二_mapreduce_03JRE_HOME/bin:$PATH
6/ source /etc/profile

2.3 安装配置Hadoop

1/ 上传 hadoop-2.5.0-cdh5.3.6.tar.gz 至 /opt

2/ cd /opt

3/ tar -zxvf hadoop-2.5.0-cdh5.3.6.tar.gz

4/ 新建tmp var hdfs hdfs/data hdfs/name

5/ 修改 core-site.xml hdfs-site.xml yarn-site.xml mapreduce-site.xml slavers hadoop-env.sh

Hadoop实验计数器统计 hadoop实验二_hdfs_04

core-site.xml

<configuration>
        <!--指定namenode的地址-->
    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://lining01:9000</value>
    </property>
    <!--用来指定使用hadoop时产生文件的存放目录-->
    <property>
             <name>hadoop.tmp.dir</name>
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/tmp</value> 
    </property>
    <property>
        <!--用来设置检查点备份日志的最长时间-->
        <name>fs.checkpoint.period</name> 
        <value>3600</value>
    </property>
 </configuration>

hdfs-site.xml

<configuration>
    <!--指定hdfs保存数据的副本数量-->
    <property>
            <name>dfs.replication</name>
            <value>2</value>
    </property>
    <!--指定hdfs中namenode的存储位置-->
    <property>
             <name>dfs.namenode.name.dir</name> 
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/name</value>
    </property>
    <!--指定hdfs中datanode的存储位置-->
    <property>
             <name>dfs.datanode.data.dir</name>
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/data</value>
    </property>

</configuration>

mapred-site.xml

<configuration>
<!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
        <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
       </property>
</configuration>

yarn-site.xml

<configuration>
    <!--nomenodeManager获取数据的方式是shuffle-->
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>
    <!--指定Yarn的老大(ResourceManager)的地址-->     
    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>lining01</value>
    </property> 
    <!--Yarn打印工作日志-->    
    <property>    
        <name>yarn.log-aggregation-enable</name> 
        <value>true</value>    
    </property>

<configuration>

2.4生成master和slaver秘钥并把master公钥追加到两台slaver slaver公钥追加到master
1/ 修改 /etc/ssh/sshd_config
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
2/ ssh-keygen -t rsa -P ‘’
复制 /root/.ssh/id_rsa.pub 改名为 authorized_keys
把lining01的pub 追加到 lining02 lining03 的 root/.ssh/authorized_keys
把lining02的pub 追加到 lining01 的 root/.ssh/authorized_keys
把lining03的pub 追加到 lining01 的 root/.ssh/authorized_keys

3/ 测试 ssh lining02

2.5 格式化hdfs
1/ 关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld
2/ 格式化hadoop磁盘
hadoop namenode -format
3/ 启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-dfs.sh
启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-yarn.sh
4/ jps (如权限不够 chmod +x /opt/jdk1.7.0_79/bin/jps
5/ hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put

3、JAVA操作WINDOWS上的HDFS

3.1 安装windows hadoop(因为是在window下编程,所以需要在windows上装hadoop和eclipse)–》1.1已安装
3.2 把eclipse-hadoop-2.7.2插件 放入eclipse\dropins里面 打开eclipse 配置location和dfs视图
3.3 新建 --> java 项目 testwinhdfs
新建文件夹lib,将hadoop272下的所有jar包复制到lib下面,BuildPath
将windows下的hadoop272配置文件coresite.xml hdfs-site.xml复制到src目录
3.4 新建java类 winhdfs

import java.io.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class WinHdfs {
	
   static String hdfsroot = "hdfs://localhost:9000";
   public static void main(String[] args) throws IOException{
       put(args);
   }
   
   
   public static void get(String[] args) throws IOException
   {

	   Configuration conf=new Configuration();
	   FileSystem fs = FileSystem.get(conf);

	   Path src = new Path(hdfsroot+args[0]);
	   FSDataInputStream in = fs.open(src);
       FileOutputStream os = new FileOutputStream(args[1]);
       IOUtils.copy(in, os);
       System.out.println("get success");
   }
   
   public static void put(String[] args) throws IOException
   {
	   
	   Configuration conf = new Configuration();

	   FileSystem fs = FileSystem.get(conf);
	   //String src = "G:/hello.txt";
	   FileInputStream in = new FileInputStream(args[0]);

	   Path outpath = new Path(hdfsroot+args[1]);
	   FSDataOutputStream out = fs.create(outpath);

	   IOUtils.copy(in, out);

	   System.out.println("put success");
   }
   
}

4、JAVA操作Linux上的HDFS

4.1 步骤与windowshdfs的操作相同,但需要将用户指定为root
Properties properties = System.getProperties();
properties.setProperty(“HADOOP_USER_NAME”, “root”);

import java.io.*;
import java.util.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class LinuxHdfs {
   public static void main(String[] args) throws IOException{
       
	   Properties properties = System.getProperties();
       properties.setProperty("HADOOP_USER_NAME", "root");

	   put(args);
   }
   
   
   public static void get(String[] args) throws IOException
   {
	   Configuration conf=new Configuration();
	   FileSystem fs = FileSystem.get(conf);

	   Path src = new Path("hdfs://192.168.92.129:9000/wordcount/phoneflow.csv");
	   FSDataInputStream in = fs.open(src);
       FileOutputStream os = new FileOutputStream("G:/phoneflow.csv");
       IOUtils.copy(in, os);
       System.out.println("get success");
   }
   
   public static void put(String[] args) throws IOException
   {
	   Configuration conf = new Configuration();

	   FileSystem fs = FileSystem.get(conf);
	   String src = "G:/hello.txt";
	   Path outpath = new Path("hdfs://192.168.92.129:9000/wordcount/wordcount32d21.txt");
	   FileInputStream in = new FileInputStream(src);

	   FSDataOutputStream out = fs.create(outpath);

	   IOUtils.copy(in, out);

	   System.out.println("put success");
   }
   
   
}

5、MapReduce WordCount 项目

aim:统计word.txt文件中的单词个数

5.1 Map过程

Hadoop实验计数器统计 hadoop实验二_hdfs_05


5.1 Reduce过程

Hadoop实验计数器统计 hadoop实验二_大数据_06


5.3 Shuffle过程

Hadoop实验计数器统计 hadoop实验二_mapreduce_07


5.4 MapReduce 整体流程

Hadoop实验计数器统计 hadoop实验二_hadoop_08


5.5 代码

public class WordCount {

    public static class WordMapper extends Mapper<Object, Text, Text, LongWritable> {

        private final static LongWritable one = new LongWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            
            StringTokenizer itr = new StringTokenizer(value.toString());
            
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);    
            }
        }
    }
    
    public static class WordReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        private LongWritable result = new LongWritable();

        public void reduce(Text key, Iterable<LongWritable> values, Context context)
                throws IOException, InterruptedException {

            int sum = 0; 
            
           
            for (LongWritable val : values) {

                sum += val.get();        
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    
    public static void main(String[] args) throws Exception {
        
        //Properties properties = System.getProperties();
        //properties.setProperty("HADOOP_USER_NAME", "root");
        
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, WordCount.class.getSimpleName());
        conf.set("mapreduce.job.jar", "G:/wordcount.jar");
        
        job.setJarByClass(wordcount.WordCount.class);
        
        String src = "hdfs://localhost:9000/wordcount/wordcount1.txt";
        //数据在哪里?
        //FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileInputFormat.setInputPaths(job, new Path(src));
        //使用哪个mapper处理输入的数据?
        job.setMapperClass(WordMapper.class);
        //map输出的数据类型是什么?
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        
        //使用哪个reducer处理输入的数据?
        job.setReducerClass(WordReducer.class);
        //reduce输出的数据类型是什么?
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        
        String out = "hdfs://localhost:9000/wordcount/wordcountout12.txt";
        

		FileSystem fs = FileSystem.get(conf);
		if(fs.exists(new Path(out) )){
			fs.delete(new Path(out), true);
		}

        
        //数据输出到哪里?
        //FileOutputFormat.setOutputPath(job, new Path(args[1]));
        FileOutputFormat.setOutputPath(job, new Path(out));
        //交给yarn去执行,直到执行结束才退出本程序
        job.waitForCompletion(true);

    }

}

5.6 运行JAR包

hadoop jar g:/wordcount.jar wordcount.WordCount

6、MapReduce PhoneFlow 项目

aim1: 计算phoneflow.csv中每个手机号的上行,下行及流量总和,实现手机号分区
文件格式:分隔符:\t
13443245065 231 4243
13523421234 4321 142
13443245065 421 3243

aim2: 对输出结果phoneflowsum.csv 按流量总和进行排序
文件格式:分隔符:\t
13443245065 652 7486 8138
13523421234 4321 142 4463

6.1 制作FlowBean 实现 writableComparable 接口

package FlowSum;

import java.io.*;

import org.apache.hadoop.io.*;

public class FlowBean implements WritableComparable<FlowBean> {

	private String phonenum;
	
	private long upflow;
	private long downflow;
	private long sumflow;
	
	
	
	public FlowBean(String phonenum, long upflow, long downflow) {
		super();
		this.phonenum = phonenum;
		this.upflow = upflow;
		this.downflow = downflow;
		this.sumflow = upflow+downflow;
	}

	public void set(long upLink,long downLink){
        this.upflow=upLink;
        this.downflow=downLink;
        this.sumflow=upLink+downLink;
    }
	
	
	public FlowBean() {
		super();
	}


	public String getPhonenum() {
		return phonenum;
	}

	public void setPhonenum(String phonenum) {
		this.phonenum = phonenum;
	}

	public long getUpflow() {
		return upflow;
	}

	public void setUpflow(long upflow) {
		this.upflow = upflow;
	}

	public long getDownflow() {
		return downflow;
	}

	public void setDownflow(long downflow) {
		this.downflow = downflow;
	}

	public long getSumflow() {
		return upflow + downflow;
	}

	@Override
	public void readFields(DataInput arg0) throws IOException
	{
        phonenum = arg0.readUTF();
        upflow = arg0.readLong();
        downflow = arg0.readLong();
        sumflow = arg0.readLong();
	}
	

	@Override
	public void write(DataOutput arg0) throws IOException
	{
		arg0.writeUTF(phonenum);
		arg0.writeLong(upflow);
		arg0.writeLong(downflow);
		arg0.writeLong(sumflow);
	}
	
	@Override
	public String toString()
	{
		return upflow+"\t"+downflow+"\t"+sumflow;
	}


    public int compareTo(FlowBean o) {

        return this.getSumflow() > o.getSumflow()?-1:1;
    }
}

6.2 Mapper

package FlowSum;

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.commons.lang.StringUtils;


public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean> 
{
	protected void map(LongWritable key,Text value,Context context)
			throws IOException,InterruptedException
	{

		String line = value.toString();
	    String[] fields =  StringUtils.split(line,'\t');
		String phonenum = fields[0].toString();
		Long upflow = Long.parseLong(fields[1]);
		Long downflow = Long.parseLong(fields[2]);

		
		context.write(new Text(phonenum),new FlowBean(phonenum,upflow,downflow));
	}

    
}

6.3 Reducer

package FlowSum;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowSumReduce extends Reducer<Text,FlowBean,Text,FlowBean>{

	@Override
	protected void reduce(Text key, Iterable<FlowBean> values, Context context)
			throws IOException, InterruptedException {
		
		long upflow_count = 0;
		long downflow_count = 0;
		for(FlowBean flowBean :values)
		{
			upflow_count += flowBean.getUpflow();
			downflow_count += flowBean.getDownflow(); 
		}
		context.write(key,new FlowBean(key.toString(),upflow_count,downflow_count));
		
	}

}

6.4 Partitioner

package FlowSum;

import java.util.HashMap;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;


public class FlowPartioner extends Partitioner<Text, FlowBean> {
	private static HashMap<String,Integer> map = new HashMap<>();
	static {
		map.put("135", 0);
		map.put("136", 1);
		map.put("137", 2);
		map.put("138", 3);
		map.put("139", 4);
	}
	@Override
	public int getPartition(Text key, FlowBean flowBean, int partitionnum) {
		return map.get(key.toString().substring(0,3)==null?5:key.toString().substring(0,3));
	}
}

6.5 Runner

package FlowSum;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;

import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import org.apache.hadoop.io.Text;


public class FlowSumRunner extends Configured implements Tool {

	static String hdfsroot = "hdfs://localhost:9000";
	
	@Override
	public int run(String[] arg0) throws Exception {
		
		Configuration conf = new Configuration();
		
        Job job = Job.getInstance(conf, FlowSumRunner.class.getSimpleName());
        conf.set("mapreduce.job.jar", "G:/phoneflow.jar");
		
		job.setJarByClass(FlowSumRunner.class);
		job.setMapperClass(FlowSumMapper.class);
		job.setReducerClass(FlowSumReduce.class);
		
		job.setPartitionerClass(FlowPartioner.class);
		job.setNumReduceTasks(6);//设置reduce的任务并发数
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowBean.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);

		FileInputFormat.setInputPaths(job, new Path(hdfsroot+arg0[0]));
		
		FileSystem fs = FileSystem.get(conf);
		if(fs.exists(new Path(hdfsroot+arg0[1]) )){
			fs.delete(new Path(hdfsroot+arg0[1]), true);
		}
		
		FileOutputFormat.setOutputPath(job, new Path(hdfsroot+arg0[1]));
		
		return job.waitForCompletion(true)?0:1;
	}

    public static void main(String[] args) throws Exception
    {
    	int res = ToolRunner.run(new Configuration(), new FlowSumRunner(),args);
        System.exit(res);
    }

}

6.6 打包运行:

hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum

6.7 二次Map --》对第一次Mapreduce的结果进行排序

Mapper

package floworder;

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.commons.lang.StringUtils;


public class FlowSumMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> 
{
	protected void map(LongWritable key,Text value,Context context)
			throws IOException,InterruptedException
	{

		String line = value.toString();
	    String[] fields =  StringUtils.split(line,'\t');
		String phonenum = fields[0].toString();
		Long upflow = Long.parseLong(fields[1]);
		Long downflow = Long.parseLong(fields[2]);

		//使用shuffle对键进行排序,把flowbean设为键,并重写comareTo方法
		context.write(new FlowBean(phonenum,upflow,downflow),NullWritable.get());
	}

    
}

Reducer

package floworder;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowSumReduce extends Reducer<FlowBean,NullWritable,Text,FlowBean>{
	@Override
	protected void reduce(FlowBean bean, Iterable<NullWritable> values,
			Reducer<FlowBean, NullWritable, Text, FlowBean>.Context context) throws IOException, InterruptedException {
		context.write(new Text(bean.getPhonenum()), bean);
	}
}

打包运行:

hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum

MapReduce作业链结束。