Hadoop实验计数器统计 hadoop实验二

转载

definitely 2023-11-25 19:55:50

文章标签 Hadoop实验计数器统计 hadoop mapreduce 大数据 hdfs 文章分类 Hadoop 大数据

Hdfs 实验

Hdfs 实验

1、windows 单机安装hadoop272
2、linux 安装hadoop
3、JAVA操作WINDOWS上的HDFS
4、JAVA操作Linux上的HDFS
5、MapReduce WordCount 项目
6、MapReduce PhoneFlow 项目

Hdfs 实验

本篇文章为Hdfs基础实验整体流程，中间如有不详处，或因环境所致踩坑，请自行baidu解决

1、windows 单机安装hadoop272

（hadoop3.2暂时没有windows环境）
在D盘安装Jdk1.7 配置环境变量
在D盘解压hadoop272（路径中不要带符号包括空格）
用hadooponwindows-master 的bin 覆盖 hadoop272的bin
设置D盘的hadoop272
etc/hadoop /
core-site.xml
hdfs-site.xml
mapreduce-site.xml
yarn-site.xml
slavers
hadoop-env.cmd

关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld

格式化hdfs
hadoop namenode -format
启动 hadoop
d:/hadoop272/sbin/start-all.cmd
使用jps 查看启动情况

2、linux 安装hadoop

2.1 使用Vmware12 安装centos7 安装前开启并设置网络固定ip
1/ /etc/hosts
192.168.198.128 lining01
192.168.198.129 lining02
2.2 安装JAVA
1/ 上传 jdk-7u79-linux-x64.tar.gz 至 /opt
2/ tar -zxvf jdk-7u79-linux-x64.tar.gz
3/ /etc/profile
5/ export JAVA_HOME=/opt/jdk1.7.0_79
export JRE_HOME= $Hadoop实验计数器统计 hadoop实验二_Hadoop实验计数器统计$ JAVA_HOME/lib: $Hadoop实验计数器统计 hadoop实验二_hadoop_02$ CLASSPATH
export PATH= $Hadoop实验计数器统计 hadoop实验二_大数据_03$ JRE_HOME/bin:$PATH
6/ source /etc/profile

2.3 安装配置Hadoop

1/ 上传 hadoop-2.5.0-cdh5.3.6.tar.gz 至 /opt

2/ cd /opt

3/ tar -zxvf hadoop-2.5.0-cdh5.3.6.tar.gz

4/ 新建tmp var hdfs hdfs/data hdfs/name

5/ 修改 core-site.xml hdfs-site.xml yarn-site.xml mapreduce-site.xml slavers hadoop-env.sh

Hadoop实验计数器统计 hadoop实验二_hdfs_04

core-site.xml

<configuration>
        <!--指定namenode的地址-->
    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://lining01:9000</value>
    </property>
    <!--用来指定使用hadoop时产生文件的存放目录-->
    <property>
             <name>hadoop.tmp.dir</name>
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/tmp</value> 
    </property>
    <property>
        <!--用来设置检查点备份日志的最长时间-->
        <name>fs.checkpoint.period</name> 
        <value>3600</value>
    </property>
 </configuration>

hdfs-site.xml

<configuration>
    <!--指定hdfs保存数据的副本数量-->
    <property>
            <name>dfs.replication</name>
            <value>2</value>
    </property>
    <!--指定hdfs中namenode的存储位置-->
    <property>
             <name>dfs.namenode.name.dir</name> 
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/name</value>
    </property>
    <!--指定hdfs中datanode的存储位置-->
    <property>
             <name>dfs.datanode.data.dir</name>
             <value>file:///opt/hadoop-2.5.0-cdh5.3.6/hdfs/data</value>
    </property>

</configuration>

mapred-site.xml

<configuration>
<!--告诉hadoop以后MR(Map/Reduce)运行在YARN上-->
        <property>
              <name>mapreduce.framework.name</name>
              <value>yarn</value>
       </property>
</configuration>

yarn-site.xml

<configuration>
    <!--nomenodeManager获取数据的方式是shuffle-->
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>
    <!--指定Yarn的老大(ResourceManager)的地址-->     
    <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>lining01</value>
    </property> 
    <!--Yarn打印工作日志-->    
    <property>    
        <name>yarn.log-aggregation-enable</name> 
        <value>true</value>    
    </property>

<configuration>

2.4生成master和slaver秘钥并把master公钥追加到两台slaver slaver公钥追加到master
1/ 修改 /etc/ssh/sshd_config
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
2/ ssh-keygen -t rsa -P ‘’
复制 /root/.ssh/id_rsa.pub 改名为 authorized_keys
把lining01的pub 追加到 lining02 lining03 的 root/.ssh/authorized_keys
把lining02的pub 追加到 lining01 的 root/.ssh/authorized_keys
把lining03的pub 追加到 lining01 的 root/.ssh/authorized_keys

3/ 测试 ssh lining02

2.5 格式化hdfs
1/ 关闭防火墙
systemctl disable firewalld.service
systemctl stop firewalld
2/ 格式化hadoop磁盘
hadoop namenode -format
3/ 启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-dfs.sh
启动 /opt/hadoop-2.5.0-cdh5.3.6/sbin/start-yarn.sh
4/ jps (如权限不够 chmod +x /opt/jdk1.7.0_79/bin/jps
5/ hadoop fs -ls
hadoop fs -mkdir
hadoop fs -put

3、JAVA操作WINDOWS上的HDFS

3.1 安装windows hadoop（因为是在window下编程，所以需要在windows上装hadoop和eclipse）–》1.1已安装
3.2 把eclipse-hadoop-2.7.2插件放入eclipse\dropins里面打开eclipse 配置location和dfs视图
3.3 新建 --> java 项目 testwinhdfs
新建文件夹lib,将hadoop272下的所有jar包复制到lib下面，BuildPath
将windows下的hadoop272配置文件coresite.xml hdfs-site.xml复制到src目录
3.4 新建java类 winhdfs

import java.io.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class WinHdfs {
	
   static String hdfsroot = "hdfs://localhost:9000";
   public static void main(String[] args) throws IOException{
       put(args);
   }
   
   
   public static void get(String[] args) throws IOException
   {

	   Configuration conf=new Configuration();
	   FileSystem fs = FileSystem.get(conf);

	   Path src = new Path(hdfsroot+args[0]);
	   FSDataInputStream in = fs.open(src);
       FileOutputStream os = new FileOutputStream(args[1]);
       IOUtils.copy(in, os);
       System.out.println("get success");
   }
   
   public static void put(String[] args) throws IOException
   {
	   
	   Configuration conf = new Configuration();

	   FileSystem fs = FileSystem.get(conf);
	   //String src = "G:/hello.txt";
	   FileInputStream in = new FileInputStream(args[0]);

	   Path outpath = new Path(hdfsroot+args[1]);
	   FSDataOutputStream out = fs.create(outpath);

	   IOUtils.copy(in, out);

	   System.out.println("put success");
   }
   
}

4、JAVA操作Linux上的HDFS

4.1 步骤与windowshdfs的操作相同，但需要将用户指定为root
Properties properties = System.getProperties();
properties.setProperty(“HADOOP_USER_NAME”, “root”);

import java.io.*;
import java.util.*;
import org.apache.commons.io.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

public class LinuxHdfs {
   public static void main(String[] args) throws IOException{
       
	   Properties properties = System.getProperties();
       properties.setProperty("HADOOP_USER_NAME", "root");

	   put(args);
   }
   
   
   public static void get(String[] args) throws IOException
   {
	   Configuration conf=new Configuration();
	   FileSystem fs = FileSystem.get(conf);

	   Path src = new Path("hdfs://192.168.92.129:9000/wordcount/phoneflow.csv");
	   FSDataInputStream in = fs.open(src);
       FileOutputStream os = new FileOutputStream("G:/phoneflow.csv");
       IOUtils.copy(in, os);
       System.out.println("get success");
   }
   
   public static void put(String[] args) throws IOException
   {
	   Configuration conf = new Configuration();

	   FileSystem fs = FileSystem.get(conf);
	   String src = "G:/hello.txt";
	   Path outpath = new Path("hdfs://192.168.92.129:9000/wordcount/wordcount32d21.txt");
	   FileInputStream in = new FileInputStream(src);

	   FSDataOutputStream out = fs.create(outpath);

	   IOUtils.copy(in, out);

	   System.out.println("put success");
   }
   
   
}

5、MapReduce WordCount 项目

aim：统计word.txt文件中的单词个数

5.1 Map过程

Hadoop实验计数器统计 hadoop实验二_mapreduce_05

5.1 Reduce过程

Hadoop实验计数器统计 hadoop实验二_Hadoop实验计数器统计_06

5.3 Shuffle过程

Hadoop实验计数器统计 hadoop实验二_大数据_07

5.4 MapReduce 整体流程

Hadoop实验计数器统计 hadoop实验二_Hadoop实验计数器统计_08

5.5 代码

public class WordCount {

    public static class WordMapper extends Mapper<Object, Text, Text, LongWritable> {

        private final static LongWritable one = new LongWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            
            StringTokenizer itr = new StringTokenizer(value.toString());
            
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);    
            }
        }
    }
    
    public static class WordReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        private LongWritable result = new LongWritable();

        public void reduce(Text key, Iterable<LongWritable> values, Context context)
                throws IOException, InterruptedException {

            int sum = 0; 
            
           
            for (LongWritable val : values) {

                sum += val.get();        
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    
    public static void main(String[] args) throws Exception {
        
        //Properties properties = System.getProperties();
        //properties.setProperty("HADOOP_USER_NAME", "root");
        
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, WordCount.class.getSimpleName());
        conf.set("mapreduce.job.jar", "G:/wordcount.jar");
        
        job.setJarByClass(wordcount.WordCount.class);
        
        String src = "hdfs://localhost:9000/wordcount/wordcount1.txt";
        //数据在哪里？
        //FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileInputFormat.setInputPaths(job, new Path(src));
        //使用哪个mapper处理输入的数据？
        job.setMapperClass(WordMapper.class);
        //map输出的数据类型是什么？
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        
        //使用哪个reducer处理输入的数据？
        job.setReducerClass(WordReducer.class);
        //reduce输出的数据类型是什么？
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        
        String out = "hdfs://localhost:9000/wordcount/wordcountout12.txt";
        

		FileSystem fs = FileSystem.get(conf);
		if(fs.exists(new Path(out) )){
			fs.delete(new Path(out), true);
		}

        
        //数据输出到哪里？
        //FileOutputFormat.setOutputPath(job, new Path(args[1]));
        FileOutputFormat.setOutputPath(job, new Path(out));
        //交给yarn去执行，直到执行结束才退出本程序
        job.waitForCompletion(true);

    }

}

5.6 运行JAR包

hadoop jar g:/wordcount.jar wordcount.WordCount

6、MapReduce PhoneFlow 项目

aim1: 计算phoneflow.csv中每个手机号的上行，下行及流量总和，实现手机号分区
文件格式：分隔符：\t
13443245065 231 4243
13523421234 4321 142
13443245065 421 3243
…

aim2: 对输出结果phoneflowsum.csv 按流量总和进行排序
文件格式：分隔符：\t
13443245065 652 7486 8138
13523421234 4321 142 4463
…

6.1 制作FlowBean 实现 writableComparable 接口

package FlowSum;

import java.io.*;

import org.apache.hadoop.io.*;

public class FlowBean implements WritableComparable<FlowBean> {

	private String phonenum;
	
	private long upflow;
	private long downflow;
	private long sumflow;
	
	
	
	public FlowBean(String phonenum, long upflow, long downflow) {
		super();
		this.phonenum = phonenum;
		this.upflow = upflow;
		this.downflow = downflow;
		this.sumflow = upflow+downflow;
	}

	public void set(long upLink,long downLink){
        this.upflow=upLink;
        this.downflow=downLink;
        this.sumflow=upLink+downLink;
    }
	
	
	public FlowBean() {
		super();
	}


	public String getPhonenum() {
		return phonenum;
	}

	public void setPhonenum(String phonenum) {
		this.phonenum = phonenum;
	}

	public long getUpflow() {
		return upflow;
	}

	public void setUpflow(long upflow) {
		this.upflow = upflow;
	}

	public long getDownflow() {
		return downflow;
	}

	public void setDownflow(long downflow) {
		this.downflow = downflow;
	}

	public long getSumflow() {
		return upflow + downflow;
	}

	@Override
	public void readFields(DataInput arg0) throws IOException
	{
        phonenum = arg0.readUTF();
        upflow = arg0.readLong();
        downflow = arg0.readLong();
        sumflow = arg0.readLong();
	}
	

	@Override
	public void write(DataOutput arg0) throws IOException
	{
		arg0.writeUTF(phonenum);
		arg0.writeLong(upflow);
		arg0.writeLong(downflow);
		arg0.writeLong(sumflow);
	}
	
	@Override
	public String toString()
	{
		return upflow+"\t"+downflow+"\t"+sumflow;
	}


    public int compareTo(FlowBean o) {

        return this.getSumflow() > o.getSumflow()?-1:1;
    }
}

6.2 Mapper

package FlowSum;

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.commons.lang.StringUtils;


public class FlowSumMapper extends Mapper<LongWritable,Text,Text,FlowBean> 
{
	protected void map(LongWritable key,Text value,Context context)
			throws IOException,InterruptedException
	{

		String line = value.toString();
	    String[] fields =  StringUtils.split(line,'\t');
		String phonenum = fields[0].toString();
		Long upflow = Long.parseLong(fields[1]);
		Long downflow = Long.parseLong(fields[2]);

		
		context.write(new Text(phonenum),new FlowBean(phonenum,upflow,downflow));
	}

    
}

6.3 Reducer

package FlowSum;

import java.io.IOException;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowSumReduce extends Reducer<Text,FlowBean,Text,FlowBean>{

	@Override
	protected void reduce(Text key, Iterable<FlowBean> values, Context context)
			throws IOException, InterruptedException {
		
		long upflow_count = 0;
		long downflow_count = 0;
		for(FlowBean flowBean :values)
		{
			upflow_count += flowBean.getUpflow();
			downflow_count += flowBean.getDownflow(); 
		}
		context.write(key,new FlowBean(key.toString(),upflow_count,downflow_count));
		
	}

}

6.4 Partitioner

package FlowSum;

import java.util.HashMap;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;


public class FlowPartioner extends Partitioner<Text, FlowBean> {
	private static HashMap<String,Integer> map = new HashMap<>();
	static {
		map.put("135", 0);
		map.put("136", 1);
		map.put("137", 2);
		map.put("138", 3);
		map.put("139", 4);
	}
	@Override
	public int getPartition(Text key, FlowBean flowBean, int partitionnum) {
		return map.get(key.toString().substring(0,3)==null?5:key.toString().substring(0,3));
	}
}

6.5 Runner

package FlowSum;

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;

import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import org.apache.hadoop.io.Text;


public class FlowSumRunner extends Configured implements Tool {

	static String hdfsroot = "hdfs://localhost:9000";
	
	@Override
	public int run(String[] arg0) throws Exception {
		
		Configuration conf = new Configuration();
		
        Job job = Job.getInstance(conf, FlowSumRunner.class.getSimpleName());
        conf.set("mapreduce.job.jar", "G:/phoneflow.jar");
		
		job.setJarByClass(FlowSumRunner.class);
		job.setMapperClass(FlowSumMapper.class);
		job.setReducerClass(FlowSumReduce.class);
		
		job.setPartitionerClass(FlowPartioner.class);
		job.setNumReduceTasks(6);//设置reduce的任务并发数
		
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowBean.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);

		FileInputFormat.setInputPaths(job, new Path(hdfsroot+arg0[0]));
		
		FileSystem fs = FileSystem.get(conf);
		if(fs.exists(new Path(hdfsroot+arg0[1]) )){
			fs.delete(new Path(hdfsroot+arg0[1]), true);
		}
		
		FileOutputFormat.setOutputPath(job, new Path(hdfsroot+arg0[1]));
		
		return job.waitForCompletion(true)?0:1;
	}

    public static void main(String[] args) throws Exception
    {
    	int res = ToolRunner.run(new Configuration(), new FlowSumRunner(),args);
        System.exit(res);
    }

}

6.6 打包运行：

hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum

6.7 二次Map --》对第一次Mapreduce的结果进行排序

Mapper

package floworder;

import java.io.IOException;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.commons.lang.StringUtils;


public class FlowSumMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> 
{
	protected void map(LongWritable key,Text value,Context context)
			throws IOException,InterruptedException
	{

		String line = value.toString();
	    String[] fields =  StringUtils.split(line,'\t');
		String phonenum = fields[0].toString();
		Long upflow = Long.parseLong(fields[1]);
		Long downflow = Long.parseLong(fields[2]);

		//使用shuffle对键进行排序，把flowbean设为键，并重写comareTo方法
		context.write(new FlowBean(phonenum,upflow,downflow),NullWritable.get());
	}

    
}

Reducer

package floworder;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class FlowSumReduce extends Reducer<FlowBean,NullWritable,Text,FlowBean>{
	@Override
	protected void reduce(FlowBean bean, Iterable<NullWritable> values,
			Reducer<FlowBean, NullWritable, Text, FlowBean>.Context context) throws IOException, InterruptedException {
		context.write(new Text(bean.getPhonenum()), bean);
	}
}

打包运行：

hadoop jar g:/phoneflow.jar flowsum.FlowSumRunner /phoneflow.csv /phonesum

MapReduce作业链结束。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：sparkRDD编程实践实验报告心得体会 spark编程基础实验4答案

下一篇：SQL server中设置为空值的语句 sql语句为空怎么写

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯