hive 求平均值 hive计算众数

转载

香奈儿 2023-08-23 10:49:35

文章标签 hive 求平均值 Hive 基础知识重点 hadoop hive 文章分类 Hive 大数据

马上就要第2次月考，其中五颗星星的是必考内容
少于5颗为可能考的内容，星星越多几率越大。

文章目录

选择题

Hive基础语法★★★★★
Hive 表的修改和删除★★★
Hive 表的数据导出★★★
Hive 表的数据加载★★★
Hive 外部表★★★
Hive 分区表★★★
Hive 分桶表★★★
Hive的基本操作★★
Hive的三种使用方式★★★
Hive 参数配置★★
查看系统自带的函数★
显示自带的函数的用法★
详细显示自带的函数的用法★

简答题

从如下三篇文章中共选4道题

编程题

（1）HDFS基础JAVA API★★★★★
（2）Hive 自定义函数UDF版★★★★★
（3）MapReduce自定义Partition分区★★★★★
（4）自定义MapJoin和ReduceJoin★★★★★
（5）MapReduce Java代码应用Snappy压缩算法★★★
（6）讲师说必须会做这道题（必考类似题）★★★★★
（7）Hive 自定义函数 Reflect版★★★
（8）MapReduce将运算结果分类输出到多个文件★
（9）必考题★★★★★
（10）模拟考试★★★★★

选择题

Hive基础语法★★★★★

Hive 表的修改和删除★★★

Hive 表的数据导出★★★

Hive 表的数据加载★★★

Hive 外部表★★★

Hive 分区表★★★

Hive 分桶表★★★

Hive的基本操作★★

Hive的三种使用方式★★★

Hive 参数配置★★

参数声明 > 命令行参数 > 配置文件参数（hive）

命令行参数：
hive -hiveconf 参数名=参数配置

hive -hiveconf hive.root.logger=INFO,console

log4j相关的设定不能使用参数声明的方式

Hive的配置文件包括

用户自定义配置文件：$HIVE_CONF_DIR/hive-site.xml 
默认配置文件：$HIVE_CONF_DIR/hive-default.xml

查看系统自带的函数★

show functions;

显示自带的函数的用法★

desc function 函数名

hive> desc function upper;

详细显示自带的函数的用法★

hive> desc function extended upper;

简答题

从如下三篇文章中共选4道题

理论上讲，只要全背下来，就能满分。
大数据考核题整理(1)大数据考核题整理(2)大数据考核题整理(3)

编程题

（1）HDFS基础JAVA API★★★★★

要看详细的点这个链接

首先，操作HDFS系统都会用到的是：

FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.100.201:8020"),new Configuration());

接下来就是常用的输入输出流：

FSDataInputStream fsDataInputStream = fileSystem.open(new Path("/home/test.txt"));
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/a.txt"));
FileOutputStream fileOutputStream = new FileOutputStream(new File("E:\\test.txt"));

接下来是FileSystem常用的方法：

方法名	作用
open	创建FSDataInputStream
create	FSDataOutputStream
mkdirs	创建目录
copyFromLocalFile	从本地拷贝文件到hdfs
listStatus	得到某目录下的所有文件的FileStatus对象数组
rename	修改某文件或目录的名称
getFileStatus	获得某文件或目录的FileStatus对象
delete	删除某文件或目录
exists	判断某文件或目录是否存在

org.apache.commons.io.IOUtils

IOUtils有以下作用：

从输入流拷贝东西到输出流输出

IOUtils.copy(输入流,输出流)

用来关流

IOUtils.closeQuietly(各种流);

FileStatus：

getPath()获得文件路径,可以加.toString()转换为字符串，就像下面这样

fileStatus.getPath().toString()

运行结果

hdfs://192.168.100.201:8020/a.txt

getModificationTime()获得最后修改时间

fileStatus.getModificationTime()

输出是毫秒值

（2）Hive 自定义函数UDF版★★★★★

开发java类继承UDF，并创造evaluate 方法
必须是evaluate方法，其它方法名无效

package com.czxy.demo01;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class TestUDF extends UDF {

    public Text evaluate(final Text s) {
        if (null == s) {
            return null;
        }

        //返回大写字母
        return new Text(s.toString().toUpperCase());
    }

}

就像Reflect一样，写完代码，打包上传linux，进入hive的shell
add jar jar包路径

然后就不一样了，需要创建临时函数，把这个刚写的代码弄成临时或者永久函数：

创建临时函数

create temporary function touppercase as 'com.czxy.demo01.TestUDF';

创建永久函数

create function touppercase1 as 'com.czxy.demo01.TestUDF';

删除临时函数

drop temporary function touppercase

删除永久函数

drop   function touppercase1 ;

使用自定义函数

select touppercase('ABC');

（3）MapReduce自定义Partition分区★★★★★

完整内容点这个链接

自定义分区简单总结：
map端和reduce端没有要求，但是要自定义一个类，继承Partitioner类。
重写getPartition()方法，然后在这个类中自定义分区规则。
注意分区发生在map之后，reduce之前。
它接收的数据类型是map输出的数据类型，继承的泛型也是map输出的类型。

（新版本pom文件不支持本地模式的分区，需要打包到集群运行）

public class MyPartitioner extends Partitioner<Text,NullWritable> {
    /**
     * 返回值表示数据要去到哪个分区
     * 返回值只是一个分区的标记，标记所有相同的数据去到指定的分区
     */
    @Override
    public int getPartition(Text text, NullWritable nullWritable, int i) {
        String result = text.toString().split("\t")[5];
        System.out.println(result);
        if (Integer.parseInt(result) > 15){
            return 1;
        }else{
            return 0;
        }
    }
}

上传集群必备：

job.setJarByClass(PartitionMain.class);

分区必备：

/**
 * 设置分区类，以及reducetask的个数
 * 注意reduceTask的个数一定要与分区数保持一致
 */
job.setPartitionerClass(MyPartitioner.class);
job.setNumReduceTasks(2);

（4）自定义MapJoin和ReduceJoin★★★★★

自定义ReduceJoin总结：

在map中，以join的on的字段作为key，把内容封装到bean，输出给reduce
在reduce端把多段数据，进行判断封装即可，相当于实现了join。

自定义MapJoin总结：
在一个文件比较大，一个文件比较小的前提下，可以将小的文件加载到程序中，让多个程序（相同的程序）同时拥有这个小量的数据，在用这个小数据与大量的数据进行关联或匹配。
小量数据加载到程序，需要组装成Map<K,V>
key 就是唯一的标识，value就是对应的数据
大量数据在获得到数据后，根据数据内的关联字段，到Map中get
最终将两个数据进行拼接。

Driver类注意：
添加一行分布式缓存文件，设置文件位置。
而且Map端实现join不需要设置reduce端，不需要写reduce端代码，也不用在driver里声明。

Configuration conf = new Configuration();

DistributedCache.addCacheFile(new URI("hdfs://192.168.100.201:8020/aaaaa/pdts.txt"),conf);

Mapper类注意：
要用map加载缓存文件，用唯一标识作为key，剩余数据作为value。
在setup方法中完成对缓存文件的读取,在map方法中完成拼接相同key的数据并输出。

HashMap<String,String> map=new HashMap<String,String>();
    String line=null;

    //获取分布式缓存的数据

    //程序在初始化的时候setup只运行一遍
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        //读取分布式缓存的数据
        URI[] cacheFiles = DistributedCache.getCacheFiles(context.getConfiguration());

        URI cacheFile = cacheFiles[0];

        FileSystem fileSystem = FileSystem.get(cacheFile, context.getConfiguration());
        FSDataInputStream inputStream = fileSystem.open(new Path(cacheFile));
        
		//用BufferedReader实现逐行读取逐行处理，需要中间类InputStreamReader
        BufferedReader bufferedReader=new BufferedReader(new InputStreamReader(inputStream));
        while ((line=bufferedReader.readLine())!=null){
            String[] split = line.split(",");
            map.put(split[0],line);
        }



    }

setup方法中的思路是这样的：

通过DistributedCache.getCacheFiles(context.getConfiguration())[0]获得缓存文件URI

因为通过URI和context能够得到configuration，所以创建FileSystem对象

通过FileSystem对象的open方法获取FSData读取流，读取HDFS上的数据

通过BufferedReader的readline()方法实现逐行读取并处理

需要用到中间类InputStreamReader，实现它需要FSData输入流

（5）MapReduce Java代码应用Snappy压缩算法★★★

完整内容点这个链接

snappy压缩简单总结：
Driver类必备：

//在configuration中使用压缩算法
Configuration conf = new Configuration();

//设置Map输出的数据使用的压缩算法
conf.set("mapreduce.map.out.compress","true");
conf.set("mapreduce.map.out.compress","org.apache.hadoop.io.compress.SnappyCodec");

//设置Reduce输出的数据使用的压缩算法
conf.set("mapreduce.output.fileoutputformat.compress","true");
conf.set("mapreduce.output.fileoutputformat.compress.type","RECORD");
conf.set("mapreduce.output.fileoutputformat.compress.codec","org.apache.hadoop.io.compress.SnappyCodec");

（6）讲师说必须会做这道题（必考类似题）★★★★★

hive 求平均值 hive计算众数_hadoop

首先是UDF

package com.czxy.demo02;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

public class TestUDF extends UDF {

    public Text evaluate(Text name, IntWritable number){
        name = new Text(name.toString()+"feng" +"-"+(number.get()+80));
        return name;
    }

}

上传jar包，add jar jar包路径，创建临时函数，调用临时函数，删除临时函数

hive (default)> add jar /home/hadoop04.jar
              > ;
Added [/home/hadoop04.jar] to class path
Added resources: [/home/hadoop04.jar]


hive (default)> create temporary function fuck as 'com.czxy.demo02.TestUDF';
OK
Time taken: 3.778 seconds


hive (default)> select fuck('zhangsan',20);
OK
_c0
zhangsanfeng-100
Time taken: 3.274 seconds, Fetched: 1 row(s)


hive (default)> drop temporary function fuck;
OK
Time taken: 0.005 seconds
hive (default)>

接下来是HDFS

package com.czxy.demo03;

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.net.URI;

public class Test01 {
    public static void main(String[] args) throws Exception{
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.100.201:8020"),new Configuration());
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/abcba/abcba.txt"));
        byte[] bytes = "hello world".getBytes();
        fsDataOutputStream.write(bytes,0,bytes.length);
        fileSystem.copyFromLocalFile(new Path("E:\\test.txt"),new Path("/abcba/"));
        IOUtils.closeQuietly(fsDataOutputStream);
        fileSystem.close();
    }
}

运行结果：

[root@hadoop01 home]# hadoop fs -ls /abcba/
Found 2 items
-rw-r--r--   3 Administrator supergroup         11 2019-11-25 20:34 /abcba/abcba.txt
-rw-r--r--   3 Administrator supergroup          8 2019-11-25 20:34 /abcba/test.txt
[root@hadoop01 home]# hadoop fs -cat /abcba/abcba.txt
hello world[root@hadoop01 home]# hadoop fs -cat /abcba/test.txt 
gagaga
[root@hadoop01 home]#

（7）Hive 自定义函数 Reflect版★★★

任意编写java类

package com.czxy.demo01;
public class Test {
    public static String getStr(String str){
        return str+"123";
    }
}

上传到linux，进入hive的shell，输入add jar jar包的路径来添加到hive客户端

add jar /home/test02.jar

记住语法：

select reflect('全类名','方法名','传入的参数');

select reflect('com.czxy.demo01.Test','getStr','haha');

（8）MapReduce将运算结果分类输出到多个文件★

MultipleOutputs简单总结：

public class MultipleReducer extends Reducer<Text,Text,Text, NullWritable> {
	//创建全局的变量mos，因为用到不止一次，所以不new出来，而是占位
    private MultipleOutputs<Text,NullWritable> mos;

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
    	//在setup运行时，重新初始化这个类
        mos = new MultipleOutputs<>(context);
    }

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        System.out.println(key);
        //同一个key，有一个value的list，里边装了好几个数据
        //我希望把结果（value的list中的数据）按key分开输出到不同文件，而不是按系统规则分配。
        for (Text value : values) {
        	//然后mos.write就完事了，三个参数
        	//第一个：你指定的reducer输出的第一个参数类型的参数 value
        	//第二个：你指定的reducer输出的对二个参数类型的参数 NullWritable.get()
        	//第三个：文件名以啥开头？  我这里key.toString()就起到了以key区分文件名的效果
        	//相同key的value的list里的值会写到相同文件里
        	//实现了将运算结果分类输出到多个文件，分类是依据key分类的。
            mos.write(value,NullWritable.get(),key.toString());
            System.out.println(value);
        }
    }

    @Override
    protected void cleanup(Context context) throws IOException, InterruptedException {
    	//reduce工作完毕肯定要关流
        mos.close();
    }
}

结果长这样

hive 求平均值 hive计算众数_hadoop_02

（9）必考题★★★★★

1、wordcount统计单词
2、压缩（集群运行）
3、自定义分区（集群运行）
4、封装bean对象
5、reducejoin
6、mapjoin

（10）模拟考试★★★★★

数据：
链接：https://pan.baidu.com/s/12zcU8H5ddh83dDaWtBywpQ
提取码：29hv
代码：
第一题：考察自定义分区

package com.czxy.demo01;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.IOException;

public class Driver extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        job.setJarByClass(Driver.class);
        job.setNumReduceTasks(2);

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("hdfs://192.168.100.201:8020/t20191126/test01Input"));

        job.setMapperClass(DataMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);


        job.setReducerClass(DataReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("hdfs://192.168.100.201:8020/t20191126/test01Output02"));


        job.setPartitionerClass(DataPartitioner.class);


        boolean b = job.waitForCompletion(true);


        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Driver(), args);
    }

    public static class DataPartitioner extends Partitioner<Text, NullWritable> {

        @Override
        public int getPartition(Text text, NullWritable nullWritable, int i) {
            if (!"".equals(text.toString())) {
                String[] split = text.toString().split("\t");
                String name = split[2];
                Integer jindu = Integer.parseInt(split[0]);

                if (jindu > 50) {
                    return 0;
                } else {
                    return 1;
                }


            }

            return 0;

        }
    }

    public static class DataMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            if (value.toString().split("\t")[2].length() > 5) {
                context.write(value, NullWritable.get());
            }
        }
    }

    public static class DataReducer extends Reducer<Text, NullWritable, Text, NullWritable> {

        @Override
        protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
            context.write(key, NullWritable.get());

        }
    }

}

第二题：考察Reduce端实现SQL中的join算法

package com.czxy.demo02;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class Driver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        job.setJarByClass(Driver.class);


        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, new Path("hdfs://192.168.100.201:8020/t20191126/test02Input"));

        job.setMapperClass(BeanMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(DataBean.class);


        job.setReducerClass(BeanReducer.class);
        job.setOutputKeyClass(DataBean.class);
        job.setOutputValueClass(NullWritable.class);

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, new Path("hdfs://192.168.100.201:8020/t20191126/test02Onput"));

        boolean b = job.waitForCompletion(true);

        return b ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {

        ToolRunner.run(new Driver(), args);

    }

    public static class BeanMapper extends Mapper<LongWritable, Text, Text, DataBean> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            FileSplit fileSplit = (FileSplit) context.getInputSplit();
            String name = fileSplit.getPath().getName();
            System.out.println(name);

            DataBean dataBean = new DataBean();
            String info = value.toString();

            if (name.contains("1")) {


                String[] split = info.split("\t");
                dataBean.setId(split[1]);
                dataBean.setName(split[0]);
                context.write(new Text(split[1]), dataBean);
            }

            if (name.contains("2")) {
                String[] split = info.split(",");
                dataBean.setId(split[0]);
                dataBean.setGender(split[1]);
                context.write(new Text(split[0]), dataBean);
            }

            if (name.contains("3")) {

                String[] split = info.split("\\|");
                dataBean.setId(split[0]);
                dataBean.setAge(split[1]);
                context.write(new Text(split[0]), dataBean);

            }


        }
    }

    public static class BeanReducer extends Reducer<Text, DataBean, DataBean, NullWritable> {


        @Override
        protected void reduce(Text key, Iterable<DataBean> values, Context context) throws IOException, InterruptedException {

            DataBean dataBean = new DataBean();

            dataBean.setId(key.toString());


            for (DataBean value : values) {

                System.out.println(value);
                if (!value.getGender().equals("null")) {
                    dataBean.setGender(value.getGender());
                }

                if (!value.getAge().equals("null")) {
                    dataBean.setAge(value.getAge());
                }

                if (!value.getName().equals("null")) {
                    dataBean.setName(value.getName());
                }


            }

            System.out.println(dataBean);
            System.out.println("---------------------");

            context.write(dataBean, NullWritable.get());


        }
    }


    public static class DataBean implements Writable {
        private String id;
        private String name;
        private String gender;
        private String age;

        public DataBean() {
        }

        public DataBean(String id, String name, String gender, String age) {
            this.id = id;
            this.name = name;
            this.gender = gender;
            this.age = age;

        }

        public String getId() {
            return id;
        }

        public void setId(String id) {
            this.id = id;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public String getGender() {
            return gender;
        }

        public void setGender(String gender) {
            this.gender = gender;
        }

        public String getAge() {
            return age;
        }

        public void setAge(String age) {
            this.age = age;
        }

        @Override
        public String toString() {
            return "DataBean{" +
                    "id='" + id + '\'' +
                    ", name='" + name + '\'' +
                    ", gender='" + gender + '\'' +
                    ", age='" + age + '\'' +
                    '}';
        }

        @Override
        public void write(DataOutput out) throws IOException {
            out.writeUTF(id + "");
            out.writeUTF(name + "");
            out.writeUTF(gender + "");
            out.writeUTF(age + "");
        }

        @Override
        public void readFields(DataInput in) throws IOException {
            this.id = in.readUTF();
            this.name = in.readUTF();
            this.gender = in.readUTF();
            this.age = in.readUTF();
        }
    }


}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。