hql 建表语法格式:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY(col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY(col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
简单的hive的建表语句:
create table student(id string,name string ,sex string ,age int,address string)
row format delimited fields terminated by ','
stored as textfile;
students.txt:
95001,李勇,男,20,CS
95002,刘晨,女,19,IS
95003,王敏,女,22,MA
95004,张立,男,19,IS
95005,刘刚,男,18,MA
95006,孙庆,男,23,CS
95007,易思玲,女,19,MA
95008,李娜,女,18,CS
95009,梦圆圆,女,18,MA
95010,孔小涛,男,19,CS
95011,包小柏,男,18,MA
95012,孙花,女,20,CS
95013,冯伟,男,21,CS
95014,王小丽,女,19,CS
95015,王君,男,18,MA
95016,钱国,男,21,MA
95017,王风娟,女,18,IS
95018,王一,女,19,IS
95019,邢小丽,女,19,IS
95020,赵钱,男,21,IS
95021,周二,男,17,MA
95022,郑明,男,20,MA
上传student.txt到/user/hive/warehouse/hive.db/student下:
Hive使用mysql表当作数据库载体时,当在hive上创建了数据库和时,会分别在mysql和hdfs上产生数据。在mysql里会生成一个数据库hive,hive里可以查看DBS(查看数据库情况)、TBLS(查看表信息情况)、SDS、CDS、COLUMNS_V2(表中的列信息)。
Hive建议不要建主键!!!
hive存放元数据(库、表、存储信息、列(字段))放在数据库(MySQL数据库)中,真实数据放在hdfs上。
Hive 特点
优点:
1、可扩展性,横向扩展,Hive 可以自由的扩展集群的规模,一般情况下不需要重启服务 横向扩展:通过分担压力的方式扩展集群的规模 纵向扩展:一台服务器cpu i7-6700k 4核心8线程,8核心16线程,内存64G => 128G
2、延展性,Hive支持自定义函数,用户可以根据自己的需求来实现自己的函数
3、良好的容错性,可以保障即使有节点出现问题,SQL 语句仍可完成执行
缺点:
1、Hive 不支持记录级别的增删改操作,但是用户可以通过查询生成新表或者将查询结 果导入到文件中(当前选择的 hive-3.x 的版本支持记录级别的插入操作)
2、Hive 的查询延时很严重,因为 MapReduce Job 的启动过程消耗很长时间,所以不能用在交互查询系统中。
3、Hive 不支持事务(因为没有增删改,所以主要用来做OLAP(联机分析处理),而不是OLTP(联机事务处理),这就是数据处理的两大级别)
1.1 基本数据类型
Hive 支持关系型数据中大多数基本数据类型
类型 | 描述 | 示例 |
boolean | true/false | TRUE |
tinyint | 1字节的有符号整数 | -128~127 1Y |
smallint | 2个字节的有符号整数,-32768~32767 | 1S |
int | 4个字节的带符号整数 | 1 |
bigint | 8字节带符号整数 | 1L |
float | 4字节单精度浮点数 | 1.0 |
double | 8字节双精度浮点数 | 1.0 |
deicimal | 任意精度的带符号小数 | 1.0 |
String | 字符串,变长 | “a”,’b’ |
varchar | 变长字符串 | “a”,’b’ |
char | 固定长度字符串 | “a”,’b’ |
binary | 字节数组 | 无法表示 |
timestamp | 时间戳,纳秒精度 | 122327493795 |
date | 日期 | ‘2018-04-07’ |
和其他的SQL语言一样,这些都是保留字。需要注意的是所有的这些数据类型都是对Java中接口的实现,因此这些类型的具体行为细节和Java中对应的类型是完全一致的。例如,string类型实现的是Java中的String,float实现的是Java中的float,等等。
1.2 复杂类型
类型 | 描述 | 示例 |
array | 有序的的同类型的集合 | array(1,2) |
map | key-value,key必须为原始类型,value可以任意类型 | map(‘a’,1,’b’,2) |
struct | 字段集合,类型可以不同 | struct(‘1’,1,1.0), named_stract(‘col1’,’1’,’col2’,1,’clo3’,1.0) |
array:
create table movie(movie_name string ,actors array<string>,first_show string)
row format delimited
fields terminated by ','
collection items terminated by ':';
movie.dat:
战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20
普罗米修斯,苍老师:小泽老师:波多老师,2020-10-19
load data local inpath '/mine/movie.dat' into table movie;
select * from movie;
根据数组角标拿到array<string>里的数据:
select movie_name,actors[0],actors[1],first_show from movie;
array_contains函数:
select * from movie where array_contains(actors,'吴刚');
size函数:
select * , size(actors) from movie;
map:
create table family (id string , name string, family_members map<string,string> , age int)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
family.dat:
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
load data local inpath '/mine/family.dat' into table family;
select * from family;
select * , family_members['father'] as father,family_members['mother'] as mother from family;
map_keys:
select * , map_keys(family_members) as keys from family;
map_values:
select * , map_values(family_members) as keys from family;
size(map_keys(family_members))、 size(map_keys(family_members)):
select * , size(map_keys(family_members)) as keys from family;
select * , size(map_values(family_members)) as keys from family;
select * , family_members['brother'] from family where array_contains(map_keys(family_members),'brother');
struct:
create table t_user(id int , name string,info struct<age:int,sex:string,address:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
user.dat:
1,zhangsan,18:male:深圳
2,lisi,28:female:北京
3,wangwu,38:male:广州
4,赵六,26:female:上海
5,钱七,35:male:杭州
load data local inpath '/mine/user.dat' into table t_user;
select * from t_user;
select * , info.address as address from t_user;
重点:hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的!!!
#将hive的SQL执行转成本地模式执行,速度快一些!!!
set hive.exec.mode.local.auto=true;
#从hdfs里加载数据上传到table student表对应的目录中去
load data inpath '/students.txt' into table student;
#从hdfs中加载数据到table student 表对应的目录中
load data local inpath '/mine/students.txt' into table student;
2.1、external关键字创建外部表
create external table student(id string,name string ,sex string ,age int,address string)
row format delimited fields terminated by ','
stored as textfile;
2.2、内部表&外部表:
未被external修饰的是内部表(managed table),被external修饰的为外部表(external table)
2.3、区别:
(1)内部表数据由Hive自身管理,外部表数据由HDFS管理;
(2)内部表数据存储的位置是hive.metastore.warehouse.dir(默认:/user/hive/warehouse),外部表数据的存储位置由自己制定;
(3)删除内部表会直接删除元数据(metadata)及存储数据;删除外部表仅仅会删除元数据,HDFS上的文件并不会被删除;
2.4、建表时指定存储数据的location
create table student(id string,name string ,sex string ,age int,address string)
row format delimited fields terminated by ','
stored as textfile
location '/student_data';
create external table external_student(id string,name string ,sex string ,age int,address string)
row format delimited fields terminated by ','
stored as textfile
location '/external_student_data';
此时可以看到hdfs目录上出现了空的/student.txt文件,
指定存储数据的location,
load data local inpath '/mine/students.txt' into table external_student;
数据放在location里的。
建内部表、外部表时指定了location,就会创建location上,所传数据放在location上。
2.5、drop table时的现象
内部表会将location里的数据和文件夹都会删除;而外部表不会删除!!!
2.6、数据追加与覆盖
数据追加(into):
load data local inpath '/mine/students.txt' into table external_student;
load data local inpath '/mine/students.txt' into table student;
数据覆盖(overwrite into):
load data local inpath '/mine/students2.txt' overwrite into table external_student;
load data local inpath '/mine/students2.txt' overwrite into table student;
3、分区
动态分区参数设置:
set hive.exec.dynamic.partition.mode=nonstrict;
create external table external_student(id string,name string ,sex string ,age int)
partitioned by (address string)
row format delimited fields terminated by ','
stored as textfile;
3.1、插入数据分区
insert into table partition_student partition(address) select id,name,sex,age,address from student;
insert overwrite table partition_student partition(address) select id,name,sex,age,address from student;
3.2、Hive分区特点:
1.分区列尽量不要使用中文
2.分区尽量使用分区密度较为适中的列。比如说 地区 日期 年龄 ,不适合的是性别 姓名 ID号
3.分区尽量不要使用联合分区,也就是说分区列尽量设置成一个
4.分区列尽量不要过为零散 也就是说分区跨度尽量不要过大
3.4、Hive分区优点:
1.加快查询速度 在我们的HIVE中分区就是一种变相的索引 有助于增加数据查询速度
2.让数据分布更线性 将多而密的单个或多个数据文件,分散成以分区为模的文件矩阵中 方式某个节点(datanode)由于数据集中造成热点征用问题
3.分区便于目录级别数据管理
4.静态分区可以大批量修改为同值数据
3.5、Hive分区缺点:
1.分区依据难以把握 当出现多个分区可能备选项时 容易产生错误选择从而导致影响速度
2.分区由于将数据再次分布 所以对NAMENODE产生的压力会增加
4、分桶
Hive还可以把表或分区,组织成桶。将表或分区组织成桶有以下几个目的:
(1)第一个目的是为看取样更高效,因为在处理大规模的数据集时,在开发、测试阶段将所有的数据全部处理一遍可能不太现实,这时取样就必不可少。
(2)第二个目的是为了获得更好的查询处理效率。
Hive 建立分桶的意义:便于做join,不用做笛卡儿积,提高join效率。
桶为表提供了额外的结构,Hive在处理某些查询时利用这个结构,能够有效地提高查询效率。桶是通过对指定列进行哈希计算来实现的,通过哈希值将一个列名下的数据切分为一组桶,并使每个桶对应于该列名下的一个存储文件。
在建立桶之前,需要设置set hive.enforce.bucketing=true;,使得hive能识别桶。
4.1、建表语句
create external table external_student(id string,name string ,sex string ,age int)
clustered by (age) sorted by (age) into 4 buckets
row format delimited fields terminated by ','
stored as textfile;
5、hive的几种排序方式
- order by ------------------------------ 在reduce 结果文件中使用排序,会强制进行全局排序
- sort by ------------------------------ 在reduce 过程发生前进行排序
- distribute by ------------------------------ 在map过程中进行分区
- cluster by(distribute by + sort by)
6、hive的语句执行
hive -e 执行一条sql语句
hive -f 执行一个sql文件
7、ROW_NUMBER、RANK和DENSE_RANK的区别
row_number(): 按顺序编号,不留空位
rank(): 按顺序编号,相同的值编相同号,留空位
dense_rank(): 按顺序编号,相同的值编相同的号,不留空位
8、Hive 执行过程概述
(1) Hive 将 HQL 转换成一组操作符(Operator),比如 GroupByOperator, JoinOperator 等
(2)操作符 Operator 是 Hive 的最小处理单元
(3)每个操作符代表一个HDFS操作或者MapReduce作业
(4)Hive 通过 ExecMapper 和 ExecReducer 执行 MapReduce 程序,执行模式有本地模式和分布式两种模式
9、表的连接与表的子查询
9.1、 表的连接
(1)inner join(join)
(2)left outer join(left join)
(3)right outer join(right join)
(4)full outer join (full join)
(5)left semi join 只保留join的左半边
9.2 、表的子查询
支持子查询,但不支持exists/in的子查询!!!
9.3、保存select查询结果的几种方式
(1)将查询结果放入新的hive表中
create table tmp_table
as
select * from student;
(2)将查询结果保存在一个已经存在的hive表中
insert into table tmp_table select * from student;
insert overwrite table tmp_table select * from student;
(3)将查询结果保存到指定的目录里(可以时本地,也可以是hdfs)
insert overwrite directory '/tmp_table' select * from student;
insert overwrite local directory '/tmp_table' select * from student;
10、join
10.1、数据:
a.txt:
1,a
2,b
3,c
4,d
7,y
8,u
b.txt:
2,bb
3,cc
7,yy
9,pp
10.2、建表和初始化数据
create table a(id string,name string)
row format delimited fields terminated by ','
stored as textfile;
create table b(id string,name string)
row format delimited fields terminated by ','
stored as textfile;
load data local inpath '/mine/a.txt' into table a;
load data local inpath '/mine/b.txt' into table b;
(1)inner join(join)
select * from a inner join b on a.id = b.id;
(2)left outer join(left join)
select * from a left outer join b on a.id = b.id;
(3)right outer join(right join)
select * from a right outer join b on a.id = b.id;
(4)full outer join (full join)
select * from a full outer join b on a.id = b.id;
(5)left semi join 只保留join的左半边
select * from a left semi join b on a.id = b.id;
11、 Hive内置函数、自定义函数和Transform
11.1、内置函数
1、substr()、concat()、split():
select substr('123456',1,3);
select concat ('a','b','c');
select concat ('a','b');
select split('a,b,c',',');
######split是按照分割符切成数组array
select split('a,b,c',',')[1];
split函数的第二个参数是正则表达式。
2、类型转换函数cast():
select cast('5' as int);
create table t_tmp
as
select cast('5' as int);
select current_timestamp;
select cast(current_timestamp as date);
3、数学函数
round(NUMBER)、round(NUMBER,3)、ceil(NUMBER)、floor(NUMBER)、abs(NUMBER)
4、explode函数(将行转成列)表生成函数
create table student_subject(id int ,name string , subjects array<string>)
row format delimited fields terminated by ','
collection items terminated by ':';
subjects.dat:
1,zhangsan,化学:物理:数学:语文
2,lisi,化学:数学:生物:生理:卫生
3,wangwu,化学:语文:英语:体育:生物
load data local inpath '/mine/subjects.dat' into table student_subject;
select * from student_subject;
select explode(subjects) from student_subject;
explode()函数会产出一个表数据。
select distinct tmp.sub from (select explode(subjects) as sub from student_subject) tmp;
lateral view:
select id,name,tmp.sub from
student_subject
lateral view
explode(subjects) tmp as sub;
5、hql 编写 wordcount语句:
create table words(sequence string);
wordcount.txt:
hello world
hello kitty
hello jerry
tom
wtl
lover is love
like like liker
load data local inpath '/mine/wordcount.txt' into table words;
select * from words;
select tmp.word,count(tmp.word) from (select explode(split(sequence,' ')) as word from words) tmp group by tmp.word;
6、sort_array
select sort_array(array(2,4,1,5,3,8,0,9));
array是构造函数。
7、条件控制函数
case when then end
select id,name,
case
when info.age <= 18 then '未成年人'
when info.age > 18 and info.age < 35 then '青年人'
else '中年人'
end
from t_user;
if()函数:
select movie_name,actors,first_show,
if(array_contains(actors,'吴刚'),'好电影','不好的电影')
from movie;
8、over开窗函数
create table rn(id int , age int,name string,sex string)
row format delimited fields terminated by ',';
rn.dat:
1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female
load data local inpath '/mine/rn.dat' into table rn;
select * from rn;
select * from
(select id,age,name,
row_number() over(partition by sex order by age) as num
from rn) tmp
where num < 3;
11.2、自定义函数类别
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数。
当写Hive UDF时,有两个选择:一是继承 UDF类,二是继承抽象类GenericUDF。这两种实现不同之处是:GenericUDF 可以处理复杂类型参数,并且继承GenericUDF更加有效率,因为UDF class 需要HIve使用反射的方式去实现。
GenericUDF实现比较复杂,需要先继承GenericUDF。这个API需要操作Object Inspectors,并且要对接收的参数类型和数量进行检查。GenericUDF需要实现以下三个方法:
11.3、GenericUDF开发实例
//这个方法只调用一次,并且在evaluate()方法之前调用。该方法接受的参数是一个ObjectInspectors数组。该方法检查接受正确的参数类型和参数个数。
abstract ObjectInspector initialize(ObjectInspector[] arguments);
//这个方法类似UDF的evaluate()方法。它处理真实的参数,并返回最终结果。
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);
//这个方法用于当实现的GenericUDF出错的时候,打印出提示信息。而提示信息就是你实现该方法最后返回的字符串。
abstract String getDisplayString(String[] children);
package application;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.io.Text;
import java.util.ArrayList;
import java.util.List;
/**
* @author: wtl
* @License: (C) Copyright 2020, wtl Corporation Limited.
* @Date: 2020-10-18 11:33
* @Version: 1.0
* @Description:
*/
public class ToLower extends GenericUDF {
private ListObjectInspector listObjectInspector;
@Override
public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
listObjectInspector = (ListObjectInspector) arguments[0];
ObjectInspector listElementObjectInspector = listObjectInspector.getListElementObjectInspector();
if (!ObjectInspectorUtils.compareTypes(listElementObjectInspector,arguments[1])){
throw new RuntimeException("参数有不匹配!!!");
}
return ObjectInspectorFactory.getStandardListObjectInspector(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
}
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
List<Text> list = (List<Text>) arguments[0].get();
Text value = (Text) arguments[1].get();
System.out.println(value);
List<String> resultList = new ArrayList<>();
for (Text v : list) {
resultList.add(v.toString() + ":" + value.toString());
}
System.out.println(resultList);
return resultList;
}
@Override
public String getDisplayString(String[] children) {
return "test";
}
}
1、add jar /mine/udf-jar-with-dependencies.jar;
2、CREATE TEMPORARY FUNCTION hello AS "application.ToLower";
3、select hello(array('test,hello'),'hel');
explode函数:
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.hive.ql.udf.generic;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.TaskExecutionException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
/**
* GenericUDTFExplode.
*
*/
@Description(name = "explode",
value = "_FUNC_(a) - separates the elements of array a into multiple rows,"
+ " or the elements of a map into multiple rows and columns ")
public class GenericUDTFExplode extends GenericUDTF {
private transient ObjectInspector inputOI = null;
@Override
public void close() throws HiveException {
}
@Override
public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
if (args.length != 1) {
throw new UDFArgumentException("explode() takes only one argument");
}
ArrayList<String> fieldNames = new ArrayList<String>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
switch (args[0].getCategory()) {
case LIST:
inputOI = args[0];
fieldNames.add("col");
fieldOIs.add(((ListObjectInspector)inputOI).getListElementObjectInspector());
break;
case MAP:
inputOI = args[0];
fieldNames.add("key");
fieldNames.add("value");
fieldOIs.add(((MapObjectInspector)inputOI).getMapKeyObjectInspector());
fieldOIs.add(((MapObjectInspector)inputOI).getMapValueObjectInspector());
break;
default:
throw new UDFArgumentException("explode() takes an array or a map as a parameter");
}
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,
fieldOIs);
}
private transient final Object[] forwardListObj = new Object[1];
private transient final Object[] forwardMapObj = new Object[2];
@Override
public void process(Object[] o) throws HiveException {
switch (inputOI.getCategory()) {
case LIST:
ListObjectInspector listOI = (ListObjectInspector)inputOI;
List<?> list = listOI.getList(o[0]);
if (list == null) {
return;
}
for (Object r : list) {
forwardListObj[0] = r;
forward(forwardListObj);
}
break;
case MAP:
MapObjectInspector mapOI = (MapObjectInspector)inputOI;
Map<?,?> map = mapOI.getMap(o[0]);
if (map == null) {
return;
}
for (Entry<?,?> r : map.entrySet()) {
forwardMapObj[0] = r.getKey();
forwardMapObj[1] = r.getValue();
forward(forwardMapObj);
}
break;
default:
throw new TaskExecutionException("explode() can only operate on an array or a map");
}
}
@Override
public String toString() {
return "explode";
}
}