Hive知识点总结(一)(基本概念、运行方式、创建库表、数据读取、文件格式)

Hive知识点总结(二)(数据类型及修改表定义)

Hive知识点总结(三)(Hive查询语法)

6、hive函数

hive的所有函数手册

下面我们总结一些常用内置函数及如何自定义函数

6.1、常用内置函数

6.1.1、类型转换函数(cast)

select cast("5" as int) from dual;
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);

6.1.2、数学运算函数

select round(5.4) from dual;   ## 5
select round(5.1345,3) from dual;  ##5.135
select ceil(5.4) from dual; // select ceiling(5.4) from dual;   ## 6
select floor(5.4) from dual;  ## 5
select abs(-5.4) from dual;  ## 5.4
select greatest(3,5,6) from dual;  ## 6 
select least(3,5,6) from dual;  ## 3
select max(age) from t_person;    聚合函数
select min(age) from t_person;    聚合函数

举例:

hive 数据库 数据类型转换 hive类型转换函数_unix

select greatest(cast(s1 as double),cast(s2 as double),cast(s3 as double)) from t_fun2;

结果:


+---------+--+
|   _c0   |
+---------+--+
| 2000.0  |
| 9800.0  |
+---------+--+

6.1.3、字符串函数

substr(string, int start)   ## 截取子串
substring(string, int start)
示例:select substr("abcdefg",2) from dual;

substr(string, int start, int len) 
substring(string, int start, int len)
示例:select substr("abcdefg",2,3) from dual;

concat(string A, string B...)  ## 拼接字符串
concat_ws(string SEP, string A, string B...)
示例:select concat("ab","xy") from dual;
select concat_ws(".","192","168","33","44") from dual;

length(string A)
示例:select length("192.168.33.44") from dual;

split(string str, string pat)
示例:select split("192.168.33.44",".") from dual; 错误的,因为.号是正则语法中的特定字符
select split("192.168.33.44","\\.") from dual;

upper(string str) ##转大写

6.1.4、时间函数

select current_timestamp; 
select current_date;

## 取当前时间的毫秒数时间戳 
select unix_timestamp();

## unix时间戳转字符串
from_unixtime(bigint unixtime[, string format])
示例:select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

## 字符串转unix时间戳
unix_timestamp(string date, string pattern)
示例: select unix_timestamp("2017-08-10 17:50:30");
select unix_timestamp("2017/08/10 17:50:30","yyyy/MM/dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");

6.1.5、表生成函数

6.1.5.1、行转列函数:explode()

假如有以下数据:

1,zhangsan,化学:物理:数学:语文

2,lisi,化学:数学:生物:生理:卫生

3,wangwu,化学:语文:英语:体育:生物


映射成一张表:

create table t_stu_subject(id int,name string,subjects array<string>)

row format delimited fields terminated by ','

collection items terminated by ':';

使用explode()对数组字段“炸裂”

hive 数据库 数据类型转换 hive类型转换函数_json_02

然后,我们利用这个explode的结果,来求去重的课程:

select distinct tmp.sub
from 
(select explode(subjects) as sub from t_stu_subject) tmp;

6.1.5.2、表生成函数 :lateral view

select id,name,tmp.sub 
from t_stu_subject lateral view explode(subjects) tmp as sub;


hive 数据库 数据类型转换 hive类型转换函数_json_03

理解: lateral view 相当于两个表在join
左表:是原表
右表:是explode(某个集合字段)之后产生的表
而且:这个join只在同一行的数据间进行

那样,可以方便做更多的查询:
比如,查询选修了生物课的同学
select a.id,a.name,a.sub from
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a

where sub='生物';


6.1.6、集合函数

array_contains(Array<T>, value)  返回boolean值

示例:
select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c') from dual;

sort_array(Array<T>) 返回排序后的数组

示例:
select sort_array(array('c','b','a')) from dual;
select 'haha',sort_array(array('c','b','a')) as xx from (select 0) tmp;

size(Array<T>)  返回一个int值

示例:
select moive_name,size(actors) as actor_number from t_movie;

size(Map<K.V>)  返回一个int值


map_keys(Map<K.V>)  返回一个数组


map_values(Map<K.V>) 返回一个数组


6.1.7、控制函数

6.1.7.1、case when

语法:
CASE   [ expression ]
       WHEN condition1 THEN result1
       WHEN condition2 THEN result2
       ...
       WHEN conditionn THEN resultn
       ELSE result
END

示例:
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

6.1.7.2、if

select id,if(age>25,'working','worked') from t_user;
select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;


6.1.8、json解析函数

get_json_object('{\"key1\":3333,\"key2\":4444}' , '$.key1')  -->  3333

json_tuple('{\"key1\":3333,\"key2\":4444}','key1','key2') as(key1,key2)  --> 3333, 4444



json_tuple函数
示例:
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;

产生结果:

hive 数据库 数据类型转换 hive类型转换函数_hive 数据库 数据类型转换_04

利用json_tuple从原始json数据表中,etl出一个详细信息表:

create table t_rate 
as 
select 
uid,
movie,
rate,
year(from_unixtime(cast(ts as bigint))) as year,
month(from_unixtime(cast(ts as bigint))) as month,
day(from_unixtime(cast(ts as bigint))) as day,
hour(from_unixtime(cast(ts as bigint))) as hour,
minute(from_unixtime(cast(ts as bigint))) as minute,
from_unixtime(cast(ts as bigint)) as ts
from 
(select 
json_tuple(rateinfo,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid)
from t_json) tmp;



6.1.9、URL解析函数

网页URL数据解析函数:parse_url_tuple

select parse_url_tuple("http://www.edu360.cn/baoming/youhui?cookieid=20937219375",'HOST','PATH','QUERY','QUERY:cookieid')
from dual;
+----------------+------------------+-----------------------+--------------+--+
|       c0       |        c1        |          c2           |      c3      |
+----------------+------------------+-----------------------+--------------+--+
| www.edu360.cn  | /baoming/youhui  | cookieid=20937219375  | 20937219375  |
+----------------+------------------+-----------------------+--------------+--+


6.1.10、分析函数

(1)使用row_number() over()函数解决TOPN问题:

-- 造数据:

1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female

create table t_rn(id int,age int,name string,sex string)
row format delimited fields terminated by ',';

load data local inpath '/root/hivetest/rn.dat' into table t_rn;

-- 分组标记序号

select * 
from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn 
from t_rn) tmp
where rn<3
;

(2)使用窗口分析函数 sum() over():来实现窗口中的逐行累加

0: jdbc:hive2://localhost:10000> select * from  t_access_amount;
+----------------------+------------------------+-------------------------+--+
| t_access_amount.uid  | t_access_amount.month  | t_access_amount.amount  |
+----------------------+------------------------+-------------------------+--+
| A                    | 2015-01                | 33                      |
| A                    | 2015-02                | 10                      |
| A                    | 2015-03                | 20                      |
| B                    | 2015-01                | 30                      |
| B                    | 2015-02                | 15                      |
| B                    | 2015-03                | 45                      |
| C                    | 2015-01                | 30                      |
| C                    | 2015-02                | 40                      |
| C                    | 2015-03                | 30                      |
+----------------------+------------------------+-------------------------+--+

-- 需求:求出每个人截止到每个月的总额

select uid,month,amount,
sum(amount) over(partition by uid order by month rows between unbounded preceding and current row) as accumulate
from t_access_amount;


6.2、自定义函数

在实际的开发过程中,很多时候需要我们处理复杂的数据逻辑,需要我们能够自定义函数来实现数据操作,下面举例说明:

/*

有如下json数据:rating.json

hive 数据库 数据类型转换 hive类型转换函数_json_05

*/

首先建表映射上述数据

create table t_ratingjson(json string);

导入数据

load data local inpath '/root/hivetest/rating.json' into table t_ratingjson;

我们的需求是想把上面的原始数据变成如下形式:
1193,5,978300760,1
661,3,978302109,1
914,3,978301968,1
3408,4,978300275,1

思路:如果能够自定义一个json解析函数,传入数,读出对应的值,就很方便了

create table t_rate
as
select myjson(json,1) as movie,cast(myjson(json,2) as int) as rate,myjson(json,3) as ts,myjson(json,4) as uid 
from t_ratingjson;

解决:
hive中如何定义自己的函数:

1、先写一个java类(extends UDF,重载方法public C evaluate(A a,B b)),实现你所想要的函数的功能(传入一个json字符串和一个脚标,返回一个值)

public class ParseJson extends UDF{
	
	// 重载 :返回值类型 和参数类型及个数,完全由用户自己决定
	// 本处需求是:给一个字符串,返回一个数组
	public String[] evaluate(String json) {	
		String[] split = json.split("\"");
		String[] res = new String[]{split[3],split[7],split[11],split[15]};
		return res;
	}
}

2、将java程序打成jar包,上传到hive所在的机器
3、在hive命令行中将jar包添加到classpath :    
        hive>add jar /root/hivetest/myjson.jar;
4、在hive命令中用命令创建一个临时函数叫做myjson,关联你所写的这个java类

        hive> create temporary function myjson as 'cn.rople.hive.udf.MyJsonParser';

5、开发hql语句,利用自定义函数,从原始表中抽取数据插入新表

insert into table t_rate
select
split(jsonp(json),',')[0],
cast(split(jsonp(json),',')[1] as int),
cast(split(jsonp(json),',')[2] as bigint),
cast(split(jsonp(json),',')[3] as int)
from
t_rating_json;

注:临时函数只在一次hive会话中有效,重启会话后就无效

如果需要经常使用该自定义函数,可以考虑创建永久函数:
拷贝jar包到hive的类路径中:
cp wc.jar apps/hive-1.2.1/lib/
创建了:
create function pfuncx as 'com.doit.hive.udf.UserInfoParser';
删除函数:
DROP  TEMPORARY  FUNCTION  [IF  EXISTS] function_name  
DROP FUNCTION[IF EXISTS] function_name