四、HQL(hive SQL)

https://www.docs4dev.com/docs/zh/apache-hive/3.1.1/reference/#

1)在hive-site.xml文件中添加如下配置信息,就可以实现显示当前数据库,以及查询表的头信息配置。

<property>
	<name>hive.cli.print.header</name>
	<value>true</value>
</property>
<property>
	<name>hive.cli.print.current.db</name>
	<value>true</value>
</property>

2)重新启动hive,对比配置前后差异。

(1)配置前,如图所示:

hive double 除不尽 hive 除法运算_hadoop

图12-1 配置前

(2)配置后,如图所示:

hive double 除不尽 hive 除法运算_hive_02

图12-2 配置后

(二)DML数据操纵语言

1、数据导入

(1)直接上传数据

--临时表
没有表文件目录,无法上传

--内部表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/managed_emp01/aaa.txt
hadoop fs -put student.txt /user/hive/warehouse/
select * from managed_emp01;

--外部表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/managed_emp/bbb.txt
select * from external_emp01;

--分区表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/partition_emp/age=20/
select * from partition_emp;
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/partition_emp2/month=5/day=20
select * from partition_emp2;
注意:如果创建完分区表后,手动创建分区字段目录(mkdir),再上传数据后,需要修复表才可以用msck repair table tabName;

--分桶表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/buck_emp
select * from buck_emp;
我们发现在HDFS上的文件目录中,并没有分成多个文件,也就是并没有按照指定字段进行分桶

(1)向表中装载数据(Load)

语法:

load data [local] inpath 'path/target.log' [overwrite] into table tab [partition (partcol1=val1,…)];

说明:

  • load data:表示加载数据
  • local:表示从(服务端启动的节点)本地加载(复制)数据到hive表;否则从HDFS加载(移动)数据到hive表
  • inpath:表示加载数据的路径
  • overwrite:表示覆盖表中已有数据,否则表示追加
  • into table:表示加载到哪张表
  • tab:表示具体的表名
  • partition:表示上传到指定分区

示例:

--为临时表加载数据
use offcn;
create temporary table if not exists temporary_emp(
id int, name string
)row format delimited fields terminated by '\t';

load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table temporary_emp;
select * from temporary_emp;

--为内部表从本地加载数据
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table managed_emp01;
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" overwrite into table managed_emp01;
select * from managed_emp01;
--为内部表从hdfs加载数据
先把数据上传到hdfs
[offcn@bd-offcn-01 hive-3.1.2]$ hadoop fs -put emp.txt /user/offcn
load data inpath "/user/offcn/emp.txt" overwrite into table managed_emp01;
select * from managed_emp01;


--为外部表加载数据
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table external_emp01;
select * from external_emp01;

--为分区表加载数据(单分区)
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table partition_emp partition(age=18);
select * from partition_emp;

--为分区表加载数据(多级分区)
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table partition_emp2 partition(month="05",day="20");
select * from partition_emp2;

--为分桶表加载数据	
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table buck_emp; 错误加载方式,查询不到数据
select * from buck_emp;  
[offcn@bd-offcn-01 hive-3.1.2]$ hadoop fs -put emp.txt /user/offcn
load data inpath "/user/offcn/emp.txt" into table buck_emp; 正确加载方式,查询到数据 ,这种方式会保留/user/offcn/emp.txt文件
select * from buck_emp;
结论:
1、	分桶表通过load加载数据,只能加载hdfs上的数据,无法加载本地无数据,
2、	在hdfs web页面查看文件内容时,发现显示有问题,这是hadoop的bug,要想查看分桶文件,可以采用hadoop -cat 命令
3、	遇到该报错,不影响分桶

hive double 除不尽 hive 除法运算_hive double 除不尽_03

(2)通过查询语句向表中插入数据(Insert)

a.insert values

语法:

insert into table tab [partition (partcol1[=val1], partcol2[=val2] ...)] values (value [, value ...])

示例:

b.insert select

语法:

insert overwrite table tablename [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement;
insert into table tablename [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement;

示例:

--先准备数据
insert into table emp02(id,name) values(4,"scala"); 
select * from emp02;

--临时表
insert overwrite table temporary_emp select * from emp02;  覆盖插入
insert into table temporary_emp select * from emp02;   新增插入
--内部表
insert into table managed_emp01 select * from emp02;
select * from managed_emp01;

--外部表
insert into table external_emp01 select * from emp02;
select * from external_emp01;

--分区表
insert into table partition_emp partition(age=45) select * from emp02;
select * from partition_emp;
insert into table partition_emp2 partition(month="3",day="18") select * from emp02;[X 列对不上]
insert into table partition_emp2 partition(month="3",day="18") select * from partition_emp;
select * from partition_emp2;

--分桶表
insert into table buck_emp select * from partition_emp;
select * from  buck_emp;

c.多重插入

语法:

from from_statement

insert overwrite table tab [partition (partcol1=val1, partcol2=val2 ...)] select_statement1

[insert overwrite table tab2 [partition ...] select_statement2]

[insert into table tab2 [partition ...] select_statement2];

说明:

  • 当我们需要将一张表中的部分数据分别插入多张表时,就可以使用多重插入。

示例:

将partition_emp表中的部分数据分别插入、替换到managed_emp01、external_emp01表中:
-- 先看下原来的数据
select * from managed_emp01;  对应/user/hive/warehouse/offcn.db/managed_emp01
select * from external_emp01;  对应/user/hive/warehouse/offcn.db/managed_emp
select * from partition_emp;   对应/user/hive/warehouse/offcn.db/partition_emp

-- 执行插入
from partition_emp
insert into table managed_emp01 
select id,name
insert overwrite table external_emp01 
select id,name;

d.动态分区

语法:

insert overwrite table tablename partition (partcol1[=val1], partcol2[=val2] ...) select_statement from from_statement;

insert into table tablename partition (partcol1[=val1], partcol2[=val2] ...) select_statement from from_statement;

说明:

  • 往hive分区表中插入数据时,如果需要创建的分区很多,比如以表中某个字段进行分区存储,则需要复制粘贴修改很多sql去执行,效率低。因为hive是批处理系统,所以hive提供了一个动态分区功能,其可以基于查询参数的位置去推断分区的名称,从而建立分区。

示例:

--设置非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
--创建动态分区表
create table dynamic_emp(id int) partitioned
by (name string) row format delimited fields terminated by '\t';
--动态插入数据
insert overwrite table dynamic_emp partition(name)
select id,name from managed_emp01;
--查看全部分区
show partitions dynamic_emp;
--查看数据
select * from dynamic_emp;

(3)创建表时通过Location指定加载数据路径

create table if not exists emp4(
id int, name string
)
row format delimited fields terminated by '\t'
location '/test';
hadoop fs -put emp.txt /test
select * from emp4;

(4)Import数据到指定Hive表中

create table if not exists emp5(
id int, name string
)
row format delimited fields terminated by '\t';
	
import table emp5 from '/tmp/export/emp/';
select * from emp5;
注意:先将数据导出,再进行导入  与export配套使用才可

2、数据导出

(1)Insert导出

a、将查询的结果导出到本地 

insert overwrite local directory '/home/offcn/tmp/export/emp'
select * from managed_emp01;

b、将查询的结果格式化导出到本地

insert overwrite local directory '/home/offcn/tmp/export/emp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from managed_emp01;

c、将查询的结果导出到HDFS上(没有local)

insert overwrite directory '/test/export/empout'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from managed_emp01;

(2)Hadoop命令导出到本地

jdbc:hive2://bd-offcn-01:10000>  dfs -get /user/hive/warehouse/offcn.db/emp5/emp.txt /home/offcn/tmp/export/emp2/emp.txt;
或者直接用hadoop命令将相应文件下载到本地

(3)Hive Shell 命令导出      

hive -e 'select * from offcn.emp02;' > /home/offcn/tmp/export/emp.txt

(4)Export导出到HDFS上

export table offcn.emp02 to

 '/tmp/export/emp';

export和import主要用于两个Hadoop平台集群之间Hive表迁移。

(三)DQL数据查询语言

数据准备:

use default;

create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string, 
sal double, 
comm double,
deptno int)
row format delimited fields terminated by '\t';


create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
	
load data local inpath '/home/offcn/tmp/emp.txt' into table emp;
load data local inpath '/home/offcn/tmp/dept.txt' into table dept;

select * from emp;
select * from dept;

1、基本查询(Select…From)

全表查询

select * from emp;

选择特定列查询           

select empno, ename from emp;

注意:

  • SQL 语言大小写不敏感。
  • SQL 可以写在一行或者多行
  • 关键字不能被缩写也不能分行
  • 各子句一般要分行写。
  • 使用缩进提高语句的可读性。

2、条件查询

查询出薪水大于1000的所有员工

select * from emp where sal >1000;

关系运算符

操作符

支持的数据类型

描述

A=B

基本数据类型

如果A等于B则返回TRUE,反之返回FALSE

A<=>B

基本数据类型

如果A和B都为NULL,则返回TRUE,如果一边为NULL,返回False

A<>B, A!=B

基本数据类型

A或者B为NULL则返回NULL;如果A不等于B,则返回TRUE,反之返回FALSE

A<B

基本数据类型

A或者B为NULL,则返回NULL;如果A小于B,则返回TRUE,反之返回FALSE

A<=B

基本数据类型

A或者B为NULL,则返回NULL;如果A小于等于B,则返回TRUE,反之返回FALSE

A>B

基本数据类型

A或者B为NULL,则返回NULL;如果A大于B,则返回TRUE,反之返回FALSE

A>=B

基本数据类型

A或者B为NULL,则返回NULL;如果A大于等于B,则返回TRUE,反之返回FALSE

A [NOT] BETWEEN B AND C

基本数据类型

如果A,B或者C任一为NULL,则结果为NULL。如果A的值大于等于B而且小于或等于C,则结果为TRUE,反之为FALSE。如果使用NOT关键字则可达到相反的效果。

A IS NULL

所有数据类型

如果A等于NULL,则返回TRUE,反之返回FALSE

A IS NOT NULL

所有数据类型

如果A不等于NULL,则返回TRUE,反之返回FALSE

IN(数值1, 数值2)

所有数据类型

使用 IN运算显示列表中的值

A [NOT] LIKE B

STRING 类型

B是一个SQL下的简单正则表达式,也叫通配符模式,如果A与其匹配的话,则返回TRUE;反之返回FALSE。B的表达式说明如下:‘x%’表示A必须以字母‘x’开头,‘%x’表示A必须以字母’x’结尾,而‘%x%’表示A包含有字母’x’,可以位于开头,结尾或者字符串中间。如果使用NOT关键字则可达到相反的效果。

A RLIKE B, A REGEXP B

STRING 类型

B是基于java的正则表达式,如果A与其匹配,则返回TRUE;反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的,因为正则也依据其中的规则。例如,正则表达式必须和整个字符串A相匹配,而不是只需与其字符串匹配。

(1)Between、IN、is null

  • 询出薪水等于5000的所有员工

 select * from emp where sal =5000;

  • 查询工资在500到1000的员工信息

 select * from emp where sal between 500 and 1000;

  • 查询comm为空的所有员工信息

 select * from emp where comm is null;

  • 查询工资是1500或5000的员工信息

 select * from emp where sal IN (1500, 5000);

(2)Like和RLike

选择条件可以包含字符或数字:

% 代表零个或多个字符(任意个字符)。

_ 代表一个字符。

RLIKE子句:

RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

  • 查找以”S”开头的员工信息

 select * from emp where ename LIKE 'S%';

  • 查找第二个数值为”M”的薪水的员工信息

 select * from emp where ename LIKE '_M%';

  • 查找名字中含有“I”的员工信息

 select * from emp where ename RLIKE '[I]';

select * from emp where ename LIKE '%I%';

3)逻辑运算符(And/Or/Not)

操作符 含义

AND 逻辑并

OR  逻辑或

NOT 逻辑否

  • 查询薪水大于1000,部门是30

 select * from emp where sal>1000 and deptno=30;

  • 查询薪水大于1000,或者部门是30

 select * from emp where sal>1000 or deptno=30;

  • 查询除了20部门和30部门以外的员工信息

 select * from emp where deptno not IN(30, 20);

(4)limit语句

LIMIT子句用于限制返回的行数不同于mysql的是,limit后只能跟一个参数

select * from emp limit 5;

3、分组查询

(1)Group By语句

GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。

注意:分组语句中,select后的字段只能是分组字段或者聚合函数!

  • 计算emp表每个部门的平均工资

 select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

  • 计算emp每个部门中每个岗位的最高薪水

 select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;

(2)Having语句

having与where不同点

where后面不能写分组函数,而having后面可以使用分组函数。

where用于对数据进行过滤,having用于对结果进行过滤。

having只用于group by分组统计语句。

  • 求每个部门的平均薪水大于2000的部门

第一步:求每个部门的平均工资

 select deptno, avg(sal) from emp group by deptno;

第二部:求每个部门的平均薪水大于2000的部门

 select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;

不能如下这样写

select deptno, avg(sal) avg_sal from emp group by deptno where avg_sal > 2000;

4、连接查询                                      

join是发生在 多个表之间, 内连接(等值内连接和不等值内连接)和外连接 (左外连接、右外连接,全外连接)

内连接:只会显示满足条件的数据。

外连接:会全部显示出某表的数据,如果另外一张表 跟它无法匹配,则另外一张表的相关字段设置为空。

(1)内连接

select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;

intersect交集方式实现
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
intersect
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

(2)外连接

a.左链接

--左连接
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno;

b.右连接

--右连接
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

c.左独有

--左独有 查出左表有,而右表没有匹配上的数据
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null;

d.右独有

--右独有 查出右表有,而左表没有的数据
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;

e.全连接

--全连接
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno;

--全连接union
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
union 
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;

f.左右独有
 

--左右独有
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
where e.deptno is null or d.deptno is null;

--左右独有union all
select e.*,d.dname,d.loc
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null
union all
select e.empno ,e.ename ,e.job,e.mgr,e.hiredate,e.sal,e.comm,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;

--except/minus:差集实现
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
except
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;

select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
minus
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;

g.交叉连接

交叉连接:得到笛卡尔积,有隐式、显式两种写法

加上on条件,相当于内连接

隐式写法
select e.*,d.*
from emp e,dept d;
显式写法
select e.*,d.*
from emp e cross join dept d
on e.deptno=d.deptno;

h.左半开连接(left semi-join)

当记录对于右边表满足on语句中的判定条件,只会返回左边表的记录,左半开连接时内连接的优化,当左边表的一条数据,在右边表中存在时,Hive就停止扫描。因此效率比join高,但是左半开连接的select和where关键字后面只能出现左边表的字段,不能出现右边表的字段。Hive不支持右半开连接。

select *
from dept d left semi join emp e
on d.deptno=e.deptno;

select d.*
from dept d left semi join emp e
on d.deptno=e.deptno;

--类似于内连接只返回左表内容,但是效率较高  等同于内连接只返回左表,并把左表去重的结果,但是效率比内连接要高
select d.*
from dept d inner join emp e
on d.deptno=e.deptno;
--错误示例
select d.*,e.*
from dept d left semi join emp e
on d.deptno=e.deptno;

(3)自连接

准备数据:

CREATE TABLE `area` (
  `id` string NOT NULL,
  `area_code` string,
  `area_name` string,
  `level` string,
  `parent_code` string,
  `target` string
)row format delimited fields terminated by ',';

load data local inpath "/home/offcn/tmp/area.csv" into table area;

select * from area;

需求:统计指定省市下所有地区
--子查询实现
SELECT * FROM AREA
WHERE parent_code=(SELECT area_code
FROM AREA
WHERE area_name="内蒙古自治区");
--自连接实现
SELECT a.*,b.area_name
FROM AREA a JOIN AREA b
ON a.parent_code=b.area_code
WHERE b.area_name="内蒙古自治区";

(4)多表连接

准备数据:

create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';

load data local inpath '/home/offcn/tmp/location.txt' into table location;

SELECT e.ename, d.dname, l.loc_name
FROM   emp e 
JOIN   dept d
ON     d.deptno = e.deptno 
JOIN   location l
ON     d.loc = l.loc;

大多数情况下,Hive会对每对JOIN连接对象启动一个MapReduce任务。本例中会首先启动一个MapReduce job对表e和表d进行连接操作,然后会再启动一个MapReduce job将第一个MapReduce job的输出和表l;进行连接操作。

注意:为什么不是表d和表l先进行连接操作呢?这是因为Hive总是按照从左到右的顺序执行的。

优化:当对3个或者更多表进行join连接时,如果每个on子句都使用相同的连接键的话,那么只会产生一个MapReduce job。

5、排序

(1)全局排序(Order By)

  • Order By:全局排序,只有一个Reducer
  • 使用 ORDER BY 子句排序

ASC(ascend): 升序(默认)

DESC(descend): 降序

  • ORDER BY 子句在SELECT语句的结尾

--查询员工信息按工资升序排列

select * from emp order by sal;

--查询员工信息按工资降序排列

select * from emp order by sal desc;

(2)每个MapReduce内部排序(Sort By)

  • Sort By:对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。
  • Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。

--设置reduce个数

set mapreduce.job.reduces=3;

--查看设置reduce个数

set mapred.reduce.tasks;

--根据部门编号降序查看员工信息

select * from emp sort by deptno desc;

--将查询结果导入到文件中(按照部门编号降序排序)

insert overwrite local directory '/home/offcn/tmp/sortby-result'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

select * from emp sort by deptno desc;

(3)分区排序(Distribute By)

  • Distribute By: 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
  • 对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。

--先按照部门编号分区,再按照员工编号降序排序。

set mapreduce.job.reduces=3;

insert overwrite local directory '/home/offcn/tmp/distribute-result'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

select * from emp distribute by deptno sort by empno desc;

注意:

  • distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
  • Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。

(4)Cluster By   

  • 当distribute by和sorts by字段相同时,可以使用cluster by方式。
  • cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。

--以下两种写法等价

select * from emp distribute by deptno sort by deptno;

select * from emp cluster by deptno;

set mapreduce.job.reduces=3;

insert overwrite local directory '/home/offcn/tmp/cluster-result'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

select * from emp cluster by deptno;

(四)函数hive的函数

hive double 除不尽 hive 除法运算_数据_04

1、系统内置函数

(1)关系运算(点我)

(2)逻辑运算(点我)

(3)数学运算

  • •加法操作: +
  • •减法操作: -
  • •乘法操作: *
  • •除法操作: /
  • •取余操作: %
  • •位与操作: &
  • •位或操作: |
  • •位异或操作: ^
  • •位取反操作: ~

create table dual(id string);

准备内容只有一个空格的文本put到该表

select 1 + 9 from dual;

(4)数值运算

  • 取整函数: round
  • 指定精度取整函数: round
  • 向下取整函数: floor
  • 向上取整函数: ceil
  • 向上取整函数: ceiling
  • 取随机数函数: rand
  • 自然指数函数: exp
  • 以10为底对数函数: log10
  • 以2为底对数函数: log2
  • 对数函数: log
  • 幂运算函数: pow
  • 幂运算函数: power
  • 开平方函数: sqrt
  • 二进制函数: bin
  • 十六进制函数: hex
  • 反转十六进制函数: unhex
  • 进制转换函数: conv
  • 绝对值函数: abs
  • 正取余函数: pmod
  • 正弦函数: sin
  • 反正弦函数: asin
  • 余弦函数: cos
  • 反余弦函数: acos
  • positive函数: positive
  • negative函数: negative


  • (5)日期函数
  • UNIX时间戳转日期函数: from_unixtime
  • 获取当前UNIX时间戳函数: unix_timestamp
  • 日期转UNIX时间戳函数: unix_timestamp
  • 指定格式日期转UNIX时间戳函数: unix_timestamp
  • 日期时间转日期函数: to_date
  • 日期转年函数: year
  • 日期转月函数: month
  • 日期转天函数: day
  • 日期转小时函数: hour
  • 日期转分钟函数: minute
  • 日期转秒函数: second
  • 日期转周函数: weekofyear
  • 日期比较函数: datediff
  • 日期增加函数: date_add
  • 日期减少函数: date_sub


  • (6)字符串函数
  • 字符串长度函数:length
  • 字符串反转函数:reverse
  • 字符串连接函数:concat
  • 带分隔符字符串连接函数:concat_ws
  • 字符串截取函数:substr,substring
  • 字符串截取函数:substr,substring
  • 字符串转大写函数:upper,ucase
  • 字符串转小写函数:lower,lcase
  • 去空格函数:trim
  • 左边去空格函数:ltrim
  • 右边去空格函数:rtrim
  • 正则表达式替换函数:regexp_replace
  • 正则表达式解析函数:regexp_extract
  • URL解析函数:parse_url
  • json解析函数:get_json_object
  • 空格字符串函数:space
  • 重复字符串函数:repeat
  • 首字符ascii函数:ascii
  • 左补足函数:lpad
  • 右补足函数:rpad
  • 分割字符串函数: split
  • 集合查找函数: find_in_set

get_json_object

语法:

返回值:

说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。

select  get_json_object('
	{"store":
		{"fruit":[
			{"weight":8,"type":"apple"},
			{"weight":9,"type":"pear"}
			], 
		"bicycle":{"price":19.95,"color":"red"} 
	},
	"email":"amy@only_for_json_udf_test.net",
	"owner":"amy"
	}'
,'$.store.fruit.type[0]');

2、常用内置函数

(1)空字段赋值

NVL:给值为NULL的数据赋值,它的格式是NVL( value,default_value)。它的功能是如果value为NULL,则NVL函数返回default_value的值,否则返回value的值,如果两个参数都为NULL ,则返回NULL。

--如果员工的comm为NULL,则用-1代替

select comm,nvl(comm, -1) from emp;

--如果员工的comm为NULL,则用领导id代替

select comm, nvl(comm,mgr) from emp;

(2)条件函数

a.If 函数 : if

语法:if(boolean testCondition, T valueTrue, T valueFalseOrNull)

返回值:

说明:

select if(1=2,100,200) from dual;

200

select if(1=1,100,200) from dual;

100

b.非空查找函数 : coalesce

语法: coalesce(T v1, T v2, …)

返回值:

说明:

select coalesce(null,'100','50') from dual;

100

c.条件判断函数: case

语法 :case a when b then c [when d then e] …… [else f] end

返回值 :

说明:如果 a 等于 b ,那么返回 c ;如果 a 等于 d ,那么返回 e ;

否则返回 f

select

case 100

when 50 then 'tom'

when 100 then 'mary'

else 'tim'

end ;

mary

(3)行转列(多行转单列)

a.concat(string a/col, string b/col…):

返回输入字符串连接后的结果,支持任意个输入字符串;

--拼接字符串“---” “>”

select concat("---",">");

select concat(ename,job) from emp;

b.concat_ws(separator, str1, str2,...):

它是一个特殊形式的 concat()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;

--拼接字符串~_^

select concat_ws("_","~","^");

d.collect_set(col):

函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。

select concat_ws('-',collect_set(job)) from emp;

--统计每个部门内薪资相同的各位员工

select concat(deptno,'-',sal)  base ,ename from emp;

+------------+--------------+
|   t.base   |     name     |
+------------+--------------+
| 10-1300.0  | MILLER       |
| 10-2450.0  | CLARK        |
| 10-5000.0  | KING         |
| 20-1100.0  | ADAMS        |
| 20-2975.0  | JONES        |
| 20-3000.0  | SCOTT|FORD   |
| 20-800.0   | SMITH        |
| 30-1250.0  | WARD|MARTIN  |
| 30-1500.0  | TURNER       |
| 30-1600.0  | ALLEN        |
| 30-2850.0  | BLAKE        |
| 30-950.0   | JAMES        |
| 50-1200.0  | TOM          |
+------------+--------------+

--先把部门编号和工资连接
select concat(deptno,'-',sal)  base ,ename from emp;
+------------+---------+
|    base    |  ename  |
+------------+---------+
| 20-800.0   | SMITH   |
| 30-1600.0  | ALLEN   |
| 30-1250.0  | WARD    |
| 20-2975.0  | JONES   |
| 30-1250.0  | MARTIN  |
| 30-2850.0  | BLAKE   |
| 10-2450.0  | CLARK   |
| 20-3000.0  | SCOTT   |
| 10-5000.0  | KING    |
| 30-1500.0  | TURNER  |
| 20-1100.0  | ADAMS   |
| 30-950.0   | JAMES   |
| 20-3000.0  | FORD    |
| 10-1300.0  | MILLER  |
| 50-1200.0  | TOM     |
+------------+---------+
-- 对上步的结果继续处理,按照 base分组聚合,

select t.base,collect_set(t.ename)
from (select concat(deptno,'-',sal)  base ,ename from emp)t 
group by t.base;

+------------+--------------------+
|   t.base   |        _c1         |
+------------+--------------------+
| 10-2450.0  | ["CLARK"]          |
| 10-5000.0  | ["KING"]           |
| 20-3000.0  | ["SCOTT","FORD"]   |
| 30-1250.0  | ["WARD","MARTIN"]  |
| 30-1500.0  | ["TURNER"]         |
| 30-2850.0  | ["BLAKE"]          |
| 50-1200.0  | ["TOM"]            |
| 10-1300.0  | ["MILLER"]         |
| 20-1100.0  | ["ADAMS"]          |
| 20-2975.0  | ["JONES"]          |
| 20-800.0   | ["SMITH"]          |
| 30-1600.0  | ["ALLEN"]          |
| 30-950.0   | ["JAMES"]          |
+------------+--------------------+
-- 对上步的结果再次处理
select t.base,concat_ws('|',collect_set(t.ename)) 
from (select concat(deptno,'-',sal)  base ,ename from emp)t 
group by t.base;

思路:

  • 把部门号、薪资拼接在一起

select concat(deptno,"-",sal) base,ename from emp;

  • 按照部门号+薪资分组

select t.base, collect_set(t.ename) name

from (select concat(deptno,"-",sal) base,ename from emp) t

group by t.base;

  • 多位员工需要拼接一起

select t.base, concat_ws('|',collect_set(t.ename)) name

from (select concat(deptno,"-",sal) base,ename from emp) t

group by t.base;

(4)列转行

a.explode

  • explode函数可以将一个array或者map类型的字段展开,

其中explode(array)使得结果中将array列表里的每个元素生成一行;explode(map)使得结果中将map里的每一对元素作为一行,key为一列,value为一列。

  • 一般情况下,直接使用explode即可,也可以根据需要结合lateral view

select explode(array("hadoop","spark","flink"));

b.lateral view

  • 用法:lateral view udtf(expression) tablealias as columnalias
  • 解释:用于和split, explode等udtf一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。

准备数据:    

create  table p1(name string,children array<string>,address Map<string,string>) 
row format delimited fields terminated by '|' 
collection items terminated by ','
map keys terminated by ':';

vim maparray.txt
zhangsan|child1,child2,child3,child4|k1:v1,k2:v2
lisi|child5,child6,child7,child8|k3:v3,k4:v4

load data local inpath '/home/offcn/tmp/maparray.txt' into table p1;

select * from p1;
+-----------+----------------------------------------+------------------------+
|  p1.name  |              p1.children               |       p1.address       |
+-----------+----------------------------------------+------------------------+
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  |
+-----------+----------------------------------------+------------------------+

select explode(children) as myChild from p1;
select explode(address) as (myMapKey, myMapValue) from p1;

--如果想用原表和拆分的列进行拼接组合
select p1.*,childrenView.*,addressView.* from p1 
lateral view explode(children) childrenView as myChild
lateral view explode(address) addressView as myMapKey, myMapValue;

+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+
|  p1.name  |              p1.children               |       p1.address       | childrenview.mychild  | addressview.mymapkey  | addressview.mymapvalue  |
+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child1                | k1                    | v1                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child1                | k2                    | v2                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child2                | k1                    | v1                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child2                | k2                    | v2                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child3                | k1                    | v1                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child3                | k2                    | v2                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child4                | k1                    | v1                      |
| zhangsan  | ["child1","child2","child3","child4"]  | {"k1":"v1","k2":"v2"}  | child4                | k2                    | v2                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child5                | k3                    | v3                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child5                | k4                    | v4                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child6                | k3                    | v3                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child6                | k4                    | v4                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child7                | k3                    | v3                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child7                | k4                    | v4                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child8                | k3                    | v3                      |
| lisi      | ["child5","child6","child7","child8"]  | {"k3":"v3","k4":"v4"}  | child8                | k4                    | v4                      |
+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+

(5)窗口函数(开窗函数)

hive中的窗口函数和sql中的窗口函数相类似,都是用来做一些数据分析类的工作,一般用于olap分析(在线分析处理)。大部分的窗口函数针对的是分区(窗口)内部处理的场景,hive、oracle提供开窗函数,mysql 8.0之前版本不提供,但MySQL 8.0版本支持窗口函数(over)和公用表表达式(with)这两个重要的功能!

语法结构:

窗口函数 over (partition by 列名, 列名……order by 列名 rows between 开始位置 and 结束位置)

说明:

  • over()函数:

over()函数中包括三个函数:包括分区partition by 列名、排序order by 列名、指定窗口范围rows between 开始位置 and 结束位置。它决定了聚合函数的聚合范围,默认对整个窗口中的数据进行聚合,聚合函数对每一条数据调用一次

  • partition by子句:

(partition by .. order by)可替换为(distribute by .. sort by ..)。

  • order by子句:

    order by是排序的意思,是在该窗口中根据指定字段进行的排序,只能跟一个字段。

  • rows between

rows between开始位置 and 结束位置,窗口大小限定:

  1. following往后
  2. current row当前行
  3. unbounded起点(一般结合preceding,following使用)
  4. unbounded preceding表示该窗口最前面的行(起点)
  5. unbounded following表示该窗口最后面的行(终点)

准备数据:

create table t2(ip string, createtime string, url string,pvs int) 
row format delimited fields terminated by ',';
	

vim t2.txt
192.168.233.11,2021-04-10,url2,5
192.168.233.11,2021-04-12,url1,4
192.168.233.11,2021-04-10,1url3,5
192.168.233.11,2021-04-11,url6,3
192.168.233.11,2021-04-13,url4,6
192.168.233.12,2021-04-10,url7,5
192.168.233.12,2021-04-11,url4,1
192.168.233.12,2021-04-10,url5,5
192.168.233.13,2021-04-11,url22,3
192.168.233.13,2021-04-10,url11,11
192.168.233.13,2021-04-12,1url33,7
192.168.233.14,2021-04-13,url66,9
192.168.233.14,2021-04-12,url77,1
192.168.233.14,2021-04-13,url44,12
192.168.233.14,2021-04-11,url55,33

load data local inpath "/home/offcn/tmp/t2.txt" into table t2;
select * from t2;

a、窗口聚合函数

解决分组内统计以及显示非分组字段问题

  • count开窗函数
--统计每个ip、每个时间段数据的数据条数以及明细
select ip,createtime,count(pvs)
from t2	
group by ip,createtime;

+-----------------+-------------+------+
|       ip        | createtime  | _c2  |
+-----------------+-------------+------+
| 192.168.233.11  | 2021-04-10  | 2    |
| 192.168.233.11  | 2021-04-11  | 1    |
| 192.168.233.11  | 2021-04-12  | 1    |
| 192.168.233.11  | 2021-04-13  | 1    |
| 192.168.233.12  | 2021-04-10  | 2    |
| 192.168.233.12  | 2021-04-11  | 1    |
| 192.168.233.13  | 2021-04-10  | 1    |
| 192.168.233.13  | 2021-04-11  | 1    |
| 192.168.233.13  | 2021-04-12  | 1    |
| 192.168.233.14  | 2021-04-11  | 1    |
| 192.168.233.14  | 2021-04-12  | 1    |
| 192.168.233.14  | 2021-04-13  | 2    |
+-----------------+-------------+------+

问题:发现group by后无法显示明细数据,此时需要用窗口函数
select ip,createtime,url,pvs,count(pvs) over(partition by ip,createtime)
from t2;

+-----------------+-------------+---------+------+-----------------+
|       ip        | createtime  |   url   | pvs  | count_window_0  |
+-----------------+-------------+---------+------+-----------------+
| 192.168.233.11  | 2021-04-10  | url2    | 5    | 2               |
| 192.168.233.11  | 2021-04-10  | 1url3   | 5    | 2               |
| 192.168.233.11  | 2021-04-11  | url6    | 3    | 1               |
| 192.168.233.11  | 2021-04-12  | url1    | 4    | 1               |
| 192.168.233.11  | 2021-04-13  | url4    | 6    | 1               |
| 192.168.233.12  | 2021-04-10  | url5    | 5    | 2               |
| 192.168.233.12  | 2021-04-10  | url7    | 5    | 2               |
| 192.168.233.12  | 2021-04-11  | url4    | 1    | 1               |
| 192.168.233.13  | 2021-04-10  | url11   | 11   | 1               |
| 192.168.233.13  | 2021-04-11  | url22   | 3    | 1               |
| 192.168.233.13  | 2021-04-12  | 1url33  | 7    | 1               |
| 192.168.233.14  | 2021-04-11  | url55   | 33   | 1               |
| 192.168.233.14  | 2021-04-12  | url77   | 1    | 1               |
| 192.168.233.14  | 2021-04-13  | url66   | 9    | 2               |
| 192.168.233.14  | 2021-04-13  | url44   | 12   | 2               |
+-----------------+-------------+---------+------+-----------------+
  • sum开窗函数

--统计每个ip、每个时间段的总pv数以及明细

select ip,createtime,url,pvs,sum(pvs) over(partition by ip,createtime)

from t2;                

  • avg开窗函数

--统计每个ip、每个时间段平局pv次数以及明细

select ip,createtime, url,pvs,avg(pvs) over(partition by ip,createtime)

from t2;      

  • min开窗函数

--统计每个ip、每个时间段最小的pv数以及明细

select ip,createtime, url,pvs,min(pvs) over(partition by ip,createtime)

from t2;

  • max开窗函数

--统计每个ip、每个时间段最大的pv数以及明细

select ip,createtime, url,pvs,max(pvs) over(partition by ip,createtime)

from t2;

  • 窗口控制语句
select  *,
--分组内所有行
--按照ip分组
sum(pvs) over(partition by ip) AS c_1 ,
--按照createtime分组并按照该字段升序排序展示
sum(pvs) over(order  by createtime) AS c_2 ,
--默认为从起点到当前行,
sum(pvs) over(partition by ip order by createtime asc) AS c_3,
--从起点到当前行,结果与sales_3不同。 根据排序先后不同,可能结果累加不同
sum(pvs) over(partition by ip order by createtime asc rows between unbounded preceding and current row) AS c_4,
--当前行+往前3行 
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and current row) AS c_5,
--当前行+往前3行+往后1行
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and 1 following) AS c_6,
--当前行+往后所有行  
sum(pvs) over(partition by ip order by createtime asc rows between current row and unbounded following) AS c_7 
from t2;

--按照ip分组,分组内所有行累加
select  *,
sum(pvs) over(partition by ip) AS c_1 
from t2;	

+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_1  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 23   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 23   |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 23   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 23   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 23   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 11   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11   |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 11   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 21   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 21   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 55   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 55   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 55   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 55   |
+-----------------+----------------+---------+---------+------+

-- 按照时间排序,按照时间分组,时间相同的累加,并逐个累加
select  *, 
sum(pvs) over(order  by createtime) AS c_2 
from t2;

+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_2  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 31   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 31   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 31   |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 31   |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 31   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 71   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 71   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 71   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 71   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 83   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 83   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 83   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 110  |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 110  |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 110  |
+-----------------+----------------+---------+---------+------+

-- 按照ip分大组,组内按照时间排序再分小组,大组内逐个累加
select  *, 
sum(pvs) over(partition by ip order by createtime asc) AS c_3
from t2;

+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_3  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 10   |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 10   |	
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 13   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 17   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 23   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 10   |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 10   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 14   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 34   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 55   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 55   |
+-----------------+----------------+---------+---------+------+


-- 按照ip分大组,组内按照时间、url排序再分小组,大组内逐个累加

select  *, 
sum(pvs) over(partition by ip order by createtime asc,url) AS c_3
from t2;

+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_3  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 5    |
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 10   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 13   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 17   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 23   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 5    |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 10   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 14   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 34   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 46   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 55   |
+-----------------+----------------+---------+---------+------+


-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前所有行(默认)
select  *, 
sum(pvs) over(partition by ip order by createtime asc rows between unbounded preceding and current row) AS c_4 
from t2;

+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_4  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 5    |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 10   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 13   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 17   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 23   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 5    |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 10   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 14   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 34   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 43   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 55   |
+-----------------+----------------+---------+---------+------+

-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前3行(共4行)
select  *, 
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and current row) AS c_5 
from t2;
	
+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_5  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 5    |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 10   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 13   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 17   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 18   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 5    |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 10   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 14   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 34   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 43   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 55   |
+-----------------+----------------+---------+---------+------+

-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前3行、当前行的下一行
select  *, 
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and 1 following) AS c_6
from t2;

+-----------------+----------------+---------+---------+-------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_6  |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 10    |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 13    |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 17    |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 23    |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 18    |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 10    |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 11    |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 11    |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 14    |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 21    |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 21    |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 34    |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 43    |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 55    |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 55    |
+-----------------+----------------+---------+---------+-------+

-- 按照ip分窗口,窗口内按照时间升序,从当前窗口的当前行开始,累加到窗口最后一行
select  *, 	
sum(pvs) over(partition by ip order by createtime asc rows between current row and unbounded following) AS c_7 
from t2;
+-----------------+----------------+---------+---------+------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | c_7  |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 23   |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 18   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 13   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 10   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 6    |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 11   |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 6    |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 1    |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 21   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 10   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 7    |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 55   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 22   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 21   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 12   |
+-----------------+----------------+---------+---------+------+

b、窗口分析函数

解决级联问题

  • first_value开窗函数
统计每个ip和本ip第一次进入网站的pv差以及明细数据

按照ip分组,组内按照时间升序排序,取组内第一个
select *,first_value(pvs) over(partition by ip order by createtime) haha
from t2;
+-----------------+----------------+---------+---------+-------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | haha  |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 5     |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 5     |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 5     |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 5     |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 5     |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 5     |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 5     |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 5     |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11    |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 11    |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 11    |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33    |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 33    |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 33    |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 33    |
+-----------------+----------------+---------+---------+-------+


with aaa as(
select *,first_value(pvs) over(partition by ip order by createtime) haha
from t2)
select *,(aaa.pvs-aaa.haha) from aaa;

+-----------------+-----------------+----------+----------+-----------+------+
|     aaa.ip      | aaa.createtime  | aaa.url  | aaa.pvs  | aaa.haha  | _c1  |
+-----------------+-----------------+----------+----------+-----------+------+
| 192.168.233.11  | 2021-04-10      | url2     | 5        | 5         | 0    |
| 192.168.233.11  | 2021-04-10      | 1url3    | 5        | 5         | 0    |
| 192.168.233.11  | 2021-04-11      | url6     | 3        | 5         | -2   |
| 192.168.233.11  | 2021-04-12      | url1     | 4        | 5         | -1   |
| 192.168.233.11  | 2021-04-13      | url4     | 6        | 5         | 1    |
| 192.168.233.12  | 2021-04-10      | url5     | 5        | 5         | 0    |
| 192.168.233.12  | 2021-04-10      | url7     | 5        | 5         | 0    |
| 192.168.233.12  | 2021-04-11      | url4     | 1        | 5         | -4   |
| 192.168.233.13  | 2021-04-10      | url11    | 11       | 11        | 0    |
| 192.168.233.13  | 2021-04-11      | url22    | 3        | 11        | -8   |
| 192.168.233.13  | 2021-04-12      | 1url33   | 7        | 11        | -4   |
| 192.168.233.14  | 2021-04-11      | url55    | 33       | 33        | 0    |
| 192.168.233.14  | 2021-04-12      | url77    | 1        | 33        | -32  |
| 192.168.233.14  | 2021-04-13      | url66    | 9        | 33        | -24  |
| 192.168.233.14  | 2021-04-13      | url44    | 12       | 33        | -21  |
+-----------------+-----------------+----------+----------+-----------+------+


with as语句:	
with...as...也叫做子查询部分,语句允许hive定义一个sql片段,供整个sql使用
  • last_value开窗函数
先根据ip分组,然后根据排序字段滑动窗口(换句话说就是ip和排序字段共同决定窗口)
统计每个ip,每个时间段和本ip、本时间段最后一次进入网站的pv差以及明细数据

select *,last_value(pvs) over(partition by ip order by createtime) haha
from t2;

+-----------------+----------------+---------+---------+-------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | haha  |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 5     |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 5     |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 3     |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 4     |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 6     |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 5     |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 5     |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 1     |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 11    |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 3     |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 7     |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 33    |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 1     |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 12    |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 12    |
+-----------------+----------------+---------+---------+-------+

with aaa as(
select *,last_value(pvs) over(partition by ip order by createtime) haha
from t2)
select *,(aaa.pvs-aaa.haha) from aaa;

+-----------------+-----------------+----------+----------+-----------+------+
|     aaa.ip      | aaa.createtime  | aaa.url  | aaa.pvs  | aaa.haha  | _c1  |
+-----------------+-----------------+----------+----------+-----------+------+
| 192.168.233.11  | 2021-04-10      | url2     | 5        | 5         | 0    |
| 192.168.233.11  | 2021-04-10      | 1url3    | 5        | 5         | 0    |
| 192.168.233.11  | 2021-04-11      | url6     | 3        | 3         | 0    |
| 192.168.233.11  | 2021-04-12      | url1     | 4        | 4         | 0    |
| 192.168.233.11  | 2021-04-13      | url4     | 6        | 6         | 0    |
| 192.168.233.12  | 2021-04-10      | url5     | 5        | 5         | 0    |
| 192.168.233.12  | 2021-04-10      | url7     | 5        | 5         | 0    |
| 192.168.233.12  | 2021-04-11      | url4     | 1        | 1         | 0    |
| 192.168.233.13  | 2021-04-10      | url11    | 11       | 11        | 0    |
| 192.168.233.13  | 2021-04-11      | url22    | 3        | 3         | 0    |
| 192.168.233.13  | 2021-04-12      | 1url33   | 7        | 7         | 0    |
| 192.168.233.14  | 2021-04-11      | url55    | 33       | 33        | 0    |
| 192.168.233.14  | 2021-04-12      | url77    | 1        | 1         | 0    |
| 192.168.233.14  | 2021-04-13      | url66    | 9        | 12        | -3   |
| 192.168.233.14  | 2021-04-13      | url44    | 12       | 12        | 0    |
+-----------------+-----------------+----------+----------+-----------+------+

	lag开窗函数
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL

窗口上移 按照createtime向上移2行,如果是空,用balabala代替

select *,lag(createtime,2,'balabala') over(partition by ip order by createtime) tmp
from t2;

+-----------------+----------------+---------+---------+-------------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |     tmp     |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | balabala    |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | balabala    |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 2021-04-10  |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 2021-04-10  |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 2021-04-11  |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | balabala    |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | balabala    |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 2021-04-10  |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | balabala    |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | balabala    |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 2021-04-10  |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | balabala    |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | balabala    |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 2021-04-11  |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 2021-04-12  |
+-----------------+----------------+---------+---------+-------------+

lead开窗函数

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL

窗口下移 默认移一行,没有的话,返回空
select *,lead(createtime) over(partition by ip order by createtime) tmp
from t2;

+-----------------+----------------+---------+---------+-------------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |     tmp     |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 2021-04-10  |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 2021-04-11  |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 2021-04-12  |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 2021-04-13  |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | NULL        |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 2021-04-10  |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 2021-04-11  |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | NULL        |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 2021-04-11  |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 2021-04-12  |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | NULL        |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 2021-04-12  |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 2021-04-13  |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 2021-04-13  |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | NULL        |
+-----------------+----------------+---------+---------+-------------+


select *,lead(createtime,1,'haha') over(partition by ip order by createtime) tmp
from t2;

+-----------------+----------------+---------+---------+-------------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |     tmp     |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 2021-04-10  |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 2021-04-11  |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 2021-04-12  |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 2021-04-13  |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | haha        |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 2021-04-10  |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 2021-04-11  |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | haha        |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 2021-04-11  |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 2021-04-12  |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | haha        |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 2021-04-12  |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 2021-04-13  |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 2021-04-13  |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | haha        |
+-----------------+----------------+---------+---------+-------------+

c、窗口排序函数

解决分组求TopN问题

  • rank开窗函数

生成数据项在分组中的排名,排名相等会在名次中留下空位,相同数值的数据可并列排名,有两个第二名时接下来就是第四名

  • dense_rank开窗函数

生成数据项在分组中的排名,排名相等会在名次中不会留下空位,相同数值的数据可并列排名,有两个第二名时仍然跟着第三名

  • row_number开窗函数

从1开始,按照顺序,生成分组内记录的序列, 有两个并列第一,row_number()都会排序返回

  • ntile开窗函数

用于将分组数据按照窗口、顺序等分成n片,返回当前切片值

如果切片不均匀,默认增加第一个切片的分布

用法:

按照ip分组,组内按照pvs升序排序,
select *,
row_number() over(partition by ip order by pvs) row_number,
rank() over(partition by ip order by pvs) rank,
dense_rank() over(partition by ip order by pvs) dense_rank, 
ntile(3) over(partition by ip order by pvs) ntile
from t2;

+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  | row_number  | rank  | dense_rank  | ntile  |
+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 1           | 1     | 1           | 1      |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 2           | 2     | 2           | 1      |
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 3           | 3     | 3           | 2      |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 4           | 3     | 3           | 2      |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 5           | 5     | 4           | 3      |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 1           | 1     | 1           | 1      |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 2           | 2     | 2           | 2      |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 3           | 2     | 2           | 3      |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 1           | 1     | 1           | 1      |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 2           | 2     | 2           | 2      |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 3           | 3     | 3           | 3      |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 1           | 1     | 1           | 1      |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 2           | 2     | 2           | 1      |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 3           | 3     | 3           | 2      |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 4           | 4     | 4           | 3      |
+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+

--统计每个ip产生pv最高的前三条数据(Top3)
with tmp as(
select *,
row_number() over(partition by ip order by pvs desc) row_number,
rank() over(partition by ip order by pvs desc) rank,
dense_rank() over(partition by ip order by pvs desc) dense_rank, 
ntile(3) over(partition by ip order by pvs desc) ntile
from t2)
select * from tmp where tmp.row_number <=3;

+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+
|     tmp.ip      | tmp.createtime  | tmp.url  | tmp.pvs  | tmp.row_number  | tmp.rank  | tmp.dense_rank  | tmp.ntile  |
+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+
| 192.168.233.11  | 2021-04-13      | url4     | 6        | 1               | 1         | 1               | 1          |
| 192.168.233.11  | 2021-04-10      | url2     | 5        | 2               | 2         | 2               | 1          |
| 192.168.233.11  | 2021-04-10      | 1url3    | 5        | 3               | 2         | 2               | 2          |
| 192.168.233.12  | 2021-04-10      | url5     | 5        | 1               | 1         | 1               | 1          |
| 192.168.233.12  | 2021-04-10      | url7     | 5        | 2               | 1         | 1               | 2          |
| 192.168.233.12  | 2021-04-11      | url4     | 1        | 3               | 3         | 2               | 3          |
| 192.168.233.13  | 2021-04-10      | url11    | 11       | 1               | 1         | 1               | 1          |
| 192.168.233.13  | 2021-04-12      | 1url33   | 7        | 2               | 2         | 2               | 2          |
| 192.168.233.13  | 2021-04-11      | url22    | 3        | 3               | 3         | 3               | 3          |
| 192.168.233.14  | 2021-04-11      | url55    | 33       | 1               | 1         | 1               | 1          |
| 192.168.233.14  | 2021-04-13      | url44    | 12       | 2               | 2         | 2               | 1          |
| 192.168.233.14  | 2021-04-13      | url66    | 9        | 3               | 3         | 3               | 2          |
+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+

d、窗口序列函数(了解)

两个序列分析函数不是很常用,这里也介绍一下。

注意: 序列函数不支持WINDOW子句,所谓window子句就是指定窗口范围的rows between 这些。

  • cume_dist开窗函数

cume_dist:小于等于当前值的行数/分组内总行数

--统计小于等于当前pvs的记录数,所占总记录数的比例
select *,
cume_dist() over(order by pvs) c1,
cume_dist() over(partition by ip order by pvs) c2
from t2;

+-----------------+----------------+---------+---------+----------------------+---------------------+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |          c1          |         c2          |
+-----------------+----------------+---------+---------+----------------------+---------------------+
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 0.26666666666666666  | 0.2                 |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 0.3333333333333333   | 0.4                 |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 0.6                  | 0.8                 |
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 0.6                  | 0.8                 |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 0.6666666666666666   | 1.0                 |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 0.13333333333333333  | 0.3333333333333333  |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 0.6                  | 1.0                 |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 0.6                  | 1.0                 |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 0.26666666666666666  | 0.3333333333333333  |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 0.7333333333333333   | 0.6666666666666666  |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 0.8666666666666667   | 1.0                 |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 0.13333333333333333  | 0.25                |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 0.8                  | 0.5                 |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 0.9333333333333333   | 0.75                |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 1.0                  | 1.0                 |
+-----------------+----------------+---------+---------+----------------------+---------------------+

结果说明:

没有指定分组即为没有限定窗口大小,窗口数据为全部数据!

c1: 没有partition,所有数据均为1组,总行数为15,

第一行:小于等于pvs 为3的行数为4行,因此,4/15= 0.26666666666666666

第二行:小于等于pvs为4的行数为5行,因此,5/15= 0.3333333333333333

c2: 按照ip分组,ip=192.168.233.11的行数为5,

第一行:小于等于pvs为3的行数为1,因此,1/5=0.2

第二行:小于等于pvs为4的行数为2,因此,2/5=0.4

  • percent_rank开窗函数

分组内当前行的RANK值-1/分组内总行数-1,适用于复杂算法

select *,
percent_rank() over(order by createtime) c1,
rank() over(order by createtime) c2,
count(*) over() c3
from t2;

结果说明:
c2是倒序排名,c3是总条数
Over中没有分组,即全部为一组C1=(C2-1)/(C3-1)
第一行:(13-1)/(15-1)= 0.8571428571428571


+-----------------+----------------+---------+---------+----------------------+-----+-----+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |          c1          | c2  | c3  |
+-----------------+----------------+---------+---------+----------------------+-----+-----+
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 0.8571428571428571   | 13  | 15  |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 0.8571428571428571   | 13  | 15  |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 0.8571428571428571   | 13  | 15  |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 0.6428571428571429   | 10  | 15  |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 0.6428571428571429   | 10  | 15  |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 0.6428571428571429   | 10  | 15  |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 0.35714285714285715  | 6   | 15  |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 0.35714285714285715  | 6   | 15  |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 0.35714285714285715  | 6   | 15  |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 0.35714285714285715  | 6   | 15  |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 0.0                  | 1   | 15  |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 0.0                  | 1   | 15  |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 0.0                  | 1   | 15  |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 0.0                  | 1   | 15  |
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 0.0                  | 1   | 15  |
+-----------------+----------------+---------+---------+----------------------+-----+-----+


select *, 
percent_rank() over(partition by ip order by createtime) c1,
rank() over(partition by ip order by createtime) c2,
count(*) over(partition by ip) c3
from t2;

结果说明:
Over中按照ip分组, C1=(C2-1)/(C3-1)
第三行:(3-1)/(5-1)=0.5


+-----------------+----------------+---------+---------+---------------------+-----+-----+
|      t2.ip      | t2.createtime  | t2.url  | t2.pvs  |         c1          | c2  | c3  |
+-----------------+----------------+---------+---------+---------------------+-----+-----+
| 192.168.233.11  | 2021-04-10     | url2    | 5       | 0.0                 | 1   | 5   |
| 192.168.233.11  | 2021-04-10     | 1url3   | 5       | 0.0                 | 1   | 5   |
| 192.168.233.11  | 2021-04-11     | url6    | 3       | 0.5                 | 3   | 5   |
| 192.168.233.11  | 2021-04-12     | url1    | 4       | 0.75                | 4   | 5   |
| 192.168.233.11  | 2021-04-13     | url4    | 6       | 1.0                 | 5   | 5   |
| 192.168.233.12  | 2021-04-11     | url4    | 1       | 1.0                 | 3   | 3   |
| 192.168.233.12  | 2021-04-10     | url7    | 5       | 0.0                 | 1   | 3   |
| 192.168.233.12  | 2021-04-10     | url5    | 5       | 0.0                 | 1   | 3   |
| 192.168.233.13  | 2021-04-12     | 1url33  | 7       | 1.0                 | 3   | 3   |
| 192.168.233.13  | 2021-04-10     | url11   | 11      | 0.0                 | 1   | 3   |
| 192.168.233.13  | 2021-04-11     | url22   | 3       | 0.5                 | 2   | 3   |
| 192.168.233.14  | 2021-04-13     | url66   | 9       | 0.6666666666666666  | 3   | 4   |
| 192.168.233.14  | 2021-04-13     | url44   | 12      | 0.6666666666666666  | 3   | 4   |
| 192.168.233.14  | 2021-04-11     | url55   | 33      | 0.0                 | 1   | 4   |
| 192.168.233.14  | 2021-04-12     | url77   | 1       | 0.3333333333333333  | 2   | 4   |
+-----------------+----------------+---------+---------+---------------------+-----+-----+

e、窗口分组函数

在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,

其中的GROUPING__ID,表示结果属于哪一个分组集合。

  • Grouping

--需求:使用一条SQL分别求出,每个ip访问量条数, 每个时间访问量的条数

select ip,null as createtime,count(1) as num,1 as GROUPING__ID 
from t2
group by ip
union all
select null as ip,createtime,count(1) as num,2 as GROUPING__ID
from t2
group by createtime;

+-----------------+-------------+------+---------------+
|       ip        | createtime  | num  | grouping__id  |
+-----------------+-------------+------+---------------+
| 192.168.233.11  | NULL        | 5    | 1             |
| 192.168.233.12  | NULL        | 3    | 1             |
| 192.168.233.13  | NULL        | 3    | 1             |
| 192.168.233.14  | NULL        | 4    | 1             |
+-----------------+-------------+------+---------------+

+-------+-------------+------+---------------+
|  ip   | createtime  | num  | grouping__id  |
+-------+-------------+------+---------------+
| NULL  | 2021-04-10  | 5    | 2             |
| NULL  | 2021-04-11  | 4    | 2             |
| NULL  | 2021-04-12  | 3    | 2             |
| NULL  | 2021-04-13  | 3    | 2             |
+-------+-------------+------+---------------+

+-----------------+-----------------+----------+-------------------+
|     _u1.ip      | _u1.createtime  | _u1.num  | _u1.grouping__id  |
+-----------------+-----------------+----------+-------------------+
| 192.168.233.11  | NULL            | 5        | 1                 |
| 192.168.233.12  | NULL            | 3        | 1                 |
| 192.168.233.13  | NULL            | 3        | 1                 |
| 192.168.233.14  | NULL            | 4        | 1                 |
| NULL            | 2021-04-10      | 5        | 2                 |
| NULL            | 2021-04-11      | 4        | 2                 |
| NULL            | 2021-04-12      | 3        | 2                 |
| NULL            | 2021-04-13      | 3        | 2                 |
+-----------------+-----------------+----------+-------------------+

--使用窗口分组函数
select
    ip,createtime, count(1) as num,
    GROUPING__ID 
from 
    t2
group by 
    ip,createtime
GROUPING SETS(ip,createtime) 
ORDER BY 
GROUPING__ID;



--需求:使用一条SQL分别求出,每个ip访问量条数, 每个时间访问量的条数,每个ip、每个时间内访问量的条数
select ip,null as createtime,count(1) as num,1 as GROUPING__ID 
from t2
group by ip
union all
select null as ip,createtime,count(1) as num,2 as GROUPING__ID
from t2
group by createtime
union all
select ip,createtime,count(1) as num,3 as GROUPING__ID
from t2
group by ip,createtime;


+-----------------+-----------------+----------+-------------------+
|     _u1.ip      | _u1.createtime  | _u1.num  | _u1.grouping__id  |
+-----------------+-----------------+----------+-------------------+
| 192.168.233.11  | 2021-04-10      | 2        | 3                 |
| 192.168.233.11  | 2021-04-11      | 1        | 3                 |
| 192.168.233.11  | 2021-04-12      | 1        | 3                 |
| 192.168.233.11  | 2021-04-13      | 1        | 3                 |
| 192.168.233.12  | 2021-04-10      | 2        | 3                 |
| 192.168.233.12  | 2021-04-11      | 1        | 3                 |
| 192.168.233.13  | 2021-04-10      | 1        | 3                 |
| 192.168.233.13  | 2021-04-11      | 1        | 3                 |
| 192.168.233.13  | 2021-04-12      | 1        | 3                 |
| 192.168.233.14  | 2021-04-11      | 1        | 3                 |
| 192.168.233.14  | 2021-04-12      | 1        | 3                 |
| 192.168.233.14  | 2021-04-13      | 2        | 3                 |
| 192.168.233.11  | NULL            | 5        | 1                 |
| 192.168.233.12  | NULL            | 3        | 1                 |
| 192.168.233.13  | NULL            | 3        | 1                 |
| 192.168.233.14  | NULL            | 4        | 1                 |
| NULL            | 2021-04-10      | 5        | 2                 |
| NULL            | 2021-04-11      | 4        | 2                 |
| NULL            | 2021-04-12      | 3        | 2                 |
| NULL            | 2021-04-13      | 3        | 2                 |
+-----------------+-----------------+----------+-------------------+


--使用窗口分组函数
select
    ip,createtime, count(1) as num,
    GROUPING__ID 
from 
    t2
group by 
    ip,createtime
GROUPING SETS(ip,createtime,(ip,createtime)) 
ORDER BY 
GROUPING__ID;

3、自定义函数

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)

根据用户自定义函数类别分为以下三种:

  • UDF(User-Defined-Function)普通函数 一进一出

https://cwiki.apache.org/confluence/display/Hive/HivePlugins

  • UDAF(User-Defined Aggregation Function)聚合函数,多进一出

https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

  • UDTF(User-Defined Table-Generating Functions)表生成函数 一进多出

https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF

(1)创建工程导入依赖

<properties>
        <!--解决File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!-->
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <!--不再支持源选项 5。请使用 6 或更高版本。 不再支持目标选项 1.5。请使用 1.6 或更高版本-->
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
    </properties>

<dependencies>
    <dependency>
        <groupId>org.apache.hive</groupId>
        <artifactId>hive-exec</artifactId>
        <version>3.1.2</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.2</version>
    </dependency>
</dependencies>

(2)编写UDF函数

a.简单类型的UDF(继承UDF)

com.bigdata.myhive

@Description(name="myudf",value="_FUNC_(str) - Returns the case of the current string letter",
        extended = "Example:\n" + " > SELECT _FUNC_('admin') FROM src limit 1; \n")
public class MyUDF extends UDF {
    //转换字符串大小写
    public String evaluate(String args){
        if ("".equals(args)||args==null){
            return null;
        }
        return args.toUpperCase();
    }
}

(3)生成jar导入hive

--向hive添加jar
add jar /home/offcn/tmp/udf.jar;
--从hive删除jar
delete jar /home/offcn/tmp/udf.jar;
--查看添加的jar
list jars;

(4)创建临时函数测试调用

--创建临时函数
create temporary function myudf as 'com.bigdata.myhive.MyUDF';
create temporary function myudf2 as 'com.bigdata.myhive.MyUDF2';
--删除临时函数
drop temporary function myudf;
drop temporary function myudf2;
--查看创建的临时函数
show functions like 'my*';
--查看函数的描述
desc function myudf ;
desc function myudf2 ;

--查看函数的详细描述
desc function extended myudf ;
desc function extended myudf2 ;

--测试
select myudf("hello");
select myudf2("hello");
select myudf2("hello","world");
select myudf2(array("hello","world"));
explain extended select myudf2(url) from t2;
explain select myudf2(url) from t2;