四、HQL(hive SQL)
https://www.docs4dev.com/docs/zh/apache-hive/3.1.1/reference/#
1)在hive-site.xml文件中添加如下配置信息,就可以实现显示当前数据库,以及查询表的头信息配置。
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
2)重新启动hive,对比配置前后差异。
(1)配置前,如图所示:
图12-1 配置前
(2)配置后,如图所示:
图12-2 配置后
(二)DML数据操纵语言
1、数据导入
(1)直接上传数据
--临时表
没有表文件目录,无法上传
--内部表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/managed_emp01/aaa.txt
hadoop fs -put student.txt /user/hive/warehouse/
select * from managed_emp01;
--外部表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/managed_emp/bbb.txt
select * from external_emp01;
--分区表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/partition_emp/age=20/
select * from partition_emp;
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/partition_emp2/month=5/day=20
select * from partition_emp2;
注意:如果创建完分区表后,手动创建分区字段目录(mkdir),再上传数据后,需要修复表才可以用msck repair table tabName;
--分桶表
hadoop fs -put emp.txt /user/hive/warehouse/offcn.db/buck_emp
select * from buck_emp;
我们发现在HDFS上的文件目录中,并没有分成多个文件,也就是并没有按照指定字段进行分桶
(1)向表中装载数据(Load)
语法:
load data [local] inpath 'path/target.log' [overwrite] into table tab [partition (partcol1=val1,…)];
说明:
- load data:表示加载数据
- local:表示从(服务端启动的节点)本地加载(复制)数据到hive表;否则从HDFS加载(移动)数据到hive表
- inpath:表示加载数据的路径
- overwrite:表示覆盖表中已有数据,否则表示追加
- into table:表示加载到哪张表
- tab:表示具体的表名
- partition:表示上传到指定分区
示例:
--为临时表加载数据
use offcn;
create temporary table if not exists temporary_emp(
id int, name string
)row format delimited fields terminated by '\t';
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table temporary_emp;
select * from temporary_emp;
--为内部表从本地加载数据
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table managed_emp01;
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" overwrite into table managed_emp01;
select * from managed_emp01;
--为内部表从hdfs加载数据
先把数据上传到hdfs
[offcn@bd-offcn-01 hive-3.1.2]$ hadoop fs -put emp.txt /user/offcn
load data inpath "/user/offcn/emp.txt" overwrite into table managed_emp01;
select * from managed_emp01;
--为外部表加载数据
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table external_emp01;
select * from external_emp01;
--为分区表加载数据(单分区)
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table partition_emp partition(age=18);
select * from partition_emp;
--为分区表加载数据(多级分区)
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table partition_emp2 partition(month="05",day="20");
select * from partition_emp2;
--为分桶表加载数据
load data local inpath "/home/offcn/apps/hive-3.1.2/emp.txt" into table buck_emp; 错误加载方式,查询不到数据
select * from buck_emp;
[offcn@bd-offcn-01 hive-3.1.2]$ hadoop fs -put emp.txt /user/offcn
load data inpath "/user/offcn/emp.txt" into table buck_emp; 正确加载方式,查询到数据 ,这种方式会保留/user/offcn/emp.txt文件
select * from buck_emp;
结论:
1、 分桶表通过load加载数据,只能加载hdfs上的数据,无法加载本地无数据,
2、 在hdfs web页面查看文件内容时,发现显示有问题,这是hadoop的bug,要想查看分桶文件,可以采用hadoop -cat 命令
3、 遇到该报错,不影响分桶
(2)通过查询语句向表中插入数据(Insert)
a.insert values
语法:
insert into table tab [partition (partcol1[=val1], partcol2[=val2] ...)] values (value [, value ...])
示例:
b.insert select
语法:
insert overwrite table tablename [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement;
insert into table tablename [partition (partcol1=val1, partcol2=val2 ...)] select_statement1 from from_statement;
示例:
--先准备数据
insert into table emp02(id,name) values(4,"scala");
select * from emp02;
--临时表
insert overwrite table temporary_emp select * from emp02; 覆盖插入
insert into table temporary_emp select * from emp02; 新增插入
--内部表
insert into table managed_emp01 select * from emp02;
select * from managed_emp01;
--外部表
insert into table external_emp01 select * from emp02;
select * from external_emp01;
--分区表
insert into table partition_emp partition(age=45) select * from emp02;
select * from partition_emp;
insert into table partition_emp2 partition(month="3",day="18") select * from emp02;[X 列对不上]
insert into table partition_emp2 partition(month="3",day="18") select * from partition_emp;
select * from partition_emp2;
--分桶表
insert into table buck_emp select * from partition_emp;
select * from buck_emp;
c.多重插入
语法:
from from_statement
insert overwrite table tab [partition (partcol1=val1, partcol2=val2 ...)] select_statement1
[insert overwrite table tab2 [partition ...] select_statement2]
[insert into table tab2 [partition ...] select_statement2];
说明:
- 当我们需要将一张表中的部分数据分别插入多张表时,就可以使用多重插入。
示例:
将partition_emp表中的部分数据分别插入、替换到managed_emp01、external_emp01表中:
-- 先看下原来的数据
select * from managed_emp01; 对应/user/hive/warehouse/offcn.db/managed_emp01
select * from external_emp01; 对应/user/hive/warehouse/offcn.db/managed_emp
select * from partition_emp; 对应/user/hive/warehouse/offcn.db/partition_emp
-- 执行插入
from partition_emp
insert into table managed_emp01
select id,name
insert overwrite table external_emp01
select id,name;
d.动态分区
语法:
insert overwrite table tablename partition (partcol1[=val1], partcol2[=val2] ...) select_statement from from_statement;
insert into table tablename partition (partcol1[=val1], partcol2[=val2] ...) select_statement from from_statement;
说明:
- 往hive分区表中插入数据时,如果需要创建的分区很多,比如以表中某个字段进行分区存储,则需要复制粘贴修改很多sql去执行,效率低。因为hive是批处理系统,所以hive提供了一个动态分区功能,其可以基于查询参数的位置去推断分区的名称,从而建立分区。
示例:
--设置非严格模式
set hive.exec.dynamic.partition.mode=nonstrict;
--创建动态分区表
create table dynamic_emp(id int) partitioned
by (name string) row format delimited fields terminated by '\t';
--动态插入数据
insert overwrite table dynamic_emp partition(name)
select id,name from managed_emp01;
--查看全部分区
show partitions dynamic_emp;
--查看数据
select * from dynamic_emp;
(3)创建表时通过Location指定加载数据路径
create table if not exists emp4(
id int, name string
)
row format delimited fields terminated by '\t'
location '/test';
hadoop fs -put emp.txt /test
select * from emp4;
(4)Import数据到指定Hive表中
create table if not exists emp5(
id int, name string
)
row format delimited fields terminated by '\t';
import table emp5 from '/tmp/export/emp/';
select * from emp5;
注意:先将数据导出,再进行导入 与export配套使用才可
2、数据导出
(1)Insert导出
a、将查询的结果导出到本地
insert overwrite local directory '/home/offcn/tmp/export/emp'
select * from managed_emp01;
b、将查询的结果格式化导出到本地
insert overwrite local directory '/home/offcn/tmp/export/emp2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from managed_emp01;
c、将查询的结果导出到HDFS上(没有local)
insert overwrite directory '/test/export/empout'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from managed_emp01;
(2)Hadoop命令导出到本地
jdbc:hive2://bd-offcn-01:10000> dfs -get /user/hive/warehouse/offcn.db/emp5/emp.txt /home/offcn/tmp/export/emp2/emp.txt;
或者直接用hadoop命令将相应文件下载到本地
(3)Hive Shell 命令导出
hive -e 'select * from offcn.emp02;' > /home/offcn/tmp/export/emp.txt
(4)Export导出到HDFS上
export table offcn.emp02 to
'/tmp/export/emp';
export和import主要用于两个Hadoop平台集群之间Hive表迁移。
(三)DQL数据查询语言
数据准备:
use default;
create table if not exists emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int)
row format delimited fields terminated by '\t';
create table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';
load data local inpath '/home/offcn/tmp/emp.txt' into table emp;
load data local inpath '/home/offcn/tmp/dept.txt' into table dept;
select * from emp;
select * from dept;
1、基本查询(Select…From)
全表查询
select * from emp;
选择特定列查询
select empno, ename from emp;
注意:
- SQL 语言大小写不敏感。
- SQL 可以写在一行或者多行
- 关键字不能被缩写也不能分行
- 各子句一般要分行写。
- 使用缩进提高语句的可读性。
2、条件查询
查询出薪水大于1000的所有员工
select * from emp where sal >1000;
关系运算符
操作符 | 支持的数据类型 | 描述 |
A=B | 基本数据类型 | 如果A等于B则返回TRUE,反之返回FALSE |
A<=>B | 基本数据类型 | 如果A和B都为NULL,则返回TRUE,如果一边为NULL,返回False |
A<>B, A!=B | 基本数据类型 | A或者B为NULL则返回NULL;如果A不等于B,则返回TRUE,反之返回FALSE |
A<B | 基本数据类型 | A或者B为NULL,则返回NULL;如果A小于B,则返回TRUE,反之返回FALSE |
A<=B | 基本数据类型 | A或者B为NULL,则返回NULL;如果A小于等于B,则返回TRUE,反之返回FALSE |
A>B | 基本数据类型 | A或者B为NULL,则返回NULL;如果A大于B,则返回TRUE,反之返回FALSE |
A>=B | 基本数据类型 | A或者B为NULL,则返回NULL;如果A大于等于B,则返回TRUE,反之返回FALSE |
A [NOT] BETWEEN B AND C | 基本数据类型 | 如果A,B或者C任一为NULL,则结果为NULL。如果A的值大于等于B而且小于或等于C,则结果为TRUE,反之为FALSE。如果使用NOT关键字则可达到相反的效果。 |
A IS NULL | 所有数据类型 | 如果A等于NULL,则返回TRUE,反之返回FALSE |
A IS NOT NULL | 所有数据类型 | 如果A不等于NULL,则返回TRUE,反之返回FALSE |
IN(数值1, 数值2) | 所有数据类型 | 使用 IN运算显示列表中的值 |
A [NOT] LIKE B | STRING 类型 | B是一个SQL下的简单正则表达式,也叫通配符模式,如果A与其匹配的话,则返回TRUE;反之返回FALSE。B的表达式说明如下:‘x%’表示A必须以字母‘x’开头,‘%x’表示A必须以字母’x’结尾,而‘%x%’表示A包含有字母’x’,可以位于开头,结尾或者字符串中间。如果使用NOT关键字则可达到相反的效果。 |
A RLIKE B, A REGEXP B | STRING 类型 | B是基于java的正则表达式,如果A与其匹配,则返回TRUE;反之返回FALSE。匹配使用的是JDK中的正则表达式接口实现的,因为正则也依据其中的规则。例如,正则表达式必须和整个字符串A相匹配,而不是只需与其字符串匹配。 |
(1)Between、IN、is null
- 询出薪水等于5000的所有员工
select * from emp where sal =5000;
- 查询工资在500到1000的员工信息
select * from emp where sal between 500 and 1000;
- 查询comm为空的所有员工信息
select * from emp where comm is null;
- 查询工资是1500或5000的员工信息
select * from emp where sal IN (1500, 5000);
(2)Like和RLike
选择条件可以包含字符或数字:
% 代表零个或多个字符(任意个字符)。
_ 代表一个字符。
RLIKE子句:
RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。
- 查找以”S”开头的员工信息
select * from emp where ename LIKE 'S%';
- 查找第二个数值为”M”的薪水的员工信息
select * from emp where ename LIKE '_M%';
- 查找名字中含有“I”的员工信息
select * from emp where ename RLIKE '[I]';
select * from emp where ename LIKE '%I%';
3)逻辑运算符(And/Or/Not)
操作符 含义
AND 逻辑并
OR 逻辑或
NOT 逻辑否
- 查询薪水大于1000,部门是30
select * from emp where sal>1000 and deptno=30;
- 查询薪水大于1000,或者部门是30
select * from emp where sal>1000 or deptno=30;
- 查询除了20部门和30部门以外的员工信息
select * from emp where deptno not IN(30, 20);
(4)limit语句
LIMIT子句用于限制返回的行数不同于mysql的是,limit后只能跟一个参数
select * from emp limit 5;
3、分组查询
(1)Group By语句
GROUP BY语句通常会和聚合函数一起使用,按照一个或者多个列队结果进行分组,然后对每个组执行聚合操作。
注意:分组语句中,select后的字段只能是分组字段或者聚合函数!
- 计算emp表每个部门的平均工资
select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;
- 计算emp每个部门中每个岗位的最高薪水
select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job;
(2)Having语句
having与where不同点
where后面不能写分组函数,而having后面可以使用分组函数。
where用于对数据进行过滤,having用于对结果进行过滤。
having只用于group by分组统计语句。
- 求每个部门的平均薪水大于2000的部门
第一步:求每个部门的平均工资
select deptno, avg(sal) from emp group by deptno;
第二部:求每个部门的平均薪水大于2000的部门
select deptno, avg(sal) avg_sal from emp group by deptno having avg_sal > 2000;
不能如下这样写
select deptno, avg(sal) avg_sal from emp group by deptno where avg_sal > 2000;
4、连接查询
join是发生在 多个表之间, 内连接(等值内连接和不等值内连接)和外连接 (左外连接、右外连接,全外连接)
内连接:只会显示满足条件的数据。
外连接:会全部显示出某表的数据,如果另外一张表 跟它无法匹配,则另外一张表的相关字段设置为空。
(1)内连接
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;
intersect交集方式实现
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
intersect
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;
(2)外连接
a.左链接
--左连接
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno;
b.右连接
--右连接
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;
c.左独有
--左独有 查出左表有,而右表没有匹配上的数据
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null;
d.右独有
--右独有 查出右表有,而左表没有的数据
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;
e.全连接
--全连接
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno;
--全连接union
select e.*,d.*
from emp e left outer join dept d
on e.deptno=d.deptno
union
select e.*,d.*
from emp e right outer join dept d
on e.deptno=d.deptno;
f.左右独有
--左右独有
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
where e.deptno is null or d.deptno is null;
--左右独有union all
select e.*,d.dname,d.loc
from emp e left outer join dept d
on e.deptno=d.deptno
where d.deptno is null
union all
select e.empno ,e.ename ,e.job,e.mgr,e.hiredate,e.sal,e.comm,d.*
from emp e right outer join dept d
on e.deptno=d.deptno
where e.deptno is null;
--except/minus:差集实现
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
except
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;
select e.*,d.*
from emp e full outer join dept d
on e.deptno=d.deptno
minus
select e.*,d.*
from emp e inner join dept d
on e.deptno=d.deptno;
g.交叉连接
交叉连接:得到笛卡尔积,有隐式、显式两种写法
加上on条件,相当于内连接
隐式写法
select e.*,d.*
from emp e,dept d;
显式写法
select e.*,d.*
from emp e cross join dept d
on e.deptno=d.deptno;
h.左半开连接(left semi-join)
当记录对于右边表满足on语句中的判定条件,只会返回左边表的记录,左半开连接时内连接的优化,当左边表的一条数据,在右边表中存在时,Hive就停止扫描。因此效率比join高,但是左半开连接的select和where关键字后面只能出现左边表的字段,不能出现右边表的字段。Hive不支持右半开连接。
select *
from dept d left semi join emp e
on d.deptno=e.deptno;
select d.*
from dept d left semi join emp e
on d.deptno=e.deptno;
--类似于内连接只返回左表内容,但是效率较高 等同于内连接只返回左表,并把左表去重的结果,但是效率比内连接要高
select d.*
from dept d inner join emp e
on d.deptno=e.deptno;
--错误示例
select d.*,e.*
from dept d left semi join emp e
on d.deptno=e.deptno;
(3)自连接
准备数据:
CREATE TABLE `area` (
`id` string NOT NULL,
`area_code` string,
`area_name` string,
`level` string,
`parent_code` string,
`target` string
)row format delimited fields terminated by ',';
load data local inpath "/home/offcn/tmp/area.csv" into table area;
select * from area;
需求:统计指定省市下所有地区
--子查询实现
SELECT * FROM AREA
WHERE parent_code=(SELECT area_code
FROM AREA
WHERE area_name="内蒙古自治区");
--自连接实现
SELECT a.*,b.area_name
FROM AREA a JOIN AREA b
ON a.parent_code=b.area_code
WHERE b.area_name="内蒙古自治区";
(4)多表连接
准备数据:
create table if not exists location(
loc int,
loc_name string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/offcn/tmp/location.txt' into table location;
SELECT e.ename, d.dname, l.loc_name
FROM emp e
JOIN dept d
ON d.deptno = e.deptno
JOIN location l
ON d.loc = l.loc;
大多数情况下,Hive会对每对JOIN连接对象启动一个MapReduce任务。本例中会首先启动一个MapReduce job对表e和表d进行连接操作,然后会再启动一个MapReduce job将第一个MapReduce job的输出和表l;进行连接操作。
注意:为什么不是表d和表l先进行连接操作呢?这是因为Hive总是按照从左到右的顺序执行的。
优化:当对3个或者更多表进行join连接时,如果每个on子句都使用相同的连接键的话,那么只会产生一个MapReduce job。
5、排序
(1)全局排序(Order By)
- Order By:全局排序,只有一个Reducer
- 使用 ORDER BY 子句排序
ASC(ascend): 升序(默认)
DESC(descend): 降序
- ORDER BY 子句在SELECT语句的结尾
--查询员工信息按工资升序排列
select * from emp order by sal;
--查询员工信息按工资降序排列
select * from emp order by sal desc;
(2)每个MapReduce内部排序(Sort By)
- Sort By:对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用sort by。
- Sort by为每个reducer产生一个排序文件。每个Reducer内部进行排序,对全局结果集来说不是排序。
--设置reduce个数
set mapreduce.job.reduces=3;
--查看设置reduce个数
set mapred.reduce.tasks;
--根据部门编号降序查看员工信息
select * from emp sort by deptno desc;
--将查询结果导入到文件中(按照部门编号降序排序)
insert overwrite local directory '/home/offcn/tmp/sortby-result'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from emp sort by deptno desc;
(3)分区排序(Distribute By)
- Distribute By: 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。distribute by 子句可以做这件事。distribute by类似MR中partition(自定义分区),进行分区,结合sort by使用。
- 对于distribute by进行测试,一定要分配多reduce进行处理,否则无法看到distribute by的效果。
--先按照部门编号分区,再按照员工编号降序排序。
set mapreduce.job.reduces=3;
insert overwrite local directory '/home/offcn/tmp/distribute-result'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from emp distribute by deptno sort by empno desc;
注意:
- distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
- Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
(4)Cluster By
- 当distribute by和sorts by字段相同时,可以使用cluster by方式。
- cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序,不能指定排序规则为ASC或者DESC。
--以下两种写法等价
select * from emp distribute by deptno sort by deptno;
select * from emp cluster by deptno;
set mapreduce.job.reduces=3;
insert overwrite local directory '/home/offcn/tmp/cluster-result'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from emp cluster by deptno;
(四)函数hive的函数
1、系统内置函数
(1)关系运算(点我)
(2)逻辑运算(点我)
(3)数学运算
- •加法操作: +
- •减法操作: -
- •乘法操作: *
- •除法操作: /
- •取余操作: %
- •位与操作: &
- •位或操作: |
- •位异或操作: ^
- •位取反操作: ~
create table dual(id string);
准备内容只有一个空格的文本put到该表
select 1 + 9 from dual;
(4)数值运算
- 取整函数: round
- 指定精度取整函数: round
- 向下取整函数: floor
- 向上取整函数: ceil
- 向上取整函数: ceiling
- 取随机数函数: rand
- 自然指数函数: exp
- 以10为底对数函数: log10
- 以2为底对数函数: log2
- 对数函数: log
- 幂运算函数: pow
- 幂运算函数: power
- 开平方函数: sqrt
- 二进制函数: bin
- 十六进制函数: hex
- 反转十六进制函数: unhex
- 进制转换函数: conv
- 绝对值函数: abs
- 正取余函数: pmod
- 正弦函数: sin
- 反正弦函数: asin
- 余弦函数: cos
- 反余弦函数: acos
- positive函数: positive
- negative函数: negative
- (5)日期函数
- UNIX时间戳转日期函数: from_unixtime
- 获取当前UNIX时间戳函数: unix_timestamp
- 日期转UNIX时间戳函数: unix_timestamp
- 指定格式日期转UNIX时间戳函数: unix_timestamp
- 日期时间转日期函数: to_date
- 日期转年函数: year
- 日期转月函数: month
- 日期转天函数: day
- 日期转小时函数: hour
- 日期转分钟函数: minute
- 日期转秒函数: second
- 日期转周函数: weekofyear
- 日期比较函数: datediff
- 日期增加函数: date_add
- 日期减少函数: date_sub
- (6)字符串函数
- 字符串长度函数:length
- 字符串反转函数:reverse
- 字符串连接函数:concat
- 带分隔符字符串连接函数:concat_ws
- 字符串截取函数:substr,substring
- 字符串截取函数:substr,substring
- 字符串转大写函数:upper,ucase
- 字符串转小写函数:lower,lcase
- 去空格函数:trim
- 左边去空格函数:ltrim
- 右边去空格函数:rtrim
- 正则表达式替换函数:regexp_replace
- 正则表达式解析函数:regexp_extract
- URL解析函数:parse_url
- json解析函数:get_json_object
- 空格字符串函数:space
- 重复字符串函数:repeat
- 首字符ascii函数:ascii
- 左补足函数:lpad
- 右补足函数:rpad
- 分割字符串函数: split
- 集合查找函数: find_in_set
get_json_object
语法:
返回值:
说明:解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效,那么返回NULL。
select get_json_object('
{"store":
{"fruit":[
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}'
,'$.store.fruit.type[0]');
2、常用内置函数
(1)空字段赋值
NVL:给值为NULL的数据赋值,它的格式是NVL( value,default_value)。它的功能是如果value为NULL,则NVL函数返回default_value的值,否则返回value的值,如果两个参数都为NULL ,则返回NULL。
--如果员工的comm为NULL,则用-1代替
select comm,nvl(comm, -1) from emp;
--如果员工的comm为NULL,则用领导id代替
select comm, nvl(comm,mgr) from emp;
(2)条件函数
a.If 函数 : if
语法:if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值:
说明:
select if(1=2,100,200) from dual;
200
select if(1=1,100,200) from dual;
100
b.非空查找函数 : coalesce
语法: coalesce(T v1, T v2, …)
返回值:
说明:
select coalesce(null,'100','50') from dual;
100
c.条件判断函数: case
语法 :case a when b then c [when d then e] …… [else f] end
返回值 :
说明:如果 a 等于 b ,那么返回 c ;如果 a 等于 d ,那么返回 e ;
否则返回 f
select
case 100
when 50 then 'tom'
when 100 then 'mary'
else 'tim'
end ;
mary
(3)行转列(多行转单列)
a.concat(string a/col, string b/col…):
返回输入字符串连接后的结果,支持任意个输入字符串;
--拼接字符串“---” “>”
select concat("---",">");
select concat(ename,job) from emp;
b.concat_ws(separator, str1, str2,...):
它是一个特殊形式的 concat()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;
--拼接字符串~_^
select concat_ws("_","~","^");
d.collect_set(col):
函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。
select concat_ws('-',collect_set(job)) from emp;
--统计每个部门内薪资相同的各位员工
select concat(deptno,'-',sal) base ,ename from emp;
+------------+--------------+
| t.base | name |
+------------+--------------+
| 10-1300.0 | MILLER |
| 10-2450.0 | CLARK |
| 10-5000.0 | KING |
| 20-1100.0 | ADAMS |
| 20-2975.0 | JONES |
| 20-3000.0 | SCOTT|FORD |
| 20-800.0 | SMITH |
| 30-1250.0 | WARD|MARTIN |
| 30-1500.0 | TURNER |
| 30-1600.0 | ALLEN |
| 30-2850.0 | BLAKE |
| 30-950.0 | JAMES |
| 50-1200.0 | TOM |
+------------+--------------+
--先把部门编号和工资连接
select concat(deptno,'-',sal) base ,ename from emp;
+------------+---------+
| base | ename |
+------------+---------+
| 20-800.0 | SMITH |
| 30-1600.0 | ALLEN |
| 30-1250.0 | WARD |
| 20-2975.0 | JONES |
| 30-1250.0 | MARTIN |
| 30-2850.0 | BLAKE |
| 10-2450.0 | CLARK |
| 20-3000.0 | SCOTT |
| 10-5000.0 | KING |
| 30-1500.0 | TURNER |
| 20-1100.0 | ADAMS |
| 30-950.0 | JAMES |
| 20-3000.0 | FORD |
| 10-1300.0 | MILLER |
| 50-1200.0 | TOM |
+------------+---------+
-- 对上步的结果继续处理,按照 base分组聚合,
select t.base,collect_set(t.ename)
from (select concat(deptno,'-',sal) base ,ename from emp)t
group by t.base;
+------------+--------------------+
| t.base | _c1 |
+------------+--------------------+
| 10-2450.0 | ["CLARK"] |
| 10-5000.0 | ["KING"] |
| 20-3000.0 | ["SCOTT","FORD"] |
| 30-1250.0 | ["WARD","MARTIN"] |
| 30-1500.0 | ["TURNER"] |
| 30-2850.0 | ["BLAKE"] |
| 50-1200.0 | ["TOM"] |
| 10-1300.0 | ["MILLER"] |
| 20-1100.0 | ["ADAMS"] |
| 20-2975.0 | ["JONES"] |
| 20-800.0 | ["SMITH"] |
| 30-1600.0 | ["ALLEN"] |
| 30-950.0 | ["JAMES"] |
+------------+--------------------+
-- 对上步的结果再次处理
select t.base,concat_ws('|',collect_set(t.ename))
from (select concat(deptno,'-',sal) base ,ename from emp)t
group by t.base;
思路:
- 把部门号、薪资拼接在一起
select concat(deptno,"-",sal) base,ename from emp;
- 按照部门号+薪资分组
select t.base, collect_set(t.ename) name
from (select concat(deptno,"-",sal) base,ename from emp) t
group by t.base;
- 多位员工需要拼接一起
select t.base, concat_ws('|',collect_set(t.ename)) name
from (select concat(deptno,"-",sal) base,ename from emp) t
group by t.base;
(4)列转行
a.explode
- explode函数可以将一个array或者map类型的字段展开,
其中explode(array)使得结果中将array列表里的每个元素生成一行;explode(map)使得结果中将map里的每一对元素作为一行,key为一列,value为一列。
- 一般情况下,直接使用explode即可,也可以根据需要结合lateral view
select explode(array("hadoop","spark","flink"));
b.lateral view
- 用法:lateral view udtf(expression) tablealias as columnalias
- 解释:用于和split, explode等udtf一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。
准备数据:
create table p1(name string,children array<string>,address Map<string,string>)
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';
vim maparray.txt
zhangsan|child1,child2,child3,child4|k1:v1,k2:v2
lisi|child5,child6,child7,child8|k3:v3,k4:v4
load data local inpath '/home/offcn/tmp/maparray.txt' into table p1;
select * from p1;
+-----------+----------------------------------------+------------------------+
| p1.name | p1.children | p1.address |
+-----------+----------------------------------------+------------------------+
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} |
+-----------+----------------------------------------+------------------------+
select explode(children) as myChild from p1;
select explode(address) as (myMapKey, myMapValue) from p1;
--如果想用原表和拆分的列进行拼接组合
select p1.*,childrenView.*,addressView.* from p1
lateral view explode(children) childrenView as myChild
lateral view explode(address) addressView as myMapKey, myMapValue;
+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+
| p1.name | p1.children | p1.address | childrenview.mychild | addressview.mymapkey | addressview.mymapvalue |
+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child1 | k1 | v1 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child1 | k2 | v2 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child2 | k1 | v1 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child2 | k2 | v2 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child3 | k1 | v1 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child3 | k2 | v2 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child4 | k1 | v1 |
| zhangsan | ["child1","child2","child3","child4"] | {"k1":"v1","k2":"v2"} | child4 | k2 | v2 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child5 | k3 | v3 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child5 | k4 | v4 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child6 | k3 | v3 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child6 | k4 | v4 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child7 | k3 | v3 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child7 | k4 | v4 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child8 | k3 | v3 |
| lisi | ["child5","child6","child7","child8"] | {"k3":"v3","k4":"v4"} | child8 | k4 | v4 |
+-----------+----------------------------------------+------------------------+-----------------------+-----------------------+-------------------------+
(5)窗口函数(开窗函数)
hive中的窗口函数和sql中的窗口函数相类似,都是用来做一些数据分析类的工作,一般用于olap分析(在线分析处理)。大部分的窗口函数针对的是分区(窗口)内部处理的场景,hive、oracle提供开窗函数,mysql 8.0之前版本不提供,但MySQL 8.0版本支持窗口函数(over)和公用表表达式(with)这两个重要的功能!
语法结构:
窗口函数 over (partition by 列名, 列名……order by 列名 rows between 开始位置 and 结束位置)
说明:
- over()函数:
over()函数中包括三个函数:包括分区partition by 列名、排序order by 列名、指定窗口范围rows between 开始位置 and 结束位置。它决定了聚合函数的聚合范围,默认对整个窗口中的数据进行聚合,聚合函数对每一条数据调用一次
- partition by子句:
(partition by .. order by)可替换为(distribute by .. sort by ..)。
- order by子句:
order by是排序的意思,是在该窗口中根据指定字段进行的排序,只能跟一个字段。
- rows between :
rows between开始位置 and 结束位置,窗口大小限定:
- following:往后
- current row:当前行
- unbounded:起点(一般结合preceding,following使用)
- unbounded preceding:表示该窗口最前面的行(起点)
- unbounded following:表示该窗口最后面的行(终点)
准备数据:
create table t2(ip string, createtime string, url string,pvs int)
row format delimited fields terminated by ',';
vim t2.txt
192.168.233.11,2021-04-10,url2,5
192.168.233.11,2021-04-12,url1,4
192.168.233.11,2021-04-10,1url3,5
192.168.233.11,2021-04-11,url6,3
192.168.233.11,2021-04-13,url4,6
192.168.233.12,2021-04-10,url7,5
192.168.233.12,2021-04-11,url4,1
192.168.233.12,2021-04-10,url5,5
192.168.233.13,2021-04-11,url22,3
192.168.233.13,2021-04-10,url11,11
192.168.233.13,2021-04-12,1url33,7
192.168.233.14,2021-04-13,url66,9
192.168.233.14,2021-04-12,url77,1
192.168.233.14,2021-04-13,url44,12
192.168.233.14,2021-04-11,url55,33
load data local inpath "/home/offcn/tmp/t2.txt" into table t2;
select * from t2;
a、窗口聚合函数
解决分组内统计以及显示非分组字段问题
- count开窗函数
--统计每个ip、每个时间段数据的数据条数以及明细
select ip,createtime,count(pvs)
from t2
group by ip,createtime;
+-----------------+-------------+------+
| ip | createtime | _c2 |
+-----------------+-------------+------+
| 192.168.233.11 | 2021-04-10 | 2 |
| 192.168.233.11 | 2021-04-11 | 1 |
| 192.168.233.11 | 2021-04-12 | 1 |
| 192.168.233.11 | 2021-04-13 | 1 |
| 192.168.233.12 | 2021-04-10 | 2 |
| 192.168.233.12 | 2021-04-11 | 1 |
| 192.168.233.13 | 2021-04-10 | 1 |
| 192.168.233.13 | 2021-04-11 | 1 |
| 192.168.233.13 | 2021-04-12 | 1 |
| 192.168.233.14 | 2021-04-11 | 1 |
| 192.168.233.14 | 2021-04-12 | 1 |
| 192.168.233.14 | 2021-04-13 | 2 |
+-----------------+-------------+------+
问题:发现group by后无法显示明细数据,此时需要用窗口函数
select ip,createtime,url,pvs,count(pvs) over(partition by ip,createtime)
from t2;
+-----------------+-------------+---------+------+-----------------+
| ip | createtime | url | pvs | count_window_0 |
+-----------------+-------------+---------+------+-----------------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 2 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 2 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 1 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 1 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 1 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 2 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 2 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 1 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 1 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 1 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 1 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 1 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 2 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 2 |
+-----------------+-------------+---------+------+-----------------+
- sum开窗函数
--统计每个ip、每个时间段的总pv数以及明细
select ip,createtime,url,pvs,sum(pvs) over(partition by ip,createtime)
from t2;
- avg开窗函数
--统计每个ip、每个时间段平局pv次数以及明细
select ip,createtime, url,pvs,avg(pvs) over(partition by ip,createtime)
from t2;
- min开窗函数
--统计每个ip、每个时间段最小的pv数以及明细
select ip,createtime, url,pvs,min(pvs) over(partition by ip,createtime)
from t2;
- max开窗函数
--统计每个ip、每个时间段最大的pv数以及明细
select ip,createtime, url,pvs,max(pvs) over(partition by ip,createtime)
from t2;
- 窗口控制语句
select *,
--分组内所有行
--按照ip分组
sum(pvs) over(partition by ip) AS c_1 ,
--按照createtime分组并按照该字段升序排序展示
sum(pvs) over(order by createtime) AS c_2 ,
--默认为从起点到当前行,
sum(pvs) over(partition by ip order by createtime asc) AS c_3,
--从起点到当前行,结果与sales_3不同。 根据排序先后不同,可能结果累加不同
sum(pvs) over(partition by ip order by createtime asc rows between unbounded preceding and current row) AS c_4,
--当前行+往前3行
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and current row) AS c_5,
--当前行+往前3行+往后1行
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and 1 following) AS c_6,
--当前行+往后所有行
sum(pvs) over(partition by ip order by createtime asc rows between current row and unbounded following) AS c_7
from t2;
--按照ip分组,分组内所有行累加
select *,
sum(pvs) over(partition by ip) AS c_1
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_1 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 23 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 23 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 23 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 23 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 23 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 11 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 11 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 21 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 21 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 55 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 55 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 55 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 55 |
+-----------------+----------------+---------+---------+------+
-- 按照时间排序,按照时间分组,时间相同的累加,并逐个累加
select *,
sum(pvs) over(order by createtime) AS c_2
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_2 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 31 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 31 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 31 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 31 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 31 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 71 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 71 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 71 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 71 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 83 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 83 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 83 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 110 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 110 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 110 |
+-----------------+----------------+---------+---------+------+
-- 按照ip分大组,组内按照时间排序再分小组,大组内逐个累加
select *,
sum(pvs) over(partition by ip order by createtime asc) AS c_3
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_3 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 10 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 10 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 13 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 17 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 23 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 10 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 10 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 14 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 34 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 55 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 55 |
+-----------------+----------------+---------+---------+------+
-- 按照ip分大组,组内按照时间、url排序再分小组,大组内逐个累加
select *,
sum(pvs) over(partition by ip order by createtime asc,url) AS c_3
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_3 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 5 |
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 10 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 13 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 17 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 23 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 10 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 14 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 34 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 46 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 55 |
+-----------------+----------------+---------+---------+------+
-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前所有行(默认)
select *,
sum(pvs) over(partition by ip order by createtime asc rows between unbounded preceding and current row) AS c_4
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_4 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 10 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 13 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 17 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 23 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 10 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 14 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 34 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 43 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 55 |
+-----------------+----------------+---------+---------+------+
-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前3行(共4行)
select *,
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and current row) AS c_5
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_5 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 10 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 13 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 17 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 18 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 10 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 14 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 34 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 43 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 55 |
+-----------------+----------------+---------+---------+------+
-- 按照ip分窗口,窗口内按照时间升序,累加当前窗口的当前行、前3行、当前行的下一行
select *,
sum(pvs) over(partition by ip order by createtime asc rows between 3 preceding and 1 following) AS c_6
from t2;
+-----------------+----------------+---------+---------+-------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_6 |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 10 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 13 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 17 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 23 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 18 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 10 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 11 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 11 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 14 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 21 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 21 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 34 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 43 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 55 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 55 |
+-----------------+----------------+---------+---------+-------+
-- 按照ip分窗口,窗口内按照时间升序,从当前窗口的当前行开始,累加到窗口最后一行
select *,
sum(pvs) over(partition by ip order by createtime asc rows between current row and unbounded following) AS c_7
from t2;
+-----------------+----------------+---------+---------+------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c_7 |
+-----------------+----------------+---------+---------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 23 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 18 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 13 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 10 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 6 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 11 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 6 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 21 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 10 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 7 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 55 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 22 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 21 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 12 |
+-----------------+----------------+---------+---------+------+
b、窗口分析函数
解决级联问题
- first_value开窗函数
统计每个ip和本ip第一次进入网站的pv差以及明细数据
按照ip分组,组内按照时间升序排序,取组内第一个
select *,first_value(pvs) over(partition by ip order by createtime) haha
from t2;
+-----------------+----------------+---------+---------+-------+
| t2.ip | t2.createtime | t2.url | t2.pvs | haha |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 5 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 5 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 5 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 5 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 5 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 5 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 11 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 11 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 33 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 33 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 33 |
+-----------------+----------------+---------+---------+-------+
with aaa as(
select *,first_value(pvs) over(partition by ip order by createtime) haha
from t2)
select *,(aaa.pvs-aaa.haha) from aaa;
+-----------------+-----------------+----------+----------+-----------+------+
| aaa.ip | aaa.createtime | aaa.url | aaa.pvs | aaa.haha | _c1 |
+-----------------+-----------------+----------+----------+-----------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 | 0 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 5 | 0 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 5 | -2 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 5 | -1 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 5 | 1 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 | 0 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 5 | 0 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 5 | -4 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 | 0 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 11 | -8 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 11 | -4 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 | 0 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 33 | -32 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 33 | -24 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 33 | -21 |
+-----------------+-----------------+----------+----------+-----------+------+
with as语句:
with...as...也叫做子查询部分,语句允许hive定义一个sql片段,供整个sql使用
- last_value开窗函数
先根据ip分组,然后根据排序字段滑动窗口(换句话说就是ip和排序字段共同决定窗口)
统计每个ip,每个时间段和本ip、本时间段最后一次进入网站的pv差以及明细数据
select *,last_value(pvs) over(partition by ip order by createtime) haha
from t2;
+-----------------+----------------+---------+---------+-------+
| t2.ip | t2.createtime | t2.url | t2.pvs | haha |
+-----------------+----------------+---------+---------+-------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 5 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 3 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 4 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 6 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 5 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 3 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 7 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 1 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 12 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 12 |
+-----------------+----------------+---------+---------+-------+
with aaa as(
select *,last_value(pvs) over(partition by ip order by createtime) haha
from t2)
select *,(aaa.pvs-aaa.haha) from aaa;
+-----------------+-----------------+----------+----------+-----------+------+
| aaa.ip | aaa.createtime | aaa.url | aaa.pvs | aaa.haha | _c1 |
+-----------------+-----------------+----------+----------+-----------+------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 5 | 0 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 5 | 0 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 3 | 0 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 4 | 0 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 6 | 0 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 5 | 0 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 5 | 0 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1 | 0 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 11 | 0 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 3 | 0 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 7 | 0 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 33 | 0 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 1 | 0 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 12 | -3 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 12 | 0 |
+-----------------+-----------------+----------+----------+-----------+------+
lag开窗函数
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL
窗口上移 按照createtime向上移2行,如果是空,用balabala代替
select *,lag(createtime,2,'balabala') over(partition by ip order by createtime) tmp
from t2;
+-----------------+----------------+---------+---------+-------------+
| t2.ip | t2.createtime | t2.url | t2.pvs | tmp |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | balabala |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | balabala |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 2021-04-10 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 2021-04-10 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 2021-04-11 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | balabala |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | balabala |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 2021-04-10 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | balabala |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | balabala |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 2021-04-10 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | balabala |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | balabala |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 2021-04-11 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 2021-04-12 |
+-----------------+----------------+---------+---------+-------------+
lead开窗函数
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL
窗口下移 默认移一行,没有的话,返回空
select *,lead(createtime) over(partition by ip order by createtime) tmp
from t2;
+-----------------+----------------+---------+---------+-------------+
| t2.ip | t2.createtime | t2.url | t2.pvs | tmp |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 2021-04-10 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 2021-04-11 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 2021-04-12 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 2021-04-13 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | NULL |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 2021-04-10 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 2021-04-11 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | NULL |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 2021-04-11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 2021-04-12 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | NULL |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 2021-04-12 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 2021-04-13 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 2021-04-13 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | NULL |
+-----------------+----------------+---------+---------+-------------+
select *,lead(createtime,1,'haha') over(partition by ip order by createtime) tmp
from t2;
+-----------------+----------------+---------+---------+-------------+
| t2.ip | t2.createtime | t2.url | t2.pvs | tmp |
+-----------------+----------------+---------+---------+-------------+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 2021-04-10 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 2021-04-11 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 2021-04-12 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 2021-04-13 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | haha |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 2021-04-10 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 2021-04-11 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | haha |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 2021-04-11 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 2021-04-12 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | haha |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 2021-04-12 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 2021-04-13 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 2021-04-13 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | haha |
+-----------------+----------------+---------+---------+-------------+
c、窗口排序函数
解决分组求TopN问题
- rank开窗函数
生成数据项在分组中的排名,排名相等会在名次中留下空位,相同数值的数据可并列排名,有两个第二名时接下来就是第四名
- dense_rank开窗函数
生成数据项在分组中的排名,排名相等会在名次中不会留下空位,相同数值的数据可并列排名,有两个第二名时仍然跟着第三名
- row_number开窗函数
从1开始,按照顺序,生成分组内记录的序列, 有两个并列第一,row_number()都会排序返回
- ntile开窗函数
用于将分组数据按照窗口、顺序等分成n片,返回当前切片值
如果切片不均匀,默认增加第一个切片的分布
用法:
按照ip分组,组内按照pvs升序排序,
select *,
row_number() over(partition by ip order by pvs) row_number,
rank() over(partition by ip order by pvs) rank,
dense_rank() over(partition by ip order by pvs) dense_rank,
ntile(3) over(partition by ip order by pvs) ntile
from t2;
+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+
| t2.ip | t2.createtime | t2.url | t2.pvs | row_number | rank | dense_rank | ntile |
+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 1 | 1 | 1 | 1 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 2 | 2 | 2 | 1 |
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 3 | 3 | 3 | 2 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 4 | 3 | 3 | 2 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 5 | 5 | 4 | 3 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1 | 1 | 1 | 1 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 2 | 2 | 2 | 2 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 3 | 2 | 2 | 3 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 1 | 1 | 1 | 1 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 2 | 2 | 2 | 2 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 3 | 3 | 3 | 3 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 1 | 1 | 1 | 1 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 2 | 2 | 2 | 1 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 3 | 3 | 3 | 2 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 4 | 4 | 4 | 3 |
+-----------------+----------------+---------+---------+-------------+-------+-------------+--------+
--统计每个ip产生pv最高的前三条数据(Top3)
with tmp as(
select *,
row_number() over(partition by ip order by pvs desc) row_number,
rank() over(partition by ip order by pvs desc) rank,
dense_rank() over(partition by ip order by pvs desc) dense_rank,
ntile(3) over(partition by ip order by pvs desc) ntile
from t2)
select * from tmp where tmp.row_number <=3;
+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+
| tmp.ip | tmp.createtime | tmp.url | tmp.pvs | tmp.row_number | tmp.rank | tmp.dense_rank | tmp.ntile |
+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 1 | 1 | 1 | 1 |
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 2 | 2 | 2 | 1 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 3 | 2 | 2 | 2 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 1 | 1 | 1 | 1 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 2 | 1 | 1 | 2 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 3 | 3 | 2 | 3 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 1 | 1 | 1 | 1 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 2 | 2 | 2 | 2 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 3 | 3 | 3 | 3 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 1 | 1 | 1 | 1 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 2 | 2 | 2 | 1 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 3 | 3 | 3 | 2 |
+-----------------+-----------------+----------+----------+-----------------+-----------+-----------------+------------+
d、窗口序列函数(了解)
两个序列分析函数不是很常用,这里也介绍一下。
注意: 序列函数不支持WINDOW子句,所谓window子句就是指定窗口范围的rows between 这些。
- cume_dist开窗函数
cume_dist:小于等于当前值的行数/分组内总行数
--统计小于等于当前pvs的记录数,所占总记录数的比例
select *,
cume_dist() over(order by pvs) c1,
cume_dist() over(partition by ip order by pvs) c2
from t2;
+-----------------+----------------+---------+---------+----------------------+---------------------+
| t2.ip | t2.createtime | t2.url | t2.pvs | c1 | c2 |
+-----------------+----------------+---------+---------+----------------------+---------------------+
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 0.26666666666666666 | 0.2 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 0.3333333333333333 | 0.4 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 0.6 | 0.8 |
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 0.6 | 0.8 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 0.6666666666666666 | 1.0 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 0.13333333333333333 | 0.3333333333333333 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 0.6 | 1.0 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 0.6 | 1.0 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 0.26666666666666666 | 0.3333333333333333 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 0.7333333333333333 | 0.6666666666666666 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 0.8666666666666667 | 1.0 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 0.13333333333333333 | 0.25 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 0.8 | 0.5 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 0.9333333333333333 | 0.75 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 1.0 | 1.0 |
+-----------------+----------------+---------+---------+----------------------+---------------------+
结果说明:
没有指定分组即为没有限定窗口大小,窗口数据为全部数据!
c1: 没有partition,所有数据均为1组,总行数为15,
第一行:小于等于pvs 为3的行数为4行,因此,4/15= 0.26666666666666666
第二行:小于等于pvs为4的行数为5行,因此,5/15= 0.3333333333333333
c2: 按照ip分组,ip=192.168.233.11的行数为5,
第一行:小于等于pvs为3的行数为1,因此,1/5=0.2
第二行:小于等于pvs为4的行数为2,因此,2/5=0.4
- percent_rank开窗函数
分组内当前行的RANK值-1/分组内总行数-1,适用于复杂算法
select *,
percent_rank() over(order by createtime) c1,
rank() over(order by createtime) c2,
count(*) over() c3
from t2;
结果说明:
c2是倒序排名,c3是总条数
Over中没有分组,即全部为一组C1=(C2-1)/(C3-1)
第一行:(13-1)/(15-1)= 0.8571428571428571
+-----------------+----------------+---------+---------+----------------------+-----+-----+
| t2.ip | t2.createtime | t2.url | t2.pvs | c1 | c2 | c3 |
+-----------------+----------------+---------+---------+----------------------+-----+-----+
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 0.8571428571428571 | 13 | 15 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 0.8571428571428571 | 13 | 15 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 0.8571428571428571 | 13 | 15 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 0.6428571428571429 | 10 | 15 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 0.6428571428571429 | 10 | 15 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 0.6428571428571429 | 10 | 15 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 0.35714285714285715 | 6 | 15 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 0.35714285714285715 | 6 | 15 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 0.35714285714285715 | 6 | 15 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 0.35714285714285715 | 6 | 15 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 0.0 | 1 | 15 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 0.0 | 1 | 15 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 0.0 | 1 | 15 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 0.0 | 1 | 15 |
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 0.0 | 1 | 15 |
+-----------------+----------------+---------+---------+----------------------+-----+-----+
select *,
percent_rank() over(partition by ip order by createtime) c1,
rank() over(partition by ip order by createtime) c2,
count(*) over(partition by ip) c3
from t2;
结果说明:
Over中按照ip分组, C1=(C2-1)/(C3-1)
第三行:(3-1)/(5-1)=0.5
+-----------------+----------------+---------+---------+---------------------+-----+-----+
| t2.ip | t2.createtime | t2.url | t2.pvs | c1 | c2 | c3 |
+-----------------+----------------+---------+---------+---------------------+-----+-----+
| 192.168.233.11 | 2021-04-10 | url2 | 5 | 0.0 | 1 | 5 |
| 192.168.233.11 | 2021-04-10 | 1url3 | 5 | 0.0 | 1 | 5 |
| 192.168.233.11 | 2021-04-11 | url6 | 3 | 0.5 | 3 | 5 |
| 192.168.233.11 | 2021-04-12 | url1 | 4 | 0.75 | 4 | 5 |
| 192.168.233.11 | 2021-04-13 | url4 | 6 | 1.0 | 5 | 5 |
| 192.168.233.12 | 2021-04-11 | url4 | 1 | 1.0 | 3 | 3 |
| 192.168.233.12 | 2021-04-10 | url7 | 5 | 0.0 | 1 | 3 |
| 192.168.233.12 | 2021-04-10 | url5 | 5 | 0.0 | 1 | 3 |
| 192.168.233.13 | 2021-04-12 | 1url33 | 7 | 1.0 | 3 | 3 |
| 192.168.233.13 | 2021-04-10 | url11 | 11 | 0.0 | 1 | 3 |
| 192.168.233.13 | 2021-04-11 | url22 | 3 | 0.5 | 2 | 3 |
| 192.168.233.14 | 2021-04-13 | url66 | 9 | 0.6666666666666666 | 3 | 4 |
| 192.168.233.14 | 2021-04-13 | url44 | 12 | 0.6666666666666666 | 3 | 4 |
| 192.168.233.14 | 2021-04-11 | url55 | 33 | 0.0 | 1 | 4 |
| 192.168.233.14 | 2021-04-12 | url77 | 1 | 0.3333333333333333 | 2 | 4 |
+-----------------+----------------+---------+---------+---------------------+-----+-----+
e、窗口分组函数
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL,
其中的GROUPING__ID,表示结果属于哪一个分组集合。
- Grouping
--需求:使用一条SQL分别求出,每个ip访问量条数, 每个时间访问量的条数
select ip,null as createtime,count(1) as num,1 as GROUPING__ID
from t2
group by ip
union all
select null as ip,createtime,count(1) as num,2 as GROUPING__ID
from t2
group by createtime;
+-----------------+-------------+------+---------------+
| ip | createtime | num | grouping__id |
+-----------------+-------------+------+---------------+
| 192.168.233.11 | NULL | 5 | 1 |
| 192.168.233.12 | NULL | 3 | 1 |
| 192.168.233.13 | NULL | 3 | 1 |
| 192.168.233.14 | NULL | 4 | 1 |
+-----------------+-------------+------+---------------+
+-------+-------------+------+---------------+
| ip | createtime | num | grouping__id |
+-------+-------------+------+---------------+
| NULL | 2021-04-10 | 5 | 2 |
| NULL | 2021-04-11 | 4 | 2 |
| NULL | 2021-04-12 | 3 | 2 |
| NULL | 2021-04-13 | 3 | 2 |
+-------+-------------+------+---------------+
+-----------------+-----------------+----------+-------------------+
| _u1.ip | _u1.createtime | _u1.num | _u1.grouping__id |
+-----------------+-----------------+----------+-------------------+
| 192.168.233.11 | NULL | 5 | 1 |
| 192.168.233.12 | NULL | 3 | 1 |
| 192.168.233.13 | NULL | 3 | 1 |
| 192.168.233.14 | NULL | 4 | 1 |
| NULL | 2021-04-10 | 5 | 2 |
| NULL | 2021-04-11 | 4 | 2 |
| NULL | 2021-04-12 | 3 | 2 |
| NULL | 2021-04-13 | 3 | 2 |
+-----------------+-----------------+----------+-------------------+
--使用窗口分组函数
select
ip,createtime, count(1) as num,
GROUPING__ID
from
t2
group by
ip,createtime
GROUPING SETS(ip,createtime)
ORDER BY
GROUPING__ID;
--需求:使用一条SQL分别求出,每个ip访问量条数, 每个时间访问量的条数,每个ip、每个时间内访问量的条数
select ip,null as createtime,count(1) as num,1 as GROUPING__ID
from t2
group by ip
union all
select null as ip,createtime,count(1) as num,2 as GROUPING__ID
from t2
group by createtime
union all
select ip,createtime,count(1) as num,3 as GROUPING__ID
from t2
group by ip,createtime;
+-----------------+-----------------+----------+-------------------+
| _u1.ip | _u1.createtime | _u1.num | _u1.grouping__id |
+-----------------+-----------------+----------+-------------------+
| 192.168.233.11 | 2021-04-10 | 2 | 3 |
| 192.168.233.11 | 2021-04-11 | 1 | 3 |
| 192.168.233.11 | 2021-04-12 | 1 | 3 |
| 192.168.233.11 | 2021-04-13 | 1 | 3 |
| 192.168.233.12 | 2021-04-10 | 2 | 3 |
| 192.168.233.12 | 2021-04-11 | 1 | 3 |
| 192.168.233.13 | 2021-04-10 | 1 | 3 |
| 192.168.233.13 | 2021-04-11 | 1 | 3 |
| 192.168.233.13 | 2021-04-12 | 1 | 3 |
| 192.168.233.14 | 2021-04-11 | 1 | 3 |
| 192.168.233.14 | 2021-04-12 | 1 | 3 |
| 192.168.233.14 | 2021-04-13 | 2 | 3 |
| 192.168.233.11 | NULL | 5 | 1 |
| 192.168.233.12 | NULL | 3 | 1 |
| 192.168.233.13 | NULL | 3 | 1 |
| 192.168.233.14 | NULL | 4 | 1 |
| NULL | 2021-04-10 | 5 | 2 |
| NULL | 2021-04-11 | 4 | 2 |
| NULL | 2021-04-12 | 3 | 2 |
| NULL | 2021-04-13 | 3 | 2 |
+-----------------+-----------------+----------+-------------------+
--使用窗口分组函数
select
ip,createtime, count(1) as num,
GROUPING__ID
from
t2
group by
ip,createtime
GROUPING SETS(ip,createtime,(ip,createtime))
ORDER BY
GROUPING__ID;
3、自定义函数
当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)
根据用户自定义函数类别分为以下三种:
- UDF(User-Defined-Function)普通函数 一进一出
https://cwiki.apache.org/confluence/display/Hive/HivePlugins
- UDAF(User-Defined Aggregation Function)聚合函数,多进一出
https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy
- UDTF(User-Defined Table-Generating Functions)表生成函数 一进多出
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
(1)创建工程导入依赖
<properties>
<!--解决File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!-->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<!--不再支持源选项 5。请使用 6 或更高版本。 不再支持目标选项 1.5。请使用 1.6 或更高版本-->
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.2</version>
</dependency>
</dependencies>
(2)编写UDF函数
a.简单类型的UDF(继承UDF)
com.bigdata.myhive
@Description(name="myudf",value="_FUNC_(str) - Returns the case of the current string letter",
extended = "Example:\n" + " > SELECT _FUNC_('admin') FROM src limit 1; \n")
public class MyUDF extends UDF {
//转换字符串大小写
public String evaluate(String args){
if ("".equals(args)||args==null){
return null;
}
return args.toUpperCase();
}
}
(3)生成jar导入hive
--向hive添加jar
add jar /home/offcn/tmp/udf.jar;
--从hive删除jar
delete jar /home/offcn/tmp/udf.jar;
--查看添加的jar
list jars;
(4)创建临时函数测试调用
--创建临时函数
create temporary function myudf as 'com.bigdata.myhive.MyUDF';
create temporary function myudf2 as 'com.bigdata.myhive.MyUDF2';
--删除临时函数
drop temporary function myudf;
drop temporary function myudf2;
--查看创建的临时函数
show functions like 'my*';
--查看函数的描述
desc function myudf ;
desc function myudf2 ;
--查看函数的详细描述
desc function extended myudf ;
desc function extended myudf2 ;
--测试
select myudf("hello");
select myudf2("hello");
select myudf2("hello","world");
select myudf2(array("hello","world"));
explain extended select myudf2(url) from t2;
explain select myudf2(url) from t2;