hive中where条件优化 hive where条件顺序

转载

墨韵流香 2023-07-14 12:35:54

文章标签 hive中where条件优化 hive hql hive sql hive 字段 文章分类 Hive 大数据

HIVE 的 HQL 总结

1. group by 与 where 与 having

A. having只能用在group by之后，对分组后的结果进行筛选(即使用having的前提条件是分组)。使用having字句对分组后的结果进行筛选

B. where肯定在group by 之前,当结合在一起时，where在前，group by 在后。即先对select xx from xx的记录集合用where进行筛选，然后再使用group by 对筛选后的结果进行分组

C. where后的条件表达式里不允许使用聚合函数，而having可以

2. 排序 order by ，sort by , ,distribute by, cluster by三者区别

A. order by : order by 字段名1 asc/desc ,会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。

B. sort by : 不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序.

C. distribute by根据distribute by指定的内容将数据分到同一个reducer。

D. Cluster by 除了具有Distribute by的功能外，还会对该字段进行排序。因此，常常认为cluster by = distribute by + sort by

3. JOIN 总结

join

A. left join：以左表为基准，即对于A left join B on A.x=B.x这句SQL来说，查询的结果是：表A中的结果会全部显示出来，但是如果表A中含有的x选项，但是表B却没有，这时会在结果集中B的x栏目显示null。

tb1、tb2两表共有 + tb1的独有
select * from tb1 a left join tb2 b on a.deptId = b.id;

B. right join：以右表为基准，即对于A right join B on A.x=B.x这句SQL来说，查询的结果是：表B中的结果会全部显示出来，但是如果表B中含有的x选项，表A中没有，这时会在结果集中A的x栏目显示null。

tb1、tb2两表共有+tb2的独有
select * from tb1 a right join tb2 b on a.deptId = b.id;

C. inner join：只有两者共有的字段才会在结果集中显示

ta1与ta2共同拥有的原生
 select * from tb1 a inner join tb2 b on a.deptId = b.id;

D. ta1的独有 Select * from tb1 a left join tb2 b on a.deptid = b.id and b.id is null;

E. ta2的独有 Select * from tb1 a right join tb2 b on a.deptid = b.id and a.id is null;

F. full outer join： tab1与tab2都有 select * from tb1 a full outer join tb2 b on a.deptid = b.id

G. A的独有+B的独有

select * from tbl_emp a left join tbl_dept b on a.deptid = b.id where b.id is null
 Union  
 select * from tbl_emp a right join tbl_dept b on a.deptid = b.id where a.deptid is null;

4. NULL 与多个group by

A. 一个字段是否为空， is null, is not null
B. 多个字段GROUP BY， group by 字段1，字段2，字段3
C. 聚合函数， sum（字段） as 别名

5.创建临时表----> 自连接视图-----> 进行比较统计实例

视图（View）
视图（view）是一种虚拟存在的表，是一个逻辑表，本身并不包含数据。作为一个select语句保存在数据字典中的。
1）简单：使用视图的用户完全不需要关心后面对应的表的结构、关联条件和筛选条件，对用户来说已经是过滤好的复合条件的结果集。
2）安全：使用视图的用户只能访问他们被允许查询的结果集，对表的权限管理并不能限制到某个行某个列，但是通过视图就可以简单的实现。
总而言之，使用视图的大部分情况是为了保障数据安全性，提高查询效率。
对业务逻辑复杂的，查询过程中可以创建一个临时表
eg: 求单月访问次数和总访问次数
用户名，月份，访问次数

A,2015-01,5
 A,2015-01,15
 B,2015-01,5
 A,2015-01,8
 B,2015-01,25
 A,2015-01,5
 A,2015-02,4
 A,2015-02,6
 B,2015-02,10
 B,2015-02,5
 A,2015-03,16
 A,2015-03,22
 B,2015-03,23
 B,2015-03,10
 B,2015-03,1

(1) 创建表导入数据

//创建表
 create external table if not exists t_access(
 uname string comment '用户名',
 umonth string comment '月份',
 ucount int comment '访问次数'
 ) comment '用户访问表' 
 row format delimited fields terminated by "," 
 //导入数据
 load data local inpath "/zhm/hiv/access.txt" into table t_access;

结果需求
每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数，结果数据格式如下
解决方案
（1）. 根据用户+月份进行分组, 创建临时表，先求出当月访问次数

--求当月访问次数
 create table tmp_access(
 name string,
 mon string,
 num int
 ); 
 
 insert into table tmp_access 
 select uname,umonth,sum(ucount)
 from t_access t group by t.uname,t.umonth;

（2）. 要统计每个月的和，需要tmp_access进行自连接视图，然后进行比较聚合

create view tmp_view as 
	select a.name anme,a.mon amon,a.num anum,b.name bname,b.mon bmon,b.num 
	bnum from tmp_access a join tmp_access b 
	on a.name=b.name;

（3）. 进行比较统计

select anme,amon,anum,max(bnum) as max_access,sum(bnum) as sum_access 
	from tmp_view 
	where amon>=bmon 
	group by anme,amon,anum;

5. hive中的concat，concat_ws，collect_set用法

A. concat 与 concat_ws

concat(string s1, string s2, string s3)把字符串类型的数据连接起来

concat( name, ‘:’, age) 就相当于把 name 列和 age 列用逗号连接起来了

concat_ws(’,’,数组) 相当于把数组中的元素，用逗号连接起来了

注意：concat需要的是字符串，有时候需要使用cast将数据类型转换成string

B. cast 强制转换， cast(value as type)
eg: cast(字段 as string)

C. collect_set函数与group by有两个作用:
第一个是去重，去除 group by 后的重复元素，
第二个是形成一个集合，将 group by 后属于同一组的第三列集合起来成为一个集合。

Hive不允许直接访问非group by字段；
对于非group by字段，可以用Hive的collect_set函数收集这些字段，返回一个数组；
使用数字下标，可以直接访问数组中的元素；

D. collect_set函数与group by 与 contact_ws结合使用
就是将这些元素以逗号分隔形成字符串

有表结构 a string , b string , c int
e f 1
d k 4
e f 1
d k 8
select a, b, concat_ws(‘,’ , collect_set(cast(c as string)))from table group by a,b;

E. 在Hive中求出一个数据表中在某天内首次登陆的人；可以借助collect_set来处理sql:

select count(a.id)   
from (select id,collect_set(time) as t from t_action_login where time<='20150906' group by id) 
as a    where size(a.t)=1 and a.t[0]='201809‘’

6. hive substr函数

substr（key, a,b)函数, 从第a个开始，取后面b个

eg: 实例：求每一年最大气温的那一天 + 温度
2010012325表示在2010年01月23日的气温为25度
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
select substr(datestr,1,4),max(substr(datestr,9,2)) from table group by substr(datestr,1,4);

7. Hive 中 case when 的用法实例

A. case x when y then z then a end

select id,case course when ‘english’ then ‘chinese’ else ‘其他’ end as id from test;

B. case when 条件 then result else result end

case when birthday<‘1968’ then ‘50up’
when birthday<‘1978’ then ‘40-50’
when birthday<‘1983’ then ‘35-40’
when birthday<‘1988’ then ‘30-35’
when birthday<‘1992’ then ‘26-30’
when birthday<‘1998’ then ‘20-25’
else ‘other’ end,

C.实例
一个成绩表： // 字段解释：id, 学号，课程，成绩
INSERT INTO course VALUES (3, 2, ‘yuwen’, 77);
INSERT INTO course VALUES (4, 2, ‘shuxue’, 88);
INSERT INTO course VALUES (5, 3, ‘yuwen’, 98);
INSERT INTO course VALUES (6, 3, ‘shuxue’, 65);
求：所有数学课程成绩大于语文课程成绩的学生的学号

解决问题思维，创建一张临时表，学号，数学成绩，语文成绩
create view tmp_course_view as
select sid, case course when “shuxue” then score else 0 end as shuxue,
case course when “yuwen” then score else 0 end as yuwen from course;

以sid分组合并取各成绩最大值
create view tmp_course_view1 as
select aa.sid, max(aa.shuxue) as shuxue, max(aa.yuwen) as yuwen from tmp_course_view aa group by sid;

比较结果
select * from tmp_course_view1 where shuxue > yuwen;

8. get_json_object 内置函数###

解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效，那么返回NULL。
get_json_object(’{“movie”:“594”,“rate”:“4”,“timeStamp”:“978302268”,“uid”:“1”}’,’$.movie’);

9. 日期内置函数

date_format ：按指定格式返回时间date 如：date_format(“2016-06-22”,“MM-dd”)=06-22
date_add(‘2008-12-31’, 1) = ‘2009-01-01’. 从开始时间startdate加上days
date_sub(‘2008-12-31’, 1) = ‘2008-12-30’.从开始时间startdate减去days
datediff(‘2009-03-01’, ‘2009-02-27’) = 2.计算开始时间startdate到结束时间enddate相差的天数
to_date(“1970-01-01 00:00:00”) = “1970-01-01”.返回时间字符串的日期部分
unix_timestamp(‘2009-03-20’, ‘yyyy-MM-dd’) = 1237532400， unix_timestamp(‘2009-03-20 11:30:01’) = 1237573801
from_unixtime(1250111000,“yyyy-MM-dd”) 得到2009-03-12
year(“1970-01-01”) = 1970. 返回时间字符串的年份部分
quarter(‘2015-04-08’) = 2. 返回当前时间属性哪个季度如quarter(‘2015-04-08’) = 2
month(“1970-11-01”) = 11.返回时间字符串的月份部分
day(“1970-11-01”) = 1.返回时间字符串的天
hour(‘12:58:59’) = 12.返回时间字符串的小时
minute(string date) 返回时间字符串的分钟
second(string date) 返回时间字符串的秒

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：centos7 mysql错误日志 mysql数据库错误日志

下一篇：conda python环境 python环境anaconda

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯