一、周二内容的复习
1、数据类型
原子类型 相当于 java 当中的 基本类型
byte short int long float double boolean char
tinyint smallint int bigint float double boolean string
date
date
2、复杂类型
array huangbo,xuzheng,wangbaoqiang === names
collection items terminated by ","
names[0], names[1],.....
map yuwen:77,shuxue:88,yinwu:99 ==== course
collection items terminated by ","
map keys terminated by ":"
course["yuwen"], course["shuxue"],.....
struct 101,huangbo,50000000000,huangbo@163.com
baseinfo struct<id:int,name:string,salary:bigint,email:string>
baseinfo.id, baseinfo.name
复杂类型和基本类型混用的时候要注意各个分隔符是要不一致
二、函数
1、内置函数
concat
substring
get_json_object
explode
2、自定义函数===UDF
UDF的编写步骤:
1、自定义一个java类MyUDF,继承UDF
2、在MyUDF类中,要编写evaluate方法
该方法要求是public的修饰方法,然后该方法可以重载
3、把MyUDF打成一个jar上传到服务器
4、通过hive的命令把该jar导入到资源路径
add jar /home/hadoop/udf.jar
list jar;
5、创建函数
create temporary function myfunc as "com.ghgj.hive.udf.MyUDF"
6、测试
select myfunc(3,4);
三、json格式的数据解析
get_json_objectd(jsonObject, path)
get_json_object(jsonline, '$.citys[0]')
rating.json
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
create table jsontable(jsonline string);
load data local inpath "/home/hadoop/rating.json" into table jsontable;
movie rate ts uid
1193 5 978300760 1
1193 5 978300760 1
create table json_table(movie int, rate int, ts bigint, uid int) row format delimited fields terminated by "\t"
insert into table json_table select get_json_object(jsonline, '$.movie') as movie, get_json_object(jsonline, '$.rate') as rate, get_json_object(jsonline, '$.timeStamp') as ts, get_json_object(jsonline, '$.uid') as uid from jsontable;
select movie, rate, ts, uid from json_table limit 3;
新建一张hive表去存储使用transform的方式解析出来的数据;
create table last_json_table (movie int, rate int, weekday string, userid int) row format delimited fields terminated by "\t";
// 怎么调用脚本去解析这个数据然后插入到last_json_table?
1、把脚本文件加入资源路径, DistributeCache MapJoin
add file /home/hadoop/weekday_mapper.py
2、编写正常的使用该脚本去解析数据的HQL语句
insert into table last_json_table select transform(movie,rate,ts,uid) using 'python weekday_mapper.py' as(movie,rate,weekday,userid) from json_table;
最终实现的目标:
把原表 json_table的数据当中的第三个字段做了转换
另存为一张新表last_json_table;
#!/bin/python
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movie,rate,ts,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(ts)).isoweekday()
print '\t'.join([movie, rate, str(weekday),userid])
脚本的作用:
输入hive表当中的一行数据:
经过脚本的语句转成:转换第三个字段 从时间戳 转换成 星期编号
输出另外一行数
验证数据:
select weekday from last_json_table limit 10;
四、视图
mysql当中的视图:视图创建之后会生成新的数据
hive当中的视图在创建之后只存在元数据,只有逻辑的概念,不会真正的执行创建视图当中的HQL语句
hive的视图在使用的时候跟 hive的子表的概念完全一样
select a.* from a where a.id in (select b.id from b);
create view bid_view as select b.id from b;
select a.* from a where a.id in bid_view;
select distinct id from bid_view;
select distinct id from (select b.id from b);
create view weekday_view as select distinct weekday from last_json_table;
五、特殊分隔符的处理
101,huangbo,22
102,xuzheng,33
103,wangbaoqiang,44
create table mingxing (id int, name string, age int) row format delimited fields terminated by ",";
load data local inpath "/home/hadoop/mingxing.txt" into table mingxing;
select * from mingxing;
101,,huangbo,,22
102,,xuzheng,,33
103,,wangbaoqiang,,44
create table mingxing1 (id int, name string, age int) row format delimited fields terminated by ",,";
load data local inpath "/home/hadoop/mingxing1.txt" into table mingxing1;
select * from mingxing1;
101^_^huangbo^_^22 ====> 101,huangbo,22
102^_^xuzheng^_^33
103^_^wangbaoqiang^_^44
特殊分隔符的两种使用方式:
通过正则的SerDe工具去抽取数据:
1、使用SerDe
create table regex_table(id int, name string, age int) row format serde ''
(.*)\|\|(.*) ===> 101||huangbo
output.format.string =====> '%1$s %2$s'
%s %d
.* 代表任意字符或者字符串
create table new_mingxing(id int, name string, age int) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties('input.regex'='(.*),,(.*),,(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile;
load data local inpath "/home/hadoop/mingxing1.txt" into table new_mingxing;
2、自定义InputFormat
create table inputformat_mingxing (id int, name string, age int) row format delimited fields terminated by "," stored as inputformat "com.ghgj.mapreduce.specialDelimiter.MySpecialDelimiterInputFormat" outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
load data local inpath "/home/hadoop/mingxing1.txt" into table inputformat_mingxing;
add jar /home/hadoop/apps/apache-hive-1.2.1-bin/lib/inputFormat.jar;
select * from inputformat_mingxing;
六、面试题
mysql或者hive当中常见的笔试题
1、自连接
2、行列转换
3、分组取前几
1、分桶和分区的理解
2、讲的速度太快
三个面试题 + 影评练习
都是特别思路题。
多提供一些hive案例
1、hive影评案例
2、7.6中关村举行盛大的招聘会。(百度,网易,文思,中软,百分点)
3、1702班。(java基础相关,mysql的操作,hadoop和spark为主)
今天
RCFile — HBase的数据存储结构
准备数据;
student
create table student(id int, name string, sex string ,age int, department string) row format delimited fields terminated by “,”;
load data local inpath “/home/hadoop/student.txt” into table student;
select * from student;
一、分区
hive表就是hdfs的上的一个目录
hive表中的数据,其实就是对应了HDFS上的一个目录下的数据
概念:对hive表的数据做分区管理
创建分区表
create table student_ptn(id int, name string) partitioned by (age int, department string) row format delimited fields terminated by ",";
添加分区
alter table student_ptn add partition(age=44,department="AA");
alter table student_ptn add partition(age=66,department="CC") partition(age=55,department="BB");
alter table student_ptn add partition(department="DD",age=77); // 添加成功。但是分区字段的顺序是在创建分区表的时候决定的
alter table student_ptn add partition(age=88,department="EE") location "/ptn_input1" partition(age=99,department="FF") location "/ptn_input2";
查询分区
show partitions student_ptn
修改分区
alter table student_ptn partition(age=44,department="AA") set location "hdfs://hadoop02:9000/ptn_input3";
// 往某个特定的分区导入数据
load data local inpath "/home/hadoop/student.txt" into table student_ptn partition(age=55,department="BB");
load data local inpath "/home/hadoop/student.txt" into table student_ptn; XXXXX
// 删除分区
alter table student_ptn drop partition(age=44,department="AA");
// 删除分区的时候,包括分区的元数据信息和分区的数据及目录
// 清空分区表的时候,分区还在,数据被清空
导入数据
hadoop fs
load
insert ....select....
// 使用insert....values... 的方法是导入数据
insert into table student_ptn values (101,"liutao");
insert into table student_ptn partition(age=66,department="CC") values (101,"liutao");
// 使用单重或者多重模式进去插入
from student.... insert ... ....insert
from student insert into table student_ptn partition(age=18,department="A") select id,name where age = 18 insert into table student_ptn partition(age=18,department="B") select id,name where age = 19;
// 验证多重分区的插入是否成功
select id, name from student_ptn where age = 18 and department = "A";
select id, name from student_ptn where age = 18 and department = "B";
// 动态分区
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
drop table student_ptn;
insert into table student_ptn partition(age=101, department) select id, department, name from student; XXXXXX
insert into table student_ptn partition(age=202, department) select id, name, department from student; √√√√√√
insert into table student_ptn(age) select id, name, age from student;
动态分区的效果:
按照动态分区的字段的值做划分,一个值就是一个分区
该studnet表当中的分区字段department有多少个值,就有多少个分区
每个分区就存一个分区字段值的所有数据
department = IS CS MA
二、分桶
概念:对分区的进一步的 更细粒度的划分。 分区类似
分桶的核心思想跟 MR程序的默认分区组件HashParititioner的原理一致
原理:根据key的hash值去模除以reduceTask的个数
分桶的原理:对分桶字段的值进行hash,然后模除以桶的个数,得到一个余数,该余数就是分桶的编号000000_0
MR 分区 mapper的outKey reduceTask的个数
hive 分桶 分桶字段的值 分桶的个数
分桶的效果:
分桶字段的值经过分桶逻辑计算之后得到的余数相同的都在同一个分桶文件中
一个分桶文件可能存在多个分桶的字段值
一个分桶文件也可能存在没有任何值
AA BB CC DD EE
分桶的个数:3
AA=0
BB=1
CC=0
DD=1
EE=0
上面的AA和CC和EE都会出现在第一个分桶文件当中
上面的BB和DD都会出现在第二个分桶文件中
第三个分桶文件当中就不会出现任何值。
表现形式上:
如果没有分区,那么分桶的目录就是表目录的下一级
如果有分区,那么分桶的目录在分区目录的下一级
分桶的操作:
创建分桶表:
create table student_bucket (id int, name string, sex string, age int, department string) clustered by (age) sorted by(age desc, id asc) into 2 buckets row format delimited fields terminated by ",";
往分桶表导入数据:
load data local inpath "/home/hadoop/student.txt" into table student_bucket;
注意:往分桶表导入数据,不能使用load方式 XXXXXXXXXXXX
insert .... values ....
insert into table student_bucket values (101,"liuyifei","female",22,"NVSHEN");
insert into table student_bucket values (102,"luoyufeng","female",33,"nvdiaosi");
// insert ....values... XXXXXXXXXXXX
// insert into .... select.....
// XXXXXXXXXXXX
insert into table student_bucket select id,name,sex,age,department from student;
√√√√√√√√√√√√√√√√√√√√√√√√
// 使用注意:打开分桶的开关,设置对应的reduceTask的个数要跟桶数匹配
set hive.enforce.bucketing = true;
set mapred.reduce.tasks = 2; / set mapreduce.job.reduces = 2;
insert into table student_bucket select id,name,sex,age,department from student distribute by age sort by age desc, id asc;
√√√√√√√√√√√√√√√√√√√√√√√√
分桶的特例:
create table student_bucket1 (id int, name string, sex string, age int, department string) clustered by (age) sorted by(age desc) into 2 buckets row format delimited fields terminated by “,”;
insert into table student_bucket1 select id,name,sex,age,department from student distribute by age; XXXXXXXXXXXXX
验证的问题:最后分桶的数据到底有没有排序、
注意:创建分桶时指定的分桶和排序等等的信息都不能在数据插入的时候起作用
数据到底有没有分桶是根据 插入数据的HQL语句决定
第四天正课的内容:
hive的底层实现原理
优化