概要:
hive是一个基于 hdfs的数据仓库,会将 hsq(类似于sql)的语句转换成 mapreduce 执行,如果配置了 其他计算引擎比如spark(hive on spark),那么会转换成其他的 来执行,
hive数据类型
基本数据类型
- tinyint,smallint,int/integer,bigint,float,double
- string,varchar,char
- timestamps,date,interval
- boolean
对象数据类型
- array---> List<String>
- map---> Map<key,value>
- structs--> 对应一个对象例如: user<id:int,name:string>
hive的表
hive表分为五类,分别是 内部表,外部表,分区表,桶表,视图.
hive的表对应的是hdfs上的一个文件夹(目录),数据对应文件,
使用beeline 连接hive和基本操作
[root@slave1 hive]# clear
[root@slave1 hive]# beeline
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/leap/3.2.1.0/zookeeper/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/phoenix/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hive
Enter password for jdbc:hive2://localhost:10000: ****
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
+-----------+--+
No rows selected (0.327 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| demo |
+----------------+--+
2 rows selected (0.19 seconds)
0: jdbc:hive2://localhost:10000>
// 默认是 default 数据库 show tables; 显示该库下面所有表
// show databases 显示所有库
- 创建数据库
0: jdbc:hive2://localhost:10000> create database test;
No rows affected (1.035 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name |
+----------------+--+
| default |
| demo |
| test |
+----------------+--+
3 rows selected (0.533 seconds)
0: jdbc:hive2://localhost:10000>
// 创建一个 test库
- 切换到 test库 查看表
0: jdbc:hive2://localhost:10000> use test;
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
+-----------+--+
No rows selected (0.189 seconds)
0: jdbc:hive2://localhost:10000>
内部表
内部表 如果删除表 ----> 元数据和hdfs上存储的数据都会被删除
- 直接通过 sql 语句 创建一张表 就是内部表
0: jdbc:hive2://localhost:10000> create table t1 (id int,info string);
No rows affected (0.452 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name |
+-----------+--+
| t1 |
+-----------+--+
1 row selected (0.194 seconds)
0: jdbc:hive2://localhost:10000> desc
desc describe descriptor
0: jdbc:hive2://localhost:10000> describe t1;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| id | int | |
| info | string | |
+-----------+------------+----------+--+
2 rows selected (0.234 seconds)
0: jdbc:hive2://localhost:10000>
// 没有指定分隔符 和所在hdfs 文件夹 也没有分区
- 插入一条数据
0: jdbc:hive2://localhost:10000> insert into table t1 values(1,"hello");
No rows affected (16.149 seconds)
0: jdbc:hive2://localhost:10000> select * from t1;
+--------+----------+--+
| t1.id | t1.info |
+--------+----------+--+
| 1 | hello |
+--------+----------+--+
1 row selected (0.246 seconds)
0: jdbc:hive2://localhost:10000>
- 查看 其对应的hdfs 文件目录和文件 依次看到 库名,表面,数据文件
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/
Found 2 items
drwxrwx--x+ - hive hive 0 2017-03-31 16:01 /apps/hive/warehouse/demo.db
drwxrwx--x+ - hive hive 0 2017-03-31 16:12 /apps/hive/warehouse/test.db
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db
Found 1 items
drwxrwx--x+ - hive hive 0 2017-03-31 16:12 /apps/hive/warehouse/test.db/t1
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db/t1
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db/t1
Found 1 items
-rwxrwx--x+ 3 hive hive 8 2017-03-31 16:16 /apps/hive/warehouse/test.db/t1/000000_0
[hive@slave3 root]$ hadoop fs -cat /apps/hive/warehouse/test.db/t1/000000_0
1hello
[hive@slave3 root]$
//不指定 hive表的所在hdfs目录-->它回默认在 配置文件中配置的hdfs的hive目录下,
//不指定 hive表的分隔符---> 默认没有分隔符
- 创建一个指定分隔符和存储位置的表
0: jdbc:hive2://localhost:10000> create table t2(foo int,bar string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile location '/mydir/test' ;
No rows affected (0.244 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+------------------------+--+
| tab_name |
+------------------------+--+
| t1 |
| t2 |
| values__tmp__table__1 |
+------------------------+--+
3 rows selected (0.195 seconds)
0: jdbc:hive2://localhost:10000> describe t2;
+-----------+------------+----------+--+
| col_name | data_type | comment |
+-----------+------------+----------+--+
| foo | int | |
| bar | string | |
+-----------+------------+----------+--+
2 rows selected (0.255 seconds)
0: jdbc:hive2://localhost:10000> insert into table t2 values(1.'tttt');
Error: Error while compiling statement: FAILED: ParseException line 1:30 cannot recognize input near ''tttt'' ')' '<EOF>' in statement (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> insert into table t2 values(1,'tttt');
No rows affected (14.613 seconds)
0: jdbc:hive2://localhost:10000> select * from t2;
+---------+---------+--+
| t2.foo | t2.bar |
+---------+---------+--+
| 1 | tttt |
+---------+---------+--+
1 row selected (0.218 seconds)
0: jdbc:hive2://localhost:10000>
//字段分隔 ',' 行分隔 '/n' 换行 存储为txt ,指定hdfs目录'/mydir/test'
- 查看效果
分区表
在所有的sql 语句优化中,避免全表扫描,是首要目的,因为众所周知全表扫描非常消耗性能,
分区表(partition)是根据若干列进行分区,讲数据分区存储,当我们根据这些列进行 select的时候,则不需要再进行全表数据扫描了
在hive中 表的一个 partition 对应表下的一个目录 .所有的partition的数据都存储在对应的目录中
### 创建分区表,
create table p_t ( id int, name string ) partitioned by (date string);
// 插入数据
0: jdbc:hive2://localhost:10000> describe p_t;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| id | int | |
| name | string | |
| date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| date | string | |
+--------------------------+-----------------------+-----------------------+--+
8 rows selected (0.227 seconds)
0: jdbc:hive2://localhost:10000> insert into table p_t partition (date="2017-04-01") values(1,'testtttt');
No rows affected (16.745 seconds)
0: jdbc:hive2://localhost:10000> insert into table p_t partition (date="2017-04-02") values(2,'testtttt');
No rows affected (16.769 seconds)
0: jdbc:hive2://localhost:10000> select * from p_t;
+---------+-----------+-------------+--+
| p_t.id | p_t.name | p_t.date |
+---------+-----------+-------------+--+
| 1 | testtttt | 2017-04-01 |
| 2 | testtttt | 2017-04-02 |
+---------+-----------+-------------+--+
2 rows selected (0.398 seconds)
0: jdbc:hive2://localhost:10000>
// 查看 hdfs 文件
[hive@master root]$ hadoop fs -ls /apps/hive/warehouse/test.db/p_t
Found 2 items
drwxrwx--x+ - hive hive 0 2017-04-01 10:20 /apps/hive/warehouse/test.db/p_t/date=2017-04-01
drwxrwx--x+ - hive hive 0 2017-04-01 10:21 /apps/hive/warehouse/test.db/p_t/date=2017-04-02
[hive@master root]$ hadoop fs -ls /apps/hive/warehouse/test.db/p_t/date=2017-04-01
Found 1 items
-rwxrwx--x+ 3 hive hive 11 2017-04-01 10:20 /apps/hive/warehouse/test.db/p_t/date=2017-04-01/000000_0
[hive@master root]$
外部表
指向 已经在 hdfs中存在的数据, 也可以创建分区,外部表 只是一个定义
加载数据和创建表一起完成 ,和内部表在元数据上是一样的
而实际数据存储并不在 hive数据仓库的目录中,只是与外部其他的hdfs
目录建立了一个链接,当删除外部表的时候 只是删除了链接 真正的数据不会受影响
###建表
0: jdbc:hive2://localhost:10000> create external table ybb(id int,info string) location '/mydir/ybb';
No rows affected (0.295 seconds)
0: jdbc:hive2://localhost:10000>
创建外部表的关键字 是 external ,location是辅助的 可以指定外部表的位置 也可以不指定,不指定还是会在warehouse里面,
桶表
桶表是 把数据进行hash取值之后,存放到不同的文件中,避免热块
可以 在一个表中创建 若干个桶
create table buckett(id int,name string, age int) clustered by (name) into 5 buckets;
根据 name 字段进行 hash运算 ,将结果相同的放到一个桶里面
视图
跨越多张表的 逻辑概念 ,是一个虚表 视图可以简化负责的查询
create view myview as select * from t1 ,t2;
删除表和清空表
删除表
删除表会移除表的元数据和数据,而HDFS上的数据,如果配置了Trash,会移到.Trash/Current目录下。
删除外部表时,表中的数据不会被删除。
drop table t1;
清空表
0: jdbc:hive2://localhost:10000> select * from p_t;
+---------+-----------+-------------+--+
| p_t.id | p_t.name | p_t.date |
+---------+-----------+-------------+--+
| 1 | testtttt | 2017-04-01 |
| 2 | testtttt | 2017-04-02 |
+---------+-----------+-------------+--+
2 rows selected (0.48 seconds)
0: jdbc:hive2://localhost:10000> truncate table p_t;
No rows affected (1.003 seconds)
0: jdbc:hive2://localhost:10000> select * from p_t;
+---------+-----------+-----------+--+
| p_t.id | p_t.name | p_t.date |
+---------+-----------+-----------+--+
+---------+-----------+-----------+--+
No rows selected (0.807 seconds)
0: jdbc:hive2://localhost:10000>