概要:

     hive是一个基于 hdfs的数据仓库,会将 hsq(类似于sql)的语句转换成 mapreduce 执行,如果配置了 其他计算引擎比如spark(hive on spark),那么会转换成其他的 来执行,



hive数据类型



基本数据类型

  • tinyint,smallint,int/integer,bigint,float,double
  • string,varchar,char
  • timestamps,date,interval
  • boolean


对象数据类型

  • array---> List<String>
  • map---> Map<key,value>
  • structs--> 对应一个对象例如: user<id:int,name:string>


hive的表

       hive表分为五类,分别是 内部表,外部表,分区表,桶表,视图.

       hive的表对应的是hdfs上的一个文件夹(目录),数据对应文件,



使用beeline 连接hive和基本操作

[root@slave1 hive]# clear
[root@slave1 hive]# beeline 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/leap/3.2.1.0/zookeeper/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/phoenix/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: hive
Enter password for jdbc:hive2://localhost:10000: ****
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+
No rows selected (0.327 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| demo           |
+----------------+--+
2 rows selected (0.19 seconds)
0: jdbc:hive2://localhost:10000>

// 默认是  default 数据库  show tables; 显示该库下面所有表

// show databases 显示所有库       

  • 创建数据库
0: jdbc:hive2://localhost:10000> create database test;
No rows affected (1.035 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| demo           |
| test           |
+----------------+--+
3 rows selected (0.533 seconds)
0: jdbc:hive2://localhost:10000>

// 创建一个 test库

  • 切换到 test库 查看表
0: jdbc:hive2://localhost:10000> use test;
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+
No rows selected (0.189 seconds)
0: jdbc:hive2://localhost:10000>


内部表

      内部表 如果删除表 ----> 元数据和hdfs上存储的数据都会被删除

  • 直接通过 sql 语句 创建一张表 就是内部表
0: jdbc:hive2://localhost:10000> create table t1 (id int,info string);
No rows affected (0.452 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
| t1        |
+-----------+--+
1 row selected (0.194 seconds)
0: jdbc:hive2://localhost:10000> desc
desc         describe     descriptor   
0: jdbc:hive2://localhost:10000> describe t1;
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| id        | int        |          |
| info      | string     |          |
+-----------+------------+----------+--+
2 rows selected (0.234 seconds)
0: jdbc:hive2://localhost:10000>

// 没有指定分隔符 和所在hdfs 文件夹 也没有分区

  • 插入一条数据
0: jdbc:hive2://localhost:10000> insert into table t1 values(1,"hello");
No rows affected (16.149 seconds)
0: jdbc:hive2://localhost:10000> select * from t1;
+--------+----------+--+
| t1.id  | t1.info  |
+--------+----------+--+
| 1      | hello    |
+--------+----------+--+
1 row selected (0.246 seconds)
0: jdbc:hive2://localhost:10000>
  • 查看 其对应的hdfs 文件目录和文件 依次看到 库名,表面,数据文件
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/
Found 2 items
drwxrwx--x+  - hive hive          0 2017-03-31 16:01 /apps/hive/warehouse/demo.db
drwxrwx--x+  - hive hive          0 2017-03-31 16:12 /apps/hive/warehouse/test.db
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db
Found 1 items
drwxrwx--x+  - hive hive          0 2017-03-31 16:12 /apps/hive/warehouse/test.db/t1
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db/t1
[hive@slave3 root]$ hadoop fs -ls /apps/hive/warehouse/test.db/t1
Found 1 items
-rwxrwx--x+  3 hive hive          8 2017-03-31 16:16 /apps/hive/warehouse/test.db/t1/000000_0
[hive@slave3 root]$ hadoop fs -cat /apps/hive/warehouse/test.db/t1/000000_0
1hello
[hive@slave3 root]$

//不指定 hive表的所在hdfs目录-->它回默认在 配置文件中配置的hdfs的hive目录下,

         //不指定 hive表的分隔符---> 默认没有分隔符

  • 创建一个指定分隔符和存储位置的表
0: jdbc:hive2://localhost:10000> create table t2(foo int,bar string) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile location '/mydir/test' ;
No rows affected (0.244 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+------------------------+--+
|        tab_name        |
+------------------------+--+
| t1                     |
| t2                     |
| values__tmp__table__1  |
+------------------------+--+
3 rows selected (0.195 seconds)
0: jdbc:hive2://localhost:10000> describe t2;
+-----------+------------+----------+--+
| col_name  | data_type  | comment  |
+-----------+------------+----------+--+
| foo       | int        |          |
| bar       | string     |          |
+-----------+------------+----------+--+
2 rows selected (0.255 seconds)
0: jdbc:hive2://localhost:10000> insert into table t2 values(1.'tttt');
Error: Error while compiling statement: FAILED: ParseException line 1:30 cannot recognize input near ''tttt'' ')' '<EOF>' in statement (state=42000,code=40000)
0: jdbc:hive2://localhost:10000> insert into table t2 values(1,'tttt');
No rows affected (14.613 seconds)
0: jdbc:hive2://localhost:10000> select * from t2;
+---------+---------+--+
| t2.foo  | t2.bar  |
+---------+---------+--+
| 1       | tttt    |
+---------+---------+--+
1 row selected (0.218 seconds)
0: jdbc:hive2://localhost:10000>

//字段分隔 ',' 行分隔 '/n' 换行 存储为txt ,指定hdfs目录'/mydir/test'

  • 查看效果

 



分区表

    在所有的sql 语句优化中,避免全表扫描,是首要目的,因为众所周知全表扫描非常消耗性能,

分区表(partition)是根据若干列进行分区,讲数据分区存储,当我们根据这些列进行 select的时候,则不需要再进行全表数据扫描了

在hive中  表的一个 partition 对应表下的一个目录 .所有的partition的数据都存储在对应的目录中

### 创建分区表,



create table p_t ( id int, name string ) partitioned by (date string);



 // 插入数据

0: jdbc:hive2://localhost:10000> describe p_t;
+--------------------------+-----------------------+-----------------------+--+
|         col_name         |       data_type       |        comment        |
+--------------------------+-----------------------+-----------------------+--+
| id                       | int                   |                       |
| name                     | string                |                       |
| date                     | string                |                       |
|                          | NULL                  | NULL                  |
| # Partition Information  | NULL                  | NULL                  |
| # col_name               | data_type             | comment               |
|                          | NULL                  | NULL                  |
| date                     | string                |                       |
+--------------------------+-----------------------+-----------------------+--+
8 rows selected (0.227 seconds)
0: jdbc:hive2://localhost:10000> insert into table p_t partition (date="2017-04-01") values(1,'testtttt');
No rows affected (16.745 seconds)
0: jdbc:hive2://localhost:10000> insert into table p_t partition (date="2017-04-02") values(2,'testtttt');
No rows affected (16.769 seconds)
0: jdbc:hive2://localhost:10000> select * from p_t;
+---------+-----------+-------------+--+
| p_t.id  | p_t.name  |  p_t.date   |
+---------+-----------+-------------+--+
| 1       | testtttt  | 2017-04-01  |
| 2       | testtttt  | 2017-04-02  |
+---------+-----------+-------------+--+
2 rows selected (0.398 seconds)
0: jdbc:hive2://localhost:10000>

 

// 查看 hdfs 文件

[hive@master root]$ hadoop fs -ls /apps/hive/warehouse/test.db/p_t
Found 2 items
drwxrwx--x+  - hive hive          0 2017-04-01 10:20 /apps/hive/warehouse/test.db/p_t/date=2017-04-01
drwxrwx--x+  - hive hive          0 2017-04-01 10:21 /apps/hive/warehouse/test.db/p_t/date=2017-04-02
[hive@master root]$ hadoop fs -ls /apps/hive/warehouse/test.db/p_t/date=2017-04-01
Found 1 items
-rwxrwx--x+  3 hive hive         11 2017-04-01 10:20 /apps/hive/warehouse/test.db/p_t/date=2017-04-01/000000_0
[hive@master root]$



外部表

指向 已经在 hdfs中存在的数据, 也可以创建分区,外部表 只是一个定义 

加载数据和创建表一起完成 ,和内部表在元数据上是一样的

而实际数据存储并不在 hive数据仓库的目录中,只是与外部其他的hdfs

目录建立了一个链接,当删除外部表的时候 只是删除了链接 真正的数据不会受影响

###建表

0: jdbc:hive2://localhost:10000> create external table ybb(id int,info string) location '/mydir/ybb';
No rows affected (0.295 seconds)
0: jdbc:hive2://localhost:10000>

创建外部表的关键字 是  external ,location是辅助的 可以指定外部表的位置 也可以不指定,不指定还是会在warehouse里面,



桶表

桶表是 把数据进行hash取值之后,存放到不同的文件中,避免热块

可以 在一个表中创建 若干个桶

create table buckett(id int,name string, age int) clustered by (name) into 5 buckets;

 根据 name 字段进行 hash运算  ,将结果相同的放到一个桶里面



视图

跨越多张表的 逻辑概念 ,是一个虚表 视图可以简化负责的查询

create view myview as select * from t1 ,t2;

 



删除表和清空表



删除表 

删除表会移除表的元数据和数据,而HDFS上的数据,如果配置了Trash,会移到.Trash/Current目录下。

删除外部表时,表中的数据不会被删除。

drop table t1;



清空表

0: jdbc:hive2://localhost:10000> select *  from p_t;
+---------+-----------+-------------+--+
| p_t.id  | p_t.name  |  p_t.date   |
+---------+-----------+-------------+--+
| 1       | testtttt  | 2017-04-01  |
| 2       | testtttt  | 2017-04-02  |
+---------+-----------+-------------+--+
2 rows selected (0.48 seconds)
0: jdbc:hive2://localhost:10000> truncate table p_t;
No rows affected (1.003 seconds)
0: jdbc:hive2://localhost:10000> select *  from p_t;
+---------+-----------+-----------+--+
| p_t.id  | p_t.name  | p_t.date  |
+---------+-----------+-----------+--+
+---------+-----------+-----------+--+
No rows selected (0.807 seconds)
0: jdbc:hive2://localhost:10000>