1.分区表的介绍
在Hive中处理数据时,当处理的一张表的数据量过大的时候,每次查询都是遍历整张表,显然对于计算机来说,是负担比较重的。所以我们可不可以针对数据进行分类,查询时只遍历该分类中的数据,这样就能有效的解决问题。所以就会Hive在表的架构下,就会有分区的这个概念,就是为了满足此需求。
分区表的一个分区对应hdfs上的一个目录。
分区表包括静态分区表和动态分区表,根据分区会不会自动创建来区分。
多级分区表,即创建的时候指定 PARTITIONED BY (event_month string,loc string),根据顺序,级联创建 event_month=XXX/loc=XXX目录,其他和一级的分区表是一样的。
2.静态分区表
- 创建一张分区表:
hive (default)> create table order_partition(
> oder_no string,
> oder_time string
> )partitioned by (event_month string)
> row format delimited fields terminated by '\t'; //创建一张静态分区表
OK
Time taken: 0.142 seconds
hive (default)> load data local inpath ‘/home/hadoop/data/order_created.txt’ overwrite into table order_partition partition (event_month=‘2019-07’); //向分区表中导入本地数据
Loading data to table default.order_partition partition (event_month=2019-07)
Partition default.order_partition{event_month=2019-07} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.652 seconds
hive (default)>
如图下图:我们此时可以看到在order_partition/目录下建了一个event_month=2019-07目录,导入数据后,可以看到目录文件夹下有一个order_created.txt文件。
注意:之前在练习的时候,导入数数据一直报下面的错误:
hive (default)>
> load data local inpath '/home/hadoop/data/order_created.txt' overwrite into table order_partition partition (event_month='2019-07');
Loading data to table default.order_partition partition (event_month=2019-07)
Failed with exception MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
最后发现是因为mysql驱动包的版本兼容问题,导致存放元数据的mysql的默认库始终没有partitions这张表,因为我之前用的mysql-connector-java-8.0.16.jar版本太高,把他换成5版本的,然后在重新执行hive命令就能够在mysql默认库中找到partitions这张表,然后再导入数据就成功了。
查看创建表时在hdfs上创建的目录:
[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
19/07/23 14:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2019-07-23 13:37 /user/hive/warehouse/order_partition/event_month=2019-07
- 通过文本形式给2019-08分区上传本地数据信息
1.先在hdfs上给2019-08分区创建一个目录
[hadoop@hadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2019-08
19/07/23 14:30:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
19/07/23 14:30:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2019-07-23 13:37 /user/hive/warehouse/order_partition/event_month=2019-07
drwxr-xr-x - hadoop supergroup 0 2019-07-23 14:30 /user/hive/warehouse/order_partition/event_month=2019-08
[hadoop@hadoop001 ~]$
2.把数据放到该分区目录下面
[hadoop@hadoop001 ~]$ hadoop fs -put /home/hadoop/data/order_created.txt /user/hive/warehouse/order_partition/event_month=2019-08
19/07/23 14:32:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/event_month=2019-08
19/07/23 14:35:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 1 hadoop supergroup 213 2019-07-23 14:32 /user/hive/warehouse/order_partition/event_month=2019-08/order_created.txt
3.查询结果并没有数据存在表内
hive (default)> select * from order_partition where event_month='2019-08';
OK
order_partition.oder_no order_partition.oder_time order_partition.event_month
Time taken: 0.048 seconds
hive (default)>
4.去MySQL查询看看是否有元数据进入了
mysql> select * from tbls;
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
| 6 | 1563860203 | 1 | 0 | hadoop | 0 | 6 | order_partition | MANAGED_TABLE | NULL | NULL |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)
ERROR 1146 (42S02): Table 'metastore.partion_key_vals' doesn't exist
mysql> select * from partition_key_vals;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
| 1 | 2019-07 | 0 |
+---------+--------------+-------------+
1 row in set (0.00 sec)
mysql>
5.此时需要修复MySQL表的元数据信息,Hive上才能查到到相应的数据结果
可以采用命令: MSCK REPAIR TABLE table_name;
hive (default)> msck repair table order_partition;
OK
Partitions not in metastore: order_partition:event_month=2019-08
Repair: Added partition to metastore order_partition:event_month=2019-08
Time taken: 0.097 seconds, Fetched: 2 row(s)
hive (default)>
这个命令会把 2019-08的信息从MySQL表里面刷新,从而可以再Hive上查询表可以查询的到信息同步过来了。
但有一个致命的缺点,它会把刷新所有MySQL表里面的信息,如果有一张表已经存放好几年了,用这个命令去执行的话
半天都反应不了,所以这个命令太暴力了,生产中禁止使用。
所以上面的方法不可取。可以采用下面的另一种方法:
1.创建一个新的分区表,把本地的数据放到hdfs分区表event_month=2018-09的上面
[hadoop@hadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2019-09
19/07/23 14:49:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$ hadoop fs -put /home/hadoop/data/order_created.txt /user/hive/warehouse/order_partition/event_month=2019-09
19/07/23 14:49:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$
2.用Hive查看表内是否有数据
hive (default)> select * from order_partition where event_month='2019-09';
OK
order_partition.oder_no order_partition.oder_time order_partition.event_month
Time taken: 0.043 seconds
hive (default)>
3.添加一个分区刷新MySQL表内的信息 (生产中使用这种刷新的方式,只单单刷新一张表的信息)
hive (default)> alter table order_partition add if not exists partition (event_month='2019-09');
OK
Time taken: 0.059 seconds
hive (default)> select * from order_partition where event_month='2019-09'; //然后再次查,就能看到表内数据了
OK
order_partition.oder_no order_partition.oder_time order_partition.event_month
10703007267488 2014-05-01 06:01:12.334+01 2019-09
10101043505096 2014-05-01 07:28:12.342+01 2019-09
10103043509747 2014-05-01 07:50:12.33+01 2019-09
10103043501575 2014-05-01 09:27:12.33+01 2019-09
10104043514061 2014-05-01 09:03:12.324+01 2019-09
Time taken: 0.044 seconds, Fetched: 5 row(s)
hive (default)>
4.查看表内的所有分区;
hive (default)> show partitions order_partition;
OK
partition
event_month=2019-07
event_month=2019-08
event_month=2019-09
Time taken: 0.034 seconds, Fetched: 3 row(s)
hive (default)>
总结:1. 以上两个SQL可以查到列event_month信息
2.而使用hdfs dfs -cat看不到该列,说明分区表的分区列是伪列
3. 实际上是hdfs中的分区目录的体现
4. 真正的表的字段是不包含分区字段的,分区字段只是HDFS上的文件夹的名称
3.多级分区
1.创建多级分区表
hive (default)> create table order_mulit_partition(
> oder_no string,
> oder_time string
> )partitioned by (event_month string ,step string)
> row format delimited fields terminated by '\t';
OK
Time taken: 0.063 seconds
hive (default)>
2.向分区表中导入数据
hive (default)> load data local inpath '/home/hadoop/data/order_created.txt' into table order_mulit_partition PARTITION (event_month='2014-05', step='1');
Loading data to table default.order_mulit_partition partition (event_month=2014-05, step=1)
Partition default.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.24 seconds
hive (default)>
3.查看分区表中的数据
hive (default)> select * from order_mulit_partition where event_month='2014-05'and step='1';
OK
order_mulit_partition.oder_no order_mulit_partition.oder_time order_mulit_partition.event_month order_mulit_partition.step
10703007267488 2014-05-01 06:01:12.334+01 2014-05 1
10101043505096 2014-05-01 07:28:12.342+01 2014-05 1
10103043509747 2014-05-01 07:50:12.33+01 2014-05 1
10103043501575 2014-05-01 09:27:12.33+01 2014-05 1
10104043514061 2014-05-01 09:03:12.324+01 2014-05 1
Time taken: 0.04 seconds, Fetched: 5 row(s)
hive (default)>
4.通过hdfs的web界面查看生成的目录
说明,多级分区就相当于多创建了一层目录。
静态分区实例
将相同部门的人写到一个分区里面。
1.创建部门分区表
hive (default)> CREATE TABLE emp_partition(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double
> )partitioned by (deptno int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.042 seconds
hive (default)>
2.向分区表中导入10部门的人员
hive (default)> insert into table emp_partition partition(deptno=10)
> select empno,ename,job,mgr,hiredate,sal,comm from ruozedata_emp where deptno=10;
Query ID = hadoop_20190723133535_7516f0b7-ac5e-4d97-832e-c5c6fe47e146
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0027, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0027/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-23 15:30:37,119 Stage-1 map = 0%, reduce = 0%
2019-07-23 15:30:41,253 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.71 sec
MapReduce Total cumulative CPU time: 1 seconds 710 msec
Ended Job = job_1562553101223_0027
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/emp_partition/deptno=10/.hive-staging_hive_2019-07-23_15-30-31_135_7477624047966631040-1/-ext-10000
Loading data to table default.emp_partition partition (deptno=10)
Partition default.emp_partition{deptno=10} stats: [numFiles=1, numRows=3, totalSize=160, rawDataSize=157]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.71 sec HDFS Read: 4916 HDFS Write: 248 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
empno ename job mgr hiredate sal comm
Time taken: 11.358 seconds
hive (default)> select * from emp_partition where deptno=10;
OK
emp_partition.empno emp_partition.ename emp_partition.job emp_partition.mgr emp_partition.hiredate emp_partition.sal emp_partition.comm emp_partition.deptno
7839 KING PRESIDENT NULL 1981-11-17 00:00:00 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450.0 NULL 10
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300.0 NULL 10
Time taken: 0.053 seconds, Fetched: 3 row(s)
hive (default)>
3.向分区表中导入20部门的人员
hive (default)> insert into table emp_partition partition(deptno=20)
> select empno,ename,job,mgr,hiredate,sal,comm from ruozedata_emp where deptno=20;
30部门也是一样,假如说有500个部门,难道一次次来弄吗?所有就有了动态分区
4.最后可以通过hdfs的web界面目录看到
动态分区
1.创建一张动态分区表(其实和静态分区表一样)
hive (default)> create table emp_dynamic_partition(
> empno int,
> ename string,
> job string,
> mgr int,
> hiredate string,
> sal double,
> comm double
> )
> PARTITIONED BY (deptno int)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.034 seconds
hive (default)>
2.动态的按不同部门向不同表中导入数据
hive (default)> insert overwrite table emp_dynamic_partition partition(deptno)
> select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
hive (default)>
此时导入数据报错,是因为hive.exec.dynamic.partition.mode=strict模式是strict模式,必须改成hive.exec.dynamic.partition.mode=nonstrict模式。
hive (default)> set hive.exec.dynamic.partition.mode=nonstrict; //设置非严格模式
hive (default)> insert overwrite table emp_dynamic_partition partition(deptno)
> select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
Query ID = hadoop_20190723154545_1fb2f25a-c8e3-457b-89dc-2444f6f9a219
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0030, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0030/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-23 15:45:07,421 Stage-1 map = 0%, reduce = 0%
2019-07-23 15:45:11,559 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1562553101223_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/emp_dynamic_partition/.hive-staging_hive_2019-07-23_15-45-01_093_9110160594754239270-1/-ext-10000
Loading data to table default.emp_dynamic_partition partition (deptno=null)
Time taken for load dynamic partitions : 340
Loading partition {deptno=20}
Loading partition {deptno=10}
Loading partition {deptno=30}
Time taken for adding to write entity : 3
Partition default.emp_dynamic_partition{deptno=10} stats: [numFiles=1, numRows=3, totalSize=160, rawDataSize=157]
Partition default.emp_dynamic_partition{deptno=20} stats: [numFiles=1, numRows=5, totalSize=264, rawDataSize=259]
Partition default.emp_dynamic_partition{deptno=30} stats: [numFiles=1, numRows=6, totalSize=336, rawDataSize=330]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 4873 HDFS Write: 974 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 530 msec
OK
empno ename job mgr hiredate sal comm deptno
Time taken: 13.272 seconds
hive (default)>
3.查看hdfs的目录文件
[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/emp_dynamic_partition
19/07/23 15:47:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=10
drwxr-xr-x - hadoop supergroup 0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=20
drwxr-xr-x - hadoop supergroup 0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=30
[hadoop@hadoop001 ~]$
说明:动态分区如果是多级分区,在导入数据的时候 deptno用","隔开,后面接着写就是,但是select后面的字段也必须要按照这个顺序写。
insert overwrite table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
最后总结:1.动态分区表的数据插入语句是partition(deptno) 而不是 partition(deptno=XXX)
2.select 子句从原表查出来的列数和列序要和分区表列数和列序保持一致
3.select 子句最后一列要为分区表的分区列
4.不在需要where子句