1.分区表的介绍

在Hive中处理数据时,当处理的一张表的数据量过大的时候,每次查询都是遍历整张表,显然对于计算机来说,是负担比较重的。所以我们可不可以针对数据进行分类,查询时只遍历该分类中的数据,这样就能有效的解决问题。所以就会Hive在表的架构下,就会有分区的这个概念,就是为了满足此需求。
分区表的一个分区对应hdfs上的一个目录。
分区表包括静态分区表和动态分区表,根据分区会不会自动创建来区分。
多级分区表,即创建的时候指定 PARTITIONED BY (event_month string,loc string),根据顺序,级联创建 event_month=XXX/loc=XXX目录,其他和一级的分区表是一样的。

2.静态分区表

  • 创建一张分区表:
hive (default)> create table  order_partition(
           > oder_no  string,
           > oder_time string
           > )partitioned by (event_month string)
           > row format delimited fields terminated by '\t';  //创建一张静态分区表

OK
Time taken: 0.142 seconds
hive (default)> load data local inpath ‘/home/hadoop/data/order_created.txt’ overwrite into table order_partition partition (event_month=‘2019-07’); //向分区表中导入本地数据
Loading data to table default.order_partition partition (event_month=2019-07)
Partition default.order_partition{event_month=2019-07} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.652 seconds
hive (default)>

如图下图:我们此时可以看到在order_partition/目录下建了一个event_month=2019-07目录,导入数据后,可以看到目录文件夹下有一个order_created.txt文件。

hive分区表join hive 表分区_hive分区表join


注意:之前在练习的时候,导入数数据一直报下面的错误:

hive (default)>
              > load data local inpath '/home/hadoop/data/order_created.txt' overwrite into table order_partition partition (event_month='2019-07');
Loading data to table default.order_partition partition (event_month=2019-07)
Failed with exception MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask

最后发现是因为mysql驱动包的版本兼容问题,导致存放元数据的mysql的默认库始终没有partitions这张表,因为我之前用的mysql-connector-java-8.0.16.jar版本太高,把他换成5版本的,然后在重新执行hive命令就能够在mysql默认库中找到partitions这张表,然后再导入数据就成功了。

查看创建表时在hdfs上创建的目录:

[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
19/07/23 14:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2019-07-23 13:37 /user/hive/warehouse/order_partition/event_month=2019-07
  • 通过文本形式给2019-08分区上传本地数据信息
    1.先在hdfs上给2019-08分区创建一个目录
[hadoop@hadoop001 ~]$ hadoop fs -mkdir -p  /user/hive/warehouse/order_partition/event_month=2019-08
 19/07/23 14:30:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 [hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/
 19/07/23 14:30:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Found 2 items
 drwxr-xr-x   - hadoop supergroup          0 2019-07-23 13:37 /user/hive/warehouse/order_partition/event_month=2019-07
 drwxr-xr-x   - hadoop supergroup          0 2019-07-23 14:30 /user/hive/warehouse/order_partition/event_month=2019-08
 [hadoop@hadoop001 ~]$

2.把数据放到该分区目录下面

[hadoop@hadoop001 ~]$ hadoop fs -put /home/hadoop/data/order_created.txt   /user/hive/warehouse/order_partition/event_month=2019-08 
19/07/23 14:32:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$ hadoop fs -ls    /user/hive/warehouse/order_partition/event_month=2019-08
19/07/23 14:35:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup        213 2019-07-23 14:32 /user/hive/warehouse/order_partition/event_month=2019-08/order_created.txt
3.查询结果并没有数据存在表内
hive (default)> select * from order_partition where event_month='2019-08';
OK
order_partition.oder_no order_partition.oder_time       order_partition.event_month
Time taken: 0.048 seconds
hive (default)>

4.去MySQL查询看看是否有元数据进入了

mysql> select * from  tbls;
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME        | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
|      6 |  1563860203 |     1 |                0 | hadoop |         0 |     6 | order_partition | MANAGED_TABLE | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)
ERROR 1146 (42S02): Table 'metastore.partion_key_vals' doesn't exist
mysql> select * from partition_key_vals;
+---------+--------------+-------------+
| PART_ID | PART_KEY_VAL | INTEGER_IDX |
+---------+--------------+-------------+
|       1 | 2019-07      |           0 |
+---------+--------------+-------------+
1 row in set (0.00 sec)

mysql>

5.此时需要修复MySQL表的元数据信息,Hive上才能查到到相应的数据结果
可以采用命令: MSCK REPAIR TABLE table_name;

hive (default)> msck repair table order_partition;
OK
Partitions not in metastore:    order_partition:event_month=2019-08
Repair: Added partition to metastore order_partition:event_month=2019-08
Time taken: 0.097 seconds, Fetched: 2 row(s)
hive (default)>

这个命令会把 2019-08的信息从MySQL表里面刷新,从而可以再Hive上查询表可以查询的到信息同步过来了。
但有一个致命的缺点,它会把刷新所有MySQL表里面的信息,如果有一张表已经存放好几年了,用这个命令去执行的话
半天都反应不了,所以这个命令太暴力了,生产中禁止使用。

所以上面的方法不可取。可以采用下面的另一种方法:
1.创建一个新的分区表,把本地的数据放到hdfs分区表event_month=2018-09的上面

[hadoop@hadoop001 ~]$ hadoop fs -mkdir -p  /user/hive/warehouse/order_partition/event_month=2019-09
19/07/23 14:49:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$ hadoop fs -put /home/hadoop/data/order_created.txt   /user/hive/warehouse/order_partition/event_month=2019-09 
19/07/23 14:49:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 ~]$

2.用Hive查看表内是否有数据

hive (default)> select * from order_partition where event_month='2019-09';
OK
order_partition.oder_no order_partition.oder_time       order_partition.event_month
Time taken: 0.043 seconds
hive (default)>

3.添加一个分区刷新MySQL表内的信息 (生产中使用这种刷新的方式,只单单刷新一张表的信息)

hive (default)> alter table order_partition add  if not exists partition (event_month='2019-09');
OK
Time taken: 0.059 seconds
hive (default)> select * from order_partition where event_month='2019-09'; //然后再次查,就能看到表内数据了
OK
order_partition.oder_no order_partition.oder_time       order_partition.event_month
10703007267488  2014-05-01 06:01:12.334+01      2019-09
10101043505096  2014-05-01 07:28:12.342+01      2019-09
10103043509747  2014-05-01 07:50:12.33+01       2019-09
10103043501575  2014-05-01 09:27:12.33+01       2019-09
10104043514061  2014-05-01 09:03:12.324+01      2019-09
Time taken: 0.044 seconds, Fetched: 5 row(s)
hive (default)>

4.查看表内的所有分区;

hive (default)> show partitions order_partition;
OK
partition
event_month=2019-07
event_month=2019-08
event_month=2019-09
Time taken: 0.034 seconds, Fetched: 3 row(s)
hive (default)>

总结:1. 以上两个SQL可以查到列event_month信息
2.而使用hdfs dfs -cat看不到该列,说明分区表的分区列是伪列
3. 实际上是hdfs中的分区目录的体现
4. 真正的表的字段是不包含分区字段的,分区字段只是HDFS上的文件夹的名称

3.多级分区

1.创建多级分区表

hive (default)> create table  order_mulit_partition(
          > oder_no  string,
          > oder_time string
          > )partitioned by (event_month string ,step string)
          > row format delimited fields terminated by '\t';
OK
Time taken: 0.063 seconds
hive (default)>

2.向分区表中导入数据

hive (default)> load data local inpath '/home/hadoop/data/order_created.txt' into table order_mulit_partition PARTITION (event_month='2014-05', step='1');
Loading data to table default.order_mulit_partition partition (event_month=2014-05, step=1)
Partition default.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.24 seconds
hive (default)>

3.查看分区表中的数据

hive (default)> select * from  order_mulit_partition  where event_month='2014-05'and step='1';
OK
order_mulit_partition.oder_no   order_mulit_partition.oder_time order_mulit_partition.event_month       order_mulit_partition.step
10703007267488  2014-05-01 06:01:12.334+01      2014-05 1
10101043505096  2014-05-01 07:28:12.342+01      2014-05 1
10103043509747  2014-05-01 07:50:12.33+01       2014-05 1
10103043501575  2014-05-01 09:27:12.33+01       2014-05 1
10104043514061  2014-05-01 09:03:12.324+01      2014-05 1
Time taken: 0.04 seconds, Fetched: 5 row(s)
hive (default)>

4.通过hdfs的web界面查看生成的目录

hive分区表join hive 表分区_分区表_02


说明,多级分区就相当于多创建了一层目录。

静态分区实例

将相同部门的人写到一个分区里面。
1.创建部门分区表

hive (default)> CREATE TABLE emp_partition(
              > empno int,
              > ename string,
              > job string,
              > mgr int,
              > hiredate string,
              > sal double,
              > comm double
              > )partitioned by (deptno int)
              > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.042 seconds
hive (default)>

2.向分区表中导入10部门的人员

hive (default)> insert into table emp_partition partition(deptno=10)
          > select empno,ename,job,mgr,hiredate,sal,comm from ruozedata_emp where deptno=10;
Query ID = hadoop_20190723133535_7516f0b7-ac5e-4d97-832e-c5c6fe47e146
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0027, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0027/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job  -kill job_1562553101223_0027
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-23 15:30:37,119 Stage-1 map = 0%,  reduce = 0%
2019-07-23 15:30:41,253 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.71 sec
MapReduce Total cumulative CPU time: 1 seconds 710 msec
Ended Job = job_1562553101223_0027
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/emp_partition/deptno=10/.hive-staging_hive_2019-07-23_15-30-31_135_7477624047966631040-1/-ext-10000
Loading data to table default.emp_partition partition (deptno=10)
Partition default.emp_partition{deptno=10} stats: [numFiles=1, numRows=3, totalSize=160, rawDataSize=157]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.71 sec   HDFS Read: 4916 HDFS Write: 248 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 710 msec
OK
empno   ename   job     mgr     hiredate        sal     comm
Time taken: 11.358 seconds
hive (default)> select * from emp_partition where deptno=10;
OK
emp_partition.empno     emp_partition.ename     emp_partition.job       emp_partition.mgr       emp_partition.hiredate  emp_partition.sal       emp_partition.comm      emp_partition.deptno
7839    KING    PRESIDENT       NULL    1981-11-17 00:00:00     5000.0  NULL    10
7782    CLARK   MANAGER 7839    1981-06-09 00:00:00     2450.0  NULL    10
7934    MILLER  CLERK   7782    1982-01-23 00:00:00     1300.0  NULL    10
Time taken: 0.053 seconds, Fetched: 3 row(s)
hive (default)>

3.向分区表中导入20部门的人员

hive (default)> insert into table emp_partition partition(deptno=20)
          > select empno,ename,job,mgr,hiredate,sal,comm from ruozedata_emp where deptno=20;

30部门也是一样,假如说有500个部门,难道一次次来弄吗?所有就有了动态分区

4.最后可以通过hdfs的web界面目录看到

hive分区表join hive 表分区_hive_03

动态分区

1.创建一张动态分区表(其实和静态分区表一样)

hive (default)> create table emp_dynamic_partition(
          > empno int,
          > ename string,
          > job string,
          > mgr int,
          > hiredate string,
          > sal double,
          > comm double
          > )
          > PARTITIONED BY (deptno int)
          > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.034 seconds
hive (default)>

2.动态的按不同部门向不同表中导入数据

hive (default)> insert overwrite table emp_dynamic_partition partition(deptno)
              > select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
hive (default)> 
此时导入数据报错,是因为hive.exec.dynamic.partition.mode=strict模式是strict模式,必须改成hive.exec.dynamic.partition.mode=nonstrict模式。

hive (default)> set hive.exec.dynamic.partition.mode=nonstrict;  //设置非严格模式
hive (default)> insert overwrite table emp_dynamic_partition partition(deptno)
              > select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;
Query ID = hadoop_20190723154545_1fb2f25a-c8e3-457b-89dc-2444f6f9a219
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0030, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0030/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job  -kill job_1562553101223_0030
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-23 15:45:07,421 Stage-1 map = 0%,  reduce = 0%
2019-07-23 15:45:11,559 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.53 sec
MapReduce Total cumulative CPU time: 1 seconds 530 msec
Ended Job = job_1562553101223_0030
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/emp_dynamic_partition/.hive-staging_hive_2019-07-23_15-45-01_093_9110160594754239270-1/-ext-10000
Loading data to table default.emp_dynamic_partition partition (deptno=null)
         Time taken for load dynamic partitions : 340
        Loading partition {deptno=20}
        Loading partition {deptno=10}
        Loading partition {deptno=30}
         Time taken for adding to write entity : 3
Partition default.emp_dynamic_partition{deptno=10} stats: [numFiles=1, numRows=3, totalSize=160, rawDataSize=157]
Partition default.emp_dynamic_partition{deptno=20} stats: [numFiles=1, numRows=5, totalSize=264, rawDataSize=259]
Partition default.emp_dynamic_partition{deptno=30} stats: [numFiles=1, numRows=6, totalSize=336, rawDataSize=330]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.53 sec   HDFS Read: 4873 HDFS Write: 974 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 530 msec
OK
empno   ename   job     mgr     hiredate        sal     comm    deptno
Time taken: 13.272 seconds
hive (default)>

3.查看hdfs的目录文件

[hadoop@hadoop001 ~]$ hadoop fs -ls /user/hive/warehouse/emp_dynamic_partition
19/07/23 15:47:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=10
drwxr-xr-x   - hadoop supergroup          0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=20
drwxr-xr-x   - hadoop supergroup          0 2019-07-23 15:45 /user/hive/warehouse/emp_dynamic_partition/deptno=30
[hadoop@hadoop001 ~]$

说明:动态分区如果是多级分区,在导入数据的时候 deptno用","隔开,后面接着写就是,但是select后面的字段也必须要按照这个顺序写。

insert overwrite table emp_dynamic_partition partition(deptno)
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruozedata_emp;

最后总结:1.动态分区表的数据插入语句是partition(deptno) 而不是 partition(deptno=XXX)
2.select 子句从原表查出来的列数和列序要和分区表列数和列序保持一致
3.select 子句最后一列要为分区表的分区列
4.不在需要where子句