hive 分区获取最新分区第一条 hive获取最大分区数据

转载

karen 2024-03-12 13:17:09

文章标签 hive 分区获取最新分区第一条 hive 数据分区表 文章分类 Hive 大数据

4.6 分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

4.6.1 分区表基本操作

1：引入分区表（需要根据日期对日志进行管理）
/user/hive/warehouse/log_partition/20180702/20190702.log
/user/hive/warehouse/log_partition/20180703/20190703.log
/user/hive/warehouse/log_partition/20180704/20190704.log

2：创建分区表语法
hive (default)> create table dept_partition(deptno int, dname string, loc string)
> partitioned by (month string)
> row format delimited fields terminated by '\t';

3：加载数据到分区表中
hive (default)> load data local inpath '/opt/datas/dept.txt' into table default.dept_partition partition(month='201907');
hive (default)> load data local inpath '/opt/datas/dept.txt' into table default.dept_partition partition(month='201908');
hive (default)> load data local inpath '/opt/datas/dept.txt' into table default.dept_partition partition(month='201909');

4：查询分区表中数据
单分区查询
hive (default)> select * from dept_partition where month='201907';

hive 分区获取最新分区第一条 hive获取最大分区数据_hive

多分区联合查询
hive (default)> select * from dept_partition where month='201908'
union
select * from dept_partition where month='201907'
union
select * from dept_partition where month='201909';

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_02

5：增加分区
创建单个分区
hive (default)> alter table dept_partition add partition(month='201906') ;
同时创建多个分区
hive (default)> alter table dept_partition add partition(month='201905') partition(month='201904');
注：增加多个分区之间用空格" "隔开，删除多个分区用","隔开

6：删除分区
删除单个分区
hive (default)> alter table dept_partition drop partition (month='201904');
同时删除多个分区
hive (default)> alter table dept_partition drop partition (month='201905'), partition (month='201906');

7：查看分区表有多少分区
hive>show partitions dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_hive_03

8：查看分区表结构

hive>desc formatted dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_04

4.6.2 分区表注意事项

1：创建二级分区表，先按照月分区，在按照天分区
hive (default)>create table dept_partition2(deptno int, dname string, loc string)
partitioned by (month string, day string)
row format delimited fields terminated by '\t';

2：正常的加载数据
（1）加载数据到二级分区表中
hive (default)> load data local inpath '/opt/datas/dept.txt' into table default.dept_partition2 partition(month='201909', day='10');
（2）查询分区数据
hive (default)> select * from dept_partition2 where month='201909' and day='10';

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_05

3：把数据直接上传到分区目录上，让分区表和数据产生关联的两种方式
（1）方式一：上传数据后修复
上传数据
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201909/day=11;
hive (default)> dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201909/day=11;
查询数据（查询不到刚上传的数据）
hive (default)> select * from dept_partition2 where month='201909' and day='11';
执行修复命令
hive>msck repair table dept_partition2;
再次查询数据
hive (default)> select * from dept_partition2 where month='201909' and day='11';

hive 分区获取最新分区第一条 hive获取最大分区数据_hive 分区获取最新分区第一条_06

（2）方式二：上传数据后添加分区
   上传数据
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201909/day=12;
hive (default)> dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201909/day=12;
   执行添加分区
   hive (default)> alter table dept_partition2 add partition(month='201909', day='12');
   查询数据
hive (default)> select * from dept_partition2 where month='201909' and day='12';

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_07

（3）方式三：上传数据后load数据到分区
创建目录
hive (default)> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201909/day=13;
上传数据
hive (default)> load data local inpath '/opt/datas/dept.txt' into table dept_partition2 partition(month='201909',day='13');
查询数据
hive (default)> select * from dept_partition2 where month='201909' and day='13';

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_08

4.7 修改表

4.7.1 重命名表

（1）语法

ALTER TABLE table_name RENAME TO new_table_name

（2）实操案例
hive (default)> alter table dept_partition2 rename to dept_partition3;

4.7.2 增加和删除表分区

详见4.6.1分区表基本操作。同上

4.7.3 增加/修改/替换列信息

1：语法
更新列
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
增加和替换列
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
注：ADD是代表新增一字段，字段位置在所有列后面(partition列前)，REPLACE则是表示替换表中所有字段。

2：实操案例
（1）查询表结构
hive>desc dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_hive 分区获取最新分区第一条_09

（2）添加列

hive (default)> alter table dept_partition add columns(deptdesc string);

（3）查询表结构

hive>desc dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_10

（4）更新列

hive (default)> alter table dept_partition change column deptdesc pid int;

（5）查询表结构

hive>desc dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_11

（6）替换列

hive (default)> alter table dept_partition replace columns(deptno string, dname string, loc string);

（7）查询表结构

hive>desc dept_partition;

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_12

4.8 删除表

hive (default)> drop table dept_partition;

五 DML数据操作

5.1 数据导入

5.1.1 向表中装载数据（Load）

1：语法

hive>load data [local] inpath '/opt/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];

（1）load data:表示加载数据

（2）local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表

（3）inpath:表示加载数据的路径

（4）overwrite:表示覆盖表中已有数据，否则表示追加

（5）into table:表示加载到哪张表

（6）student:表示具体的表名

（7）partition:表示上传到指定分区

:2：实操案例
（0）创建一张表
hive (default)> create table student(id string, name string) row format delimited fields terminated by '\t';
（1）加载本地文件到hive
hive (default)> load data local inpath '/opt/datas/student.txt' into table default.student;
（2）加载HDFS文件到hive中
上传文件到HDFS
hive (default)> dfs -put /opt/datas/student.txt /user/hive;
加载HDFS上数据
hive (default)>load data inpath '/user/hive/student.txt' into table default.student;

（3）加载数据覆盖表中已有的数据
上传文件到HDFS
hive (default)> dfs -put /opt/datas/student.txt /user/hive;
加载数据覆盖表中已有的数据
hive (default)>load data inpath '/user/hive/student.txt' overwrite into table default.student;
注：load hdfs的数据相当于mv文件到另一个目录中，原目录文件消失

5.1.2 通过查询语句向表中插入数据（Insert）
1：创建一张分区表
hive (default)> drop table student;
hive (default)> create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';
2：基本插入数据
hive (default)> insert into table student partition(month='201907') values(1,'wangwu');
3：基本模式插入（根据单张表查询结果）
hive (default)> insert overwrite table student partition(month='201908') select id, name from student where month='201907';
4：多插入模式（根据多张表查询结果）
hive (default)> from student
insert overwrite table student partition(month='201909') select id, name where month='201907'
insert overwrite table student partition(month='201910') select id, name where month='201908';

hive 分区获取最新分区第一条 hive获取最大分区数据_hive_13

5.1.3 查询语句中创建表并加载数据（As Select）
详见4.5.1章创建表。
根据查询结果创建表（查询的结果会添加到新创建的表中）
create table if not exists student3 as select id, name from student;
这种方式不能创建外部表。
external
CREATE-TABLE-AS-SELECT cannot create external table

5.1.4 创建表时通过Location指定加载数据路径
1：创建表，并指定在hdfs上的位置
hive (default)>
create table if not exists student5(id int, name string)
row format delimited fields terminated by '\t'
location '/user/hive/warehouse/student5';

:2：上传数据到hdfs上
hive (default)> dfs -put /opt/datas/student.txt /user/hive/warehouse/student5;
:3：查询数据
hive (default)> select * from student5;

hive 分区获取最新分区第一条 hive获取最大分区数据_hive 分区获取最新分区第一条_14

5.1.5 Import数据到指定Hive表中
注意：先用export导出后（导出的数据目录里面附带有元数据)，在HDFS上是Copy级操作
hive (default)> export table default.student to '/user/hive/warehouse/export/student1';

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_15

5.2 数据导出
5.2.1 Insert导出
1）将查询的结果导出到本地,数据之间无间隔
hive (default)> insert overwrite local directory '/opt/datas/export/student1' select * from student;

hive 分区获取最新分区第一条 hive获取最大分区数据_分区表_16

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_17

2）将查询的结果格式化导出到本地,数据之间"\t"间隔
hive (default)> insert overwrite local directory '/opt/datas/student2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_18

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_19

3）将查询的结果导出到HDFS上(没有local)
hive (default)> insert overwrite directory '/user/student2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select * from student;

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_20

注：虽然同是HDFS，但不是copy操作

5.2.2 Hadoop命令导出到本地
hive (default)> dfs -get /user/hive/warehouse/student/month=201909/000000_0 /opt/datas/export/student3.txt;

hive 分区获取最新分区第一条 hive获取最大分区数据_hive_21

5.2.3 Hive Shell 命令导出
基本语法：（hive -f/-e 执行语句或者脚本 > file(自己创建)）
[itstar@bigdata111hive]$ /opt/mod/hive/bin/hive -e 'select * from default.student;' > /opt/datas/export/student4.txt;

hive 分区获取最新分区第一条 hive获取最大分区数据_hive 分区获取最新分区第一条_22

5.2.4 Export导出到HDFS上
hive (default)> export table default.student to '/user/hive/warehouse/export/student2';

hive 分区获取最新分区第一条 hive获取最大分区数据_数据_23

5.2.5 Sqoop导出（后续讲）
后续章节专门讲。
5.3 清除表中数据（Truncate）
注意：Truncate只能删除管理表，不能删除外部表中数据
hive (default)> truncate table student;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java推送服务项目 java推送数据

下一篇：kvm 运行日志 kvm 部署

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯