hive文件入库多了一行

转载

karen 2024-11-28 08:55:51

文章标签 hive文件入库多了一行 hive hadoop 大数据 Hive 文章分类 Hive 大数据

Hive 基础篇

1. Hive 介绍

1.1 Hive 简单认识
1.2 Hive 与传统数据库对比
1.3 Hive 的架构

2. Hive 的metastore元数据服务

2.1 metadata 、metastore
2.2 metastore 配置模式

3. Hive 启动
4. Hive 数据相关概念

4.1 数据类型
4.2 显示类型强制转换
4.3 Hive读写机制
4.4 Hive默认存储路径

5. SQL 语句的划分

5.1 DDL 数据定义语言
5.2 DML 数据操纵语言
5.3 DQL 数据查询语言

6. Hive 的简单表

6.1 LazySimpleSerDe 分隔符
6.2 内部表
6.3 外部表
6.4 内部表、外部表差异

7. Hive 分区表

7.1 分区表的创建
7.2 分区数据加载------静态分区
7.3 分区数据加载------动态分区
7.4 分区表特点
7.5 多重分区表

8. 分桶表

8.1 创建分桶表
8.2 分桶表特点

9. DDL 基本操作

9.1 database 操作
9.2 table 操作
9.3 partition 操作

10. DML 基本操作

10.1 load 操作
10.2 insert 操作

11. DQL 基本操作

11.1 基础查询
11.2 排序
11.3 union 查询

1. Hive 介绍

1.1 Hive 简单认识

Hive 是基于 Hadoop 的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类 SQL 查询功能。其本质是将 SQL 转换为 MapReduce 程序。 Hive利用HDFS存储数据，利用MapReduce查询分析数据.

1.2 Hive 与传统数据库对比

hive文件入库多了一行_hive文件入库多了一行

1.3 Hive 的架构

hive文件入库多了一行_Hive_02

2. Hive 的metastore元数据服务

2.1 metadata 、metastore

Metadata即元数据，元数据包含用Hive创建的database、table、表的字段等元信息。元数据存储在关系型数据库中。如hive内置的Derby、第三方如MySQL等。

Metastore即元数据服务，作用是：客户端连接metastore服务，metastore再去连接MySQL数据库来存取元数据。有了metastore服务，就可以有多个客户端同时连接，而且这些客户端不需要知道MySQL数据库的用户名和密码，只需要连接metastore 服务即可。

2.2 metastore 配置模式

metastore服务配置有3种模式：内嵌模式、本地模式、远程模式。

企业推荐模式–远程模式部署。

hive文件入库多了一行_hadoop_03

内嵌模式
内嵌模式使用的是内嵌的Derby数据库来存储元数据，也不需要额外起Metastore服务。配置简单，但是一次只能一个客户端连接，适用于用来实验，不适用于生产环境。
缺点： 不同路径启动hive，每一个hive拥有一套自己的元数据，无法共享。
本地模式
本地模式采用外部数据库来存储元数据，我使用的是mysql。本地模式不需要单独起metastore服务，用的是跟hive在同一个进程里的metastore服务。也就是说当你启动一个hive 服务，里面默认会帮我们启动一个metastore服务。
缺点： 每启动一次hive服务，都内置启动了一个metastore。
远程模式
远程模式下，需要单独起metastore服务，然后每个客户端都在配置文件里配置连接到该metastore服务。在这种情况下，其他依赖hive的软件都可以通过Metastore访问hive。
远程模式的metastore服务和hive运行在不同的进程里。

3. Hive 启动

# 这是我的HIve存放地址 /export/server/apache-hive-3.1.2-bin/bin/hive 
# 若配置了Hive的环境变量可以用 hive 代替/export/server/apache-hive-3.1.2-bin/bin/hive
# 同理beeline也是如此

# 前台启动 关闭 ctrl+c
/export/server/apache-hive-3.1.2-bin/bin/hive --service metastore
hive --service metastore

# 前台启动开启 debug 日志
/export/server/apache-hive-3.1.2-bin/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console 
hive --service metastore --hiveconf hive.root.logger=DEBUG,console 

# 后台启动 进程挂起 关闭使用（ kill -9 进程id ）
nohup /export/server/apache-hive-3.1.2-bin/bin/hive --service metastore &
hive --service metastore &

# 启动 hiveserver2 服务
nohup /export/server/apache-hive-3.1.2-bin/bin/hive --service hiveserver2 &
hive --service hiveserver2 &

# 启动beeline 远程连接
/export/server/apache-hive-3.1.2-bin/bin/beeline
beeline

# 通过beeline 远程连接
! connect jdbc:hive2://node1:10000

4. Hive 数据相关概念

4.1 数据类型

Hive数据类型分为两类：原生数据类型（primitive data type）和复杂数据类型（complex data type）。


原生数据类型包括	数值类型、时间类型、字符串类型、杂项数据类型
复杂数据类型包括	array数组、map映射、struct结构、union联合体

Hive特点

英文字母大小写不敏感；
除SQL数据类型外，还支持Java数据类型，比如：string；
int和string是使用最多的，大多数函数都支持；
复杂数据类型的使用通常需要和分隔符指定语法配合使用。
如果定义的数据类型和文件不一致，hive会尝试隐式转换，但是不保证成功。
hive不允许从宽类型到窄类型的隐式转换

4.2 显示类型强制转换

cast 函数：强行转换类型

select cast('1314' as decimal(6,2));	# 输出1314.00

4.3 Hive读写机制

SerDe 是序列化 (Serializer)、反序列化 (Deserializer) 的简称。序列化是对象转化为字节码的过程；而反序列化是字节码转换为对象的过程。Hive 使用 SerDe（和 FileFormat）读取和写入行对象。

Hive读取文件机制： 首先调用InputFormat（默认TextInputFormat），返回一条一条kv键值对记录（默认是一行对应一条记录）。然后调用SerDe（默认LazySimpleSerDe）的Deserializer，将一条记录中的value根据分隔符切分为各个字段。

hive文件入库多了一行_大数据_04

Hive写文件机制： 将Row写入文件时，首先调用SerDe（默认LazySimpleSerDe）的Serializer将对象转换成字节序列，然后调用OutputFormat将数据写入HDFS文件中。

hive文件入库多了一行_大数据_05

4.4 Hive默认存储路径

默认存储路径是由*${HIVE_HOME}/conf/hive-site.xml* 配置文件的hive.metastore.warehouse.dir 属性指定。默认值是：/user/hive/warehouse

hive文件入库多了一行_hadoop_06

5. SQL 语句的划分

5.1 DDL 数据定义语言

DDL (Data Definition Language)，是SQL语言集中对数据库内部的对象结构进行创建，删除，修改等的操作语言。在某些上下文中，该术语也称为数据描述语言，因为它描述了数据库表中的字段和记录。核心语法由 CREATE、ALTER 与 DROP三个所组成，DDL并不涉及表内部数据的操作。

5.2 DML 数据操纵语言

DML (Data Manipulation Language), 是用于数据库操作，对数据库其中的对象和数据运行访问工作的编程语句，在标准的SQL语言中，以insert、update、delete 三种指令为核心。
外加上 SQL的select语句，被称为 “CRUD”(分别为 Create, Read, Update, Delete) 增查改删

5.3 DQL 数据查询语言

DQL (Data Query Language), 用来查询数据，以select为核心。加上DML的insert、update、delete，被称为CRUD（增查改删）。
标准语法为select - from - where - group by - having-order by - limit

6. Hive 的简单表

6.1 LazySimpleSerDe 分隔符

LazySimpleSerDe是Hive默认的序列化类，包含4种子语法，分别用于指定字段之间、集合元素之间、map映射 kv之间、行数据之间 的分隔符号。

# 1,后裔,53,精灵王:288-阿尔法小队:588-辉光之辰:888-黄金射手座:1688-如梦令:1314
create table tb_hot_hero(
    id int comment 'ID',
    name string comment '英雄名称',
    winrate int comment '胜率',
    skin_and_price map<string,int> comment '皮肤及价格'
)row format delimited fields terminated by ','   --字段之间分隔符
collection items terminated by '-'              --集合元素之间分隔符
map keys terminated by ':'                      --Map映射kv之间分隔符
lines terminated by '\n'                        --行数据之间分隔符
location '/zimo/datahive';                      --指定表在HDFS的路径

hive建表时如果没有row format语法。此时字段之间默认的分割符是’\001’，是一种特殊的字符，使用的是ascii编码的值，键盘是打不出来的。

hive文件入库多了一行_hive_07

在vim编辑器中，连续按下 Ctrl+v/Ctrl+a 即可输入’\001’ ，显示 ^A。

6.2 内部表

内部表（Internal table）也称为被Hive拥有和管理的托管表（Managed table）。默认情况下创建的表就是内部表，Hive拥有该表的结构和文件。换句话说，Hive完全管理表（元数据和数据）的生命周期 。

当删除内部表时，Hive会删除数据以及表的元数据。

-- 获取表的描述信息，可以查看表类型
describe formatted zimo.student;

6.3 外部表

外部表（External table）中的数据不是Hive拥有或管理的，只管理表元数据的生命周期。要创建一个外部表，需要使用 external 语法关键字。

删除外部表只删除元数据，而不删除实际数据，Hive外部仍然可以访问实际数据

create external table student_ext(
    num tinyint,
    name string,
    age int )
row format delimited fields terminated by ',';

6.4 内部表、外部表差异

无论内部表还是外部表，Hive都在Hive Metastore中管理表定义及其分区信息。

删除内部表，会从Metastore中删除表的元数据，还会从HDFS中删除其所有数据/文件。

删除外部表，会从Metastore中删除表的元数据，并保持HDFS位置中的实际数据不变。

hive文件入库多了一行_Hive_08

若需要通过Hive完全管理控制表的整个生命周期时，使用内部表。

若文件已存在或位于远程位置时，使用外部表，因为即使删除表，文件也会被保留。

7. Hive 分区表

当Hive表对应的数据量大、文件多时，为了避免查询时全表扫描，Hive支持根据用户指定的字段进行分区，分区的字段可以是日期、地域、种类等具有标识意义的字段。比如把一整年的数据根据月份划分12个月（12个分区），后续就可以查询指定月份分区的数据，避免了全表扫描查询。
切记，分区字段不能是中文，不然会有奇怪的bug！！

7.1 分区表的创建

create table stu_partition(
    id int,
    name string,
    age int,
    class int
)
-- partitioned by创建分区表的关键词,会根据分区字段创建对应的目录，
-- 注意：分区字段不能和表字段重名!
partitioned by (year int, month int, day int)
row format delimited fields terminated by ",";

7.2 分区数据加载------静态分区

分区的字段值是由用户在加载数据的时候手动指定。

hive文件入库多了一行_Hive_09

load data local inpath '/home/stu_8.txt' into table stu_partition partition(ntime='2022-5-2');
load data local inpath '/home/stu_8.txt' into table stu_partition partition(ntime='2022-5-3');
select name,count(id) over() from stu_partition where ntime='2022-5-3';

hive文件入库多了一行_hive文件入库多了一行_10

7.3 分区数据加载------动态分区

往hive分区表中插入加载数据时，如果需要创建的分区很多，则需要复制粘贴修改很多sql去执行，效率低。因为hive是批处理系统，所以hive提供了一个动态分区功能，其可以基于查询参数的位置去推断分区的名称，从而建立分区。默认是根据最后一个参数设为分区字段
切记，分区字段不能是中文，不然会有奇怪的bug！！

# 开启动态分区需要设置的参数
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

create table stu_partition_1(
    id int,
    name string,
    age int,
    cls int
)
-- partitioned by创建分区表的关键词,会根据分区字段创建对应的目录，
-- 注意：分区字段不能和表字段重名!
partitioned by (class int)
row format delimited fields terminated by ",";

-- 分区字段为最后一个的class
insert into stu_partition_1 partition(class) select id,name,age,class,class from stu_partition;

7.4 分区表特点

分区的概念提供了一种将Hive表数据分离为多个文件/目录的方法。不同分区对应着不同的文件夹，同一分区的数据存储在同一个文件夹下。只需要根据分区值找到对应的文件夹，扫描本分区下的文件即可，避免全表数据扫描。
查询的时候尽量先使用where进行分区过滤，查询指定分区的数据，避免全表扫描。

分区表的注意事项
分区表不是建表的必要语法规则，是一种优化手段表
分区字段不能是表中已有的字段，不能重复
分区字段是虚拟字段，其数据并不存储在底层的文件中
分区字段值的确定来自于用户价值数据手动指定（静态分区）或者根据查询结果位置自动推断（动态分区）
Hive支持多重分区，也就是说在分区的基础上继续分区，划分更加细粒度

-- 手动添加分区
alter table 表名 add if not exists partitions(dt='2021-11-11');
-- 修复分区
msck repair table 表名;

7.5 多重分区表

多重分区下，分区之间是一种递进关系,可以理解为在前一个分区的基础上继续分区。从HDFS的角度来看就是文件夹下继续划分子文件夹。

create table stu_partition_2(
    id int,
    name string,
    age int,
    cls int
)partitioned by (year int,month int)
row format delimited fields terminated by ',';

load data local inpath '/home/stu_8.txt' into table stu_partition_2 partition(year=2021,month=05);
load data local inpath '/home/stu_8.txt' into table stu_partition_2 partition(year=2022,month=05);
load data local inpath '/home/stu_8.txt' into table stu_partition_2 partition(year=2022,month=06);

8. 分桶表

分桶表也叫做桶表，是一种用于优化查询而设计的表类型。该功能可以让数据分解为若干个部分易于管理。在分桶时，我们要指定根据哪个字段将数据分为几桶（几个部分）。可以发现桶编号相同的数据会被分到同一个桶当中。分桶的数据需要插入，不能直接load data，因为只有经过插入操作才能进行分桶。

8.1 创建分桶表

-- 分桶表
create table stu_bucket(
    id int,
    name string,
    age int,
    cls int
)clustered by (cls) into 4 buckets
row format delimited fields terminated by ',';
-- 一定要经过插入语句才能进行分桶
insert into stu_bucket select id,name,age,class
from stu_partition where ntime='2022-5-3';


-- 分桶表排序
create table stu_bucket(
    id int,
    name string,
    age int,
    cls int
)clustered by (cls) sorted by(cls) into 4 buckets
row format delimited fields terminated by ',';

8.2 分桶表特点

分桶表是通过hash算法进行分桶的，对于非int类型的分桶字段，会转化成相应的hashcode值再进行分桶。要注意的是，分桶的字段必须是表中已经存在的字段。

分桶表的好处：

基于分桶字段查询时，减少全表扫描。提高查询性能

select * from stu_bucket where cls=8;

join时可以提高MR程序效率，减少笛卡尔积数量。join优化
对于join操作两个表有一个相同的列，如果对这两个表都进行了分桶操作。那么将保存相同列值的桶进行join操作就可以，可以大大较少join的数据量。
分桶表数据进行抽样。抽样查询
当数据量特别大时，对全体数据进行处理存在困难时，抽样就显得尤其重要了。抽样可以从被抽取的数据中估计和推断出整体的特性，是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法。

9. DDL 基本操作

9.1 database 操作

-- 查看所有数据库
show databases;
show schemas;
-- 创建数据库
create database zimo;
-- 创建数据库在指定目录下
create database zimodatabase location '/zimo';
-- 描述数据库需信息
describe database zimo;
-- 详细描述数据库信息
describe database extended zimo;
-- 查看创建数据库的语句
show create database zimo;
-- 使用数据库
use zimo;
-- 删除数据库
drop database zimo;
-- 删除带表的数据库
drop database zimo cascade;
-- 修改数据库 【慎用!】
alter database zimo set ???;

9.2 table 操作

-- 查看所有表
show tables;
-- 描述表的元数据信息
describe extended zimotable;
-- 详细描述表的元数据信息(格式化查看)
describe formatted zimotable;
-- 删除表
drop table if exists zimotable;
-- 删除表跳过垃圾桶
drop table zimotable purge ;
-- 删除表中的所有行数据 (删除不了外部表)
truncate table zimotable;
-- 修改表名字
alter table zimotable rename to zimo_table;
-- 修改表属性 【慎用!】
alter table zimotable set ???;
-- 修改表字段(原字段 新字段名 类型) 【慎用!】
alter table zimotable change id uid int;
-- 添加表字段 【慎用!】
alter table zimotable add columns(name string);

9.3 partition 操作

-- 添加分区目录
alter table zimopartition add partition(year=2022) partition (year=2021);
-- 重命名分区目录
alter table zimopartition partition (year=2021) rename to partition (year=2023);
-- 删除分区
alter table zimopartition drop partition (year=2023);
-- 修复分区(手动在hdfs上加的目录是不会加载到该表的元数据中,需要进行表分区修复)
msck repair table zimopartition;
-- 查看分区表的分区
show partitions zimopartition;

10. DML 基本操作

10.1 load 操作

在将数据load加载到表中时，Hive不进行任何转换，加载操作是将数据文件移动到与Hive表对应的位置的纯复制/移动操作。果指定了LOCAL， load命令将在本地文件系统，如果是远程登入执行命令的，这里的本地文件系统指的是Hiveserver2服务所在机器的本地Linux文件系统。

-- 没有local，则filepath是hdfs路径
load data inpath 'filepath' into table zimotable;
-- 有local，则filepath为linux的路径
load data local inpath 'filepath' into table zimotable;
-- 则目标表（或者分区）中的内容会被删除，
-- 然后再将 filepath 指向的文件/目录中的内容添加到表/分区中
load data local inpath 'filepath' overwrite into table zimotable;

10.2 insert 操作

在传统数据库RDBMS中使用（insert+values），速度快。而在Hive中使用（insert+values），速度非常慢，因为Hive底层使用MR来写入数据到HDFS。所以在Hive中通常将数据清洗成为结构化文件，再Load加载到HIve的表中。
在Hive中通常使用（insert+select）语句而非（insert+values）。
insert overwrite表示覆盖写，清空旧数据，再插入新数据！

--如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，
-- 但是不能保证转换一定成功，转换失败的数据将会为NULL。
-- 分区表动态插入，默认最后一列字段为分区字段
-- 需要在非严格模式下
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table stu_partition_1
select id, name, age, class, class
from stu_partition where stu_partition.ntime='2022-5-3';

-- 导出数据 (重写到linux中的'/home/hdfs_data'目录下）
-- 保存格式以orc格式进行压缩
insert overwrite local directory '/home/hdfs_data'
row format delimited fields terminated by ','
stored as orc
select * from stu_partition where ntime='2022-5-3';

-- multiple insert多重插入
from stu_partition
insert overwrite table student_insert1
select id,name
insert overwrite table student_insert2
select name,age;

11. DQL 基本操作

11.1 基础查询

在查询过程中执行顺序： from>where>group（含聚合）>having>order>select 。

where与having 的区别
where是在分组前对数据进行过滤	having是在分组后对数据进行过滤
where后面不可以使用聚合	having后面可以使用聚合函数

-- 基础查询 select - from - group by - having - order by - limit；
select * from student;
select distinct age from student;
select name,age,money from student where money<5000;
-- 模糊匹配
select * from student where name like '%五' or name like '__十';
-- 正则匹配
select * from student where name rlike '.*五|王.十';

select age,avg(money) avgmoney from student group by age
having avgmoney>5000 order by avgmoney desc limit 3;

11.2 排序

order by
对输出的结果进行全局排序，因此底层使用MapReduce 引擎执行的时候，只会有一个reducetask执行。速度慢。
cluster by
语法可以指定根据后面的字段将数据分组，每组内再根据这个字段正序排序（不允许指定排序规则），分为几组取决于reduce task的个数 ，分组的规则默认为hash散列。概况的说：对同一字段，分且排序。
distribute by +sort by就相当于把cluster by的功能一分为二：distribute by负责分，sort by负责分组内排序，并且可以是不同的字段。分为几组同样取决于reduce task的个数，概况的说：对不同字段，先分再排序。sort by只保证每个reducer的输出有序，而非全局有序。

-- order by
select * from student order by age desc,money;
-- cluster by
set mapreduce.job.reduces=2;
select * from student cluster by(id);
-- distribute by + sort by
set mapreduce.job.reduces=2;
select * from student distribute by sex sort by money desc;

11.3 union 查询

union用于将字段相同的表或者相同的select语句得到的结果 合并为一个结果集。
union all 表示不去重，union [distinct] 表示去重。

-- union all (不去重) | union distinct (去重) 【默认去重】
select * from student where money>=5000
union
select * from student where money<=5000
order by money desc ;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：iis worker process 启动

下一篇：SpringBoot工程的config包中放哪些类

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯