hive几类表 hive数据被分为

转载

mob64ca13ff9303 2023-08-29 20:38:24

文章标签 hive几类表 hive 数据 Hive 文章分类 Hive 大数据

Hive的数据类型

对于Hive的String类型相当于数据库的varchar类型，该类型是一个可变的字符串

hive几类表 hive数据被分为_Hive

Hive有三种复杂数据类型ARRAY、MAP 和 STRUCT。ARRAY和MAP与Java中的Array和Map类似，而STRUCT与C语言中的Struct类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套。

hive几类表 hive数据被分为_数据_02

DDL部分

创建数据库

避免要创建的数据库已经存在错误，增加if not exists判断。（标准写法）

create database if not exists db_hive;

显示数据库

show databases;

使用数据库

use db_hive;

显示数据库信息

desc database db_hive;

删除数据库（空数据库）

drop database if exists db_hive2;

删除不为空数据库（强制删除）

drop database db_hive cascade;

创建表

内部表

默认创建的表都被称为内部表。因为这种表，Hive会（或多或少地）控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下。当我们删除一个内部表时，Hive也会删除这个表中数据。内部表不适合和其他工具共享数据。

普通创建表

create table if not exists student(
id int,
name string
)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/hive/warehouse/student';

按照什么分割 ‘’ 号里可填 ‘，’ ‘空格’ ‘\t’等
row format delimited fields terminated by ‘\t’

hive文件存储格式，不写默认为textfile型
stored as textfile

location：数据来源
location ‘/user/hive/warehouse/student’;

根据查询结果创建表（as）

create table employee as select * from employee;

子查询生成内部表

create tabe employees as 
with 
r1 as (select name from emps where empname='zhangsan' and sex=1),
r2 as (select name from emps where sex=0)
select * from r1 union all select * from r2;

根据已经存在的表结构创建表

create table if not exists student1 like student;

查看创建的表

show tables;

查看表信息

desc student;

查看表具体信息

desc formatted student;

外部表

external

因为表是外部表，所以Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉。

create external table if not exists dept(
deptno int,
dname string,
loc int
)
row format delimited fields terminated by '\t';

向外部表中导入数据
数据在hdfs中

load data inpath '/myinfo/4.txt' into table mytest;

数据在linux中

load data local inpath '/myinfo/4.txt' into table mytest;

含有集合的表
集合中数据以什么分割
collection items terminated by ‘,’
比如数据为：

1 zhangsan football,basketball,drink 22
2 caixukun sing,tiao,rap 12

每个字段以空格分割，集合中数据以逗号分割

create externale table myuser(
uid int,
uname string,
ulike array<string>,
uage int
)
row format delimited fields terminated by ' '
collection items terminated by ','
location '/myinfo'

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

分区分为静态分区和动态分区

静态分区：若分区的值是确定的，那么称为静态分区。新增分区或者是加载分区数据时，已经指定分区名。

创建静态分区表

create table partition(
id int, 
name string, 
age int
)
partitioned by (sex int)
row format delimited fields terminated by '\t';

加载数据

load data local inpath '/opt/module/datas/data1.txt' into table partition partition(sex='1');
load data local inpath '/opt/module/datas/data2.txt' into table partition partition(sex='0');

查询
select * from partition where sex=‘1’;
select * from partition where sex=‘0’;

联合查询

select * from partition where sex='1'
              union
              select * from partition where sex='0';

增加分区删除分区（add，drop）

alter table partition add partition(sex='3') ;
alter table partition add partition(sex='3') ;

查看表有多少分区

show partitions partition;

多级分区表

create table partition2(
id int,
name string,
)
partitioned by (age int,sex int)
row format delimited fields terminated by '\t';

加载数据到多级分区表

load data local inpath '/opt/module/datas/data.txt' into table partition2 partition(age='20',sex='0');

查询多级分区表

select * from partition2 where age='19' and sex='1';

动态分区

动态分区：分区的值是非确定的，由输入数据来确定

动态分区需要在hive中设置

开启动态分区
字段可全部动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

创建动态分区表

create external table origninfos(
id string,
name string,
sex string,
age int
)
row format delimited fields terminated by ' '
location '/orgin';

create external table partinfos(
id string,
name string,
age int
)
partitioned by (sex string)
row format delimited fields terminated by ' '

insert into partinfos partition(sex) select id,name,age,sex from origninfos;

分桶表

对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。

2、提高join查询效率
获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

1、方便抽样
使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。