hive 元数据 cd

转载

kcoufee 2024-12-02 14:36:49

文章标签 hive 元数据 cd hive hadoop hql bigdata 文章分类 Hive 大数据

开始

主要内容：
1.数据库的创建、修改和显示。
2.数据库表的创建、修改和显示。
3.内部表。
4.分区表。

数据库定义

创建数据库
删除数据库
显示数据库

数据库表定义

创建表

创建内部表
创建外部表

查看表定义

tblproperties
仅使用describe
extended
formatted

分区表

数据示例
创建分区表
分区数据导入
查看分区

外部分区表
删除表
修改表

表重命名
分区操作

添加分区
修改分区路径
删除分区

修改列
添加列
替换列
修改表属性

数据库定义

创建数据库

以下示例创建了一个mydb数据库。用dfs查看/user/hive/warehouse目录，可发现多出了一个mydb.db的文件夹。

hive> create database if not exists mydb;
OK
Time taken: 1.178 seconds
hive> show databases;
OK
default
mydb
testdb
Time taken: 0.215 seconds, Fetched: 3 row(s)
hive> dfs -ls /user/hive/warehouse/;
Found 2 items
drwxrwxrwx   - stack supergroup          0 2018-05-16 09:43 /user/hive/warehouse/mydb.db

如果想把数据库保存在其他目录，则可以通过location /path/dbname进行自定义设置。

hive> create database loc
    > location '/my/hive/loc.db';

同时，还可以在创建数据库时，设置描述信息。
第一种方式是通过comment。

hive> create database cmnt
    > comment 'database with commnets';
OK
Time taken: 0.092 seconds

用describe 查看数据库描述。显示数据库描述和存放目录。

hive> describe database cmnt;
OK
cmnt    database with commnets  hdfs://master:9000/user/hive/warehouse/cmnt.db  stack   USER    
Time taken: 0.136 seconds, Fetched: 1 row(s)

第二种方式是通过with dbproperties (key1=value1, ...)。比如以记录项目id为“ABC0012”。

hive> create database prop
    > with dbproperties ('projectid'='ABC0012', 'creator'='wei');

dbproperties的属性是可修改和可添加的。如下修改projectid，添加time描述。

hive> alter database prop set dbproperties('projectid'='AAAAAA', time='10:35');
OK
Time taken: 0.064 seconds

查看数据库dbproperties时，必须在数据库名前面加上extended关键字。

hive (cmnt)> describe database extended prop;
OK
prop        hdfs://master:9000/user/hive/warehouse/prop.db  stack   USER    {creator=wei, time=10:35, projectid=AAAAAA}
Time taken: 0.04 seconds, Fetched: 1 row(s)

使用数据库。使用use dbname的语句切换到相应的数据库。为了便于观察，可以设置显示当前所在的数据库。

hive> use cmnt;
hive> set hive.cli.print.current.db=true; //设置显示当前所在的数据库
hive (cmnt)> //当前数据库为cmnt

删除数据库

使用drop database dbname的方式删除。但是当数据库存在表时，将会删除失败。如下，先在cmnt中创建一个表mytable，然后执行drop，提示FAILED。

hive (cmnt)> create table mytable(id string);
OK
Time taken: 0.157 seconds
hive (cmnt)> drop database cmnt;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database cmnt is not empty. One or more tables exist.)

如果加上关键字cascade，则会自行删除数据中表，然后删除数据库。如果加上关键字restrict，则和默认情况相同。

hive (cmnt)> drop database cmnt cascade;
OK

显示数据库

通过show databases来显示已有数据库。如果想要模糊匹配，则可以用like关键字实现。

hive (cmnt)> show databases like '*ro*';
OK
prop
Time taken: 0.031 seconds, Fetched: 1 row(s)

数据库表定义

创建表

Hive中的表分为内部表（管理表）和外部表。删除内部表时，内部表的元数据和目录上的数据均被删除；删除外部表时，只删除元数据，目录上数据不被删除。内部表不适合共享数据。因为表被删除时，数据也被删除。

创建内部表

以下这个创建表的示例来自《Hive编程指南》，但略有不同，原书上将tblproperties放在location之前，执行时会报“FAILED: ParseException line 9:0 missing EOF at ‘location’ near ‘)’”的错误。将该示例写到createdb.hql文件中。

create table if not exists mydb.employees(
  name string comment 'Employee name',
  salary float comment 'Employee salay',
  subordinates array<string> comment 'Names of subordiantes',
  deductions map<string, float> comment '<deductions name, percentages>',
  address struct<street:string, city:string, state:string, zip:int> comment 'Home address')
comment 'Description of the table'
location '/user/hive/warehouse/mydb.db/employees'
tblproperties ('creator'='me', 'created_at'='2018-05-16');

该示例在mydb数据库中新建一张employees表，包含五列。name为string类型，salary为float类型，这些为基本数据类型。之后的三列为hive的三种集合数据类型，subordinates为数组array类型，deductions为map类型，address为struct类型。类似于数据库操作，这里使用comment为列和表添加描述信息，location设置表的保存目录，tblproperties设置键值对类型的属性。
在hive shell中，通过source执行创建表的语句。

hive> source ./createdb.hql

如果mydb数据库中无employees表，则会创建；如果存在同名表，即使和createdb.hql表结构不同，也不会再创建。
另外，也可以通过拷贝表的方式创建表。

hive > create table if not exists mydb.cpyemployees
     > like mydb.employees;

创建外部表

外部表被删除时，其目录下的数据不会删除，而只是在hive中删除了表的元数据。
外部表也可以像内部表一样通过load实现数据导入，但这里使用另一种方式，无需load即可实现数据导入。
假设在HDFS上存在数据文件/user/hive/external/etlemployees/data.dat。我们想使用Hive表操纵这个数据，但是表删除时，原数据文件依然保存。那么，应该选择建立外部表。
外部表的创建方式如下所示。其与内部表类似，但略有不同。
1.在table前多了external关键字，表明该表类型为外部表。
2.location的路径为数据存在的路径。注意，此路径只能指向数据存在的文件夹目录，而不能指向文件。否则，执行时会报“路径不是文件夹或者路径无法被创建”的错误。

create external table if not exists mydb.etlemployees(
  name string comment 'Employee name',
  salary float comment 'Employee salay',
  subordinates array<string> comment 'Names of subordiantes',
  deductions map<string, float> comment '<deductions name, percentages>',
  address struct<street:string, city:string, state:string, zip:int> comment 'Home address')
comment 'Description of the table'
location '/user/hive/external/etlemployees'
tblproperties ('creator'='me', 'created_at'='2018-05-16');

创建后，可以直接显示表中数据。

hive> select * from etlemployees;

删除该表后，show tables的结果不在存在etlemployees表，但是/user/hive/external/etlemployees/data.dat依然存在。

像内部表一样，外部表在创建时也可以复制其他表结构。

hive> create external table if not exists mydb.cpyetlemployees
    > like mydb.employees
    >location '/user/hive/external/cpyetlemployees';

但是，执行完上述语句后，数据并不会被复制到cpyetlemployees目录下。

查看表定义

查看表定义信息的三种方法

tblproperties

1.查看tblproperties。发现多出了一个transient_lastDdlTime属性。

hive> show tblproperties mydb.employees;
OK
comment Description of the table
created_at  2018-05-16
creator me
transient_lastDdlTime   1526455784
Time taken: 0.094 seconds, Fetched: 4 row(s)

仅使用describe

2.查看table表结构信息。

hive> describe mydb.employees;
OK
name                    string                  Employee name       
salary                  float                   Employee salay      
subordinates            array<string>           Names of subordiantes
deductions              map<string,float>       <deductions name, percentages>
address                 struct<street:string,city:string,state:string,zip:int>  Home address        
Time taken: 0.091 seconds, Fetched: 5 row(s)

extended

3.通过过extended关键字查看表结构及描述信息。这条语句输出的信息有些混乱。

hive> describe extended mydb.employees;

formatted

4.通过formatted查看表结构及描述信息。

hive> describe formatted mydb.employees;
OK
# col_name              data_type               comment             

name                    string                  Employee name       
salary                  float                   Employee salay      
subordinates            array<string>           Names of subordiantes
deductions              map<string,float>       <deductions name, percentages>
address                 struct<street:string,city:string,state:string,zip:int>  Home address        

# Detailed Table Information         
Database:               mydb                     
Owner:                  stack                    
CreateTime:             Wed May 16 15:29:44 CST 2018     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://master:9000/user/hive/warehouse/mydb.db/employees     
Table Type:             MANAGED_TABLE            
Table Parameters:        
    comment                 Description of the table
    created_at              2018-05-16          
    creator                 me                  
    transient_lastDdlTime   1526455784          

# Storage Information        
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
InputFormat:            org.apache.hadoop.mapred.TextInputFormat     
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:         
    serialization.format    1                   
Time taken: 0.1 seconds, Fetched: 32 row(s)

分区表

表的分区实际上就是在表目录下创建多个子目录，将数据分成多份存在不同的子目录下。而这些数据按照设定的某些字段进行划分。

数据示例

例如，存在如下类型的学生数据文件。数据有6个字段，分别为id号、名字、性别、学校和年级。这些数据按学校和年级分为四个文件。

文件1: student-MingShengSchool-Grade1.dat

S0001,WangXiaoming,male,MingSheng School,Grade 1
S0003,ZhangSan,male,MingSheng School,Grade 1

文件2: student-MingShengSchool-Grade2.dat

S0002,ZhangBobo,male,MingSheng School,Grade 2
S0016,XiaoBaiBai,male,MingSheng School,Grade 2

文件3: student-DaHuaSchool-Grade1.dat

S0002,LiLeilei,male,DaHua School,Grade 1
S0007,LiDahua,male,DaHua School,Grade 1

文件4: student-DaHuaSchool-Grade2.dat

S0003,LuoDafu,male,DaHua School,Grade 2
S0012,ZhouXiaoJuan,male,DaHua School,Grade 2

如果经常需要查询某个学校某个年级的学生，则可以将这些文件按学校和年级进行逐级分区。如下：

1级目录             2级目录      
 school=MingSheng School/grade=Grade 1
 school=MingSheng School/grade=Grade 2

 school=DaHua School/grade=Grade 1
 school=DaHua School/grade=Grade 2

创建分区表

创建相应分区表时，通过partitioned by设定学校和年级为分区字段。具体语句如下所示，create()中定义的字段和partitioned by()的字段是分开的。两者组成了完整的字段。将下述写入create_part_tbl.hql文件中。

create table mydb.student (
id string,
name string,
sex string
)
partitioned by (school string, grade string)
row format delimited
fields terminated by ','

使用source执行create_part_tbl.hql。

hive> source ./create_part_tbl；

分区数据导入

创建表后，导入分区的数据。首先导入文件1：student-MingShengSchool-Grade1.dat。

hive> load data local inpath '/path/data/student.dat'
    > into table student
    > partition (school = 'MingSheng School', grade='Grade 1');

这一步完成后，在/user/hive/warehouse/mydb.db/student目录下会出现school=MingShengSchool/grade=Grade 1这个目录。
接着按如上方法依次导入剩下的3种文件。

查看分区

显示分区目录。

hive> show partitions student;
OK
school=DaHua School/grade=Grade 1
school=DaHua School/grade=Grade 2
school=MingSheng School/grade=Grade 1
school=MingSheng School/grade=Grade 2

也可以查看hdfs上相应的目录。注意：查看school=DaHua School的子目录时，因为路径存在空格，要在空格前加上\进行转义，否则会报错。

$ hdfs dfs -ls /user/hive/warehouse/mydb.db/student
Found 3 items
drwxrwxrwx   - stack supergroup          0 2018-05-17 16:43 /user/hive/warehouse/mydb.db/student/school=DaHua School
drwxrwxrwx   - stack supergroup          0 2018-05-17 16:40 /user/hive/warehouse/mydb.db/student/school=MingSheng School
$ hdfs dfs -ls /user/hive/warehouse/mydb.db/student/school=DaHua\ School
Found 3 items
drwxrwxrwx   - stack supergroup          0 2018-05-17 16:43 /user/hive/warehouse/mydb.db/student/school=DaHua School/grade=Grade 1
drwxrwxrwx   - stack supergroup          0 2018-05-17 16:43 /user/hive/warehouse/mydb.db/student/school=DaHua School/grade=Grade 2

外部分区表

假设有一些日志数据如下所示。各字段分别表示：时分秒，日志等级，类名，日志信息，年，月，日。现在要使用外部分区表进行处理。

155337  INFO    Utils   Successfully started service 'sparkDriver' on port 59995        2018    5       21
155338  INFO    SparkEn Registering MapOutputTracker    2018    5       21
155339  INFO    SparkEnv        Registering BlockManagerMaster  2018    5       21
155341  INFO    BlockManagerMasterEndpoint      BlockManagerMasterEndpoint up   2108    5       21
155342  INFO    MemoryStore      MemoryStore started with capacity 1445.7 MB    2018    5       21

创建外部分区表语句如下所示，各字段分隔符为'\t'，按年/月/日分区。将语句保存到createLogsTbl.hql。

create external table if not exists logs(
  hms int,
  level string,
  class string,
  message string)
partitioned by (year int, month int, day int)
row format delimited
fileds terminated by '\t';

通过source执行createLogTbl。

hive> source /path/createLogTbl.hql

注意：在前面的创建分区表etlemployees时，设定了location为文件位置。但是，在创建外部分区表时，可以不设定location。而可以通过alter语句添加分区和位置。

hive> alter table logs add partition(year=2018, month=5, day=21)
    > location '/user/hive/external/logs/2018/05/21';

通过load将之前的logs.dat文件加载到表中。

hive> load data local inpath '../testdata/logs.dat' 
    > into table logs
    > partition (year=2018, month=05, day=21);
Loading data to table mydb.logs partition (year=2018, month=5, day=21)

如果数据被转移到其他目录。则可以改变location的设置。

hive> dfs -cp  /user/hive/external/logs /user/hive/etl/ //将数据转移到其他目录
hive> alter table logs partition(year=2018, month=5,day=21)
    > set location '/user/hive/etl/logs/2018/05/21'; //设置新的目录

删除表

删除内部表employees表。元数据和表中数据都被删除。

hive> drop table if exists employees;

删除外部表etlemployees。仅元数据被删除，表目录下的数据保留。

hive> drop table if exists etlemployees;

修改表

表重命名

将表etlemlpyees修改为etl_employees。

hive> alter table etlemployees rename to etl_employees;

分区操作

添加分区

alter table logs add if not exists
partition (year = 2018, month = 5, day = 20) location '/user/hive/external/logs/2018/05/20'
partition (year = 2018, month = 5, day = 19) location '/user/hive/external/logs/2018/05/19';

修改分区路径

修改分区的location路径。

hive> alter table logs partition(year=2018, month=5,day=21)
    > set location '/user/hive/etl/logs/2018/05/21'; //设置新的目录

删除分区

删除指定分区。

hive> alter table logs drop if exists partition(year=2018, month=5, day=21);

修改列

将logs的列level修改成loggerlevel，添加comment描述，并将其调整到hms列后面。不过level本来就在hms后面，因此产生更改。

hive> alter table logs
    > change column level loggerlevel string
    > comment 'logger level: Error, WARN, INFO, DEBUG';
    > after hms;

如果要将其调整到第一列，则将after colname的语句用first替代即可。
但是，列的位置并非可以随意更改。如下语句将列loggerlevel改回level，然后将其位置调整到class之后。可以成功执行。

hive> alter table logs
    > change column loggerlevel level string
    > after class;
OK

但是当将其调整到第一列时，执行报错。这是因为class和loggerlevel的数据类型相同。但是原第一列hms和loggerlevel的数据类型不兼容。这样的方式修改表只能改变元数据中的表结构，并不会改变数据。

hive> alter table logs
    > change column level level string
    > first;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. The following columns have types incompatible with the existing columns in their respective positions :

添加列

通过add columns(colname datatype, ...)来添加列。

hive> alter table logs add columns(
    > operator string,
    > host string);
OK

如果列的位置不对，则可以再用change的方法改变列的位置。

替换列

通过replace columns(colname datatype, ...)来实现。()中定义的新列会替代原表结构中除了分区列之外的所有列。如下所示，原本的hms，level，class，message列都被删除，仅有一个新列一个level int。

hive> alter table logs replace columns (
    > level int);
OK
Time taken: 0.113 seconds
hive> desc logs;
OK
level                   int                                         
year                    int                                         
month                   int                                         
day                     int                                         

# Partition Information      
# col_name              data_type               comment             

year                    int                                         
month                   int                                         
day                     int                                         
Time taken: 0.065 seconds, Fetched: 11 row(s)

修改表属性

改变表的tblproperties。之前的employees表中，定义了部分属性，下面的语句为其添加了新的属性depart。注意，属性只能添加和修改，不能删除。

hive> use mydb
hive> alter table employees
    > set tblproperties (
    > 'depart'='Cloud');
OK
hive> show tblproperties employees; //显示tblproperties属性。
OK
comment Description of the table
created_at  2018-05-16
creator me
depart  Cloud
last_modified_by    stack
last_modified_time  1526905193
numFiles    1
totalSize   339
transient_lastDdlTime   1526905193
Time taken: 0.051 seconds, Fetched: 9 row(s)

显示的属性当中，有部分属性是自动生成的：last_modified_by，last_modified_time，numFiles ，totalSize和transient_lastDdlTime。
尝试修改其中的属性。发现执行并未报错。但是show出来的结果中，numFiles并不会发生改变,依然为1。

hive> alter table employees
    > set tblproperties (
    > 'numFiles'='2');
OK
Time taken: 0.108 seconds

(后续添加：存储属性；数据类型)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：spring中的jpa有那些

下一篇：shell命令获取当前python版本

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯