本文借鉴了他人博客,但此文看了更加简明明了
本文主要实现一下目标:
1. 在hive中创建的表能直接创建保存到hbase中。
2. hive中的表插入数据,插入的数据会同步更新到hbase对应的表中。
3. hbase对应的列簇值变更,也会在Hive中对应的表中变更。
4. 实现了多列,多列簇的转化:(示例:hive中3列对应hbase中2列簇)
+++++++++++++++++++++++++++++++++++++++
注意:在hive下删除管理表,hbase下关联表也会被删除:
1.注意:创建内部表(管理表)时,不用先去hbase下创建关联表,hive下创建好后,hbase下会自动生成关联表:
create table hbase_table_1(key int, value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key,cf1:val")
tblproperties ("hbase.table.name"='xyz');
去habse下查看是否生成了关联表:xyz
hbase(main):035:0> desc 'xyz'
Table xyz is ENABLED
xyz
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',
MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true
'}
1 row(s) in 0.0130 seconds
2.然后从hive里面创建好的数据源表data_in里面导数据到 上面的关联表中去:
(这是一个MR程序:)
insert overwrite table test01.hbase_table_1 select * from test01.data_in;
去hbase下看数据是否导入进去了:
hbase(main):036:0> scan 'xyz'
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1495595889276, value=val_1
2 column=cf1:val, timestamp=1495595889276, value=val_2
3 column=cf1:val, timestamp=1495595889276, value=val_3
3 row(s) in 0.0150 seconds
3.反之,在hbase下put数据,看hive 下是否数据同步:
hbase(main):057:0> put 'xyz','4','cf1:val','val_4'
hbase(main):058:0> put 'xyz','55','cf1:age','24'
hbase(main):059:0> put 'xyz','66','cf1:name','zhangsan'
hbase(main):060:0> scan 'xyz'
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1495596813061, value=val_1
2 column=cf1:val, timestamp=1495596813061, value=val_2
3 column=cf1:val, timestamp=1495596813061, value=val_3
4 column=cf1:val, timestamp=1495596878735, value=val_4
55 column=cf1:age, timestamp=1495596890741, value=24
66 column=cf1:name, timestamp=1495596904166, value=zhangsan
注意:hbase下put数据时,格式必须要跟hive下建表语句的格式一致,要不然传不进去:
hive (test01)> select * from hbase_table_1;
OK
1 val_1
2 val_2
3 val_3
4 val_4
3.删除hive下的关联表hbase_table_1,hbase下关联表xyz也随之被删除掉
hive (test01)> drop table hbase_table_1;
OK
Time taken: 3.803 seconds
hbase下scan关联表:xyz,发现已经被删除
hbase(main):055:0> scan 'xyz'
ROW COLUMN+CELL
ERROR: Unknown table xyz!
=========================================================
注意:hive下删除外部表,Hbase里面关联表不会被删除!
1.先在hbase里面建关联表
hbase(main):001:0> create 'hive_to_hbase','up'
2.在hive里面 创建hbase识别的表:
CREATE EXTERNAL TABLE if not exists test01.hbase_v1_hive( key string, value string )
STORED BY ' org.apache.hadoop.hive.hbase.HBaseStorageHandler '
WITH SERDEPROPERTIES("hbase.columns.mapping" = " :key,up:val ")
TBLPROPERTIES("hbase.table.name"="hive_to_hbase");
3.在hive中建一个数据源表(名字随便取)
create table test01.data_in(
id int,
bar string
) row format delimited fields terminated by '\t'
lines terminated by '\n' stored as textfile;
4.将数据倒到数据源表中去:
[root@10-122-129-28 johnmay]# more 11.txt
1 val_1
2 val_2
3 val_3
load data local inpath '/root/johnmay/11.txt'overwrite into table test01.data_in
5.从data_in中导入数据到hbase_v1_hive中
insert overwrite table test01.hbase_v1_hive select * from data_in;
6.去hbase下scan关联表,看数据是否同步:
hbase(main):020:0> scan 'hive_to_hbase'
ROW COLUMN+CELL
1 column=up:val, timestamp=1495453854071, value=val_1
1 row(s) in 0.0090 seconds
由此可见数据已经同步了;
7.hbase下put数据:
put 'hive_to_hbase','key','up:val','1234'
put 'hive_to_hbase','name','up:val','zhangsan'
查看是否上传成功:
hbase(main):004:0> scan 'hive_to_hbase'
ROW COLUMN+CELL
1 column=up:val, timestamp=1495453854071, value=val_1
key column=up:val, timestamp=1495454199731, value=1234
name column=up:val, timestamp=1495454201497, value=zhangsan
3 row(s) in 0.0640 seconds
hive表下看数据是否同步:
hive (test01)> select * from hbase_v1_hive;
OK
1 val_1
key 1234
name zhangsan
Time taken: 0.096 seconds, Fetched: 3 row(s)
刚刚在hbase中插入的数据,已经在hive里同步了.
================================
建议:创建hvie,hbase关联表时,多建外部表(external)!
多列和多列族(Multiple Columns and Families)
1. hbase下先建关联表:
hbase(main):014:0> create 'user','info', 'info1'
2. hive下建关联表
create external table hbase_user(id int, name string, age int)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping"=":key,info:name,info1:age")
tblproperties ("hbase.table.name"="user");
3. 从建好的数据源表中导入数据到关联表:hbase_user
两条命令的区别就不多说:
insert overwrite table test01.hbase_user select * from data_in ;
insert into table test01.hbase_user select * from data_in ;
4. hive下查看数据:
hive> select * from hbase_user;
OK
1 val_1 23
2 val_2 36
3 val_3 22
4 val_4 26
Time taken: 0.102 seconds, Fetched: 4 row(s)
hbase下查看数据:
hbase(main):025:0> scan 'user'
ROW COLUMN+CELL
1 column=info:name, timestamp=1495610825243, value=val_1
1 column=info1:age, timestamp=1495610825243, value=23
2 column=info:name, timestamp=1495610825243, value=val_2
2 column=info1:age, timestamp=1495610825243, value=36
3 column=info:name, timestamp=1495610825243, value=val_3
3 column=info1:age, timestamp=1495610825243, value=22
4 column=info:name, timestamp=1495610825243, value=val_4
4 column=info1:age, timestamp=1495610825243, value=26
4 row(s) in 0.0190 seconds
5. 查看hbase表结构,hive表结构
hive> desc hbase_user;
OK
id int from deserializer
name string from deserializer
age int from deserializer
Time taken: 0.331 seconds, Fetched: 3 row(s)
hbase(main):026:0> desc 'user'
Table user is ENABLED
user
COLUMN FAMILIES DESCRIPTION
{NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE'
, MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'tru
e'}
{NAME => 'info1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE
', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'tr
ue'}
2 row(s) in 0.0160 seconds
这里实现了3个hive列(id, name, age) 对应两个hbase列簇( info, info1 )的应用