1、HBase体系架构
各个功能组件阐述如下:
(1)Client
- 整个HBase集群的访问入口;
- 使用HBase RPC机制与HMaster和HRegionServer进行通信;
- 与HMaster进行通信进行管理类操作;
- 与HRegionServer进行数据读写类操作;
- 包含访问HBase的接口,并维护cache来加快对HBase的访问。
(2)Zookeeper
- 保证任何时候,集群中只有一个HMaster;
- 存贮所有HRegion的寻址入口;
- 实时监控HRegion Server的上线和下线信息,并实时通知给HMaster;
- 存储HBase的schema和table元数据;早期的版本有一个root系统表,后面的版本就取消了;
- Zookeeper Quorum存储Meta表地址、HMaster地址。
(3)HMaster
- HMaster没有单点故障问题,HBase中可以启动多个HMaster,通过Zookeeper的Master Election机制保证总有一个Master在运行,主要负责Table和Region的管理工作。
- 管理用户对table的增删改查操作;
- 管理HRegionServer的负载均衡,调整Region分布;
- Region Split后,负责新Region的分布,HMaster决定了Region被分配到哪台机器上;;
- 在HRegionServer停机后,负责失效HRegionServer上Region迁移工作;
- 真正的读写并不依赖于HMaster,在读写的过程中如果HMaster节点出现宕机,短暂时间是不会出现太大的问题的。
(4)HRegion Server
- 维护HRegion,处理对这些HRegion的IO请求,向HDFS文件系统中读写数据;
- 负责切分在运行过程中变得过大的HRegion;
- Client访问hbase上数据的过程并不需要master参与(寻址访问Zookeeper和HRegion Server,数据读写访问HRegione Server),HMaster仅仅维护者table和Region的元数据信息,负载很低。所以HMaster的机器内存一般不会分配的太多,而HRegionserver的机器一般会分配较多的内存。
- HMaster和HRegionserver在一开始启动的时候都会向ZooKeeper进行注册。
(5)ZooKeeper
- HBase 依赖ZooKeeper;
- 默认情况下,HBase 管理 ZooKeeper 实例,比如, 启动或者停止ZooKeeper;
- HMaster与HRegionServers 启动时会向ZooKeeper注册;
- Zookeeper的引入使得HMaster不再是单点故障。
2、HBase集成MapReduce
(1)在/opt/modules/hadoop-2.5.0/etc/hadoop/hadoop-env.sh文件中添加环境变量
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp` $HADOOP_HOME/bin/yarn jar $HADOOP_HOME/jars/hbase-mr-user2basic.jar
(2)例:HBase集成MapReduce,将user表中的部分数据导出到basic表中。
User2BasicMapReduce.java
package com.beifeng.senior.hadoop.hbase;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class User2BasicMapReduce extends Configured implements Tool {
// Mapper Class
public static class ReadUserMapper extends TableMapper<Text, Put> {
private Text mapOutputKey = new Text();
@Override
public void map(ImmutableBytesWritable key, Result value,
Mapper<ImmutableBytesWritable, Result, Text, Put>.Context context)
throws IOException, InterruptedException {
// get rowkey
String rowkey = Bytes.toString(key.get());
// set
mapOutputKey.set(rowkey);
// --------------------------------------------------------
Put put = new Put(key.get());
// iterator
for (Cell cell : value.rawCells()) {
// add family : info
if ("info".equals(Bytes.toString(CellUtil.cloneFamily(cell)))) {
// add column: name
if ("name".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
put.add(cell);
}
// add column : age
if ("age".equals(Bytes.toString(CellUtil.cloneQualifier(cell)))) {
put.add(cell);
}
}
}
// context write
context.write(mapOutputKey, put);
}
}
// Reducer Class
public static class WriteBasicReducer extends TableReducer<Text, Put, //
ImmutableBytesWritable> {
@Override
public void reduce(Text key, Iterable<Put> values,
Reducer<Text, Put, ImmutableBytesWritable, Mutation>.Context context)
throws IOException, InterruptedException {
for(Put put: values){
context.write(null, put);
}
}
}
// Driver
public int run(String[] args) throws Exception {
// create job
Job job = Job.getInstance(this.getConf(), this.getClass().getSimpleName());
// set run job class
job.setJarByClass(this.getClass());
// set job
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
// set input and set mapper
TableMapReduceUtil.initTableMapperJob(
"user", // input table
scan, // Scan instance to control CF and attribute selection
ReadUserMapper.class, // mapper class
Text.class, // mapper output key
Put.class, // mapper output value
job //
);
// set reducer and output
TableMapReduceUtil.initTableReducerJob(
"basic", // output table
WriteBasicReducer.class, // reducer class
job//
);
job.setNumReduceTasks(1); // at least one, adjust as required
// submit job
boolean isSuccess = job.waitForCompletion(true) ;
return isSuccess ? 0 : 1;
}
public static void main(String[] args) throws Exception {
// get configuration
Configuration configuration = HBaseConfiguration.create();
// submit job
int status = ToolRunner.run(configuration,new User2BasicMapReduce(),args) ;
// exit program
System.exit(status);
}
}
3、HBase与MapReduce集成的模式
HBase与MapReduce集成的模式分为3种:
(1)从hbase读数据,就是将hbase数据作为map的输入;
(2)将数据写入hbase,将hbase作为reduce的输出;
其中,inputformat作为输入,key默认类型longwritable+value text类型,outputformat作为输出,然后写入HDFS文件。
(3)上面两种的结合,从hbase读数据,再写入hbase,更多的适合数据迁移的场景。如下所示:
将数据迁移进HBase方法:
- 使用HBase Put API
- 使用HBase bulk load tool
- 使用自定义的MapReduce任务
HBase Bulk Load工具
通常 MapReduce 在写HBase时使用的是 TableOutputFormat 方式,在reduce中直接生成put对象写入HBase,该方式在大数据量写入时效率低下(HBase会block写入,频繁进行flush,split,compact等大量IO操作),并对HBase节点的稳定性造成一定的影响(GC时间过长,响应变慢,导致节点超时退出,并引起一系列连锁反应)。
HBase支持 bulk load 的入库方式,它是利用hbase的数据信息按照特定格式存储在hdfs内这一原理,直接在HDFS中生成持久化的HFile数据格式文件,然后上传至合适位置,即完成巨量数据快速入库的办法。配合mapreduce完成,高效便捷,而且不占用region资源,增添负载,在大数据量写入时能极大的提高写入效率,并降低对HBase节点的写入压力。
通过使用先生成HFile,然后再BulkLoad到Hbase的方式来替代之前直接调用HTableOutputFormat的方法有如下的好处:
- 消除了对HBase集群的插入压力;
- 提高了Job的运行速度,降低了Job的执行时间。
Bulk Load的工作流程:
- mapreduce将*.cvs文件转换为hfile文件;
- bulk loada将hfile文件加载进HBase表中。
执行命令如下:
export HBASE_HOME=/opt/modules/hbase-0.98.6-hadoop2
export HADOOP_HOME=/opt/modules/hadoop-2.5.0
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase mapredcp`:${HBASE_HOME}/conf \
${HADOOP_HOME}/bin/yarn jar \
${HBASE_HOME}/lib/hbase-server-0.98.6-hadoop2.jar importtsv \
-Dimporttsv.columns=HBASE_ROW_KEY,\
info:name,info:age,info:sex,info:address,info:phone \
student \
hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/hbase/importtsv
或者:
在/opt/modules/hadoop-2.5.0/etc/hadoop/hadoop-env.sh文件中将需要的hbase的jar包放入Hadoop的运行环境变量中:
export HBASE_HOME=/opt/moduels/hbase-0.98.6-hadoop2
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:`$HBASE_HOME/bin/hbase mapredcp`
测试rowcounter:
$ /opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar rowcounter nstest:tb1
Bulk Load工具示例:
【假设场景】
建表:create 'stu_info','info','degree','work'
插入数据:6个rowkey,3个列簇
put 'stu_info','20170222_10001','degree:xueli','benke'
put 'stu_info','20170222_10001','info:age','18'
put 'stu_info','20170222_10001','info:sex','male'
put 'stu_info','20170222_10001','info:name','tom'
put 'stu_info','20170222_10001','work:job','bigdata'
put 'stu_info','20170222_10002','degree:xueli','gaozhong'
put 'stu_info','20170222_10002','info:age','22'
put 'stu_info','20170222_10002','info:sex','female'
put 'stu_info','20170222_10002','info:name','jack'
put 'stu_info','20170222_10003','info:age','22'
put 'stu_info','20170222_10003','info:name','leo'
put 'stu_info','20170222_10004','info:age','18'
put 'stu_info','20170222_10004','info:name','peter'
put 'stu_info','20170222_10005','info:age','19'
put 'stu_info','20170222_10005','info:name','jim'
put 'stu_info','20170222_10006','info:age','20'
put 'stu_info','20170222_10006','info:name','zhangsan'
【需求】将stu_info表的info列簇中的name这一列导入到另外一张表t5中去。
【案例实现】
(1)创建t5表:
create 't5' , {NAME=>'info'}
(2)在hadoop中的hadoop-env.sh文件中添加相关的jar,进行集成依赖:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/moduels/hbase-0.98.6-hadoop2/lib/*
(3)将文件上传到HDFS:
/opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex stu_info /test.tsv
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
-》选项:-D表示指明某一个参数,key=value
(4)如果不是默认的\t,就要在语句中指定输入的分隔符:
/opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex stu_info /test2.csv
(5)转换Hfile,其实就是storefile:
/opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:age,info:sex -Dimporttsv.bulk.output=/testHfile stu_info /test3.tsv
(6)导入HBase:
/opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar completebulkload
usage: completebulkload /path/to/hfileoutputformat-output tablename
completebulkload
/opt/moduels/hadoop-2.5.0/bin/yarn jar lib/hbase-server-0.98.6-hadoop2.jar completebulkload /testHfile stu_info
(7)查看导入HBase中的数据:
20170222_10001 column=info:name, timestamp=1497059738675, value=tom
20170222_10002 column=info:name, timestamp=1497059738956, value=jack
20170222_10003 column=info:name, timestamp=1497059739013, value=leo
20170222_10004 column=info:name, timestamp=1497059739121, value=peter
20170222_10005 column=info:name, timestamp=1497059739254, value=jim
20170222_10006 column=info:name, timestamp=1497059740585, value=zhangsan
4、Java代码实现HBase API,对HBase表的增删改查操作
(1)在pom.xml文件中添加依赖:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>0.98.6-hadoop2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>0.98.6-hadoop2</version>
</dependency>
(2)添加hbase-site.xml文件。
(3)编写HBaseOperation.java
package com.beifeng.senior.hadoop.hbase;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IOUtils;
/**
* CRUD Operations
*
*/
public class HBaseOperation {
public static HTable getHTableByTableName(String tableName) throws Exception {
// Get instance of Default Configuration
Configuration configuration = HBaseConfiguration.create();
// Get table instance
HTable table = new HTable(configuration, tableName);
return table;
}
public void getData() throws Exception {
String tableName = "user"; // default.user / hbase:meta
HTable table = getHTableByTableName(tableName);
// Create Get with rowkey
Get get = new Get(Bytes.toBytes("10002")); // "10002".toBytes()
// ==========================================================================
// add column
get.addColumn(//
Bytes.toBytes("info"), //
Bytes.toBytes("name"));
get.addColumn(//
Bytes.toBytes("info"), //
Bytes.toBytes("age"));
// Get Data
Result result = table.get(get);
// Key : rowkey + cf + c + version
// Value: value
for (Cell cell : result.rawCells()) {
System.out.println(//
Bytes.toString(CellUtil.cloneFamily(cell)) + ":" //
+ Bytes.toString(CellUtil.cloneQualifier(cell)) + " ->" //
+ Bytes.toString(CellUtil.cloneValue(cell)));
}
// Table Close
table.close();
}
/**
* 建议 tablename & column family -> 常量 , HBaseTableContent
* Map<String,Obejct>
* @throws Exception
*/
public void putData() throws Exception {
String tableName = "user"; // default.user / hbase:meta
HTable table = getHTableByTableName(tableName);
Put put = new Put(Bytes.toBytes("10004"));
// Add a column with value
put.add(//
Bytes.toBytes("info"), //
Bytes.toBytes("name"), //
Bytes.toBytes("zhaoliu")//
);
put.add(//
Bytes.toBytes("info"), //
Bytes.toBytes("age"), //
Bytes.toBytes(25)//
);
put.add(//
Bytes.toBytes("info"), //
Bytes.toBytes("address"), //
Bytes.toBytes("shanghai")//
);
table.put(put);
table.close();
}
public void delete() throws Exception {
String tableName = "user"; // default.user / hbase:meta
HTable table = getHTableByTableName(tableName);
Delete delete = new Delete(Bytes.toBytes("10004"));
/*
* delete.deleteColumn(Bytes.toBytes("info"),//
* Bytes.toBytes("address"));
*/
delete.deleteFamily(Bytes.toBytes("info"));
table.delete(delete);
table.close();
}
public static void main(String[] args) throws Exception {
String tableName = "user"; // default.user / hbase:meta
HTable table = null;
ResultScanner resultScanner = null;
try {
table = getHTableByTableName(tableName);
Scan scan = new Scan();
// Range
scan.setStartRow(Bytes.toBytes("10001"));
scan.setStopRow(Bytes.toBytes("10003")) ;
// Scan scan2 = new Scan(Bytes.toBytes("10001"),Bytes.toBytes("10003"));
// PrefixFilter
// PageFilter
// scan.setFilter(filter) ;
// scan.setCacheBlocks(cacheBlocks);
// scan.setCaching(caching);
// scan.addColumn(family, qualifier)
// scan.addFamily(family)
resultScanner = table.getScanner(scan);
for (Result result : resultScanner) {
System.out.println(Bytes.toString(result.getRow()));
// System.out.println(result);
for (Cell cell : result.rawCells()) {
System.out.println(//
Bytes.toString(CellUtil.cloneFamily(cell)) + ":" //
+ Bytes.toString(CellUtil.cloneQualifier(cell)) + " ->" //
+ Bytes.toString(CellUtil.cloneValue(cell)));
}
System.out.println("---------------------------------------");
}
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(resultScanner);
IOUtils.closeStream(table);
}
}
}