最近有一个需求就是在建模的时候,有少部分数据是postgres的,只能读取postgres里面的数据到hadoop里面进行建模测试,而不能导出数据到hdfs上去。
读取postgres里面的数据库有两种方法,一种就是用hadoop的DBInputFormat(DBInputFormat在hadoop2.4.1的jar里面有两个包,import
org.apache.hadoop.mapreduce.lib.db包和org.apache.hadoop.mapred包,前者是较新的),另外一种就是postgres的CopyManager类。
先说一说用DBInputFormat这个方法吧。
首先在数据库里面创建一个表,插入几条数据测试用
由于表里面的数据要用来做为map的输入Value,所以要自定义数据类型。
hadoop要自定义数据类型要实现Writable接口,如果是Key要自定义数据类型那么就要实现WritableComparable接口,还要实现里面的比较方法。实现WritableComparable接 口在比较时要反序列话,比较麻烦,那么可以用继承WritableComparator类来实现字节流的比较。
在配置DBInputFormat的输入参数时,必须要有一个数据类型实现DBWritable,所有在这里为Value自定义数据类型要实现DBWritable和Writable两个接口。
package com.qldhlbs.hadoop.demo0420;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
public class PgDbWritable implements DBWritable, Writable{
private Integer call_type_id;
private String call_type;
private String remark;
public PgDbWritable() {
}
public PgDbWritable(Integer call_type_id, String call_type, String remark){
set(call_type_id, call_type, remark);
}
public void set(Integer call_type_id, String call_type, String remark) {
this.call_type_id = call_type_id;
this.call_type = call_type;
this.remark = remark;
}
//结果集读取
@Override
public void readFields(ResultSet set) throws SQLException {
this.call_type_id = set.getInt(1);
this.call_type = set.getString(2);
this.remark = set.getString(3);
}
//设置参数
@Override
public void write(PreparedStatement ps) throws SQLException {
ps.setInt(1, this.call_type_id);
ps.setString(2, this.call_type);
ps.setString(3, this.remark);
}
//反序列化
@Override
public void readFields(DataInput in) throws IOException {
this.call_type_id = in.readInt();
this.call_type = in.readUTF();
this.remark = in.readUTF();
}
//序列化
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(this.call_type_id);
out.writeUTF(this.call_type);
out.writeUTF(this.remark);
}
public Integer getCall_type_id() {
return call_type_id;
}
public String getCall_type() {
return call_type;
}
public String getRemark() {
return remark;
}
@Override
public String toString() {
return call_type_id + "\t" + call_type + "\t" + remark;
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((call_type == null) ? 0 : call_type.hashCode());
result = prime * result + ((call_type_id == null) ? 0 : call_type_id.hashCode());
result = prime * result + ((remark == null) ? 0 : remark.hashCode());
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
PgDbWritable other = (PgDbWritable) obj;
if (call_type == null) {
if (other.call_type != null)
return false;
} else if (!call_type.equals(other.call_type))
return false;
if (call_type_id == null) {
if (other.call_type_id != null)
return false;
} else if (!call_type_id.equals(other.call_type_id))
return false;
if (remark == null) {
if (other.remark != null)
return false;
} else if (!remark.equals(other.remark))
return false;
return true;
}
}
首先在PgDbWritable 里面维护对应数据库表的3个字段,并覆写关键的四个方法。每个方法的作用在代码里面有介绍。重写toString,hashCode和equals方法。
自定义数据类型后就是读取数据库的数据了。
package com.qldhlbs.hadoop.demo0420;
import java.io.IOException;
import java.sql.SQLException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.filecache.DistributedCache;
import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;
import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MapreducePackageDbApp {
static class DbReadMapper extends Mapper<LongWritable, PgDbWritable, LongWritable, PgDbWritable>{
@Override
protected void map(LongWritable key, PgDbWritable value,
Mapper<LongWritable, PgDbWritable, LongWritable, PgDbWritable>.Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}
static class DbReadReduce extends Reducer<LongWritable, PgDbWritable, LongWritable, PgDbWritable>{
@Override
protected void reduce(LongWritable key, Iterable<PgDbWritable> values,
Reducer<LongWritable, PgDbWritable, LongWritable, PgDbWritable>.Context context) throws IOException, InterruptedException {
for (PgDbWritable value : values) {
context.write(key, value);
}
}
}
@SuppressWarnings("deprecation")
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, SQLException {
Configuration conf = new Configuration();
DBConfiguration.configureDB(conf, "org.postgresql.Driver", "jdbc:postgresql://192.168.0.203/test", "hb", "xxx");
Job job = Job.getInstance(conf);
job.setJarByClass(MapreducePackageDbApp.class);
job.setJobName(MapreducePackageDbApp.class.getSimpleName());
DistributedCache.addFileToClassPath(new Path("hdfs://192.168.0.201:49000/user/qldhlbs/lib/postgresql-9.3-1101.jdbc3.jar"), conf);
String[] fields = {"call_type_id", "call_type", "remark"};
DBInputFormat<PgDbWritable> in = new DBInputFormat<PgDbWritable>();
in.setConf(conf);
//配置DBInputFormat的信息,job, 输入DBWritable, 表名, 查询条件, order by条件, 表的字段数组
DBInputFormat.setInput(job, PgDbWritable.class, "dim_160_168_call_type", null, null, fields);
job.setMapperClass(DbReadMapper.class);
//可以不设置reducer,hadoop会自动配置最简的reducer,看源码可以知道是输出map的输出
job.setReducerClass(DbReadReduce.class);
job.setOutputKeyClass(LongWritable.class);
//job.setOutputValueClass(Text.class);
job.setOutputValueClass(PgDbWritable.class);
job.setInputFormatClass(DBInputFormat.class);
FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.0.201:49000/user/qldhlbs/db5"));
boolean isSuccess = job.waitForCompletion(true);
System.exit(isSuccess ? 0 : 1);
}
}
这里只是一个demo,所以map函数就直接输出读取到的内容就行了,由于reduce函数不写,就是直接写出读取到的map函数数据,所有这里reduce函数也可以不写。
在这里有几点是要注意的,首先这里面的包都是导入的mapreduce的而不是mapred包,混淆会报错;第二点是在hadoop的hdfs上上传一份postgres的驱动包,
先在hdfs上创建一个目录:hadoop fs -mkdir /user/qldhlbs/lib,然后把文件上传上去:hadoop fs -copyFromLocal postgresql-9.3-1101.jdbc3.jar /user/qldhlbs/lib。
在代码里面就是用DistributedCache.addFileToClassPath(new Path("hdfs://192.168.0.201:49000/user/qldhlbs/lib/postgresql-9.3-1101.jdbc3.jar"), conf)这个方法
把jar加载到类路径上去;第三点就是配置DBConfiguration信息,参数依次是Configuration ,数据库驱动,数据库url,用户名,密码。在配置完DBConfiguration信息后,
DBInputFormat<PgDbWritable> in = new DBInputFormat<PgDbWritable>();
in.setConf(conf);
setConf()这个方法不能忘记,一开始就是没调用这个方法把conf给DBInputFormat,一直报空指针异常,后来经过调试查看得知是connection没得到,但是DBConfiguration得到了connection。再进一步调试是DBInputFormat没得到DBConfiguration对象,所以根本就获取不到connection。查看hadoop-mapreduce-client-core-2.4.1源码才解决问题。
public void setConf(Configuration conf)
{
this.dbConf = new DBConfiguration(conf);
try
{
getConnection();
DatabaseMetaData dbMeta = this.connection.getMetaData();
this.dbProductName = dbMeta.getDatabaseProductName().toUpperCase();
}
catch (Exception ex) {
throw new RuntimeException(ex);
}
this.tableName = this.dbConf.getInputTableName();
this.fieldNames = this.dbConf.getInputFieldNames();
this.conditions = this.dbConf.getInputConditions();
}
public Connection getConnection() {
try {
if (null == this.connection)
{
this.connection = this.dbConf.getConnection();
this.connection.setAutoCommit(false);
this.connection.setTransactionIsolation(8);
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return this.connection;
}
这是反编译的部分源码,可以看到connection是可以从DBConfiguration对象拿的;第四点就是配置DBInputFormat的信息,参数是job, 输入DBWritable, 表名, 查询条件, order by条件, 表的字段字符串数组。
所有的做完了接下来就可以跑hadoop了。
这是在hdfs里面生成的文件,可以看到数据读取到hdfs上了。
如果不用mapreduce包,用mapred包也是可以的,代码就不上了,差不多,只是不要掉用setConf()方法把conf绑定上去也行。
这是第一种方法,第二种方法就是直接用org.postgresql.copy.CopyManager这个类
public ByteArrayOutputStream copyToStream(String tableOrQuery,String delimiter){
try {
ByteArrayOutputStream out = new ByteArrayOutputStream();
CopyManager copyManager = new CopyManager(
(BaseConnection) getConnection());
String copySql = "COPY " + tableOrQuery + " TO STDOUT";
if (delimiter != null){
copySql = copySql + " WITH DELIMITER AS '"+delimiter+"'";
}
copyManager.copyOut(copySql,
out);
return out;
}catch(Exception e){
e.printStackTrace();
}
return null;
}
ByteArrayOutputStream out = copyToStream(sql.toString(), ",");
ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray());
public void uploadFile(String hdfsPath,InputStream in){
try {
FileSystem hdfs = FileSystem.get(conf);
FSDataOutputStream out = hdfs.create(new Path(hdfsPath));
org.apache.hadoop.io.IOUtils.copyBytes(in, out,4096,false);
out.sync();
out.close();
} catch (Exception e) {
// TODO: handle exception
}
}
把流读取出来,用hadoop自带的IOUtils.copyBytes()方法写到hdfs上就可以了就可以了。