flume篇1:flume把json数据写入kudu(flume-kudu-sink)
对应非json数据同样适用,可以把非json数据通过拦截器拼接成一个json send出去,这样也是ok的
废话不多说,直接上干货

一、 自定义拦截器:
1 拦截器要求:新建一个新的工程,单独打包,保证每个flume的的拦截器都是单独的一个工程打的包,这样保证每次对拦截器修改的时候不影响其他flume业务,当然你也可以把string或者csv格式的数据split成数组,然后再拼成一个json send出去,这样也是ok的
(个人习惯,喜欢用拦截器,其实可以可下面自定义JsonKuduOperationsProducer两个是冗余的,可以去掉)
拦截器pom如下(pom包多了,可以自己去掉):

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <scala.version>2.10.4</scala.version>
        <flume.version>1.8.0</flume.version>
    </properties>

     <dependencies>
            <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>${flume.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
            <version>3.3</version>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>6.1.1</version>
            <scope>test</scope>
        </dependency>
       <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-store-sdk</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-core</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-common</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-format</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-hadoop</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbondata-processing</artifactId>
            <version>1.5.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.carbondata</groupId>
            <artifactId>carbonata</artifactId>
            <version>1.5.3</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/lib/apache-carbondata-1.5.3-bin-spark2.3.2-hadoop2.6.0-cdh5.16.1.jar</systemPath>
        </dependency>
     <dependency>
            <groupId>org.apache.mina</groupId>
            <artifactId>mina-core</artifactId>
            <version>2.0.9</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.mockito</groupId>
            <artifactId>mockito-all</artifactId>
            <version>1.9.5</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.sshd</groupId>
            <artifactId>sshd-core</artifactId>
            <version>0.14.0</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.jcraft</groupId>
            <artifactId>jsch</artifactId>
            <version>0.1.54</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.12</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.5</version>
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>16.0.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.11.0.0</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.46</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>

2 拦截器代码如下:
(以下拦截器主要目的是:把一个嵌套2层的body Json中的各个字段取出来,并对字段进行重命名)

package com.iflytek.extracting.flume.interceptor;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.SimpleDateFormat;
import java.util.List;
import java.util.Map;

public class XyJsonInterceptorTC implements Interceptor {
    private static final Logger logger = LoggerFactory.getLogger(XyJsonInterceptorTC .class);
    private SimpleDateFormat dataFormat;
    @Override
    public void initialize() {
        dataFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    }
    @Override
    public Event intercept(Event event) {
        String body = new String(event.getBody());
        try {
            JSONObject jsonObject = JSON.parseObject(body);  //把string转成json
            JSONObject bodyObject1 = jsonObject.getJSONObject("body"); //获取第一层body
            JSONObject bodyObject2 = bodyObject1.getJSONObject("body");//获取第二层body
            JSONObject resObject = new JSONObject();//new 一个新的json,用于存放前面json body中取出来的各个字段
            getPut(resObject, bodyObject2, "id", "newId");//取出json 中的id,放到新的json中,并重命名为newid
            getPut(resObject, bodyObject2, "name", "newName");
            getPut(resObject, bodyObject2, "age", "newAge");
            getPutDate(resObject, bodyObject2, "time", "newTime", dataFormat);//把Long类型时间戳转成date类型
            logger.info("拦截器最hou输出结果为resObject:" + resObject.toString());
            event.setBody(resObject.toString().getBytes());
            return event;
        } catch (Exception e) {
            logger.info("ERROR格式数据" + body.toString());//对于不符合约定的数据用日志抛出来,有需要的时候可以从日志中回滚,这里加了try catch十分必要,不然flume处理的错误数据多了,容易挂
            return null;
        }
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        List<Event> resultList = Lists.newArrayList();
        for (Event event : events) {
            Event result = intercept(event);
            if (result != null) {
                resultList.add(result);
            }
        }
        return resultList;
    }

    @Override
    public void close() {

    }
    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new XyJsonInterceptorTC ();
        }
        @Override
        public void configure(Context context) {
        }
    }

    public static void getPut(JSONObject resObject, JSONObject jsonObject, String oldName, String newName) {
        Object value = jsonObject.get(oldName);
        if (value !=null){
            resObject.put(newName, value.toString());
        }
    }
    public static void getPutDate(JSONObject resObject,JSONObject jsonObject,String oldName,String newName,SimpleDateFormat dataFormat) {
        Object value = jsonObject.get(oldName);
        if (value !=null){
            Long valuelong=  Long.parseLong(value.toString());
            resObject.put(newName, dataFormat.format(valuelong).toString());
        }
    }

    public static void getPutLong(JSONObject resObject,JSONObject jsonObject,String oldName,String newName) {
        Object value = jsonObject.get(oldName);
        if (value !=null){
            Long valuelong=  Long.parseLong(value.toString());
            resObject.put(newName, valuelong);
        }
    }
}

3 打包上传,到flume的lib位置,cdh位置如下:/opt/cloudera/parcels/CDH/lib/flume-ng/lib/

二、上传kudu sink包相关jar包

/opt/cloudera/parcels/CDH/lib/flume-ng/lib/ 目录
kudu-client-1.7.0-cdh5.16.1.jar
kudu-flume-sink-1.7.0-cdh5.16.1.jar
如果你的是cdh版本,并且flume已经和kudu关联了,那么这2个包都是自动生成的,就在kudu的lib目录下,不需要重新编译,直接分发到flume 的lib下就好了

三、自定义kudu_json解析器,这是一个通用包,其实打过一次包上传后每个kudu sink都可以使用
1 json解析器要求如下:需要单独新建一个工程,然后单独打包上传即可

2 pom 如下:

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
        <scala.version>2.10.4</scala.version>
        <flume.version>1.8.0</flume.version>
        <kudu.version>1.7.1</kudu.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.kudu</groupId>
            <artifactId>kudu-client</artifactId>
            <version>${kudu.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-core</artifactId>
            <version>${flume.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flume</groupId>
            <artifactId>flume-ng-configuration</artifactId>
            <version>${flume.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.8.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.12</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>
        <dependency>
            <groupId>org.apache.kudu</groupId>
            <artifactId>kudu-flume-sink</artifactId>
            <version>1.7.0</version>
        </dependency>
    </dependencies>

3 代码如下:不用考虑其原理,这个一个通用包,可以直接打包上传就完事了

package com.iflytek;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.google.common.base.Preconditions;
import com.google.common.collect.Lists;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.FlumeException;
import org.apache.flume.annotations.InterfaceAudience;
import org.apache.flume.annotations.InterfaceStability;
import org.apache.kudu.ColumnSchema;
import org.apache.kudu.Schema;
import org.apache.kudu.Type;
import org.apache.kudu.client.KuduTable;
import org.apache.kudu.client.Operation;
import org.apache.kudu.client.PartialRow;
import org.apache.kudu.flume.sink.KuduOperationsProducer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.nio.charset.Charset;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@InterfaceAudience.Public
@InterfaceStability.Evolving
public class XyJsonKuduOperationsProducer implements KuduOperationsProducer {
    private static final Logger logger = LoggerFactory.getLogger(XyJsonKuduOperationsProducer.class);
    private static final String INSERT = "insert";
    private static final String UPSERT = "upsert";
    private static final List<String> validOperations = Lists.newArrayList(UPSERT, INSERT);
    public static final String ENCODING_PROP = "encoding";
    public static final String DEFAULT_ENCODING = "utf-8";
    public static final String OPERATION_PROP = "operation";
    public static final String DEFAULT_OPERATION = UPSERT;
    public static final String SKIP_MISSING_COLUMN_PROP = "skipMissingColumn";
    public static final boolean DEFAULT_SKIP_MISSING_COLUMN = false;
    public static final String SKIP_BAD_COLUMN_VALUE_PROP = "skipBadColumnValue";
    public static final boolean DEFAULT_SKIP_BAD_COLUMN_VALUE = false;
    public static final String WARN_UNMATCHED_ROWS_PROP = "skipUnmatchedRows";
    public static final boolean DEFAULT_WARN_UNMATCHED_ROWS = true;
    private KuduTable table;
    private Charset charset;
    private String operation;
    private boolean skipMissingColumn;
    private boolean skipBadColumnValue;
    private boolean warnUnmatchedRows;
    public XyJsonKuduOperationsProducer() {
    }
    @Override
    public void configure(Context context) {
        String charsetName = context.getString(ENCODING_PROP, DEFAULT_ENCODING);
        try {
            charset = Charset.forName(charsetName);
        } catch (IllegalArgumentException e) {
            throw new FlumeException(
                    String.format("Invalid or unsupported charset %s", charsetName), e);
        }
        operation = context.getString(OPERATION_PROP, DEFAULT_OPERATION).toLowerCase();
        Preconditions.checkArgument(
                validOperations.contains(operation),
                "Unrecognized operation '%s'",
                operation);
        skipMissingColumn = context.getBoolean(SKIP_MISSING_COLUMN_PROP,
                DEFAULT_SKIP_MISSING_COLUMN);
        skipBadColumnValue = context.getBoolean(SKIP_BAD_COLUMN_VALUE_PROP,
                DEFAULT_SKIP_BAD_COLUMN_VALUE);
        warnUnmatchedRows = context.getBoolean(WARN_UNMATCHED_ROWS_PROP,
                DEFAULT_WARN_UNMATCHED_ROWS);
    }
    @Override
    public void initialize(KuduTable table) {
        this.table = table;
    }
    @Override
    public List<Operation> getOperations(Event event) throws FlumeException {
        String s = new String(event.getBody(), charset);
        Map<String, Object> rawMap = jsonStr2Map(s);
        logger.info("sink数据为"+rawMap.toString());
        Schema schema = table.getSchema();
        List<Operation> ops = Lists.newArrayList();
        if(s != null && !s.isEmpty()) {
            Operation op;
            switch (operation) {
                case UPSERT:
                    op = table.newUpsert();
                    break;
                case INSERT:
                    op = table.newInsert();
                    break;
                default:
                    throw new FlumeException(
                            String.format("Unrecognized operation type '%s' in getOperations(): " +
                                    "this should never happen!", operation));
            }
            PartialRow row = op.getRow();
            for (ColumnSchema col : schema.getColumns()) {
//                logger.info("Column:" + col.getName() + "----" + rawMap.get(col.getName()));
                try {
                    Object mpaValue = rawMap.get(col.getName());
                    if (mpaValue!=null){
                        coerceAndSet(mpaValue.toString(), col.getName(), col.getType(), row);
                    }

                } catch (NumberFormatException e) {
                    String msg = String.format(
                            "Raw value '%s' couldn't be parsed to type %s for column '%s'",
                            s, col.getType(), col.getName());
                    logOrThrow(skipBadColumnValue, msg, e);
                } catch (IllegalArgumentException e) {
                    String msg = String.format(
                            "Column '%s' has no matching group in '%s'",
                            col.getName(), s);
                    logOrThrow(skipMissingColumn, msg, e);
                } catch (Exception e) {
                    throw new FlumeException("Failed to create Kudu operation", e);
                }
            }
            ops.add(op);
        }
        return ops;
    }
    /**
     * Coerces the string `rawVal` to the type `type` and sets the resulting
     * value for column `colName` in `row`.
     *
     * @param rawVal the raw string column value
     * @param colName the name of the column
     * @param type the Kudu type to convert `rawVal` to
     * @param row the row to set the value in
     * @throws NumberFormatException if `rawVal` cannot be cast as `type`.
     */
    private void coerceAndSet(String rawVal, String colName, Type type, PartialRow row)
            throws NumberFormatException {
        switch (type) {
            case INT8:
                row.addByte(colName, Byte.parseByte(rawVal));
                break;
            case INT16:
                row.addShort(colName, Short.parseShort(rawVal));
                break;
            case INT32:
                row.addInt(colName, Integer.parseInt(rawVal));
                break;
            case INT64:
                row.addLong(colName, Long.parseLong(rawVal));
                break;
            case BINARY:
                row.addBinary(colName, rawVal.getBytes(charset));
                break;
            case STRING:
                row.addString(colName, rawVal==null?"":rawVal);
                break;
            case BOOL:
                row.addBoolean(colName, Boolean.parseBoolean(rawVal));
                break;
            case FLOAT:
                row.addFloat(colName, Float.parseFloat(rawVal));
                break;
            case DOUBLE:
                row.addDouble(colName, Double.parseDouble(rawVal));
                break;
            case UNIXTIME_MICROS:
                row.addLong(colName, Long.parseLong(rawVal));
                break;
            default:
                logger.warn("got unknown type {} for column '{}'-- ignoring this column", type, colName);
        }
    }
    private void logOrThrow(boolean log, String msg, Exception e)
            throws FlumeException {
        if (log) {
            logger.warn(msg, e);
        } else {
            throw new FlumeException(msg, e);
        }
    }
    @Override
    public void close() {
    }

    public static Map<String, Object> jsonStr2Map(String jsonStr) {
        Map<String, Object> resultMap = new HashMap<>();
        JSONObject jsonObject = JSON.parseObject(jsonStr);
        resultMap=JSON.parseObject(jsonObject.toString(), Map.class);
        return resultMap;
    }
}

4 打包上传至 /opt/cloudera/parcels/CDH/lib/flume-ng/lib/ 目录

四、配置flume的conf
1 在 /opt/cloudera/parcels/CDH/lib/flume-ng/conf目录下,
vi kudu.conf
输入以下内容:

ng.sources = kafkaSource
ng.channels = memorychannel
ng.sinks =  kudusink
ng.sources.kafkaSource.type= org.apache.flume.source.kafka.KafkaSource
ng.sources.kafkaSource.kafka.bootstrap.servers=cdh01:9092,cdh02:9092,cdh03:9092
ng.sources.kafkaSource.kafka.consumer.group.id=xytest1
ng.sources.kafkaSource.kafka.topics=pd_ry_txjl 
ng.sources.kafkaSource.batchSize=1000
ng.sources.kafkaSource.channels= memorychannel
ng.sources.kafkaSource.kafka.consumer.auto.offset.reset=latest
ng.sources.kafkaSource.interceptors= i1
ng.sources.kafkaSource.interceptors.i1.type=com.iflytek.extracting.flume.interceptor.XyJsonInterceptorTC$Builder
#自定义的拦截器类,对于kudu其他表的接入,其实后面kudu 的配置都已经固定了,不需要更改,你只需要重新写一个拦截器,并把下面的kudu表名更改即可

ng.channels.memorychannel.type = memory
ng.channels.memorychannel.keep-alive = 3
ng.channels.memorychannel.byteCapacityBufferPercentage = 20
ng.channels.memorychannel.transactionCapacity = 10000
ng.channels.memorychannel.capacity = 100000

ng.sinks.kudusink.type = org.apache.kudu.flume.sink.KuduSink
ng.sinks.kudusink.masterAddresses = cdh01:7051  # kudu 的ip
ng.sinks.kudusink.tableName = rytx_kudu   # kudu 表名,这里没有写impapa表名,直接写kudu表名就行了
ng.sinks.kudusink.operation = insert
ng.sinks.kudusink.batchSize = 50
ng.sinks.kudusink.producer = com.iflytek.XyJsonKuduOperationsProducer  #自定义的json解析类

ng.sinks.kudusink.channel = memorychannel

2 启动flume:

bin/flume-ng agent -n ng -c conf -f conf/kudu.conf
cdh版本的flume默认的日志打在 /var/log/flume/flume.log里面
查看数据已经接入kudu,并确定没问题可以使用后台提交:
nohup bin/flume-ng agent -n ng -c conf -f conf/kudu.conf &
任务停止:
jcmd | grep kudu.conf   # 找到含有 kudu.conf的任务
然后kill 任务id即可