本人采用双节点的方式
1、其中两个节点都存活时 :两个节点做负载均衡使用/
2、其中一个节点宕机 : 一个节点承担从前两个节点的流量 (做到高可用)
3、kafka channel 确保数据到kafka 性能和安全性
4、断点续传功能
channel 直接对接kafka 节省资源
其中配置为 (两份)
tier1.sources = source1 #对应sources名字
tier1.channels = kafka-mobile-channel #对应channel 名字tier1.sources.source1.type = avro
tier1.sources.source1.bind = 0.0.0.0
tier1.sources.source1.port = 44444
tier1.sources.source1.channels = kafka-mobile-channel
tier1.sources.source1.selector.type = multiplexing
tier1.sources.source1.selector.header = topic
tier1.sources.source1.selector.mapping.mobile = kafka-mobile-channeltier1.channels.kafka-mobile-channel.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.kafka-mobile-channel.parseAsFlumeEvent = false #用了配置是否后面要解析 Flume 头信息内容
tier1.channels.kafka-mobile-channel.kafka.topic = tomcat-mobile
tier1.channels.kafka-mobile-channel.kafka.consumer.group.id = flume-tomcat-mobile
tier1.channels.kafka-mobile-channel.kafka.consumer.auto.offset.reset = earliest
tier1.channels.kafka-mobile-channel.kafka.bootstrap.servers = ZW0804-hadoop-89:9092,ZW0804-hadoop-90:9092,ZW0804-hadoop-91:9092他的上游配置为
agent
collector.sources = taildir-source
collector.channels = file-channel
collector.sinks = avro-forward-sink-node2 avro-forward-sink-node3source
collector.sources.taildir-source.type = TAILDIR
collector.sources.taildir-source.channels = file-channel
collector.sources.taildir-source.positionFile = /var/log/flume-ng/taildir_position.json
collector.sources.taildir-source.filegroups = f1
collector.sources.taildir-source.filegroups.f1 = /tmp/nginx/.+.log
collector.sources.taildir-source.fileHeader = true
collector.sources.taildir-source.interceptors = topic UUID
collector.sources.taildir-source.interceptors.topic.type = static
collector.sources.taildir-source.interceptors.topic.key = topic
collector.sources.taildir-source.interceptors.topic.value = we-user
collector.sources.taildir-source.interceptors.topic.preserveExisting = false
collector.sources.taildir-source.interceptors.UUID.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
collector.sources.taildir-source.interceptors.UUID.headerName=key
collector.sources.taildir-source.interceptors.UUID.prefix=NODE_
collector.sources.taildir-source.interceptors.UUID.preserveExisting=false
collector.sources.taildir-source.skipToEnd = truechannel
collector.channels.file-channel.type=file
collector.channels.file-channel.checkpointDir = /var/log/flume-ng/file-channel/checkpoint #channel 的备份文件方式
collector.channels.file-channel.dataDirs = /var/log/flume-ng/file-channel/data #数据存储路径sink 采用分发的方式
collector.sinks.avro-forward-sink-node2.type = avro
collector.sinks.avro-forward-sink-node2.channel = file-channel
collector.sinks.avro-forward-sink-node2.hostname = node2 #对应 负载均衡的ip
collector.sinks.avro-forward-sink-node2.port = 44444collector.sinks.avro-forward-sink-node3.type = avro
collector.sinks.avro-forward-sink-node3.channel = file-channel
collector.sinks.avro-forward-sink-node3.hostname = node3 #对应 负载均衡的ip
collector.sinks.avro-forward-sink-node3.port = 44444load balance
collector.sinkgroups = g1
collector.sinkgroups.g1.sinks = avro-forward-sink-node2 avro-forward-sink-node3
collector.sinkgroups.g1.processor.type = load_balance
collector.sinkgroups.g1.processor.backoff = true断点续传功能
flume 采取采用 TAILDIR
偏移量存储在: /var/log/flume-ng/taildir_position.json
(注: [{“inode”:52299335,“pos”:13,“file”:"/tmp/nginx/aa.log"},{“inode”:52299428,“pos”:81,“file”:"/tmp/nginx/test.log"}])
这里inode就是标记文件的,文件名称改变,这个iNode不会变,pos记录偏移量(按字符计算),file就是绝对路径
测试方法: 关闭kafka 然后在 其监控路径下生产数据 (/tmp/nginx/.+.log)
发现记录偏移量的 pos 更新了(此时kafka 停滞状态)
发现采集chanel 数据存储和备份的文件路径下文件的大小基本不变(说明采集端的flume采集成功并成功发送到了后面的flume集群)
15 分钟后启动 kafka 发现flume 接收到了flume 停滞时间的数据(实现了断点传输和兼容kafka 挂掉)
测试 采集flume 写入 集群flume 没有成功 (在数据路径下 tail -f /var/log/flume-ng/file-channel/data )有写入 data的操作
把采集flume 配置改正确,后数据 又把没传输成功的数据又传输到了kafka 消息队列中
flume 有时我们需要解析header中的信息 todo
1、常见的需求是我们解析业务日志时候,由于每条日志没有可能唯一标志这是唯一一条日志的字段,所以我们一般都加一个字段,进行区分,已确保后续日志的相关去重操作。
例子:
我们在日志采集端每条日志添加一个uuid 操作
source.interceptors.UUID.type=org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
collector.sources.taildir-source.interceptors.UUID.headerName=key
让后在代码中解析出uuid
最后应用MySQL 的唯一主键进行去重操作。