1、上传flume-ng-1.5.0-cdh5.3.6.tar.gz 至/opt/modules/cdh/ 并解压
2、编辑 /conf/flume-env.sh
export JAVA_HOME=/usr/java/jdk1.7.0_79
3、编辑/etc/profile
export FLUME_HOME=/opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin
export PATH=$PATH:$FLUME_HOME/bin
4、flume业务情景
flume是日志分析型项目的数据起点
5、下图为事件的处理过程,事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。
当节点出现故障时,日志能够被传送到其他节点上而不会丢失。
6、概念和原理
Client:Client生产数据,运行在一个独立的线程。
Event:一个数据单元,消息头和消息体组成。(日志记录、 avro 对象等)
Flow: Event从源点到达目的点的迁移的抽象。
Agent:一个独立的Flume进程,包含组件Source、 Channel、 Sink。
Source: 数据收集组件。(source从Client收集数据,传递给Channel)
Channel:临时存储,保存由Source组件传递过来的Event队列。
Sink: 从Channel中读取并移除Event,将Event传递到下一个Agent(如果有的话)
7、Avro二进制序列化数据系统
Avro使用模式来实现数据结构定义。一个存储文件由两部分组成:头信息(Header)和数据块(Data Block)。头信息又由三部分构成:四个字节的前缀(类似于Magic Number),文件Meta-data信息和随机生成的16字节同步标记符。
8、Avro支持两种序列化编码方式:二进制编码和JSON编码。使用二进制编码会高效序列化,并且序列化后得到的结果会比较小。而JSON一般用于调试系统或是基于WEB的应用。
9、消息从客户端发送到服务器端需要经过传输层(Transport Layer),它发送消息并接收服务器端的响应。到达传输层的数据就是二进制数据。通常以HTTP作为传输模型,数据以POST方式发送到对方去。
10、单一流测试示例
通过flume来监控一个目录,当目录中有新文件时,将文件内容输出到控制台
编辑/conf/a1.properties
#配置一个agent,agent的名称可以自定义(如a1)
#指定agent的sources(如s1)、sinks(如k1)、channels(如c1)
#分别指定agent的sources,sinks,channels的名称 名称可以自定义
a1.sources = s1
a1.sinks = k1
a1.channels = c1
#描述source
#配置目录scource
a1.sources.s1.type =spooldir
a1.sources.s1.spoolDir =/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs
a1.sources.s1.fileHeader= true
a1.sources.s1.channels =c1
#配置sink
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1
#配置channel(内存做缓存)
a1.channels.c1.type = memory
启动命令
flume-ng agent --conf conf --conf-file /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/a1.properties --name a1 -Dflume.root.logger=INFO,console
上传一个lining07.log文件至/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs
监控台将实时显示此日志能容
11、source 配置总结
a、安装netcat,读取nc发送的数据当做源数据
#类别
a1.sources.r1.type = netcat
#连接名
a1.sources.r1.bind = localhost
#端口
a1.sources.r1.port = 5566
可以通过nc localhost 5566来发送数据
b、监听目录做源数据
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/root/logs
a1.sources.r1.fileHeader = true
c、监听文件做源数据
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/test-flume.log
a1.sources.r1.shell = /bin/bash -c
d、监听kafka生产者的数据作为源数据
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.channels = mem_channel
a1.sources.r1.batchSize = 5000
#配置kafka端口号
a1.sources.r1.servers = node02:9092,node03:9092,node04:9092
#设置kafka的topic(主题)
a1.sources.r1.topics = hellotopic
#配置groupid
a1.sources.r1.consumer.group.id = flume_test_id
12、单文件单流 AGENT 写入 HDFS
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type= memory
agent1.channels.ch1.capacity = 100000
agent1.channels.ch1.transactionCapacity = 100000
agent1.channels.ch1.keep-alive = 30
#define source monitor a file
agent1.sources.avro-source1.type= exec
agent1.sources.avro-source1.shell = /bin/bash-c
agent1.sources.avro-source1.command= tail-n +0 -F /opt/test/testlog.txt
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.threads = 5
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type= hdfs
agent1.sinks.log-sink1.hdfs.path = hdfs://192.168.198.131:8020/flumeTest
agent1.sinks.log-sink1.hdfs.writeFormat = Text
agent1.sinks.log-sink1.hdfs.fileType = DataStream
agent1.sinks.log-sink1.hdfs.rollInterval = 0
agent1.sinks.log-sink1.hdfs.rollSize = 1000000
agent1.sinks.log-sink1.hdfs.rollCount = 0
agent1.sinks.log-sink1.hdfs.batchSize = 1000
agent1.sinks.log-sink1.hdfs.txnEventMax = 1000
agent1.sinks.log-sink1.hdfs.callTimeout = 60000
agent1.sinks.log-sink1.hdfs.appendTimeout = 60000
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
执行命令
flume-ng agent --conf /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/a1.properties -n agent1 -Dflume.root.logger=INFO,console
13、多 agent 汇聚写入 HDFS
因环境所限,未进行此实验。
14、flume将数据导入到Hbase
建表
hbase(main):001:0> create 't2','f2'
flume agent 配置
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/test/hbase.txt
a1.sources.r1.channels = c1
# Describe the sink
a2.sources = r2
a2.sinks = k2
a2.channels = c2
# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/test/hbase.txt
a2.sources.r2.channels = c2
# Describe the sink
a2.sinks.k2.type = logger
a2.sinks.k2.type = hbase
a2.sinks.k2.table = t2 # 与hbase中创建的表名相同
a2.sinks.k2.columnFamily = f2 # 与hbase中创建的表的列簇相同
a2.sinks.k2.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a2.sinks.k2.channel = memoryChannel
# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
拷贝jar包
cp /opt/modules/cdh/hbase-0.98.6-cdh5.3.6/lib/* /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib
启动flume a2
flume-ng agent --conf conf --conf-file /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/filetohbase.conf --name a2 -Dflume.root.logger=INFO,console
向文件写入数据
echo 'Iamzhangli'>>/opt/test/hbase.txt
在hbase 中查看
scan "t2"
hbase(main):054:0> scan "t2"
ROW COLUMN+CELL
1595656365661-eFw9ub1SBT-0 column=f2:payload, timestamp=1595656368932, value=lkjglsjfdsflkjglsjfdsf
1595656393887-eFw9ub1SBT-1 column=f2:payload, timestamp=1595656396890, value=gsdfasdfasglkjglsjfdsf
1595656399523-eFw9ub1SBT-2 column=f2:payload, timestamp=1595656402523, value=gsdfasdfasg
1595656448582-eFw9ub1SBT-3 column=f2:payload, timestamp=1595656451582, value=gdsfasdgsddlkjglsjfdsf
1595656453566-eFw9ub1SBT-4 column=f2:payload, timestamp=1595656456559, value=gsdfasdfasg
1595656453566-eFw9ub1SBT-5 column=f2:payload, timestamp=1595656456559, value=gdsfasdgsdd
1595656520483-eFw9ub1SBT-6 column=f2:payload, timestamp=1595656523485, value=dgdfsafdfdlkjglsjfdsf
1595656525633-eFw9ub1SBT-7 column=f2:payload, timestamp=1595656528630, value=gsdfasdfasg
1595656525634-eFw9ub1SBT-8 column=f2:payload, timestamp=1595656528630, value=gdsfasdgsdd
1595656525635-eFw9ub1SBT-9 column=f2:payload, timestamp=1595656528630, value=dgdfsafdfd
1595656662762-eFw9ub1SBT-10 column=f2:payload, timestamp=1595656665761, value=oiuyiuijiiussdgsadfsadfssdgsadfsadfsaIamzhangli
1595656695816-eFw9ub1SBT-11 column=f2:payload, timestamp=1595656698814, value=mmmmmjngli
1595656746553-eFw9ub1SBT-12 column=f2:payload, timestamp=1595656749549, value=gdgdgdgdgdgd>>/opt/test/hbase.txt
1595656750865-eFw9ub1SBT-13 column=f2:payload, timestamp=1595656753855, value=echo gdgdgdgdgdgd
1595656762901-eFw9ub1SBT-14 column=f2:payload, timestamp=1595656765894, value=gdgdgdgdgdgd
1595656795947-eFw9ub1SBT-15 column=f2:payload, timestamp=1595656798939, value=ooooooooooooooo
16 row(s) in 0.0340 seconds
hbase(main):055:0>
日志文件的数据成功导入hbase
结束监听服务 ctrl + c