1、上传flume-ng-1.5.0-cdh5.3.6.tar.gz 至/opt/modules/cdh/ 并解压
2、编辑 /conf/flume-env.sh

export JAVA_HOME=/usr/java/jdk1.7.0_79

3、编辑/etc/profile

export FLUME_HOME=/opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin
export PATH=$PATH:$FLUME_HOME/bin

4、flume业务情景

flume搭建的时候需要测试 flume实验_flume


flume是日志分析型项目的数据起点

5、下图为事件的处理过程,事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。你可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。

flume搭建的时候需要测试 flume实验_flume搭建的时候需要测试_02


当节点出现故障时,日志能够被传送到其他节点上而不会丢失。

6、概念和原理

Client:Client生产数据,运行在一个独立的线程。

Event:一个数据单元,消息头和消息体组成。(日志记录、 avro 对象等)

Flow: Event从源点到达目的点的迁移的抽象。

Agent:一个独立的Flume进程,包含组件Source、 Channel、 Sink。

Source: 数据收集组件。(source从Client收集数据,传递给Channel)

Channel:临时存储,保存由Source组件传递过来的Event队列。

Sink: 从Channel中读取并移除Event,将Event传递到下一个Agent(如果有的话)

flume搭建的时候需要测试 flume实验_数据_03

7、Avro二进制序列化数据系统
Avro使用模式来实现数据结构定义。一个存储文件由两部分组成:头信息(Header)和数据块(Data Block)。头信息又由三部分构成:四个字节的前缀(类似于Magic Number),文件Meta-data信息和随机生成的16字节同步标记符。
8、Avro支持两种序列化编码方式:二进制编码和JSON编码。使用二进制编码会高效序列化,并且序列化后得到的结果会比较小。而JSON一般用于调试系统或是基于WEB的应用。
9、消息从客户端发送到服务器端需要经过传输层(Transport Layer),它发送消息并接收服务器端的响应。到达传输层的数据就是二进制数据。通常以HTTP作为传输模型,数据以POST方式发送到对方去。

10、单一流测试示例
通过flume来监控一个目录,当目录中有新文件时,将文件内容输出到控制台
编辑/conf/a1.properties

#配置一个agent,agent的名称可以自定义(如a1)
#指定agent的sources(如s1)、sinks(如k1)、channels(如c1)
#分别指定agent的sources,sinks,channels的名称 名称可以自定义
a1.sources = s1  
a1.sinks = k1  
a1.channels = c1  
   
#描述source
#配置目录scource
a1.sources.s1.type =spooldir  
a1.sources.s1.spoolDir =/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs 
a1.sources.s1.fileHeader= true  
a1.sources.s1.channels =c1  
   
#配置sink 
a1.sinks.k1.type = logger  
a1.sinks.k1.channel = c1  
   
#配置channel(内存做缓存)
a1.channels.c1.type = memory

启动命令

flume-ng agent --conf conf --conf-file /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/a1.properties --name a1 -Dflume.root.logger=INFO,console

上传一个lining07.log文件至/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs

监控台将实时显示此日志能容

flume搭建的时候需要测试 flume实验_flume搭建的时候需要测试_04


11、source 配置总结

a、安装netcat,读取nc发送的数据当做源数据

#类别
a1.sources.r1.type = netcat
#连接名
a1.sources.r1.bind = localhost
#端口
a1.sources.r1.port = 5566

可以通过nc localhost 5566来发送数据

b、监听目录做源数据

a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/root/logs
a1.sources.r1.fileHeader = true

c、监听文件做源数据

a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/test-flume.log
a1.sources.r1.shell = /bin/bash -c

d、监听kafka生产者的数据作为源数据

a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.channels = mem_channel
a1.sources.r1.batchSize = 5000
#配置kafka端口号
a1.sources.r1.servers = node02:9092,node03:9092,node04:9092
#设置kafka的topic(主题)
a1.sources.r1.topics = hellotopic
#配置groupid
a1.sources.r1.consumer.group.id = flume_test_id

12、单文件单流 AGENT 写入 HDFS

# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type= memory
agent1.channels.ch1.capacity = 100000
agent1.channels.ch1.transactionCapacity = 100000
agent1.channels.ch1.keep-alive = 30
  

#define source monitor a file
agent1.sources.avro-source1.type= exec
agent1.sources.avro-source1.shell = /bin/bash-c
agent1.sources.avro-source1.command= tail-n +0 -F /opt/test/testlog.txt
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.threads = 5
  
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type= hdfs
agent1.sinks.log-sink1.hdfs.path = hdfs://192.168.198.131:8020/flumeTest
agent1.sinks.log-sink1.hdfs.writeFormat = Text
agent1.sinks.log-sink1.hdfs.fileType = DataStream
agent1.sinks.log-sink1.hdfs.rollInterval = 0
agent1.sinks.log-sink1.hdfs.rollSize = 1000000
agent1.sinks.log-sink1.hdfs.rollCount = 0
agent1.sinks.log-sink1.hdfs.batchSize = 1000
agent1.sinks.log-sink1.hdfs.txnEventMax = 1000
agent1.sinks.log-sink1.hdfs.callTimeout = 60000
agent1.sinks.log-sink1.hdfs.appendTimeout = 60000
  
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1

执行命令

flume-ng agent --conf /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/a1.properties -n agent1 -Dflume.root.logger=INFO,console

13、多 agent 汇聚写入 HDFS

flume搭建的时候需要测试 flume实验_flume_05

因环境所限,未进行此实验。

14、flume将数据导入到Hbase
建表

hbase(main):001:0> create 't2','f2'

flume agent 配置

a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 

# Describe/configure the source 
a1.sources.r1.type = exec 
a1.sources.r1.command = tail -F /opt/test/hbase.txt 
a1.sources.r1.channels = c1 

# Describe the sink 
a2.sources = r2 
a2.sinks = k2 
a2.channels = c2 

# Describe/configure the source 
a2.sources.r2.type = exec 
a2.sources.r2.command = tail -F /opt/test/hbase.txt 
a2.sources.r2.channels = c2 

# Describe the sink 
a2.sinks.k2.type = logger 
a2.sinks.k2.type = hbase 
a2.sinks.k2.table = t2  # 与hbase中创建的表名相同
a2.sinks.k2.columnFamily = f2  # 与hbase中创建的表的列簇相同
a2.sinks.k2.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer 
a2.sinks.k2.channel = memoryChannel 

# Use a channel which buffers events in memory 
a2.channels.c2.type = memory 
a2.channels.c2.capacity = 1000 
a2.channels.c2.transactionCapacity = 100 

# Bind the source and sink to the channel 
a2.sources.r2.channels = c2 
a2.sinks.k2.channel = c2

拷贝jar包

cp /opt/modules/cdh/hbase-0.98.6-cdh5.3.6/lib/* /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/lib

启动flume a2

flume-ng agent --conf conf --conf-file /opt/modules/cdh/apache-flume-1.5.0-cdh5.3.6-bin/conf/filetohbase.conf --name a2 -Dflume.root.logger=INFO,console

向文件写入数据

echo 'Iamzhangli'>>/opt/test/hbase.txt

在hbase 中查看

scan "t2"
hbase(main):054:0> scan "t2"
ROW                                          COLUMN+CELL
 1595656365661-eFw9ub1SBT-0                  column=f2:payload, timestamp=1595656368932, value=lkjglsjfdsflkjglsjfdsf
 1595656393887-eFw9ub1SBT-1                  column=f2:payload, timestamp=1595656396890, value=gsdfasdfasglkjglsjfdsf
 1595656399523-eFw9ub1SBT-2                  column=f2:payload, timestamp=1595656402523, value=gsdfasdfasg
 1595656448582-eFw9ub1SBT-3                  column=f2:payload, timestamp=1595656451582, value=gdsfasdgsddlkjglsjfdsf
 1595656453566-eFw9ub1SBT-4                  column=f2:payload, timestamp=1595656456559, value=gsdfasdfasg
 1595656453566-eFw9ub1SBT-5                  column=f2:payload, timestamp=1595656456559, value=gdsfasdgsdd
 1595656520483-eFw9ub1SBT-6                  column=f2:payload, timestamp=1595656523485, value=dgdfsafdfdlkjglsjfdsf
 1595656525633-eFw9ub1SBT-7                  column=f2:payload, timestamp=1595656528630, value=gsdfasdfasg
 1595656525634-eFw9ub1SBT-8                  column=f2:payload, timestamp=1595656528630, value=gdsfasdgsdd
 1595656525635-eFw9ub1SBT-9                  column=f2:payload, timestamp=1595656528630, value=dgdfsafdfd
 1595656662762-eFw9ub1SBT-10                 column=f2:payload, timestamp=1595656665761, value=oiuyiuijiiussdgsadfsadfssdgsadfsadfsaIamzhangli
 1595656695816-eFw9ub1SBT-11                 column=f2:payload, timestamp=1595656698814, value=mmmmmjngli
 1595656746553-eFw9ub1SBT-12                 column=f2:payload, timestamp=1595656749549, value=gdgdgdgdgdgd>>/opt/test/hbase.txt
 1595656750865-eFw9ub1SBT-13                 column=f2:payload, timestamp=1595656753855, value=echo gdgdgdgdgdgd
 1595656762901-eFw9ub1SBT-14                 column=f2:payload, timestamp=1595656765894, value=gdgdgdgdgdgd
 1595656795947-eFw9ub1SBT-15                 column=f2:payload, timestamp=1595656798939, value=ooooooooooooooo
16 row(s) in 0.0340 seconds

hbase(main):055:0>

日志文件的数据成功导入hbase
结束监听服务 ctrl + c