引言
Flume是一个分布式、可靠、和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
Flume环境搭建
安装:cdh版本:flume-ng-1.5.0-cdh5.3.6.tar.gz
tar -zxvf flume-ng-1.5.0-cdh5.3.6.tar.gz -C /opt/cdh-5.3.6
配置
flume-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67
第一个agent应用编写 实时读取数据(详见官网)
在conf目录中
cp flume-conf.properties.template a1.conf
vi a1.conf
内容如下:
### define agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
### define sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop-senior.ibeifeng.com
a1.sources.r1.port = 44444
### define channels
a1.channels.c1.type = memory
al.channels.c1.capacity = 1000
al.channels.c1.transactionCapacity = 1000
### define sink
al.sinks.k1.type = logger
#### bind the source and sink to the channel
a1.sources.r1.channels = c1
al.sinks.k1.channel = c1
启动
$ bin/flume-ng agent
--conf conf \
--conf-file a1.conf \
--name a1 \
-Dflume.root.logger=INFO,console
另开一个终端 执行如下命令 如果telnet命令不能执行 则需要安装
$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK
在终端可以看到如下信息:
12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D Hello world!. }
Flume第二个Agent应用讲解(实时监控读取日志数据,存储hdfs文件系统)
* 收集log
hive运行的日志 /opt/cdh-5.3.6/hive-0.13.1-cdh6.3.6/logs/hive.log
使用这个命令:tail -f
* memory
内存管道
* hdfs
存储位置
/user/beifeng/flume/hive-logs/
vi flume-tail.conf
### define agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2
### define sources
a2.sources.r2.type = exec
a2.sources.r2.commad = tail -f /opt/cdh-5.3.6/hive-0.12.1-cdh6.3.6/logs/hive.log
a2.sources.r2.shell = /bin/bash -c
### define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 1000
### define sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/flume/hive-logs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a2.sinks.k2.hdfs.fileType = DataStream
a2.sinks.k2.hdfs.writeFormat = Text
a2.sinks.k2.hdfs.batchSize = 10
#### bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
启动
$ bin/flume-ng agent
--conf conf \
--conf-file flume-tail.conf \
--name a2 \
-Dflume.root.logger=INFO,console
直接运行不成功,还有包没有导入,将hadoop包放入flume中
放入flume的lib目录下,现在可以运行了
将commons-configuration-1.6.jar
hadoop-hdfs-2.5.0-cdh5.3.6.jar
hadoop-common-2.5.0-cdh.5.3.6.jar
hadoop-auth-2.5.0.cdh5.3.6.jar
结果:
接下来执行hive语句,这样hive就会产生日志,而flume会实时的传输日志信息
打开:http://hadoop-senior.ibeifeng.com:50070 下面/user/beifeng/flume/hive-logs下会有文件增加
Flume实在案例讲解(监控日志目录日志数据,实时抽取之hdfs系统上)
Spooling Direcotory Source
1、在使用exec来监听数据源虽然实时性较高,但是可靠性较差,当source程序运行异常或者Linux命令中断都会
造成数据丢失,在恢复正常运行之前数据的完整性无法得到保障
2、Spool Direcotory Paths 通过监听某个目录下新增文件,并将文件的内容读取出来,实现日志信息的收集。
实际生产中会结合log4j来使用。被传输结束的文件会修改后缀名,添加completed后缀(可修改)
实例:
监控某个日志文件的目录
/app/logs/2014-12-20
....
/app/logs/2016-11-12
zz.log -> 不收集变化的日志文件
xx.log.comp -> 20M
yy.log.comp -> 20M
vi flume-app.conf
### define agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3
### define sources
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/spoollogs
a3.sources.r3.ignorePattern = ^(.)*\\.log$
a3.sources.r3.fileSuffix = .delete
### define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/data
### define sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop.ibeifeng.com:8020/user/beifeng/flume/splogs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10
#### bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
启动:
$ bin/flume-ng agent
--conf conf \
--conf-file flume-app.conf \
--name a3 \
-Dflume.root.logger=INFO,console
结果:
拷贝一些文件数据到spoollogs目录下,以.log结尾的不会被抽取到hdfs中
其他文件都会被抽取,并且文件后缀添加了.delete
自动添加创建时间配置
a3.sinks.k3.hdfs.useLocalTimeStamp = true
vi flume-app.conf
### define agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3
### define sources
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/spoollogs
a3.sources.r3.ignorePattern = ^(.)*\\.log$
a3.sources.r3.fileSuffix = .delete
### define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/data
### define sink
a3.sinks.k3.type = hdfs
### a3.sinks.k3.hdfs.path = hdfs://ns1/user/beifeng/flume/splogs/%Y%m%d
a3.sinks.k3.hdfs.path = hdfs://hadoop.ibeifeng.com:8020/user/beifeng/flume/splogs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#### bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
启动:
$ bin/flume-ng agent
--conf conf \
--conf-file flume-app.conf \
--name a3 \
-Dflume.root.logger=INFO,console