引言

Flume是一个分布式、可靠、和高可用的海量日志聚合的系统,支持在系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。

Flume环境搭建

安装:cdh版本:flume-ng-1.5.0-cdh5.3.6.tar.gz

tar -zxvf flume-ng-1.5.0-cdh5.3.6.tar.gz -C /opt/cdh-5.3.6

配置


flume-env.sh
export JAVA_HOME=/opt/modules/jdk1.7.0_67

第一个agent应用编写 实时读取数据(详见官网)

在conf目录中

cp flume-conf.properties.template a1.conf
vi a1.conf

内容如下:


### define agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

### define sources 
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop-senior.ibeifeng.com
a1.sources.r1.port = 44444

### define channels
a1.channels.c1.type = memory
al.channels.c1.capacity = 1000
al.channels.c1.transactionCapacity = 1000

### define sink
al.sinks.k1.type = logger

#### bind the source and sink to the channel
a1.sources.r1.channels = c1
al.sinks.k1.channel = c1

启动

$ bin/flume-ng agent 
--conf conf \
--conf-file a1.conf \
--name a1 \
-Dflume.root.logger=INFO,console

另开一个终端 执行如下命令 如果telnet命令不能执行 则需要安装


$ telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
Hello world! <ENTER>
OK

在终端可以看到如下信息:


12/06/19 15:32:19 INFO source.NetcatSource: Source starting
12/06/19 15:32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
12/06/19 15:32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D          Hello world!. }

Flume第二个Agent应用讲解(实时监控读取日志数据,存储hdfs文件系统)

* 收集log
     hive运行的日志 /opt/cdh-5.3.6/hive-0.13.1-cdh6.3.6/logs/hive.log
     使用这个命令:tail -f
 * memory
     内存管道
 * hdfs
     存储位置
     /user/beifeng/flume/hive-logs/
vi flume-tail.conf
### define agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

### define sources 
a2.sources.r2.type = exec
a2.sources.r2.commad = tail -f /opt/cdh-5.3.6/hive-0.12.1-cdh6.3.6/logs/hive.log
a2.sources.r2.shell = /bin/bash -c

### define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 1000

### define sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/flume/hive-logs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a2.sinks.k2.hdfs.fileType = DataStream
a2.sinks.k2.hdfs.writeFormat = Text
a2.sinks.k2.hdfs.batchSize = 10

#### bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

启动

$ bin/flume-ng agent 
--conf conf \
--conf-file flume-tail.conf \
--name a2 \
-Dflume.root.logger=INFO,console

直接运行不成功,还有包没有导入,将hadoop包放入flume中


放入flume的lib目录下,现在可以运行了

将commons-configuration-1.6.jar 

 hadoop-hdfs-2.5.0-cdh5.3.6.jar 

 hadoop-common-2.5.0-cdh.5.3.6.jar 

 hadoop-auth-2.5.0.cdh5.3.6.jar


结果:

接下来执行hive语句,这样hive就会产生日志,而flume会实时的传输日志信息
打开:http://hadoop-senior.ibeifeng.com:50070 下面/user/beifeng/flume/hive-logs下会有文件增加

Flume实在案例讲解(监控日志目录日志数据,实时抽取之hdfs系统上)

Spooling Direcotory Source
1、在使用exec来监听数据源虽然实时性较高,但是可靠性较差,当source程序运行异常或者Linux命令中断都会
   造成数据丢失,在恢复正常运行之前数据的完整性无法得到保障
2、Spool Direcotory Paths 通过监听某个目录下新增文件,并将文件的内容读取出来,实现日志信息的收集。
   实际生产中会结合log4j来使用。被传输结束的文件会修改后缀名,添加completed后缀(可修改)

实例:

监控某个日志文件的目录
/app/logs/2014-12-20
....
/app/logs/2016-11-12
    zz.log    ->    不收集变化的日志文件
    xx.log.comp    ->    20M
    yy.log.comp    ->    20M

vi flume-app.conf
### define agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

### define sources 
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/spoollogs
a3.sources.r3.ignorePattern = ^(.)*\\.log$
a3.sources.r3.fileSuffix = .delete

### define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/data

### define sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop.ibeifeng.com:8020/user/beifeng/flume/splogs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10

#### bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动:

$ bin/flume-ng agent 
--conf conf \
--conf-file flume-app.conf \
--name a3 \
-Dflume.root.logger=INFO,console


结果:

拷贝一些文件数据到spoollogs目录下,以.log结尾的不会被抽取到hdfs中
其他文件都会被抽取,并且文件后缀添加了.delete

自动添加创建时间配置

a3.sinks.k3.hdfs.useLocalTimeStamp = true

vi flume-app.conf
### define agent
a3.sources = r3
a3.sinks = k3
a3.channels = c3

### define sources 
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/spoollogs
a3.sources.r3.ignorePattern = ^(.)*\\.log$
a3.sources.r3.fileSuffix = .delete

### define channels
a3.channels.c3.type = file
a3.channels.c3.checkpointDir = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/checkpoint
a3.channels.c3.dataDirs = /opt/cdh-5.3.6/flume-1.5.0-cdh5.3.6/filechannel/data

### define sink
a3.sinks.k3.type = hdfs
### a3.sinks.k3.hdfs.path = hdfs://ns1/user/beifeng/flume/splogs/%Y%m%d
a3.sinks.k3.hdfs.path = hdfs://hadoop.ibeifeng.com:8020/user/beifeng/flume/splogs/
### bin/hdfs dfs -mkdir -p /user/beifeng/flume/hive-logs/ (先去hadoop上创建好地址)
a3.sinks.k3.hdfs.fileType = DataStream
a3.sinks.k3.hdfs.writeFormat = Text
a3.sinks.k3.hdfs.batchSize = 10
a3.sinks.k3.hdfs.useLocalTimeStamp = true

#### bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动:

$ bin/flume-ng agent 
--conf conf \
--conf-file flume-app.conf \
--name a3 \
-Dflume.root.logger=INFO,console