介绍
- 本文对Flume框架进行了简单的介绍,内容如下
- 如何在安装Linux上安装Flume框架
- 如何动态读取一个日志文件
- 如何使用Flume将文件存储到HDFS上
- 如何使用Flume将文件存储到HDFS指定目录下
- 如何使用Flume使用分区方式将文件存储到HDFS上
- 如何动态监听一个文件夹中的内容
- 如何过滤不想加载到Flume中的文件
- 如何实现动态监听多个文件与文件
1:Flume简单介绍与安装
1.1:Flume介绍
(1)分布式:
可以在多台机器上运行多个flume,日志文件往往分布在不同的机器里面
(2)collecting, aggregating, and moving
收集 聚集 移动
(3)组件agent
source:从数据源读取数据的,将数据转换为数据流,将数据丢给channel
channel:类似于一个队列,临时存储source发送过来的数据
sink:负责从channel中读取数据, 然后发送给目的地
(4)flume的使用很简单,就是一个配置文件,
1.2:Flume版本
flume-ng:(next generation): 目前使用该版本
flume-og:(Original generation):以前的版本,淘汰
1.3 :Flume安装
环境要求:Linux下,hadoop环境安装完成;JDK安装完成
安装配置:
(1)修改文件名,配置JDK
1:mv flume-env.sh.template flume-env.sh
(2)找到HDFS的地址:
方法1.声明Hadoop_home为全局环境变量
全局配置
方法2.将core-site.xml和hdfs-site.xml放到flume配置文件下(推荐)
cp /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/cdh5.7.6/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml ./
方法3.直接在使用的时候给HDFS绝对路径
hdfs://hostname:8020/aa/bb
(3)添加HDFS的Jar包lib目录下:在执行的过程中需要使用HDFS api
测试案例1:读取Hive日志信息到控制台
flume-conf.properties配置文件
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per a1,
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# defined sources
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell=/bin/sh -c
# defined channel
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#读取数据容量
a1.channels.c1.transactionCapacity=100
# defined sink
a1.sinks.k1.type = logger
#bond
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
在Flume目录下输入命令
bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties -Dflume.root.logger=INFO,console
结果:控制台输出为二进制,所以看不出结果
测试案例2:读取Hive日志信息到HDFS上
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs2/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
命令:与案例1一样
HDFS结果
文件内容:
2019-07-15 05:34:11,930 INFO [main]: ql.Driver (Driver.java:compile(500)) - Semantic Analysis Completed
2019-07-15 05:34:12,060 INFO [main]: ql.Driver (Driver.java:getSchema(266)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
2019-07-15 05:34:12,646 INFO [main]: ql.Driver (Driver.java:compile(607)) - Completed compiling command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 1.595 seconds
2019-07-15 05:34:12,647 INFO [main]: ql.Driver (Driver.java:checkConcurrency(186)) - Concurrency mode is disabled, not creating a lock manager
2019-07-15 05:34:12,647 INFO [main]: ql.Driver (Driver.java:execute(1598)) - Executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812): show tables
2019-07-15 05:34:12,665 INFO [main]: ql.Driver (Driver.java:launchTask(1968)) - Starting task [Stage-0:DDL] in serial mode
2019-07-15 05:34:12,830 INFO [main]: ql.Driver (Driver.java:execute(1877)) - Completed executing command(queryId=huadian_20190715053434_d650e334-f5f3-4e7e-b030-0622a759f812); Time taken: 0.183 seconds
测试案例3:存储在HDFS文件大小的问题,解决小文件问题
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/hdfs1/
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
结果:文件大小明显变化了
测试案例4:数据指定目录到hdfs中,导入hive分区
设置每年每月每天每分钟进行分区
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# For each one of the sources, the type is defined
a1.sources.s1.type = exec
a1.sources.s1.command = tail -F /opt/cdh5.7.6/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.sources.s1.shell = /bin/sh -c
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
结果:
/flume/part/yearst=2019/monthstr=07/daystr=15/minutestr=50
导入Hive问题:
将Flume的文件导入Hive中,操作起来比较麻烦
原因一:
要求Hive表中的数据的存储格式必须为ORC(列式存储)
原因二:
要求Hive表为桶表、按照每条数据进行分桶
测试案例5:如何动态监听一个目录Spooling Directory Source
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
结果:可以看到Linux下文件被加载成功了
测试案例6:过滤不被加载到Flume中的文件
在案例5中,被加载的文件只会被加载一次
这样后续写入到文件里的数据就不会被读取
为了解决这个问题,可以添加过滤操作
在需要加载该文件时,修改文件名,对该文件进行加载
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# defined sources
#设置Flume扫描文件夹
a1.sources.s1.type = spooldir
#具体扫描哪一个文件夹
a1.sources.s1.spoolDir = /opt/datas/flume/spool
#正则过滤
a1.sources.s1.ignorePattern=([^ ]*\.tmp)
# Each channel's type is defined.
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/datas/flume/channel/checkpoint
a1.channels.c1.dataDirs=/opt/datas/flume/channel/data
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
#hdfs中存储文件的路径
a1.sinks.k1.hdfs.path=/flume/part/yearst=%Y/monthstr=%m/daystr=%d/minutestr=%M
#使用了时间格式,必须设置该属性
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
结果:可以看到后缀为.tmp 的文件没有被加载
测试案例7:动态监听多个文件,并加载到内存中
flume-conf.properties配置文件
# in this case called 'a1'
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# defined sources
#如果是自己编译的类,这里写类的全路径
a1.sources.s1.type = TAILDIR
a1.sources.s1.positionFile =/opt/cdh5.7.6/flume-1.6.0-cdh5.7.6-bin/position/taildir_position.json
#文件组的绝对路径。 正则表达式(而不是文件系统模式)只能用于文件名。
a1.sources.s1.filegroups = f1 f2
a1.sources.s1.filegroups.f1 = /opt/datas/flume/taildir/test.txt
#标题值,使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f1.age = 17
a1.sources.s1.headers.f1.type = bb
a1.sources.s1.filegroups.f2 = /opt/datas/flume/taildir/huadian/.*
#标题值,使用标题键设置。 可以为一个文件组指定多个标头
a1.sources.s1.headers.f2.age = 18
a1.sources.s1.headers.f2.type = aa
# Each channel's type is defined.
a1.channels.c1.type = memory
#容量
a1.channels.c1.capacity=1000
#一次写出多少文件
a1.channels.c1.transactionCapacity=100
# Each sink's type must be defined
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path=/flume/taildir
#设置文件类型和写的格式,解决中文乱码
a1.sinks.k1.hdfs.fileType=DataStream
a1.sinks.k1.hdfs.writeFormat=Text
#设置多久进行一次保存
a1.sinks.k1.hdfs.rollInterval=0
#设置HDFS文件大小
a1.sinks.k1.hdfs.rollSize=10240
#在滚动之前写入文件的事件数
a1.sinks.k1.hdfs.rollCount=0
#Specify the channel the sink should use
a1.sinks.k1.channel = c1
a1.sources.s1.channels = c1
结果
文件内容:
i
am
a
chinese
i
love
my
country
追加内容后Flume中的内容:
i
am
a
chinese
i
love
my
country
test1