hive flume建表

转载

架构设计师之光 2024-07-14 13:52:27

文章标签 hive flume建表 flume Source Channel Sink 文章分类 Hive 大数据

Flume简介：

Apache Flume是一个分布式、可信任的弹性系统，用于高效收集、汇聚和移动大规模日志信息，从多种不同的数据源到一个集中的数据存储中心(HDFS、HBase)。

功能：

--支持在日志习哦他能够中定制各类数据发送方，用于收集数据；

--提供对数据进行简单处理，并写到各种数据接收方

多种数据源：

--Console、RPC、Text、Tail、Syslog、Exec等

特点：

可以高效率的将多个网站服务器中收集的日志信息存入HDFS/HBase中

使用Flume，可以将从多个服务器中获取的数据迅速的移交给Hadoop中

支持各种接入资源数据的类型以及接出数据类型

支持多路径流量，多管道接入流量，多管道接出流量，上下文路由等

可以被水平扩展

hive flume建表_Channel

数据发生器产生的数据被单个的运行在数据发生器所在服务器上的agent收集，之后数据收容器从各个agent上汇集数据并将采集到的数据存入到HDFS或者HBase中。

Flume Event

Flume使用Event对象来作为传递数据的格式，是内部数据传输的最基本单元；

Flume Event由两部分组成：转载数据的字节数组(Byte Payload)+可选头部(Header)

Header是key/value形式的，可以用来制造路由决策或携带其他结构化信息(如事件的时间戳或事件来源的服务器主机名)，可以想象成和HTTP头一样提供相同的功能(通过该方法来传输正文之外的额外信息)。Flume提供的不同source会给其生成的event添加不同的header。

Body是一个字节数组，包含了实际的内容。

Flume Agent

Flume内部有一个或者多个Agent，每个Agent是一个独立的守护进程(JVM)，从客户端那里接收收集，或者从其他Agent那里接收，然后迅速的将获取的数据传给下一个目的节点Agent。

hive flume建表_Channel_02

Agent主要由source、channel、sink三个组件组成。

Agent Source

负责一个外部源(数据发生器)，如一个web服务器传递给他的事件，该外部源将它的事件以Flume可以识别的格式发送到Flume中，当一个Flume源接收到一个事件时，其将通过一个或者多个通道存储该事件

Agent Channel

采用被动存储的形式，即通道会缓存该事件知道该事件被sink组件处理，所以Channel是一种短暂的存储容器，它将从source处接收到的event格式的数据缓存起来，直到他们被sinks消费掉，它在source和sink间起着一个桥梁的作用，channel是一个完整的事务，这一点保证了数据在收发的时候的一致性。并且它可以和任意数量的source和sink链接。

可以通过参数设置event的最大个数，

Flume通常选择FileChannel，而不使用Memory Channel

--Memory Channel：内存存储事务，吞吐率极高，但存在丢数据风险

--File Channel: 本地磁盘的事务实现模式，保证数据不会丢失(WAL实现)

Agent Sink

Sink会将事件从Channel中移除，并将事件放置到外部数据介质上

——如：通过Flume HDFS Sink将数据放置到HDFS中，或者放置到下一个Flume的Source，等到下一个FLume处理。

——对于缓存在通道中的事件，Source和Sink采用异步处理的方式

Sink成功取出Event后，将Event从Channel中移除

Sink必须作用于一个确切的Channel

不同类型的Sink：

——存储Event到最终目的的终端：HDFS、HBase

——自动消耗：Null Sink

——用于Agent之间的通信：Avro

Agent Interceptor

Intercepor用于Source的一组拦截器，按照预设的顺序必要地方对events进行过滤和自定义的处理逻辑实现

在app(应用程序日志)和source之间的，对app日志进行拦截处理的。也即在日志进入到source之前，对日志进行一些包装、清洗过滤等动作

官方提供的已有拦截器有：

—— Timestamp Interceptor：在event的header中添加一个key叫：timestamp,value为当前的时间戳

—— Host Interceptor：在event的header中添加一个key叫：host,value为当前机器的hostname或者ip

—— Static Interceptor：可以在event的header中添加自定义的key和value

—— Regex Filtering Interceptor：通过正则来清洗或包含匹配的events

—— Regex Extractor Interceptor：通过正则表达式来在header中添加指定的key,value则为正则匹配的部分

Flume的拦截器也是chain形式的，可以对一个source指定多个拦截器，按先后顺序依次处理。

Channel selectors有两种类型：

——Replicating Channel Selector(defalut):将source过来的events发往所有的channel

——Multiplexing Channel Selector：可以选择该发往哪些channel

对于有选择性选择数据源，明显需要使用Multiplexing这种分发方式

hive flume建表_hive flume建表_03

问题：Multiplexing需要判断header里指定key的值来决定分发到某个具体的channel，如果demo和demo2同时运行在不同的服务器上，可以在source1上加一个host拦截器，这样可以通过header中的host来判断event该分发给哪个channel，而这里是在同一个服务器上，由host区分不出来日志的来源，必须想办法在header中添加一个key来区分日志的来源——通过设置上游不同的Source就可以解决。

可靠性：

——Flume保证单次跳转可靠性的方式：传送完成后，该事件才会从通道中移除

——Flume使用事务性的方法来保证事件交互的可靠性

——整个处理过程中，如果因为网络中断或者其他原因，在某一步被迫结束了，这个数据会在下一次重新传输。

——数据可暂存上面，当目标不可访问后，数据会暂存在channel中，等目标可访问后，再进行传输

——Source和Sink封装在一个事务的存储和检索中，即事件的放置或者提供由一个事务通过通道来分别提供。这保证了事件集在流中可靠地进行端到端的传递。

Sink开启事务——Sink从Channel中获取数据——Sink把数据传给另一个FIume Agent的Source——Souce开启事务——Source把数据传给Channel——Souce关闭事务——Sink关闭事务

Flume安装

下载源码包：

# wget "http:///apache/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz"

解压：

# tar xvzf apache-flume-1.6.0-bin.tar.gz

修改配置文件，默认目录在/usr/local/src/apache-flume-1.6.0-bin/conf：

#cd /usr/local/src/apache-flume-1.6.0-bin/conf

#vim header_test.conf

--------------------------------------------------------------------

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = org.apache.flume.source.http.HTTPSource

a1.sources.r1.bind = localhost

a1.sources.r1.port = 9000

#a1.sources.r1.fileHeader = true

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

---------------------------------------------------------------------------------------------------------------

运行flume-ng:

#bin/flume-ng agent -c conf -f conf/header_test.conf -n a1 -Dflume.root.logger=INFO,console

验证：

#curl -X POST -d '[{"headers":{"timestampe":"123456","host":"master"},"body":"badou flume"}]' localhost:9000

写到hdfs中：

#vim flume_hdfs.conf

------------------------------------------------------------------------------

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type =regex_filter

a1.sources.r1.interceptors.i1.regex =^[0-9]*$

a1.sources.r1.interceptors.i1.excludeEvents =true

# Describe the sink

#a1.sinks.k1.type = logger

a1.channels = c1

a1.sinks = k1

a1.sinks.k1.type = hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs:/flume/events

a1.sinks.k1.hdfs.filePrefix = events-

a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

----------------------------------------------------------------------------------------------

运行flume-ng:

#bin/flume-ng agent -c conf -f conf/flume_hdfs.conf -n a1 -Dflume.root.logger=INFO,console

验证：

#telnet localhost 44444

从一个agent到另一个agent:

flume agent a2[push.conf] -> flume2 agent a1[pull.conf]

push.conf配置：在master上

--------------------------------------------------------------------------------------------------

#Name the components on this agent

a2.sources= r1

a2.sinks= k1

a2.channels= c1

#Describe/configure the source

a2.sources.r1.type= netcat

a2.sources.r1.bind= localhost

a2.sources.r1.port = 44444

a2.sources.r1.channels= c1

#Use a channel which buffers events in memory

a2.channels.c1.type= memory

a2.channels.c1.keep-alive= 10

a2.channels.c1.capacity= 100000

a2.channels.c1.transactionCapacity= 100000

#Describe/configure the source

a2.sinks.k1.type= avro

a2.sinks.k1.channel= c1

a2.sinks.k1.hostname= slave1

a2.sinks.k1.port= 44444

-----------------------------------------------------------------------------------------------------

pull.conf配置：在slave1上

----------------------------------------------------------------------------------------------------

#Name the components on this agent

a1.sources= r1

a1.sinks= k1

a1.channels= c1

#Describe/configure the source

a1.sources.r1.type= avro

a1.sources.r1.channels= c1

a1.sources.r1.bind= slave1

a1.sources.r1.port= 44444

#Describe the sink

a1.sinks.k1.type= logger

a1.sinks.k1.channel = c1

#Use a channel which buffers events in memory

a1.channels.c1.type= memory

a1.channels.c1.keep-alive= 10

a1.channels.c1.capacity= 100000

a1.channels.c1.transactionCapacity= 100000

-------------------------------------------------------------------------------------------------------------

执行步骤：先salve1,再master

slave1:

#bin/flume-ng agent -c conf -f conf/pull.conf -n a1 -Dflume.root.logger=INFO,console

master:

#bin/flume-ng agent -c conf -f conf/push.conf -n a2 -Dflume.root.logger=INFO,console

判断有没有连接成功：

slave1: log中是否有下面的信息(ip不相关)

[id: 0x39afcef8, /192.168.235.10:59800 => /192.168.235.11:44444] CONNECTED: /192.168.235.10:59800

验证：在master上

#telnet localhost 44444

写入到kafka中：

#vim flume_kafka.conf

--------------------------------------------------------------------------------------------------------------------

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -f /home/badou/Documents/code/python/flume_exec_test.txt

# 设置kafka接收器

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink

# 设置kafka的broker地址和端口号

a1.sinks.k1.brokerList=master:9092

# 设置Kafka的topic

a1.sinks.k1.topic=badou

# 设置序列化的方式

a1.sinks.k1.serializer.class=kafka.serializer.StringEncoder

# use a channel which buffers events in memory

a1.channels.c1.type=memory

a1.channels.c1.capacity = 100000

a1.channels.c1.transactionCapacity = 1000

# Bind the source and sink to the channel

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

------------------------------------------------------------------------------------------------------------------

1.启动kafka需要启动zookeeper

#./bin/ start 三个节点分别启动

通过./bin/ -server master:2181,slave1:2181,slave2:2181

<=> ./bin/ -server master:2181

<=> ./bin/

2.启动kafka: cd 到kafka home

#./bin/ config/server.properties

3.如果没有topic,创建topic

查看topic list:

#bin/kafka-topics.sh --list --zookeeper master:2181,slave1:2181,slave2:2181

没有topic为badou，创建：

#bin/kafka-topics.sh --create --zookeeper master:2181,slave1:2181,slave2:2181 --replication-factor 1 --partitions 2 --topic badou

消费：

#./bin/kafka-console-consumer.sh --zookeeper master:2181,slave1:2181,slave2:2181 --topic badou

4.启动flume: [flume_kafka.conf]

#bin/flume-ng agent -c conf -f conf/flume_kafka.conf -n a1 -Dflume.root.logger=INFO,console

5.写数据到flume中：

因为flume_kafak.conf中的source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -f /usr/local/src/code/flume/flume_exec_test.txt

flume监控这个路径下的文件

只要有数据追加到这个文件中，这些数据就会被监控，通过这个source写入到flume中

需要用到【read_write.py】将一个文件写入到监控的文件中flume_exec_test.txt

# -*- coding: utf-8 -*-

import random

import time

readFileName="/home/badou/Documents/data/order_data/orders.csv"

writeFileName="flume_exec_test.txt"

with open(writeFileName,'a+')as wf:

with open(readFileName,'rb') as f:

for line in f.readlines():

for word in line.split(" "):

ss = line.strip()

if len(ss)<1:

continue

wf.write(ss+'\n')

rand_num = random.random()

time.sleep(rand_num)

这样写入的数据就是写到kafka中的数据

写入到hive中

一、hive建表：badou

create table order_flume(

order_id string,

user_id string,

eval_set string,

order_number string,

order_dow string,

order_hour_of_day string,

days_since_prior_order string)

clustered by (order_id) into 5 buckets

stored as orc;

二、flume hive sink配置：

#vim flume_hive.conf

-----------------------------------------------------------------

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -f /usr/local/src/code/flume/flume_exec_test.txt

# 设置hive接收器

a1.sinks.k1.type=hive

a1.sinks.k1.hive.metastore=thrift://master:9083

a1.sinks.k1.hive.database=badou

a1.sinks.k1.hive.table=order_flume

a1.sinks.k1.serializer=DELIMITED

a1.sinks.k1.serializer.delimiter=","

a1.sinks.k1.serializer.serdeSeparator=','

a1.sinks.k1.serializer.fieldnames=order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

# use a channel which buffers events in memory

a1.channels.c1.type=memory

a1.channels.c1.capacity = 1000000

a1.channels.c1.transactionCapacity = 100000

# Bind the source and sink to the channel

a1.sources.r1.channels=c1

a1.sinks.k1.channel=c1

----------------------------------------------------------------------------------------

三、添加hive相关依赖jar包到flume_home/lib中：

1. /usr/local/src/apache-hive-1.2.2-bin/hcatalog/share/hcatalog/*

2. /usr/local/src/apache-hive-1.2.2-bin/lib/*

四、修改hive-site.xml 5个配置文件：

<name>hive.support.concurrency</name>

</property>

<name>hive.exec.dynamic.partition.mode</name>

<value>nonstrict</value>

</property>

<name>hive.txn.manager</name>

<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

</property>

<name>hive.compactor.initiator.on</name>

</property>

<name>hive.compactor.worker.threads</name>

</property>

五、配置生效

1.重启（reboot）三个节点，

2.启动mysql：service mysqld start

3.启动Hadoop ./sbin/

六、另起一个cli，启动metastore：

hive --service metastore

七、启动flume：

bin/flume-ng agent --conf conf --conf-file conf/flume_hive.conf --name a1 -Dflume.root.logger=INFO,console

执行python脚本：

python read_write.py

-----------------------------------------------------------

# -*- coding: utf-8 -*-

import random

import time

import pandas as pd

import json

writeFileName="./flume_exec_test.txt"

cols = ["order_id","user_id","eval_set","order_number","order_dow","order_hour_of_day","days_since_prior_order"]

df1 = pd.read_csv('../../data/orders.csv')

df1.columns = cols

df = df1.fillna(0)

with open(writeFileName,'a+')as wf:

for idx,row in df.iterrows():

d = {}

for col in cols:

d[col] = row[col]

js = json.dumps(d)

wf.write(js+'\n')

-------------------------------------------------------------

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：Android GNSS 模块分析

下一篇：dpi systemverilog 二维数组

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯