关于HUDI的支持度,。

验证flink集成hadoop成功_flink

测试过hudi对flinksql的支持 还是有问题的,当然hudi是有源码可能需要自己编译找问题:

我用的hudi官方的scala 2.11的sql client.在flink1.12.4下测试。。

./bin/sql-client.sh embedded -j ./hudi-flink-bundle_2.11-0.8.0.jar shell

验证flink集成hadoop成功_flink_02

验证flink集成hadoop成功_hadoop_03

Hadoop 配置

下载flink1.12.4:https://www.apache.org/dyn/closer.lua/flink/flink-1.12.4/flink-1.12.4-bin-scala_2.11.tgz

下载hadoop 2.9.2: https://archive.apache.org/dist/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

配置hadoop之前做好准备工作

1.修改主机名称,我这里创建了三个虚拟主机,分别命名node-1,node-2,node-3,进入 network 文件删掉里面的内容直接写上主机名就可以了

vi /etc/sysconfig/network

2.映射 IP 和主机名,之后 reboot 重启主机

[root@node-1 redis-cluster]# vim /etc/hosts
192.168.0.1 node-1
192.168.0.2 node-2
192.168.0.3 node-3

3.检测防火墙(要关闭防火墙),不同系统的防火墙关闭方式不一样,以下做个参考即可

1.service iptables stop  关闭
2.chkconfig iptables off  强制关闭  除非手动开启不然不会自动开启
3.chkconfig iptables  查看
4.service iptables status 查看防火墙状态

4.ssh 免密码登录

输入命令:ssh-keygen -t rsa  然后点击四个回车键,如下图所示:

验证flink集成hadoop成功_hadoop_04

(必须设置)然后通过 ssh-copy-id  对应主机 IP,之后通过“ssh 主机名/IP”  便可以不输入密码即可登录相应的主机系统

ssh-copy-id node1

ssh-copy-id node2

ssh-copy-id node3

验证flink集成hadoop成功_hdfs_05

开始配置 Hadoop 相关文件

上传 hadoop 安装包解压后进入 hadoop-2.7.6/etc/hadoop 目录下
以下所有的 <property> </property> 都是写在各自相应配置文件末尾的 <configuration> 标签里面

第一个 hadoop-env.sh

vi hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_171   #JAVA_HOME写上自己jdk 的安装路径

第二个 :core-site.xml

vi <strong>core-site.xml</strong>
<!-- 指定Hadoop所使用的文件系统schema(URI),HDFS的老大(NameNode)的地址 -->
<property>
	<name>fs.defaultFS</name>
	<value>hdfs://node-1:9000</value>
</property>
<!-- 定Hadoop运行是产生文件的存储目录。默认 -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/export/data/hddata</value>
</property>

第三个: hdfs-site.xml

vi <strong>hdfs-site.xml</strong>
<!-- 指定HDFS副本的数量,不修改默认为3个 -->
<property>
	<name>dfs.replication</name>
	<value>2</value>
</property>
<!-- dfs的SecondaryNameNode在哪台主机上 -->
<property>
	<name>dfs.namenode.secondary.http-address</name>
	<value>node-2:50090</value>
</property>
<property>
    <name>dfs.namenode.http.address</name>
    <value>slave1:50070</value>
</property>

第四个: mapred-site.xml

mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml

<!-- 指定MapReduce运行是框架,这里指定在yarn上,默认是local -->
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>

第五个:yarn-site.xml

vi <strong>yarn-site.xml</strong>
<!-- 指定yarn的老大ResourceManager的地址 -->
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>node-1</value>
</property>


<!-- NodeManager上运行的附属服务。需要配置成mapreduce_shuffle,才可以运行MapReduce程序默认值 -->
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>

第六个:slaves 文件,里面写上从节点所在的主机名字

vi slaves
node-1
node-2
node-3

第七个:将 Hadoop 添加到环境变量

vi /etc/profile
export HADOOP_HOME=/export/server/hadoop-2.7.6
export PATH=$HADOOP_HOME/bin:$PATH

然后将 hadoop 安装包远程发送给各个副计算机

scp -r /export/server/hadoop-2.7.6/ root@node-2:/export/server/
scp -r /export/server/hadoop-2.7.6/ root@node-3:/export/server/

把配置好的环境变量也要远程发送给各个副计算机

scp -r /etc/profile root@node-2:/etc/
scp -r /etc/profile root@node-3:/etc/

然后试所有的计算机环境变量生效

source /etc/profile

yarn模式启动

1、格式化namenode:hdfs namenode -format;

2、启动hdfs:sbin/start-dfs.sh (9000端口找不到是没启动hdfs)

3、yarn模式启动:sbin/start-yarn.sh

查看端口 8088,9000端口是否存在:

netstat -tpnl

netstat -an| grep 9000

WEB访问看看:

http://192.168.1.10:8088/

hdfs 的HA见:


需要重新格式化hdfs。

1、停止hdfs;

2、删除hdfs相关文件目录(hdfs-site.xml中配置存放文件目录)。

3、启动journalnode:sbin/hadoop-daemon.sh start journalnode;

4、格式化namenode:hdfs namenode -format;

5、hdfs zkfc -formatZK;//ZK 实现HA

6、启动hdfs:sbin/start-dfs.sh

7、yarn模式启动:sbin/start-yarn.sh

FLINK 启动

Local Installation

This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.

Follow these few steps to download the latest stable versions and get started.

Step 1: Download

To be able to run Flink, the only requirement is to have a working Java 8 or 11 installation. You can check the correct installation of Java by issuing the following command:

java -version

Download the 1.12.3 release and un-tar it.

$ tar -xzf flink-1.12.3-bin-scala_2.11.tgz
$ cd flink-1.12.3-bin-scala_2.11

Step 2: Start a Cluster

Flink ships with a single bash script to start a local cluster.

$ ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host.
Starting taskexecutor daemon on host.

Step 3: Submit a Job

Releases of Flink come with a number of example Jobs. You can quickly deploy one of these applications to the running cluster.

采坑1:写入hdfs报错:flink1.12.x默认是不支持hadoop的写入,需要下载wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.8.3-10.0/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar到lib文件夹下;

$ ./bin/flink run examples/streaming/WordCount.jar
$./bin/flink run examples/streaming/WordCount.jar --output hdfs://suibin.online:9000/usr/root/Gate.txt
$ tail log/flink-*-taskexecutor-*.out
  (to,1)
  (be,1)
  (or,1)
  (not,1)
  (to,2)
  (be,2)

Additionally, you can check Flink’s Web UI to monitor the status of the Cluster and running Job.

Step 4: Stop the Cluster

When you are finished you can quickly stop the cluster and all running components.

$ ./bin/stop-cluster.sh

Hudi’s 启动https://hudi.apache.org/docs/flink-quick-start-guide.html

This guide provides a quick peek at Hudi’s capabilities using flink SQL client. Using flink SQL, we will walk through code snippets that allows you to insert and update a Hudi table of default table type: Copy on Write and Merge On Read. After each write operation we will also show how to read the data snapshot (incrementally read is already on the roadmap).

SetupPermalink

We use the Flink Sql Client because it’s a good quick start tool for SQL users.

Step.1 download flink jarPermalink

Hudi works with Flink-1.12.x version. You can follow instructions here for setting up flink. The hudi-flink-bundle jar is archived with scala 2.11, so it’s recommended to use flink 1.12.x bundled with scala 2.11.

Step.2 start flink clusterPermalink

Start a standalone flink cluster within hadoop environment.

Now starts the cluster:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

# Start the flink standalone cluster
./bin/start-cluster.sh

Step.3 start flink SQL clientPermalink

Hudi has a prepared bundle jar for flink, which should be loaded in the flink SQL Client when it starts up. You can build the jar manually under path hudi-source-dir/packaging/hudi-flink-bundle, or download it from the Apache Official Repository.

Now starts the SQL CLI:

# HADOOP_HOME is your hadoop root directory after unpack the binary package.
export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`

./bin/sql-client.sh embedded -j .../hudi-flink-bundle_2.1?-*.*.*.jar shell

Please note the following:

  • We suggest hadoop 2.9.x+ version because some of the object storage has filesystem implementation only after that
  • The flink-parquet and flink-avro formats are already packaged into the hudi-flink-bundle jar

Setup table name, base path and operate using SQL for this guide. The SQL CLI only executes the SQL line by line.

Insert dataPermalink

Creates a flink hudi table first and insert data into the Hudi table using SQL VALUES as below.

-- sets up the result mode to tableau to show the results directly in the CLI
set execution.result-mode=tableau;

CREATE TABLE t1(
  uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark the field as record key
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://suibin.online:9000/usr/root/t1',
  'write.tasks' = '1', -- default is 4 ,required more resource
  'compaction.tasks' = '1', -- default is 10 ,required more resource
  'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
);

-- insert data using values
INSERT INTO t1 VALUES
  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');

Query dataPermalink

-- query from the hudi table
select * from t1;

This query provides snapshot querying of the ingested data. Refer to Table types and queries for more info on all table types and query types supported.

Update dataPermalink

This is similar to inserting new data.

-- this would update the record with key 'id1'
insert into t1 values
  ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');

Notice that the save mode is now Append. In general, always use append mode unless you are trying to create the table for the first time. Querying the data again will now show updated records. Each write operation generates a new commit denoted by the timestamp. Look for changes in _hoodie_commit_timeage fields for the same _hoodie_record_keys in previous commit.

Streaming queryPermalink

Hudi flink also provides capability to obtain a stream of records that changed since given commit timestamp. This can be achieved using Hudi’s streaming querying and providing a start time from which changes need to be streamed. We do not need to specify endTime, if we want all changes after the given commit (as is the common case).

注意:'path' = 'hdfs://suibin.online:9000/usr/root/t1',写成你HDFS的地址;

CREATE TABLE t1(
  uuid VARCHAR(20), -- you can use 'PRIMARY KEY NOT ENFORCED' syntax to mark the field as record key
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://suibin.online:9000/usr/root/t1',
  'table.type' = 'MERGE_ON_READ',
  'read.tasks' = '1', -- default is 4 ,required more resource
  'read.streaming.enabled' = 'true',  -- this option enable the streaming read
  'read.streaming.start-commit' = '20210316134557', -- specifies the start commit instant time
  'read.streaming.check-interval' = '4' -- specifies the check interval for finding new source commits, default 60s.
);

-- Then query the table in stream mode
select * from t1;

This will give all changes that happened after the read.streaming.start-commit commit. The unique thing about this feature is that it now lets you author streaming pipelines on streaming or batch data source.

Delete dataPermalink

When consuming data in streaming query, hudi flink source can also accepts the change logs from the underneath data source, it can then applies the UPDATE and DELETE by per-row level. You can then sync a NEAR-REAL-TIME snapshot on hudi for all kinds of RDBMS.

Where to go from here?Permalink

We used flink here to show case the capabilities of Hudi. However, Hudi can support multiple table types/query types and Hudi tables can be queried from query engines like Hive, Spark, Flink, Presto and much more. We have put together a demo video that show cases all of this on a docker based setup with all dependent systems running locally. We recommend you replicate the same setup and run the demo yourself, by following steps here to get a taste for it. Also, if you are looking for ways to migrate your existing data to Hudi, refer to migration guide.