本人最近在工作中遇到了一些指标需要用flink sql计算,其中就涉及到了维度表关联,开窗计算,提前输出等问题,现有网上内容(包括官网),对此方面内容并不详细,容易让新手产生一定疑惑。故此以此文章来记录所遇到的问题以及解决方案,该文章会尽可能的详细记录所遇到的问题,希望可以帮助新入坑的朋友们顺利完成相关工作。
之所以选择pv和uv两个指标,一方面是这两个指标相对来说足够简单,同时也包含一定的技术点,而且也是数据工作中相对常见的指标,故此以此来作为案例讲解。
本文会介绍展示基于两种时间(事件时间和处理时间)的代码,基于两种时间的维表关联在运行时也有所不同,具体如下(该片段摘抄自官网)
- 对于基于事件时间的时态 Join, join 算子保留 Join 两侧流的状态并通过 watermark 清理。
- 对于基于处理时间的时态 Join, join 算子保留仅保留右侧(构建侧)的状态,且构建侧的状态只包含数据的最新版本,右侧的状态是轻量级的; 对于在运行时有能力查询外部系统的时态表,join 算子还可以优化成不保留任何状态,此时算子是非常轻量级的。
前期准备
flink版本 1.12.1(推荐使用1.11以上版本)
需要引入jar包:
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!-- <scope>provided</scope>-->
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<!-- or.. (for the new Blink planner) -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!-- <scope>provided</scope>-->
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<!-- <exclusions>-->
<!-- <exclusion>-->
<!-- <artifactId>slf4j-api</artifactId>-->
<!-- <groupId>org.slf4j</groupId>-->
<!-- </exclusion>-->
<!-- </exclusions>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
<exclusion>
<artifactId>snappy-java</artifactId>
<groupId>org.xerial.snappy</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-sql-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-json</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.21</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.15</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>${log4j.version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
案例需求
计算该网站各省的pv和uv
数据情况如下
数据流来自kafka:
- user_id 网站点击用户id
- city_id 用户所属城市id
- ts 时间戳
mysql维度表:
- city_id 城市id
- city_name 城市名称
- update_time 更新时间
具体实现
环境配置
所需要import类
import com.alibaba.fastjson.JSON
import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.table.api._
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import java.text.SimpleDateFormat
import java.time.Duration
import java.util.Properties
配置flink启动环境,创建env和tableEnv
val env = StreamExecutionEnvironment.createLocalEnvironment()
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(env, settings)
val tableConfig =tableEnv.getConfig.getConfiguration
读取Kafka数据,转换成table
该段代码主要是从kafka中读取数据,通过fastjson中的parseObject方法转换为对象生产stream,基于该stream创建view用于后续sql操作
基于处理时间:
val properties = new Properties()
properties.put("bootstrap.servers", "172.0.0.1:9092")
val consumer = new FlinkKafkaConsumer[String]("web-data", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest()
val testStream = env.addSource(consumer)
//过滤Null数据
.map(x =>
try{JSON.parseObject(x, classOf[WebData])}
catch {
case ex:Exception => null
}).filter(_ != null)
tableEnv.createTemporaryView("base_web_data", testStream,'user_id,'age,'proctime.proctime())
基于事件时间:
val properties = new Properties()
properties.put("bootstrap.servers", "172.0.0.1:9092")
val consumer = new FlinkKafkaConsumer[String]("web-data", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest()
val testStream = env.addSource(consumer)
//过滤Null数据
.map(x =>
try{JSON.parseObject(x, classOf[WebData])}
catch {
case ex:Exception => null
}).filter(_ != null)
//设置事件时间为watermark
.assignTimestampsAndWatermarks(WatermarkStrategy
.forBoundedOutOfOrderness(Duration.ofSeconds(5)) //乱序程度
.withTimestampAssigner(new SerializableTimestampAssigner[WebData] {
override def extractTimestamp(element: StudentTest, recordTimestamp: Long): Long =
//修正事件时间
element.ts + 8 * 60 * 60 * 1000
}))
tableEnv.createTemporaryView("base_web_data", testStream,'user_id,'age,'rowtime.rowtime())
注:flink在设置事件时间时候,会将时间戳转换为时间,而这个转换是utc+0的,所以会和现实时间差8个小时,所以要加8个小时修正
注:该段代码中,基于事件时间相对于处理时间有两处不同,一是需要设置watermark,代码在运行时,需要根据watermark来做数据的过期处理;二是在创建视图时候,处理时间调用的是proctime()方法,事件时间调用的是rowtime()方法
注:'user_id的用法需要import org.apache.flink.streaming.api.scala._
通过jdbc创建维表
基于处理时间
tableEnv.executeSql(
"""CREATE TABLE all_city_list(
|city_id STRING,
|city_name STRING,
|update_time TIMESTAMP,
| PRIMARY KEY (entity_id) NOT ENFORCED
|) WITH (
| 'connector' = 'jdbc',
| 'table-name' = 'all_city_list',
| 'url' = 'jdbc:mysql://127.0.0.1:3306/db1',
| 'username' = 'username',
| 'password' = 'passwd',
| 'driver' = 'com.mysql.cj.jdbc.Driver'
|)
|""".stripMargin)
基于事件时间
tableEnv.executeSql(
"""CREATE TABLE all_city_list_mysql(
|city_id STRING,
|city_name STRING,
|update_time TIMESTAMP,
| PRIMARY KEY (entity_id) NOT ENFORCED,
| WATERMARK FOR update_time AS update_time
|) WITH (
| 'connector' = 'jdbc',
| 'table-name' = 'all_city_list',
| 'url' = 'jdbc:mysql://127.0.0.1:3306/db1',
| 'username' = 'username',
| 'password' = 'passwd',
| 'driver' = 'com.mysql.cj.jdbc.Driver'
|)
|""".stripMargin)
注:基于事件时间的维度表,在创建时候需要指定一个字段为watermark
维度表关联
Temporal Join的关键词为FOR SYSTEM_TIME AS OF tb_a.proctime(事件时间为rowtime)
该关联方式相对普通关联有两个好处,一方面该关联方式在实现关联时,使用的是StreamExecLookupJoin,该join可以查询到维度表的变更。普通方式关联jdbc表,在启动的时候加载表数据后,后续变更则无法同步到fllink,所以Temporal Join可以获取到维度表变更;
另一方面普通关联会一直保留关联双侧的数据,数据也就会一直膨胀,直到撑爆内存导致任务失败,Temporal Join则可以定期清理过期数据,在合理的内存配置下即可避免内存溢出;
基于处理时间:
tableEnv.executeSql(
"""CREATE VIEW inout_web_table AS
|SELECT tb_a.*,tb_b.city_name
|FROM base_web_data tb_a
| LEFT JOIN all_city_list_mysql FOR SYSTEM_TIME AS OF tb_a.proctime AS tb_b
| ON tb_a.city_id = tb_b.city_id
|""".stripMargin)
基于事件时间:
tableEnv.executeSql(
"""CREATE VIEW inout_web_table AS
|SELECT tb_a.*,tb_b.city_name
|FROM base_web_data tb_a
| LEFT JOIN all_city_list_mysql FOR SYSTEM_TIME AS OF tb_a.rowtime AS tb_b
| ON tb_a.city_id = tb_b.city_id
|""".stripMargin)
注:基于事件时间的Temporal Join,join key必须为维度表的主键,在运行时flink需要根据主键来找寻对应数据的版本,具体参考时态表
注:官网文档中在使用FOR SYSTEM_TIME AS OF tb_a.rowtime没提到原表中要包含rowtime字段,即原表中没有rowtime字段依然可以使用FOR SYSTEM_TIME AS OF tb_a.rowtime。但是经过本人测试,如果原表中没有对应字段则会报错,所以需要先在原表中声明该字段,后续关联中才能使用
计算pv,uv
计算每天的pv,uv,可以开一个24 h的窗口,统计窗口内数据情况。但是这样有一个问题,flink sql没有办法直接设置trigger,所以只能等待窗口结束后才会有计算结果输出。这往往不是我们想要的,针对这种情况,有两种解决办法,一种是设置参数,提前输出计算结果;另一种办法是开一分钟窗口(根据自己输出频率来设定窗口)做初步聚合,然后根据日期做二次聚合;
两种方式有一个区别,提前输出方式,因为窗口没有关闭,输出的数据是update,而二次聚合方式,因为窗口已经关闭,所以每分钟都会insert一条数据,当需要数据绘制曲线图时,推荐二次聚合方式,如果只是显示一个计算数字,则推荐使用提前输出方式;
提前输出所需设置参数:
tableConfig.setString("table.exec.emit.early-fire.enabled", "true")
tableConfig.setString("table.exec.emit.early-fire.delay", "60 s")
二次聚合方式:
//计算pv
tableEnv.executeSql(
"""CREATE VIEW pv_temp AS
|SELECT city_name,
|CAST('PV' AS STRING) AS statistic_type,
|DATE_FORMAT((max(rowtime),'yyyy:MM:dd') AS statistic_date,
|DATE_FORMAT(CEIL(max(rowtime) TO MINUTE),'HH:mm:ss') AS statistic_time,
|COUNT(user_id) AS statistic_value
|FROM inout_web_table
|GROUP BY city_name,TUMBLE(rowtime, INTERVAL '1' MINUTE)
|""".stripMargin)
tableEnv.executeSql(
"""insert into web_data_index_sink
|SELECT
|city_name,
|MAX(statistic_type) AS statistic_type,
|MAX(statistic_date) AS statistic_date,
|MAX(statistic_time) AS statistic_time,
|SUM(statistic_value) AS statistic_value
|FROM pv_temp
|GROUP BY city_name,statistic_date
|""".stripMargin)
//计算uv
tableEnv.executeSql(
"""CREATE VIEW uv_temp AS
|SELECT city_name,
|CAST('UV' AS STRING) AS statistic_type,
|DATE_FORMAT((max(rowtime),'yyyy:MM:dd') AS statistic_date,
|DATE_FORMAT(CEIL(max(rowtime) TO MINUTE),'HH:mm:ss') AS statistic_time,
|user_id AS user_id
|FROM inout_web_table
|GROUP BY city_name,user_id,TUMBLE(rowtime, INTERVAL '1' MINUTE)
|""".stripMargin)
tableEnv.executeSql(
"""insert into web_data_index_sink
|SELECT
|city_name,
|MAX(statistic_type) AS statistic_type,
|MAX(statistic_date) AS statistic_date,
|MAX(statistic_time) AS statistic_time,
|COUNT(distinct user_id) AS statistic_value
|FROM uv_temp
|GROUP BY city_name,statistic_date
|""".stripMargin)