本人最近在工作中遇到了一些指标需要用flink sql计算,其中就涉及到了维度表关联,开窗计算,提前输出等问题,现有网上内容(包括官网),对此方面内容并不详细,容易让新手产生一定疑惑。故此以此文章来记录所遇到的问题以及解决方案,该文章会尽可能的详细记录所遇到的问题,希望可以帮助新入坑的朋友们顺利完成相关工作。

之所以选择pv和uv两个指标,一方面是这两个指标相对来说足够简单,同时也包含一定的技术点,而且也是数据工作中相对常见的指标,故此以此来作为案例讲解。

本文会介绍展示基于两种时间(事件时间和处理时间)的代码,基于两种时间的维表关联在运行时也有所不同,具体如下(该片段摘抄自官网)

  • 对于基于事件时间的时态 Join, join 算子保留 Join 两侧流的状态并通过 watermark 清理。
  • 对于基于处理时间的时态 Join, join 算子保留仅保留右侧(构建侧)的状态,且构建侧的状态只包含数据的最新版本,右侧的状态是轻量级的; 对于在运行时有能力查询外部系统的时态表,join 算子还可以优化成不保留任何状态,此时算子是非常轻量级的。

前期准备

flink版本 1.12.1(推荐使用1.11以上版本)

需要引入jar包:

<dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-api-scala-bridge_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.58</version>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <!--            <scope>provided</scope>-->
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <!-- or.. (for the new Blink planner) -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-table-planner-blink_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <!--            <scope>provided</scope>-->
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <!--            <exclusions>-->
            <!--                <exclusion>-->
            <!--                    <artifactId>slf4j-api</artifactId>-->
            <!--                    <groupId>org.slf4j</groupId>-->
            <!--                </exclusion>-->
            <!--            </exclusions>-->
        </dependency>

        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>snappy-java</artifactId>
                    <groupId>org.xerial.snappy</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-sql-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-json</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.21</version>
        </dependency>

        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.15</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <version>${log4j.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>slf4j-api</artifactId>
                    <groupId>org.slf4j</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>${log4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc_2.11</artifactId>
            <version>${flink.version}</version>
        </dependency>
    </dependencies>
案例需求

计算该网站各省的pv和uv

数据情况如下

数据流来自kafka:

  • user_id 网站点击用户id
  • city_id 用户所属城市id
  • ts 时间戳

mysql维度表:

  • city_id 城市id
  • city_name 城市名称
  • update_time 更新时间

具体实现

环境配置

所需要import类

import com.alibaba.fastjson.JSON
import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.EnvironmentSettings
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment
import org.apache.flink.table.api._
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import java.text.SimpleDateFormat
import java.time.Duration
import java.util.Properties

配置flink启动环境,创建env和tableEnv

val env = StreamExecutionEnvironment.createLocalEnvironment()
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(env, settings)
val tableConfig =tableEnv.getConfig.getConfiguration
读取Kafka数据,转换成table

该段代码主要是从kafka中读取数据,通过fastjson中的parseObject方法转换为对象生产stream,基于该stream创建view用于后续sql操作

基于处理时间:

val properties = new Properties()
properties.put("bootstrap.servers", "172.0.0.1:9092")

val consumer = new FlinkKafkaConsumer[String]("web-data", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest()

val testStream = env.addSource(consumer)
  //过滤Null数据
  .map(x =>
    try{JSON.parseObject(x, classOf[WebData])}
    catch {
      case ex:Exception => null
    }).filter(_ != null)
tableEnv.createTemporaryView("base_web_data", testStream,'user_id,'age,'proctime.proctime())

基于事件时间:

val properties = new Properties()
properties.put("bootstrap.servers", "172.0.0.1:9092")

val consumer = new FlinkKafkaConsumer[String]("web-data", new SimpleStringSchema(), properties)
consumer.setStartFromEarliest()

val testStream = env.addSource(consumer)
  //过滤Null数据
  .map(x =>
    try{JSON.parseObject(x, classOf[WebData])}
    catch {
      case ex:Exception => null
    }).filter(_ != null)
		   //设置事件时间为watermark
      .assignTimestampsAndWatermarks(WatermarkStrategy
        .forBoundedOutOfOrderness(Duration.ofSeconds(5)) //乱序程度
        .withTimestampAssigner(new SerializableTimestampAssigner[WebData] {
          override def extractTimestamp(element: StudentTest, recordTimestamp: Long): Long =
          //修正事件时间
          element.ts + 8 * 60 * 60 * 1000
        }))
tableEnv.createTemporaryView("base_web_data", testStream,'user_id,'age,'rowtime.rowtime())

注:flink在设置事件时间时候,会将时间戳转换为时间,而这个转换是utc+0的,所以会和现实时间差8个小时,所以要加8个小时修正

注:该段代码中,基于事件时间相对于处理时间有两处不同,一是需要设置watermark,代码在运行时,需要根据watermark来做数据的过期处理;二是在创建视图时候,处理时间调用的是proctime()方法,事件时间调用的是rowtime()方法

注:'user_id的用法需要import org.apache.flink.streaming.api.scala._

通过jdbc创建维表

基于处理时间

tableEnv.executeSql(
  """CREATE TABLE all_city_list(
    |city_id	STRING,
    |city_name	STRING,
    |update_time	TIMESTAMP,
    | PRIMARY KEY (entity_id) NOT ENFORCED
    |) WITH (
    |   'connector' = 'jdbc',
    |   'table-name' = 'all_city_list',
    |   'url' = 'jdbc:mysql://127.0.0.1:3306/db1',
    |   'username' = 'username',
    |   'password' = 'passwd',
    |   'driver' = 'com.mysql.cj.jdbc.Driver'
    |)
    |""".stripMargin)

基于事件时间

tableEnv.executeSql(
  """CREATE TABLE all_city_list_mysql(
    |city_id	STRING,
    |city_name	STRING,
    |update_time	TIMESTAMP,
    | PRIMARY KEY (entity_id) NOT ENFORCED,
    | WATERMARK FOR update_time AS update_time
    |) WITH (
    |   'connector' = 'jdbc',
    |   'table-name' = 'all_city_list',
    |   'url' = 'jdbc:mysql://127.0.0.1:3306/db1',
    |   'username' = 'username',
    |   'password' = 'passwd',
    |   'driver' = 'com.mysql.cj.jdbc.Driver'
    |)
    |""".stripMargin)

注:基于事件时间的维度表,在创建时候需要指定一个字段为watermark

维度表关联

Temporal Join的关键词为FOR SYSTEM_TIME AS OF tb_a.proctime(事件时间为rowtime)

该关联方式相对普通关联有两个好处,一方面该关联方式在实现关联时,使用的是StreamExecLookupJoin,该join可以查询到维度表的变更。普通方式关联jdbc表,在启动的时候加载表数据后,后续变更则无法同步到fllink,所以Temporal Join可以获取到维度表变更;

另一方面普通关联会一直保留关联双侧的数据,数据也就会一直膨胀,直到撑爆内存导致任务失败,Temporal Join则可以定期清理过期数据,在合理的内存配置下即可避免内存溢出;

基于处理时间:

tableEnv.executeSql(
  """CREATE VIEW inout_web_table AS
    |SELECT tb_a.*,tb_b.city_name
    |FROM base_web_data tb_a
    |    LEFT JOIN all_city_list_mysql FOR SYSTEM_TIME AS OF tb_a.proctime AS tb_b
    |    ON tb_a.city_id = tb_b.city_id
    |""".stripMargin)

基于事件时间:

tableEnv.executeSql(
  """CREATE VIEW inout_web_table AS
    |SELECT tb_a.*,tb_b.city_name
    |FROM base_web_data tb_a
    |    LEFT JOIN all_city_list_mysql FOR SYSTEM_TIME AS OF tb_a.rowtime AS tb_b
    |    ON tb_a.city_id = tb_b.city_id
    |""".stripMargin)

注:基于事件时间的Temporal Join,join key必须为维度表的主键,在运行时flink需要根据主键来找寻对应数据的版本,具体参考时态表

注:官网文档中在使用FOR SYSTEM_TIME AS OF tb_a.rowtime没提到原表中要包含rowtime字段,即原表中没有rowtime字段依然可以使用FOR SYSTEM_TIME AS OF tb_a.rowtime。但是经过本人测试,如果原表中没有对应字段则会报错,所以需要先在原表中声明该字段,后续关联中才能使用

计算pv,uv

计算每天的pv,uv,可以开一个24 h的窗口,统计窗口内数据情况。但是这样有一个问题,flink sql没有办法直接设置trigger,所以只能等待窗口结束后才会有计算结果输出。这往往不是我们想要的,针对这种情况,有两种解决办法,一种是设置参数,提前输出计算结果;另一种办法是开一分钟窗口(根据自己输出频率来设定窗口)做初步聚合,然后根据日期做二次聚合;

两种方式有一个区别,提前输出方式,因为窗口没有关闭,输出的数据是update,而二次聚合方式,因为窗口已经关闭,所以每分钟都会insert一条数据,当需要数据绘制曲线图时,推荐二次聚合方式,如果只是显示一个计算数字,则推荐使用提前输出方式;

提前输出所需设置参数:

tableConfig.setString("table.exec.emit.early-fire.enabled", "true")
tableConfig.setString("table.exec.emit.early-fire.delay", "60 s")

二次聚合方式:

//计算pv    
tableEnv.executeSql(
      """CREATE VIEW pv_temp AS
        |SELECT city_name,
        |CAST('PV' AS STRING) AS statistic_type,
        |DATE_FORMAT((max(rowtime),'yyyy:MM:dd') AS statistic_date,
        |DATE_FORMAT(CEIL(max(rowtime) TO MINUTE),'HH:mm:ss') AS statistic_time,
        |COUNT(user_id) AS statistic_value
        |FROM inout_web_table
        |GROUP BY city_name,TUMBLE(rowtime, INTERVAL '1' MINUTE)
        |""".stripMargin)

    tableEnv.executeSql(
      """insert into web_data_index_sink
        |SELECT 
        |city_name,
        |MAX(statistic_type) AS statistic_type,
        |MAX(statistic_date) AS statistic_date,
        |MAX(statistic_time) AS statistic_time,
        |SUM(statistic_value) AS statistic_value
        |FROM pv_temp
        |GROUP BY city_name,statistic_date
        |""".stripMargin)
//计算uv
tableEnv.executeSql(
      """CREATE VIEW uv_temp AS
        |SELECT city_name,
        |CAST('UV' AS STRING) AS statistic_type,
        |DATE_FORMAT((max(rowtime),'yyyy:MM:dd') AS statistic_date,
        |DATE_FORMAT(CEIL(max(rowtime) TO MINUTE),'HH:mm:ss') AS statistic_time,
        |user_id AS user_id
        |FROM inout_web_table
        |GROUP BY city_name,user_id,TUMBLE(rowtime, INTERVAL '1' MINUTE)
        |""".stripMargin)

    tableEnv.executeSql(
      """insert into web_data_index_sink
        |SELECT 
        |city_name,
        |MAX(statistic_type) AS statistic_type,
        |MAX(statistic_date) AS statistic_date,
        |MAX(statistic_time) AS statistic_time,
        |COUNT(distinct user_id) AS statistic_value
        |FROM uv_temp
        |GROUP BY city_name,statistic_date
        |""".stripMargin)