前言

早上对Structured Streaming 的window函数, Output Mode 以及Watermark有些疑惑的地方。Structured Streaming 的文档偏少,而且网上的文章同质化太严重,基础的不能再基础了,但是我也不想再开个测试的工程项目,所以直接就给予MLSQL来调试。

本地启动一个

根据streamingpro的文档,在本地启动一个local模式的实例,然后打开 127.0.0.1:9003页面,大概是这个样子的。

image.png


测试过程

首先设置一个应用名称。通过

set streamName="streamExample";

来完成.

接着造一些数据:

-- mock some data.
set data='''
{"key":"1","value":"no","topic":"test","partition":0,"offset":0,"timestamp":"2008-01-24 18:01:01.001","timestampType":0}
{"key":"2","value":"no","topic":"test","partition":0,"offset":1,"timestamp":"2008-01-24 18:01:18.002","timestampType":0}
{"key":"3","value":"no","topic":"test","partition":0,"offset":2,"timestamp":"2008-01-24 18:01:20.003","timestampType":0}
{"key":"4","value":"no","topic":"test","partition":0,"offset":3,"timestamp":"2008-01-24 18:01:50.003","timestampType":0}
{"key":"5","value":"no","topic":"test","partition":0,"offset":4,"timestamp":"2008-01-24 18:02:01.003","timestampType":0}
{"key":"6","value":"no","topic":"test","partition":0,"offset":5,"timestamp":"2008-01-24 18:02:01.003","timestampType":0}
''';

这里精心调整下timestamp的实验,因为后面那我们测试都是根据这个时间来完成的。

把这些数据模拟成数据源表,我们取名叫newkafkatable1。

-- load data as table
load jsonStr.`data` as datasource;

-- convert table as stream source
load mockStream.`datasource` options 
stepSizeRange="0-3"
as newkafkatable1;

stepSizeRange 表示每个批次随机会给0-3条数据。你也可用fixSize参数,这样可以控制每个批次每次给多条。
接着对数据做个简单的处理。

select cast(key as string) as k,timestamp  from newkafkatable1 
as table21;

对table2 设置一下WaterMark

register WaterMarkInPlace.`table21` as tmp1
options eventTimeCol="timestamp"
and delayThreshold="60 seconds";

按窗口进行聚合,聚合的窗口大小是20秒。

select collect_list(k),
window(timestamp,"20 seconds").start as start,
window(timestamp,"20 seconds").end as end
from table21 
group by window(timestamp,"20 seconds")
as table22;

最后启动该流程序:

save append table22  
as console.`` 
options mode="Complete"
and duration="10"
and checkpointLocation="/tmp/cpl4";

这里采用Complete模式,然后输出打印在console.

我分别尝试了Complete,Append,Update模式,然后调整WarterMark,以及测试数据的timestamp,然后观察情况。

观察完毕,你可以关掉这个流式程序,按住command键点击任务列表,会新开一个窗口:




image.png


点击关闭任务按钮即可。

因为Console 输出不支持从checkpoint recover ,所以你可以手动删除/tmp/cpl4目录。

接着你修改mlsql脚本,然后点击提交即可。

总结

通过完全校本化,界面操作,以及mock数据的支持,可以很方便你进行structured streaming的探索