Window(窗口计算)
窗口计算是流计算的核心,通过使用窗口对无限的流数据划分成固定大小的 buckets,然后基于落入同一个bucket(窗口)中的元素执行计算。Flink将窗口计算分为两大类。
一类基于keyed-stream窗口计算。
stream
.keyBy(...) <- 分组
.window(...) <- 必须: "assigner" 窗口分配器
[.trigger(...)] <- 可选: "trigger" 每一种类型的窗口系统都有默认触发器
[.evictor(...)] <- 可选: "evictor" 可以剔除窗口中元素
[.allowedLateness(...)] <- 可选: "lateness" 可以处理迟到数据
[.sideOutputLateData(...)] <- 可选: "output tag" 可以Side Out获取迟到的元素
.reduce/aggregate/fold/apply() <- 必须: "function"
[.getSideOutput(...)] <- 可选: 获取Sideout数据 例如迟到数据
直接对non-keyed Stream窗口计算
stream
.windowAll(...) <- required: "assigner"
[.trigger(...)] <- optional: "trigger" (else default trigger)
[.evictor(...)] <- optional: "evictor" (else no evictor)
[.allowedLateness(...)] <- optional: "lateness" (else zero)
[.sideOutputLateData(...)] <- optional: "output tag" (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: "function"
[.getSideOutput(...)] <- optional: "output tag"
Window Lifecycle
简而言之,一旦应属于该窗口的第一个元素到达,就会创建一个窗口,并且当时间|WaterMarker(Event Tme或Process Time)超过其Window End 时间加上用户指定的允许延迟时,该窗口将被完全删除。窗口触发计算前提 水位线 没过窗口的End Time这个时候窗口处于Ready状态,这个时候Flink才会对窗口做真正的输出计算。
Trigger:负责监控窗口,只有满足触发器的条件,窗口才会触发。(例如 水位线计算)
evictor: 在窗口触发之后在应用聚合函数之前或之后剔除窗口中的元素。
Window Assigners
Window Assigners定义了如何将元素分配给窗口。在定义完窗口之后,用户可以使用reduce/aggregate/folder/apply等算子实现对窗口的聚合计算。
- Tumbling Windows :滚动,窗口长度和滑动间隔相等,窗口之间没有重叠。(时间)
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
- Sliding Windows:滑动,窗口长度 大于 滑动间隔,窗口之间存在数据重叠。(时间)
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.fold(("",0))((z,v)=>(v._1,z._2+v._2))
.print()
- Session Windows: 会话窗口,窗口没有固定大小,每个元素都会形成一个新窗口,如果窗口的间隔小于指定时间,这些窗口会进行合并。(时间)
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)] {
override def createAccumulator(): (String, Int) = {
("",0)
}
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = {
accumulator
}
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
(a._1,a._2+b._2)
}
})
.print()
- Global Windows:全局窗口,窗口并不是基于时间划分窗口,因此不存在窗口长度和时间概念。需要用户定制触发策略,窗口才会触发。
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(4))
.apply(new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
println("key:"+key+" w:"+window)
inputs.foreach(t=>println(t))
out.collect((key,inputs.map(_._2).sum))
}
})
.print()
Window Function
定义Window Assigners后,我们需要指定要在每个窗口上执行的计算。 这是Window Function的职责,一旦系统确定某个窗口已准备好进行处理,该Window Function将用于处理每个窗口的元素。Flink提供了以下Window Function处理函数:
- ReduceFunction
new ReduceFunction[(String, Int)] {
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
(v1._1,v1._2+v2._2)
}
}
- AggregateFunction
new AggregateFunction[(String,Int),(String,Int),(String,Int)] {
override def createAccumulator(): (String, Int) = {
("",0)
}
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = {
accumulator
}
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
(a._1,a._2+b._2)
}
}
- FoldFunction(废弃)
new FoldFunction[(String,Int),(String,Int)] {
override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) = {
(value._1,accumulator._2+value._2)
}
}
不能用在Merger window中,不可用在SessionWindows中。
- apply/WindowFunction(旧版-一般不推荐)
可以获取窗口的中的所有元素,并且可以拿到一些元数据信息,无法操作窗口状态。
new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
println("key:"+key+" w:"+window)
inputs.foreach(t=>println(t))
out.collect((key,inputs.map(_._2).sum))
}
}
在keyBy的时候,不能使用下标,只能使用
keyBy(_._1)
- ProcessWindowFunction(重点掌握)
可以获取窗口的中的所有元素,并且拿到一些元数据信息。是WindowFunction的替代方案,因为该接口可以直接操作窗口的State|全局State
获取窗口状态
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window
val sdf = new SimpleDateFormat("HH:mm:ss")
println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))
val total = elements.map(_._2).sum
out.collect((key,total))
}
})
.print()
fsEnv.execute("FlinkWordCountsQuickStart")
配合Reduce|Aggregate|FoldFunction
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
val w = context.window
val sdf = new SimpleDateFormat("HH:mm:ss")
println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))
val total = elements.map(_._2).sum
out.collect((key,total))
}
})
.print()
fsEnv.execute("FlinkWordCountsQuickStart")
操作WindowState|GlobalState
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
new ProcessWindowFunction[(String,Int),String,String,TimeWindow] {
var windowStateDescriptor:ReducingStateDescriptor[Int]=_
var globalStateDescriptor:ReducingStateDescriptor[Int]=_
override def open(parameters: Configuration): Unit = {
windowStateDescriptor = new ReducingStateDescriptor[Int]("wcs",new ReduceFunction[Int] {
override def reduce(value1: Int, value2: Int): Int = value1+value2
},createTypeInformation[Int])
globalStateDescriptor = new ReducingStateDescriptor[Int]("gcs",new ReduceFunction[Int] {
override def reduce(value1: Int, value2: Int): Int = value1+value2
},createTypeInformation[Int])
}
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[String]): Unit = {
val w = context.window
val sdf = new SimpleDateFormat("HH:mm:ss")
val windowState = context.windowState.getReducingState(windowStateDescriptor)
val globalState = context.globalState.getReducingState(globalStateDescriptor)
elements.foreach(t=>{
windowState.add(t._2)
globalState.add(t._2)
})
out.collect(key+"\t"+windowState.get()+"\t"+globalState.get())
}
})
.print()
fsEnv.execute("FlinkWordCountsQuickStart")
Trigger (触发器)
Trigger确定窗口(由Window Assigner形成)何时准备好由Window Function处理。 每个Window Assigner都带有一个默认Trigger。 如果默认Trigger不适合您的需求,则可以使用trigger(…)指定自定义触发器。
窗口类型 | 触发器 | 触发时机 |
event-time window(Tumbling/Sliding/Session) | EventTimeTrigger | 一旦watermarker没过窗口的末端,该触发器便会触发 |
processing-time window(Tumbling/Sliding/Session) | ProcessingTimeTrigger | 一旦系统时间没过窗口末端,该触发器便会触发 |
GlobalWindow 并不是基于时间的窗口 | NeverTrigger | 永远不会触发。 |
public class UserDefineDeltaTrigger<T, W extends Window> extends Trigger<T, W> {
private final DeltaFunction<T> deltaFunction;
private final double threshold;
private final ValueStateDescriptor<T> stateDesc;
private UserDefineDeltaTrigger(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
this.deltaFunction = deltaFunction;
this.threshold = threshold;
this.stateDesc = new ValueStateDescriptor("last-element", stateSerializer);
}
public TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx) throws Exception {
ValueState<T> lastElementState = (ValueState)ctx.getPartitionedState(this.stateDesc);
if (lastElementState.value() == null) {
lastElementState.update(element);
return TriggerResult.CONTINUE;
} else if (this.deltaFunction.getDelta(lastElementState.value(), element) > this.threshold) {
lastElementState.update(element);
return TriggerResult.FIRE_AND_PURGE;
} else {
return TriggerResult.CONTINUE;
}
}
public TriggerResult onEventTime(long time, W window, TriggerContext ctx) {
return TriggerResult.CONTINUE;
}
public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
public void clear(W window, TriggerContext ctx) throws Exception {
((ValueState)ctx.getPartitionedState(this.stateDesc)).clear();
}
public String toString() {
return "DeltaTrigger(" + this.deltaFunction + ", " + this.threshold + ")";
}
public static <T, W extends Window> UserDefineDeltaTrigger<T, W> of(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
return new UserDefineDeltaTrigger(threshold, deltaFunction, stateSerializer);
}
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
var deltaTrigger=UserDefineDeltaTrigger.of[(String,Double),GlobalWindow](10.0,new DeltaFunction[(String, Double)] {
override def getDelta(lastData: (String, Double), newData: (String, Double)): Double = {
newData._2-lastData._2
}
},createTypeInformation[(String,Double)].createSerializer(fsEnv.getConfig))
//3.对数据做转换 10
// a 100.0
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(deltaTrigger)
.apply(new WindowFunction[(String,Double),(String,Int),String, GlobalWindow] {
override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Double)],
out: Collector[(String, Int)]): Unit = {
println("key:"+key+" w:"+window)
inputs.foreach(t=>println(t))
}
})
.print()
fsEnv.execute("FlinkWordCountsQuickStart")
Evictors(剔除器)
Evictors可以在触发器触发后,应用Window Function之前 和/或 之后从窗口中删除元素。 为此,Evictor界面有两种方法:
public interface Evictor<T, W extends Window> extends Serializable {
/**
* Optionally evicts elements. Called before windowing function.
*
* @param elements The elements currently in the pane.
* @param size The current number of elements in the pane.
* @param window The {@link Window}
* @param evictorContext The context for the Evictor
*/
void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
/**
* Optionally evicts elements. Called after windowing function.
*
* @param elements The elements currently in the pane.
* @param size The current number of elements in the pane.
* @param window The {@link Window}
* @param evictorContext The context for the Evictor
*/
void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
}
}
public class UserDefineErrorEvictor<W extends Window> implements Evictor<String, W> {
private boolean isEvictorBefore;
private String content;
public UserDefineErrorEvictor(boolean isEvictorBefore, String content) {
this.isEvictorBefore = isEvictorBefore;
this.content=content;
}
public void evictBefore(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
if(isEvictorBefore){
evict(elements, size, window, evictorContext);
}
}
public void evictAfter(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
if(!isEvictorBefore){
evict(elements, size, window, evictorContext);
}
}
private void evict(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
Iterator<TimestampedValue<String>> iterator = elements.iterator();
while(iterator.hasNext()){
TimestampedValue<String> next = iterator.next();
String value = next.getValue();
if(value.contains(content)){
iterator.remove();
}
}
}
}