Window(窗口计算)

窗口计算是流计算的核心,通过使用窗口对无限的流数据划分成固定大小的 buckets,然后基于落入同一个bucket(窗口)中的元素执行计算。Flink将窗口计算分为两大类。

一类基于keyed-stream窗口计算。

stream
       .keyBy(...)               <-  分组
       .window(...)              <-  必须: "assigner" 窗口分配器
      [.trigger(...)]            <-  可选: "trigger" 每一种类型的窗口系统都有默认触发器
      [.evictor(...)]            <-  可选: "evictor" 可以剔除窗口中元素
      [.allowedLateness(...)]    <-  可选: "lateness" 可以处理迟到数据
      [.sideOutputLateData(...)] <-  可选: "output tag" 可以Side Out获取迟到的元素
       .reduce/aggregate/fold/apply()      <-  必须: "function"
      [.getSideOutput(...)]      <-  可选: 获取Sideout数据 例如迟到数据

直接对non-keyed Stream窗口计算

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

Window Lifecycle

简而言之,一旦应属于该窗口的第一个元素到达,就会创建一个窗口,并且当时间|WaterMarker(Event Tme或Process Time)超过其Window End 时间加上用户指定的允许延迟时,该窗口将被完全删除。窗口触发计算前提 水位线 没过窗口的End Time这个时候窗口处于Ready状态,这个时候Flink才会对窗口做真正的输出计算。

Trigger:负责监控窗口,只有满足触发器的条件,窗口才会触发。(例如 水位线计算)

evictor: 在窗口触发之后在应用聚合函数之前或之后剔除窗口中的元素。

Window Assigners

Window Assigners定义了如何将元素分配给窗口。在定义完窗口之后,用户可以使用reduce/aggregate/folder/apply等算子实现对窗口的聚合计算。

  • Tumbling Windows :滚动,窗口长度和滑动间隔相等,窗口之间没有重叠。(时间)
dataStream.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
    .reduce((v1,v2)=>(v1._1,v1._2+v2._2))
    .print()
  • Sliding Windows:滑动,窗口长度 大于 滑动间隔,窗口之间存在数据重叠。(时间)
dataStream.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
    .fold(("",0))((z,v)=>(v._1,z._2+v._2))
    .print()
  • Session Windows: 会话窗口,窗口没有固定大小,每个元素都会形成一个新窗口,如果窗口的间隔小于指定时间,这些窗口会进行合并。(时间)
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)] {
    override def createAccumulator(): (String, Int) = {
        ("",0)
    }

    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
        (value._1,value._2+accumulator._2)
    }

    override def getResult(accumulator: (String, Int)): (String, Int) = {
        accumulator
    }

    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        (a._1,a._2+b._2)
    }
})
.print()
  • Global Windows:全局窗口,窗口并不是基于时间划分窗口,因此不存在窗口长度和时间概念。需要用户定制触发策略,窗口才会触发。
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(4))
.apply(new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
        out.collect((key,inputs.map(_._2).sum))
    }
})
.print()

Window Function

定义Window Assigners后,我们需要指定要在每个窗口上执行的计算。 这是Window Function的职责,一旦系统确定某个窗口已准备好进行处理,该Window Function将用于处理每个窗口的元素。Flink提供了以下Window Function处理函数:

  • ReduceFunction
new ReduceFunction[(String, Int)] {
    override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
        (v1._1,v1._2+v2._2)
    }
}
  • AggregateFunction
new AggregateFunction[(String,Int),(String,Int),(String,Int)] {
    override def createAccumulator(): (String, Int) = {
        ("",0)
    }
    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
        (value._1,value._2+accumulator._2)
    }
    override def getResult(accumulator: (String, Int)): (String, Int) = {
        accumulator
    }
    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        (a._1,a._2+b._2)
    }
}
  • FoldFunction(废弃)
new FoldFunction[(String,Int),(String,Int)] {
    override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) = {
        (value._1,accumulator._2+value._2)
    }
}

不能用在Merger window中,不可用在SessionWindows中。

  • apply/WindowFunction(旧版-一般不推荐)

可以获取窗口的中的所有元素,并且可以拿到一些元数据信息,无法操作窗口状态。

new WindowFunction[(String,Int),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
        out.collect((key,inputs.map(_._2).sum))
    }
}

在keyBy的时候,不能使用下标,只能使用keyBy(_._1)

  • ProcessWindowFunction(重点掌握)

可以获取窗口的中的所有元素,并且拿到一些元数据信息。是WindowFunction的替代方案,因为该接口可以直接操作窗口的State|全局State

获取窗口状态

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {

    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {

        val w = context.window
        val sdf = new SimpleDateFormat("HH:mm:ss")

        println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))

        val total = elements.map(_._2).sum
        out.collect((key,total))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

配合Reduce|Aggregate|FoldFunction

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
        new ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow] {

    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {

        val w = context.window
        val sdf = new SimpleDateFormat("HH:mm:ss")

        println(sdf.format(w.getStart)+" ~ "+ sdf.format(w.getEnd))

        val total = elements.map(_._2).sum
        out.collect((key,total))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

操作WindowState|GlobalState

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

//3.对数据做转换
dataStream.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1:(String,Int),v2:(String,Int))=>(v1._1,v1._2+v2._2),
        new ProcessWindowFunction[(String,Int),String,String,TimeWindow] {
            var windowStateDescriptor:ReducingStateDescriptor[Int]=_
            var globalStateDescriptor:ReducingStateDescriptor[Int]=_

            override def open(parameters: Configuration): Unit = {
                windowStateDescriptor = new ReducingStateDescriptor[Int]("wcs",new ReduceFunction[Int] {
                    override def reduce(value1: Int, value2: Int): Int = value1+value2
                },createTypeInformation[Int])
                globalStateDescriptor = new ReducingStateDescriptor[Int]("gcs",new ReduceFunction[Int] {
                    override def reduce(value1: Int, value2: Int): Int = value1+value2
                },createTypeInformation[Int])
            }

            override def process(key: String,
                                 context: Context,
                                 elements: Iterable[(String, Int)],
                                 out: Collector[String]): Unit = {

                val w = context.window
                val sdf = new SimpleDateFormat("HH:mm:ss")

                val windowState = context.windowState.getReducingState(windowStateDescriptor)
                val globalState = context.globalState.getReducingState(globalStateDescriptor)

                elements.foreach(t=>{
                    windowState.add(t._2)
                    globalState.add(t._2)
                })
                out.collect(key+"\t"+windowState.get()+"\t"+globalState.get())
            }
        })
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Trigger (触发器)

Trigger确定窗口(由Window Assigner形成)何时准备好由Window Function处理。 每个Window Assigner都带有一个默认Trigger。 如果默认Trigger不适合您的需求,则可以使用trigger(…)指定自定义触发器。

窗口类型

触发器

触发时机

event-time window(Tumbling/Sliding/Session)

EventTimeTrigger

一旦watermarker没过窗口的末端,该触发器便会触发

processing-time window(Tumbling/Sliding/Session)

ProcessingTimeTrigger

一旦系统时间没过窗口末端,该触发器便会触发

GlobalWindow 并不是基于时间的窗口

NeverTrigger

永远不会触发。

public class UserDefineDeltaTrigger<T, W extends Window> extends Trigger<T, W> {

    private final DeltaFunction<T> deltaFunction;
    private final double threshold;
    private final ValueStateDescriptor<T> stateDesc;

    private UserDefineDeltaTrigger(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
        this.deltaFunction = deltaFunction;
        this.threshold = threshold;
        this.stateDesc = new ValueStateDescriptor("last-element", stateSerializer);
    }

    public TriggerResult onElement(T element, long timestamp, W window, TriggerContext ctx) throws Exception {
        ValueState<T> lastElementState = (ValueState)ctx.getPartitionedState(this.stateDesc);
        if (lastElementState.value() == null) {
            lastElementState.update(element);
            return TriggerResult.CONTINUE;
        } else if (this.deltaFunction.getDelta(lastElementState.value(), element) > this.threshold) {
            lastElementState.update(element);
            return TriggerResult.FIRE_AND_PURGE;
        } else {
            return TriggerResult.CONTINUE;
        }
    }

    public TriggerResult onEventTime(long time, W window, TriggerContext ctx) {
        return TriggerResult.CONTINUE;
    }

    public TriggerResult onProcessingTime(long time, W window, TriggerContext ctx) throws Exception {
        return TriggerResult.CONTINUE;
    }

    public void clear(W window, TriggerContext ctx) throws Exception {
        ((ValueState)ctx.getPartitionedState(this.stateDesc)).clear();
    }

    public String toString() {
        return "DeltaTrigger(" + this.deltaFunction + ", " + this.threshold + ")";
    }

    public static <T, W extends Window> UserDefineDeltaTrigger<T, W> of(double threshold, DeltaFunction<T> deltaFunction, TypeSerializer<T> stateSerializer) {
        return new UserDefineDeltaTrigger(threshold, deltaFunction, stateSerializer);
    }
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)

var deltaTrigger=UserDefineDeltaTrigger.of[(String,Double),GlobalWindow](10.0,new DeltaFunction[(String, Double)] {
    override def getDelta(lastData: (String, Double), newData: (String, Double)): Double = {
        newData._2-lastData._2
    }
},createTypeInformation[(String,Double)].createSerializer(fsEnv.getConfig))

//3.对数据做转换  10
// a  100.0
dataStream.map(_.split("\\s+"))
.map(ts=>(ts(0),ts(1).toDouble))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(deltaTrigger)
.apply(new WindowFunction[(String,Double),(String,Int),String, GlobalWindow] {
    override def apply(key: String, window: GlobalWindow, inputs: Iterable[(String, Double)],
                       out: Collector[(String, Int)]): Unit = {
        println("key:"+key+" w:"+window)
        inputs.foreach(t=>println(t))
    }
})
.print()

fsEnv.execute("FlinkWordCountsQuickStart")

Evictors(剔除器)

Evictors可以在触发器触发后,应用Window Function之前 和/或 之后从窗口中删除元素。 为此,Evictor界面有两种方法:

public interface Evictor<T, W extends Window> extends Serializable {

	/**
	 * Optionally evicts elements. Called before windowing function.
	 *
	 * @param elements The elements currently in the pane.
	 * @param size The current number of elements in the pane.
	 * @param window The {@link Window}
	 * @param evictorContext The context for the Evictor
     */
	void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);

	/**
	 * Optionally evicts elements. Called after windowing function.
	 *
	 * @param elements The elements currently in the pane.
	 * @param size The current number of elements in the pane.
	 * @param window The {@link Window}
	 * @param evictorContext The context for the Evictor
	 */
	void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window, EvictorContext evictorContext);
	}
}
public class UserDefineErrorEvictor<W extends  Window> implements Evictor<String, W> {
    private  boolean isEvictorBefore;
    private  String  content;

    public UserDefineErrorEvictor(boolean isEvictorBefore, String content) {
        this.isEvictorBefore = isEvictorBefore;
        this.content=content;
    }

    public void evictBefore(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        if(isEvictorBefore){
            evict(elements,  size,  window,  evictorContext);
        }
    }

    public void evictAfter(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        if(!isEvictorBefore){
            evict(elements,  size,  window,  evictorContext);
        }
    }
    private  void evict(Iterable<TimestampedValue<String>> elements, int size, W window, EvictorContext evictorContext) {
        Iterator<TimestampedValue<String>> iterator = elements.iterator();
        while(iterator.hasNext()){
            TimestampedValue<String> next = iterator.next();
            String value = next.getValue();
            if(value.contains(content)){
                iterator.remove();
            }
        }
    }
}