写在前面

自0.10版本起,kafka开始支持指定起始时间戳进行消费,即使用KafkaConsumer.offsetsForTimes定位时间戳对应的offset, 本质上依然是定位offset进行消费。
对应的,FlinkKafkaConsumer010起,也由source接口支持了在kafka中指定起始时间消费。

FlinkKafkaConsumerBase<T> setStartFromTimestamp(long startupOffsetsTimestamp)

由于业务上的需求,要在Flink环境下消费指定时间段的kafka数据。结合Flink目前提供的API, 仅支持指定起始时间,在指定起始时间后则会一直消费到最后,“有头没尾”。为满足业务,需要实现在某一时刻结束消费

1.分析

通过分析FlinkKafkaConsumer源码,消费者启动是由FlinkKafkaConsumerBase.run实现,通过创建一个KafkaFetcher,由KafkaFetcher启动消费kafka topic的线程,并获取数据(同步的 Handover 在两个线程间使用全局变量共享数据)。实现逻辑在KafkaFetcher.runFetchLoop()中,具体原理可详细查阅KafkaConsumerThread和Handover的源码。

@Override
	public void runFetchLoop() throws Exception {
		try {
			final Handover handover = this.handover;

			// kick off the actual Kafka consumer
			consumerThread.start();

			while (running) {
				// this blocks until we get the next records
				// it automatically re-throws exceptions encountered in the consumer thread
				final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

				// get the records for each topic partition
				for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {

					List<ConsumerRecord<byte[], byte[]>> partitionRecords =
						records.records(partition.getKafkaPartitionHandle());

					for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
						final T value = deserializer.deserialize(record);

						if (deserializer.isEndOfStream(value)) {
							// end of stream signaled
							running = false;
							break;
						}

						// emit the actual record. this also updates offset state atomically
						// and deals with timestamps and watermark generation
						emitRecord(value, partition, record.offset(), record);
					}
				}
			}
		}
		finally {
			// this signals the consumer thread that no more work is to be done
			consumerThread.shutdown();
		}

通过源码可以看到,KafkaFetcher调用runFetchLoop()方法循环拉取数据,循环结束条件是数据流被读到末尾。显然,若要手动停止KafkaConsumer消费,另外增加一个循环结束的条件即可。

2.实现

a.为满足停止消费的功能,个人的思路是首先从KafkaFetcher入手,增加一个指定结束时间戳的参数,作为成员变量。

private long stopConsumingTimestamp;

b.然后则是重构消费数据的核心部分,即runFetchLoop()方法,增加一个循环结束的条件,当拉取数据到达指定时间戳时,停止拉取,关闭线程。

//stop fetching and emitting
if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
  this.running = false;
  break;
}

c.对于FlinkKafkaConsumer, 同样增加结束时间戳的参数,创建Fetcher时进行传递。另外增加指定时间范围的方法setIntervalFromTimestamp() , 为stopConsumingTimestamp参数赋值,同时复用setStartFromTimestamp()方法,为consumer指定起始消费的时间戳。

public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
  setStopConsumingTimestamp(stopConsumingTimestamp);
  if (startupOffsetsTimestamp > stopConsumingTimestamp) {
    throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
  } else {
    return super.setStartFromTimestamp(startupOffsetsTimestamp);
  }
}

实现并不复杂,总结来说即是重构了两个类,FlinkKafkaConsumer与KafkaFetcher.
附上完整代码:

SpecificFlinkKafkaConsumer.java

package org.apache.flink.streaming.connectors.kafka;

import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.operators.StreamingRuntimeContext;
import org.apache.flink.streaming.connectors.kafka.config.OffsetCommitMode;
import org.apache.flink.streaming.connectors.kafka.internal.SpecificKafkaFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.util.SerializedValue;

import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.regex.Pattern;

/**
 * @author lee
 */
public class SpecificFlinkKafkaConsumer<T> extends FlinkKafkaConsumer<T> {
  private long stopConsumingTimestamp;

  public SpecificFlinkKafkaConsumer(String topic, DeserializationSchema<T> valueDeserializer, Properties props) {
    super(topic, valueDeserializer, props);
  }

  public SpecificFlinkKafkaConsumer(String topic, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(topic, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(List<String> topics, DeserializationSchema<T> deserializer, Properties props) {
    super(topics, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(List<String> topics, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(topics, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, DeserializationSchema<T> valueDeserializer, Properties props) {
    super(subscriptionPattern, valueDeserializer, props);
  }

  public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(subscriptionPattern, deserializer, props);
  }

  private void setStopConsumingTimestamp(long stopConsumingTimestamp) {
    this.stopConsumingTimestamp = stopConsumingTimestamp;
  }

  //指定时间范围
  public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
    setStopConsumingTimestamp(stopConsumingTimestamp);
    if (startupOffsetsTimestamp > stopConsumingTimestamp) {
      throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
    } else {
      return super.setStartFromTimestamp(startupOffsetsTimestamp);
    }
  }

  @Override
  protected AbstractFetcher<T, ?> createFetcher(SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, StreamingRuntimeContext runtimeContext, OffsetCommitMode offsetCommitMode, MetricGroup consumerMetricGroup, boolean useMetrics) throws Exception {
    adjustAutoCommitConfig(this.properties, offsetCommitMode);
    return new SpecificKafkaFetcher(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, runtimeContext.getProcessingTimeService(), runtimeContext.getExecutionConfig().getAutoWatermarkInterval(), runtimeContext.getUserCodeClassLoader(), runtimeContext.getTaskNameWithSubtasks(), this.deserializer, this.properties, this.pollTimeout, runtimeContext.getMetricGroup(), consumerMetricGroup, useMetrics, stopConsumingTimestamp);
  }
}

SpecificKafkaFetcher.java

package org.apache.flink.streaming.connectors.kafka.internal;

import org.apache.flink.annotation.Internal;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaCommitCallback;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartitionState;
import org.apache.flink.streaming.runtime.tasks.ProcessingTimeService;
import org.apache.flink.util.Preconditions;
import org.apache.flink.util.SerializedValue;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import javax.annotation.Nonnull;

/**
 * @author lee
 */
@Internal
public class SpecificKafkaFetcher<T> extends AbstractFetcher<T, TopicPartition> {
  private static final Logger LOG = LoggerFactory.getLogger(KafkaFetcher.class);
  private final KafkaDeserializationSchema<T> deserializer;
  private final Handover handover;
  private final KafkaConsumerThread consumerThread;
  private volatile boolean running = true;

  public long stopConsumingTimestamp;

  public SpecificKafkaFetcher(SourceFunction.SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, ProcessingTimeService processingTimeProvider, long autoWatermarkInterval, ClassLoader userCodeClassLoader, String taskNameWithSubtasks, KafkaDeserializationSchema<T> deserializer, Properties kafkaProperties, long pollTimeout, MetricGroup subtaskMetricGroup, MetricGroup consumerMetricGroup, boolean useMetrics, long stopConsumingTimestamp) throws Exception {
    super(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, processingTimeProvider, autoWatermarkInterval, userCodeClassLoader, consumerMetricGroup, useMetrics);
    this.deserializer = deserializer;
    this.handover = new Handover();
    this.consumerThread = new KafkaConsumerThread(LOG, this.handover, kafkaProperties, this.unassignedPartitionsQueue, this.getFetcherName() + " for " + taskNameWithSubtasks, pollTimeout, useMetrics, consumerMetricGroup, subtaskMetricGroup);
    this.stopConsumingTimestamp = stopConsumingTimestamp;
  }

  public void runFetchLoop() throws Exception {
    try {
      final Handover handover = this.handover;

      // kick off the actual Kafka consumer
      consumerThread.start();

      while (running) {
        // this blocks until we get the next records
        // it automatically re-throws exceptions encountered in the consumer thread
        final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

        // get the records for each topic partition
        for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {

          List<ConsumerRecord<byte[], byte[]>> partitionRecords =
              records.records(partition.getKafkaPartitionHandle());

          for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
            final T value = deserializer.deserialize(record);

            if (deserializer.isEndOfStream(value)) {
              // end of stream signaled
              running = false;
              break;
            }

            //stop fetching and emitting
            if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
              this.running = false;
              break;
            }

            // emit the actual record. this also updates offset state atomically
            // and deals with timestamps and watermark generation
            emitRecord(value, partition, record.offset(), record);
          }
        }
      }
    } finally {
      // this signals the consumer thread that no more work is to be done
      consumerThread.shutdown();
    }

    // on a clean exit, wait for the runner thread
    try {
      consumerThread.join();
    } catch (InterruptedException e) {
      // may be the result of a wake-up interruption after an exception.
      // we ignore this here and only restore the interruption state
      Thread.currentThread().interrupt();
    }
  }

  protected void emitRecord(T record, KafkaTopicPartitionState<TopicPartition> partition, long offset, ConsumerRecord<?, ?> consumerRecord) throws Exception {
    this.emitRecordWithTimestamp(record, partition, offset, consumerRecord.timestamp());
  }

  public void cancel() {
    this.running = false;
    this.handover.close();
    this.consumerThread.shutdown();
  }


  protected String getFetcherName() {
    return "Kafka Fetcher";
  }

  public TopicPartition createKafkaPartitionHandle(KafkaTopicPartition partition) {
    return new TopicPartition(partition.getTopic(), partition.getPartition());
  }

  protected void doCommitInternalOffsetsToKafka(Map<KafkaTopicPartition, Long> offsets, @Nonnull KafkaCommitCallback commitCallback) throws Exception {
    List<KafkaTopicPartitionState<TopicPartition>> partitions = this.subscribedPartitionStates();
    Map<TopicPartition, OffsetAndMetadata> offsetsToCommit = new HashMap(partitions.size());
    Iterator var5 = partitions.iterator();

    while (var5.hasNext()) {
      KafkaTopicPartitionState<TopicPartition> partition = (KafkaTopicPartitionState) var5.next();
      Long lastProcessedOffset = (Long) offsets.get(partition.getKafkaTopicPartition());
      if (lastProcessedOffset != null) {
        Preconditions.checkState(lastProcessedOffset >= 0L, "Illegal offset value to commit");
        long offsetToCommit = lastProcessedOffset + 1L;
        offsetsToCommit.put(partition.getKafkaPartitionHandle(), new OffsetAndMetadata(offsetToCommit));
        partition.setCommittedOffset(offsetToCommit);
      }
    }

    this.consumerThread.setOffsetsToCommit(offsetsToCommit, commitCallback);
  }


}

测试类

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.SpecificFlinkKafkaConsumer;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Properties;

/**
 * @author lee
 */
public class FlinkKafkaConsumerWithTimestampTest {
  private static ThreadLocal<DateFormat> pattern = ThreadLocal.withInitial(() -> new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"));

  public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    Properties prop = new Properties();
    prop.put("bootstrap.servers", "kafka:23092");
    prop.put("group.id", "flink-streaming-job");

    long startTimestamp = pattern.get().parse("2020-10-20 11:14:00").getTime();
    long stopTimestamp = pattern.get().parse("2020-10-20 11:14:10").getTime();

    SpecificFlinkKafkaConsumer<String> consumer = new SpecificFlinkKafkaConsumer<String>("http_log", new SimpleStringSchema(), prop);
    consumer.setIntervalFromTimestamp(startTimestamp, stopTimestamp);

    DataStreamSource<String> dataStreamSource = env.addSource(consumer);
    dataStreamSource.print();
    env.execute();

  }
}

特别需要注意的是,因相关方法是protected修饰,在重构上述源码时,创建的java类必须和Flink提供的原生FlinkKafkaConsumer相关类在同一包下。
即org.apache.flink.streaming.connectors.kafka