写在前面
自0.10版本起,kafka开始支持指定起始时间戳进行消费,即使用KafkaConsumer.offsetsForTimes定位时间戳对应的offset, 本质上依然是定位offset进行消费。
对应的,FlinkKafkaConsumer010起,也由source接口支持了在kafka中指定起始时间消费。
FlinkKafkaConsumerBase<T> setStartFromTimestamp(long startupOffsetsTimestamp)
由于业务上的需求,要在Flink环境下消费指定时间段的kafka数据。结合Flink目前提供的API, 仅支持指定起始时间,在指定起始时间后则会一直消费到最后,“有头没尾”。为满足业务,需要实现在某一时刻结束消费。
1.分析
通过分析FlinkKafkaConsumer源码,消费者启动是由FlinkKafkaConsumerBase.run实现,通过创建一个KafkaFetcher,由KafkaFetcher启动消费kafka topic的线程,并获取数据(同步的 Handover 在两个线程间使用全局变量共享数据)。实现逻辑在KafkaFetcher.runFetchLoop()中,具体原理可详细查阅KafkaConsumerThread和Handover的源码。
@Override
public void runFetchLoop() throws Exception {
try {
final Handover handover = this.handover;
// kick off the actual Kafka consumer
consumerThread.start();
while (running) {
// this blocks until we get the next records
// it automatically re-throws exceptions encountered in the consumer thread
final ConsumerRecords<byte[], byte[]> records = handover.pollNext();
// get the records for each topic partition
for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {
List<ConsumerRecord<byte[], byte[]>> partitionRecords =
records.records(partition.getKafkaPartitionHandle());
for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
final T value = deserializer.deserialize(record);
if (deserializer.isEndOfStream(value)) {
// end of stream signaled
running = false;
break;
}
// emit the actual record. this also updates offset state atomically
// and deals with timestamps and watermark generation
emitRecord(value, partition, record.offset(), record);
}
}
}
}
finally {
// this signals the consumer thread that no more work is to be done
consumerThread.shutdown();
}
通过源码可以看到,KafkaFetcher调用runFetchLoop()方法循环拉取数据,循环结束条件是数据流被读到末尾。显然,若要手动停止KafkaConsumer消费,另外增加一个循环结束的条件即可。
2.实现
a.为满足停止消费的功能,个人的思路是首先从KafkaFetcher入手,增加一个指定结束时间戳的参数,作为成员变量。
private long stopConsumingTimestamp;
b.然后则是重构消费数据的核心部分,即runFetchLoop()方法,增加一个循环结束的条件,当拉取数据到达指定时间戳时,停止拉取,关闭线程。
//stop fetching and emitting
if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
this.running = false;
break;
}
c.对于FlinkKafkaConsumer, 同样增加结束时间戳的参数,创建Fetcher时进行传递。另外增加指定时间范围的方法setIntervalFromTimestamp() , 为stopConsumingTimestamp参数赋值,同时复用setStartFromTimestamp()方法,为consumer指定起始消费的时间戳。
public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
setStopConsumingTimestamp(stopConsumingTimestamp);
if (startupOffsetsTimestamp > stopConsumingTimestamp) {
throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
} else {
return super.setStartFromTimestamp(startupOffsetsTimestamp);
}
}
实现并不复杂,总结来说即是重构了两个类,FlinkKafkaConsumer与KafkaFetcher.
附上完整代码:
SpecificFlinkKafkaConsumer.java
package org.apache.flink.streaming.connectors.kafka;
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.operators.StreamingRuntimeContext;
import org.apache.flink.streaming.connectors.kafka.config.OffsetCommitMode;
import org.apache.flink.streaming.connectors.kafka.internal.SpecificKafkaFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.util.SerializedValue;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.regex.Pattern;
/**
* @author lee
*/
public class SpecificFlinkKafkaConsumer<T> extends FlinkKafkaConsumer<T> {
private long stopConsumingTimestamp;
public SpecificFlinkKafkaConsumer(String topic, DeserializationSchema<T> valueDeserializer, Properties props) {
super(topic, valueDeserializer, props);
}
public SpecificFlinkKafkaConsumer(String topic, KafkaDeserializationSchema<T> deserializer, Properties props) {
super(topic, deserializer, props);
}
public SpecificFlinkKafkaConsumer(List<String> topics, DeserializationSchema<T> deserializer, Properties props) {
super(topics, deserializer, props);
}
public SpecificFlinkKafkaConsumer(List<String> topics, KafkaDeserializationSchema<T> deserializer, Properties props) {
super(topics, deserializer, props);
}
public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, DeserializationSchema<T> valueDeserializer, Properties props) {
super(subscriptionPattern, valueDeserializer, props);
}
public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, KafkaDeserializationSchema<T> deserializer, Properties props) {
super(subscriptionPattern, deserializer, props);
}
private void setStopConsumingTimestamp(long stopConsumingTimestamp) {
this.stopConsumingTimestamp = stopConsumingTimestamp;
}
//指定时间范围
public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
setStopConsumingTimestamp(stopConsumingTimestamp);
if (startupOffsetsTimestamp > stopConsumingTimestamp) {
throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
} else {
return super.setStartFromTimestamp(startupOffsetsTimestamp);
}
}
@Override
protected AbstractFetcher<T, ?> createFetcher(SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, StreamingRuntimeContext runtimeContext, OffsetCommitMode offsetCommitMode, MetricGroup consumerMetricGroup, boolean useMetrics) throws Exception {
adjustAutoCommitConfig(this.properties, offsetCommitMode);
return new SpecificKafkaFetcher(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, runtimeContext.getProcessingTimeService(), runtimeContext.getExecutionConfig().getAutoWatermarkInterval(), runtimeContext.getUserCodeClassLoader(), runtimeContext.getTaskNameWithSubtasks(), this.deserializer, this.properties, this.pollTimeout, runtimeContext.getMetricGroup(), consumerMetricGroup, useMetrics, stopConsumingTimestamp);
}
}
SpecificKafkaFetcher.java
package org.apache.flink.streaming.connectors.kafka.internal;
import org.apache.flink.annotation.Internal;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaCommitCallback;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartitionState;
import org.apache.flink.streaming.runtime.tasks.ProcessingTimeService;
import org.apache.flink.util.Preconditions;
import org.apache.flink.util.SerializedValue;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import javax.annotation.Nonnull;
/**
* @author lee
*/
@Internal
public class SpecificKafkaFetcher<T> extends AbstractFetcher<T, TopicPartition> {
private static final Logger LOG = LoggerFactory.getLogger(KafkaFetcher.class);
private final KafkaDeserializationSchema<T> deserializer;
private final Handover handover;
private final KafkaConsumerThread consumerThread;
private volatile boolean running = true;
public long stopConsumingTimestamp;
public SpecificKafkaFetcher(SourceFunction.SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, ProcessingTimeService processingTimeProvider, long autoWatermarkInterval, ClassLoader userCodeClassLoader, String taskNameWithSubtasks, KafkaDeserializationSchema<T> deserializer, Properties kafkaProperties, long pollTimeout, MetricGroup subtaskMetricGroup, MetricGroup consumerMetricGroup, boolean useMetrics, long stopConsumingTimestamp) throws Exception {
super(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, processingTimeProvider, autoWatermarkInterval, userCodeClassLoader, consumerMetricGroup, useMetrics);
this.deserializer = deserializer;
this.handover = new Handover();
this.consumerThread = new KafkaConsumerThread(LOG, this.handover, kafkaProperties, this.unassignedPartitionsQueue, this.getFetcherName() + " for " + taskNameWithSubtasks, pollTimeout, useMetrics, consumerMetricGroup, subtaskMetricGroup);
this.stopConsumingTimestamp = stopConsumingTimestamp;
}
public void runFetchLoop() throws Exception {
try {
final Handover handover = this.handover;
// kick off the actual Kafka consumer
consumerThread.start();
while (running) {
// this blocks until we get the next records
// it automatically re-throws exceptions encountered in the consumer thread
final ConsumerRecords<byte[], byte[]> records = handover.pollNext();
// get the records for each topic partition
for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {
List<ConsumerRecord<byte[], byte[]>> partitionRecords =
records.records(partition.getKafkaPartitionHandle());
for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
final T value = deserializer.deserialize(record);
if (deserializer.isEndOfStream(value)) {
// end of stream signaled
running = false;
break;
}
//stop fetching and emitting
if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
this.running = false;
break;
}
// emit the actual record. this also updates offset state atomically
// and deals with timestamps and watermark generation
emitRecord(value, partition, record.offset(), record);
}
}
}
} finally {
// this signals the consumer thread that no more work is to be done
consumerThread.shutdown();
}
// on a clean exit, wait for the runner thread
try {
consumerThread.join();
} catch (InterruptedException e) {
// may be the result of a wake-up interruption after an exception.
// we ignore this here and only restore the interruption state
Thread.currentThread().interrupt();
}
}
protected void emitRecord(T record, KafkaTopicPartitionState<TopicPartition> partition, long offset, ConsumerRecord<?, ?> consumerRecord) throws Exception {
this.emitRecordWithTimestamp(record, partition, offset, consumerRecord.timestamp());
}
public void cancel() {
this.running = false;
this.handover.close();
this.consumerThread.shutdown();
}
protected String getFetcherName() {
return "Kafka Fetcher";
}
public TopicPartition createKafkaPartitionHandle(KafkaTopicPartition partition) {
return new TopicPartition(partition.getTopic(), partition.getPartition());
}
protected void doCommitInternalOffsetsToKafka(Map<KafkaTopicPartition, Long> offsets, @Nonnull KafkaCommitCallback commitCallback) throws Exception {
List<KafkaTopicPartitionState<TopicPartition>> partitions = this.subscribedPartitionStates();
Map<TopicPartition, OffsetAndMetadata> offsetsToCommit = new HashMap(partitions.size());
Iterator var5 = partitions.iterator();
while (var5.hasNext()) {
KafkaTopicPartitionState<TopicPartition> partition = (KafkaTopicPartitionState) var5.next();
Long lastProcessedOffset = (Long) offsets.get(partition.getKafkaTopicPartition());
if (lastProcessedOffset != null) {
Preconditions.checkState(lastProcessedOffset >= 0L, "Illegal offset value to commit");
long offsetToCommit = lastProcessedOffset + 1L;
offsetsToCommit.put(partition.getKafkaPartitionHandle(), new OffsetAndMetadata(offsetToCommit));
partition.setCommittedOffset(offsetToCommit);
}
}
this.consumerThread.setOffsetsToCommit(offsetsToCommit, commitCallback);
}
}
测试类
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.SpecificFlinkKafkaConsumer;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Properties;
/**
* @author lee
*/
public class FlinkKafkaConsumerWithTimestampTest {
private static ThreadLocal<DateFormat> pattern = ThreadLocal.withInitial(() -> new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"));
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties prop = new Properties();
prop.put("bootstrap.servers", "kafka:23092");
prop.put("group.id", "flink-streaming-job");
long startTimestamp = pattern.get().parse("2020-10-20 11:14:00").getTime();
long stopTimestamp = pattern.get().parse("2020-10-20 11:14:10").getTime();
SpecificFlinkKafkaConsumer<String> consumer = new SpecificFlinkKafkaConsumer<String>("http_log", new SimpleStringSchema(), prop);
consumer.setIntervalFromTimestamp(startTimestamp, stopTimestamp);
DataStreamSource<String> dataStreamSource = env.addSource(consumer);
dataStreamSource.print();
env.execute();
}
}
特别需要注意的是,因相关方法是protected修饰,在重构上述源码时,创建的java类必须和Flink提供的原生FlinkKafkaConsumer相关类在同一包下。
即org.apache.flink.streaming.connectors.kafka