摘要
主要介绍了Kafka High Level Consumer,Consumer Group,Consumer Rebalance,Low Level Consumer原理,以及适用场景和Java API实现
High Level Consumer 原理
High Level Consumer API围绕着Consumer Group这个逻辑概念展开,它屏蔽了每个Topic的每个Partition的Offset管理(自动读取zookeeper中该Consumer group的last offset)
consume设置注意事项:
- 如果consumer比partition多,是浪费,因为kafka的设计是在一个partition上是不允许并发的,所以consumer数不要大于partition数
- 如果consumer比partition少,一个consumer会对应于多个partitions,这里主要合理分配consumer数和partition数,否则会导致partition里面的数据被取的不均匀。最好partiton数目是consumer数目的整数倍,所以partition数目很重要,比如取24,就很容易设定consumer数目
- 如果consumer从多个partition读到数据,不保证数据间的顺序性,kafka只保证在一个partition上数据是有序的,但多个partition,根据你读的顺序会有不同
- 增减consumer,broker,partition会导致rebalance,所以rebalance后consumer对应的partition会发生变化
- High-level接口中获取不到数据的时候是会block的
检测消费者失败
当消费者订阅topics后,当poll(long) 方法被调用的时候消费者自动加到消费组中;Poll方法调用可以确保消费者没有失败。只要持续的调用poll方法,组中的消费者都是激活状态,在这过程中,消费者会定期的向server发送心跳包;如何消费者失败或者没有发送心跳包,或者发心跳包间隔查过配置session.timeout.ms
的时间,消费者就被认为dead的,它消费的分区讲重新分配给其他消费者;
消费者消费数据的时候,有两个参数可以配置:
-
max.poll.interval.ms
: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of records returned frompoll(long). The drawback is that increasing this value may delay a group rebalance since the consumer will only join the rebalance inside the call to poll. You can use this setting to bound the time to finish a rebalance, but you risk slower progress if the consumer cannot actually call poll often enough.
通过上面的参数设置可以增加消费的时间间隔,这样消费者有足够的时间去处理好已经获取的数据;延长这个配置也有缺点会延迟组的平衡,并效率也会减少
-
max.poll.records
: Use this setting to limit the total records returned from a single call to poll. This can make it easier to predict the maximum that must be handled within each poll interval. By tuning this value, you may be able to reduce the poll interval, which will reduce the impact of group rebalancing.
通过
max.poll.records
配置可以设置每次poll数量大小
Consumer API
一:添加maven依赖
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.1.0</version>
</dependency>
2.1 :Kafka's consumer api that relying on automatic offset committing.(kafka自动管理offset)
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
自动提交可以出现的问题是:获取数据后,offset自动提交到broker,但后续对这些数据处理失败,这样就会出现丢失数据现象
2.2 :Manual Offset Control (手动控制提交offset)
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//自动提交设置为false
props.put("enable.auto.commit", "false");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("foo", "bar"));
final int minBatchSize = 200;
List<ConsumerRecord<String, String>> buffer = new ArrayList<>();
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
//手动异步提交
consumer.commitSync();
buffer.clear();
}
}
手动提交offset会出现的问题:当数据已经被处理,比如insertIntoDb(buffer)后,在consumer.commitSync()提交之间出现异常,也就是数据已经被处理,但offset提交失败,这样下次消费也会获取这条数据,会出现数据重复的现象。
2.3: 手动细精度控制提交offset)
try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
//获取records的所在的所有分区,并迭代每个分区
for (TopicPartition partition : records.partitions()) {
//获取每个分区上的records
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
//迭代一个分区的record
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
//获取每个分区上最后的offset
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
//每个分区分别提交offset
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
2.4 :消费具体partition上面的数据
String topic = "test";
TopicPartition partition0 = new TopicPartition(topic, 0);
TopicPartition partition1 = new TopicPartition(topic, 1);
consumer.assign(Arrays.asList(partition0, partition1));