文章目录

  • 前言
  • 1. 消息数据主从同步的流程
  • 2. 消息数据主从同步源码分析
  • 2.1 元数据变动的发布
  • 2.2 变动元数据的消费应用
  • 2.3 主从副本的消息数据同步


前言

Kafka 3.0 源码笔记(9)-Kafka 服务端元数据的主从同步 中笔者在文章的末尾提到了元数据主从同步完成后,元数据的变动被 broker 模块监听处理后才能对集群产生影响,本文实际上就是以创建 Topic 功能为引子,从消息数据分区副本主从同步的场景来分析这个过程。结合 Kafka 的整体设计实现来看,创建 Topic 后,整个消息生产消费功能的完整流程如下图所示:

kafka 同步发送速率 kafka数据同步_kafka 同步发送速率

1. 消息数据主从同步的流程

kafka 同步发送速率 kafka数据同步_分布式_02

上图展示了消息写入后 Follower 副本通过 Fetch 请求完成消息数据及 HW 同步的过程,大致分为 4 个阶段:

  1. 初始状态 当某个分区的 Leader 副本和所有 Follower 副本保存的消息都一致时,HW 与 LEO 指向同一个位置。在以上示例图中,Leader 副本和 Follower 副本都保存了 Offset=0 的数据,HW 和 LEO 都指向还未写入的 Offset=1 的位置
  2. 消息写入 Leader 副本所在节点接收生产消息请求,会将消息写入到本地的分区副本,此时 Leader 副本 LEO 指向 Offset=2 的位置。与此同时, Leader 副本还会尝试更新本地日志的 HW 水位,不过在当前阶段实际不会更新 HW 水位。 HW 更新具体算法如下,下文提及的尝试更新本地日志HW 皆指代以下过程:
    1. 遍历分区内保存远程副本的 remoteReplicasMap,取所有ISR列表及消息数据落后但正在追上 Leader (由配置 replica.lag.time.max.ms 决定)的副本中最小的 LEO 作为新水位候选,即 new HW = min(LEOs)。需注意,这种算法实际上意味着对消息数据主从同步的强一致性要求,只有所有活跃的分区副本都保存了消息数据,这条消息才对外可见,可以被消费
    2. 为防止单调递增的分区 HW 降低,取分区 Leader 本地的 old HW 与 new HW 比较,如果 old HW < new HW,则更新分区 HW 为 new HW,否则不更新
  3. 第一次 Fetch 请求的交互 Follower 副本节点会通过 Fetcher 线程定时发送 Fetch 请求到 Leader 副本同步消息,发起的请求中会携带本地分区副本的 LEO=1。 Leader 副本所在节点接收到请求后,更新目标分区 remoteReplicasMap 中保存的该 Follower 副本状态,尝试更新本地日志HW。此时只要当前 Follower 副本不是最后一个来同步消息的,Leader 副本就不会更新本地 HW,仅仅返回消息记录。Follower 副本节点在处理 Fetch 响应时,仅会将消息追加到本地日志,并将 LEO 指向 Offset=2 的位置
  4. 第二次 Fetch 请求的交互 与第一次 Fetch 请求交互类似,只不过这时 Fetcher 线程发起的请求中会携带本地分区副本的 LEO=2。假设此时已经有所有 Follower 副本都保存了新消息,那么 Leader 副本节点在 尝试更新本地日志HW 时会成功更新本地 HW 指向 Offset=2 的位置,并在 Fetch 响应中将当前 HW 返回给 Follower。Follower 依据 Leader 的 HW 更新本地副本 HW 指向 Offset=2 的位置,最终完成 HW 同步

2. 消息数据主从同步源码分析

新增 Topic 后,要最终实现消息数据在主从副本之间的同步,整个过程大致分为以下几个阶段:

  1. 元数据变动的发布
  2. 变动元数据的消费应用
  3. 主从副本的消息数据同步

kafka 同步发送速率 kafka数据同步_scala_03

2.1 元数据变动的发布

  1. BrokerServer 启动过程中会调用 KafkaRaftManager.scala#register() 方法将 BrokerMetadataListener 注册到 KafkaRaftClient 中,当元数据分区 HW 更新后将回调 BrokerMetadataListener.scala#handleCommit() 方法通知监听器。这个方法源码如下,关键处理显而易见:
  1. 新建封装元数据消息的异步事件 HandleCommitsEvent 对象,事件被处理时该对象 HandleCommitsEvent#run() 方法将被执行
  2. 调用 KafkaEventQueue.java#append() 方法将事件投入到异步队列,关于这个事件队列的运作机制读者可参考Kafka 3.0 源码笔记(8)-Kafka 服务端集群 Leader 对 CreateTopics 请求的处理,笔者不再赘述
override def handleCommit(reader: BatchReader[ApiMessageAndVersion]): Unit =
 eventQueue.append(new HandleCommitsEvent(reader))
  1. HandleCommitsEvent#run() 方法比较简练,核心逻辑如下:
  1. 调用 BrokerMetadataListener.scala#loadBatches() 解析元数据消息记录,将其重放出来载入到数据结构 MetadataDelta
  2. 调用 BrokerMetadataListener.scala#publish() 方法使用元数据发布器发布元数据
class HandleCommitsEvent(reader: BatchReader[ApiMessageAndVersion])
   extends EventQueue.FailureLoggingEvent(log) {
 override def run(): Unit = {
   val results = try {
     val loadResults = loadBatches(_delta, reader)
     if (isDebugEnabled) {
       debug(s"Loaded new commits: ${loadResults}")
     }
     loadResults
   } finally {
     reader.close()
   }
   _publisher.foreach(publish(_, results.highestMetadataOffset))

   snapshotter.foreach { snapshotter =>
     _bytesSinceLastSnapshot = _bytesSinceLastSnapshot + results.numBytes
     if (shouldSnapshot()) {
       if (snapshotter.maybeStartSnapshot(results.highestMetadataOffset,
         _highestEpoch,
         _highestTimestamp,
         _delta.apply())) {
         _bytesSinceLastSnapshot = 0L
       }
     }
   }
 }
}
  1. BrokerMetadataListener.scala#loadBatches() 内部逻辑比较简单,可以看到就是遍历元数据消息列表,调用MetadataDelta.java#replay()方法将其载入
private def loadBatches(delta: MetadataDelta,
                       iterator: util.Iterator[Batch[ApiMessageAndVersion]]): BatchLoadResults = {
 val startTimeNs = time.nanoseconds()
 var numBatches = 0
 var numRecords = 0
 var batch: Batch[ApiMessageAndVersion] = null
 var numBytes = 0L
 while (iterator.hasNext()) {
   batch = iterator.next()
   var index = 0
   batch.records().forEach { messageAndVersion =>
     if (isTraceEnabled) {
       trace("Metadata batch %d: processing [%d/%d]: %s.".format(batch.lastOffset, index + 1,
         batch.records().size(), messageAndVersion.message().toString()))
     }
     delta.replay(messageAndVersion.message())
     numRecords += 1
     index += 1
   }
   numBytes = numBytes + batch.sizeInBytes()
   metadataBatchSizeHist.update(batch.records().size())
   numBatches = numBatches + 1
 }
 val newHighestMetadataOffset = if (batch == null) {
   _highestMetadataOffset
 } else {
   _highestMetadataOffset = batch.lastOffset()
   _highestEpoch = batch.epoch()
   _highestTimestamp = batch.appendTimestamp()
   batch.lastOffset()
 }
 val endTimeNs = time.nanoseconds()
 val elapsedUs = TimeUnit.MICROSECONDS.convert(endTimeNs - startTimeNs, TimeUnit.NANOSECONDS)
 batchProcessingTimeHist.update(elapsedUs)
 BatchLoadResults(numBatches, numRecords, elapsedUs, numBytes, newHighestMetadataOffset)
}
  1. MetadataDelta.java#replay()方法将根据元数据记录的类型进行处理分发,新增 Topic 生成的元数据记录类型为 TOPIC_RECORD,则将触发MetadataDelta.java#replay()重载方法执行
public void replay(ApiMessage record) {
     MetadataRecordType type = MetadataRecordType.fromId(record.apiKey());
     switch (type) {
         case REGISTER_BROKER_RECORD:
             replay((RegisterBrokerRecord) record);
             break;
         case UNREGISTER_BROKER_RECORD:
             replay((UnregisterBrokerRecord) record);
             break;
         case TOPIC_RECORD:
             replay((TopicRecord) record);
             break;
         case PARTITION_RECORD:
             replay((PartitionRecord) record);
             break;
         case CONFIG_RECORD:
             replay((ConfigRecord) record);
             break;
         case PARTITION_CHANGE_RECORD:
             replay((PartitionChangeRecord) record);
             break;
         case FENCE_BROKER_RECORD:
             replay((FenceBrokerRecord) record);
             break;
         case UNFENCE_BROKER_RECORD:
             replay((UnfenceBrokerRecord) record);
             break;
         case REMOVE_TOPIC_RECORD:
             replay((RemoveTopicRecord) record);
             break;
         case FEATURE_LEVEL_RECORD:
             replay((FeatureLevelRecord) record);
             break;
         case CLIENT_QUOTA_RECORD:
             replay((ClientQuotaRecord) record);
             break;
         case PRODUCER_IDS_RECORD:
             // Nothing to do.
             break;
         case REMOVE_FEATURE_LEVEL_RECORD:
             replay((RemoveFeatureLevelRecord) record);
             break;
         case BROKER_REGISTRATION_CHANGE_RECORD:
             replay((BrokerRegistrationChangeRecord) record);
             break;
         default:
             throw new RuntimeException("Unknown metadata record type " + type);
     }
 }
  1. MetadataDelta.java#replay()处理 TOPIC_RECORD 消息类型的重载方法如下,可以看到核心处理是调用 TopicsDelta.java#replay() 方法
public void replay(TopicRecord record) {
     if (topicsDelta == null) topicsDelta = new TopicsDelta(image.topics());
     topicsDelta.replay(record);
 }
  1. TopicsDelta.java#replay() 方法此处只是暂存消息中的 Topic 元数据,实际使用将在后文进行
public void replay(TopicRecord record) {
     TopicDelta delta = new TopicDelta(
         new TopicImage(record.name(), record.topicId(), Collections.emptyMap()));
     changedTopics.put(record.topicId(), delta);
 }
  1. 此时回到本节步骤2第2步BrokerMetadataListener.scala#publish() 方法通过元数据发布器将元数据发布出来,触发 BrokerMetadataPublisher.scala#publish() 方法执行,至此元数据变动的发布基本结束
private def publish(publisher: MetadataPublisher,
                   newHighestMetadataOffset: Long): Unit = {
 val delta = _delta
 _image = _delta.apply()
 _delta = new MetadataDelta(_image)
 publisher.publish(newHighestMetadataOffset, delta, _image)
}

2.2 变动元数据的消费应用

  1. BrokerMetadataPublisher.scala#publish() 方法实现如下,关键处理如下:
  1. 如果是第一次发布元数据变更,需要调用 BrokerMetadataPublisher.scala#initializeManagers() 方法进行初始化操作。这一步大多是定时任务的启动,包括日志文件相关的定期刷盘、异常恢复检测,副本管理相关的 ISR 列表过期收缩,以及消费者组协调器删除过期消费者组信息等
  2. 开始计算元数据的变动,进行相应处理,本文以 Topic 变动触发 ReplicaManager.scala#applyDelta() 方法执行为例
override def publish(newHighestMetadataOffset: Long,
                    delta: MetadataDelta,
                    newImage: MetadataImage): Unit = {
 try {
   // Publish the new metadata image to the metadata cache.
   metadataCache.setImage(newImage)

   if (_firstPublish) {
     info(s"Publishing initial metadata at offset ${newHighestMetadataOffset}.")

     // If this is the first metadata update we are applying, initialize the managers
     // first (but after setting up the metadata cache).
     initializeManagers()
   } else if (isDebugEnabled) {
     debug(s"Publishing metadata at offset ${newHighestMetadataOffset}.")
   }

   // Apply feature deltas.
   Option(delta.featuresDelta()).foreach { featuresDelta =>
     featureCache.update(featuresDelta, newHighestMetadataOffset)
   }

   // Apply topic deltas.
   Option(delta.topicsDelta()).foreach { topicsDelta =>
     // Notify the replica manager about changes to topics.
     replicaManager.applyDelta(newImage, topicsDelta)

     // Handle the case where the old consumer offsets topic was deleted.
     if (topicsDelta.topicWasDeleted(Topic.GROUP_METADATA_TOPIC_NAME)) {
       topicsDelta.image().getTopic(Topic.GROUP_METADATA_TOPIC_NAME).partitions().entrySet().forEach {
         entry =>
           if (entry.getValue().leader == brokerId) {
             groupCoordinator.onResignation(entry.getKey(), Some(entry.getValue().leaderEpoch))
           }
       }
     }
     // Handle the case where we have new local leaders or followers for the consumer
     // offsets topic.
     getTopicDelta(Topic.GROUP_METADATA_TOPIC_NAME, newImage, delta).foreach { topicDelta =>
       val changes = topicDelta.localChanges(brokerId)

       changes.deletes.forEach { topicPartition =>
         groupCoordinator.onResignation(topicPartition.partition, None)
       }
       changes.leaders.forEach { (topicPartition, partitionInfo) =>
         groupCoordinator.onElection(topicPartition.partition, partitionInfo.partition.leaderEpoch)
       }
       changes.followers.forEach { (topicPartition, partitionInfo) =>
         groupCoordinator.onResignation(topicPartition.partition, Some(partitionInfo.partition.leaderEpoch))
       }
     }

     // Handle the case where the old transaction state topic was deleted.
     if (topicsDelta.topicWasDeleted(Topic.TRANSACTION_STATE_TOPIC_NAME)) {
       topicsDelta.image().getTopic(Topic.TRANSACTION_STATE_TOPIC_NAME).partitions().entrySet().forEach {
         entry =>
           if (entry.getValue().leader == brokerId) {
             txnCoordinator.onResignation(entry.getKey(), Some(entry.getValue().leaderEpoch))
           }
       }
     }
     // If the transaction state topic changed in a way that's relevant to this broker,
     // notify the transaction coordinator.
     getTopicDelta(Topic.TRANSACTION_STATE_TOPIC_NAME, newImage, delta).foreach { topicDelta =>
       val changes = topicDelta.localChanges(brokerId)

       changes.deletes.forEach { topicPartition =>
         txnCoordinator.onResignation(topicPartition.partition, None)
       }
       changes.leaders.forEach { (topicPartition, partitionInfo) =>
         txnCoordinator.onElection(topicPartition.partition, partitionInfo.partition.leaderEpoch)
       }
       changes.followers.forEach { (topicPartition, partitionInfo) =>
         txnCoordinator.onResignation(topicPartition.partition, Some(partitionInfo.partition.leaderEpoch))
       }
     }

     // Notify the group coordinator about deleted topics.
     val deletedTopicPartitions = new mutable.ArrayBuffer[TopicPartition]()
     topicsDelta.deletedTopicIds().forEach { id =>
       val topicImage = topicsDelta.image().getTopic(id)
       topicImage.partitions().keySet().forEach {
         id => deletedTopicPartitions += new TopicPartition(topicImage.name(), id)
       }
     }
     if (deletedTopicPartitions.nonEmpty) {
       groupCoordinator.handleDeletedPartitions(deletedTopicPartitions, RequestLocal.NoCaching)
     }
   }

   // Apply configuration deltas.
   Option(delta.configsDelta()).foreach { configsDelta =>
     configsDelta.changes().keySet().forEach { configResource =>
       val tag = configResource.`type`() match {
         case ConfigResource.Type.TOPIC => Some(ConfigType.Topic)
         case ConfigResource.Type.BROKER => Some(ConfigType.Broker)
         case _ => None
       }
       tag.foreach { t =>
         val newProperties = newImage.configs().configProperties(configResource)
         val maybeDefaultName = configResource.name() match {
           case "" => ConfigEntityName.Default
           case k => k
         }
         dynamicConfigHandlers(t).processConfigChanges(maybeDefaultName, newProperties)
       }
     }
   }

   // Apply client quotas delta.
   Option(delta.clientQuotasDelta()).foreach { clientQuotasDelta =>
     clientQuotaMetadataManager.update(clientQuotasDelta)
   }

   if (_firstPublish) {
     finishInitializingReplicaManager(newImage)
   }
 } catch {
   case t: Throwable => error(s"Error publishing broker metadata at ${newHighestMetadataOffset}", t)
     throw t
 } finally {
   _firstPublish = false
 }
}
  1. ReplicaManager.scala#applyDelta() 方法源码如下,关键处理分为如下几步:
  1. 首先调用 TopicsDelta.java#localChanges() 方法计算元数据中的 topic 变动点
  2. 计算出 topic 的变动点后,如果当前节点被分配了充当某些分区的 Leader 副本,那么调用 ReplicaManager.scala#applyLocalLeadersDelta() 方法进行相应处理;如果当前节点还被分配负责某些分区的 Follower 副本,则调用 ReplicaManager.scala#applyLocalFollowersDelta() 进行处理
def applyDelta(newImage: MetadataImage, delta: TopicsDelta): Unit = {
 // Before taking the lock, compute the local changes
 val localChanges = delta.localChanges(config.nodeId)

 replicaStateChangeLock.synchronized {
   // Handle deleted partitions. We need to do this first because we might subsequently
   // create new partitions with the same names as the ones we are deleting here.
   if (!localChanges.deletes.isEmpty) {
     val deletes = localChanges.deletes.asScala.map(tp => (tp, true)).toMap
     stateChangeLogger.info(s"Deleting ${deletes.size} partition(s).")
     stopPartitions(deletes).foreach { case (topicPartition, e) =>
       if (e.isInstanceOf[KafkaStorageException]) {
         stateChangeLogger.error(s"Unable to delete replica ${topicPartition} because " +
           "the local replica for the partition is in an offline log directory")
       } else {
         stateChangeLogger.error(s"Unable to delete replica ${topicPartition} because " +
           s"we got an unexpected ${e.getClass.getName} exception: ${e.getMessage}")
       }
     }
   }

   // Handle partitions which we are now the leader or follower for.
   if (!localChanges.leaders.isEmpty || !localChanges.followers.isEmpty) {
     val lazyOffsetCheckpoints = new LazyOffsetCheckpoints(this.highWatermarkCheckpoints)
     val changedPartitions = new mutable.HashSet[Partition]
     if (!localChanges.leaders.isEmpty) {
       applyLocalLeadersDelta(changedPartitions, delta, lazyOffsetCheckpoints, localChanges.leaders.asScala)
     }
     if (!localChanges.followers.isEmpty) {
       applyLocalFollowersDelta(changedPartitions, newImage, delta, lazyOffsetCheckpoints, localChanges.followers.asScala)
     }
     maybeAddLogDirFetchers(changedPartitions, lazyOffsetCheckpoints,
       name => Option(newImage.topics().getTopic(name)).map(_.id()))

     def markPartitionOfflineIfNeeded(tp: TopicPartition): Unit = {
       /*
        * If there is offline log directory, a Partition object may have been created by getOrCreatePartition()
        * before getOrCreateReplica() failed to create local replica due to KafkaStorageException.
        * In this case ReplicaManager.allPartitions will map this topic-partition to an empty Partition object.
        * we need to map this topic-partition to OfflinePartition instead.
        */
       if (localLog(tp).isEmpty)
         markPartitionOffline(tp)
     }
     localChanges.leaders.keySet.forEach(markPartitionOfflineIfNeeded)
     localChanges.followers.keySet.forEach(markPartitionOfflineIfNeeded)

     replicaFetcherManager.shutdownIdleFetcherThreads()
     replicaAlterLogDirsManager.shutdownIdleFetcherThreads()
   }
 }
}
  1. TopicsDelta.java#localChanges() 方法如下,根据方法注释可以看到这里主要计算以下 3 种需要在本节点应用的所有 topic 变动,单个 topic 的变动计算逻辑由 TopicDelta.java#localChanges() 实现,本文不再赘述
  1. 当前节点需要删除的本地副本
  2. 当前节点新增的需要维护的 Leader 副本
  3. 当前节点新增的需要维护的 Follower 副本
/**
  * Find the topic partitions that have change based on the replica given.
  *
  * The changes identified are:
  *   1. topic partitions for which the broker is not a replica anymore
  *   2. topic partitions for which the broker is now the leader
  *   3. topic partitions for which the broker is now a follower
  *
  * @param brokerId the broker id
  * @return the list of topic partitions which the broker should remove, become leader or become follower.
  */
 public LocalReplicaChanges localChanges(int brokerId) {
     Set<TopicPartition> deletes = new HashSet<>();
     Map<TopicPartition, LocalReplicaChanges.PartitionInfo> leaders = new HashMap<>();
     Map<TopicPartition, LocalReplicaChanges.PartitionInfo> followers = new HashMap<>();

     for (TopicDelta delta : changedTopics.values()) {
         LocalReplicaChanges changes = delta.localChanges(brokerId);

         deletes.addAll(changes.deletes());
         leaders.putAll(changes.leaders());
         followers.putAll(changes.followers());
     }

     // Add all of the removed topic partitions to the set of locally removed partitions
     deletedTopicIds().forEach(topicId -> {
         TopicImage topicImage = image().getTopic(topicId);
         topicImage.partitions().forEach((partitionId, prevPartition) -> {
             if (Replicas.contains(prevPartition.replicas, brokerId)) {
                 deletes.add(new TopicPartition(topicImage.name(), partitionId));
             }
         });
     });

     return new LocalReplicaChanges(deletes, leaders, followers);
 }
  1. 元数据变动计算完毕,回到本节步骤2第2步,如果当前节点有新增的需要维护的 Leader 副本,则 ReplicaManager.scala#applyLocalLeadersDelta() 方法将被触发执行。这个方法的实现如下,可以看到核心处理比较简单:
  1. 遍历新的 Leader 列表,调用 ReplicaManager.scala#getOrCreatePartition() 方法为其创建本地分区 Partition 对象
  2. 调用 Partition.scala#makeLeader() 将新建的 Partition 对象设置为分区副本 Leader
private def applyLocalLeadersDelta(
 changedPartitions: mutable.Set[Partition],
 delta: TopicsDelta,
 offsetCheckpoints: OffsetCheckpoints,
 newLocalLeaders: mutable.Map[TopicPartition, LocalReplicaChanges.PartitionInfo]
): Unit = {
 stateChangeLogger.info(s"Transitioning ${newLocalLeaders.size} partition(s) to " +
   "local leaders.")
 replicaFetcherManager.removeFetcherForPartitions(newLocalLeaders.keySet)
 newLocalLeaders.forKeyValue { case (tp, info) =>
   getOrCreatePartition(tp, delta, info.topicId).foreach { case (partition, isNew) =>
     try {
       val state = info.partition.toLeaderAndIsrPartitionState(tp, isNew)
       if (!partition.makeLeader(state, offsetCheckpoints, Some(info.topicId))) {
         stateChangeLogger.info("Skipped the become-leader state change for " +
           s"${tp} with topic id ${info.topicId} because this partition is " +
           "already a local leader.")
       }
       changedPartitions.add(partition)
     } catch {
       case e: KafkaStorageException =>
         stateChangeLogger.info(s"Skipped the become-leader state change for ${tp} " +
           s"with topic id ${info.topicId} due to disk error ${e}")
         val dirOpt = getLogDir(tp)
         error(s"Error while making broker the leader for partition ${tp} in dir " +
           s"${dirOpt}", e)
     }
   }
 }
}
  1. Partition.scala#makeLeader() 处理流程还算清晰,关键的处理如下:
  1. 执行 Partition.scala#updateAssignmentAndIsr() 方法更新当前分区 Leader 副本的 ISR 列表及内部的远程副本列表 remoteReplicasMap
  2. 调用 Partition.scala#createLogIfNotExists() 方法为当前分区 Leader 副本创建本地日志文件,如果文件已经存在则不创建
  3. 调用 Log.scala#maybeAssignEpochStartOffset() 方法更新当前分区的 Leader 版本等信息,后续将用于异常恢复,感兴趣的读者可参考 Kafka 3.0 源码笔记(12)-Kafka 服务端分区异常恢复机制的源码分析
  4. 调用 Partition.scala#maybeIncrementLeaderHW() 方法尝试更新当前分区的水位 HW
def makeLeader(partitionState: LeaderAndIsrPartitionState,
              highWatermarkCheckpoints: OffsetCheckpoints,
              topicId: Option[Uuid]): Boolean = {
 val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
   // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
   // to maintain the decision maker controller's epoch in the zookeeper path
   controllerEpoch = partitionState.controllerEpoch

   val isr = partitionState.isr.asScala.map(_.toInt).toSet
   val addingReplicas = partitionState.addingReplicas.asScala.map(_.toInt)
   val removingReplicas = partitionState.removingReplicas.asScala.map(_.toInt)

   updateAssignmentAndIsr(
     assignment = partitionState.replicas.asScala.map(_.toInt),
     isr = isr,
     addingReplicas = addingReplicas,
     removingReplicas = removingReplicas
   )
   try {
     createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints, topicId)
   } catch {
     case e: ZooKeeperClientException =>
       stateChangeLogger.error(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
         s"state change for the partition $topicPartition with leader epoch: $leaderEpoch ", e)

       return false
   }

   val leaderLog = localLogOrException
   val leaderEpochStartOffset = leaderLog.logEndOffset
   stateChangeLogger.info(s"Leader $topicPartition starts at leader epoch ${partitionState.leaderEpoch} from " +
     s"offset $leaderEpochStartOffset with high watermark ${leaderLog.highWatermark} " +
     s"ISR ${isr.mkString("[", ",", "]")} addingReplicas ${addingReplicas.mkString("[", ",", "]")} " +
     s"removingReplicas ${removingReplicas.mkString("[", ",", "]")}. Previous leader epoch was $leaderEpoch.")

   //We cache the leader epoch here, persisting it only if it's local (hence having a log dir)
   leaderEpoch = partitionState.leaderEpoch
   leaderEpochStartOffsetOpt = Some(leaderEpochStartOffset)
   zkVersion = partitionState.zkVersion

   // In the case of successive leader elections in a short time period, a follower may have
   // entries in its log from a later epoch than any entry in the new leader's log. In order
   // to ensure that these followers can truncate to the right offset, we must cache the new
   // leader epoch and the start offset since it should be larger than any epoch that a follower
   // would try to query.
   leaderLog.maybeAssignEpochStartOffset(leaderEpoch, leaderEpochStartOffset)

   val isNewLeader = !isLeader
   val curTimeMs = time.milliseconds
   // initialize lastCaughtUpTime of replicas as well as their lastFetchTimeMs and lastFetchLeaderLogEndOffset.
   remoteReplicas.foreach { replica =>
     val lastCaughtUpTimeMs = if (isrState.isr.contains(replica.brokerId)) curTimeMs else 0L
     replica.resetLastCaughtUpTime(leaderEpochStartOffset, curTimeMs, lastCaughtUpTimeMs)
   }

   if (isNewLeader) {
     // mark local replica as the leader after converting hw
     leaderReplicaIdOpt = Some(localBrokerId)
     // reset log end offset for remote replicas
     remoteReplicas.foreach { replica =>
       replica.updateFetchState(
         followerFetchOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata,
         followerStartOffset = Log.UnknownOffset,
         followerFetchTimeMs = 0L,
         leaderEndOffset = Log.UnknownOffset)
     }
   }
   // we may need to increment high watermark since ISR could be down to 1
   (maybeIncrementLeaderHW(leaderLog), isNewLeader)
 }
 // some delayed operations may be unblocked after HW changed
 if (leaderHWIncremented)
   tryCompleteDelayedRequests()
 isNewLeader
}
  1. Partition.scala#maybeIncrementLeaderHW() 方法的实现如下所示,具体算法在上文第1节-消息数据主从同步的流程有提及,此处不再赘述
private def maybeIncrementLeaderHW(leaderLog: Log, curTime: Long = time.milliseconds): Boolean = {
 // maybeIncrementLeaderHW is in the hot path, the following code is written to
 // avoid unnecessary collection generation
 var newHighWatermark = leaderLog.logEndOffsetMetadata
 remoteReplicasMap.values.foreach { replica =>
   // Note here we are using the "maximal", see explanation above
   if (replica.logEndOffsetMetadata.messageOffset < newHighWatermark.messageOffset &&
     (curTime - replica.lastCaughtUpTimeMs <= replicaLagTimeMaxMs || isrState.maximalIsr.contains(replica.brokerId))) {
     newHighWatermark = replica.logEndOffsetMetadata
   }
 }

 leaderLog.maybeIncrementHighWatermark(newHighWatermark) match {
   case Some(oldHighWatermark) =>
     debug(s"High watermark updated from $oldHighWatermark to $newHighWatermark")
     true

   case None =>
     def logEndOffsetString: ((Int, LogOffsetMetadata)) => String = {
       case (brokerId, logEndOffsetMetadata) => s"replica $brokerId: $logEndOffsetMetadata"
     }

     if (isTraceEnabled) {
       val replicaInfo = remoteReplicas.map(replica => (replica.brokerId, replica.logEndOffsetMetadata)).toSet
       val localLogInfo = (localBrokerId, localLogOrException.logEndOffsetMetadata)
       trace(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old value. " +
         s"All current LEOs are ${(replicaInfo + localLogInfo).map(logEndOffsetString)}")
     }
     false
 }
}
  1. 此时回到本节步骤2第2步,如果当前节点有新增的需要维护的 Follower 副本,则 ReplicaManager.scala#applyLocalFollowersDelta() 方法将被触发执行。这个方法的实现如下,关键步骤分为以下几步:
  1. 调用 ReplicaManager.scala#getOrCreatePartition() 方法创建本地分区 Partition 对象
  2. 调用 Partition.scala#makeFollower() 将新建的 Partition 对象设置为分区副本 Follower
  3. 调用 ReplicaFetcherManager.scala#addFetcherForPartitions() 方法为分区 Follower 副本设置 Fetcher 线程,该线程用于从分区Leader 副本处同步消息数据
private def applyLocalFollowersDelta(
 changedPartitions: mutable.Set[Partition],
 newImage: MetadataImage,
 delta: TopicsDelta,
 offsetCheckpoints: OffsetCheckpoints,
 newLocalFollowers: mutable.Map[TopicPartition, LocalReplicaChanges.PartitionInfo]
): Unit = {
 stateChangeLogger.info(s"Transitioning ${newLocalFollowers.size} partition(s) to " +
   "local followers.")
 val shuttingDown = isShuttingDown.get()
 val partitionsToMakeFollower = new mutable.HashMap[TopicPartition, Partition]
 val newFollowerTopicSet = new mutable.HashSet[String]
 newLocalFollowers.forKeyValue { case (tp, info) =>
   getOrCreatePartition(tp, delta, info.topicId).foreach { case (partition, isNew) =>
     try {
       newFollowerTopicSet.add(tp.topic)

       if (shuttingDown) {
         stateChangeLogger.trace(s"Unable to start fetching ${tp} with topic " +
           s"ID ${info.topicId} because the replica manager is shutting down.")
       } else {
         val leader = info.partition.leader
         if (newImage.cluster.broker(leader) == null) {
           stateChangeLogger.trace(s"Unable to start fetching $tp with topic ID ${info.topicId} " +
             s"from leader $leader because it is not alive.")

           // Create the local replica even if the leader is unavailable. This is required
           // to ensure that we include the partition's high watermark in the checkpoint
           // file (see KAFKA-1647).
           partition.createLogIfNotExists(isNew, false, offsetCheckpoints, Some(info.topicId))
         } else {
           val state = info.partition.toLeaderAndIsrPartitionState(tp, isNew)
           if (partition.makeFollower(state, offsetCheckpoints, Some(info.topicId))) {
             partitionsToMakeFollower.put(tp, partition)
           } else {
             stateChangeLogger.info("Skipped the become-follower state change after marking its " +
               s"partition as follower for partition $tp with id ${info.topicId} and partition state $state.")
           }
         }
       }
       changedPartitions.add(partition)
     } catch {
       case e: Throwable => stateChangeLogger.error(s"Unable to start fetching ${tp} " +
           s"with topic ID ${info.topicId} due to ${e.getClass.getSimpleName}", e)
         replicaFetcherManager.addFailedPartition(tp)
     }
   }
 }

 // Stopping the fetchers must be done first in order to initialize the fetch
 // position correctly.
 replicaFetcherManager.removeFetcherForPartitions(partitionsToMakeFollower.keySet)
 stateChangeLogger.info(s"Stopped fetchers as part of become-follower for ${partitionsToMakeFollower.size} partitions")

 val listenerName = config.interBrokerListenerName.value
 val partitionAndOffsets = new mutable.HashMap[TopicPartition, InitialFetchState]
 partitionsToMakeFollower.forKeyValue { (topicPartition, partition) =>
   val node = partition.leaderReplicaIdOpt
     .flatMap(leaderId => Option(newImage.cluster.broker(leaderId)))
     .flatMap(_.node(listenerName).asScala)
     .getOrElse(Node.noNode)
   val log = partition.localLogOrException
   partitionAndOffsets.put(topicPartition, InitialFetchState(
     new BrokerEndPoint(node.id, node.host, node.port),
     partition.getLeaderEpoch,
     initialFetchOffset(log)
   ))
 }

 replicaFetcherManager.addFetcherForPartitions(partitionAndOffsets)
 stateChangeLogger.info(s"Started fetchers as part of become-follower for ${partitionsToMakeFollower.size} partitions")

 partitionsToMakeFollower.keySet.foreach(completeDelayedFetchOrProduceRequests)

 updateLeaderAndFollowerMetrics(newFollowerTopicSet)
}
  1. Partition.scala#makeFollower() 的处理比较简单,是 Partition.scala#makeLeader() 的精简版本,此处不再赘述
def makeFollower(partitionState: LeaderAndIsrPartitionState,
                highWatermarkCheckpoints: OffsetCheckpoints,
                topicId: Option[Uuid]): Boolean = {
 inWriteLock(leaderIsrUpdateLock) {
   val newLeaderBrokerId = partitionState.leader
   val oldLeaderEpoch = leaderEpoch
   // record the epoch of the controller that made the leadership decision. This is useful while updating the isr
   // to maintain the decision maker controller's epoch in the zookeeper path
   controllerEpoch = partitionState.controllerEpoch

   updateAssignmentAndIsr(
     assignment = partitionState.replicas.asScala.iterator.map(_.toInt).toSeq,
     isr = Set.empty[Int],
     addingReplicas = partitionState.addingReplicas.asScala.map(_.toInt),
     removingReplicas = partitionState.removingReplicas.asScala.map(_.toInt)
   )
   try {
     createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints, topicId)
   } catch {
     case e: ZooKeeperClientException =>
       stateChangeLogger.error(s"A ZooKeeper client exception has occurred. makeFollower will be skipping the " +
         s"state change for the partition $topicPartition with leader epoch: $leaderEpoch.", e)

       return false
   }

   val followerLog = localLogOrException
   val leaderEpochEndOffset = followerLog.logEndOffset
   stateChangeLogger.info(s"Follower $topicPartition starts at leader epoch ${partitionState.leaderEpoch} from " +
     s"offset $leaderEpochEndOffset with high watermark ${followerLog.highWatermark}. " +
     s"Previous leader epoch was $leaderEpoch.")

   leaderEpoch = partitionState.leaderEpoch
   leaderEpochStartOffsetOpt = None
   zkVersion = partitionState.zkVersion

   if (leaderReplicaIdOpt.contains(newLeaderBrokerId) && leaderEpoch == oldLeaderEpoch) {
     false
   } else {
     leaderReplicaIdOpt = Some(newLeaderBrokerId)
     true
   }
 }
}
  1. ReplicaFetcherManager.scala#addFetcherForPartitions() 实际由其父类 AbstractFetcherManager.scala#addFetcherForPartitions() 实现,
    可以看到这里关键处理是调用内部 addAndStartFetcherThread() 方法执行子类 ReplicaFetcherManager.scala#createFetcherThread() 方法创建 Fetcher 线程,并将其启动
def addFetcherForPartitions(partitionAndOffsets: Map[TopicPartition, InitialFetchState]): Unit = {
 lock synchronized {
   val partitionsPerFetcher = partitionAndOffsets.groupBy { case (topicPartition, brokerAndInitialFetchOffset) =>
     BrokerAndFetcherId(brokerAndInitialFetchOffset.leader, getFetcherId(topicPartition))
   }

   def addAndStartFetcherThread(brokerAndFetcherId: BrokerAndFetcherId,
                                brokerIdAndFetcherId: BrokerIdAndFetcherId): T = {
     val fetcherThread = createFetcherThread(brokerAndFetcherId.fetcherId, brokerAndFetcherId.broker)
     fetcherThreadMap.put(brokerIdAndFetcherId, fetcherThread)
     fetcherThread.start()
     fetcherThread
   }

   for ((brokerAndFetcherId, initialFetchOffsets) <- partitionsPerFetcher) {
     val brokerIdAndFetcherId = BrokerIdAndFetcherId(brokerAndFetcherId.broker.id, brokerAndFetcherId.fetcherId)
     val fetcherThread = fetcherThreadMap.get(brokerIdAndFetcherId) match {
       case Some(currentFetcherThread) if currentFetcherThread.sourceBroker == brokerAndFetcherId.broker =>
         // reuse the fetcher thread
         currentFetcherThread
       case Some(f) =>
         f.shutdown()
         addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
       case None =>
         addAndStartFetcherThread(brokerAndFetcherId, brokerIdAndFetcherId)
     }

     addPartitionsToFetcherThread(fetcherThread, initialFetchOffsets)
   }
 }
}
  1. ReplicaFetcherManager.scala#createFetcherThread() 方法将创建 ReplicaFetcherThread 对象作为 Fetcher 线程实例,至此 Topic 元数据变动的消费应用告一段落
override def createFetcherThread(fetcherId: Int, sourceBroker: BrokerEndPoint): ReplicaFetcherThread = {
val prefix = threadNamePrefix.map(tp => s"$tp:").getOrElse("")
val threadName = s"${prefix}ReplicaFetcherThread-$fetcherId-${sourceBroker.id}"
new ReplicaFetcherThread(threadName, fetcherId, sourceBroker, brokerConfig, failedPartitions, replicaManager,
  metrics, time, quotaManager)
}

2.3 主从副本的消息数据同步

  1. 在上一节中,ReplicaFetcherThread 线程对象被创建后会立即启动,触发 ReplicaFetcherThread.scala#run() 方法执行,不过这个方法实际是由其父类 ShutdownableThread.scala#run() 方法实现,可以看到核心逻辑就是在 while 循环中不断执行子类实现的 AbstractFetcherThread.scala#doWork()方法
override def run(): Unit = {
 isStarted = true
 info("Starting")
 try {
   while (isRunning)
     doWork()
 } catch {
   case e: FatalExitError =>
     shutdownInitiated.countDown()
     shutdownComplete.countDown()
     info("Stopped")
     Exit.exit(e.statusCode())
   case e: Throwable =>
     if (isRunning)
       error("Error due to", e)
 } finally {
    shutdownComplete.countDown()
 }
 info("Stopped")
}
  1. AbstractFetcherThread.scala#doWork()方法内部只有两个方法调用,其中 AbstractFetcherThread.scala#maybeTruncate()方法是在异常恢复时处理日志截断的,本文暂不分析。AbstractFetcherThread.scala#maybeFetch() 方法则通过以下几步实际完成 Fetch 请求同步消息数据的功能:
  1. 调用子类实现 ReplicaFetcherThread.scala#buildFetch() 方法构建 Fetch 请求
  2. 调用 AbstractFetcherThread.scala#processFetchRequest() 方法发起 Fetch 请求并处理响应数据
override def doWork(): Unit = {
 maybeTruncate()
 maybeFetch()
}

private def maybeFetch(): Unit = {
 val fetchRequestOpt = inLock(partitionMapLock) {
   val ResultWithPartitions(fetchRequestOpt, partitionsWithError) = buildFetch(partitionStates.partitionStateMap.asScala)

   handlePartitionsWithErrors(partitionsWithError, "maybeFetch")

   if (fetchRequestOpt.isEmpty) {
     trace(s"There are no active partitions. Back off for $fetchBackOffMs ms before sending a fetch request")
     partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
   }

   fetchRequestOpt
 }

 fetchRequestOpt.foreach { case ReplicaFetch(sessionPartitions, fetchRequest) =>
   processFetchRequest(sessionPartitions, fetchRequest)
 }
}
  1. ReplicaFetcherThread.scala#buildFetch() 方法其实比较简单,需要注意的是 Fetch 请求中会线程初始化时设置进来的本地日志的 LEO 填充到 fetchOffset 参数
override def buildFetch(partitionMap: Map[TopicPartition, PartitionFetchState]): ResultWithPartitions[Option[ReplicaFetch]] = {
 val partitionsWithError = mutable.Set[TopicPartition]()

 val builder = fetchSessionHandler.newBuilder(partitionMap.size, false)
 partitionMap.forKeyValue { (topicPartition, fetchState) =>
   // We will not include a replica in the fetch request if it should be throttled.
   if (fetchState.isReadyForFetch && !shouldFollowerThrottle(quota, fetchState, topicPartition)) {
     try {
       val logStartOffset = this.logStartOffset(topicPartition)
       val lastFetchedEpoch = if (isTruncationOnFetchSupported)
         fetchState.lastFetchedEpoch.map(_.asInstanceOf[Integer]).asJava
       else
         Optional.empty[Integer]
       builder.add(topicPartition, new FetchRequest.PartitionData(
         fetchState.fetchOffset,
         logStartOffset,
         fetchSize,
         Optional.of(fetchState.currentLeaderEpoch),
         lastFetchedEpoch))
     } catch {
       case _: KafkaStorageException =>
         // The replica has already been marked offline due to log directory failure and the original failure should have already been logged.
         // This partition should be removed from ReplicaFetcherThread soon by ReplicaManager.handleLogDirFailure()
         partitionsWithError += topicPartition
     }
   }
 }

 val fetchData = builder.build()
 val fetchRequestOpt = if (fetchData.sessionPartitions.isEmpty && fetchData.toForget.isEmpty) {
   None
 } else {
   val requestBuilder = FetchRequest.Builder
     .forReplica(fetchRequestVersion, replicaId, maxWait, minBytes, fetchData.toSend)
     .setMaxBytes(maxBytes)
     .toForget(fetchData.toForget)
     .metadata(fetchData.metadata)
   Some(ReplicaFetch(fetchData.sessionPartitions(), requestBuilder))
 }

 ResultWithPartitions(fetchRequestOpt, partitionsWithError)
}
  1. AbstractFetcherThread.scala#processFetchRequest() 方法关键处理如下:
  1. 首先调用子类 ReplicaFetcherThread.scala#fetchFromLeader() 实现向分区副本 Leader 发起 Fetch 请求,这部分往下分析涉及到底层网络组件 NetworkClient,读者如有兴趣可参考消费者组协调器定位了解其工作机制,本文不再赘述
  2. 如果 Fetch 响应中有消息数据,则调用子类实现方法 ReplicaFetcherThread.scala#processPartitionData() 将消息追加到本地日志
  3. 如果当前服务端版本支持在 Fetch 时进行日志截断操作,那么处理 Fetch 响应时会同步收集其携带的版本分歧信息。因为分区主从副本的版本不一致通常是发生了故障恢复,分区 Follower 副本可能需要进行日志截断以保持和 Leader 副本数据一致,这部分最后由 AbstractFetcherThread.scala#truncateOnFetchResponse() 方法处理,此处暂不深入分析
private def processFetchRequest(sessionPartitions: util.Map[TopicPartition, FetchRequest.PartitionData],
                               fetchRequest: FetchRequest.Builder): Unit = {
 val partitionsWithError = mutable.Set[TopicPartition]()
 val divergingEndOffsets = mutable.Map.empty[TopicPartition, EpochEndOffset]
 var responseData: Map[TopicPartition, FetchData] = Map.empty

 try {
   trace(s"Sending fetch request $fetchRequest")
   responseData = fetchFromLeader(fetchRequest)
 } catch {
   case t: Throwable =>
     if (isRunning) {
       warn(s"Error in response for fetch request $fetchRequest", t)
       inLock(partitionMapLock) {
         partitionsWithError ++= partitionStates.partitionSet.asScala
         // there is an error occurred while fetching partitions, sleep a while
         // note that `AbstractFetcherThread.handlePartitionsWithError` will also introduce the same delay for every
         // partition with error effectively doubling the delay. It would be good to improve this.
         partitionMapCond.await(fetchBackOffMs, TimeUnit.MILLISECONDS)
       }
     }
 }
 fetcherStats.requestRate.mark()

 if (responseData.nonEmpty) {
   // process fetched data
   inLock(partitionMapLock) {
     responseData.forKeyValue { (topicPartition, partitionData) =>
       Option(partitionStates.stateValue(topicPartition)).foreach { currentFetchState =>
         // It's possible that a partition is removed and re-added or truncated when there is a pending fetch request.
         // In this case, we only want to process the fetch response if the partition state is ready for fetch and
         // the current offset is the same as the offset requested.
         val fetchPartitionData = sessionPartitions.get(topicPartition)
         if (fetchPartitionData != null && fetchPartitionData.fetchOffset == currentFetchState.fetchOffset && currentFetchState.isReadyForFetch) {
           Errors.forCode(partitionData.errorCode) match {
             case Errors.NONE =>
               try {
                 // Once we hand off the partition data to the subclass, we can't mess with it any more in this thread
                 val logAppendInfoOpt = processPartitionData(topicPartition, currentFetchState.fetchOffset,
                   partitionData)

                 logAppendInfoOpt.foreach { logAppendInfo =>
                   val validBytes = logAppendInfo.validBytes
                   val nextOffset = if (validBytes > 0) logAppendInfo.lastOffset + 1 else currentFetchState.fetchOffset
                   val lag = Math.max(0L, partitionData.highWatermark - nextOffset)
                   fetcherLagStats.getAndMaybePut(topicPartition).lag = lag

                   // ReplicaDirAlterThread may have removed topicPartition from the partitionStates after processing the partition data
                   if (validBytes > 0 && partitionStates.contains(topicPartition)) {
                     // Update partitionStates only if there is no exception during processPartitionData
                     val newFetchState = PartitionFetchState(nextOffset, Some(lag),
                       currentFetchState.currentLeaderEpoch, state = Fetching,
                       logAppendInfo.lastLeaderEpoch)
                     partitionStates.updateAndMoveToEnd(topicPartition, newFetchState)
                     fetcherStats.byteRate.mark(validBytes)
                   }
                 }
                 if (isTruncationOnFetchSupported) {
                   FetchResponse.divergingEpoch(partitionData).ifPresent { divergingEpoch =>
                     divergingEndOffsets += topicPartition -> new EpochEndOffset()
                       .setPartition(topicPartition.partition)
                       .setErrorCode(Errors.NONE.code)
                       .setLeaderEpoch(divergingEpoch.epoch)
                       .setEndOffset(divergingEpoch.endOffset)
                   }
                 }
               } catch {
                 case ime@( _: CorruptRecordException | _: InvalidRecordException) =>
                   // we log the error and continue. This ensures two things
                   // 1. If there is a corrupt message in a topic partition, it does not bring the fetcher thread
                   //    down and cause other topic partition to also lag
                   // 2. If the message is corrupt due to a transient state in the log (truncation, partial writes
                   //    can cause this), we simply continue and should get fixed in the subsequent fetches
                   error(s"Found invalid messages during fetch for partition $topicPartition " +
                     s"offset ${currentFetchState.fetchOffset}", ime)
                   partitionsWithError += topicPartition
                 case e: KafkaStorageException =>
                   error(s"Error while processing data for partition $topicPartition " +
                     s"at offset ${currentFetchState.fetchOffset}", e)
                   markPartitionFailed(topicPartition)
                 case t: Throwable =>
                   // stop monitoring this partition and add it to the set of failed partitions
                   error(s"Unexpected error occurred while processing data for partition $topicPartition " +
                     s"at offset ${currentFetchState.fetchOffset}", t)
                   markPartitionFailed(topicPartition)
               }
             case Errors.OFFSET_OUT_OF_RANGE =>
               if (handleOutOfRangeError(topicPartition, currentFetchState, fetchPartitionData.currentLeaderEpoch))
                 partitionsWithError += topicPartition

             case Errors.UNKNOWN_LEADER_EPOCH =>
               debug(s"Remote broker has a smaller leader epoch for partition $topicPartition than " +
                 s"this replica's current leader epoch of ${currentFetchState.currentLeaderEpoch}.")
               partitionsWithError += topicPartition

             case Errors.FENCED_LEADER_EPOCH =>
               if (onPartitionFenced(topicPartition, fetchPartitionData.currentLeaderEpoch))
                 partitionsWithError += topicPartition

             case Errors.NOT_LEADER_OR_FOLLOWER =>
               debug(s"Remote broker is not the leader for partition $topicPartition, which could indicate " +
                 "that the partition is being moved")
               partitionsWithError += topicPartition

             case Errors.UNKNOWN_TOPIC_OR_PARTITION =>
               warn(s"Received ${Errors.UNKNOWN_TOPIC_OR_PARTITION} from the leader for partition $topicPartition. " +
                    "This error may be returned transiently when the partition is being created or deleted, but it is not " +
                    "expected to persist.")
               partitionsWithError += topicPartition

             case partitionError =>
               error(s"Error for partition $topicPartition at offset ${currentFetchState.fetchOffset}", partitionError.exception)
               partitionsWithError += topicPartition
           }
         }
       }
     }
   }
 }

 if (divergingEndOffsets.nonEmpty)
   truncateOnFetchResponse(divergingEndOffsets)
 if (partitionsWithError.nonEmpty) {
   handlePartitionsWithErrors(partitionsWithError, "processFetchRequest")
 }
}
  1. ReplicaFetcherThread.scala#processPartitionData() 的重点处理简单明了:
  1. 调用 Partition.scala#appendRecordsToFollowerOrFutureReplica() 方法将消息数据追加到本地,这部分主要是日志文件的写操作,读者如有兴趣可参考Kafka 3.0 源码笔记(7)-Kafka 服务端对客户端的 Produce 请求处理至此就完成了主从副本消息数据的同步
  2. 消息写入完成后,需要把 Fetch 响应中的 HW 取出来,尝试更新本地日志的 HW,这部分由调用 Log.scala#updateHighWatermark() 方法实现
override def processPartitionData(topicPartition: TopicPartition,
                                 fetchOffset: Long,
                                 partitionData: FetchData): Option[LogAppendInfo] = {
 val logTrace = isTraceEnabled
 val partition = replicaMgr.getPartitionOrException(topicPartition)
 val log = partition.localLogOrException
 val records = toMemoryRecords(FetchResponse.recordsOrFail(partitionData))

 maybeWarnIfOversizedRecords(records, topicPartition)

 if (fetchOffset != log.logEndOffset)
   throw new IllegalStateException("Offset mismatch for partition %s: fetched offset = %d, log end offset = %d.".format(
     topicPartition, fetchOffset, log.logEndOffset))

 if (logTrace)
   trace("Follower has replica log end offset %d for partition %s. Received %d messages and leader hw %d"
     .format(log.logEndOffset, topicPartition, records.sizeInBytes, partitionData.highWatermark))

 // Append the leader's messages to the log
 val logAppendInfo = partition.appendRecordsToFollowerOrFutureReplica(records, isFuture = false)

 if (logTrace)
   trace("Follower has replica log end offset %d after appending %d bytes of messages for partition %s"
     .format(log.logEndOffset, records.sizeInBytes, topicPartition))
 val leaderLogStartOffset = partitionData.logStartOffset

 // For the follower replica, we do not need to keep its segment base offset and physical position.
 // These values will be computed upon becoming leader or handling a preferred read replica fetch.
 val followerHighWatermark = log.updateHighWatermark(partitionData.highWatermark)
 log.maybeIncrementLogStartOffset(leaderLogStartOffset, LeaderOffsetIncremented)
 if (logTrace)
   trace(s"Follower set replica high watermark for partition $topicPartition to $followerHighWatermark")

 // Traffic from both in-sync and out of sync replicas are accounted for in replication quota to ensure total replication
 // traffic doesn't exceed quota.
 if (quota.isThrottled(topicPartition))
   quota.record(records.sizeInBytes)

 if (partition.isReassigning && partition.isAddingLocalReplica)
   brokerTopicStats.updateReassignmentBytesIn(records.sizeInBytes)

 brokerTopicStats.updateReplicationBytesIn(records.sizeInBytes)

 logAppendInfo
}
  1. ReplicaManager.scala#updateFollowerFetchState() 方法执行,核心处理执行 Partition.scala#updateFollowerFetchState() 方法以便更新 Leader 副本保存的远程副本列表中的 LEO
private def updateFollowerFetchState(followerId: Int,
                                    readResults: Seq[(TopicPartition, LogReadResult)]): Seq[(TopicPartition, LogReadResult)] = {
 readResults.map { case (topicPartition, readResult) =>
   val updatedReadResult = if (readResult.error != Errors.NONE) {
     debug(s"Skipping update of fetch state for follower $followerId since the " +
       s"log read returned error ${readResult.error}")
     readResult
   } else if (readResult.divergingEpoch.nonEmpty) {
     debug(s"Skipping update of fetch state for follower $followerId since the " +
       s"log read returned diverging epoch ${readResult.divergingEpoch}")
     readResult
   } else {
     onlinePartition(topicPartition) match {
       case Some(partition) =>
         if (partition.updateFollowerFetchState(followerId,
           followerFetchOffsetMetadata = readResult.info.fetchOffsetMetadata,
           followerStartOffset = readResult.followerLogStartOffset,
           followerFetchTimeMs = readResult.fetchTimeMs,
           leaderEndOffset = readResult.leaderLogEndOffset)) {
           readResult
         } else {
           warn(s"Leader $localBrokerId failed to record follower $followerId's position " +
             s"${readResult.info.fetchOffsetMetadata.messageOffset}, and last sent HW since the replica " +
             s"is not recognized to be one of the assigned replicas ${partition.assignmentState.replicas.mkString(",")} " +
             s"for partition $topicPartition. Empty records will be returned for this partition.")
           readResult.withEmptyFetchInfo
         }
       case None =>
         warn(s"While recording the replica LEO, the partition $topicPartition hasn't been created.")
         readResult
     }
   }
   topicPartition -> updatedReadResult
 }
}
  1. Partition.scala#updateFollowerFetchState() 没有复杂逻辑,可以看到核心处理分为两步,至此本文全部分析基本结束
  1. 更新发出 Fetch 请求的 Follower 副本的 LEO 等信息
  2. 调用上文2.2节步骤6提到的Partition.scala#maybeIncrementLeaderHW() 方法在处理 Follower 端 Fetch 请求时尝试更新分区 HW,只有 HW 更新后新消息才算是 committed 状态,可以被消费者消费
def updateFollowerFetchState(followerId: Int,
                            followerFetchOffsetMetadata: LogOffsetMetadata,
                            followerStartOffset: Long,
                            followerFetchTimeMs: Long,
                            leaderEndOffset: Long): Boolean = {
 getReplica(followerId) match {
   case Some(followerReplica) =>
     // No need to calculate low watermark if there is no delayed DeleteRecordsRequest
     val oldLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
     val prevFollowerEndOffset = followerReplica.logEndOffset
     followerReplica.updateFetchState(
       followerFetchOffsetMetadata,
       followerStartOffset,
       followerFetchTimeMs,
       leaderEndOffset)

     val newLeaderLW = if (delayedOperations.numDelayedDelete > 0) lowWatermarkIfLeader else -1L
     // check if the LW of the partition has incremented
     // since the replica's logStartOffset may have incremented
     val leaderLWIncremented = newLeaderLW > oldLeaderLW

     // Check if this in-sync replica needs to be added to the ISR.
     maybeExpandIsr(followerReplica, followerFetchTimeMs)

     // check if the HW of the partition can now be incremented
     // since the replica may already be in the ISR and its LEO has just incremented
     val leaderHWIncremented = if (prevFollowerEndOffset != followerReplica.logEndOffset) {
       // the leader log may be updated by ReplicaAlterLogDirsThread so the following method must be in lock of
       // leaderIsrUpdateLock to prevent adding new hw to invalid log.
       inReadLock(leaderIsrUpdateLock) {
         leaderLogIfLocal.exists(leaderLog => maybeIncrementLeaderHW(leaderLog, followerFetchTimeMs))
       }
     } else {
       false
     }

     // some delayed operations may be unblocked after HW or LW changed
     if (leaderLWIncremented || leaderHWIncremented)
       tryCompleteDelayedRequests()

     debug(s"Recorded replica $followerId log end offset (LEO) position " +
       s"${followerFetchOffsetMetadata.messageOffset} and log start offset $followerStartOffset.")
     true

   case None =>
     false
 }
}