文章目录

  • 1. 前言
  • 2. 源码分析


1. 前言

在Kafka 3.0 源码笔记(6)-Kafka 生产者的源码分析 中笔者分析了作为客户端的 Producer 生产消息的主要动作,本文则着重分析 Kafka 服务端对于客户端生产消息的 Produce 请求的处理。这部分实际上比较简单,笔者在 Kafka 3.0 源码笔记(4)-Kafka 服务端对客户端的 Fetch 请求处理 已经介绍了 Kafka 服务端的文件存储结构以及读取消息数据的主要脉络,服务端写入消息的流程与之相比没有太大出入

2. 源码分析

kafka客户端命令查看服务端版本_kafka客户端命令查看服务端版本

  1. 客户端的请求抵达 Kafka 服务端后,首先经过底层网络组件的协议解析处理,完成后会分发到上层的 KafkaApis.scala#handle() 方法进行业务逻辑分发。对于 Produce 请求,Kafka服务端的处理方法是 KafkaApis.scala#handleProduceRequest(),可以看到其核心逻辑如下:
  1. 首先对 Produce 请求携带的每一条消息数据进行有效性校验,由方法 ProduceRequest.java#validateRecords() 完成。由于 Kakfa 各个版本的消息结构有差异,这部分其实主要是进行消息结构的版本兼容性校验,没有太多逻辑
  2. 校验完成后,核心逻辑是调用 ReplicaManager.scala#appendRecords() 方法开始处理消息数据,需注意此处会将 KafkaApis.scala#handleProduceRequest()#sendResponseCallback() 方法作为请求处理完成的回调传入
def handleProduceRequest(request: RequestChannel.Request, requestLocal: RequestLocal): Unit = {
 val produceRequest = request.body[ProduceRequest]
 val requestSize = request.sizeInBytes

 if (RequestUtils.hasTransactionalRecords(produceRequest)) {
   val isAuthorizedTransactional = produceRequest.transactionalId != null &&
     authHelper.authorize(request.context, WRITE, TRANSACTIONAL_ID, produceRequest.transactionalId)
   if (!isAuthorizedTransactional) {
     requestHelper.sendErrorResponseMaybeThrottle(request, Errors.TRANSACTIONAL_ID_AUTHORIZATION_FAILED.exception)
     return
   }
 }

 val unauthorizedTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
 val nonExistingTopicResponses = mutable.Map[TopicPartition, PartitionResponse]()
 val invalidRequestResponses = mutable.Map[TopicPartition, PartitionResponse]()
 val authorizedRequestInfo = mutable.Map[TopicPartition, MemoryRecords]()
 // cache the result to avoid redundant authorization calls
 val authorizedTopics = authHelper.filterByAuthorized(request.context, WRITE, TOPIC,
   produceRequest.data().topicData().asScala)(_.name())

 produceRequest.data.topicData.forEach(topic => topic.partitionData.forEach { partition =>
   val topicPartition = new TopicPartition(topic.name, partition.index)
   // This caller assumes the type is MemoryRecords and that is true on current serialization
   // We cast the type to avoid causing big change to code base.
   // https://issues.apache.org/jira/browse/KAFKA-10698
   val memoryRecords = partition.records.asInstanceOf[MemoryRecords]
   if (!authorizedTopics.contains(topicPartition.topic))
     unauthorizedTopicResponses += topicPartition -> new PartitionResponse(Errors.TOPIC_AUTHORIZATION_FAILED)
   else if (!metadataCache.contains(topicPartition))
     nonExistingTopicResponses += topicPartition -> new PartitionResponse(Errors.UNKNOWN_TOPIC_OR_PARTITION)
   else
     try {
       ProduceRequest.validateRecords(request.header.apiVersion, memoryRecords)
       authorizedRequestInfo += (topicPartition -> memoryRecords)
     } catch {
       case e: ApiException =>
         invalidRequestResponses += topicPartition -> new PartitionResponse(Errors.forException(e))
     }
 })

 // the callback for sending a produce response
 // The construction of ProduceResponse is able to accept auto-generated protocol data so
 // KafkaApis#handleProduceRequest should apply auto-generated protocol to avoid extra conversion.
 // https://issues.apache.org/jira/browse/KAFKA-10730
 @nowarn("cat=deprecation")
 def sendResponseCallback(responseStatus: Map[TopicPartition, PartitionResponse]): Unit = {
   val mergedResponseStatus = responseStatus ++ unauthorizedTopicResponses ++ nonExistingTopicResponses ++ invalidRequestResponses
   var errorInResponse = false

   mergedResponseStatus.forKeyValue { (topicPartition, status) =>
     if (status.error != Errors.NONE) {
       errorInResponse = true
       debug("Produce request with correlation id %d from client %s on partition %s failed due to %s".format(
         request.header.correlationId,
         request.header.clientId,
         topicPartition,
         status.error.exceptionName))
     }
   }

   // Record both bandwidth and request quota-specific values and throttle by muting the channel if any of the quotas
   // have been violated. If both quotas have been violated, use the max throttle time between the two quotas. Note
   // that the request quota is not enforced if acks == 0.
   val timeMs = time.milliseconds()
   val bandwidthThrottleTimeMs = quotas.produce.maybeRecordAndGetThrottleTimeMs(request, requestSize, timeMs)
   val requestThrottleTimeMs =
     if (produceRequest.acks == 0) 0
     else quotas.request.maybeRecordAndGetThrottleTimeMs(request, timeMs)
   val maxThrottleTimeMs = Math.max(bandwidthThrottleTimeMs, requestThrottleTimeMs)
   if (maxThrottleTimeMs > 0) {
     request.apiThrottleTimeMs = maxThrottleTimeMs
     if (bandwidthThrottleTimeMs > requestThrottleTimeMs) {
       requestHelper.throttle(quotas.produce, request, bandwidthThrottleTimeMs)
     } else {
       requestHelper.throttle(quotas.request, request, requestThrottleTimeMs)
     }
   }

   // Send the response immediately. In case of throttling, the channel has already been muted.
   if (produceRequest.acks == 0) {
     // no operation needed if producer request.required.acks = 0; however, if there is any error in handling
     // the request, since no response is expected by the producer, the server will close socket server so that
     // the producer client will know that some error has happened and will refresh its metadata
     if (errorInResponse) {
       val exceptionsSummary = mergedResponseStatus.map { case (topicPartition, status) =>
         topicPartition -> status.error.exceptionName
       }.mkString(", ")
       info(
         s"Closing connection due to error during produce request with correlation id ${request.header.correlationId} " +
           s"from client id ${request.header.clientId} with ack=0\n" +
           s"Topic and partition to exceptions: $exceptionsSummary"
       )
       requestChannel.closeConnection(request, new ProduceResponse(mergedResponseStatus.asJava).errorCounts)
     } else {
       // Note that although request throttling is exempt for acks == 0, the channel may be throttled due to
       // bandwidth quota violation.
       requestHelper.sendNoOpResponseExemptThrottle(request)
     }
   } else {
     requestChannel.sendResponse(request, new ProduceResponse(mergedResponseStatus.asJava, maxThrottleTimeMs), None)
   }
 }

 def processingStatsCallback(processingStats: FetchResponseStats): Unit = {
   processingStats.forKeyValue { (tp, info) =>
     updateRecordConversionStats(request, tp, info)
   }
 }

 if (authorizedRequestInfo.isEmpty)
   sendResponseCallback(Map.empty)
 else {
   val internalTopicsAllowed = request.header.clientId == AdminUtils.AdminClientId

   // call the replica manager to append messages to the replicas
   replicaManager.appendRecords(
     timeout = produceRequest.timeout.toLong,
     requiredAcks = produceRequest.acks,
     internalTopicsAllowed = internalTopicsAllowed,
     origin = AppendOrigin.Client,
     entriesPerPartition = authorizedRequestInfo,
     requestLocal = requestLocal,
     responseCallback = sendResponseCallback,
     recordConversionStatsCallback = processingStatsCallback)

   // if the request is put into the purgatory, it will have a held reference and hence cannot be garbage collected;
   // hence we clear its data here in order to let GC reclaim its memory since it is already appended to log
   produceRequest.clearPartitionRecords()
 }
}
  1. ReplicaManager.scala#appendRecords() 方法的实现如下,简单来说核心处理如下:
  1. 首先调用 ReplicaManager.scala#isValidRequiredAcks() 方法校验生产者的 request.required.acks 配置是否合法,该配置主要用于设置消息可靠性级别,有如下 3 个值:
  • 0:生产者不等待服务端完成消息写入,最小延迟
  • 1:生产者等待服务端 Leader 副本的消息写入完成,以便确认消息发送成功
  • -1: 生产者等待服务端 Leader 副本及其ISR(in-sync replicas) 列表中的 Follower 副本都完成消息写入
  1. 调用 ReplicaManager.scala#appendToLocalLog() 方法进行业务处理
  2. 业务处理完成后,根据请求携带的信息及请求处理的情况决定立即响应客户端还是延时响应,不管哪种情况,核心的响应处理还是执行在步骤1中提到的回调函数
def appendRecords(timeout: Long,
                 requiredAcks: Short,
                 internalTopicsAllowed: Boolean,
                 origin: AppendOrigin,
                 entriesPerPartition: Map[TopicPartition, MemoryRecords],
                 responseCallback: Map[TopicPartition, PartitionResponse] => Unit,
                 delayedProduceLock: Option[Lock] = None,
                 recordConversionStatsCallback: Map[TopicPartition, RecordConversionStats] => Unit = _ => (),
                 requestLocal: RequestLocal = RequestLocal.NoCaching): Unit = {
 if (isValidRequiredAcks(requiredAcks)) {
   val sTime = time.milliseconds
   val localProduceResults = appendToLocalLog(internalTopicsAllowed = internalTopicsAllowed,
     origin, entriesPerPartition, requiredAcks, requestLocal)
   debug("Produce to local log in %d ms".format(time.milliseconds - sTime))

   val produceStatus = localProduceResults.map { case (topicPartition, result) =>
     topicPartition -> ProducePartitionStatus(
       result.info.lastOffset + 1, // required offset
       new PartitionResponse(
         result.error,
         result.info.firstOffset.map(_.messageOffset).getOrElse(-1),
         result.info.logAppendTime,
         result.info.logStartOffset,
         result.info.recordErrors.asJava,
         result.info.errorMessage
       )
     ) // response status
   }

   actionQueue.add {
     () =>
       localProduceResults.foreach {
         case (topicPartition, result) =>
           val requestKey = TopicPartitionOperationKey(topicPartition)
           result.info.leaderHwChange match {
             case LeaderHwChange.Increased =>
               // some delayed operations may be unblocked after HW changed
               delayedProducePurgatory.checkAndComplete(requestKey)
               delayedFetchPurgatory.checkAndComplete(requestKey)
               delayedDeleteRecordsPurgatory.checkAndComplete(requestKey)
             case LeaderHwChange.Same =>
               // probably unblock some follower fetch requests since log end offset has been updated
               delayedFetchPurgatory.checkAndComplete(requestKey)
             case LeaderHwChange.None =>
               // nothing
           }
       }
   }

   recordConversionStatsCallback(localProduceResults.map { case (k, v) => k -> v.info.recordConversionStats })

   if (delayedProduceRequestRequired(requiredAcks, entriesPerPartition, localProduceResults)) {
     // create delayed produce operation
     val produceMetadata = ProduceMetadata(requiredAcks, produceStatus)
     val delayedProduce = new DelayedProduce(timeout, produceMetadata, this, responseCallback, delayedProduceLock)

     // create a list of (topic, partition) pairs to use as keys for this delayed produce operation
     val producerRequestKeys = entriesPerPartition.keys.map(TopicPartitionOperationKey(_)).toSeq

     // try to complete the request immediately, otherwise put it into the purgatory
     // this is because while the delayed produce operation is being created, new
     // requests may arrive and hence make this operation completable.
     delayedProducePurgatory.tryCompleteElseWatch(delayedProduce, producerRequestKeys)

   } else {
     // we can respond immediately
     val produceResponseStatus = produceStatus.map { case (k, status) => k -> status.responseStatus }
     responseCallback(produceResponseStatus)
   }
 } else {
   // If required.acks is outside accepted range, something is wrong with the client
   // Just return an error and don't handle the request at all
   val responseStatus = entriesPerPartition.map { case (topicPartition, _) =>
     topicPartition -> new PartitionResponse(
       Errors.INVALID_REQUIRED_ACKS,
       LogAppendInfo.UnknownLogAppendInfo.firstOffset.map(_.messageOffset).getOrElse(-1),
       RecordBatch.NO_TIMESTAMP,
       LogAppendInfo.UnknownLogAppendInfo.logStartOffset
     )
   }
   responseCallback(responseStatus)
 }
}
  1. ReplicaManager.scala#appendToLocalLog() 方法会以 topic 下的分区为单位遍历 Produce 请求携带的数据,对于每一个 topic 分区下的数据处理过程主要分为两步:
  1. 调用 ReplicaManager.scala#getPartitionOrException() 方法定位到目标分区的 Partition 结构
  2. 调用 Partition.scala#appendRecordsToLeader() 方法进行消息数据的追加处理
private def appendToLocalLog(internalTopicsAllowed: Boolean,
                            origin: AppendOrigin,
                            entriesPerPartition: Map[TopicPartition, MemoryRecords],
                            requiredAcks: Short,
                            requestLocal: RequestLocal): Map[TopicPartition, LogAppendResult] = {
 val traceEnabled = isTraceEnabled
 def processFailedRecord(topicPartition: TopicPartition, t: Throwable) = {
   val logStartOffset = onlinePartition(topicPartition).map(_.logStartOffset).getOrElse(-1L)
   brokerTopicStats.topicStats(topicPartition.topic).failedProduceRequestRate.mark()
   brokerTopicStats.allTopicsStats.failedProduceRequestRate.mark()
   error(s"Error processing append operation on partition $topicPartition", t)

   logStartOffset
 }

 if (traceEnabled)
   trace(s"Append [$entriesPerPartition] to local log")

 entriesPerPartition.map { case (topicPartition, records) =>
   brokerTopicStats.topicStats(topicPartition.topic).totalProduceRequestRate.mark()
   brokerTopicStats.allTopicsStats.totalProduceRequestRate.mark()

   // reject appending to internal topics if it is not allowed
   if (Topic.isInternal(topicPartition.topic) && !internalTopicsAllowed) {
     (topicPartition, LogAppendResult(
       LogAppendInfo.UnknownLogAppendInfo,
       Some(new InvalidTopicException(s"Cannot append to internal topic ${topicPartition.topic}"))))
   } else {
     try {
       val partition = getPartitionOrException(topicPartition)
       val info = partition.appendRecordsToLeader(records, origin, requiredAcks, requestLocal)
       val numAppendedMessages = info.numMessages

       // update stats for successfully appended bytes and messages as bytesInRate and messageInRate
       brokerTopicStats.topicStats(topicPartition.topic).bytesInRate.mark(records.sizeInBytes)
       brokerTopicStats.allTopicsStats.bytesInRate.mark(records.sizeInBytes)
       brokerTopicStats.topicStats(topicPartition.topic).messagesInRate.mark(numAppendedMessages)
       brokerTopicStats.allTopicsStats.messagesInRate.mark(numAppendedMessages)

       if (traceEnabled)
         trace(s"${records.sizeInBytes} written to log $topicPartition beginning at offset " +
           s"${info.firstOffset.getOrElse(-1)} and ending at offset ${info.lastOffset}")

       (topicPartition, LogAppendResult(info))
     } catch {
       // NOTE: Failed produce requests metric is not incremented for known exceptions
       // it is supposed to indicate un-expected failures of a broker in handling a produce request
       case e@ (_: UnknownTopicOrPartitionException |
                _: NotLeaderOrFollowerException |
                _: RecordTooLargeException |
                _: RecordBatchTooLargeException |
                _: CorruptRecordException |
                _: KafkaStorageException) =>
         (topicPartition, LogAppendResult(LogAppendInfo.UnknownLogAppendInfo, Some(e)))
       case rve: RecordValidationException =>
         val logStartOffset = processFailedRecord(topicPartition, rve.invalidException)
         val recordErrors = rve.recordErrors
         (topicPartition, LogAppendResult(LogAppendInfo.unknownLogAppendInfoWithAdditionalInfo(
           logStartOffset, recordErrors, rve.invalidException.getMessage), Some(rve.invalidException)))
       case t: Throwable =>
         val logStartOffset = processFailedRecord(topicPartition, t)
         (topicPartition, LogAppendResult(LogAppendInfo.unknownLogAppendInfoWithLogStartOffset(logStartOffset), Some(t)))
     }
   }
 }
}
  1. Partition.scala#appendRecordsToLeader() 方法将在一个读锁里面进行数据写入操作,此处主要处理如下:
  1. 首先检查当前 Leader 副本的 ISR 状态,状态满足要求才调用 Log.scala#appendAsLeader() 方法进行消息数据写入
  2. 调用 Partition.scala#maybeIncrementLeaderHW() 方法尝试去更新这个分区的 高水位标识(high watermark),这部分的处理主要和 Kafka 主从副本的数据同步机制有关,本文暂不深入
def appendRecordsToLeader(records: MemoryRecords, origin: AppendOrigin, requiredAcks: Int,
                         requestLocal: RequestLocal): LogAppendInfo = {
 val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
   leaderLogIfLocal match {
     case Some(leaderLog) =>
       val minIsr = leaderLog.config.minInSyncReplicas
       val inSyncSize = isrState.isr.size

       // Avoid writing to leader if there are not enough insync replicas to make it safe
       if (inSyncSize < minIsr && requiredAcks == -1) {
         throw new NotEnoughReplicasException(s"The size of the current ISR ${isrState.isr} " +
           s"is insufficient to satisfy the min.isr requirement of $minIsr for partition $topicPartition")
       }

       val info = leaderLog.appendAsLeader(records, leaderEpoch = this.leaderEpoch, origin,
         interBrokerProtocolVersion, requestLocal)

       // we may need to increment high watermark since ISR could be down to 1
       (info, maybeIncrementLeaderHW(leaderLog))

     case None =>
       throw new NotLeaderOrFollowerException("Leader not local for partition %s on broker %d"
         .format(topicPartition, localBrokerId))
   }
 }

 info.copy(leaderHwChange = if (leaderHWIncremented) LeaderHwChange.Increased else LeaderHwChange.Same)
}
  1. Log.scala#appendAsLeader() 方法其实只是个入口,核心是调用 Log.scala#append() 方法。这个方法比较长,不过关键的处理还算清晰:
  1. 首先调用 Log.scala#analyzeAndValidateRecords() 方法分析校验消息数据是否合法
  2. 调用 LogValidator.scala#validateMessagesAndAssignOffsets() 方法进一步校验消息数据,并为每一条消息设置偏移量和时间戳
  3. 调用 Log.scala#maybeRoll() 方法判断当前最新的 LogSegment 是否能容纳这一批消息数据,如果不能则新建一个 LogSegment 用于存储消息数据。新建 LogSegment 的过程中会创建新的消息存储文件,消息数据写入文件的过程实际上是通过文件系统先写入到页缓存(Page Cache),在合适的时候再写入到磁盘
  4. 接着调用 Log.scala#analyzeAndValidateProducerState() 方法分析校验生产者在服务端的状态,确定消息是否有重复等信息
  5. 消息没有重复则调用 LogSegment.scala#append() 进行消息数据的写入,完成后需要通过 ProducerStateManager 更新服务端保存的该生产者的相关信息
def appendAsLeader(records: MemoryRecords,
                  leaderEpoch: Int,
                  origin: AppendOrigin = AppendOrigin.Client,
                  interBrokerProtocolVersion: ApiVersion = ApiVersion.latestVersion,
                  requestLocal: RequestLocal = RequestLocal.NoCaching): LogAppendInfo = {
 val validateAndAssignOffsets = origin != AppendOrigin.RaftLeader
 append(records, origin, interBrokerProtocolVersion, validateAndAssignOffsets, leaderEpoch, Some(requestLocal), ignoreRecordSize = false)
}

private def append(records: MemoryRecords,
                  origin: AppendOrigin,
                  interBrokerProtocolVersion: ApiVersion,
                  validateAndAssignOffsets: Boolean,
                  leaderEpoch: Int,
                  requestLocal: Option[RequestLocal],
                  ignoreRecordSize: Boolean): LogAppendInfo = {
 // We want to ensure the partition metadata file is written to the log dir before any log data is written to disk.
 // This will ensure that any log data can be recovered with the correct topic ID in the case of failure.
 maybeFlushMetadataFile()

 val appendInfo = analyzeAndValidateRecords(records, origin, ignoreRecordSize, leaderEpoch)

 // return if we have no valid messages or if this is a duplicate of the last appended entry
 if (appendInfo.shallowCount == 0) appendInfo
 else {

   // trim any invalid bytes or partial messages before appending it to the on-disk log
   var validRecords = trimInvalidBytes(records, appendInfo)

   // they are valid, insert them in the log
   lock synchronized {
     maybeHandleIOException(s"Error while appending records to $topicPartition in dir ${dir.getParent}") {
       checkIfMemoryMappedBufferClosed()
       if (validateAndAssignOffsets) {
         // assign offsets to the message set
         val offset = new LongRef(nextOffsetMetadata.messageOffset)
         appendInfo.firstOffset = Some(LogOffsetMetadata(offset.value))
         val now = time.milliseconds
         val validateAndOffsetAssignResult = try {
           LogValidator.validateMessagesAndAssignOffsets(validRecords,
             topicPartition,
             offset,
             time,
             now,
             appendInfo.sourceCodec,
             appendInfo.targetCodec,
             config.compact,
             config.recordVersion.value,
             config.messageTimestampType,
             config.messageTimestampDifferenceMaxMs,
             leaderEpoch,
             origin,
             interBrokerProtocolVersion,
             brokerTopicStats,
             requestLocal.getOrElse(throw new IllegalArgumentException(
               "requestLocal should be defined if assignOffsets is true")))
         } catch {
           case e: IOException =>
             throw new KafkaException(s"Error validating messages while appending to log $name", e)
         }
         validRecords = validateAndOffsetAssignResult.validatedRecords
         appendInfo.maxTimestamp = validateAndOffsetAssignResult.maxTimestamp
         appendInfo.offsetOfMaxTimestamp = validateAndOffsetAssignResult.shallowOffsetOfMaxTimestamp
         appendInfo.lastOffset = offset.value - 1
         appendInfo.recordConversionStats = validateAndOffsetAssignResult.recordConversionStats
         if (config.messageTimestampType == TimestampType.LOG_APPEND_TIME)
           appendInfo.logAppendTime = now

         // re-validate message sizes if there's a possibility that they have changed (due to re-compression or message
         // format conversion)
         if (!ignoreRecordSize && validateAndOffsetAssignResult.messageSizeMaybeChanged) {
           validRecords.batches.forEach { batch =>
             if (batch.sizeInBytes > config.maxMessageSize) {
               // we record the original message set size instead of the trimmed size
               // to be consistent with pre-compression bytesRejectedRate recording
               brokerTopicStats.topicStats(topicPartition.topic).bytesRejectedRate.mark(records.sizeInBytes)
               brokerTopicStats.allTopicsStats.bytesRejectedRate.mark(records.sizeInBytes)
               throw new RecordTooLargeException(s"Message batch size is ${batch.sizeInBytes} bytes in append to" +
                 s"partition $topicPartition which exceeds the maximum configured size of ${config.maxMessageSize}.")
             }
           }
         }
       } else {
         // we are taking the offsets we are given
         if (!appendInfo.offsetsMonotonic)
           throw new OffsetsOutOfOrderException(s"Out of order offsets found in append to $topicPartition: " +
             records.records.asScala.map(_.offset))

         if (appendInfo.firstOrLastOffsetOfFirstBatch < nextOffsetMetadata.messageOffset) {
           // we may still be able to recover if the log is empty
           // one example: fetching from log start offset on the leader which is not batch aligned,
           // which may happen as a result of AdminClient#deleteRecords()
           val firstOffset = appendInfo.firstOffset match {
             case Some(offsetMetadata) => offsetMetadata.messageOffset
             case None => records.batches.asScala.head.baseOffset()
           }

           val firstOrLast = if (appendInfo.firstOffset.isDefined) "First offset" else "Last offset of the first batch"
           throw new UnexpectedAppendOffsetException(
             s"Unexpected offset in append to $topicPartition. $firstOrLast " +
               s"${appendInfo.firstOrLastOffsetOfFirstBatch} is less than the next offset ${nextOffsetMetadata.messageOffset}. " +
               s"First 10 offsets in append: ${records.records.asScala.take(10).map(_.offset)}, last offset in" +
               s" append: ${appendInfo.lastOffset}. Log start offset = $logStartOffset",
             firstOffset, appendInfo.lastOffset)
         }
       }

       // update the epoch cache with the epoch stamped onto the message by the leader
       validRecords.batches.forEach { batch =>
         if (batch.magic >= RecordBatch.MAGIC_VALUE_V2) {
           maybeAssignEpochStartOffset(batch.partitionLeaderEpoch, batch.baseOffset)
         } else {
           // In partial upgrade scenarios, we may get a temporary regression to the message format. In
           // order to ensure the safety of leader election, we clear the epoch cache so that we revert
           // to truncation by high watermark after the next leader election.
           leaderEpochCache.filter(_.nonEmpty).foreach { cache =>
             warn(s"Clearing leader epoch cache after unexpected append with message format v${batch.magic}")
             cache.clearAndFlush()
           }
         }
       }

       // check messages set size may be exceed config.segmentSize
       if (validRecords.sizeInBytes > config.segmentSize) {
         throw new RecordBatchTooLargeException(s"Message batch size is ${validRecords.sizeInBytes} bytes in append " +
           s"to partition $topicPartition, which exceeds the maximum configured segment size of ${config.segmentSize}.")
       }

       // maybe roll the log if this segment is full
       val segment = maybeRoll(validRecords.sizeInBytes, appendInfo)

       val logOffsetMetadata = LogOffsetMetadata(
         messageOffset = appendInfo.firstOrLastOffsetOfFirstBatch,
         segmentBaseOffset = segment.baseOffset,
         relativePositionInSegment = segment.size)

       // now that we have valid records, offsets assigned, and timestamps updated, we need to
       // validate the idempotent/transactional state of the producers and collect some metadata
       val (updatedProducers, completedTxns, maybeDuplicate) = analyzeAndValidateProducerState(
         logOffsetMetadata, validRecords, origin)

       maybeDuplicate match {
         case Some(duplicate) =>
           appendInfo.firstOffset = Some(LogOffsetMetadata(duplicate.firstOffset))
           appendInfo.lastOffset = duplicate.lastOffset
           appendInfo.logAppendTime = duplicate.timestamp
           appendInfo.logStartOffset = logStartOffset
         case None =>
           // Before appending update the first offset metadata to include segment information
           appendInfo.firstOffset = appendInfo.firstOffset.map { offsetMetadata =>
             offsetMetadata.copy(segmentBaseOffset = segment.baseOffset, relativePositionInSegment = segment.size)
           }

           segment.append(largestOffset = appendInfo.lastOffset,
             largestTimestamp = appendInfo.maxTimestamp,
             shallowOffsetOfMaxTimestamp = appendInfo.offsetOfMaxTimestamp,
             records = validRecords)

           // Increment the log end offset. We do this immediately after the append because a
           // write to the transaction index below may fail and we want to ensure that the offsets
           // of future appends still grow monotonically. The resulting transaction index inconsistency
           // will be cleaned up after the log directory is recovered. Note that the end offset of the
           // ProducerStateManager will not be updated and the last stable offset will not advance
           // if the append to the transaction index fails.
           updateLogEndOffset(appendInfo.lastOffset + 1)

           // update the producer state
           updatedProducers.values.foreach(producerAppendInfo => producerStateManager.update(producerAppendInfo))

           // update the transaction index with the true last stable offset. The last offset visible
           // to consumers using READ_COMMITTED will be limited by this value and the high watermark.
           completedTxns.foreach { completedTxn =>
             val lastStableOffset = producerStateManager.lastStableOffset(completedTxn)
             segment.updateTxnIndex(completedTxn, lastStableOffset)
             producerStateManager.completeTxn(completedTxn)
           }

           // always update the last producer id map offset so that the snapshot reflects the current offset
           // even if there isn't any idempotent data being written
           producerStateManager.updateMapEndOffset(appendInfo.lastOffset + 1)

           // update the first unstable offset (which is used to compute LSO)
           maybeIncrementFirstUnstableOffset()

           trace(s"Appended message set with last offset: ${appendInfo.lastOffset}, " +
             s"first offset: ${appendInfo.firstOffset}, " +
             s"next offset: ${nextOffsetMetadata.messageOffset}, " +
             s"and messages: $validRecords")

           if (unflushedMessages >= config.flushInterval) flush()
       }
       appendInfo
     }
   }
 }
}
  1. LogSegment.scala#append() 进行消息数据写入的操作比较简练,大致分为如下几步:
  1. 首先调用 LogSegment.scala#ensureOffsetInRange() 通过偏移量索引 OffsetIndex 检查要插入的消息数据的偏移量是否符合要求
  2. 其次调用 FileRecords.java#append() 将消息数据写入文件
  3. 根据索引间隔字节数配置及bytesSinceLastIndexEntry累加器判断是否要插入一个索引关键字,这也就是笔者在之前的文章中提到过的 Kafka 索引为稀疏索引的体现
@nonthreadsafe
def append(largestOffset: Long,
          largestTimestamp: Long,
          shallowOffsetOfMaxTimestamp: Long,
          records: MemoryRecords): Unit = {
 if (records.sizeInBytes > 0) {
   trace(s"Inserting ${records.sizeInBytes} bytes at end offset $largestOffset at position ${log.sizeInBytes} " +
         s"with largest timestamp $largestTimestamp at shallow offset $shallowOffsetOfMaxTimestamp")
   val physicalPosition = log.sizeInBytes()
   if (physicalPosition == 0)
     rollingBasedTimestamp = Some(largestTimestamp)

   ensureOffsetInRange(largestOffset)

   // append the messages
   val appendedBytes = log.append(records)
   trace(s"Appended $appendedBytes to ${log.file} at end offset $largestOffset")
   // Update the in memory max timestamp and corresponding offset.
   if (largestTimestamp > maxTimestampSoFar) {
     maxTimestampAndOffsetSoFar = TimestampOffset(largestTimestamp, shallowOffsetOfMaxTimestamp)
   }
   // append an entry to the index (if needed)
   if (bytesSinceLastIndexEntry > indexIntervalBytes) {
     offsetIndex.append(largestOffset, physicalPosition)
     timeIndex.maybeAppend(maxTimestampSoFar, offsetOfMaxTimestampSoFar)
     bytesSinceLastIndexEntry = 0
   }
   bytesSinceLastIndexEntry += records.sizeInBytes
 }
}
  1. FileRecords.java#append() 将消息数据写入文件系统的实现如下,可以看到此处其实是调用MemoryRecords.java#writeFullyTo()将数据写入到了 FileChannel 中,至此 Kafka 服务端消息数据存储的处理基本结束
public int append(MemoryRecords records) throws IOException {
     if (records.sizeInBytes() > Integer.MAX_VALUE - size.get())
         throw new IllegalArgumentException("Append of size " + records.sizeInBytes() +
                 " bytes is too large for segment with current file position at " + size.get());

     int written = records.writeFullyTo(channel);
     size.getAndAdd(written);
     return written;
 }