Kafka使用Replica对象表示一个分区的副本:
class Replica(val brokerId: Int,//副本所在的brokerID
val partition: Partition,//副本对应的分区
time: Time = SystemTime,
initialHighWatermarkValue: Long = 0L,
//本地副本对应的Log对象
val log: Option[Log] = None) extends Logging {
// the high watermark offset value, in non-leader replicas only its message offsets are kept
//消费者只能获取到HW之前消息。此字段由leader副本负责更新维护,其他副本只保留一份
@volatile private[this] var highWatermarkMetadata: LogOffsetMetadata = new LogOffsetMetadata(initialHighWatermarkValue)
// the log end offset value, kept in all replicas;
// for local replica it is the log's end offset, for remote replicas its value is only updated by follower fetch
// 追加到log中的最新消息的offset,可以直接从Log.nextOffsetMeatada字段获取,非leader副本从leader副本拉取获取。
@volatile private[this] var logEndOffsetMetadata: LogOffsetMetadata = LogOffsetMetadata.UnknownOffsetMetadata
val topic = partition.topic
val partitionId = partition.partitionId
//记录Follower副本最后一次赶上leader副本的时机
private[this] val lastCaughtUpTimeMsUnderlying = new AtomicLong(time.milliseconds)
}
分区Partition:
class Partition(val topic: String,//topic和分区号
val partitionId: Int,
time: Time,
replicaManager: ReplicaManager) extends Logging with KafkaMetricsGroup {
private val localBrokerId = replicaManager.config.brokerId
//Broker上的LogManager对象
private val logManager = replicaManager.logManager
//操作Zookeeper的辅助类
private val zkUtils = replicaManager.zkUtils
//所有副本的集合
private val assignedReplicaMap = new Pool[Int, Replica]
// The read lock is only required when multiple reads are executed and needs to be in a consistent manner
private val leaderIsrUpdateLock = new ReentrantReadWriteLock()
private var zkVersion: Int = LeaderAndIsr.initialZKVersion
// Leader副本的年代信息
@volatile private var leaderEpoch: Int = LeaderAndIsr.initialLeaderEpoch - 1
// 该分区的leader副本的ID
@volatile var leaderReplicaIdOpt: Option[Int] = None
// 维护了分区的isr集合
@volatile var inSyncReplicas: Set[Replica] = Set.empty[Replica]
/* Epoch of the controller that last changed the leader. This needs to be initialized correctly upon broker startup.
* One way of doing that is through the controller's start replica state change command. When a new broker starts up
* the controller sends it a start replica command containing the leader for each partition that the broker hosts.
* In addition to the leader, the controller can also send the epoch of the controller that elected the leader for
* each partition. */
private var controllerEpoch: Int = KafkaController.InitialControllerEpoch - 1
this.logIdent = "Partition [%s,%d] on broker %d: ".format(topic, partitionId, localBrokerId)
private def isReplicaLocal(replicaId: Int) : Boolean = (replicaId == localBrokerId)
val tags = Map("topic" -> topic, "partition" -> partitionId.toString)
}
下面Partition的方法主要有5个:
1. 创建副本对象,getOrCreateReplica
2. 副本的leader和follower角色切换,makeLeader和makeFollower
3. isr集合管理,分别有maybeExpandIsr()与maybeShrinkIsr()
4. 消息写入,appendMessagesToLeader
5. 检测HW位置,checkEnoughReplicasReachOffset
创建副本:
// 在AR集合(assignedReplicaMap)中查找指定副本的Replica对象,找不到就添加到AR中。
def getOrCreateReplica(replicaId: Int = localBrokerId): Replica = {
//在AR集合(assignedReplicaMap)中查找指定副本的Replica对象
val replicaOpt = getReplica(replicaId)
replicaOpt match {
//查到指定的replica对象,直接返回
case Some(replica) => replica
case None =>
//判断是否为localReplica
if (isReplicaLocal(replicaId)) {
//获取配置消息,Zookeeper中的配置会覆盖默认的配置
val config = LogConfig.fromProps(logManager.defaultConfig.originals,
AdminUtils.fetchEntityConfig(zkUtils, ConfigType.Topic, topic))
// 创建local副本指定的log,包括文件夹,如果log已经存在就直接返回。
val log = logManager.createLog(TopicAndPartition(topic, partitionId), config)
// 获取指定目录的对应的offsetcheckpoint对象,负责管理log目录下的replication-offset-checkpoint文件。
val checkpoint = replicaManager.highWatermarkCheckpoints(log.dir.getParentFile.getAbsolutePath)
//读取replication-offset-checkpoint文件的HW
val offsetMap = checkpoint.read
if (!offsetMap.contains(TopicAndPartition(topic, partitionId)))
info("No checkpointed highwatermark is found for partition [%s,%d]".format(topic, partitionId))
// 把offset和LEO比较,值作为这个副本的HW
val offset = offsetMap.getOrElse(TopicAndPartition(topic, partitionId), 0L).min(log.logEndOffset)
//创建replica对象,添加到assignedReplicaMap集合中管理
val localReplica = new Replica(replicaId, this, time, offset, Some(log))
addReplicaIfNotExists(localReplica)
} else {
//非本地副本,直接创建replica对象并添加到AR中。
val remoteReplica = new Replica(replicaId, this, time)
addReplicaIfNotExists(remoteReplica)
}
//返回创建的replica
getReplica(replicaId).get
}
}
副本角色切换
Broker会根据KafkaController发送到LeaderAndISRRequest请求控制副本的leader和follwer角色切换。Paritition.makeLeader()方法是处理LeaderAndISRRequest中比较重要的环节之一,它会把Local Replica设置成Leader副本。
/*
* Make the local replica the leader by resetting LogEndOffset for remote replicas (there could be old LogEndOffset
* from the time when this broker was the leader last time) and setting the new leader and ISR.
* If the leader replica id does not change, return false to indicate the replica manager.
*/
public class PartitionState {
public final int controllerEpoch;
//leader副本的ID
public final int leader;
public final int leaderEpoch;
//ISR集合,保存的是ID
public final List<Integer> isr;
public final int zkVersion;
//AR集合
public final Set<Integer> replicas;
}
def makeLeader(controllerId: Int, partitionStateInfo: PartitionState, correlationId: Int): Boolean = {
val (leaderHWIncremented, isNewLeader) = inWriteLock(leaderIsrUpdateLock) {
//获取需要分配的AR集合
val allReplicas = partitionStateInfo.replicas.asScala.map(_.toInt)
// record the epoch of the controller that made the leadership decision. This is useful while updating the isr
// to maintain the decision maker controller's epoch in the zookeeper path
controllerEpoch = partitionStateInfo.controllerEpoch
// add replicas that are new
// 创建AR集合中所有副本对应的Replica对象
allReplicas.foreach(replica => getOrCreateReplica(replica))
//获取ISR集合
val newInSyncReplicas = partitionStateInfo.isr.asScala.map(r => getOrCreateReplica(r)).toSet
// remove assigned replicas that have been removed by the controller
// 根据allReplicas更新assignedReplicas集合
(assignedReplicas().map(_.brokerId) -- allReplicas).foreach(removeReplica(_))
//更新Partition字段
inSyncReplicas = newInSyncReplicas//更新isr集合、leaderEpoch、zkVersion
leaderEpoch = partitionStateInfo.leaderEpoch
zkVersion = partitionStateInfo.zkVersion
// 检测leader是否发生变化
val isNewLeader =
if (leaderReplicaIdOpt.isDefined && leaderReplicaIdOpt.get == localBrokerId) {
//laeder所在的brokerID没有变化
false
} else {
//之前这个leader不在这个broker上,就更新leaderReplicaIdOpt
leaderReplicaIdOpt = Some(localBrokerId)
true
}
//获取local Replica
val leaderReplica = getReplica().get
// we may need to increment high watermark since ISR could be down to 1
if (isNewLeader) {
// construct the high watermark metadata for the new leader replica
// 初始化leader的HW metadata
leaderReplica.convertHWToLocalOffsetMetadata()
// reset log end offset for remote replicas
// 重置所有远程副本的LEO是-1
assignedReplicas.filter(_.brokerId != localBrokerId).foreach(_.updateLogReadResult(LogReadResult.UnknownLogReadResult))
}
//尝试更新HW
(maybeIncrementLeaderHW(leaderReplica), isNewLeader)
}
// some delayed operations may be unblocked after HW changed
// 如果HW增加了,那么DelayFetch可能满足条件了,这里检查
if (leaderHWIncremented)
tryCompleteDelayedRequests()
isNewLeader
}
maybeIncrementLeaderHW方法会尝试后移leader副本的hw,当ISR集合发送增减或者ISR中一个副本的leo发生变化时,都会导致isr集合中最小leo变大,所以这种情况要调用maybeIncrementLeaderHW进行检查
private def maybeIncrementLeaderHW(leaderReplica: Replica): Boolean = {
//获取ISR中所有副本的LEO
val allLogEndOffsets = inSyncReplicas.map(_.logEndOffset)
//把ISR集合最小的LEO作为新的HW
val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
//获取当前HW
val oldHighWatermark = leaderReplica.highWatermark
//比较两个HW,更新。
if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset || oldHighWatermark.onOlderSegment(newHighWatermark)) {
leaderReplica.highWatermark = newHighWatermark
debug("High watermark for partition [%s,%d] updated to %s".format(topic, partitionId, newHighWatermark))
true
} else {
debug("Skipping update high watermark since Old hw %s is larger than new hw %s for partition [%s,%d]. All leo's are %s"
.format(oldHighWatermark, newHighWatermark, topic, partitionId, allLogEndOffsets.mkString(",")))
false
}
}
makeFollower把local的replica设置为follower副本
/**
* Make the local replica the follower by setting the new leader and ISR to empty
* If the leader replica id does not change, return false to indicate the replica manager
*/
def makeFollower(controllerId: Int, partitionStateInfo: PartitionState, correlationId: Int): Boolean = {
//加锁
inWriteLock(leaderIsrUpdateLock) {
//获取需要分配的AR集合
val allReplicas = partitionStateInfo.replicas.asScala.map(_.toInt)
//获取leader的brokersID
val newLeaderBrokerId: Int = partitionStateInfo.leader
// record the epoch of the controller that made the leadership decision. This is useful while updating the isr
// to maintain the decision maker controller's epoch in the zookeeper path
controllerEpoch = partitionStateInfo.controllerEpoch
// add replicas that are new
// 床面对应的replica对象
allReplicas.foreach(r => getOrCreateReplica(r))
// remove assigned replicas that have been removed by the controller
// 根据partitionStateInfo信息更新AR集合
(assignedReplicas().map(_.brokerId) -- allReplicas).foreach(removeReplica(_))
// 空集合,ISR集合在Leader副本上进行维护,Follower副本上不维护ISR集合信息
inSyncReplicas = Set.empty[Replica]
leaderEpoch = partitionStateInfo.leaderEpoch
zkVersion = partitionStateInfo.zkVersion
//查看leader是否发生变化
if (leaderReplicaIdOpt.isDefined && leaderReplicaIdOpt.get == newLeaderBrokerId) {
false
}
else {
// 更新leaderReplicaIdOpt字段
leaderReplicaIdOpt = Some(newLeaderBrokerId)
true
}
}
}
ISR集合管理
Partition处理对副本的角色进行切换,还要管理ISR集合。分别有maybeExpandIsr()与maybeShrinkIsr()
/**
* Check and maybe expand the ISR of the partition.
*
* This function can be triggered when a replica's LEO has incremented
*/
def maybeExpandIsr(replicaId: Int) {
//只有leader副本才需要管理isr,所以先获取leader副本对应的replica对象
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
// check if this replica needs to be added to the ISR
leaderReplicaIfLocal() match {
case Some(leaderReplica) =>
val replica = getReplica(replicaId).get
//获取当前HW
val leaderHW = leaderReplica.highWatermark
//if判断,1. follower在不在isr集合中,AR集合中可以找到follower副本,follower副本的leo已经赶上HW
if(!inSyncReplicas.contains(replica) &&
assignedReplicas.map(_.brokerId).contains(replicaId) &&
replica.logEndOffset.offsetDiff(leaderHW) >= 0) {
//把follower副本添加到ISR集合,形成新的isr集合
val newInSyncReplicas = inSyncReplicas + replica
info("Expanding ISR for partition [%s,%d] from %s to %s"
.format(topic, partitionId, inSyncReplicas.map(_.brokerId).mkString(","),
newInSyncReplicas.map(_.brokerId).mkString(",")))
// update ISR in ZK and cache
//写到zk中保存
updateIsr(newInSyncReplicas)
//更新Partition.inSyncReplicas字段
replicaManager.isrExpandRate.mark()
}
// check if the HW of the partition can now be incremented
// since the replica maybe now be in the ISR and its LEO has just incremented
// 尝试更新HW
maybeIncrementLeaderHW(leaderReplica)
case None => false // nothing to do if no longer leader
}
}
// 尝试执行延迟任务
if (leaderHWIncremented)
tryCompleteDelayedRequests()
}
在分布式系统中,各个节点通过网络交互可能出现阻塞和延迟,导致ISR集合中的部分Follower服务无法和leader进行同步。如果此时ProducerRequest的acks是-1,那就要等待长时间。
为了避免出现这种情况,Partition会对ISR集合进行缩减,功能在maybeShrinkIsr中实现,在ReplicaManager中使用定时任务周期性地调用maybeShrinkIsr检查ISR集合中follower副本和leader副本与leader副本之间的同步差距。并对isr集合进行缩减。
def maybeShrinkIsr(replicaMaxLagTimeMs: Long) {
val leaderHWIncremented = inWriteLock(leaderIsrUpdateLock) {
//获取leader副本对应replica对象
leaderReplicaIfLocal() match {
case Some(leaderReplica) =>
// 检测follower副本中的lastCaughtUpTimeMsUnderlying字段,找到之后的follower副本集合,剔除出isr集合
//无论是长时间没有和leader进行同步或者leo和hw相差太大,就可以从这个字段中反映出来。
val outOfSyncReplicas = getOutOfSyncReplicas(leaderReplica, replicaMaxLagTimeMs)
if(outOfSyncReplicas.size > 0) {
val newInSyncReplicas = inSyncReplicas -- outOfSyncReplicas
assert(newInSyncReplicas.size > 0)
info("Shrinking ISR for partition [%s,%d] from %s to %s".format(topic, partitionId,
inSyncReplicas.map(_.brokerId).mkString(","), newInSyncReplicas.map(_.brokerId).mkString(",")))
// 生成新的ISR集合并在zk中存下
updateIsr(newInSyncReplicas)
// we may need to increment high watermark since ISR could be down to 1
replicaManager.isrShrinkRate.mark()
// 尝试更新HW
maybeIncrementLeaderHW(leaderReplica)
} else {
false
}
case None => false // do nothing if no longer leader
}
}
// some delayed operations may be unblocked after HW changed
// 尝试执行延迟任务
if (leaderHWIncremented)
tryCompleteDelayedRequests()
}
追加消息
在分区中,只有leader副本可以处理读写请求,appendMessagesToLeader提供向leader副本对应的log追加消息的功能。
def appendMessagesToLeader(messages: ByteBufferMessageSet, requiredAcks: Int = 0) = {
val (info, leaderHWIncremented) = inReadLock(leaderIsrUpdateLock) {
//获取leader副本对应replica对象
val leaderReplicaOpt = leaderReplicaIfLocal()
leaderReplicaOpt match {
case Some(leaderReplica) =>
val log = leaderReplica.log.get
// 获取配置指定的最小isr集合大小的限制
val minIsr = log.config.minInSyncReplicas
val inSyncSize = inSyncReplicas.size
// Avoid writing to leader if there are not enough insync replicas to make it safe
// isr集合小于要求,跑出NotEnoughReplicasException异常给生产者
if (inSyncSize < minIsr && requiredAcks == -1) {
throw new NotEnoughReplicasException("Number of insync replicas for partition [%s,%d] is [%d], below required minimum [%d]"
.format(topic, partitionId, inSyncSize, minIsr))
}
// 写入leader副本对应的log
val info = log.append(messages, assignOffsets = true)
// probably unblock some follower fetch requests since log end offset has been updated
// 尝试执行delayedFetch
replicaManager.tryCompleteDelayedFetch(new TopicPartitionOperationKey(this.topic, this.partitionId))
// we may need to increment high watermark since ISR could be down to 1
// 尝试更新HW
(info, maybeIncrementLeaderHW(leaderReplica))
case None =>
throw new NotLeaderForPartitionException("Leader not local for partition [%s,%d] on broker %d"
.format(topic, partitionId, localBrokerId))
}
}
// some delayed operations may be unblocked after HW changed
// 尝试执行延迟任务
if (leaderHWIncremented)
tryCompleteDelayedRequests()
info
}
检测HW位置
在介绍DelayProduce的执行条件时,提到了checkEnoughReplicasReachOffset方法,检测指定的消息是否已经被isr集合中的所有follower副本同步
/*
* Note that this method will only be called if requiredAcks = -1
* and we are waiting for all replicas in ISR to be fully caught up to
* the (local) leader's offset corresponding to this produce request
* before we acknowledge the produce request.
*/
def checkEnoughReplicasReachOffset(requiredOffset: Long): (Boolean, Short) = {
leaderReplicaIfLocal() match {
// 获取leader副本对应replica对象
case Some(leaderReplica) =>
// keep the current immutable replica list reference
// 获取当前ISR集合
val curInSyncReplicas = inSyncReplicas
val numAcks = curInSyncReplicas.count(r => {
if (!r.isLocal)
if (r.logEndOffset.messageOffset >= requiredOffset) {
trace("Replica %d of %s-%d received offset %d".format(r.brokerId, topic, partitionId, requiredOffset))
true
}
else
false
else
true /* also count the local (leader) replica */
})
trace("%d acks satisfied for %s-%d with acks = -1".format(numAcks, topic, partitionId))
val minIsr = leaderReplica.log.get.config.minInSyncReplicas
//比较HW 和消息的offset
if (leaderReplica.highWatermark.messageOffset >= requiredOffset ) {
/*
* The topic may be configured not to accept messages if there are not enough replicas in ISR
* in this scenario the request was already appended locally and then added to the purgatory before the ISR was shrunk
*/
// 检测isr大小是否合法,太小则返回错误码
if (minIsr <= curInSyncReplicas.size) {
(true, Errors.NONE.code)
} else {
(true, Errors.NOT_ENOUGH_REPLICAS_AFTER_APPEND.code)
}
} else
(false, Errors.NONE.code)
case None =>
(false, Errors.NOT_LEADER_FOR_PARTITION.code)
}
}