相关关键词:
AR(Assigned Repllicas):所有副本统称
ISR(In-Sync Replicas):同步副本列表
OSR(Out-Sync Replicas):未同步/滞后过多副本列表
AR = ISR + OSR
LSO:(Last-Stable Offset):kafka事务消息可见性偏移量(影响隔离性中读未提交级别)
LogStartOffset:分区副本起始偏移量
LEO(Log-End Offset):分区副本末尾偏移量(每个副本值不同,leader一般比fllower大)
HW(High Water):leader数据高水位值,用于确定数据对非内部客户端的可见性,一般也受到fllower同步能力的影响
新topic分区broker分配策略方法:
kafka.AdminUtils.assignReplicasToBrokers(...arg)
kafka会将分区与分区副本尽量均匀的分配给各个broker,最佳策略为分区数为broker倍数,副本数为broker数
1.在当前broker随机出一个起始值,轮询分配分区
2.分区副本在上一步基础上再平移分配
broker-0 broker-1 broker-2 broker-3 broker-4
p0 p1 p2 p3 p4 (1st replica)
p5 p6 p7 p8 p9 (1st replica)
p4 p0 p1 p2 p3 (2nd replica)
p8 p9 p5 p6 p7 (2nd replica)
p3 p4 p0 p1 p2 (3nd replica)
p7 p8 p9 p5 p6 (3nd replica)
topic自动创建步骤:
1.producer启动时获取metadata
2.server查询本地缓存的topic列表返回元数据
不存在则根据是否允许自动创建配置创建topic或返回异常:
UNKNOWN_TOPIC_OR_PARTITION
副本管理器启动时会创建定时任务
replicaManager.startup()
..
def startup() {
// start ISR expiration thread
// A follower can lag behind leader for up to config.replicaLagTimeMaxMs x 1.5 before it is removed from ISR
//isr列表伸缩策略
scheduler.schedule("isr-expiration", maybeShrinkIsr _, period = config.replicaLagTimeMaxMs / 2, unit = TimeUnit.MILLISECONDS)
//isr变更记录处理(记录在/isr_change_notification/isr_change_sequenceNumber)
scheduler.schedule("isr-change-propagation", maybePropagateIsrChanges _, period = 2500L, unit = TimeUnit.MILLISECONDS)
//用于定时移除空闲的复制线程
scheduler.schedule("shutdown-idle-replica-alter-log-dirs-thread", shutdownIdleReplicaAlterLogDirsThread _, period = 10000L, unit = TimeUnit.MILLISECONDS)
..
ISR移除策略:
maybeShrinkIsr
def getOutOfSyncReplicas(leaderReplica: Replica, maxLagMs: Long): Set[Replica] = {
there are two cases that will be handled here -
1. Stuck followers: If the leo of the replica hasn't been updated for maxLagMs ms,
the follower is stuck and should be removed from the ISR
2. Slow followers: If the replica has not read up to the leo within the last maxLagMs ms,
then the follower is lagging and should be removed from the ISR
Both these cases are handled by checking the lastCaughtUpTimeMs which represents
the last time when the replica was fully caught up. If either of the above conditions
is violated, that replica is considered to be out of sync
val candidateReplicas = inSyncReplicas - leaderReplica
//滞后副本列表
val laggingReplicas = candidateReplicas.filter(r => (time.milliseconds - r.lastCaughtUpTimeMs) > maxLagMs)
两种情况:
1.同步副本被卡住导致副本的leo更新间隔时间超过maxLagMs毫秒
2.在maxLagMs时间内未读取导致fllower同步滞后
maxLagMs具体配置参数:
replica.lag.time.max.ms
检测到有滞后副本会将当前ISR列表中滞后的副本清除掉(更新zk数据)
更新isr
private def updateIsr(newIsr: Set[Replica]) {
val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(_.brokerId).toList, zkVersion)
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topicPartition, newLeaderAndIsr,
controllerEpoch)
if (updateSucceeded) {
//记录isr变更
//def recordIsrChange(topicPartition: TopicPartition) {
//isrChangeSet synchronized {
//isrChangeSet += topicPartition
//lastIsrChangeMs.set(System.currentTimeMillis())
//}
//}
replicaManager.recordIsrChange(topicPartition)
inSyncReplicas = newIsr
zkVersion = newVersion
println("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
} else {
replicaManager.failedIsrUpdatesRate.mark()
println("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
}
}
isr记录变更记录在
private val isrChangeSet: mutable.Set[TopicPartition] = new mutable.HashSet[TopicPartition]()
private val lastIsrChangeMs = new AtomicLong(System.currentTimeMillis())
Leader更新HW:
当leader拿到所有副本的信息(Replica列表)
private def maybeIncrementLeaderHW(leaderReplica: Replica, curTime: Long = time.milliseconds): Boolean = {
//找到fllower最后一次拉取数据的时间与当前时间的差值不大于replicaLagTimeMaxMs
//或者
//副本在ISR列表中
val allLogEndOffsets = assignedReplicas.filter { replica =>
curTime - replica.lastCaughtUpTimeMs <= replicaManager.config.replicaLagTimeMaxMs || inSyncReplicas.contains(replica)
}.map(_.logEndOffset)
//找到最小的LEO
val newHighWatermark = allLogEndOffsets.min(new LogOffsetMetadata.OffsetOrdering)
val oldHighWatermark = leaderReplica.highWatermark
// Ensure that the high watermark increases monotonically. We also update the high watermark when the new
// offset metadata is on a newer segment, which occurs whenever the log is rolled to a new segment.
//旧的hw偏移量小于新的hw偏移量
//或者
//新旧相等且新的segment包含旧的
if (oldHighWatermark.messageOffset < newHighWatermark.messageOffset ||
(oldHighWatermark.messageOffset == newHighWatermark.messageOffset && oldHighWatermark.onOlderSegment(newHighWatermark))) {
//更新leader副本hw
leaderReplica.highWatermark = newHighWatermark
debug(s"High watermark updated to $newHighWatermark")
true
} else {
debug(s"Skipping update high watermark since new hw $newHighWatermark is not larger than old hw $oldHighWatermark." +
s"All LEOs are ${allLogEndOffsets.mkString(",")}")
false
}
}
ISR更新zk周期记录:maybePropagateIsrChanges
ISR变更会记录到isrChangeSet并更新lastIsrChangeMs(详情见maybeShrinkIsr)
变更记录会被更新到zk上保存,满足以下两个条件会触发写zk
1.isrChangeSet有数据写入
2.isr最后更新时间超过ReplicaManager.IsrChangePropagationBlackOut(5000L) 5s
或者 最后一次写zk操作超过 ReplicaManager.IsrChangePropagationInterval(60000L)60s
def maybePropagateIsrChanges() {
val now = System.currentTimeMillis()
isrChangeSet synchronized {
if (isrChangeSet.nonEmpty &&
(lastIsrChangeMs.get() + ReplicaManager.IsrChangePropagationBlackOut < now ||
lastIsrPropagationMs.get() + ReplicaManager.IsrChangePropagationInterval < now)) {
zkClient.propagateIsrChanges(isrChangeSet)
isrChangeSet.clear()
lastIsrPropagationMs.set(now)
}
}
}
......
object ReplicaManager {
val HighWatermarkFilename = "replication-offset-checkpoint"
val IsrChangePropagationBlackOut = 5000L
val IsrChangePropagationInterval = 60000L
val OfflinePartition = new Partition("", -1, null, null, isOffline = true)
}