上一篇kafka源码(一)correspond to/explain Kafka设计解析(二) 中的3.2、3.3。以前一直用kafka 0.8.2.x,那时候redis开始风靡,hadoop方兴未艾,一晃四五年过去了,终于老得可以读读源码。
不得不说Kafka的代码风格比spark好多了。毕竟spark太庞大,相对来说kafka小而美吧。
可能出于性能的考虑,以及ZooKeeper的机制,kafka大部分都是异步回调的事件机制。类似epoll对IO的处理。
源码中几乎对每个回调函数都注释了该方法什么情况下会被Invoke,以及触发后做哪些工作。这对于开发维护和阅读都很友好,真是相见恨晚哈哈。
本文第3、4部分呼应前文的3.4 broker failover、3.5 Partition的Leader选举。
内容目录
- 1. BrokerChangeListener的源起
- 2. Controller对新建Broker的处理
- 3. Controller对Broker failure的处理
- 4. Partition的Leader选举
1. BrokerChangeListener的源起
broker被选举为controller后,会在onBecomingLeader(亦即onControllerFailover)回调中注册两个状态机的Listener:
partitionStateMachine.registerListeners()
replicaStateMachine.registerListeners()
其中replica状态机registerListeners()时,调用registerBrokerChangeListener(),在"/brokers/ids" 路径注册brokerChangeListener。
当前节点以及子节点增加或者删除的状态改变,都会触发这个Listener。
// 位于 ReplicaStateMachine.scala
// register ZK listeners of the replica state machine
def registerListeners() {
// register broker change listener
registerBrokerChangeListener()
}
private def registerBrokerChangeListener() = {
zkUtils.zkClient.subscribeChildChanges(ZkUtils.BrokerIdsPath, brokerChangeListener)
}
注册Listener使用的是zkClient的subscribeChildChanges方法,它触发的回调(第二个参数)是一个IZkChildListener接口:
// 见 zkClient document subscribeChildChanges(java.lang.String path, IZkChildListener listener)
ControllerZkChildListener实现了IZkChildListener接口:
// ControllerZkListener.scala
trait ControllerZkChildListener extends IZkChildListener with ControllerZkListener {
@throws[Exception]
final def handleChildChange(parentPath: String, currentChildren: java.util.List[String]): Unit = {
// Due to zkclient's callback order, it's possible for the callback to be triggered after the controller has moved
if (controller.isActive)
doHandleChildChange(parentPath, currentChildren.asScala)
}
@throws[Exception]
def doHandleChildChange(parentPath: String, currentChildren: Seq[String]): Unit
}
触发时,如果controller是Active的,就执行ControllerZkListener的doHandleChildChange方法。
由于BrokerChangeListener继承了ControllerZkChildListener,也就是执行BrokerChangeListener的doHandleChildChange。
Kafka对所有新增的broker和死掉的broker的处理,都在这个回调函数中,如下:
/**
* 位于ReplicaStateMachine.scala
* This is the zookeeper listener that triggers all the state transitions for a replica
*/
class BrokerChangeListener(protected val controller: KafkaController) extends ControllerZkChildListener {
protected def logName = "BrokerChangeListener"
// 同时处理新增的Broker和死掉的Broker。
// 通过controller来对新Broker执行onBrokerStartup,对死掉的Broker执行onBrokerFailure。
def doHandleChildChange(parentPath: String, currentBrokerList: Seq[String]) {
info("Broker change listener fired for path %s with children %s"
.format(parentPath, currentBrokerList.sorted.mkString(",")))
inLock(controllerContext.controllerLock) {
// ReplicaStateMachine 已经启动(startup)时才会执行
if (hasStarted.get) {
ControllerStats.leaderElectionTimer.time {
try {
// 从结点路径读取当前的Broker信息,也就是节点变化后的
val curBrokers = currentBrokerList.map(_.toInt).toSet.flatMap(zkUtils.getBrokerInfo)
val curBrokerIds = curBrokers.map(_.id)
val liveOrShuttingDownBrokerIds = controllerContext.liveOrShuttingDownBrokerIds
// liveOrShuttingDownBrokerIds是一个Set,--是求差集操作。
// https://docs.scala-lang.org/zh-cn/overviews/collections/sets.html
// 节点变化后当前的Broker减去已有的(包括活着的和正在关闭的)Broker,得到新建的Broker;
val newBrokerIds = curBrokerIds -- liveOrShuttingDownBrokerIds
// 结点变化前的Brokers减去变化后的,得到挂掉的Brokers
val deadBrokerIds = liveOrShuttingDownBrokerIds -- curBrokerIds
val newBrokers = curBrokers.filter(broker => newBrokerIds(broker.id))
controllerContext.liveBrokers = curBrokers
val newBrokerIdsSorted = newBrokerIds.toSeq.sorted
val deadBrokerIdsSorted = deadBrokerIds.toSeq.sorted
val liveBrokerIdsSorted = curBrokerIds.toSeq.sorted
info("Newly added brokers: %s, deleted brokers: %s, all live brokers: %s"
.format(newBrokerIdsSorted.mkString(","), deadBrokerIdsSorted.mkString(","), liveBrokerIdsSorted.mkString(",")))
// 对新添加的每个broker,将它添加到controllerChannelManager中去,执行一系列操作
newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
// 对dead brokers,从controllerChannelManager中移除,包括关闭requestSendThread,从ChannelMgr的内存缓存清除
deadBrokerIds.foreach(controllerContext.controllerChannelManager.removeBroker)
if(newBrokerIds.nonEmpty)
controller.onBrokerStartup(newBrokerIdsSorted)
if(deadBrokerIds.nonEmpty)
controller.onBrokerFailure(deadBrokerIdsSorted)
} catch {
case e: Throwable => error("Error while handling broker changes", e)
}
}
}
}
}
}
}
对于新添加的每个broker,将它添加到controllerChannelManager中去,Controller会保持与新Broker的连接,通过创建和启动专门的线程发送请求。
然后执行controller.onBrokerStartup()。
对于每一个dead broker,会将它从controllerChannelManager中移除,关闭requestSendThread这个“Channel”,并从ChannelMgr的内存缓存清除。
然后执行controller.onBrokerFailure()
另外,ReplicaState共有7种状态:
sealed trait ReplicaState { def state: Byte }
case object NewReplica extends ReplicaState { val state: Byte = 1 }
case object OnlineReplica extends ReplicaState { val state: Byte = 2 }
case object OfflineReplica extends ReplicaState { val state: Byte = 3 }
case object ReplicaDeletionStarted extends ReplicaState { val state: Byte = 4}
case object ReplicaDeletionSuccessful extends ReplicaState { val state: Byte = 5}
case object ReplicaDeletionIneligible extends ReplicaState { val state: Byte = 6}
case object NonExistentReplica extends ReplicaState { val state: Byte = 7 }
上面是整体情况,下面细看。
2. Controller对新建Broker的处理
主要是这两行:
newBrokers.foreach(controllerContext.controllerChannelManager.addBroker)
controller.onBrokerStartup(newBrokerIdsSorted)
controllerChannelManager的addBroker是个函数:
def addBroker(broker: Broker) {
// be careful here. Maybe the startup() API has already started the request send thread
brokerLock synchronized {
if(!brokerStateInfo.contains(broker.id)) {
addNewBroker(broker)
startRequestSendThread(broker.id)
}
}
}
addNewBroker(broker)中,对新broker创建了brokerNode、Selector、NetworkClient,
最后创建一个RequestSendThread线程对象,并把broker的ControllerBrokerStateInfo保存到brokerStateInfo对象。
RequestSendThread是一个ShutdownableThread。
接下来startRequestSendThread就是启动这个线程,每隔100ms发送clientRequest消息,并处理回复:
// key code in startRequestSendThread:
clientResponse = networkClient.blockingSendAndReceive(clientRequest)(time)
3. Controller对Broker failure的处理
Controller执行onBrokerFailure,该函数由replica状态机的BrokerChangeListener触发,带上failed brokers列表作为输入。它做下面4件事:
- 把leaders 死掉的分区标记为offline
- 对所有offline或新的partitions触发OnlinePartition的状态改变
- 在输入的failed brokers列表上,调用OfflineReplica的状态改变
- 如果没有partitions受影响,就发送UpdateMetadataRequest消息给live or shutting down brokers。
/**
* This callback is invoked by the replica state machine's broker change listener with the list of failed brokers
* as input. It does the following -
* 1. Mark partitions with dead leaders as offline
* 2. Triggers the OnlinePartition state change for all new/offline partitions
* 3. (这句好像写错了:)Invokes the OfflineReplica state change on the input list of newly started brokers
* 4. If no partitions are effected then send UpdateMetadataRequest to live or shutting down brokers
*
* Note that we don't need to refresh the leader/isr cache for all topic/partitions at this point. This is because
* the partition state machine will refresh our cache for us when performing leader election for all new/offline
* partitions coming online.
*/
def onBrokerFailure(deadBrokers: Seq[Int]) {
info("Broker failure callback for %s".format(deadBrokers.mkString(",")))
val deadBrokersThatWereShuttingDown =
deadBrokers.filter(id => controllerContext.shuttingDownBrokerIds.remove(id))
info("Removed %s from list of shutting down brokers.".format(deadBrokersThatWereShuttingDown))
val deadBrokersSet = deadBrokers.toSet
// trigger OfflinePartition state for all partitions whose current leader is one amongst the dead brokers
val partitionsWithoutLeader = controllerContext.partitionLeadershipInfo.filter(partitionAndLeader =>
deadBrokersSet.contains(partitionAndLeader._2.leaderAndIsr.leader) &&
!deleteTopicManager.isTopicQueuedUpForDeletion(partitionAndLeader._1.topic)).keySet
// 1,为当前leader在deadBrokers中的所有partitions 触发 OfflinePartition的目的状态
partitionStateMachine.handleStateChanges(partitionsWithoutLeader, OfflinePartition)
// trigger OnlinePartition state changes for offline or new partitions
// 2,立即重新进入OnlinePartition状态
// 把所有NewPartition or OfflinePartition状态的(除了要删除的)partition,触发OnlinePartition状态,
// 使用controller.offlinePartitionSelector 重新选举partition的leader。
partitionStateMachine.triggerOnlinePartitionStateChange()
// filter out the replicas that belong to topics that are being deleted
// 3,从dead Brokers过滤掉(filterNot)属于将要删除的Topics的replicas,得到activeReplicas On DeadBrokers
var allReplicasOnDeadBrokers = controllerContext.replicasOnBrokers(deadBrokersSet)
val activeReplicasOnDeadBrokers = allReplicasOnDeadBrokers.filterNot(p => deleteTopicManager.isTopicQueuedUpForDeletion(p.topic))
// 4,处理 activeReplicas,使进入OfflineReplica状态
// 向activeReplicas发送stop replica 命令,使停止从leader拉取数据.
replicaStateMachine.handleStateChanges(activeReplicasOnDeadBrokers, OfflineReplica)
// check if topic deletion state for the dead replicas needs to be updated
// 5,过滤出 设置为要删除的replicas,从topicMgr中删除,并触发ReplicaDeletionIneligible状态
val replicasForTopicsToBeDeleted = allReplicasOnDeadBrokers.filter(p => deleteTopicManager.isTopicQueuedUpForDeletion(p.topic))
if(replicasForTopicsToBeDeleted.nonEmpty) {
// it is required to mark the respective replicas in TopicDeletionFailed state since the replica cannot be
// deleted when the broker is down. This will prevent the replica from being in TopicDeletionStarted state indefinitely
// since topic deletion cannot be retried until at least one replica is in TopicDeletionStarted state
deleteTopicManager.failReplicaDeletion(replicasForTopicsToBeDeleted)
}
// If broker failure did not require leader re-election, inform brokers of failed broker
// Note that during leader re-election, brokers update their metadata
if (partitionsWithoutLeader.isEmpty) {
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
}
}
partition的状态有四种:
sealed trait PartitionState { def state: Byte }
case object NewPartition extends PartitionState { val state: Byte = 0 }
case object OnlinePartition extends PartitionState { val state: Byte = 1 }
case object OfflinePartition extends PartitionState { val state: Byte = 2 }
case object NonExistentPartition extends PartitionState { val state: Byte = 3 }
4. Partition的Leader选举
KafkaController中共定义了四种selector选举器,它们都继承自PartitionLeaderSelector:
val offlinePartitionSelector = new OfflinePartitionLeaderSelector(controllerContext, config)
private val reassignedPartitionLeaderSelector = new ReassignedPartitionLeaderSelector(controllerContext)
private val preferredReplicaPartitionLeaderSelector = new PreferredReplicaPartitionLeaderSelector(controllerContext)
private val controlledShutdownPartitionLeaderSelector = new ControlledShutdownLeaderSelector(controllerContext)
0.10.2中去除了NoOpLeaderSelector。这四种选举器分别对应不同的selectLeader策略,从ISR中选取分区的Leader。具体详情有空再补充吧。