PartitionStateMachine 这个类代表了分区的状态机。它定义了partition可以存在的状态,以及将partition移动到另一个合法的状态的过程。有四种不同的状态,如下所示:
- NonExistentPartition: 这个状态指明了这个partition从未被创建或者创建后被删除了。有效的前置状态如果存在的话,就是OfflinePartition。
- NewPartition: 创建后,partition就处于NewPartition状态了。在这个状态,应该已经为这个partition分配了副本集,但是还没有leader/ISR。有效的前置状态是NonExistentPartition。
- OnlinePartition: 一旦partition的leader选举了,它就进入OnlinePartition状态。有效的前置状态是:NewPartition/OfflinePartition。
- OfflinePartition: 如果在leader成功选举之后,partition的那个leader死了,那么partition就移动到OfflinePartition状态。有效的前置状态是 NewPartition/OnlinePartition。
再看onControllerFailover(),先调用partitionStateMachine和replicaStateMachine的registerListeners(),随后启动了replica状态机和partition状态机:
def onControllerFailover() {
if(isRunning) {
......
partitionStateMachine.registerListeners()
replicaStateMachine.registerListeners()
initializeControllerContext()
sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)
replicaStateMachine.startup()
partitionStateMachine.startup()
......
上一篇说到,ReplicaStateMachine注册的只有一个brokerChangeListener。而PartitionStateMachine注册了两个Listener,都是关于Topic的。So, ReplicaStateMachine主要监听新加入的Brokers和dead的Brokers;
PartitionStateMachine主要监听Topic创建和删除的情况。
// PartitionStateMachine.scala
// register topic and partition change listeners
def registerListeners() {
registerTopicChangeListener()
registerDeleteTopicListener()
}
// 在"/brokers/topics" 注册 topicChangeListener
private def registerTopicChangeListener() = {
zkUtils.zkClient.subscribeChildChanges(BrokerTopicsPath, topicChangeListener)
}
// 在"/admin/delete_topics" 注册 deleteTopicsListener
private def registerDeleteTopicListener() = {
zkUtils.zkClient.subscribeChildChanges(DeleteTopicsPath, deleteTopicsListener)
}
先不管这两个Listener都做些什么,Controller注册Listenser后就启动了partitionStateMachine,通过调用partitionStateMachine.startup()方法。
1. Partition上线
partition状态机启动过程的状态转移可能有:
- NonExistentPartition -> NewPartition
- NewPartition -> OnlinePartition
- OnlinePartition -> OnlinePartition
- OfflinePartition -> OnlinePartition
其中“NonExistentPartition -> NewPartition”
发生在initializePartitionState从ZK加载已分配的replicas到Controller缓存后。
load from ZK后得到的初始状态可能有三种情况:
partition存在ISR,Isr的leader是alive的 – OnlinePartition
partition存在ISR,Isr的leader Not alive – OfflinePartition
partition还没有leaderAndIsr – NewPartition
controller启动partitionStateMachine时,先为所有zookeeper上已存在的topicPartition设置初始状态,然后使他们都上线。
// partitionStateMachine类
def startup() {
// 初始化 partition state,如果是新的Partition,就初始化为NewPartition状态;
// 如果controllerContext中LeaderIsrAndControllerEpoch已存在,
// 那么Leader is alive就进入OnlinePartition,否则进入OfflinePartition状态。
initializePartitionState()
// set started flag
hasStarted.set(true)
// try to 将NewPartition or OfflinePartition的partitions转移到OnlinePartition状态
triggerOnlinePartitionStateChange()
info("Started partition state machine with initial state -> " + partitionState.toString())
}
/**
* partitionStateMachine状态机为所有在zookeeper上已存在的partitions设置初始状态时调用
*/
private def initializePartitionState() {
for (topicPartition <- controllerContext.partitionReplicaAssignment.keys) {
// check if leader and isr path exists for partition. If not, then it is in NEW state
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
case Some(currentLeaderIsrAndEpoch) =>
// else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
if (controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader))
// leader is alive
// 如果当前partition存在ISR,并且Isr的leader 在 liveBrokerIds 中,则设置为OnlinePartition状态。
partitionState.put(topicPartition, OnlinePartition)
else
// 当前partition存在ISR,并且Isr的leader没有alive,则设置为OfflinePartition状态。
partitionState.put(topicPartition, OfflinePartition)
case None =>
// 当前partition还没有leaderAndIsr,设置初始状态为NewPartition。
partitionState.put(topicPartition, NewPartition)
}
}
}
def triggerOnlinePartitionStateChange() {
try {
brokerRequestBatch.newBatch()
// try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state
// except partitions that belong to topics to be deleted
for((topicAndPartition, partitionState) <- partitionState
if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
(new CallbackBuilder).build)
}
brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
} catch {
case e: Throwable => error("Error while moving some partitions to the online state", e)
// TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
}
}
handleStateChange(target to OnlinePartition)
triggerOnlinePartitionStateChange() {
//遍历除了将被删除的partitions:
->handleStateChange(to OnlinePartition state)
->initializeLeaderAndIsrForPartition()
//通过ControllerChannelManager将leader and isr等信息发送给所有活着的Brokers.
}
其中,initializeLeaderAndIsrForPartition 为新partition初始化leader和isr的Path.
因为NewPartition在zookeeper没有leader和isr的path,一旦进入OnlinePartition状态,它的leader和isr的path就被初始化了,这个partition就再也回不到NewPartition状态了。
initializeLeaderAndIsrForPartition()要做的工作:
- 从zookeeper读取活着的replicaAssignment信息,
- 创建一个LeaderAndIsr对象,带上controller.epoch创建LeaderIsrAndControllerEpoch对象
- 创建一个持久路径在"/broker/topics/{topic}/partitions/{partitionId}/state",把LeaderIsrAndControllerEpoch信息写入;
- 保存topicAndPartition的leaderIsrAndControllerEpoch信息到controller上下文;
- 把已知的信息(leaderIsrAndControllerEpoch和对应的replicas)汇报给Brokers
2. Partition下线
在Controller的 onBrokerFailure()时或者TopicDeletionManager的 completeDeleteTopic()时,不管之前partition处于什么状态,都进入下线状态,可能的状态转移有:
- NewPartition -> OfflinePartition
- OnlinePartition -> OfflinePartition
- OfflinePartition -> OfflinePartition
进入OfflinePartition后,随之而来的状态转移可能有:
- OfflinePartition -> OnlinePartition (broker fail时)
- OfflinePartition -> NonExistentPartition (要删除topic时)
1,注意,onBrokerFailure()时:
partitionStateMachine.handleStateChanges(partitionsWithoutLeader, OfflinePartition)
// trigger OnlinePartition state changes for offline or new partitions
partitionStateMachine.triggerOnlinePartitionStateChange()
使partition下线,随即调用triggerOnlinePartitionStateChange()使下线的partitions重新上线!
就是说,为了保证服务可用,partition不能说下线就下:
/**
* This API invokes the OnlinePartition state change on all partitions in either the NewPartition or OfflinePartition
* state. This is called on a successful controller election and on broker changes
*/
def triggerOnlinePartitionStateChange() {
try {
brokerRequestBatch.newBatch()
// try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
// that belong to topics to be deleted
for((topicAndPartition, partitionState) <- partitionState
if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
(new CallbackBuilder).build)
}
brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
} catch {
case e: Throwable => error("Error while moving some partitions to the online state", e)
// TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
}
}
2,completeDeleteTopic()时:
先让partition进入OfflinePartition,因为要删除topic,随即又使它进入NonExistentPartition状态。
private def completeDeleteTopic(topic: String) {
// 在删除的topic上撤销PartitionChangeListener。
// 这是为了在一个删除的topic自动重新创建之前,防止激发 PartitionChangeListener
partitionStateMachine.deregisterPartitionChangeListener(topic)
val replicasForDeletedTopic = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
// controller 将从replica状态机中删除这个replica,同时从partitionsForTopic缓存中删除。
replicaStateMachine.handleStateChanges(replicasForDeletedTopic, NonExistentReplica)
val partitionsForDeletedTopic = controllerContext.partitionsForTopic(topic)
// 将各自的partition转移到 OfflinePartition and NonExistentPartition 状态
partitionStateMachine.handleStateChanges(partitionsForDeletedTopic, OfflinePartition)
partitionStateMachine.handleStateChanges(partitionsForDeletedTopic, NonExistentPartition)
// 记录删除的topic
topicsToBeDeleted -= topic
partitionsToBeDeleted.retain(_.topic != topic)
// 从ZK上递归地删除这个topic相关的路径
val zkUtils = controllerContext.zkUtils
zkUtils.zkClient.deleteRecursive(getTopicPath(topic))
zkUtils.zkClient.deleteRecursive(getEntityConfigPath(ConfigType.Topic, topic))
zkUtils.zkClient.delete(getDeleteTopicPath(topic))
controllerContext.removeTopic(topic)
}
handleStateChange(target to OfflinePartition)
什么也不做,只是把内存记录的partitionState修改为OfflinePartition
3. Partition状态机图