PartitionStateMachine 这个类代表了分区的状态机。它定义了partition可以存在的状态,以及将partition移动到另一个合法的状态的过程。有四种不同的状态,如下所示:

  1. NonExistentPartition: 这个状态指明了这个partition从未被创建或者创建后被删除了。有效的前置状态如果存在的话,就是OfflinePartition。
  2. NewPartition: 创建后,partition就处于NewPartition状态了。在这个状态,应该已经为这个partition分配了副本集,但是还没有leader/ISR。有效的前置状态是NonExistentPartition。
  3. OnlinePartition: 一旦partition的leader选举了,它就进入OnlinePartition状态。有效的前置状态是:NewPartition/OfflinePartition。
  4. OfflinePartition: 如果在leader成功选举之后,partition的那个leader死了,那么partition就移动到OfflinePartition状态。有效的前置状态是 NewPartition/OnlinePartition。

再看onControllerFailover(),先调用partitionStateMachine和replicaStateMachine的registerListeners(),随后启动了replica状态机和partition状态机:

def onControllerFailover() {
    if(isRunning) {
      ......
      partitionStateMachine.registerListeners()
      replicaStateMachine.registerListeners()

      initializeControllerContext()
      sendUpdateMetadataRequest(controllerContext.liveOrShuttingDownBrokerIds.toSeq)

      replicaStateMachine.startup()
      partitionStateMachine.startup()
      ......

上一篇说到,ReplicaStateMachine注册的只有一个brokerChangeListener。而PartitionStateMachine注册了两个Listener,都是关于Topic的。So, ReplicaStateMachine主要监听新加入的Brokers和dead的Brokers;
PartitionStateMachine主要监听Topic创建和删除的情况。

// PartitionStateMachine.scala
  // register topic and partition change listeners
  def registerListeners() {
    registerTopicChangeListener()
    registerDeleteTopicListener()
  }
  // 在"/brokers/topics" 注册 topicChangeListener
  private def registerTopicChangeListener() = {
    zkUtils.zkClient.subscribeChildChanges(BrokerTopicsPath, topicChangeListener)
  }
  // 在"/admin/delete_topics" 注册 deleteTopicsListener
  private def registerDeleteTopicListener() = {
    zkUtils.zkClient.subscribeChildChanges(DeleteTopicsPath, deleteTopicsListener)
  }

先不管这两个Listener都做些什么,Controller注册Listenser后就启动了partitionStateMachine,通过调用partitionStateMachine.startup()方法。

1. Partition上线

partition状态机启动过程的状态转移可能有:

  • NonExistentPartition -> NewPartition
  • NewPartition -> OnlinePartition
  • OnlinePartition -> OnlinePartition
  • OfflinePartition -> OnlinePartition

其中“NonExistentPartition -> NewPartition”
发生在initializePartitionState从ZK加载已分配的replicas到Controller缓存后。
load from ZK后得到的初始状态可能有三种情况:
partition存在ISR,Isr的leader是alive的 – OnlinePartition
partition存在ISR,Isr的leader Not alive – OfflinePartition
partition还没有leaderAndIsr – NewPartition

controller启动partitionStateMachine时,先为所有zookeeper上已存在的topicPartition设置初始状态,然后使他们都上线。

// partitionStateMachine类
  def startup() {
    // 初始化 partition state,如果是新的Partition,就初始化为NewPartition状态;
    // 如果controllerContext中LeaderIsrAndControllerEpoch已存在,
    //     那么Leader is alive就进入OnlinePartition,否则进入OfflinePartition状态。
    initializePartitionState()
    // set started flag
    hasStarted.set(true)
    // try to 将NewPartition or OfflinePartition的partitions转移到OnlinePartition状态
    triggerOnlinePartitionStateChange()

    info("Started partition state machine with initial state -> " + partitionState.toString())
  }
  
  /**
   * partitionStateMachine状态机为所有在zookeeper上已存在的partitions设置初始状态时调用
   */
  private def initializePartitionState() {
    for (topicPartition <- controllerContext.partitionReplicaAssignment.keys) {
      // check if leader and isr path exists for partition. If not, then it is in NEW state
      controllerContext.partitionLeadershipInfo.get(topicPartition) match {
        case Some(currentLeaderIsrAndEpoch) =>
          // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
          if (controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader))
            // leader is alive
            // 如果当前partition存在ISR,并且Isr的leader 在 liveBrokerIds 中,则设置为OnlinePartition状态。
            partitionState.put(topicPartition, OnlinePartition)
          else
            // 当前partition存在ISR,并且Isr的leader没有alive,则设置为OfflinePartition状态。
            partitionState.put(topicPartition, OfflinePartition)
        case None =>
          // 当前partition还没有leaderAndIsr,设置初始状态为NewPartition。
          partitionState.put(topicPartition, NewPartition)
      }
    }
  }
  
  def triggerOnlinePartitionStateChange() {
    try {
      brokerRequestBatch.newBatch()
      // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state
      // except partitions that belong to topics to be deleted
      for((topicAndPartition, partitionState) <- partitionState
          if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
        if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
          handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                            (new CallbackBuilder).build)
      }
      brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
    } catch {
      case e: Throwable => error("Error while moving some partitions to the online state", e)
      // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
    }
  }

handleStateChange(target to OnlinePartition)

triggerOnlinePartitionStateChange() {
 //遍历除了将被删除的partitions:
 ->handleStateChange(to OnlinePartition state)
   ->initializeLeaderAndIsrForPartition()
 //通过ControllerChannelManager将leader and isr等信息发送给所有活着的Brokers.
}

其中,initializeLeaderAndIsrForPartition 为新partition初始化leader和isr的Path.
因为NewPartition在zookeeper没有leader和isr的path,一旦进入OnlinePartition状态,它的leader和isr的path就被初始化了,这个partition就再也回不到NewPartition状态了。

initializeLeaderAndIsrForPartition()要做的工作:

  1. 从zookeeper读取活着的replicaAssignment信息,
  2. 创建一个LeaderAndIsr对象,带上controller.epoch创建LeaderIsrAndControllerEpoch对象
  3. 创建一个持久路径在"/broker/topics/{topic}/partitions/{partitionId}/state",把LeaderIsrAndControllerEpoch信息写入;
  4. 保存topicAndPartition的leaderIsrAndControllerEpoch信息到controller上下文;
  5. 把已知的信息(leaderIsrAndControllerEpoch和对应的replicas)汇报给Brokers

2. Partition下线

在Controller的 onBrokerFailure()时或者TopicDeletionManager的 completeDeleteTopic()时,不管之前partition处于什么状态,都进入下线状态,可能的状态转移有:

  • NewPartition -> OfflinePartition
  • OnlinePartition -> OfflinePartition
  • OfflinePartition -> OfflinePartition

进入OfflinePartition后,随之而来的状态转移可能有:

  • OfflinePartition -> OnlinePartition (broker fail时)
  • OfflinePartition -> NonExistentPartition (要删除topic时)

1,注意,onBrokerFailure()时:

partitionStateMachine.handleStateChanges(partitionsWithoutLeader, OfflinePartition)
    // trigger OnlinePartition state changes for offline or new partitions
    partitionStateMachine.triggerOnlinePartitionStateChange()

使partition下线,随即调用triggerOnlinePartitionStateChange()使下线的partitions重新上线!
就是说,为了保证服务可用,partition不能说下线就下:

/**
   * This API invokes the OnlinePartition state change on all partitions in either the NewPartition or OfflinePartition
   * state. This is called on a successful controller election and on broker changes
   */
  def triggerOnlinePartitionStateChange() {
    try {
      brokerRequestBatch.newBatch()
      // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
      // that belong to topics to be deleted
      for((topicAndPartition, partitionState) <- partitionState
          if !controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic)) {
        if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
          handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
                            (new CallbackBuilder).build)
      }
      brokerRequestBatch.sendRequestsToBrokers(controller.epoch)
    } catch {
      case e: Throwable => error("Error while moving some partitions to the online state", e)
      // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
    }
  }

2,completeDeleteTopic()时:
先让partition进入OfflinePartition,因为要删除topic,随即又使它进入NonExistentPartition状态。

private def completeDeleteTopic(topic: String) {
    // 在删除的topic上撤销PartitionChangeListener。
    // 这是为了在一个删除的topic自动重新创建之前,防止激发 PartitionChangeListener
    partitionStateMachine.deregisterPartitionChangeListener(topic)
    val replicasForDeletedTopic = controller.replicaStateMachine.replicasInState(topic, ReplicaDeletionSuccessful)
    
    // controller 将从replica状态机中删除这个replica,同时从partitionsForTopic缓存中删除。
    replicaStateMachine.handleStateChanges(replicasForDeletedTopic, NonExistentReplica)
    val partitionsForDeletedTopic = controllerContext.partitionsForTopic(topic)
    // 将各自的partition转移到 OfflinePartition and NonExistentPartition 状态
    partitionStateMachine.handleStateChanges(partitionsForDeletedTopic, OfflinePartition)
    partitionStateMachine.handleStateChanges(partitionsForDeletedTopic, NonExistentPartition)
    // 记录删除的topic
    topicsToBeDeleted -= topic
    partitionsToBeDeleted.retain(_.topic != topic)
    // 从ZK上递归地删除这个topic相关的路径
    val zkUtils = controllerContext.zkUtils
    zkUtils.zkClient.deleteRecursive(getTopicPath(topic))
    zkUtils.zkClient.deleteRecursive(getEntityConfigPath(ConfigType.Topic, topic))
    zkUtils.zkClient.delete(getDeleteTopicPath(topic))
    controllerContext.removeTopic(topic)
  }



handleStateChange(target to OfflinePartition)
什么也不做,只是把内存记录的partitionState修改为OfflinePartition

3. Partition状态机图

windows kafka producer中午乱码 kafka offline partition_zookeeper