flink作业失败重启或者从指定savepoint启动时,需要将整个作业恢复到上一次成功checkpoint的状态。这里主要分为两个阶段:
1、checkpointCoordinator加载最近一次成功的CompletedCheckpoint,并将状态重新分配到不同Exection(Task)中。
2、task 启动时进行状态初始化。

一、状态分配

首先,JobMaster 在创建ExecutionGraph后会尝试恢复状态到最近一次成功的checkpoint,或者加载SavePoint,最终都会调用CheckpointCoordinator.restoreLatestCheckpointedState() 方法:

flink java代码重启从指定checkpoint文件 flink从checkpoint恢复_ide

class CheckpointCoordinator {
	public boolean restoreLatestCheckpointedState(
			Map<JobVertexID, ExecutionJobVertex> tasks,
			boolean errorIfNoCheckpoint,
			boolean allowNonRestoredState) throws Exception {
		synchronized (lock) {
			......
			// Restore from the latest checkpoint
			CompletedCheckpoint latest = completedCheckpointStore.getLatestCheckpoint();
			final Map<OperatorID, OperatorState> operatorStates = latest.getOperatorStates();
			StateAssignmentOperation stateAssignmentOperation =
					new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
			stateAssignmentOperation.assignStates();
			........
		}
	}
}

状态的分配过程被封装在StateAssignmentOperation 中。在状态的恢复过程中,假如任务的并发度发生变化,那么每个子任务的状态和先前的必然是不一致的,这其中旧涉及到状态的平均分配问题,关于状态的分配的细节。可以参考flink团队的博文A Deep Dive into Rescalable State in Apache Flink
参考:https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html#reassigning-operator-state-when-rescaling 里面详细给出了Operator state 和keyed state 重新分配的详细介绍。

最终,每个Task 分配的状态被封装在JobManagerTaskRestore 中,并通过
Execution.setInitialState() 关联到 Execution中,bManagerTaskRestore 会作为TaskDeploymentDescriptor 的一个属性下发到TaskEXecutor 中

Task 状态的初始化
当TaskDeploymentDescriptor被提交给TaskExecutor 之后,TaskExcutor 会TaskStateManager 用于管理当前Task的状态,TaskStateManager 对象会基于分配的JobManagerTaskRestore 和本地状态存储TaskLocalStateStore进行创建:

class TaskExecutor {
	@Override
	public CompletableFuture<Acknowledge> submitTask(
			TaskDeploymentDescriptor tdd,
			JobMasterId jobMasterId,
			Time timeout) {

		.......

		//本地状态存储
		final TaskLocalStateStore localStateStore = localStateStoresManager.localStateStoreForSubtask(
				jobId,
				tdd.getAllocationId(),
				taskInformation.getJobVertexId(),
				tdd.getSubtaskIndex());
		//由 JobManager 分配的用于恢复的状态
		final JobManagerTaskRestore taskRestore = tdd.getTaskRestore();
		//创建 TaskStateManager
		final TaskStateManager taskStateManager = new TaskStateManagerImpl(
				jobId,
				tdd.getExecutionAttemptId(),
				localStateStore,
				taskRestore,
				checkpointResponder);

		//创建并启动 Task
		......
	}
}

在 Task 启动后,StreamTask 会调用initializeState 方法,这样每个算子都会调用StreamOperator.initalizeState()进行状态的初始化:

public abstract class AbstractStreamOperator<OUT>
		implements StreamOperator<OUT>, Serializable {
	@Override
	public final void initializeState() throws Exception {
		final TypeSerializer<?> keySerializer = config.getStateKeySerializer(getUserCodeClassloader());

		final StreamTask<?, ?> containingTask =
			Preconditions.checkNotNull(getContainingTask());
		final CloseableRegistry streamTaskCloseableRegistry =
			Preconditions.checkNotNull(containingTask.getCancelables());
		final StreamTaskStateInitializer streamTaskStateManager =
			Preconditions.checkNotNull(containingTask.createStreamTaskStateInitializer());

		//创建 StreamOperatorStateContext,这一步会进行状态的恢复,
		//这样 operatorStateBackend 和 keyedStateBackend 就可以恢复到到最后一次 checkpoint 的状态
		//timeServiceManager 也会恢复
		final StreamOperatorStateContext context =
			streamTaskStateManager.streamOperatorStateContext(
				getOperatorID(),
				getClass().getSimpleName(),
				this,
				keySerializer,
				streamTaskCloseableRegistry,
				metrics);

		this.operatorStateBackend = context.operatorStateBackend();
		this.keyedStateBackend = context.keyedStateBackend();

		if (keyedStateBackend != null) {
			this.keyedStateStore = new DefaultKeyedStateStore(keyedStateBackend, getExecutionConfig());
		}

		timeServiceManager = context.internalTimerServiceManager();

		CloseableIterable<KeyGroupStatePartitionStreamProvider> keyedStateInputs = context.rawKeyedStateInputs();
		CloseableIterable<StatePartitionStreamProvider> operatorStateInputs = context.rawOperatorStateInputs();

		try {
			//StateInitializationContext 对外暴露了 state backend,timer service manager 等,operator 可以借助它来进行状态初始化
			StateInitializationContext initializationContext = new StateInitializationContextImpl(
				context.isRestored(), // information whether we restore or start for the first time
				operatorStateBackend, // access to operator state backend
				keyedStateStore, // access to keyed state backend
				keyedStateInputs, // access to keyed state stream
				operatorStateInputs); // access to operator state stream

			//进行状态初始化,在子类中实现,比如调用 UDF 的状态初始化方法
			initializeState(initializationContext);
		} finally {
			closeFromRegistry(operatorStateInputs, streamTaskCloseableRegistry);
			closeFromRegistry(keyedStateInputs, streamTaskCloseableRegistry);
		}
	}

	@Override
	public void initializeState(StateInitializationContext context) throws Exception {
	}
}

public abstract class AbstractUdfStreamOperator<OUT, F extends Function>
		extends AbstractStreamOperator<OUT>
		implements OutputTypeConfigurable<OUT> {
		@Override
	public void initializeState(StateInitializationContext context) throws Exception {
		super.initializeState(context);
		//用户函数调用状态初始化方法
		StreamingFunctionUtils.restoreFunctionState(context, userFunction);
	}
}

状态恢复的关键操作在于通过StreamTaskStateInitializer.streamOpertorStateContext()生成StreamOperatorStateContext,通过StreamOperatorStateContext 可以获取 state backend。time service manager 等

public interface StreamOperatorStateContext {
	// Returns true, the states provided by this context are restored from a checkpoint/savepoint.
	boolean isRestored();

	// Returns the operator state backend for the stream operator.
	OperatorStateBackend operatorStateBackend();

	// Returns the keyed state backend for the stream operator. This method returns null for non-keyed operators.
	AbstractKeyedStateBackend<?> keyedStateBackend();

	// Returns the internal timer service manager for the stream operator. This method returns null for non-keyed operators.
	InternalTimeServiceManager<?> internalTimerServiceManager();

	// Returns an iterable to obtain input streams for previously stored operator state partitions that are assigned to this stream operator.
	CloseableIterable<StatePartitionStreamProvider> rawOperatorStateInputs();

	// Returns an iterable to obtain input streams for previously stored keyed state partitions that are assigned tothis operator. This method returns null for non-keyed operators.
	CloseableIterable<KeyGroupStatePartitionStreamProvider> rawKeyedStateInputs();
}

为了生成StreamOperatoStateContext,首先通过TaskStateManager.prioritizedOperatorState() 方法获得每个Operator 需要会的状态句柄;然后使用获得的专题句柄创建并还原State backend 和timer.

这里引入了PrioritizedOperatorSubtaskState ,它封装了多个备选的OperatorSubtaskState(快照),这些快照相互之间是可以替换的,并按照优先级排序。列表中的最后一项是包换了这个字任务的所有状态,但是优先级最顶。在继续状态恢复的时候。优先从高优先级的状态句柄中读取状态。

相关参考:

flink java代码重启从指定checkpoint文件 flink从checkpoint恢复_初始化_02

在获得PrioritizedOperatorSubtaskState 之后就可以进行状态的恢复了:

public class StreamTaskStateInitializerImpl implements StreamTaskStateInitializer {
	@Override
	public StreamOperatorStateContext streamOperatorStateContext(
		@Nonnull OperatorID operatorID,
		@Nonnull String operatorClassName,
		@Nonnull KeyContext keyContext,
		@Nullable TypeSerializer<?> keySerializer,
		@Nonnull CloseableRegistry streamTaskCloseableRegistry,
		@Nonnull MetricGroup metricGroup) throws Exception {

		TaskInfo taskInfo = environment.getTaskInfo();
		OperatorSubtaskDescriptionText operatorSubtaskDescription =
			new OperatorSubtaskDescriptionText(
				operatorID,
				operatorClassName,
				taskInfo.getIndexOfThisSubtask(),
				taskInfo.getNumberOfParallelSubtasks());

		final String operatorIdentifierText = operatorSubtaskDescription.toString();

		//先获取用于恢复状态的 PrioritizedOperatorSubtaskState
		final PrioritizedOperatorSubtaskState prioritizedOperatorSubtaskStates =
			taskStateManager.prioritizedOperatorState(operatorID);

		AbstractKeyedStateBackend<?> keyedStatedBackend = null;
		OperatorStateBackend operatorStateBackend = null;
		CloseableIterable<KeyGroupStatePartitionStreamProvider> rawKeyedStateInputs = null;
		CloseableIterable<StatePartitionStreamProvider> rawOperatorStateInputs = null;
		InternalTimeServiceManager<?> timeServiceManager;

		try {
			// -------------- Keyed State Backend --------------
			keyedStatedBackend = keyedStatedBackend(
				keySerializer,
				operatorIdentifierText,
				prioritizedOperatorSubtaskStates,
				streamTaskCloseableRegistry,
				metricGroup);

			// -------------- Operator State Backend --------------
			operatorStateBackend = operatorStateBackend(
				operatorIdentifierText,
				prioritizedOperatorSubtaskStates,
				streamTaskCloseableRegistry);

			// -------------- Raw State Streams --------------
			rawKeyedStateInputs = rawKeyedStateInputs(
				prioritizedOperatorSubtaskStates.getPrioritizedRawKeyedState().iterator());
			streamTaskCloseableRegistry.registerCloseable(rawKeyedStateInputs);

			rawOperatorStateInputs = rawOperatorStateInputs(
				prioritizedOperatorSubtaskStates.getPrioritizedRawOperatorState().iterator());
			streamTaskCloseableRegistry.registerCloseable(rawOperatorStateInputs);

			// -------------- Internal Timer Service Manager --------------
			timeServiceManager = internalTimeServiceManager(keyedStatedBackend, keyContext, rawKeyedStateInputs);

			// -------------- Preparing return value --------------

			return new StreamOperatorStateContextImpl(
				prioritizedOperatorSubtaskStates.isRestored(),
				operatorStateBackend,
				keyedStatedBackend,
				timeServiceManager,
				rawOperatorStateInputs,
				rawKeyedStateInputs);
		} catch (Exception ex) {
			//.......
		}
	}
}

转台恢复和创建State backend 耦合在一起,借助BackendRestorerPorcedure 来完成,具体的逻辑在
BackendRestorerProcedure.createAndRestore 方法中

参考:https://blog.jrwang.me/2019/flink-source-code-checkpoint/