flink作业失败重启或者从指定savepoint启动时,需要将整个作业恢复到上一次成功checkpoint的状态。这里主要分为两个阶段:
1、checkpointCoordinator加载最近一次成功的CompletedCheckpoint,并将状态重新分配到不同Exection(Task)中。
2、task 启动时进行状态初始化。
一、状态分配
首先,JobMaster 在创建ExecutionGraph后会尝试恢复状态到最近一次成功的checkpoint,或者加载SavePoint,最终都会调用CheckpointCoordinator.restoreLatestCheckpointedState() 方法:
class CheckpointCoordinator {
public boolean restoreLatestCheckpointedState(
Map<JobVertexID, ExecutionJobVertex> tasks,
boolean errorIfNoCheckpoint,
boolean allowNonRestoredState) throws Exception {
synchronized (lock) {
......
// Restore from the latest checkpoint
CompletedCheckpoint latest = completedCheckpointStore.getLatestCheckpoint();
final Map<OperatorID, OperatorState> operatorStates = latest.getOperatorStates();
StateAssignmentOperation stateAssignmentOperation =
new StateAssignmentOperation(latest.getCheckpointID(), tasks, operatorStates, allowNonRestoredState);
stateAssignmentOperation.assignStates();
........
}
}
}
状态的分配过程被封装在StateAssignmentOperation 中。在状态的恢复过程中,假如任务的并发度发生变化,那么每个子任务的状态和先前的必然是不一致的,这其中旧涉及到状态的平均分配问题,关于状态的分配的细节。可以参考flink团队的博文A Deep Dive into Rescalable State in Apache Flink
参考:https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html#reassigning-operator-state-when-rescaling 里面详细给出了Operator state 和keyed state 重新分配的详细介绍。
最终,每个Task 分配的状态被封装在JobManagerTaskRestore 中,并通过
Execution.setInitialState() 关联到 Execution中,bManagerTaskRestore 会作为TaskDeploymentDescriptor 的一个属性下发到TaskEXecutor 中
Task 状态的初始化
当TaskDeploymentDescriptor被提交给TaskExecutor 之后,TaskExcutor 会TaskStateManager 用于管理当前Task的状态,TaskStateManager 对象会基于分配的JobManagerTaskRestore 和本地状态存储TaskLocalStateStore进行创建:
class TaskExecutor {
@Override
public CompletableFuture<Acknowledge> submitTask(
TaskDeploymentDescriptor tdd,
JobMasterId jobMasterId,
Time timeout) {
.......
//本地状态存储
final TaskLocalStateStore localStateStore = localStateStoresManager.localStateStoreForSubtask(
jobId,
tdd.getAllocationId(),
taskInformation.getJobVertexId(),
tdd.getSubtaskIndex());
//由 JobManager 分配的用于恢复的状态
final JobManagerTaskRestore taskRestore = tdd.getTaskRestore();
//创建 TaskStateManager
final TaskStateManager taskStateManager = new TaskStateManagerImpl(
jobId,
tdd.getExecutionAttemptId(),
localStateStore,
taskRestore,
checkpointResponder);
//创建并启动 Task
......
}
}
在 Task 启动后,StreamTask 会调用initializeState 方法,这样每个算子都会调用StreamOperator.initalizeState()进行状态的初始化:
public abstract class AbstractStreamOperator<OUT>
implements StreamOperator<OUT>, Serializable {
@Override
public final void initializeState() throws Exception {
final TypeSerializer<?> keySerializer = config.getStateKeySerializer(getUserCodeClassloader());
final StreamTask<?, ?> containingTask =
Preconditions.checkNotNull(getContainingTask());
final CloseableRegistry streamTaskCloseableRegistry =
Preconditions.checkNotNull(containingTask.getCancelables());
final StreamTaskStateInitializer streamTaskStateManager =
Preconditions.checkNotNull(containingTask.createStreamTaskStateInitializer());
//创建 StreamOperatorStateContext,这一步会进行状态的恢复,
//这样 operatorStateBackend 和 keyedStateBackend 就可以恢复到到最后一次 checkpoint 的状态
//timeServiceManager 也会恢复
final StreamOperatorStateContext context =
streamTaskStateManager.streamOperatorStateContext(
getOperatorID(),
getClass().getSimpleName(),
this,
keySerializer,
streamTaskCloseableRegistry,
metrics);
this.operatorStateBackend = context.operatorStateBackend();
this.keyedStateBackend = context.keyedStateBackend();
if (keyedStateBackend != null) {
this.keyedStateStore = new DefaultKeyedStateStore(keyedStateBackend, getExecutionConfig());
}
timeServiceManager = context.internalTimerServiceManager();
CloseableIterable<KeyGroupStatePartitionStreamProvider> keyedStateInputs = context.rawKeyedStateInputs();
CloseableIterable<StatePartitionStreamProvider> operatorStateInputs = context.rawOperatorStateInputs();
try {
//StateInitializationContext 对外暴露了 state backend,timer service manager 等,operator 可以借助它来进行状态初始化
StateInitializationContext initializationContext = new StateInitializationContextImpl(
context.isRestored(), // information whether we restore or start for the first time
operatorStateBackend, // access to operator state backend
keyedStateStore, // access to keyed state backend
keyedStateInputs, // access to keyed state stream
operatorStateInputs); // access to operator state stream
//进行状态初始化,在子类中实现,比如调用 UDF 的状态初始化方法
initializeState(initializationContext);
} finally {
closeFromRegistry(operatorStateInputs, streamTaskCloseableRegistry);
closeFromRegistry(keyedStateInputs, streamTaskCloseableRegistry);
}
}
@Override
public void initializeState(StateInitializationContext context) throws Exception {
}
}
public abstract class AbstractUdfStreamOperator<OUT, F extends Function>
extends AbstractStreamOperator<OUT>
implements OutputTypeConfigurable<OUT> {
@Override
public void initializeState(StateInitializationContext context) throws Exception {
super.initializeState(context);
//用户函数调用状态初始化方法
StreamingFunctionUtils.restoreFunctionState(context, userFunction);
}
}
状态恢复的关键操作在于通过StreamTaskStateInitializer.streamOpertorStateContext()生成StreamOperatorStateContext,通过StreamOperatorStateContext 可以获取 state backend。time service manager 等
public interface StreamOperatorStateContext {
// Returns true, the states provided by this context are restored from a checkpoint/savepoint.
boolean isRestored();
// Returns the operator state backend for the stream operator.
OperatorStateBackend operatorStateBackend();
// Returns the keyed state backend for the stream operator. This method returns null for non-keyed operators.
AbstractKeyedStateBackend<?> keyedStateBackend();
// Returns the internal timer service manager for the stream operator. This method returns null for non-keyed operators.
InternalTimeServiceManager<?> internalTimerServiceManager();
// Returns an iterable to obtain input streams for previously stored operator state partitions that are assigned to this stream operator.
CloseableIterable<StatePartitionStreamProvider> rawOperatorStateInputs();
// Returns an iterable to obtain input streams for previously stored keyed state partitions that are assigned tothis operator. This method returns null for non-keyed operators.
CloseableIterable<KeyGroupStatePartitionStreamProvider> rawKeyedStateInputs();
}
为了生成StreamOperatoStateContext,首先通过TaskStateManager.prioritizedOperatorState() 方法获得每个Operator 需要会的状态句柄;然后使用获得的专题句柄创建并还原State backend 和timer.
这里引入了PrioritizedOperatorSubtaskState ,它封装了多个备选的OperatorSubtaskState(快照),这些快照相互之间是可以替换的,并按照优先级排序。列表中的最后一项是包换了这个字任务的所有状态,但是优先级最顶。在继续状态恢复的时候。优先从高优先级的状态句柄中读取状态。
相关参考:
在获得PrioritizedOperatorSubtaskState 之后就可以进行状态的恢复了:
public class StreamTaskStateInitializerImpl implements StreamTaskStateInitializer {
@Override
public StreamOperatorStateContext streamOperatorStateContext(
@Nonnull OperatorID operatorID,
@Nonnull String operatorClassName,
@Nonnull KeyContext keyContext,
@Nullable TypeSerializer<?> keySerializer,
@Nonnull CloseableRegistry streamTaskCloseableRegistry,
@Nonnull MetricGroup metricGroup) throws Exception {
TaskInfo taskInfo = environment.getTaskInfo();
OperatorSubtaskDescriptionText operatorSubtaskDescription =
new OperatorSubtaskDescriptionText(
operatorID,
operatorClassName,
taskInfo.getIndexOfThisSubtask(),
taskInfo.getNumberOfParallelSubtasks());
final String operatorIdentifierText = operatorSubtaskDescription.toString();
//先获取用于恢复状态的 PrioritizedOperatorSubtaskState
final PrioritizedOperatorSubtaskState prioritizedOperatorSubtaskStates =
taskStateManager.prioritizedOperatorState(operatorID);
AbstractKeyedStateBackend<?> keyedStatedBackend = null;
OperatorStateBackend operatorStateBackend = null;
CloseableIterable<KeyGroupStatePartitionStreamProvider> rawKeyedStateInputs = null;
CloseableIterable<StatePartitionStreamProvider> rawOperatorStateInputs = null;
InternalTimeServiceManager<?> timeServiceManager;
try {
// -------------- Keyed State Backend --------------
keyedStatedBackend = keyedStatedBackend(
keySerializer,
operatorIdentifierText,
prioritizedOperatorSubtaskStates,
streamTaskCloseableRegistry,
metricGroup);
// -------------- Operator State Backend --------------
operatorStateBackend = operatorStateBackend(
operatorIdentifierText,
prioritizedOperatorSubtaskStates,
streamTaskCloseableRegistry);
// -------------- Raw State Streams --------------
rawKeyedStateInputs = rawKeyedStateInputs(
prioritizedOperatorSubtaskStates.getPrioritizedRawKeyedState().iterator());
streamTaskCloseableRegistry.registerCloseable(rawKeyedStateInputs);
rawOperatorStateInputs = rawOperatorStateInputs(
prioritizedOperatorSubtaskStates.getPrioritizedRawOperatorState().iterator());
streamTaskCloseableRegistry.registerCloseable(rawOperatorStateInputs);
// -------------- Internal Timer Service Manager --------------
timeServiceManager = internalTimeServiceManager(keyedStatedBackend, keyContext, rawKeyedStateInputs);
// -------------- Preparing return value --------------
return new StreamOperatorStateContextImpl(
prioritizedOperatorSubtaskStates.isRestored(),
operatorStateBackend,
keyedStatedBackend,
timeServiceManager,
rawOperatorStateInputs,
rawKeyedStateInputs);
} catch (Exception ex) {
//.......
}
}
}
转台恢复和创建State backend 耦合在一起,借助BackendRestorerPorcedure 来完成,具体的逻辑在
BackendRestorerProcedure.createAndRestore 方法中
参考:https://blog.jrwang.me/2019/flink-source-code-checkpoint/