Flink Checkpoint源码入口在DefaultExecutionGraphBuilder.buildGraph()
处,因此本文从源码入口处开始进行分析。另一条路径触发CK是在创建Dispatcher时,调用源码入口在DefaultDispatcherGatewayServiceFactory.create()
,调用链如下:
DefaultDispatcherGatewayServiceFactory.create()
dispatcherFactory.createDispatcher(...)
JobDispatcherFactory.createDispatcher(...)
new MiniDispatcher(...)
// 在创建MiniDispatcher实例过程中,在初始化父类Dispatcher时,在初始化变量dispatcherCachedOperationsHandler中调用了triggerSavepointAndGetLocation,来逐步调用到了JobMaster.triggerSavepoint()
this.dispatcherCachedOperationsHandler = new DispatcherCachedOperationsHandler(
dispatcherServices.getOperationCaches(),
this::triggerSavepointAndGetLocation,
this::stopWithSavepointAndGetLocation
);
gateway.triggerSavepoint()
schedulerNG.triggerSavepoint(targetDirectory, cancelJob, formatType)
checkpointCoordinator.triggerSavepoint(targetDirectory, formatType)
triggerSavepointInternal(properties, targetLocation)
triggerCheckpointFromCheckpointThread(checkpointProperties, targetLocation, false)
triggerCheckpoint(checkpointProperties, targetLocation, isPeriodic)
// 堆栈调用到下面行为止就和从DefaultExecutionGraphBuilder.buildGraph()开始的调用链关联上了,后边的过程相同
chooseRequestToExecute(request).ifPresent(this::startTriggeringCheckpoint)
Checkpoint的公共源码入口可以认为是chooseRequestToExecute(request).ifPresent(this::startTriggeringCheckpoint)
这行代码。
- 从源码
DefaultExecutionGraphBuilder.buildGraph()
开始:
executionGraph.enableCheckpointing(
// 这是用于配置检查点的配置对象,其中包含了与检查点相关的各种配置选项,例如检查点的间隔时间、最大并发检查点数量等。
chkConfig,
// 这是一组用于定制化检查点行为的钩子(hook)对象。通过这些钩子,你可以在不同的检查点生命周期事件中插入自定义逻辑,
// 例如在检查点完成之后执行一些特定操作。
hooks,
// 这是一个计数器,用于生成唯一的检查点 ID。每次执行检查点时,都会生成一个新的检查点 ID。
// 这是一个存储已完成检查点的存储后端。在检查点成功完成后,相关的状态信息会被存储在这个存储后端中,以便在需要时进行恢复。
checkpointIdCounter,
completedCheckpointStore,
// rootBackend 和 rootStorage: 这两个参数提供了检查点状态的根后端和根存储。
// 这些用于指定检查点数据在分布式文件系统中的存储位置。
rootBackend,
rootStorage,
// 这是一个用于创建检查点统计跟踪器的工厂方法。这个统计跟踪器可以帮助你监视和记录检查点操作的性能指标和统计信息。
checkpointStatsTrackerFactory.get(),
// 这是一个用于清理过期检查点的对象。在保留一定数量的检查点之后,你可能需要删除旧的检查点,以释放存储空间。
checkpointsCleaner,
// 这个参数指定了状态变更日志的存储位置。状态变更日志用于记录应用程序状态的变更,以便在故障发生时进行精确的状态恢复。
jobManagerConfig.getString(STATE_CHANGE_LOG_STORAGE)
);
如果启用了CK,那么在前面实例化rootBackend、rootStorage、hooks并且获取chkConfig,然后通过调用executionGraph.enableCheckpointing()
方法正式进入CK启动流程中。
- 继续追踪
enableCheckpointing
源码:
new CheckpointCoordinator(
jobInformation.getJobId(),
chkConfig,
operatorCoordinators, // 这是一组操作符协调器(OperatorCoordinator)对象,用于协调每个操作符的状态。
checkpointIDCounter, // 这是一个计数器,用于生成唯一的检查点 ID。每次执行检查点时,都会生成一个新的检查点 ID。
checkpointStore, // 这是一个用于存储检查点元数据的存储后端。
checkpointStorage, // 这是用于实际存储检查点数据的存储后端。它将检查点的状态数据和操作符状态保存在分布式文件系统或其他存储介质中。
ioExecutor, // 这是一个执行 I/O 操作的线程池,用于执行与检查点操作相关的 I/O 操作,如数据的持久化和恢复。
checkpointsCleaner, // 这是一个用于清理过期检查点的对象。它负责删除不再需要的旧检查点,以释放存储空间。
// 这是一个定时器对象,用于触发定期的检查点操作。这个定时器会按照配置的时间间隔触发检查点。
new ScheduledExecutorServiceAdapter(checkpointCoordinatorTimer),
failureManager, // 这是故障管理器(FailureManager)对象,用于处理可能发生的故障情况,例如任务失败或节点故障。
// 这是一个检查点计划计算器(CheckpointPlanCalculator)对象,用于确定在何时触发检查点。
// 参数 chkConfig.isEnableCheckpointsAfterTasksFinish() 用于指示是否在任务完成后触发检查点。
createCheckpointPlanCalculator(chkConfig.isEnableCheckpointsAfterTasksFinish()),
// 这个对象提供了执行尝试(Execution Attempt)到执行图顶点的映射,用于在恢复时将检查点数据映射回正确的执行图位置。
new ExecutionAttemptMappingProvider(getAllExecutionVertices()),
checkpointStatsTracker
);
创建CheckpointCoordinator,这是负责checkpoint的核心实现类,其核心功能如下:
- 定时触发checkpoint操作。命令数据源发送checkpoint屏障。
- 接收各个operator的某个checkpoint完成确认消息。
- 对于某个checkpoint,当接收到所有operator的确认消息之时,发送消息通知各个operator,checkpoint已完成。
- 保存已完成和正在进行中的checkpoint的相关信息。
- 追踪
CheckpointCoordinator
构造方法,发现核心代码如下:
this.requestDecider = new CheckpointRequestDecider(
chkConfig.getMaxConcurrentCheckpoints(),
this::rescheduleTrigger,
this.clock,
this.minPauseBetweenCheckpoints,
this.pendingCheckpoints::size,
this.checkpointsCleaner::getNumberOfCheckpointsToClean
);
其中核心方法rescheduleTrigger
- 追踪发现:
private void rescheduleTrigger(long tillNextMillis) {
cancelPeriodicTrigger();
currentPeriodicTrigger = scheduleTriggerWithDelay(tillNextMillis); // 调度触发
}
调度触发调用了定时器timer.scheduleAtFixedRate(new ScheduledTrigger(), initDelay, baseInterval, TimeUnit.MILLISECONDS);
周期性触发CK执行。
- 继续追踪:
new ScheduledTrigger()
triggerCheckpoint(checkpointProperties, null, true);
发现由CheckpointCoordinator.triggerCheckpoint
触发周期性的CK。
- 进入
CheckpointCoordinator.triggerCheckpoint
查看:
CompletableFuture<CompletedCheckpoint> triggerCheckpoint(
CheckpointProperties props,
@Nullable String externalSavepointLocation,
boolean isPeriodic
) {
CheckpointTriggerRequest request =
new CheckpointTriggerRequest(props, externalSavepointLocation, isPeriodic);
// 这里可以认为通常情况下 chooseRequestToExecute(request) 返回的请求既是 上一行定义的 request
chooseRequestToExecute(request).ifPresent(this::startTriggeringCheckpoint);
return request.onCompletionPromise;
}
**triggerCheckpoint()**是一个重要的节点方法,通过调用startTriggeringCheckpoint
继续。
从这里开始,与从构建Dispatcher调用链结合到一起了,即代码chooseRequestToExecute(request).ifPresent(this::startTriggeringCheckpoint)
,下面分析都是从这行代码开始的。
- 继续追踪
startTriggeringCheckpoint
: 这里向后代码是公共代码部分
private void startTriggeringCheckpoint(CheckpointTriggerRequest request) {
try {
synchronized (lock) {
preCheckGlobalState(request.isPeriodic);
}
// we will actually trigger this checkpoint!
Preconditions.checkState(!isTriggering);
isTriggering = true;
final long timestamp = System.currentTimeMillis();
// 1 计算ck计划
CompletableFuture<CheckpointPlan> checkpointPlanFuture = checkpointPlanCalculator.calculateCheckpointPlan();
boolean initializeBaseLocations = !baseLocationsForCheckpointInitialized;
baseLocationsForCheckpointInitialized = true;
CompletableFuture<Void> masterTriggerCompletionPromise = new CompletableFuture<>();
final CompletableFuture<PendingCheckpoint> pendingCheckpointCompletableFuture =
checkpointPlanFuture
.thenApplyAsync(
plan -> {
try {
// this must happen outside the coordinator-wide lock,
// because it communicates with external services
// (in HA mode) and may block for a while.
long checkpointID = checkpointIdCounter.getAndIncrement();
// 为本次ck配置ck_id
return new Tuple2<>(plan, checkpointID);
} catch (Throwable e) {
throw new CompletionException(e);
}
},
executor
)
.thenApplyAsync(
// 2 创建 pendingCheckpoint
(checkpointInfo) -> createPendingCheckpoint(
timestamp,
request.props,
checkpointInfo.f0,
request.isPeriodic,
checkpointInfo.f1,
request.getOnCompletionFuture(),
masterTriggerCompletionPromise
),
timer
);
final CompletableFuture<?> coordinatorCheckpointsComplete =
pendingCheckpointCompletableFuture
.thenApplyAsync(
pendingCheckpoint -> {
try {
// 3 为 pendingCheckpoint 设置CK目录
CheckpointStorageLocation checkpointStorageLocation = initializeCheckpointLocation(
pendingCheckpoint.getCheckpointID(),
request.props,
request.externalSavepointLocation,
initializeBaseLocations
);
return Tuple2.of(pendingCheckpoint, checkpointStorageLocation);
} catch (Throwable e) {
throw new CompletionException(e);
}
},
executor
)
.thenComposeAsync(
(checkpointInfo) -> {
PendingCheckpoint pendingCheckpoint = checkpointInfo.f0;
if (pendingCheckpoint.isDisposed()) {
// The disposed checkpoint will be handled later,
// skip snapshotting the coordinator states.
return null;
}
synchronized (lock) {
// 为 pendingCheckpoint 设置ck路径
pendingCheckpoint.setCheckpointTargetLocation(checkpointInfo.f1);
}
// 4 触发并确认所有协调器检查点完成
return OperatorCoordinatorCheckpoints.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
coordinatorsToCheckpoint,
pendingCheckpoint,
timer
);
},
timer
);
// We have to take the snapshot of the master hooks after the coordinator checkpoints
// has completed.
// This is to ensure the tasks are checkpointed after the OperatorCoordinators in case
// ExternallyInducedSource is used.
final CompletableFuture<?> masterStatesComplete =
coordinatorCheckpointsComplete.thenComposeAsync(
ignored -> {
// If the code reaches here, the pending checkpoint is guaranteed to
// be not null.
// We use FutureUtils.getWithoutException() to make compiler happy
// with checked
// exceptions in the signature.
PendingCheckpoint checkpoint = FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
if (checkpoint == null || checkpoint.isDisposed()) {
// The disposed checkpoint will be handled later,
// skip snapshotting the master states.
return null;
}
return snapshotMasterState(checkpoint);
},
timer
);
FutureUtils.forward(
CompletableFuture.allOf(masterStatesComplete, coordinatorCheckpointsComplete),
masterTriggerCompletionPromise
);
FutureUtils.assertNoException(
masterTriggerCompletionPromise
.handleAsync(
(ignored, throwable) -> {
final PendingCheckpoint checkpoint = FutureUtils.getWithoutException(pendingCheckpointCompletableFuture);
Preconditions.checkState(
checkpoint != null || throwable != null,
"Either the pending checkpoint needs to be created or an error must have occurred."
);
if (throwable != null) {
// the initialization might not be finished yet
if (checkpoint == null) {
onTriggerFailure(request, throwable);
} else {
onTriggerFailure(checkpoint, throwable);
}
} else {
// 6 如果没有异常 触发检查点请求
triggerCheckpointRequest(request, timestamp, checkpoint);
}
return null;
},
timer
)
.exceptionally(
error -> {
if (!isShutdown()) {
throw new CompletionException(error);
} else if (findThrowable(error, RejectedExecutionException.class).isPresent()) {
LOG.debug("Execution rejected during shutdown");
} else {
LOG.warn("Error encountered during shutdown", error);
}
return null;
})
);
} catch (Throwable throwable) {
onTriggerFailure(request, throwable);
}
}
触发由triggerCheckpointRequest(request, timestamp, checkpoint);
继续
- 继续追踪堆栈调用
triggerTasks(request, timestamp, checkpoint) // 重点方法
execution.triggerCheckpoint(checkpointId, timestamp, checkpointOptions)
triggerCheckpointHelper(checkpointId, timestamp, checkpointOptions)
taskManagerGateway.triggerCheckpoint(attemptId, getVertex().getJobId(), checkpointId, timestamp, checkpointOptions)
RpcTaskManagerGateway.triggerCheckpoint()
TaskExecutor.triggerCheckpoint()
// 触发ck barrier
task.triggerCheckpointBarrier(checkpointId, checkpointTimestamp, checkpointOptions);
// 这里的triggerCheckpointAsync方法分别被SourceStreamTask和普通StreamTask覆盖,主要逻辑还是在StreamTask中
((CheckpointableTask) invokable).triggerCheckpointAsync(checkpointMetaData, checkpointOptions)
StreamTask.triggerCheckpointAsync(CheckpointMetaData checkpointMetaData, CheckpointOptions checkpointOptions)
- 继续追踪
StreamTask.triggerCheckpointAsync()
boolean noUnfinishedInputGates =
Arrays.stream(getEnvironment().getAllInputGates()).allMatch(InputGate::isFinished);
if (noUnfinishedInputGates) { // 无未完成输入,算子执行CK
result.complete(
// 分支1
triggerCheckpointAsyncInMailbox(checkpointMetaData, checkpointOptions)
);
} else { // 有未完成输入
result.complete(
// 分支2
triggerUnfinishedChannelsCheckpoint(checkpointMetaData, checkpointOptions)
);
}
- 通过源码跟踪发现两个分支最终都会调用到
performCheckpoint()
subtaskCheckpointCoordinator.checkpointState(
checkpointMetaData,
checkpointOptions,
checkpointMetrics,
operatorChain,
finishedOperators,
this::isRunning
);
在performCheckpoint
内部调用了subtaskCheckpointCoordinator.checkpointState(...)
,让我们继续。
- 方法
checkpointState
的核心源码如下:
//步骤(1):准备检查点,允许操作员做一些屏障前的工作。通常情况下,屏障前的工作应该是零或最小的。
operatorChain.prepareSnapshotPreBarrier(metadata.getCheckpointId());
//第(2)步:向下游发送检查点屏障
CheckpointBarrier checkpointBarrier = new CheckpointBarrier(metadata.getCheckpointId(), metadata.getTimestamp(), options);
operatorChain.broadcastEvent(checkpointBarrier, options.isUnalignedCheckpoint());
//步骤(3):将对齐计时器注册为超时对齐barrier到未对齐barrier
registerAlignmentTimer(metadata.getCheckpointId(), operatorChain, checkpointBarrier);
//步骤(4):准备溢出输入和输出的in-flight缓冲
if (options.needsChannelState()) {
// 在广播事件时输出已写入的数据
channelStateWriter.finishOutput(metadata.getCheckpointId());
}
//步骤(5):获取状态快照这在很大程度上应该是异步的,以免影响流拓扑的进度
Map<OperatorID, OperatorSnapshotFutures> snapshotFutures = new HashMap<>(operatorChain.getNumberOfOperators());
try {
// takeSnapshotSync(...) 调用了 operatorChain.snapshotState()
if (takeSnapshotSync(snapshotFutures, metadata, metrics, options, operatorChain, isRunning)) {
finishAndReportAsync(
snapshotFutures,
metadata,
metrics,
operatorChain.isTaskDeployedAsFinished(),
isTaskFinished,
isRunning
);
} else {
cleanup(snapshotFutures, metadata, metrics, new Exception("Checkpoint declined"));
}
} catch (Exception ex) {}
源码位置:SubtaskCheckpointCoordinatorImpl.takeSnapshotSync()
在代码takeSnapshotSync(snapshotFutures, metadata, metrics, options, operatorChain, isRunning)
内部触发了operatorChain.snapshotState()
的执行,这里真正的跳转到了算子内部执行快照动作。
- 在
operatorChain.snapshotState()
有两个实现
- FinishedOperatorChain
snapshotChannelStates(operator, channelStateWriteResult, snapshotInProgress);
- RegularOperatorChain
OperatorSnapshotFutures snapshotInProgress = checkpointStreamOperator(op, checkpointMetaData, checkpointOptions, storage, isRunning);
snapshotChannelStates(op, channelStateWriteResult, snapshotInProgress);
这里的核心代码是checkpointStreamOperator(op, checkpointMetaData, checkpointOptions, storage, isRunning)
,调用AbstractStreamOperator的snapshotState方法,snapshotState方法中是具体执行snapshot的逻辑,即:
public final OperatorSnapshotFutures snapshotState(
long checkpointId,
long timestamp,
CheckpointOptions checkpointOptions,
CheckpointStreamFactory factory
) throws Exception {
return stateHandler.snapshotState(
this,
Optional.ofNullable(timeServiceManager),
getOperatorName(),
checkpointId,
timestamp,
checkpointOptions,
factory,
isUsingCustomRawKeyedState()
);
}
这里的源码入口为:AbstractStreamOperator.snapshotState(),这里的调用逻辑后边再更新。
以上内容就是Flink Checkpoint源码调用流程,理解Checkpoint原理对于理解Flink应用具有较大的帮助,以上内容由本人追踪源码理解,有误欢迎指正。