从PostgreSQL数据库WAL——资源管理器RMGR文章中,我们知道XLog日志被划分为多个类型的资源管理器,每个资源管理器只需要负责与自己相关的日志处理(抽象出操作函数,不同的日志实现不同的操作函数)。checkpoint WAL是包含在RM_XLOG_ID类型(和XLog相关的事务日志,包括Checkpoint、日志切换等)的资源管理器中。RM_XLOG_ID类型的资源管理器定义在src/include/access/rmgrlist.h中,它们的作用和操作函数如下所示:PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL)
。rm_startup、rm_redo、rm_mask和rm_cleanup用于startup进程执行StartupXLOG函数进行XLOG日志回放中的初始化资源管理器RMGR、回放xlog、对页面做一致性检查和清理工作,其执行流程如下所示:
void StartupXLOG(void){
...
/* Initialize resource managers */ // 初始化资源管理器RMGR
for (rmid = 0; rmid <= RM_MAX_ID; rmid++) {
if (RmgrTable[rmid].rm_startup != NULL)
RmgrTable[rmid].rm_startup();
}
...
/* main redo apply loop */ // 主回放逻辑
do {
...
/* Now apply the WAL record itself */ // 调用rm_redo回放xlog
RmgrTable[record->xl_rmid].rm_redo(xlogreader);
/* After redo, check whether the backup pages associated with the WAL record are consistent with the existing pages. This check is done only if consistency check is enabled for this record. */ // redo 后,检查与 WAL 记录关联的备份页是否与现有页一致。 仅当为此记录启用一致性检查时才进行此检查。
if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
checkXLogConsistency(xlogreader);
...
record = ReadRecord(xlogreader, LOG, false); /* Else, try to fetch the next WAL record */
} while (record != NULL);
/* Allow resource managers to do any required cleanup. */ // 进行清理工作
for (rmid = 0; rmid <= RM_MAX_ID; rmid++) {
if (RmgrTable[rmid].rm_cleanup != NULL)
RmgrTable[rmid].rm_cleanup();
}
}
PostgreSQL备机在回放主机的WAL日志过程中,由于回放较慢会导致pg_control文件中checkpoint timestamp远远小于备机自身时间(和checkpoint中的时间戳一致),PostgreSQL备机pg_control文件的checkpoint记录的位点是从主机传过来WAL里面的checkpoint记录位置,由此可见备机pg_control文件中checkpoint相关字段是来自于PostgreSQL主机的checkpoint wal。
/* XLOG resource manager's routines
* Definitions of info values are in include/catalog/pg_control.h, though not all record types are related to control file updates. */
void xlog_redo(XLogReaderState *record) {
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
XLogRecPtr lsn = record->EndRecPtr;
...
else if (info == XLOG_CHECKPOINT_ONLINE) {
CheckPoint checkPoint; // 定义在src/include/catalog/pg_control.h中
memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint)); // 获取checkpoint wal中CheckPoint记录
LWLockAcquire(XidGenLock, LW_EXCLUSIVE); /* In an ONLINE checkpoint, treat the XID counter as a minimum */
if (FullTransactionIdPrecedes(ShmemVariableCache->nextXid, checkPoint.nextXid))
ShmemVariableCache->nextXid = checkPoint.nextXid; // 如果checkpoint中的nextXid大于备机ShmemVariableCache缓存的nextXid,则更新
LWLockRelease(XidGenLock);
/* We ignore the nextOid counter in an ONLINE checkpoint, preferring to track OID assignment through XLOG_NEXTOID records. The nextOid counter is from the start of the checkpoint and might well be stale compared to later XLOG_NEXTOID records. We could try to take the maximum of the nextOid counter and our latest value, but since there's no particular guarantee about the speed with which the OID counter wraps around, that's a risky thing to do. In any case, users of the nextOid counter are required to avoid assignment of duplicates, so that a somewhat out-of-date value should be safe. 我们忽略在线检查点中的nextOid计数器,更喜欢通过XLOG_Nextoid记录跟踪OID分配。nextOid计数器是从检查点开始的,与后来的XLOG_Nextoidd记录相比可能很旧。我们可以尝试取nextOid计数器的最大值和我们的最新值,但由于OID计数器的运行速度没有特别的保证,所以这是一件冒险的事情。在任何情况下,nextOid计数器的用户都需要避免分配重复项,因此有些过时的值应该是安全的 */
MultiXactAdvanceNextMXact(checkPoint.nextMulti, checkPoint.nextMultiOffset); /* Handle multixact */
// 如果checkPoint.nextMulti大于MultiXactState->nextMXact,则更新MultiXactState->nextMXact为checkPoint.nextMulti
// 如果checkPoint.nextMultiOffset大于MultiXactState->nextOffset,则更新MultiXactState->nextOffset为checkPoint.nextMultiOffset
/* NB: This may perform multixact truncation when replaying WAL generated by an older primary. */
MultiXactAdvanceOldest(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
// 如果checkPoint.oldestMulti大于MultiXactState->oldestMultiXactId,则更新MultiXactState->oldestMultiXactId为checkPoint.oldestMulti和MultiXactState->oldestMultiXactDB为checkPoint.oldestMultiDB
if (TransactionIdPrecedes(ShmemVariableCache->oldestXid, checkPoint.oldestXid))
SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
// 如果checkPoint.oldestXid大于ShmemVariableCache->oldestXid,则更新ShmemVariableCache->oldestXid为 checkPoint.oldestXid和ShmemVariableCache->oldestXidDB为checkPoint.oldestXidDB
/* ControlFile->checkPointCopy always tracks the latest ckpt XID */
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); // 更新ControlFile->checkPointCopy.nextXid
ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
LWLockRelease(ControlFileLock);
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextXid;
SpinLockRelease(&XLogCtl->info_lck);
if (checkPoint.ThisTimeLineID != ThisTimeLineID) /* TLI should not change in an on-line checkpoint */
ereport(PANIC, (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record", checkPoint.ThisTimeLineID, ThisTimeLineID)));
RecoveryRestartPoint(&checkPoint);
// 保存检查点以进行恢复。如果合适,请重新启动。每次从XLOG读取检查点记录时,都会调用此函数。它必须确定检查点是否表示安全重启点。如果是这样,检查点记录将隐藏在共享内存中,以便CreateRestartPoint可以查询它。(请注意,后一个函数由检查点执行,而此函数将由启动进程执行) 主要是设置XLogCtl->lastCheckPointRecPtr为ReadRecPtr;XLogCtl->lastCheckPointEndPtr为EndRecPtr;XLogCtl->lastCheckPoint为*checkPoint;
}
ReadRecPtr赋值流程在ReadRecord函数中
for (;;)
{
char *errormsg;
record = XLogReadRecord(xlogreader, RecPtr, &errormsg);
ReadRecPtr = xlogreader->ReadRecPtr;
EndRecPtr = xlogreader->EndRecPtr;
...
}
PostgreSQL为了缩短恢复时间,备机上也支持checkpoint,即CreateRestartPoint。PostgreSQL备机checkpoint是不能产生checkpoint WAL的,因为如果写这样类型的checkpoint的话,就会将接收的WAL打乱,那么日志将混乱,回放会出问题。在src/backend/postmaster/checkpointer.c中的CheckpointerMain函数,我们关注do_restartpoint布尔值变量,如果RecoveryInProgress函数返回true,CheckpointerMain函数就执行CreateRestartPoint函数,CreateRestartPoint函数用于建立重新启动点,类似于CreateCheckPoint,但在WAL恢复期间用于建立一个点,从该点开始恢复可以向前滚动,而无需重放整个恢复日志。如果建立了新的重新启动点,则返回true。只有在自上次重新启动点以来重放了安全检查点记录时,才能建立重新启动点。
bool CreateRestartPoint(int flags) {
XLogRecPtr lastCheckPointRecPtr,lastCheckPointEndPtr;
CheckPoint lastCheckPoint;
XLogRecPtr PriorRedoPtr,receivePtr,replayPtr,endptr;
TimeLineID replayTLI;
XLogSegNo _logSegNo;
TimestampTz xtime;
SpinLockAcquire(&XLogCtl->info_lck); /* Get a local copy of the last safe checkpoint record. */
lastCheckPointRecPtr = XLogCtl->lastCheckPointRecPtr; // checkpoint的位置来自XLogCtl->lastCheckPointRecPtr
lastCheckPointEndPtr = XLogCtl->lastCheckPointEndPtr;
lastCheckPoint = XLogCtl->lastCheckPoint;
SpinLockRelease(&XLogCtl->info_lck);
if (!RecoveryInProgress()){ /* Check that we're still in recovery mode. It's ok if we exit recovery mode after this check, the restart point is valid anyway. */
ereport(DEBUG2,(errmsg_internal("skipping restartpoint, recovery has already ended")));
return false;
}
/* If the last checkpoint record we've replayed is already our last restartpoint, we can't perform a new restart point. We still update minRecoveryPoint in that case, so that if this is a shutdown restart point, we won't start up earlier than before. That's not strictly necessary, but when hot standby is enabled, it would be rather weird if the database opened up for read-only connections at a point-in-time before the last shutdown. Such time travel is still possible in case of immediate shutdown, though. We don't explicitly advance minRecoveryPoint when we do create a restartpoint. It's assumed that flushing the buffers will do that as a side-effect. */ // 如果我们回放的最后一条检查点记录已经是我们的最后一个重新启动点,我们不能执行新的重新启动点。在这种情况下,我们仍然会更新minRecoveryPoint,这样,如果这是一个关机重启点,我们就不会比以前更早启动。这并不是绝对必要的,但当启用热备用时,如果数据库在上次关闭之前的某个时间点打开以进行只读连接,则会非常奇怪。然而,在立即关闭的情况下,这种时间旅行仍然是可能的。在创建restartpoint时,我们不会显式提升minRecoveryPoint。假设刷新缓冲区会产生副作用。
if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) || lastCheckPoint.redo <= ControlFile->checkPointCopy.redo) {
ereport(DEBUG2,(errmsg_internal("skipping restartpoint, already performed at %X/%X", LSN_FORMAT_ARGS(lastCheckPoint.redo))));
UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
if (flags & CHECKPOINT_IS_SHUTDOWN) {
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
ControlFile->time = (pg_time_t) time(NULL);
UpdateControlFile();
LWLockRelease(ControlFileLock);
}
return false;
}
/* Update the shared RedoRecPtr so that the startup process can calculate the number of segments replayed since last restartpoint, and request a restartpoint if it exceeds CheckPointSegments. Like in CreateCheckPoint(), hold off insertions to update it, although during recovery this is just pro forma, because no WAL insertions are happening. */ // 更新共享RedRecptr,以便启动进程可以计算自上次重新启动点以来重放的段数,并在超过检查点段时请求重新启动点。与CreateCheckPoint()一样,推迟插入以更新它,尽管在恢复过程中这只是形式上的,因为没有WAL插入发生。
WALInsertLockAcquireExclusive();
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = lastCheckPoint.redo;
WALInsertLockRelease();
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->RedoRecPtr = lastCheckPoint.redo; /* Also update the info_lck-protected copy */
SpinLockRelease(&XLogCtl->info_lck);
/* Prepare to accumulate statistics. Note: because it is possible for log_checkpoints to change while a checkpoint proceeds, we always accumulate stats, even if log_checkpoints is currently off. */
MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
if (log_checkpoints) LogCheckpointStart(flags, true);
update_checkpoint_display(flags, true, false); /* Update the process title */
CheckPointGuts(lastCheckPoint.redo, flags);
/* Remember the prior checkpoint's redo ptr for UpdateCheckPointDistanceEstimate() */
PriorRedoPtr = ControlFile->checkPointCopy.redo;
/*
* Update pg_control, using current time. Check that it still shows
* DB_IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
* this is a quick hack to make sure nothing really bad happens if somehow
* we get here after the end-of-recovery checkpoint.
*/
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY && ControlFile->checkPointCopy.redo < lastCheckPoint.redo) {
ControlFile->checkPoint = lastCheckPointRecPtr;
ControlFile->checkPointCopy = lastCheckPoint;
ControlFile->time = (pg_time_t) time(NULL);
/*
* Ensure minRecoveryPoint is past the checkpoint record. Normally,
* this will have happened already while writing out dirty buffers,
* but not necessarily - e.g. because no buffers were dirtied. We do
* this because a non-exclusive base backup uses minRecoveryPoint to
* determine which WAL files must be included in the backup, and the
* file (or files) containing the checkpoint record must be included,
* at a minimum. Note that for an ordinary restart of recovery there's
* no value in having the minimum recovery point any earlier than this
* anyway, because redo will begin just after the checkpoint record.
*/
if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
{
ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID;
/* update local copy */
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
}
if (flags & CHECKPOINT_IS_SHUTDOWN) ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
UpdateControlFile();
}
LWLockRelease(ControlFileLock);
/*
* Update the average distance between checkpoints/restartpoints if the
* prior checkpoint exists.
*/
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
/*
* Delete old log files, those no longer needed for last restartpoint to
* prevent the disk holding the xlog from growing full.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
/*
* Retreat _logSegNo using the current end of xlog replayed or received,
* whichever is later.
*/
receivePtr = GetWalRcvFlushRecPtr(NULL, NULL);
replayPtr = GetXLogReplayRecPtr(&replayTLI);
endptr = (receivePtr < replayPtr) ? replayPtr : receivePtr;
KeepLogSeg(endptr, &_logSegNo);
if (InvalidateObsoleteReplicationSlots(_logSegNo))
{
/*
* Some slots have been invalidated; recalculate the old-segment
* horizon, starting again from RedoRecPtr.
*/
XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
KeepLogSeg(endptr, &_logSegNo);
}
_logSegNo--;
/*
* Try to recycle segments on a useful timeline. If we've been promoted
* since the beginning of this restartpoint, use the new timeline chosen
* at end of recovery (RecoveryInProgress() sets ThisTimeLineID in that
* case). If we're still in recovery, use the timeline we're currently
* replaying.
*
* There is no guarantee that the WAL segments will be useful on the
* current timeline; if recovery proceeds to a new timeline right after
* this, the pre-allocated WAL segments on this timeline will not be used,
* and will go wasted until recycled on the next restartpoint. We'll live
* with that.
*/
if (RecoveryInProgress()) ThisTimeLineID = replayTLI;
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, endptr);
PreallocXlogFiles(endptr); /* Make more log segments if needed. (Do this after recycling old log segments, since that may supply some of the needed files.) */
/* ThisTimeLineID is normally not set when we're still in recovery. However, recycling/preallocating segments above needed ThisTimeLineID to determine which timeline to install the segments on. Reset it now, to restore the normal state of affairs for debugging purposes. */
if (RecoveryInProgress()) ThisTimeLineID = 0;
/* Truncate pg_subtrans if possible. We can throw away all data before the oldest XMIN of any running transaction. No future transaction will attempt to reference any pg_subtrans entry older than that (see Asserts in subtrans.c). When hot standby is disabled, though, we mustn't do this because StartupSUBTRANS hasn't been called yet. */
if (EnableHotStandby) TruncateSUBTRANS(GetOldestTransactionIdConsideredRunning());
LogCheckpointEnd(true); /* Real work is done; log and update stats. */
update_checkpoint_display(flags, true, true); /* Reset the process title */
xtime = GetLatestXTime();
ereport((log_checkpoints ? LOG : DEBUG2),(errmsg("recovery restart point at %X/%X",LSN_FORMAT_ARGS(lastCheckPoint.redo)),xtime ? errdetail("Last completed transaction was at log time %s.",timestamptz_to_str(xtime)) : 0));
if (archiveCleanupCommand && strcmp(archiveCleanupCommand, "") != 0) /* Finally, execute archive_cleanup_command, if any. */ ExecuteRecoveryCommand(archiveCleanupCommand,"archive_cleanup_command",false);
return true;
}