1.2搭建后,组件都正常,升级至1.3版本,mco组件出现异常。

信息

节点状态如下

[uswift@utccp-test-289lv-worker-0-gs922 ~]$ rpm-ostree status
State: idle
Deployments:
* uswift:uswift/x86_64
                   Version: 20.20230625.uccps.0 (2023-06-25T08:15:06Z)
                    Commit: 698a07617e25f715b19b047b27a03d18fd526da4a2cfd56f02f6661ba78be127

machine-config-daemon 日志报错信息

I0312 14:59:20.356836  403813 daemon.go:903] Current+desired config: rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa
I0312 14:59:20.367576  403813 daemon.go:538] Detected a login session before the daemon took over on first boot
I0312 14:59:20.367606  403813 daemon.go:539] Applying annotation: machineconfiguration.openshift.io/ssh
I0312 14:59:20.390021  403813 daemon.go:1196] Validating against current config rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa
E0312 14:59:20.390106  403813 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa: expected target osImageURL "registry.uniontech.com/utccp-components/machine-os-content:1.2.2", have ""

排查

分析

v1.3分支,查看日志报错处 "pkg/daemon/daemon.go" 1579 lines

// validateOnDiskState compares the on-disk state against what a configuration
// specifies.  If for example an admin ssh'd into a node, or another operator
// is stomping on our files, we want to highlight that and mark the system
// degraded.
func (dn *Daemon) validateOnDiskState(currentConfig *mcfgv1.MachineConfig) error {
        // Be sure we're booted into the OS we expect
        osMatch := dn.checkOS(currentConfig.Spec.OSImageURL)
        if !osMatch {
                return errors.Errorf("expected target osImageURL %q, have %q", currentConfig.Spec.OSImageURL, dn.bootedOSImageURL)
        }

        return validateOnDiskState(currentConfig, pathSystemd)
}

对比了现有imageurl和目标imageurl不一致报错。但是目标镜像地址为空就很奇怪。于是去查看该值如何获取

"pkg/daemon/rpm-ostree.go" 337 lines

// GetBootedOSImageURL returns the image URL as well as the OSTree version (for logging)
// Returns the empty string if the host doesn't have a custom origin that matches pivot://
// (This could be the case for e.g. FCOS, or a future RHCOS which comes not-pivoted by default)
func (r *RpmOstreeClient) GetBootedOSImageURL() (string, string, error) {
        bootedDeployment, err := r.GetBootedDeployment()
        if err != nil {
                return "", "", err
        }

        // the canonical image URL is stored in the custom origin field.
        osImageURL := ""
        if len(bootedDeployment.CustomOrigin) > 0 {
                if strings.HasPrefix(bootedDeployment.CustomOrigin[0], "pivot://") {
                        osImageURL = bootedDeployment.CustomOrigin[0][len("pivot://"):]
                }
        }

        return osImageURL, bootedDeployment.Version, nil
}

这里默认是空的,通过现象信息中 rpm-ostree status 可以发现不存在CustomOrigin对象,下面是正常的输出,可见如果初始引导正常,Deployments[0]的url应为pivot开头且存在CustomOrigin。

sh-5.1# chroot /host
sh-5.1# rpm-ostree status
State: idle
Deployments:
* pivot://registry.uniontech.com/uccps-components/machine-os-content:1.3.0
              CustomOrigin: Managed by machine-config-operator
                   Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)

  uswift:uccps/x86_64
                   Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)
                    Commit: 9261ed2ea2851f5caebd641318388406525fbb3e863e30356e84008e513a859a
sh-5.1#

也就是说在1.2版本初始引导失败了,顺着最开始的daemon.go中的validateOnDiskState函数排查可以发现是checkOS函数不匹配。

pkg/daemon/daemon.go 1521 lines

// checkOS determines whether the booted system matches the target
// osImageURL and if not whether we need to take action.  This function
// returns `true` if no action is required, which is the case if we're
// not running RHCOS or FCOS, or if the target osImageURL is "" (unspecified),
// or if the digests match.
// Otherwise if `false` is returned, then we need to perform an update.
func (dn *Daemon) checkOS(osImageURL string) bool {
        // Nothing to do if we're not on RHCOS or FCOS
        if !dn.os.IsCoreOSVariant() {
                glog.Infof(`Not booted into a CoreOS variant, ignoring target OSImageURL %s`, osImageURL)
                return true
        }

        return compareOSImageURL(dn.bootedOSImageURL, osImageURL)
}

这里通过IsCoreOSVariant确定OSImageUrl v1.2分支

pkg/daemon/osrelease.go 44 lines

// IsCoreOSVariant is true if the OS is FCOS or a derivative (ostree+Ignition)
// which includes RHCOS.
func (os OperatingSystem) IsCoreOSVariant() bool {
        // We should probably add VARIANT_ID=coreos to RHCOS too and key off that
        return os.IsFCOS() || os.IsRHCOS() || os.IsUSwift()
}

再去查看 IsUSwift 
// IsUSwift is true if the OS is UOS USwift
func (os OperatingSystem) IsUSwift() bool {
        return os.ID == "Uswift" || os.ID == "USwift"
}

回到系统中查看/etc/os-release

[uswift@utccp-test-289lv-worker-0-gs922 ~]$ cat /etc/os-release 
PRETTY_NAME="UOS Server 20"
NAME="UOS Server 20"
VERSION_ID="20"
VERSION="20"
ID="uos"
HOME_URL="https://www.chinauos.com/"
BUG_REPORT_URL="https://bbs.chinauos.com/"
VERSION_CODENAME=mercury
PLATFORM_ID="platform:uelc20"
OSTREE_VERSION='20.20230625.uccps.0'

这里的ID是uos不是USwift或Uswift,所以第一次引导为被跳过了,在1.3分支里判断 IsUSwift 变成了 /run/ostree-booted,https://gerrit-dev.uniontech.com/c/utccp-image-machine-config-operator/+/66967

总结

可知不是升级的问题,而是升级前在部署1.2版本时,first boot没有正确引导,这也可能是轩宝在release中加了免密后搭建集群报错的原因,免密的文件会触发mco update从而出现相同的错误。

解决方案

使用1.3的machine-config-daemon引导1.2的machine-os-content

在每个节点下执行命令:

sudo podman pull registry.uniontech.com/uccps-components/machine-config-operator:1.3.0
sudo podman run --rm --quiet --net=host -v /run/bin:/host/run/bin:z --entrypoint=cp 'registry.uniontech.com/uccps-components/machine-config-operator:1.3.0' /usr/bin/machine-config-daemon /host/run/bin
sudo /bin/chcon system_u:object_r:bin_t:s0 /run/bin/machine-config-daemon
sudo cp /etc/ignition-machine-config-encapsulated.json.bak /etc/ignition-machine-config-encapsulated.json && sudo /run/bin/machine-config-daemon firstboot-complete-machineconfig

执行到最后报错no such file or directory,此处报错可以忽略。

I0313 03:43:26.962177 1211001 update.go:1986] Rebooting node
I0313 03:43:26.964941 1211001 update.go:2016] Removing SIGTERM protection
error: failed to rename encapsulated MachineConfig after processing on firstboot: rename /etc/ignition-machine-config-encapsulated.json /etc/ignition-machine-config-encapsulated.json.bak: no such file or directory

等一会机器自动重启,再次登录可以看到机器已被mco控制

[uswift@utccp-test-289lv-worker-0-lctt4 ~]$ Connection to 192.168.123.52 closed by remote host.
Connection to 192.168.123.52 closed.
[root@adsl-172-10-0-1 uccps-update]# ssh -i /root/.ssh/uccps uswift@192.168.123.52
Web console: https://utccp-test-289lv-worker-0-lctt4.utccp-test.example.com:9090/ or https://192.168.123.52:9090/

Last login: Wed Mar 13 03:38:58 2024 from 192.168.123.1
[uswift@utccp-test-289lv-worker-0-lctt4 ~]$ rpm-ostree status
State: idle
Deployments:
* pivot://registry.uniontech.com/utccp-components/machine-os-content:1.2.2
              CustomOrigin: Managed by machine-config-operator
                   Version: 20.20230625.uccps.0 (2023-06-25T03:40:11Z)

  uswift:uswift/x86_64
                   Version: 20.20230625.uccps.0 (2023-06-25T08:15:06Z)
                    Commit: 698a07617e25f715b19b047b27a03d18fd526da4a2cfd56f02f6661ba78be127

当对应机器池内的机器执行完后mco会自动更新机器的machine-os-content至1.3.0版本。

I0313 03:52:56.102221    2136 rpm-ostree.go:325] Running captured: rpm-ostree status
I0313 03:52:56.196015    2136 daemon.go:957] State: idle
Deployments:
* pivot://registry.uniontech.com/uccps-components/machine-os-content:1.3.0
              CustomOrigin: Managed by machine-config-operator
                   Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)

  pivot://registry.uniontech.com/utccp-components/machine-os-content:1.2.2
              CustomOrigin: Managed by machine-config-operator
                   Version: 20.20230625.uccps.0 (2023-06-25T03:40:11Z)

查看machine-config-daemon日志恢复正常

I0313 03:45:58.376862    2287 daemon.go:394] Node utccp-test-289lv-worker-0-lctt4 is not labeled node-role.kubernetes.io/master
I0313 03:45:58.386300    2287 daemon.go:903] Current+desired config: rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.395551    2287 daemon.go:538] Detected a login session before the daemon took over on first boot
I0313 03:45:58.395675    2287 daemon.go:539] Applying annotation: machineconfiguration.openshift.io/ssh
I0313 03:45:58.410063    2287 daemon.go:1193] Validating against pending config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.941646    2287 daemon.go:1211] Validated on-disk state
I0313 03:45:58.969187    2287 daemon.go:1262] Completing pending config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.969216    2287 drain.go:44] Initiating uncordon on node (currently schedulable: false)
I0313 03:45:59.001942    2287 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in uncordon state, retrying
I0313 03:46:09.004125    2287 drain.go:44] Initiating uncordon on node (currently schedulable: true)
I0313 03:46:09.004181    2287 drain.go:66] uncordon succeeded on node (currently schedulable: true)
I0313 03:46:09.004245    2287 update.go:1986] Update completed for config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8 and node has been successfully uncordoned
I0313 03:46:09.020477    2287 daemon.go:1278] In desired config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:46:09.020931    2287 config_drift_monitor.go:240] Config Drift Monitor started
^C

其他

强制执行mcd引导

在节点上执行命令

rm /etc/machine-config-daemon/currentconfig
touch /run/machine-config-daemon-force