1.2搭建后,组件都正常,升级至1.3版本,mco组件出现异常。
信息
节点状态如下
[uswift@utccp-test-289lv-worker-0-gs922 ~]$ rpm-ostree status
State: idle
Deployments:
* uswift:uswift/x86_64
Version: 20.20230625.uccps.0 (2023-06-25T08:15:06Z)
Commit: 698a07617e25f715b19b047b27a03d18fd526da4a2cfd56f02f6661ba78be127
machine-config-daemon 日志报错信息
I0312 14:59:20.356836 403813 daemon.go:903] Current+desired config: rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa
I0312 14:59:20.367576 403813 daemon.go:538] Detected a login session before the daemon took over on first boot
I0312 14:59:20.367606 403813 daemon.go:539] Applying annotation: machineconfiguration.openshift.io/ssh
I0312 14:59:20.390021 403813 daemon.go:1196] Validating against current config rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa
E0312 14:59:20.390106 403813 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-master-bd1426e1c30cc288379bc4c7dd71c4aa: expected target osImageURL "registry.uniontech.com/utccp-components/machine-os-content:1.2.2", have ""
排查
分析
v1.3分支,查看日志报错处 "pkg/daemon/daemon.go" 1579 lines
// validateOnDiskState compares the on-disk state against what a configuration
// specifies. If for example an admin ssh'd into a node, or another operator
// is stomping on our files, we want to highlight that and mark the system
// degraded.
func (dn *Daemon) validateOnDiskState(currentConfig *mcfgv1.MachineConfig) error {
// Be sure we're booted into the OS we expect
osMatch := dn.checkOS(currentConfig.Spec.OSImageURL)
if !osMatch {
return errors.Errorf("expected target osImageURL %q, have %q", currentConfig.Spec.OSImageURL, dn.bootedOSImageURL)
}
return validateOnDiskState(currentConfig, pathSystemd)
}
对比了现有imageurl和目标imageurl不一致报错。但是目标镜像地址为空就很奇怪。于是去查看该值如何获取
"pkg/daemon/rpm-ostree.go" 337 lines
// GetBootedOSImageURL returns the image URL as well as the OSTree version (for logging)
// Returns the empty string if the host doesn't have a custom origin that matches pivot://
// (This could be the case for e.g. FCOS, or a future RHCOS which comes not-pivoted by default)
func (r *RpmOstreeClient) GetBootedOSImageURL() (string, string, error) {
bootedDeployment, err := r.GetBootedDeployment()
if err != nil {
return "", "", err
}
// the canonical image URL is stored in the custom origin field.
osImageURL := ""
if len(bootedDeployment.CustomOrigin) > 0 {
if strings.HasPrefix(bootedDeployment.CustomOrigin[0], "pivot://") {
osImageURL = bootedDeployment.CustomOrigin[0][len("pivot://"):]
}
}
return osImageURL, bootedDeployment.Version, nil
}
这里默认是空的,通过现象信息中 rpm-ostree status 可以发现不存在CustomOrigin对象,下面是正常的输出,可见如果初始引导正常,Deployments[0]的url应为pivot开头且存在CustomOrigin。
sh-5.1# chroot /host
sh-5.1# rpm-ostree status
State: idle
Deployments:
* pivot://registry.uniontech.com/uccps-components/machine-os-content:1.3.0
CustomOrigin: Managed by machine-config-operator
Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)
uswift:uccps/x86_64
Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)
Commit: 9261ed2ea2851f5caebd641318388406525fbb3e863e30356e84008e513a859a
sh-5.1#
也就是说在1.2版本初始引导失败了,顺着最开始的daemon.go中的validateOnDiskState函数排查可以发现是checkOS函数不匹配。
pkg/daemon/daemon.go 1521 lines
// checkOS determines whether the booted system matches the target
// osImageURL and if not whether we need to take action. This function
// returns `true` if no action is required, which is the case if we're
// not running RHCOS or FCOS, or if the target osImageURL is "" (unspecified),
// or if the digests match.
// Otherwise if `false` is returned, then we need to perform an update.
func (dn *Daemon) checkOS(osImageURL string) bool {
// Nothing to do if we're not on RHCOS or FCOS
if !dn.os.IsCoreOSVariant() {
glog.Infof(`Not booted into a CoreOS variant, ignoring target OSImageURL %s`, osImageURL)
return true
}
return compareOSImageURL(dn.bootedOSImageURL, osImageURL)
}
这里通过IsCoreOSVariant确定OSImageUrl v1.2分支
pkg/daemon/osrelease.go 44 lines
// IsCoreOSVariant is true if the OS is FCOS or a derivative (ostree+Ignition)
// which includes RHCOS.
func (os OperatingSystem) IsCoreOSVariant() bool {
// We should probably add VARIANT_ID=coreos to RHCOS too and key off that
return os.IsFCOS() || os.IsRHCOS() || os.IsUSwift()
}
再去查看 IsUSwift
// IsUSwift is true if the OS is UOS USwift
func (os OperatingSystem) IsUSwift() bool {
return os.ID == "Uswift" || os.ID == "USwift"
}
回到系统中查看/etc/os-release
[uswift@utccp-test-289lv-worker-0-gs922 ~]$ cat /etc/os-release
PRETTY_NAME="UOS Server 20"
NAME="UOS Server 20"
VERSION_ID="20"
VERSION="20"
ID="uos"
HOME_URL="https://www.chinauos.com/"
BUG_REPORT_URL="https://bbs.chinauos.com/"
VERSION_CODENAME=mercury
PLATFORM_ID="platform:uelc20"
OSTREE_VERSION='20.20230625.uccps.0'
这里的ID是uos不是USwift或Uswift,所以第一次引导为被跳过了,在1.3分支里判断 IsUSwift 变成了 /run/ostree-booted,https://gerrit-dev.uniontech.com/c/utccp-image-machine-config-operator/+/66967
总结
可知不是升级的问题,而是升级前在部署1.2版本时,first boot没有正确引导,这也可能是轩宝在release中加了免密后搭建集群报错的原因,免密的文件会触发mco update从而出现相同的错误。
解决方案
使用1.3的machine-config-daemon引导1.2的machine-os-content
在每个节点下执行命令:
sudo podman pull registry.uniontech.com/uccps-components/machine-config-operator:1.3.0
sudo podman run --rm --quiet --net=host -v /run/bin:/host/run/bin:z --entrypoint=cp 'registry.uniontech.com/uccps-components/machine-config-operator:1.3.0' /usr/bin/machine-config-daemon /host/run/bin
sudo /bin/chcon system_u:object_r:bin_t:s0 /run/bin/machine-config-daemon
sudo cp /etc/ignition-machine-config-encapsulated.json.bak /etc/ignition-machine-config-encapsulated.json && sudo /run/bin/machine-config-daemon firstboot-complete-machineconfig
执行到最后报错no such file or directory,此处报错可以忽略。
I0313 03:43:26.962177 1211001 update.go:1986] Rebooting node
I0313 03:43:26.964941 1211001 update.go:2016] Removing SIGTERM protection
error: failed to rename encapsulated MachineConfig after processing on firstboot: rename /etc/ignition-machine-config-encapsulated.json /etc/ignition-machine-config-encapsulated.json.bak: no such file or directory
等一会机器自动重启,再次登录可以看到机器已被mco控制
[uswift@utccp-test-289lv-worker-0-lctt4 ~]$ Connection to 192.168.123.52 closed by remote host.
Connection to 192.168.123.52 closed.
[root@adsl-172-10-0-1 uccps-update]# ssh -i /root/.ssh/uccps uswift@192.168.123.52
Web console: https://utccp-test-289lv-worker-0-lctt4.utccp-test.example.com:9090/ or https://192.168.123.52:9090/
Last login: Wed Mar 13 03:38:58 2024 from 192.168.123.1
[uswift@utccp-test-289lv-worker-0-lctt4 ~]$ rpm-ostree status
State: idle
Deployments:
* pivot://registry.uniontech.com/utccp-components/machine-os-content:1.2.2
CustomOrigin: Managed by machine-config-operator
Version: 20.20230625.uccps.0 (2023-06-25T03:40:11Z)
uswift:uswift/x86_64
Version: 20.20230625.uccps.0 (2023-06-25T08:15:06Z)
Commit: 698a07617e25f715b19b047b27a03d18fd526da4a2cfd56f02f6661ba78be127
当对应机器池内的机器执行完后mco会自动更新机器的machine-os-content至1.3.0版本。
I0313 03:52:56.102221 2136 rpm-ostree.go:325] Running captured: rpm-ostree status
I0313 03:52:56.196015 2136 daemon.go:957] State: idle
Deployments:
* pivot://registry.uniontech.com/uccps-components/machine-os-content:1.3.0
CustomOrigin: Managed by machine-config-operator
Version: 20.20231221.uccps.0 (2023-12-21T07:35:25Z)
pivot://registry.uniontech.com/utccp-components/machine-os-content:1.2.2
CustomOrigin: Managed by machine-config-operator
Version: 20.20230625.uccps.0 (2023-06-25T03:40:11Z)
查看machine-config-daemon日志恢复正常
I0313 03:45:58.376862 2287 daemon.go:394] Node utccp-test-289lv-worker-0-lctt4 is not labeled node-role.kubernetes.io/master
I0313 03:45:58.386300 2287 daemon.go:903] Current+desired config: rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.395551 2287 daemon.go:538] Detected a login session before the daemon took over on first boot
I0313 03:45:58.395675 2287 daemon.go:539] Applying annotation: machineconfiguration.openshift.io/ssh
I0313 03:45:58.410063 2287 daemon.go:1193] Validating against pending config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.941646 2287 daemon.go:1211] Validated on-disk state
I0313 03:45:58.969187 2287 daemon.go:1262] Completing pending config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:45:58.969216 2287 drain.go:44] Initiating uncordon on node (currently schedulable: false)
I0313 03:45:59.001942 2287 drain.go:62] RunCordonOrUncordon() succeeded but node is still not in uncordon state, retrying
I0313 03:46:09.004125 2287 drain.go:44] Initiating uncordon on node (currently schedulable: true)
I0313 03:46:09.004181 2287 drain.go:66] uncordon succeeded on node (currently schedulable: true)
I0313 03:46:09.004245 2287 update.go:1986] Update completed for config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8 and node has been successfully uncordoned
I0313 03:46:09.020477 2287 daemon.go:1278] In desired config rendered-worker-3c364e7f394f757afd319cdb5b0e17c8
I0313 03:46:09.020931 2287 config_drift_monitor.go:240] Config Drift Monitor started
^C
其他
强制执行mcd引导
在节点上执行命令
rm /etc/machine-config-daemon/currentconfig
touch /run/machine-config-daemon-force