Troubleshoot an operating system upgrade with host restart¶
Mandatory host restart for the operating system (OS) upgrade is designed to be safe and takes certain precautions to protect the user data and the cluster integrity. However, sometimes it may result in a host-level failure and block the cluster update. Use this section to troubleshoot such issues.
Warning
The OS upgrade cannot be rolled back on a host or cluster level. If the OS upgrade fails, recover or remove the faulty host before you can complete the cluster upgrade.
Caution
Depending on the cluster configuration, applying security updates and host restart can increase the update time for each node to up to 1 hour.
Cluster nodes are updated one by one. Therefore, for large clusters, the update may take several days to complete.
Pre-upgrade workload lock issues¶
If the cluster upgrade does not start, verify whether the
ceph-clusterworkloadlock
object is present in the Container Cloud
Management API:
kubectl get clusterworkloadlocks
Example of system response:
NAME AGE
ceph-clusterworkloadlock 7h37m
This object indicates that LCM operations that require hosts restart cannot
start on the cluster. The Ceph Controller verifies that Ceph services are
prepared for restart. Once the Ceph Controller completes verification, it
removes the ceph-clusterworkloadlock
object and the cluster upgrade starts.
If this object is still present after the upgrade is initiated, assess the
logs of the ceph-controller
pod to identify and fix errors:
kubectl -n ceph-lcm-mirantis logs deployments/ceph-controller
If a node upgrade does not start, verify whether the NodeWorkloadLock
object is present in the Container Cloud Management API:
kubectl get nodeworkloadlocks
If the object is present, assess the affected node logs to identify and fix errors.
Host restart issues¶
If the host cannot boot after upgrade, verify the following possible issues:
- Invalid boot order configuration in the host BIOS settings
Inspect the host settings using the IPMI console. If you see a message about an invalid boot device, verify and correct the boot order in the host BIOS settings. Set the first boot device to a network card and the second device to a local disk (legacy or UEFI).
- The host is stuck in the GRUB rescue mode
If you see the following message, you are likely affected by the Ubuntu known issue in the Ubuntu
grub-installer
:Entering rescue mode... grub rescue>
In this case, redeploy the host with a correctly defined
BareMetalHostProfile
. You will have to delete the correspondingMachine
resource and create a newMachine
with the correspondingBareMetalHostProfile
. For details, see Create MOSK host profiles.