Troubleshoot an operating system upgrade with host restart¶
Mandatory host restart for the operating system (OS) upgrade is designed to be safe and takes certain precautions to protect the user data and the cluster integrity. However, sometimes it may result in a host-level failure and block the cluster upgrade. Use this section to troubleshoot such issues.
The OS upgrade cannot be rolled back on a host or cluster level. If the OS upgrade fails, recover or remove the faulty host before you can complete the cluster upgrade.
Depending on the cluster configuration, applying security updates and host restart can increase the update time for each node to up to 1 hour.
Cluster nodes are updated one by one. Therefore, for large clusters, the update may take several days to complete.
Pre-upgrade workload lock issues¶
If the cluster upgrade does not start, verify whether the
ceph-clusterworkloadlock object is present in the Container Cloud
kubectl get clusterworkloadlocks
Example of system response:
NAME AGE ceph-clusterworkloadlock 7h37m
This object indicates that LCM operations that require hosts restart cannot
start on the cluster. The Ceph Controller verifies that Ceph services are
prepared for restart. Once the Ceph Controller completes verification, it
ceph-clusterworkloadlock object and the cluster upgrade starts.
If this object is still present after the upgrade is initiated, assess the
logs of the
ceph-controller pod to identify and fix errors:
kubectl -n ceph-lcm-mirantis logs deployments/ceph-controller
If a node upgrade does not start, verify whether the
object is present in the Container Cloud Management API:
kubectl get nodeworkloadlocks
If the object is present, assess the affected node logs to identify and fix errors.
Host restart issues¶
If the host cannot boot after upgrade, verify the following possible issues:
- Invalid boot order configuration in the host BIOS settings
Inspect the host settings using the IPMI console. If you see a message about an invalid boot device, verify and correct the boot order in the host BIOS settings. Set the first boot device to a network card and the second device to a local disk (legacy or UEFI).
- The host is stuck in the GRUB rescue mode
If you see the following message, you are likely affected by the Ubuntu known issue in the Ubuntu
Entering rescue mode... grub rescue>
In this case, redeploy the host with a correctly defined
BareMetalHostProfile. You will have to delete the corresponding
Machineresource and create a new
Machinewith the corresponding
BareMetalHostProfile. For details, see Create a custom host profile.