Troubleshoot an operating system upgrade with host restart

Mandatory host restart for the operating system (OS) upgrade is designed to be safe and takes certain precautions to protect the user data and the cluster integrity. However, sometimes it may result in a host-level failure and block the cluster upgrade. Use this section to troubleshoot such issues.

Warning

The OS upgrade cannot be rolled back on a host or cluster level. If the OS upgrade fails, recover or remove the faulty host before you can complete the cluster upgrade.

Caution

  • Depending on the cluster configuration, applying security updates and host restart can increase the update time for each node to up to 1 hour.

  • Cluster nodes are updated one by one. Therefore, for large clusters, the update may take several days to complete.

Pre-upgrade workload lock issues

If the cluster upgrade does not start, verify whether the ceph-clusterworkloadlock object is present in the Container Cloud Management API:

kubectl get clusterworkloadlocks

Example of system response:

NAME                       AGE
ceph-clusterworkloadlock   7h37m

This object indicates that LCM operations that require hosts restart cannot start on the cluster. The Ceph Controller verifies that Ceph services are prepared for restart. Once the Ceph Controller completes verification, it removes the ceph-clusterworkloadlock object and the cluster upgrade starts.

If this object is still present after the upgrade is initiated, assess the logs of the ceph-controller pod to identify and fix errors:

kubectl -n ceph-lcm-mirantis logs deployments/ceph-controller

If a node upgrade does not start, verify whether the NodeWorkloadLock object is present in the Container Cloud Management API:

kubectl get nodeworkloadlocks

If the object is present, assess the affected node logs to identify and fix errors.

Host restart issues

If the host cannot boot after upgrade, verify the following possible issues:

  • Invalid boot order configuration in the host BIOS settings

    Inspect the host settings using the IPMI console. If you see a message about an invalid boot device, verify and correct the boot order in the host BIOS settings. Set the first boot device to a network card and the second device to a local disk (legacy or UEFI).

  • The host is stuck in the GRUB rescue mode

    If you see the following message, you are likely affected by the Ubuntu known issue in the Ubuntu grub-installer:

    Entering rescue mode...
    grub rescue>
    

    In this case, redeploy the host with a correctly defined BareMetalHostProfile. You will have to delete the corresponding Machine resource and create a new Machine with the corresponding BareMetalHostProfile. For details, see Create a custom host profile.