Update a MOSK cluster to a major release version

This section describes the workflow you as a cloud operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster to a major release version.

The provided guidelines are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every target major release may have its own update peculiarities. Therefore, to accurately plan and successfully perform a specific update, in addition to the procedure below, read the update-related section in the Release Notes of the target MOSK version.

Depending on the payload of a target release, the update mechanism can perform the changes on different levels of the stack, from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests your involvement when a human decision is required to proceed.

Though the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack, that is, the APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a dedicated upgrade procedure to obtain the latest supported versions of OpenStack.

Before you begin

Before updating, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release, as well as thoroughly plan a maintenance window for each update phase depending on the configurational specifics of your cluster.

Read the release notes

Read carefully Release Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:

  • Current Mirantis Container Cloud software version and the need to first update to the latest cluster release version

  • Update notes provided in the Release notes for the target MOSK version

  • New product features that will get enabled in your cloud by default

  • New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support

  • Any changes in the behavior of the product features enabled in your cloud

  • List of the addressed and known issues in the target MOSK version

Warning

If your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.

Plan the cluster update

Depending on the payload brought by a particular target release, a generic cluster update includes from three to six major phases.

The first three phases are present in any update. They focus on the containerized components of the software stack and have minimal impact on the cloud users and workloads.

The remaining phases are only present if any changes need to be made to the foundation layers: the underlay Kubernetes cluster and host operating system. For the changes to take effect, you may need to reboot the cluster nodes. This procedure imposes a severe impact on cloud workloads and, therefore, needs to be thoroughly planned across several sequential maintenance windows.

Important

To effectively plan a cluster update, keep in mind the architecture of your specific cloud. Depending on the selected design, the components of a MOSK cluster may have different distribution across the nodes (physical servers) comprising the underlay bare metal Kubernetes cluster. The more components are collocated on a single node, the harder is the impact on the functions of the cloud when the changes are applied.

The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:

  • What happens during the phase

    Includes the phase milestones. This content is important for understanding the impact.

  • Impact

    Describes any possible impact on cloud users and workloads.

    The impact estimate represents the worst-case scenario in the architectures that imply a combination of several cluster roles on the same physical nodes, such as hyper-converged compute nodes and clusters with the compact control plane. Also, the impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

    Several update phases will occur simultaneously, resulting in a greater cumulative impact, but a shorter completion time.

  • Time to complete

    Provides a rough estimation of the time required to complete the phase.

    The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

Warning

During the update, prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.

Phase 1: Life-cycle management modules update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Life-cycle management modules update

What happens during the phase

New versions of OpenStack, Tungsten Fabric, and Ceph controllers downloaded and installed. OpenStack and Tungsten Fabric images precached.

Impact

None

Time to complete

Depending on the quality of the Internet connectivity, up to 45 minutes.

Phase 2: OpenStack and Tungsten Fabric components update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

OpenStack and Tungsten Fabric components update

What happens during the phase

New versions of OpenStack and Tungsten Fabric container images downloaded, services restarted sequentially.

Impact

  • Approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • Minor loss of the East-West connectivity with the Open vSwitch networking back end that causes approximately 2 seconds of downtime per compute node.

  • Minor loss of the North-South connectivity with the Open vSwitch networking back end:

    • A non-distributed HA virtual router needs up to 1 minute to fail over

    • A non-distributed and non-HA virtual router failover time depends on many factors and may take up to 10 minutes

Time to complete

  • 20 minutes per network gateway node (Open vSwitch)

  • 5 minutes for a Tungsten Fabric cluster

  • 15 minutes per compute node

Phase 3: Ceph cluster update and upgrade

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Ceph cluster update and upgrade

What happens during the phase

New versions of Ceph components downloaded, services restarted. If applicable, Ceph switched to the latest major version.

Impact

None

Time to complete

The update of a Ceph cluster with 30 storage nodes can take up to 35 minutes. Additionally, 15 minutes are required for the major Ceph version upgrade, if any.

Phase 4a: Host operating system update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Host operating system update on Kubernetes master nodes

What happens during the phase

New system packages downloaded and installed on the host operating system, other major changes get applied.

Impact

None

Time to complete

The nodes are updated sequentially. Up to 15 minutes per node.

Phase 4b: Kubernetes components update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Kubernetes cluster update on Kubernetes master nodes

What happens during the phase

New versions of Kubernetes control plane components downloaded and installed.

Impact

For the compact control plane, approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

For the compact control plane with collocated gateway nodes (Open vSwitch), minor loss of the North-South connectivity:

  • A non-distributed HA virtual router needs up to 1 minute to fail over

  • A non-distributed and non-HA virtual router failover time depends on many factors and may take up to 10 minutes

Time to complete

Up to 40 minutes total

Phases 5a and 5b: Host operating system and Kubernetes cluster update on Kubernetes worker nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Important

Both phases, 5a and 5b, are applied together, either node by node (default) or to several nodes in parallel. The parallel updating is available since 23.1.

Take this into consideration when estimating the impact and planning the maintenance window.

Host operating system and Kubernetes cluster

What happens during the phase

During the host operating system update:

  • New packages for host operating system downloaded and installed

  • Workloads optionally migrated

  • Any other major configuration changes applied

  • Node manually rebooted. But an operator of the cloud has an option to restart the nodes later, during another maintenance window.

During the Kubernetes cluster update:

  • New versions of Kubernetes control plane components downloaded and installed

  • Kubernetes pods restarted

Impact

  • For the storage nodes:

    • No impact for the nodes hosting the Ceph cluster data.

    • Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes.

  • For the dedicated control plane nodes, approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • For the dedicated gateway nodes (Open vSwitch), minor loss of the North-South connectivity that is up to 1 minute per non-distributed HA router.

  • For the compute nodes, no or controllable impact depending on the configured instance migration policy. See Configure instance migration policy for cluster nodes.

    If instances are not migrated from the host and since the host packages are updated during this phase and the host can be rebooted, the downtime duration depends on the restart time of the Docker and containerd services. Approximately, the process can take from 5 to 10 minutes per compute node.

Time to complete

By default, the nodes are updated sequentially as follows:

  • For the host operating system update, up to 15 minutes per node.

  • For the Kubernetes cluster update, up to 40 minutes per node.

For MOSK 23.1 to 23.2 and newer updates, you can reduce update time by enabling parallel node update. The procedure is described further in the Enable parallel update of Kubernetes worker nodes subsection.

Phase 6: Cluster nodes reboot

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Important

An update to a newer MOSK version may require reboot of the cluster nodes for changes to take effect. Although, you can decide when to restart each particular node, an update process can not be considered complete until all of the nodes get handled.

To determine whether the reboot is required, consult the Step 4. Reboot the nodes with optional instance migration section.

What happens during the phase

  1. You put the cluster into the maintenance mode.

  2. For each node in the cluster:

    1. Optional. You configure an instance migration policy.

    2. You initiate the node reboot.

    3. The node is gracefully restarted with automatic or manual migration of cloud workloads running on it.

Impact

  • For the storage nodes:

    • No impact on the nodes hosting the Ceph cluster data

    • Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes

  • For the control plane nodes, approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • For the network gateway nodes (Open vSwitch), minor loss of the North-South connectivity that is up to 1 minute per each non-distributed HA router hosted on them.

  • For the compute nodes, no or controllable impact depending on the configured instance migration policy. See Configure instance migration policy for cluster nodes.

Time to complete

  • Optional. Time to migrate instances across compute nodes.

  • Up to 10 minutes per node to reboot. Depends on the hardware and BIOS configuration. Several nodes can be rebooted in parallel.

Step 1. Verify that the Container Cloud management cluster is up-to-date

MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.

Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.

To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI.

Step 2. Initiate MOSK cluster update

Silence alerts

During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Container Cloud Operations Guide: Silence alerts.

Verify Ceph configuration

If you update MOSK to 23.1, verify that the KaaSCephCluster custom resource does not contain the following entries. If they exist, remove them.

  • In the spec.cephClusterSpec section, the external section.

  • In the spec.cephClusterSpec.rookConfig section, the ms_crc_data or ms crc data configuration key. After you remove the key, wait for rook-ceph-mon pods to restart on the MOSK cluster.

Enable parallel update of Kubernetes worker nodes

Optional. Starting from MOSK 23.1 to 23.2 update, you can enable and configure parallel node update to reduce update time and minimize downtime:

Trigger the update

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, find the managed MOSK cluster.

  4. Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.

  5. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release.

  6. Click Update.

    Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:

    • /srv/backup/swarm on every Container Cloud node for Docker Swarm

    • /srv/backup/ucp on one of the controller nodes for MKE

    To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Step 3. Watch the cluster update

Watch the update process through the web UI

To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.

Also, you can see the general status of each node during the update on the Container Cloud cluster view page.

Follow the update process through logs

The whole update process is controlled by lcm-controller, which runs in the kaas namespace of the Container Cloud management cluster. Follow its logs to watch the progress of the update, discover, and debug any issues.

Watch the state of the cluster and nodes update through the CLI

The lcmclusterstate and lcmmachines objects in the mos namespace of the Container Cloud management cluster provide detailed information about the current phase of the update process in the context of the managed cluster overall as well as specific nodes.

The lcmmachine object being in the Ready state indicates that a node has been successfully updated.

To display the detailed view of the cluster update state, run:

kubectl -n child-ns get lcmclusterstates -o wide

Example system response:

NAME                                            CLUSTERNAME   TYPE              ARG                                          VALUE   ACTUALVALUE   ATTEMPT   MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      cordon-drain      cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: following    NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      swarm-drain       cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first

To display the detailed view of the nodes update state, run:

kubectl -n child-ns get lcmmachines

Example system response:

NAME                                                 CLUSTERNAME   TYPE      STATE
cz5018-child-cl-storage-worker-noefi-dzttw           child-cl      worker    Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9           child-cl      worker    Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t   child-cl      control   Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft   child-cl      control   Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f   child-cl      control   Ready
cz7503-child-cl-storage-worker-noefi-5lvd7           child-cl      worker    Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc           child-cl      worker    Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk           child-cl      worker    Prepare

Step 4. Reboot the nodes with optional instance migration

Depending on the target release content, you may need to reboot the cluster nodes for the changes to take effect. Running a MOSK cluster in a semi-updated state for an extended period may result in unpredictable behavior of the cloud and impact users and workloads. Therefore, when it is required, you need to reboot the cluster nodes as soon as possible to avoid potential risks.

Determine if the node needs to be rebooted

Verify the YAML definitions of the LCMMachine and Machine objects. The node must be rebooted if the rebootRequired flag is set to true. In addition, objects explicitly specify the reason for rebooting. For example:

  • The LCMMachine object of the node that requires rebooting:

    ...
    status:
       hostInfo:
         rebootRequired: true
         rebootReason: "linux-image-5.13.0-51-generic"
    
  • The Machine object of the node that does not require rebooting:

    ...
    status:
      ...
      providerStatus:
        ...
        reboot:
          reason: ""
          required: false
        status: Ready
    

Since MOSK 23.1, you can also use the Mirantis Container Cloud web UI to identify the nodes requiring reboot:

  1. In the Clusters tab, click the required cluster name. The page with Machines opens.

  2. Hover over the status of every machine. A machine to reboot contains the Reboot > The machine requires a reboot notification in the Status tooltip.

Configure instance migration policy for cluster nodes

Restarting the cluster causes downtime of the cloud services running on the nodes. While the MOSK control plane is built for high availability and can tolerate temporary loss of at least 1/3 of services without a significant impact on user experience, rebooting nodes that host the elements of cloud data plane, such as network gateway nodes and compute nodes, has a detrimental effect on the cloud workloads, if not performed gracefully.

To mitigate the potential impact on the cloud workloads, you can define the instance migration flow for the compute nodes running the most valuable instances.

The list of available options for the instance migration configuration includes:

  • The openstack.lcm.mirantis.com/instance_migration_mode annotation:

    • live

      Default. The OpenStack controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the manual mode.

      Note

      Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:

      • LVM-based ephemeral storage with and without encryption

      • Encrypted block storage volumes

      • CPU and NUMA node pinning

    • manual

      The OpenStack Controller waits for the Operator to migrate instances from the compute node. When it is time to update the compute node, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update.

    • skip

      The OpenStack Controller skips the instance check on the node and reboots it.

      Note

      For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. We recommend that in such environments you perform the update procedure in the manual mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore the connectivity.

  • The openstack.lcm.mirantis.com/instance_migration_attempts annotation

    Defines the number of times the OpenStack Controller attempts to migrate a single instance before giving up. Defaults to 3.

Note

You can also use annotations to control the update of non-compute nodes if they represent critical points of a specific cloud architecture. For example, setting the instance_migration_mode to manual on a controller node with a collocated gateway (Open vSwitch) will allow the Operator to gracefully shut down all the virtual routers hosted on this node.

To configure the instance migration policy:

  1. Edit the target compute node resource. For example:

    kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
    
  2. Set the migration mode and the number of attempts the OpenStack Controller should make to migrate a single instance. For example:

    apiVersion: v1
    kind: Node
    metadata:
     name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     uid: 54be5139-aba7-47e7-92bf-5575773a12a6
     resourceVersion: '299734609'
     creationTimestamp: '2021-03-24T16:03:11Z'
     labels:
       ...
       openstack-compute-node: enabled
       openvswitch: enabled
     annotations:
       openstack.lcm.mirantis.com/instance_migration_mode: "live"
       openstack.lcm.mirantis.com/instance_migration_attempts: "5"
       ...
    

Reboot MOSK cluster

Since MOSK 23.1, you can reboot several cluster nodes in one go by using the Graceful reboot mechanism provided by Mirantis Container Cloud. The mechanism restarts the selected nodes one by one, honoring the instance migration policies.

For older versions of MOSK, you need to reboot each node manually as follows:

  1. Enable maintenance mode for the MOSK cluster.

  2. For each node in the cluster:

    1. Enable maintenance mode for the node.

    2. If manual instance migration policy is configured for the node, perform manual migration once the node is ready to reboot (see below).

    3. Reboot the node using cluster life-cycle management.

    4. Disable maintenance mode for the node.

  3. Disable maintenance mode for the MOSK cluster.

Perform manual actions before node reboot

When a node that has a manual instance migration policy is ready to be restarted, the life-cycle management mechanism notifies you about that by creating a NodeMaintenanceRequest object for the node and setting the active status attribute for the corresponding NodeWorkloadLock object.

Note

Verify the status:errorMessage attribute before proceeding.

To view the NodeWorkloadLock objects details for a specific node, run:

kubectl get nodeworkloadlocks <NODE-NAME> -o yaml

Example system response:

apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  annotations:
    inner_state: active
  creationTimestamp: "2022-02-04T13:24:48Z"
  generation: 1
  name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
  resourceVersion: "173934"
  uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
  controllerName: openstack
  nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
  errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
  Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
  release: 8.5.0-rc+22.1
  state: active

Note

For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.

After the update

Once your MOSK cluster update is complete, proceed with the following:

  1. Perform the post-update steps recommended in the update notes of the target release if any.

  2. Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.

  3. To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.

  4. Optional. Proceed with the upgrade of OpenStack.

  5. If necessary, expire alert silences in StackLight as described in Container Cloud Operations Guide: Silence alerts.

What to do if the update hangs or fails

If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.

If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.

Troubleshoot issues

To see any issues that might have occurred during the update, verify the logs of the lcm-controller pods in the kaas namespace of the Container Cloud management cluster.

To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Container Cloud documentation.

Roll back the changes

Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update.