Update a MOSK cluster

This section describes the workflow you as a Cloud Operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster to the version 22.1 or above.

Note

If you are updating your cluster to an older version, read Update a MOSK cluster to 21.6 or below.

The provided guidelines are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every target release may have its own update peculiarities. Therefore, to accurately plan and successfuly perform a specific update, in addition to following the procedure below, read the update-related section in the Release Notes of the target MOSK version.

Depending on the payload of a target release, the update mechanism can perform the changes on different stack levels from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests the Cloud Operator envolvement when a human decision is required to proceed.

Though, the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack and Tungsten Fabric, the cloud APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a dedicated upgrade procedure to obtain to the latest supported versions of OpenStack and Tungsten Fabric.

Before you begin

Before updating, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release as well as thorougly plan a maintenance window for each update phase depending on the configurational specifics of your cluster.

Read the release notes

Read carefully Release Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:

  • Current Mirantis Container Coud software version and the need to first update to the latest cluster release version

  • Update notes provided in the Release notes for the target MOSK version

  • New product features that will get enabled in your cloud by default

  • New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support

  • Any changes in the behavior of the product features enabled in your cloud

  • List of the addressed and known issues in the target MOSK version

Warning

In case your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.

Plan the cluster update

Depending on the payload brought by a particular target release, a generic cluster update includes from three to five major phases.

The first three phases are mandatory for any update, have minimal impact on the cloud users and workloads, and get executed within a single maintenance window. The remaining phases are only present if major changes need to be performed to a cluster underlay and include a Kubernetes cluster and host operating system update. These phases may introduce a severe impact on the workloads and, therefore, need to be thoroughly planned across several sequential maintenance windows.

Important

To effectively plan an update, keep in mind the specific setup of your cluster. Depending on the cluster architecture, various groups of OpenStack, Tungsten Fabric, and Ceph components, that include controller nodes, storage nodes, compute nodes, and so on, can be running as pods across different worker nodes comprising the underlay Kubernetes cluster. Moreover, some architectures allow running pods on Kubernetes master nodes.

The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:

  • What happens during the phase

    Includes the phase milestones. This content is important for understanding the impact.

  • Impact

    Describes any possible impact on cloud users and workloads.

    The impact estimate represents the worst-case scenario in the architectures that imply a combination of several clusters roles on the same physical nodes, such as hyper-converged compute nodes and clusters with the compact control plane. Also, the impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

    Several update phases will occur simultaneously, resulting in a greater cumulative impact, but a shorter completion time.

  • Time to complete

    Provides a rough estimation of the time required to complete the phase.

    The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

  • Can be paused

    Specifies whether you can postpone the execution of the most impactful phases on a per-node basis using the workload handling policy mechanism described in Step 2. Define workload handling policy for compute nodes, or not.

    Warning

    MOSK clusters are not designed to run in a semi-updated state for a long time. Once started, an update needs to complete as soon as possible. Postponing update phases for longer than one day may introduce a risk of the cloud workload corruption.

Warning

During the update, prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.

Step 1. Verify that the Container Cloud management cluster is up-to-date

MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.

Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.

To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI.

Step 2. Define workload handling policy for compute nodes

During an update, changes get applied to the MOSK cluster nodes one by one, starting from Kubernetes master nodes and then proceeding with Kubernetes worker nodes in no particular order. Depending on the target MOSK release, severity of changes, and, therefore, the impact on the OpenStack workloads running on a compute node, varies from zero to an hour of downtime, while the node gets rebooted, for example, when switching to a new host operating system.

To mitigate the potential impact on the cloud workloads, you can define the instance migration flow for the compute nodes running the most valuable instances.

The list of available options for the instance migration configuration includes:

  • The openstack.lcm.mirantis.com/instance_migration_mode annotation:

    • live

      Default. The OpenStack controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the manual mode.

      Note

      Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:

      • LVM-based ephemeral storage with and without encryption

      • Encrypted block storage volumes

      • CPU and NUMA node pinning

    • manual

      The OpenStack Controller waits for the Operator to migrate instances from the compute node. When it is time to update the compute node, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update.

    • skip

      The OpenStack Controller skips the instance check on the node and reboots it.

      Note

      For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. We recommend that in such environments you perform the update procedure in the manual mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore the connectivity.

  • The openstack.lcm.mirantis.com/instance_migration_attempts annotation

    Defines the number of times the OpenStack Controller attempts to migrate a single instance before giving up. Defaults to 3.

Note

You can also use annotations to control the update of non-compute nodes if they represent critical points of a specific cloud architecture. For example, setting the instance_migration_mode to manual on a controller node with a collocated gateway (Open vSwitch) will allow the Operator to gracefully shut down all the virtual routers hosted on this node.

To configure the migration flow:

  1. Edit the target compute node resource. For example:

    kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
    
  2. Set the migration mode and the number of times the OpenStack Controller should attempt to migrate a single instance. For example:

    apiVersion: v1
    kind: Node
    metadata:
     name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     uid: 54be5139-aba7-47e7-92bf-5575773a12a6
     resourceVersion: '299734609'
     creationTimestamp: '2021-03-24T16:03:11Z'
     labels:
       ...
       openstack-compute-node: enabled
       openvswitch: enabled
     annotations:
       openstack.lcm.mirantis.com/instance_migration_mode: "live"
       openstack.lcm.mirantis.com/instance_migration_attempts: "5"
       ...
    

Step 3. Initiate MOSK cluster update

Silence alerts

During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Container Cloud Operations Guide: Silence alerts.

Trigger the update

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, find the managed MOSK cluster.

  4. Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.

  5. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release.

  6. Click Update.

    Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:

    • /srv/backup/swarm on every Container Cloud node for Docker Swarm

    • /srv/backup/ucp on one of the controller nodes for MKE

    To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Step 4. Watch the cluster update

Watch the update process through the web UI

To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.

Also, you can see the general status of each node during the update on the Container Cloud cluster view page.

Follow the update process through logs

The whole update process is controlled by lcm-controller, which runs in the kaas namespace of the Container Cloud management cluster. Follow its logs to watch the progress of the update, discover, and debug any issues.

Watch the state of the cluster and nodes update through the CLI

The lcmclusterstate and lcmmachines objects in the mos namespace of the Container Cloud management cluster provide detailed information about the current phase of the update process in the context of the managed cluster overall as well as specific nodes.

The lcmmachine object being in the Ready state indicates that a node has been successfully updated.

To display the detailed view of the cluster update state, run:

kubectl -n child-ns get lcmclusterstates -o wide

Example system response:

NAME                                            CLUSTERNAME   TYPE              ARG                                          VALUE   ACTUALVALUE   ATTEMPT   MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      cordon-drain      cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: following    NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      swarm-drain       cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first

To display the detailed view of the nodes update state, run:

kubectl -n child-ns get lcmmachines

Example system response:

NAME                                                 CLUSTERNAME   TYPE      STATE
cz5018-child-cl-storage-worker-noefi-dzttw           child-cl      worker    Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9           child-cl      worker    Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t   child-cl      control   Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft   child-cl      control   Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f   child-cl      control   Ready
cz7503-child-cl-storage-worker-noefi-5lvd7           child-cl      worker    Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc           child-cl      worker    Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk           child-cl      worker    Prepare

Watch the node maintenance status

A NodeMaintenanceRequest object gets created during the update process to inform the higher-level life-cycle management mechanisms that a node is going to be updated. This enables OpenStack, Ceph, and Tungsten Fabric controllers to get prepared for that.

The NodeMaintenanceRequest object that exists for the node with the manual policy configured indicates that you should proceed with the manual handling of the workloads. See Step 5. Optional. Perform the manual workload migration for details.

To output the NodeMaintenanceRequest object details, run:

kubectl get nodemaintenancerequests

Example system response:

NAME                                             AGE
kaas-node-50a51d95-1e4b-487e-a973-199de400b97d   17m
kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161   85s

Step 5. Optional. Perform the manual workload migration

When it is time to apply changes to a node that has a manual workload handling policy configured, the life-cycle management mechanism notifies you about that by creating a NodeMaintenanceRequest object for the node and setting the active status attribute for the corresponding NodeWorkloadLock object.

Note

Verify the status:errorMessage attribute before proceeding.

To view the nodeworkloadlock objects details for a specific node, run:

kubectl get nodeworkloadlocks <NODE-NAME> -o yaml

Example system response:

apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  annotations:
    inner_state: active
  creationTimestamp: "2022-02-04T13:24:48Z"
  generation: 1
  name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
  resourceVersion: "173934"
  uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
  controllerName: openstack
  nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
  errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
  Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
  release: 8.5.0-rc+22.1
  state: active

Note

For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.

Once you complete the workload migration, the update mechanism can safely proceed with the node update. Indicate the migration completion to the LCM by setting the node workload handling policy to skip. See Step 2. Define workload handling policy for compute nodes for annotation editing details.

After the update

Once your MOSK cluster update is complete, proceed with the following:

  1. If your MOSK managed cluster uses Tugsten Fabric as a back end for networking, manually restart the vRouter pods:

    1. Upgrade the vRouter pods by removing them one by one manually.

      Note

      Manual removal is required because vRouter pods use the OnDelete update strategy. vRouter pod restart causes networking downtime for workloads on the affected node. If it is not applicable for some workloads, migrate them before restarting the vRouter pods.

      kubectl -n tf delete pod <VROUTER-POD-NAME>
      
    2. Verify that all tf-vrouter-* pods are upgraded:

      kubectl -n tf get ds | grep tf-vrouter
      

      The UP-TO-DATE and CURRENT fields must have the same values.

  2. Perform the post-update steps recommended in the update notes of the target release if any.

  3. Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.

  4. To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.

  5. Optional. Proceed with the upgrade of OpenStack and Tungsten Fabric.

  6. If necessary, expire alert silences in StackLight as described in Container Cloud Operations Guide: Silence alerts.

What to do if the update hangs or fails

If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.

If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.

Troubleshoot issues

To see any issues that might have occurred during the update, verify the logs of the lcm-controller pods in the kaas namespace of the Container Coud management cluster.

To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Container Cloud documentation.

Roll back the changes

Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update. Therefore, if the update process gets blocked on a specific node, remove the node from the cluster and try to complete the update.