Update to a major version¶

This section describes the workflow you as a cloud operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster to a major release version.

Note

The hereby guide applies to the clusters running MOSK of version 23.1 and above. In case you have an older version and looking to update, please contact Mirantis support to get intructions valid for your cluster.

The instructions below are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every major release may have its own update peculiarities. Therefore, to accurately plan and successfully perform an update, in addition to the hereby document, read the update-related section in the Release Notes of the target MOSK version.

Depending on the payload of a target release, the update mechanism can perform the changes on different levels of the stack, from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests your involvement when a human decision is required to proceed.

Though the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack, that is, the APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a separate upgrade procedure to obtain the latest supported OpenStack version.

See also

Before you begin¶

Before starting an update, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release, as well as thoroughly plan maintenance windows for each update phase depending on the configurational of your cluster.

Read the release notes¶

Read carefully Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:

Current Mirantis Container Cloud software version and the need to first update to the latest cluster release version
Update notes provided in the Release notes for the target MOSK version
New product features that will get enabled in your cloud by default
New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support
Any changes in the behavior of the product features enabled in your cloud
List of the addressed and known issues in the target MOSK version

Warning

If your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.

Plan the cluster update¶

Depending on the payload brought by a particular target release, a generic cluster update includes from three to six major phases.

The first three phases are present in any update. They focus on the containerized components of the software stack and have minimal impact on the cloud users and workloads.

The remaining phases are only present if any changes need to be made to the foundation layers: the underlay Kubernetes cluster and host operating system. For the changes to take effect, you may need to reboot the cluster nodes. This procedure imposes a severe impact on cloud workloads and, therefore, needs to be thoroughly planned across several sequential maintenance windows.

Important

To effectively plan a cluster update, keep in mind the architecture of your specific cloud. Depending on the selected design, the components of a MOSK cluster may have different distribution across the nodes (physical servers) comprising the underlay bare metal Kubernetes cluster. The more components are collocated on a single node, the harder is the impact on the functions of the cloud when the changes are applied.

The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:

What happens during the phase
Includes the phase milestones. The nature of changes that are going to be applied is important to understand in order to estimate the exact impact the update is going to have on your cluster.

Consult the Update notes section of the target MOSK release for the detailed information about the changes it brings and the impact these changes are going to imply when getting applied to your cluster.
Impact
Describes possible impact on cloud users and workloads.

The provided information about the impact represents the worst-case scenario in the cluster architectures that imply a combination of several roles on the same physical servers, such as hyper-converged compute nodes and clusters with a compact control plane.

The impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.
Time to complete
Provides a rough estimation of the time required to complete the phase.

The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

Warning

During the update, try to prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.

Phase 1: Life-cycle management modules update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Life-cycle management modules update¶
What happens during the phase	New versions of OpenStack, Tungsten Fabric, and Ceph controllers downloaded and installed. OpenStack and Tungsten Fabric images precached.
Impact	None
Time to complete	Depending on the quality of the Internet connectivity, up to 45 minutes.

Phase 2: OpenStack and Tungsten Fabric components update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

OpenStack and Tungsten Fabric components update¶
What happens during the phase	New versions of OpenStack and Tungsten Fabric container images downloaded, services restarted sequentially.
Impact	Some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. Workloads may experience temporary loss of the North-South connectivity in the clusters with Open vSwitch networking backend. The downtime depends on the type of virtual routers in use.
Time to complete	20 minutes per network gateway node (Open vSwitch) 5 minutes for a Tungsten Fabric cluster 15 minutes per compute node

Phase 3: Ceph cluster update and upgrade

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Ceph cluster update and upgrade¶
What happens during the phase	New versions of Ceph components downloaded, services restarted. If applicable, Ceph switched to the latest major version.
Impact	Workloads may experience IO performance degradation for the virtual storage devices backed by Ceph.
Time to complete	The update of a Ceph cluster with 30 storage nodes can take up to 35 minutes. Additionally, 15 minutes are required for the major Ceph version upgrade, if any.

Phase 4a: Host operating system update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Host operating system update on Kubernetes master nodes¶
What happens during the phase	New system packages downloaded and installed on the host operating system, other major changes get applied.
Impact	None
Time to complete	The nodes are updated sequentially. Up to 15 minutes per node.

Phase 4b: Kubernetes components update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Kubernetes cluster update on Kubernetes master nodes¶
What happens during the phase	New versions of Kubernetes control plane components downloaded and installed.
Impact	For clusters with the compact control plane, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For the compact control plane with gateway nodes collocated (Open vSwitch networking backend), workloads can experience temporary loss of the North-South connectivity. The downtime depends on the type of virtual routers in use.
Time to complete	Up to 40 minutes total

Phases 5a and 5b: Host operating system and Kubernetes cluster update on Kubernetes worker nodes

Important

Both phases, 5a and 5b, are applied together, either node by node (default) or to several nodes in parallel. The parallel updating is available since 23.1.

Take this into consideration when estimating the impact and planning the maintenance window.

Host operating system and Kubernetes cluster¶
What happens during the phase	During the host operating system update: New packages for host operating system downloaded and installed, including kernel, and other system components. Any other major configuration changes get applied. Node manually rebooted. But an operator of the cloud has an option to restart the nodes later, during another maintenance window. During the Kubernetes cluster update: New versions of Kubernetes control plane components, including container runtime, downloaded and installed Containers get restarted
Impact	For the storage nodes: Minor impact on Ceph cluster availability, depending on the number of storage nodes getting updated in parallel. See Enable parallel update of Kubernetes worker nodes Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes. For dedicated control plane nodes, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For dedicated gateway nodes (Open vSwitch), workloads can experience minor loss of the North-South connectivity. For compute nodes, there can be up to 5 minute downtime on the network connectivity for the workloads running on the node, due to the restart of the containers hosting the components of the cloud data plane. For clusters running MOSK 24.1.2 and above, the dowtime is up to 2 minutes per node.
Time to complete	By default, the nodes are updated sequentially as follows: For the host operating system update, up to 15 minutes per node. For the Kubernetes cluster update, up to 40 minutes per node. For MOSK 23.1 to 23.2 and newer releases, you can reduce update time by enabling parallel node update. The procedure is described further in the Enable parallel update of Kubernetes worker nodes subsection.

Phase 6: Cluster nodes reboot

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Important

An update to a newer MOSK version may require reboot of the cluster nodes for changes to take effect. Although, you can decide when to restart each particular node, an update can not be considered complete until all of the nodes get restarted.

To determine whether the reboot is required, consult the Step 4. Reboot the nodes with optional instance migration section.

What happens during the phase	You put the cluster into the maintenance mode. For each node in the cluster: Optional. You configure an instance migration policy. You initiate the node reboot. The node is gracefully restarted with automatic or manual migration of cloud workloads running on it.
Impact	For the storage nodes: No impact on the nodes hosting the Ceph cluster data Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes For the control plane nodes, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For the network gateway nodes (Open vSwitch), workloads can experience minor loss of the North-South connectivity depending on the type of virtual routers in use. For the compute nodes, no or controllable impact on the workloads depending on the configured instance migration policy. See Configure instance migration policy for cluster nodes.
Time to complete	Optional. Time to migrate instances across compute nodes. Up to 10 minutes per node to reboot. Depends on the hardware and BIOS configuration. Several nodes can be rebooted in parallel.

Step 1. Verify that the Container Cloud management cluster is up-to-date¶

MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.

Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.

To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI. You can also verify your management cluster status using CLI as described in Verify the management cluster status before MOSK update.

Step 2. Initiate MOSK cluster update¶

Silence alerts¶

During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Silence alerts.

Caution

During update, the false positive CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert, which will disappear once update succeeds.

The observed behavior is typical for calico-node during upgrades, as workload changes occur frequently. Consequently, there is a possibility of temporary desynchronization in the Calico dataplane. This can occasionally result in throttling when applying workload changes to the Calico dataplane.

Note

In non-HA StackLight deployments, the KubePodsCrashLooping alert may temporarily be firing for the Grafana ReplicaSet. Such behavior is expected in non-HA StackLight setups. For details, see known issue 42463.

To prevent the issue, deploy StackLight in HA mode.

Verify Ceph configuration¶

If you update MOSK to 23.1, verify that the KaaSCephCluster custom resource does not contain the following entries. If they exist, remove them.

In the spec.cephClusterSpec section, the external section.
In the spec.cephClusterSpec.rookConfig section, the ms_crc_data or ms crc data configuration key. After you remove the key, wait for rook-ceph-mon pods to restart on the MOSK cluster.

Enable parallel update of Kubernetes worker nodes¶

Optional. Starting from MOSK 23.1 to 23.2 update, you can enable and configure parallel node update to reduce update time and minimize downtime:

To enable parallel update of Kubernetes worker nodes, set the spec.providerSpec.value.maxWorkerUpgradeCount configuration parameter in the Mirantis Container Cloud management cluster as described in conf-upd-count.
Consider the specifics of handling of parallel node updates by OpenStack, Ceph, and Tungsten Fabric Controllers to properly plan the maintenance window. For handling details and possible configuration, refer to Parallelizing node update operations.

Enable automatic node reboot in update groups¶

TechPreview

Optional. Starting from MOSK 24.3, you can enable automatic node reboot of an update group, which contains a set of controller or worker machines. This option applies when a Cluster release update requires node reboot, for example, when kernel version update is available in the target Cluster release. The option reduces manual intervention and overall downtime during cluster update.

To enable automatic node reboot in an update group, set spec.rebootIfUpdateRequires in the required UpdateGroup object. For details, see UpdateGroup resource.

Caution

During a distribution upgrade, machines are always rebooted, overriding rebootIfUpdateRequires: false.

Trigger the update¶

Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.
Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.
In the Clusters tab, find the managed MOSK cluster.
Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.
In the Release Update window, select the required Cluster release to update your managed cluster to.

The Description section contains the list of components versions to be installed with a new Cluster release.
Click Update.

Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:
- /srv/backup/swarm on every Container Cloud node for Docker Swarm
- /srv/backup/ucp on one of the controller nodes for MKE
To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Step 3. Watch the cluster update¶

Watch the update process through the web UI¶

To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.

Also, you can see the general status of each node during the update on the Container Cloud cluster view page.

Follow the update process through logs¶

The whole update process is controlled by lcm-controller, which runs in the kaas namespace of the Container Cloud management cluster. Follow its logs to watch the progress of the update, discover, and debug any issues.

Watch the state of the cluster and nodes update through the CLI¶

The lcmclusterstate and lcmmachines objects in the mos namespace of the Container Cloud management cluster provide detailed information about the current phase of the update process in the context of the managed cluster overall as well as specific nodes.

The lcmmachine object being in the Ready state indicates that a node has been successfully updated.

To display the detailed view of the cluster update state, run:

kubectl -n child-ns get lcmclusterstates -o wide

Example system response:

NAME                                            CLUSTERNAME   TYPE              ARG                                          VALUE   ACTUALVALUE   ATTEMPT   MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      cordon-drain      cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: following    NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      swarm-drain       cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first

To display the detailed view of the nodes update state, run:

kubectl -n child-ns get lcmmachines

Example system response:

NAME                                                 CLUSTERNAME   TYPE      STATE
cz5018-child-cl-storage-worker-noefi-dzttw           child-cl      worker    Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9           child-cl      worker    Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t   child-cl      control   Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft   child-cl      control   Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f   child-cl      control   Ready
cz7503-child-cl-storage-worker-noefi-5lvd7           child-cl      worker    Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc           child-cl      worker    Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk           child-cl      worker    Prepare

Step 4. Reboot the nodes with optional instance migration¶

Depending on the target release content, you may need to reboot the cluster nodes for the changes to take effect. Running a MOSK cluster in a semi-updated state for an extended period may result in unpredictable behavior of the cloud and impact users and workloads. Therefore, when it is required, you need to reboot the cluster nodes as soon as possible to avoid potential risks.

Note

If you enabled rebootIfUpdateRequires as described in Enable automatic node reboot in update groups, nodes will be automatically rebooted in update groups during a Cluster release update that requires a reboot, for example, when kernel version update is available in the target Cluster release. For a distribution upgrade, continue reading the following subsections.

Determine if the node needs to be rebooted¶

Verify the YAML definitions of the LCMMachine and Machine objects. The node must be rebooted if the rebootRequired flag is set to true. In addition, objects explicitly specify the reason for rebooting. For example:

The LCMMachine object of the node that requires rebooting:

...
status:
   hostInfo:
     rebootRequired: true
     rebootReason: "linux-image-5.13.0-51-generic"

The Machine object of the node that does not require rebooting:

...
status:
  ...
  providerStatus:
    ...
    reboot:
      reason: ""
      required: false
    status: Ready

Since MOSK 23.1, you can also use the Mirantis Container Cloud web UI to identify the nodes requiring reboot:

In the Clusters tab, click the required cluster name. The page with Machines opens.
Hover over the status of every machine. A machine to reboot contains the Reboot > The machine requires a reboot notification in the Status tooltip.

Configure instance migration policy for cluster nodes¶

Restarting the cluster causes downtime of the cloud services running on the nodes. While the MOSK control plane is built for high availability and can tolerate temporary loss of at least 1/3 of services without a significant impact on user experience, rebooting nodes that host the elements of cloud data plane, such as network gateway nodes and compute nodes, has a detrimental effect on the cloud workloads, if not performed gracefully.

To configure the instance migration policy:

Edit the target compute node resource. For example:

kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a

To mitigate the potential impact on the cloud workloads, define the migration mode and the number of attempts the OpenStack Controller should make to migrate a single instance running on it:

Instance migration configuration for hosts¶
Node annotation	Default	Description
`instance_migration_mode`	`live`	Defines the instance migration mode for the host. The list of available options include: `live`: the OpenStack Controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the `manual` mode. 0 `manual`: the OpenStack Controller waits for the Operator to migrate instances from the host. When it is time to update the host, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update. 1 `skip`: the OpenStack Controller skips the instance check on the node and reboots it.
`instance_migration_attempts`	`3`	Defines the number of times the OpenStack Controller attempts to live-migrate a single instance before falling back to the `manual` mode.

0

Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:

LVM-based ephemeral storage with and without encryption
Encrypted block storage volumes
CPU and NUMA node pinning

1

For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. Mirantis recommends that in such environments you perform the update procedure in the manual mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore connectivity.

Configuration example that sets the instance migration mode to live and the number of attempts to live-migrate to 5:

apiVersion: v1
kind: Node
metadata:
 name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
 selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
 uid: 54be5139-aba7-47e7-92bf-5575773a12a6
 resourceVersion: '299734609'
 creationTimestamp: '2021-03-24T16:03:11Z'
 labels:
   ...
   openstack-compute-node: enabled
   openvswitch: enabled
 annotations:
   openstack.lcm.mirantis.com/instance_migration_mode: "live"
   openstack.lcm.mirantis.com/instance_migration_attempts: "5"
   ...

If needed, as a cloud user, mark the instances that require individual handling during instance migration using the openstack.lcm.mirantis.com:maintenance_action=<ACTION-TAG> server tag. For details, refer to Configure per-instance migration mode.

See also

OpenStack Controller maintenance API

Reboot MOSK cluster¶

Since MOSK 23.1, you can reboot several cluster nodes in one go by using the Graceful reboot mechanism provided by Mirantis Container Cloud. The mechanism restarts the selected nodes one by one, honoring the instance migration policies.

For older versions of MOSK, you need to reboot each node manually as follows:

Enable maintenance mode for the MOSK cluster.
For each node in the cluster:
1. Enable maintenance mode for the node.
2. If manual instance migration policy is configured for the node, perform manual migration once the node is ready to reboot (see below).
3. Reboot the node using cluster life-cycle management.
4. Disable maintenance mode for the node.
Disable maintenance mode for the MOSK cluster.

Perform manual actions before node reboot¶

When a node that has a manual instance migration policy is ready to be restarted, the life-cycle management mechanism notifies you about that by creating a NodeMaintenanceRequest object for the node and setting the active status attribute for the corresponding NodeWorkloadLock object.

Note

Verify the status:errorMessage attribute before proceeding.

To view the NodeWorkloadLock objects details for a specific node, run:

kubectl get nodeworkloadlocks <NODE-NAME> -o yaml

Example system response:

apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  annotations:
    inner_state: active
  creationTimestamp: "2022-02-04T13:24:48Z"
  generation: 1
  name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
  resourceVersion: "173934"
  uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
  controllerName: openstack
  nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
  errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
  Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
  release: 8.5.0-rc+22.1
  state: active

Note

For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.

After the update¶

Once your MOSK cluster update is complete, proceed with the following:

Perform the post-update steps recommended in the update notes of the target release if any.
Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.
To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.
Optional. Proceed with the upgrade of OpenStack.
If necessary, expire alert silences in StackLight as described in Silence alerts.
Strongly recommended. Back up MKE as described in Mirantis Kubernetes Engine documentation: Back up MKE.

Since the procedure above modifies the cluster configuration, a fresh backup is required to restore the cluster in case further reconfigurations fail.

What to do if the update hangs or fails¶

If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.

If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.

Troubleshoot issues¶

To see any issues that might have occurred during the update, verify the logs of the lcm-controller pods in the kaas namespace of the Container Cloud management cluster.

To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Troubleshoot an operating system upgrade with host restart.

Roll back the changes¶

Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update.