Update to a major version¶
This section describes the workflow you as a cloud operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster to a major release version.
Note
The hereby guide applies to the clusters running MOSK of version 23.1 and above. In case you have an older version and looking to update, please contact Mirantis support to get intructions valid for your cluster.
The instructions below are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every major release may have its own update peculiarities. Therefore, to accurately plan and successfully perform an update, in addition to the hereby document, read the update-related section in the Release Notes of the target MOSK version.
Depending on the payload of a target release, the update mechanism can perform the changes on different levels of the stack, from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests your involvement when a human decision is required to proceed.
Though the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack, that is, the APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a separate upgrade procedure to obtain the latest supported OpenStack version.
Before you begin¶
Before starting an update, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release, as well as thoroughly plan maintenance windows for each update phase depending on the configurational of your cluster.
Read the release notes¶
Read carefully Release Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:
Current Mirantis Container Cloud software version and the need to first update to the latest cluster release version
Update notes provided in the Release notes for the target MOSK version
New product features that will get enabled in your cloud by default
New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support
Any changes in the behavior of the product features enabled in your cloud
List of the addressed and known issues in the target MOSK version
Warning
If your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.
Plan the cluster update¶
Depending on the payload brought by a particular target release, a generic cluster update includes from three to six major phases.
The first three phases are present in any update. They focus on the containerized components of the software stack and have minimal impact on the cloud users and workloads.
The remaining phases are only present if any changes need to be made to the foundation layers: the underlay Kubernetes cluster and host operating system. For the changes to take effect, you may need to reboot the cluster nodes. This procedure imposes a severe impact on cloud workloads and, therefore, needs to be thoroughly planned across several sequential maintenance windows.
Important
To effectively plan a cluster update, keep in mind the architecture of your specific cloud. Depending on the selected design, the components of a MOSK cluster may have different distribution across the nodes (physical servers) comprising the underlay bare metal Kubernetes cluster. The more components are collocated on a single node, the harder is the impact on the functions of the cloud when the changes are applied.
The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:
- What happens during the phase
Includes the phase milestones. The nature of changes that are going to be applied is important to understand in order to estimate the exact impact the update is going to have on your cluster.
Consult the Update notes section of the target MOSK release for the detailed information about the changes it brings and the impact these changes are going to imply when getting applied to your cluster.
- Impact
Describes possible impact on cloud users and workloads.
The provided information about the impact represents the worst-case scenario in the cluster architectures that imply a combination of several roles on the same physical servers, such as hyper-converged compute nodes and clusters with a compact control plane.
The impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.
- Time to complete
Provides a rough estimation of the time required to complete the phase.
The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.
Warning
During the update, try to prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.
Phase 1: Life-cycle management modules update
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of OpenStack, Tungsten Fabric, and Ceph controllers downloaded and installed. OpenStack and Tungsten Fabric images precached. |
---|---|
Impact |
None |
Time to complete |
Depending on the quality of the Internet connectivity, up to 45 minutes. |
Phase 2: OpenStack and Tungsten Fabric components update
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of OpenStack and Tungsten Fabric container images downloaded, services restarted sequentially. |
---|---|
Impact |
|
Time to complete |
|
Phase 3: Ceph cluster update and upgrade
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of Ceph components downloaded, services restarted. If applicable, Ceph switched to the latest major version. |
---|---|
Impact |
Workloads may experience IO performance degradation for the virtual storage devices backed by Ceph. |
Time to complete |
The update of a Ceph cluster with 30 storage nodes can take up to 35 minutes. Additionally, 15 minutes are required for the major Ceph version upgrade, if any. |
Phase 4a: Host operating system update on Kubernetes master nodes
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
What happens during the phase |
New system packages downloaded and installed on the host operating system, other major changes get applied. |
---|---|
Impact |
None |
Time to complete |
The nodes are updated sequentially. Up to 15 minutes per node. |
Phase 4b: Kubernetes components update on Kubernetes master nodes
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
What happens during the phase |
New versions of Kubernetes control plane components downloaded and installed. |
---|---|
Impact |
For clusters with the compact control plane, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For the compact control plane with gateway nodes collocated (Open vSwitch networking backend), workloads can experience temporary loss of the North-South connectivity. The downtime depends on the type of virtual routers in use. |
Time to complete |
Up to 40 minutes total |
Phases 5a and 5b: Host operating system and Kubernetes cluster
update on Kubernetes worker nodes
Important
Both phases, 5a and 5b, are applied together, either node by node (default) or to several nodes in parallel. The parallel updating is available since 23.1.
Take this into consideration when estimating the impact and planning the maintenance window.
What happens during the phase |
During the host operating system update:
During the Kubernetes cluster update:
|
---|---|
Impact |
|
Time to complete |
By default, the nodes are updated sequentially as follows:
For MOSK 23.1 to 23.2 and newer releases, you can reduce update time by enabling parallel node update. The procedure is described further in the Enable parallel update of Kubernetes worker nodes subsection. |
Phase 6: Cluster nodes reboot
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
Important
An update to a newer MOSK version may require reboot of the cluster nodes for changes to take effect. Although, you can decide when to restart each particular node, an update can not be considered complete until all of the nodes get restarted.
To determine whether the reboot is required, consult the Step 4. Reboot the nodes with optional instance migration section.
What happens during the phase |
|
---|---|
Impact |
|
Time to complete |
|
Step 1. Verify that the Container Cloud management cluster is up-to-date¶
MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.
Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.
To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI. You can also verify your management cluster status using CLI as described in Verify the management cluster status before MOSK update.
Step 2. Initiate MOSK cluster update¶
Silence alerts¶
During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Silence alerts.
Caution
During update, the false positive CalicoDataplaneFailuresHigh
alert may be firing. Disregard this alert, which will disappear once update
succeeds.
The observed behavior is typical for calico-node
during upgrades,
as workload changes occur frequently. Consequently, there is a possibility
of temporary desynchronization in the Calico dataplane. This can
occasionally result in throttling when applying workload changes to the
Calico dataplane.
Verify Ceph configuration¶
If you update MOSK to 23.1, verify that the
KaaSCephCluster
custom resource does not contain the following entries.
If they exist, remove them.
In the
spec.cephClusterSpec
section, theexternal
section.In the
spec.cephClusterSpec.rookConfig
section, thems_crc_data
orms crc data
configuration key. After you remove the key, wait forrook-ceph-mon
pods to restart on the MOSK cluster.
Enable parallel update of Kubernetes worker nodes¶
Optional. Starting from MOSK 23.1 to 23.2 update, you can enable and configure parallel node update to reduce update time and minimize downtime:
To enable parallel update of Kubernetes worker nodes, set the
spec.providerSpec.value.maxWorkerUpgradeCount
configuration parameter in the Mirantis Container Cloud management cluster as described in conf-upd-count.Consider the specifics of handling of parallel node updates by OpenStack, Ceph, and Tungsten Fabric Controllers to properly plan the maintenance window. For handling details and possible configuration, refer to Parallelizing node update operations.
Enable automatic node reboot in update groups¶
TechPreview
Optional. Starting from MOSK 24.3, you can enable automatic node reboot of an update group, which contains a set of controller or worker machines. This option applies when a Cluster release update requires node reboot, for example, when kernel version update is available in the target Cluster release. The option reduces manual intervention and overall downtime during cluster update.
To enable automatic node reboot in an update group, set
spec.rebootIfUpdateRequires
in the required UpdateGroup
object. For
details, see Container Cloud documentation: API Reference - UpdateGroup
resource
and Create update groups for worker machines.
Caution
During a distribution upgrade, machines are always rebooted,
overriding rebootIfUpdateRequires: false
.
Trigger the update¶
Log in to the Container Cloud web UI with the
m:kaas:namespace@operator
orm:kaas:namespace@writer
permissions.Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.
In the Clusters tab, find the managed MOSK cluster.
Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.
In the Release Update window, select the required Cluster release to update your managed cluster to.
The Description section contains the list of components versions to be installed with a new Cluster release.
Click Update.
Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:
/srv/backup/swarm
on every Container Cloud node for Docker Swarm/srv/backup/ucp
on one of the controller nodes for MKE
To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.
Step 3. Watch the cluster update¶
Watch the update process through the web UI¶
To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.
Also, you can see the general status of each node during the update on the Container Cloud cluster view page.
Follow the update process through logs¶
The whole update process is controlled by lcm-controller
, which runs in
the kaas
namespace of the Container Cloud management cluster. Follow
its logs to watch the progress of the update, discover, and debug any issues.
Watch the state of the cluster and nodes update through the CLI¶
The lcmclusterstate
and lcmmachines
objects in the mos
namespace
of the Container Cloud management cluster provide detailed information about
the current phase of the update process in the context of the managed cluster
overall as well as specific nodes.
The lcmmachine
object being in the Ready
state indicates that a node
has been successfully updated.
To display the detailed view of the cluster update state, run:
kubectl -n child-ns get lcmclusterstates -o wide
Example system response:
NAME CLUSTERNAME TYPE ARG VALUE ACTUALVALUE ATTEMPT MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk child-cl cordon-drain cz7506-child-cl-storage-worker-noefi-rgxhk true 0 Error: following NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk child-cl swarm-drain cz7506-child-cl-storage-worker-noefi-rgxhk true 0 Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first
To display the detailed view of the nodes update state, run:
kubectl -n child-ns get lcmmachines
Example system response:
NAME CLUSTERNAME TYPE STATE
cz5018-child-cl-storage-worker-noefi-dzttw child-cl worker Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9 child-cl worker Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t child-cl control Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft child-cl control Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f child-cl control Ready
cz7503-child-cl-storage-worker-noefi-5lvd7 child-cl worker Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc child-cl worker Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk child-cl worker Prepare
Step 4. Reboot the nodes with optional instance migration¶
Depending on the target release content, you may need to reboot the cluster nodes for the changes to take effect. Running a MOSK cluster in a semi-updated state for an extended period may result in unpredictable behavior of the cloud and impact users and workloads. Therefore, when it is required, you need to reboot the cluster nodes as soon as possible to avoid potential risks.
Note
If you enabled rebootIfUpdateRequires
as described in
Enable automatic node reboot in update groups, nodes will be automatically rebooted in update
groups during a Cluster release update that requires a reboot, for example,
when kernel version update is available in the target Cluster release.
For a distribution upgrade, continue reading the following subsections.
Determine if the node needs to be rebooted¶
Verify the YAML definitions of the LCMMachine
and Machine
objects.
The node must be rebooted if the rebootRequired
flag is set to true
.
In addition, objects explicitly specify the reason for rebooting. For example:
The
LCMMachine
object of the node that requires rebooting:... status: hostInfo: rebootRequired: true rebootReason: "linux-image-5.13.0-51-generic"
The
Machine
object of the node that does not require rebooting:... status: ... providerStatus: ... reboot: reason: "" required: false status: Ready
Since MOSK 23.1, you can also use the Mirantis Container Cloud web UI to identify the nodes requiring reboot:
In the Clusters tab, click the required cluster name. The page with Machines opens.
Hover over the status of every machine. A machine to reboot contains the Reboot > The machine requires a reboot notification in the Status tooltip.
Configure instance migration policy for cluster nodes¶
Restarting the cluster causes downtime of the cloud services running on the nodes. While the MOSK control plane is built for high availability and can tolerate temporary loss of at least 1/3 of services without a significant impact on user experience, rebooting nodes that host the elements of cloud data plane, such as network gateway nodes and compute nodes, has a detrimental effect on the cloud workloads, if not performed gracefully.
To mitigate the potential impact on the cloud workloads, you can define the instance migration flow for the compute nodes running the most valuable instances.
The list of available options for the instance migration configuration includes:
The
openstack.lcm.mirantis.com/instance_migration_mode
annotation:live
Default. The OpenStack controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the
manual
mode.Note
Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:
LVM-based ephemeral storage with and without encryption
Encrypted block storage volumes
CPU and NUMA node pinning
manual
The OpenStack Controller waits for the Operator to migrate instances from the compute node. When it is time to update the compute node, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update.
skip
The OpenStack Controller skips the instance check on the node and reboots it.
Note
For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. We recommend that in such environments you perform the update procedure in the
manual
mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore the connectivity.
- The
openstack.lcm.mirantis.com/instance_migration_attempts
annotation Defines the number of times the OpenStack Controller attempts to migrate a single instance before giving up. Defaults to
3
.
- The
Note
You can also use annotations to control the update of
non-compute nodes if they represent critical points of a specific
cloud architecture. For example, setting the instance_migration_mode
to manual
on a controller node with a collocated gateway (Open vSwitch)
will allow the Operator to gracefully shut down all the virtual routers
hosted on this node.
To configure the instance migration policy:
Edit the target compute node resource. For example:
kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
Set the migration mode and the number of attempts the OpenStack Controller should make to migrate a single instance. For example:
apiVersion: v1 kind: Node metadata: name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a uid: 54be5139-aba7-47e7-92bf-5575773a12a6 resourceVersion: '299734609' creationTimestamp: '2021-03-24T16:03:11Z' labels: ... openstack-compute-node: enabled openvswitch: enabled annotations: openstack.lcm.mirantis.com/instance_migration_mode: "live" openstack.lcm.mirantis.com/instance_migration_attempts: "5" ...
Reboot MOSK cluster¶
Since MOSK 23.1, you can reboot several cluster nodes in one go by using the Graceful reboot mechanism provided by Mirantis Container Cloud. The mechanism restarts the selected nodes one by one, honoring the instance migration policies.
For older versions of MOSK, you need to reboot each node manually as follows:
For each node in the cluster:
If
manual
instance migration policy is configured for the node, perform manual migration once the node is ready to reboot (see below).
Perform manual actions before node reboot¶
When a node that has a manual
instance migration policy is ready to be
restarted, the life-cycle management mechanism notifies you about that by
creating a NodeMaintenanceRequest
object for the node and setting the
active
status attribute for the corresponding NodeWorkloadLock
object.
Note
Verify the status:errorMessage
attribute before proceeding.
To view the NodeWorkloadLock
objects details for a specific node, run:
kubectl get nodeworkloadlocks <NODE-NAME> -o yaml
Example system response:
apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
annotations:
inner_state: active
creationTimestamp: "2022-02-04T13:24:48Z"
generation: 1
name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
resourceVersion: "173934"
uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
controllerName: openstack
nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
release: 8.5.0-rc+22.1
state: active
Note
For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.
After the update¶
Once your MOSK cluster update is complete, proceed with the following:
Perform the post-update steps recommended in the update notes of the target release if any.
Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.
To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.
Optional. Proceed with the upgrade of OpenStack.
If necessary, expire alert silences in StackLight as described in Silence alerts.
What to do if the update hangs or fails¶
If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.
If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.
Troubleshoot issues¶
To see any issues that might have occurred during the update, verify the logs
of the lcm-controller
pods in the kaas
namespace of the Container Cloud
management cluster.
To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Troubleshoot an operating system upgrade with host restart.
Roll back the changes¶
Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update.