Update a MOSK cluster to 22.2 or earlier¶
This section describes the workflow you as a cloud operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster from 21.6 to 22.1 or 22.1 to 22.2 versions.
The provided guidelines are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every target release may have its own update peculiarities. Therefore, to accurately plan and successfully perform a specific update, in addition to following the procedure below, read the update-related section in the Release Notes of the target MOSK version.
Depending on the payload of a target release, the update mechanism can perform the changes on different stack levels from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests the cloud operator involvement when a human decision is required to proceed.
Though the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack and Tungsten Fabric, the cloud APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a dedicated upgrade procedure to obtain to the latest supported versions of OpenStack and Tungsten Fabric.
See also
Before you begin¶
Before updating, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release, as well as thoroughly plan a maintenance window for each update phase depending on the configurational specifics of your cluster.
Read the release notes¶
Read carefully Release Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:
Current Mirantis Container Cloud software version and the need to first update to the latest cluster release version
Update notes provided in the Release notes for the target MOSK version
New product features that will get enabled in your cloud by default
New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support
Any changes in the behavior of the product features enabled in your cloud
List of the addressed and known issues in the target MOSK version
Warning
If your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.
Plan the cluster update¶
Depending on the payload brought by a particular target release, a generic cluster update includes from three to five major phases.
The first three phases are mandatory for any update, have minimal impact on the cloud users and workloads, and get executed within a single maintenance window. The remaining phases are only present if major changes need to be performed to a cluster underlay and include a Kubernetes cluster and host operating system update. These phases may introduce a severe impact on the workloads and, therefore, need to be thoroughly planned across several sequential maintenance windows.
Important
To effectively plan an update, keep in mind the specific setup of your cluster. Depending on the cluster architecture, various groups of OpenStack, Tungsten Fabric, and Ceph components, that include controller nodes, storage nodes, compute nodes, and so on, can be running as pods across different worker nodes comprising the underlay Kubernetes cluster. Moreover, some architectures allow running pods on Kubernetes master nodes.
The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:
- What happens during the phase
Includes the phase milestones. This content is important for understanding the impact.
- Impact
Describes any possible impact on cloud users and workloads.
The impact estimate represents the worst-case scenario in the architectures that imply a combination of several cluster roles on the same physical nodes, such as hyper-converged compute nodes and clusters with the compact control plane. Also, the impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.
Several update phases will occur simultaneously, resulting in a greater cumulative impact, but a shorter completion time.
- Time to complete
Provides a rough estimation of the time required to complete the phase.
The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.
- Can be paused
Specifies whether you can postpone the execution of the most impactful phases on a per-node basis using the workload handling policy mechanism described in Step 2. Define workload handling policy for compute nodes, or not.
Warning
MOSK clusters are not designed to run in a semi-updated state for a long time. Once started, an update needs to complete as soon as possible. Postponing update phases for longer than one day may introduce a risk of the cloud workload corruption.
Warning
During the update, prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.
Phase 1: Life-cycle management modules update
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of OpenStack, Tungsten Fabric, and Ceph controllers downloaded and installed. OpenStack and Tungsten Fabric images precached. |
---|---|
Impact |
No |
Time to complete |
Depending on the quality of the Internet connectivity, up to 45 minutes. |
Can be paused |
No |
Phase 2: OpenStack and Tungsten Fabric components update
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of OpenStack and Tungsten Fabric container images downloaded, services restarted sequentially. |
---|---|
Impact |
|
Time to complete |
|
Can be paused |
No |
Phase 3: Ceph cluster update and upgrade
Important
This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.
What happens during the phase |
New versions of Ceph components downloaded, services restarted. If applicable, Ceph switched to the latest major version. |
---|---|
Impact |
No |
Time to complete |
The update of a Ceph cluster with 30 storage nodes can take up to 35 minutes. Additionally, 15 minutes are required for the major Ceph version upgrade if any. |
Can be paused |
No |
Phase 4a: Host operating system update on Kubernetes master nodes
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
What happens during the phase |
New system packages downloaded and installed on the host operating system, other major changes applied, node rebooted. |
---|---|
Impact |
For the compact control plane, approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For the compact control plane with collocated gateway nodes (Open vSwitch), minor loss of the North-South connectivity:
|
Time to complete |
Up to 20 minutes per node. The nodes are updated sequentially, not in parallel |
Can be paused |
Yes |
Phase 4b: Kubernetes cluster update on Kubernetes master nodes
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
What happens during the phase |
New versions of Kubernetes control plane components downloaded and installed. |
---|---|
Impact |
For the compact control plane, approximately 8% of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API. For the compact control plane with collocated gateway nodes (Open vSwitch), minor loss of the North-South connectivity:
|
Time to complete |
Up to 40 minutes total |
Can be paused |
Yes |
Phases 5a and 5b: Host operating system and Kubernetes cluster
update on Kubernetes worker nodes
Important
This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.
Important
Phases 5a and 5b are applied together, node by node. Take it into account when estimating the impact and planning the maintenance window.
What happens during the phase |
During the host operating system update:
During the Kubernetes cluster update:
|
---|---|
Impact |
|
Time to complete |
For the host operating system update:
For the Kubernetes cluster update:
|
Can be paused |
Yes |
Step 1. Verify that the Container Cloud management cluster is up-to-date¶
MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.
Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.
To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI.
Step 2. Define workload handling policy for compute nodes¶
During an update, changes get applied to the MOSK cluster nodes one by one, starting from Kubernetes master nodes and then proceeding with Kubernetes worker nodes in no particular order. Depending on the target MOSK release, severity of changes, and, therefore, the impact on the OpenStack workloads running on a compute node, varies from zero to an hour of downtime, while the node gets rebooted, for example, when switching to a new host operating system.
To mitigate the potential impact on the cloud workloads, you can define the instance migration flow for the compute nodes running the most valuable instances.
The list of available options for the instance migration configuration includes:
The
openstack.lcm.mirantis.com/instance_migration_mode
annotation:live
Default. The OpenStack controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the
manual
mode.Note
Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:
LVM-based ephemeral storage with and without encryption
Encrypted block storage volumes
CPU and NUMA node pinning
manual
The OpenStack Controller waits for the Operator to migrate instances from the compute node. When it is time to update the compute node, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update.
skip
The OpenStack Controller skips the instance check on the node and reboots it.
Note
For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. We recommend that in such environments you perform the update procedure in the
manual
mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore the connectivity.
- The
openstack.lcm.mirantis.com/instance_migration_attempts
annotation Defines the number of times the OpenStack Controller attempts to migrate a single instance before giving up. Defaults to
3
.
- The
Note
You can also use annotations to control the update of
non-compute nodes if they represent critical points of a specific
cloud architecture. For example, setting the instance_migration_mode
to manual
on a controller node with a collocated gateway (Open vSwitch)
will allow the Operator to gracefully shut down all the virtual routers
hosted on this node.
To configure the migration flow:
Edit the target compute node resource. For example:
kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
Set the migration mode and the number of times the OpenStack Controller should attempt to migrate a single instance. For example:
apiVersion: v1 kind: Node metadata: name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a uid: 54be5139-aba7-47e7-92bf-5575773a12a6 resourceVersion: '299734609' creationTimestamp: '2021-03-24T16:03:11Z' labels: ... openstack-compute-node: enabled openvswitch: enabled annotations: openstack.lcm.mirantis.com/instance_migration_mode: "live" openstack.lcm.mirantis.com/instance_migration_attempts: "5" ...
Step 3. Initiate MOSK cluster update¶
Silence alerts¶
During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Container Cloud Operations Guide: Silence alerts.
Trigger the update¶
Log in to the Container Cloud web UI with the
m:kaas:namespace@operator
orm:kaas:namespace@writer
permissions.Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.
In the Clusters tab, find the managed MOSK cluster.
Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.
In the Release Update window, select the required Cluster release to update your managed cluster to.
The Description section contains the list of components versions to be installed with a new Cluster release.
Click Update.
Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:
/srv/backup/swarm
on every Container Cloud node for Docker Swarm/srv/backup/ucp
on one of the controller nodes for MKE
To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.
Step 4. Watch the cluster update¶
Watch the update process through the web UI¶
To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.
Also, you can see the general status of each node during the update on the Container Cloud cluster view page.
Follow the update process through logs¶
The whole update process is controlled by lcm-controller
, which runs in
the kaas
namespace of the Container Cloud management cluster. Follow
its logs to watch the progress of the update, discover, and debug any issues.
Watch the state of the cluster and nodes update through the CLI¶
The lcmclusterstate
and lcmmachines
objects in the mos
namespace
of the Container Cloud management cluster provide detailed information about
the current phase of the update process in the context of the managed cluster
overall as well as specific nodes.
The lcmmachine
object being in the Ready
state indicates that a node
has been successfully updated.
To display the detailed view of the cluster update state, run:
kubectl -n child-ns get lcmclusterstates -o wide
Example system response:
NAME CLUSTERNAME TYPE ARG VALUE ACTUALVALUE ATTEMPT MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk child-cl cordon-drain cz7506-child-cl-storage-worker-noefi-rgxhk true 0 Error: following NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk child-cl swarm-drain cz7506-child-cl-storage-worker-noefi-rgxhk true 0 Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first
To display the detailed view of the nodes update state, run:
kubectl -n child-ns get lcmmachines
Example system response:
NAME CLUSTERNAME TYPE STATE
cz5018-child-cl-storage-worker-noefi-dzttw child-cl worker Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9 child-cl worker Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t child-cl control Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft child-cl control Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f child-cl control Ready
cz7503-child-cl-storage-worker-noefi-5lvd7 child-cl worker Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc child-cl worker Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk child-cl worker Prepare
Watch the node maintenance status¶
A NodeMaintenanceRequest
object gets created during the update process to
inform the higher-level life-cycle management mechanisms that a node is going
to be updated. This enables OpenStack, Ceph, and Tungsten Fabric controllers
to get prepared for that.
The NodeMaintenanceRequest
object that exists for the node with the
manual
policy configured indicates that you should proceed with the
manual handling of the workloads. See Step 5. Optional. Perform the manual workload migration for details.
To output the NodeMaintenanceRequest
object details, run:
kubectl get nodemaintenancerequests
Example system response:
NAME AGE
kaas-node-50a51d95-1e4b-487e-a973-199de400b97d 17m
kaas-node-e41a610a-ceaf-4d80-90ee-4ea7b4dee161 85s
Step 5. Optional. Perform the manual workload migration¶
When it is time to apply changes to a node that has a manual
workload
handling policy configured, the life-cycle management mechanism notifies
you about that by creating a NodeMaintenanceRequest
object for the node
and setting the active
status attribute for the corresponding
NodeWorkloadLock
object.
Note
Verify the status:errorMessage
attribute before proceeding.
To view the nodeworkloadlock
objects details for a specific node, run:
kubectl get nodeworkloadlocks <NODE-NAME> -o yaml
Example system response:
apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
annotations:
inner_state: active
creationTimestamp: "2022-02-04T13:24:48Z"
generation: 1
name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
resourceVersion: "173934"
uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
controllerName: openstack
nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
release: 8.5.0-rc+22.1
state: active
Note
For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.
Once you complete the workload migration, the update mechanism can safely
proceed with the node update. Indicate the migration completion to the LCM
by setting the node workload handling policy to skip
. See
Step 2. Define workload handling policy for compute nodes for annotation editing details.
After the update¶
Once your MOSK cluster update is complete, proceed with the following:
If your MOSK cluster uses Tungsten Fabric as a back end for networking, manually restart the vRouter pods:
Important
Since MOSK 22.4, the post-update restart of the TF vRouter pods has been implemented. Therefore, if the target update version of your deployment is MOSK 22.4 or newer, skip this step.
Remove the vRouter pods one by one manually.
Note
Manual removal is required because vRouter pods use the
OnDelete
update strategy. vRouter pod restart causes networking downtime for workloads on the affected node. If it is not applicable for some workloads, migrate them before restarting the vRouter pods.kubectl -n tf delete pod <VROUTER-POD-NAME>
Verify that all
tf-vrouter-*
pods have been updated:kubectl -n tf get ds | grep tf-vrouter
The
UP-TO-DATE
andCURRENT
fields must have the same values.
Perform the post-update steps recommended in the update notes of the target release if any.
Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.
To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.
Optional. Proceed with the upgrade of OpenStack and Tungsten Fabric.
If necessary, expire alert silences in StackLight as described in Container Cloud Operations Guide: Silence alerts.
What to do if the update hangs or fails¶
If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.
If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.
Troubleshoot issues¶
To see any issues that might have occurred during the update, verify the logs
of the lcm-controller
pods in the kaas
namespace of the Container Cloud
management cluster.
To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Container Cloud documentation.
Roll back the changes¶
Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update. Therefore, if the update process gets blocked on a specific node, remove the node from the cluster and try to complete the update.