Granularly update a managed cluster using the ClusterUpdatePlan object

Available since MCC 2.27.0 (17.2.0) TechPreview

You can control the process of a managed cluster update by manually launching update stages using the ClusterUpdatePlan custom resource. Between the update stages, a cluster remains functional from the perspective of cloud users and workloads.

A ClusterUpdatePlan object contains the following funtionality:

  • The object is automatically created by the bare metal provider when a new Cluster release becomes available for your cluster.

  • The object is created in the management cluster for the same namespace that the corresponding managed cluster refers to.

  • The object contains a list of self-descriptive update steps that are cluster-specific. These steps are defined in the spec section of the object with information about their impact on the cluster.

  • The object starts cluster update when the operator manually changes the commence field of the first update step to true. All steps have the commence flag initially set to false so that the operator can decide when to pause or resume the update process.

  • The object has the following naming convention: <managedClusterName>-<targetClusterReleaseVersion>.

  • Since Container Cloud 2.28.0 (Cluster release 17.3.0), the object contains several StackLight alerts to notify the operator about the update progress and potential update issues. For details, see StackLight alerts: Container Cloud.

Granularly update a managed cluster using CLI

  1. Verify that the management cluster is upgraded successfully as described in Verify the management cluster status before MOSK update.

  2. Optional. Available since Container Cloud 2.29.0 (Cluster release 17.4.0) as Technology Preview. Enable update auto-pause to be triggered by specific StackLight alerts. For details, see Configure update auto-pause.

  3. Open the ClusterUpdatePlan object for editing.

  4. Start cluster update by changing the spec:steps:commence field of the first update step to true.

    Once done, the following actions are applied to the cluster:

    1. The Cluster release in the corresponding Cluster spec is changed to the target Cluster version defined in the ClusterUpdatePlan spec.

    2. The cluster update starts and pauses before the next update step with commence: false set in the ClusterUpdatePlan spec.

    Caution

    Cancelling an already started update step is not supported.

    The following example illustrates the ClusterUpdatePlan object of a MOSK cluster update that has completed:

    Example of a completed ClusterUpdatePlan object
    Object:
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: ClusterUpdatePlan
      metadata:
        creationTimestamp: "2025-02-06T16:53:51Z"
        generation: 11
        name: mosk-17.4.0
        namespace: child
        resourceVersion: "6072567"
        uid: 82c072be-1dc5-43dd-b8cf-bc643206d563
      spec:
        cluster: mosk
        releaseNotes: https://docs.mirantis.com/mosk/latest/25.1-series.html
        source: mosk-17-3-0-24-3
        steps:
        - commence: true
          description:
          - install new version of OpenStack and Tungsten Fabric life cycle management
            modules
          - OpenStack and Tungsten Fabric container images pre-cached
          - OpenStack and Tungsten Fabric control plane components restarted in parallel
          duration:
            estimated: 1h30m0s
            info:
            - 15 minutes to cache the images and update the life cycle management modules
            - 1h to restart the components
          granularity: cluster
          id: openstack
          impact:
            info:
            - some of the running cloud operations may fail due to restart of API services
              and schedulers
            - DNS might be affected
            users: minor
            workloads: minor
          name: Update OpenStack and Tungsten Fabric
        - commence: true
          description:
          - Ceph version update
          - restart Ceph monitor, manager, object gateway (radosgw), and metadata services
          - restart OSD services node-by-node, or rack-by-rack depending on the cluster
            configuration
          duration:
            estimated: 8m30s
            info:
            - 15 minutes for the Ceph version update
            - around 40 minutes to update Ceph cluster of 30 nodes
          granularity: cluster
          id: ceph
          impact:
            info:
            - 'minor unavailability of object storage APIs: S3/Swift'
            - workloads may experience IO performance degradation for the virtual storage
              devices backed by Ceph
            users: minor
            workloads: minor
          name: Update Ceph
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - container runtime version gets bumped
          - new versions of Kubernetes components installed
          duration:
            estimated: 1h40m0s
            info:
            - about 20 minutes to update host OS per a Kubernetes controller, nodes updated
              one-by-one
            - Kubernetes components update takes about 40 minutes, all nodes in parallel
          granularity: cluster
          id: k8s-controllers
          impact:
            users: none
            workloads: none
          name: Update host OS and Kubernetes components on master nodes
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - container runtime version gets bumped
          - new versions of Kubernetes components installed
          - data plane components (Open vSwitch and Neutron L3 agents, TF agents and vrouter)
            restarted on gateway and compute nodes
          - storage nodes put to “no-out” mode to prevent rebalancing
          - by default, nodes are updated one-by-one, a node group can be configured to
            update several nodes in parallel
          duration:
            estimated: 8h0m0s
            info:
            - host OS update - up to 15 minutes per node (not including host OS configuration
              modules)
            - Kubernetes components update - up to 15 minutes per node
            - OpenStack controllers and gateways updated one-by-one
            - nodes hosting Ceph OSD, monitor, manager, metadata, object gateway (radosgw)
              services updated one-by-one
          granularity: machine
          id: k8s-workers-vdrok-child-default
          impact:
            info:
            - 'OpenStack controller nodes: some running OpenStack operations might not
              complete due to restart of components'
            - 'OpenStack compute nodes: minor loss of the East-West connectivity with
              the Open vSwitch networking back end that causes approximately 5 min of
              downtime'
            - 'OpenStack gateway nodes: minor loss of the North-South connectivity with
              the Open vSwitch networking back end: a non-distributed HA virtual router
              needs up to 1 minute to fail over; a non-distributed and non-HA virtual
              router failover time depends on many factors and may take up to 10 minutes'
            users: major
            workloads: major
          name: Update host OS and Kubernetes components on worker nodes, group vdrok-child-default
        - commence: true
          description:
          - restart of StackLight, MetalLB services
          - restart of auxiliary controllers and charts
          duration:
            estimated: 1h30m0s
          granularity: cluster
          id: mcc-components
          impact:
            info:
            - minor cloud API downtime due restart of MetalLB components
            users: minor
            workloads: none
          name: Auxiliary components update
        target: mosk-17-4-0-25-1
      status:
        completedAt: "2025-02-07T19:24:51Z"
        startedAt: "2025-02-07T17:07:02Z"
        status: Completed
        steps:
        - duration: 26m36.355605528s
          id: openstack
          message: Ready
          name: Update OpenStack and Tungsten Fabric
          startedAt: "2025-02-07T17:07:02Z"
          status: Completed
        - duration: 6m1.124356485s
          id: ceph
          message: Ready
          name: Update Ceph
          startedAt: "2025-02-07T17:33:38Z"
          status: Completed
        - duration: 24m3.151554465s
          id: k8s-controllers
          message: Ready
          name: Update host OS and Kubernetes components on master nodes
          startedAt: "2025-02-07T17:39:39Z"
          status: Completed
        - duration: 1h19m9.359184228s
          id: k8s-workers-vdrok-child-default
          message: Ready
          name: Update host OS and Kubernetes components on worker nodes, group vdrok-child-default
          startedAt: "2025-02-07T18:03:42Z"
          status: Completed
        - duration: 2m0.772243006s
          id: mcc-components
          message: Ready
          name: Auxiliary components update
          startedAt: "2025-02-07T19:22:51Z"
          status: Completed
    
  5. Monitor the message and status fields of the first step. The message field contains information about the progress of the current step. The status field can have the following values:

    • NotStarted

    • Scheduled Since MCC 2.28.0 (17.3.0)

    • InProgress

    • AutoPaused TechPreview since MCC 2.29.0 (17.4.0)

    • Stuck

    • Completed

    The Scheduled status indicates that a step is already triggered but its execution has not started yet.

    The AutoPaused status indicates that the update process is paused by a firing StackLight alert defined in the UpdateAutoPause object. For details, see Configure update auto-pause.

    The Stuck status indicates an issue and that the step can not fit into the ETA defined in the duration field for this step. The ETA for each step is defined statically and does not change depending on the cluster.

    Caution

    The status is not populated for the ClusterUpdatePlan objects that have not been started by adding the commence: true flag to the first object step. Therefore, always start updating the object from the first step.

  6. Optional. Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). Add or remove update groups of worker nodes on the fly, unless the update of the group that is being removed has already been scheduled, or if a newly set group will have an index that is lower or equal to another group that is already scheduled. These changes are reflected in ClusterUpdatePlan.

    You can also reassign a machine to a different update group while the cluster is being updated, but only if the new update group has an index higher than the index of the last scheduled worker update group. Disabled machines are considered as updated immediately.

    Note

    Depending on the number of update groups for worker nodes present in the cluster, the number of steps in spec differs. Each update group for worker nodes that has at least one machine will be represented by a step with the ID k8s-workers-<UpdateGroupName>.

  7. Proceed with changing the commence flag of the following update steps granularly depending on the cluster update requirements.

    Caution

    Launch the update steps sequentially. A consecutive step is not started until the previous step is completed.

Granularly update a managed cluster using the Container Cloud web UI

Available since MCC 2.29.0 (17.4.0 and 16.4.0)

  1. Verify that the management cluster is upgraded successfully as described in Verify the management cluster status before MOSK update.

  2. Optional. Available since Container Cloud 2.29.0 (Cluster release 17.4.0) as Technology Preview. Enable update auto-pause to be triggered by specific StackLight alerts. For details, see Configure update auto-pause.

  3. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  4. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  5. On the Clusters page, in the Updates column of the required cluster, click the Available link. The Updates tab opens.

    Note

    If the Updates column is absent, it indicates that the cluster is up-to-date.

    Note

    For your convenience, the Cluster updates menu is also available in the right-side kebab menu of the cluster on the Clusters page.

  6. On the Updates page, click the required version in the Target column to open update details, including the list of update steps, current and target cluster versions, and estimated update time.

  7. In the Target version section of the Cluster update window, click Release notes and carefully read updates about target release, including the Update notes section that contains important pre-update and post-update steps.

  8. Expand each step to verify information about update impact and other useful details.

  9. Select one of the following options:

    • Enable Auto-commence all at the top-right of the first update step section and click Start Update to launch update and start each step automatically.

    • Click Start Update to only launch the first update step.

      Note

      This option allows you to auto-commence consecutive steps while the current step is in progress. Enable the Auto-commence toggle for required steps and click Save to launch the selected steps automatically. You will only be prompted to confirm the consecutive step, all remaining steps will be launched without a manual confirmation.

    Before launching the update, you will be prompted to manually type in the target Cluster release name and confirm that you have read release notes about target release.

    Caution

    Cancelling an already started update step is not supported.

  10. Monitor the status of each step by hovering over the In Progress icon at the top-right of the step window. While the step is in progress, its current status is updated every minute.

    Once the required step is completed, the Waiting for input status at the top of the update window is displayed requiring you to confirm the next step.

The update history is retained in the Updates tab with the completion timestamp. The update plans that were not started and can no longer be used are cleaned up automatically.