Granularly update a managed cluster using the ClusterUpdatePlan object

Available since MCC 2.27.0 (17.2.0 and 16.2.0) TechPreview

You can control the process of a managed cluster update by manually launching update stages using the ClusterUpdatePlan custom resource. Between the update stages, a cluster remains functional from the perspective of cloud users and workloads.

A ClusterUpdatePlan object contains the following funtionality:

  • The object is automatically created by the bare metal provider when a new Cluster release becomes available for your cluster.

  • The object is created in the management cluster for the same namespace that the corresponding managed cluster refers to.

  • The object contains a list of self-descriptive update steps that are cluster-specific. These steps are defined in the spec section of the object with information about their impact on the cluster.

  • The object starts cluster update when the operator manually changes the commence field of the first update step to true. All steps have the commence flag initially set to false so that the operator can decide when to pause or resume the update process.

  • The object has the following naming convention: <managedClusterName>-<targetClusterReleaseVersion>.

  • Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), the object contains several StackLight alerts to notify the operator about the update progress and potential update issues. For details, see StackLight alerts: Container Cloud.

To update a managed cluster granularly:

  1. Verify that the management cluster is upgraded successfully as described in Verify the management cluster status before MOSK update.

  2. Open the ClusterUpdatePlan object for editing.

  3. Start cluster update by changing the spec:steps:commence field of the first update step to true.

    Once done, the following actions are applied to the cluster:

    1. The Cluster release in the corresponding Cluster spec is changed to the target Cluster version defined in the ClusterUpdatePlan spec.

    2. The cluster update starts and pauses before the next update step with commence: false set in the ClusterUpdatePlan spec.

    Caution

    Cancelling an already started update step is not supported.

    The following example illustrates the ClusterUpdatePlan object of a MOSK cluster update that has completed:

    Example of a completed ClusterUpdatePlan object
    Object:
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: ClusterUpdatePlan
      metadata:
        creationTimestamp: "2024-05-20T14:03:47Z"
        generation: 3
        name: demo-managed-67835-17.3.0
        namespace: managed-namespace
        resourceVersion: "534402"
        uid: 2eab536b-55aa-4870-b732-67ebf0a8a5bb
      spec:
        cluster: demo-managed-67835
        source: mosk-17-2-0-24-2
        steps:
        - commence: true
          constraints:
          - until the step is complete, it wont be possible to perform normal LCM operations
            on the cluster
          description:
          - install new version of life cycle management modules
          - restart OpenStack control plane components in parallel
          duration:
            eta: 2h0m0s
            info:
            - 15 minutes to update one OpenStack controller node
            - 5 minutes to update one compute node
          granularity: cluster
          impact:
            info:
            - 'up to 8% unavailability of APIs: OpenStack'
            users: minor
            workloads: none
          id: openstack
          name: Update OpenStack control plane on a MOSK cluster
        - commence: true
          description:
          - major Ceph version upgrade
          - update monitors, managers, RGW/MDS
          - OSDs are restarted sequentially, or by rack
          - takes into account the failure domain config in cluster (rack updated in parallel)
          duration:
            eta: 40m0s
            info:
            - up to 40 minutes to update Ceph cluster (30 nodes)
          granularity: cluster
          impact:
            info:
            - 'up to 8% unavailability of APIs: S3/Swift'
            users: none
            workloads: none
          id: ceph
          name: Update Ceph cluster on a MOSK cluster
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - new versions of Kubernetes components installed
          duration:
            eta: 45m0s
            info:
            - 15 minutes per Kubernetes master node, nodes updated sequentially
          granularity: cluster
          impact:
            users: none
            workloads: none
          id: k8s-controllers
          name: Update host OS and Kubernetes components on master nodes
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - new versions of Kubernetes components installed
          - containerd and MCR get bumped
          - Open vSwitch and Neutron L3 agents gets restarted on gateway and compute nodes
          duration:
            eta: 12h0m0s
            info:
            - 'depends on the type of the nodes: controller, compute, OSD'
          granularity: machine
          impact:
            info:
            - some OpenStack running operations might not complete due to restart of docker/containerd
              on controller nodes (up to 30%, assuming seq. controller update)
            - OpenStack LCM will prevent OpenStack controllers and gateways from parallel
              cordon / drain, despite node-group config
            - Ceph LCM will prevent parallel restart of OSDs, monitors and managers, despite
              node-group config
            - minor loss of the East-West connectivity with the Open vSwitch networking
              back end that causes approximately 5 min of downtime per compute node
            - 'minor loss of the North-South connectivity with the Open vSwitch networking
              back end: a non-distributed HA virtual router needs up to 1 minute to fail
              over; a non-distributed and non-HA virtual router failover time depends
              on many factors and may take up to 10 minutes'
            users: minor
            workloads: major
          id: k8s-workers-demo-managed-67835-default
          name: Update host OS and Kubernetes components on worker nodes, group default
        - commence: true
          description:
          - restart of StackLight, MetalLB services
          - restart of auxilary controllers and charts
          duration:
            eta: 30m0s
            info:
            - 30 minutes minimum
          granularity: cluster
          impact:
            info:
            - minor cloud API downtime due restart of MetalLB components
            users: minor
            workloads: none
          id: mcc-components
          name: Auxilary components update
        target: mosk-17-3-0-24-3
      status:
        startedAt: "2024-05-20T14:05:23Z"
        status: Completed
        steps:
        - duration: 29m16.887573286s
          message: Ready
          id: openstack
          name: Update OpenStack control plane
          startedAt: "2024-05-20T14:05:23Z"
          status: Completed
        - duration: 8m1.808804491s
          message: Ready
          id: ceph
          name: Update Ceph cluster
          startedAt: "2024-05-20T14:34:39Z"
          status: Completed
        - duration: 33m5.100480887s
          message: Ready
          id: k8s-controllers
          name: Update host OS and Kubernetes components on master nodes
          startedAt: "2024-05-20T14:42:40Z"
          status: Completed
        - duration: 1h39m9.896875724s
          message: Ready
          id: k8s-workers-demo-managed-67835-default
          name: Update host OS and Kubernetes components on worker nodes, group default
          startedAt: "2024-05-20T15:34:46Z"
          status: Completed
        - duration: 2m1.426000849s
          message: Ready
          id: mcc-components
          name: Auxilary components update
          startedAt: "2024-05-20T17:13:55Z"
          status: Completed
    
  4. Monitor the message and status fields of the first step. The message field contains information about the progress of the current step. The status field can have the following values:

    • NotStarted

    • Scheduled Since MCC 2.28.0 (17.3.0 and 16.3.0)

    • InProgress

    • Stuck

    • Completed

    The Stuck status indicates an issue and that the step can not fit into the ETA defined in the duration field for this step. The ETA for each step is defined statically and does not change depending on the cluster.

    The Scheduled status indicates that a step is already triggered but its execution has not started yet.

    Caution

    The status is not populated for the ClusterUpdatePlan objects that have not been started by adding the commence: true flag to the first object step. Therefore, always start updating the object from the first step.

  5. Optional. Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). Add or remove update groups of worker nodes on the fly, unless the group update has already been scheduled. These changes are reflected in ClusterUpdatePlan.

    You can also reassign a machine to a different update group while the cluster is being updated, but only if the new update group has not finished updating yet. Disabled machines are considered as updated immediately.

    Note

    Depending on the number of update groups for worker nodes present in the cluster, the number of steps in spec differs. Each update group for worker nodes that has at least one machine will be represented by a step with the ID k8s-workers-<UpdateGroupName>.

  6. Proceed with changing the commence flag of the following update steps granularly depending on the cluster update requirements.

    Caution

    Launch the update steps sequentially. A consecutive step is not started until the previous step is completed.