The documentation is intended to help operators understand the core concepts
of the product.
The information provided in this documentation set is being constantly
improved and amended based on the feedback and kind requests from our
software consumers. This documentation set outlines description of
the features supported within three latest Container Cloud minor releases and
their supported Cluster releases, with a corresponding note
Available since <release-version>.
The following table lists the guides included in the documentation set you
are reading:
GUI elements that include any part of interactive user interface and
menu navigation.
Superscript
Some extra, brief information. For example, if a feature is
available from a specific release or if a feature is in the
Technology Preview development stage.
Note
The Note block
Messages of a generic meaning that may be useful to the user.
Caution
The Caution block
Information that prevents a user from mistakes and undesirable
consequences when following the procedures.
Warning
The Warning block
Messages that include details that can be easily missed, but should not
be ignored by the user and are valuable before proceeding.
See also
The See also block
List of references that may be helpful for understanding of some related
tools, concepts, and so on.
Learn more
The Learn more block
Used in the Release Notes to wrap a list of internal references to
the reference architecture, deployment and operation procedures specific
to a newly implemented product feature.
A Technology Preview feature provides early access to upcoming product
innovations, allowing customers to experiment with the functionality and
provide feedback.
Technology Preview features may be privately or publicly available but
neither are intended for production use. While Mirantis will provide
assistance with such features through official channels, normal Service
Level Agreements do not apply.
As Mirantis considers making future iterations of Technology Preview features
generally available, we will do our best to resolve any issues that customers
experience when using these features.
During the development of a Technology Preview feature, additional components
may become available to the public for evaluation. Mirantis cannot guarantee
the stability of such features. As a result, if you are using Technology
Preview features, you may not be able to seamlessly update to subsequent
product releases, as well as upgrade or migrate to the functionality that
has not been announced as full support yet.
Mirantis makes no guarantees that Technology Preview features will graduate
to generally available features.
The documentation set refers to Mirantis Container Cloud GA as to the latest
released GA version of the product. For details about the Container Cloud
GA minor releases dates, refer to
Container Cloud releases.
Mirantis Container Cloud enables you to ship code faster by enabling speed
with choice, simplicity, and security. Through a single pane of glass you can
deploy, manage, and observe Kubernetes clusters on bare metal infrastructure.
The list of the most common use cases includes:
Kubernetes cluster lifecycle management
The consistent lifecycle management of a single Kubernetes cluster
is a complex task on its own that is made infinitely more difficult
when you have to manage multiple clusters across different platforms
spread across the globe. Mirantis Container Cloud provides a single,
centralized point from which you can perform full lifecycle management
of your container clusters, including automated updates and upgrades.
Highly regulated industries
Regulated industries need a fine level of access control granularity,
high security standards and extensive reporting capabilities to ensure
that they can meet and exceed the security standards and requirements.
Mirantis Container Cloud provides for a fine-grained Role Based Access
Control (RBAC) mechanism and easy integration and federation to existing
identity management systems (IDM).
Logging, monitoring, alerting
A complete operational visibility is required to identify and address issues
in the shortest amount of time – before the problem becomes serious.
Mirantis StackLight is the proactive monitoring, logging, and alerting
solution designed for large-scale container and cloud observability with
extensive collectors, dashboards, trend reporting and alerts.
Storage
Cloud environments require a unified pool of storage that can be scaled up by
simply adding storage server nodes. Ceph is a unified, distributed storage
system designed for excellent performance, reliability, and scalability.
Deploy Ceph utilizing Rook to provide and manage a robust persistent storage
that can be used by Kubernetes workloads on the baremetal-based clusters.
Security
Security is a core concern for all enterprises, especially with more
of our systems being exposed to the Internet as a norm. Mirantis
Container Cloud provides for a multi-layered security approach that
includes effective identity management and role based authentication,
secure out of the box defaults and extensive security scanning and
monitoring during the development process.
5G and Edge
The introduction of 5G technologies and the support of Edge workloads
requires an effective multi-tenant solution to manage the underlying
container infrastructure. Mirantis Container Cloud provides for a full
stack, secure, multi-cloud cluster management and Day-2 operations
solution.
Mirantis Container Cloud is a set of microservices
that are deployed using Helm charts and run in a Kubernetes cluster.
Container Cloud is based on the Kubernetes Cluster API community initiative.
The following diagram illustrates an overview of Container Cloud
and the clusters it manages:
All artifacts used by Kubernetes and workloads are stored
on the Container Cloud content delivery network (CDN):
mirror.mirantis.com (Debian packages including the Ubuntu mirrors)
binary.mirantis.com (Helm charts and binary artifacts)
mirantis.azurecr.io (Docker image registry)
All Container Cloud components are deployed in the Kubernetes clusters.
All Container Cloud APIs are implemented using the Kubernetes
Custom Resource Definition (CRD) that represents custom objects
stored in Kubernetes and allows you to expand Kubernetes API.
The Container Cloud logic is implemented using controllers.
A controller handles the changes in custom resources defined
in the controller CRD.
A custom resource consists of a spec that describes the desired state
of a resource provided by a user.
During every change, a controller reconciles the external state of a custom
resource with the user parameters and stores this external state in the
status subresource of its custom resource.
The types of the Container Cloud clusters include:
Bootstrap cluster
Runs the bootstrap process on a seed data center bare metal node that can be
reused after the management cluster deployment for other purposes.
Requires access to the bare metal provider backend.
Initially, the bootstrap cluster is created with the following minimal set
of components: Bootstrap Controller, public API charts, and the Bootstrap
API.
The user can interact with the bootstrap cluster through the Bootstrap API
to create the configuration for a management cluster and start its
deployment. More specifically, the user performs the following operations:
Create required deployment objects.
Optionally add proxy and SSH keys.
Configure the cluster and machines.
Deploy a management cluster.
The user can monitor the deployment progress of the cluster and machines.
After a successful deployment, the user can download the kubeconfig
artifact of the provisioned cluster.
Management cluster
Comprises Container Cloud as product and provides the following functionality:
Runs all public APIs and services including the web UIs
of Container Cloud.
Does not require access to any provider backend.
Runs the provider-specific services and internal API including
LCMMachine and LCMCluster. Also, it runs an LCM controller for
orchestrating managed clusters and other controllers for handling
different resources.
Requires two-way access to a provider backend. The provider connects
to a backend to spawn managed cluster nodes,
and the agent running on the nodes accesses the regional cluster
to obtain the deployment information.
For deployment details of a management cluster, see Deployment Guide.
Managed cluster
A Mirantis Kubernetes Engine (MKE) cluster that an end user
creates using the Container Cloud web UI.
Requires access to its management cluster. Each node of a managed
cluster runs an LCM Agent that connects to the LCM machine of the
management cluster to obtain the deployment details.
Supports Mirantis OpenStack for Kubernetes (MOSK). For details, see
MOSK documentation.
All types of the Container Cloud clusters except the bootstrap cluster
are based on the MKE and Mirantis Container Runtime (MCR) architecture.
For details, see MKE and
MCR documentation.
The following diagram illustrates the distribution of services
between each type of the Container Cloud clusters:
The Mirantis Container Cloud provider is the central component of Container
Cloud that provisions a node of a management or managed cluster and runs the
LCM Agent on this node. It runs in a management cluster and requires connection
to a provider backend.
The Container Cloud provider interacts with the following types of public API
objects:
Public API object name
Description
Container Cloud release object
Contains the following information about clusters:
Version of the supported Cluster release for a management cluster
List of supported Cluster releases for the managed clusters
and supported upgrade path
Description of Helm charts that are installed on the management cluster
Cluster release object
Provides a specific version of a management or managed cluster.
Any Cluster release object, as well as a Container Cloud release
object never changes, only new releases can be added.
Any change leads to a new release of a cluster.
Contains references to all components and their versions
that are used to deploy all cluster types:
LCM components:
LCM Agent
Ansible playbooks
Scripts
Description of steps to execute during a cluster deployment
and upgrade
Helm Controller image references
Supported Helm charts description:
Helm chart name and version
Helm release name
Helm values
Cluster object
References the Credentials, KaaSRelease and ClusterRelease
objects.
Represents all cluster-level resources, for example, networks, load
balancer for the Kubernetes API, and so on. It uses data from the
Credentials object to create these resources and data from the
KaaSRelease and ClusterRelease objects to ensure that all
lower-level cluster objects are created.
Machine object
References the Cluster object.
Represents one node of a managed cluster and contains all data to
provision it.
Credentials object
Contains all information necessary to connect to a provider backend.
PublicKey object
Is provided to every machine to obtain an SSH access.
The following diagram illustrates the Container Cloud provider data flow:
The Container Cloud provider performs the following operations
in Container Cloud:
Consumes the below types of data from a management cluster:
Credentials to connect to a provider backend
Deployment instructions from the KaaSRelease and ClusterRelease
objects
The cluster-level parameters from the Cluster objects
The machine-level parameters from the Machine objects
Prepares data for all Container Cloud components:
Creates the LCMCluster and LCMMachine custom resources
for LCM Controller and LCM Agent. The LCMMachine custom resources
are created empty to be later handled by the LCM Controller.
Creates the HelmBundle custom resources for the Helm Controller
using data from the KaaSRelease and ClusterRelease objects.
Creates service accounts for these custom resources.
Creates a scope in Identity and access management (IAM)
for a user access to a managed cluster.
Provisions nodes for a managed cluster using the cloud-init script
that downloads and runs the LCM Agent.
The Mirantis Container Cloud Release Controller is responsible
for the following functionality:
Monitor and control the KaaSRelease and ClusterRelease objects
present in a management cluster. If any release object is used
in a cluster, the Release Controller prevents the deletion
of such an object.
Trigger the Container Cloud auto-update procedure if a new
KaaSRelease object is found:
Search for the managed clusters with old Cluster releases
that are not supported by a new Container Cloud release.
If any are detected, abort the auto-update and display
a corresponding note about an old Cluster release in the Container
Cloud web UI for the managed clusters. In this case, a user must update
all managed clusters using the Container Cloud web UI.
Once all managed clusters are updated to the Cluster releases
supported by a new Container Cloud release,
the Container Cloud auto-update is retriggered
by the Release Controller.
Trigger the Container Cloud release update of all Container Cloud
components in a management cluster.
The update itself is processed by the Container Cloud provider.
Trigger the Cluster release update of a management cluster
to the Cluster release version that is indicated
in the updated Container Cloud release version.
The LCMCluster components, such as MKE, are updated before
the HelmBundle components, such as StackLight or Ceph.
Once a management cluster is updated, an option to update
a managed cluster becomes available in the Container Cloud web UI.
During a managed cluster update, all cluster components including
Kubernetes are automatically updated to newer versions if available.
The LCMCluster components, such as MKE, are updated before
the HelmBundle components, such as StackLight or Ceph.
The Operator can delay the Container Cloud automatic upgrade procedure for a
limited amount of time or schedule upgrade to run at desired hours or weekdays.
For details, see Schedule Mirantis Container Cloud updates.
Container Cloud remains operational during the management cluster upgrade.
Managed clusters are not affected during this upgrade. For the list of
components that are updated during the Container Cloud upgrade, see the
Components versions section of the corresponding Container Cloud release in
Release Notes.
When Mirantis announces support of the newest versions of
Mirantis Container Runtime (MCR) and Mirantis Kubernetes Engine
(MKE), Container Cloud automatically upgrades these components as well.
For the maintenance window best practices before upgrade of these
components, see
MKE Documentation.
The Mirantis Container Cloud web UI is mainly designed
to create and update the managed clusters as well as add or remove machines
to or from an existing managed cluster.
You can use the Container Cloud web UI
to obtain the management cluster details including endpoints, release version,
and so on.
The management cluster update occurs automatically
with a new release change log available through the Container Cloud web UI.
The Container Cloud web UI is a JavaScript application that is based
on the React framework. The Container Cloud web UI is designed to work
on a client side only. Therefore, it does not require a special backend.
It interacts with the Kubernetes and Keycloak APIs directly.
The Container Cloud web UI uses a Keycloak token
to interact with Container Cloud API and download kubeconfig
for the management and managed clusters.
The Container Cloud web UI uses NGINX that runs on a management cluster
and handles the Container Cloud web UI static files.
NGINX proxies the Kubernetes and Keycloak APIs
for the Container Cloud web UI.
The bare metal service provides for the discovery, deployment, and management
of bare metal hosts.
The bare metal management in Mirantis Container Cloud
is implemented as a set of modular microservices.
Each microservice implements a certain requirement or function
within the bare metal management system.
The backend bare metal manager in a standalone mode with its auxiliary
services that include httpd, dnsmasq, and mariadb.
OpenStack Ironic Inspector
Introspects and discovers the bare metal hosts inventory.
Includes OpenStack Ironic Python Agent (IPA) that is used
as a provision-time agent for managing bare metal hosts.
Ironic Operator
Monitors changes in the external IP addresses of httpd, ironic,
and ironic-inspector and automatically reconciles the configuration
for dnsmasq, ironic, baremetal-provider,
and baremetal-operator.
Bare Metal Operator
Manages bare metal hosts through the Ironic API. The Container Cloud
bare-metal operator implementation is based on the Metal³ project.
Bare metal resources manager
Ensures that the bare metal provisioning artifacts such as the
distribution image of the operating system is available and up to date.
cluster-api-provider-baremetal
The plugin for the Kubernetes Cluster API integrated with Container Cloud.
Container Cloud uses the Metal³ implementation of
cluster-api-provider-baremetal for the Cluster API.
HAProxy
Load balancer for external access to the Kubernetes API endpoint.
LCM Agent
Used for physical and logical storage, physical and logical network,
and control over the life cycle of a bare metal machine resources.
Ceph
Distributed shared storage is required by the Container Cloud services
to create persistent volumes to store their data.
MetalLB
Load balancer for Kubernetes services on bare metal. 1
Keepalived
Monitoring service that ensures availability of the virtual IP for
the external load balancer endpoint (HAProxy). 1
IPAM
IP address management services provide consistent IP address space
to the machines in bare metal clusters. See details in
IP Address Management.
Mirantis Container Cloud on bare metal uses IP Address Management (IPAM)
to keep track of the network addresses allocated to bare metal hosts.
This is necessary to avoid IP address conflicts
and expiration of address leases to machines through DHCP.
Note
Only IPv4 address family is currently supported by Container Cloud
and IPAM. IPv6 is not supported and not used in Container Cloud.
IPAM is provided by the kaas-ipam controller. Its functions include:
Allocation of IP address ranges or subnets to newly created clusters using
the Subnet resource.
Allocation of IP addresses to machines and cluster services at the request
of baremetal-provider using the IpamHost and IPaddr resources.
Creation and maintenance of host networking configuration
on the bare metal hosts using the IpamHost resources.
The IPAM service can support different networking topologies and network
hardware configurations on the bare metal hosts.
In the most basic network configuration, IPAM uses a single L3 network
to assign addresses to all bare metal hosts, as defined in
Managed cluster networking.
You can apply complex networking configurations to a bare metal host
using the L2 templates. The L2 templates imply multihomed host networking
and enable you to create a managed cluster where nodes use separate host
networks for different types of traffic. Multihoming is required
to ensure the security and performance of a managed cluster.
Caution
Modification of L2 templates in use is allowed with a mandatory
validation step from the Infrastructure Operator to prevent accidental
cluster failures due to unsafe changes. The list of risks posed by modifying
L2 templates includes:
Services running on hosts cannot reconfigure automatically to switch to
the new IP addresses and/or interfaces.
Connections between services are interrupted unexpectedly, which can cause
data loss.
Incorrect configurations on hosts can lead to irrevocable loss of
connectivity between services and unexpected cluster partition or
disassembly.
The main purpose of networking in a Container Cloud management cluster is to
provide access to the Container Cloud Management API that consists of:
Container Cloud Public API
Used by end users to provision and configure managed clusters and machines.
Includes the Container Cloud web UI.
Container Cloud LCM API
Used by LCM agents in managed clusters to obtain configuration and report
status. Contains provider-specific services and internal API including
LCMMachine and LCMCluster objects.
The following types of networks are supported for the management clusters in
Container Cloud:
PXE network
Enables PXE boot of all bare metal machines in the Container Cloud region.
PXE subnet
Provides IP addresses for DHCP and network boot of the bare metal hosts
for initial inspection and operating system provisioning.
This network may not have the default gateway or a router connected
to it. The PXE subnet is defined by the Container Cloud Operator
during bootstrap.
Provides IP addresses for the bare metal management services of
Container Cloud, such as bare metal provisioning service (Ironic).
These addresses are allocated and served by MetalLB.
Management network
Connects LCM Agents running on the hosts to the Container Cloud LCM API.
Serves the external connections to the Container Cloud Management API.
The network is also used for communication between kubelet
and the Kubernetes API server inside a Kubernetes cluster. The MKE
components use this network for communication inside a swarm cluster.
LCM subnet
Provides IP addresses for the Kubernetes nodes in the management cluster.
This network also provides a Virtual IP (VIP) address for the load
balancer that enables external access to the Kubernetes API
of a management cluster. This VIP is also the endpoint to access
the Container Cloud Management API in the management cluster.
Provides IP addresses for the externally accessible services of
Container Cloud, such as Keycloak, web UI, StackLight.
These addresses are allocated and served by MetalLB.
Kubernetes workloads network
Technology Preview
Serves the internal traffic between workloads on the management cluster.
Kubernetes workloads subnet
Provides IP addresses that are assigned to nodes and used by Calico.
Out-of-Band (OOB) network
Connects to Baseboard Management Controllers of the servers that host
the management cluster. The OOB subnet must be accessible from the
management network through IP routing. The OOB network
is not managed by Container Cloud and is not represented in the IPAM API.
A Kubernetes cluster networking is typically focused on connecting pods on
different nodes. On bare metal, however, the cluster networking is more
complex as it needs to facilitate many different types of traffic.
Kubernetes clusters managed by Mirantis Container Cloud
have the following types of traffic:
PXE network
Enables the PXE boot of all bare metal machines in Container Cloud.
This network is not configured on the hosts in a managed cluster.
It is used by the bare metal provider to provision additional
hosts in managed clusters and is disabled on the hosts after
provisioning is done.
Life-cycle management (LCM) network
Connects LCM Agents running on the hosts to the Container Cloud LCM API.
The LCM API is provided by the management cluster.
The LCM network is also used for communication between kubelet
and the Kubernetes API server inside a Kubernetes cluster. The MKE
components use this network for communication inside a swarm cluster.
When using the BGP announcement of the IP address for the cluster API
load balancer, which is available as Technology Preview since
Container Cloud 2.24.4, no segment stretching is required
between Kubernetes master nodes. Also, in this scenario, the load
balancer IP address is not required to match the LCM subnet CIDR address.
LCM subnet(s)
Provides IP addresses that are statically allocated by the IPAM service
to bare metal hosts. This network must be connected to the Kubernetes API
endpoint of the management cluster through an IP router.
LCM Agents running on managed clusters will connect to the management
cluster API through this router. LCM subnets may be different
per managed cluster as long as this connection requirement is satisfied.
The Virtual IP (VIP) address for load balancer that enables access to
the Kubernetes API of the managed cluster must be allocated from the LCM
subnet.
Cluster API subnet
Technology Preview
Provides a load balancer IP address for external access to the cluster
API. Mirantis recommends that this subnet stays unique per managed
cluster.
Kubernetes workloads network
Serves as an underlay network for traffic between pods in
the managed cluster. Do not share this network between clusters.
Kubernetes workloads subnet(s)
Provides IP addresses that are statically allocated by the IPAM service
to all nodes and that are used by Calico for cross-node communication
inside a cluster. By default, VXLAN overlay is used for Calico
cross-node communication.
Kubernetes external network
Serves ingress traffic to the managed cluster from the outside world.
You can share this network between clusters, but with dedicated subnets
per cluster. Several or all cluster nodes must be connected to
this network. Traffic from external users to the externally available
Kubernetes load-balanced services comes through the nodes that
are connected to this network.
Services subnet(s)
Provides IP addresses for externally available Kubernetes load-balanced
services. The address ranges for MetalLB are assigned from this subnet.
There can be several subnets per managed cluster that define
the address ranges or address pools for MetalLB.
External subnet(s)
Provides IP addresses that are statically allocated by the IPAM service
to nodes. The IP gateway in this network is used as the default route
on all nodes that are connected to this network. This network
allows external users to connect to the cluster services exposed as
Kubernetes load-balanced services. MetalLB speakers must run on the same
nodes. For details, see Configure node selector for MetalLB speaker.
Storage network
Serves storage access and replication traffic from and to Ceph OSD services.
The storage network does not need to be connected to any IP routers
and does not require external access, unless you want to use Ceph
from outside of a Kubernetes cluster.
To use a dedicated storage network, define and configure
both subnets listed below.
Storage access subnet(s)
Provides IP addresses that are statically allocated by the IPAM service
to Ceph nodes.
The Ceph OSD services bind to these addresses on their respective
nodes. Serves Ceph access traffic from and to storage clients.
This is a public network in Ceph terms. 1
Storage replication subnet(s)
Provides IP addresses that are statically allocated by the IPAM service
to Ceph nodes.
The Ceph OSD services bind to these addresses on their respective
nodes. Serves Ceph internal replication traffic. This is a
cluster network in Ceph terms. 1
Out-of-Band (OOB) network
Connects baseboard management controllers (BMCs) of the bare metal hosts.
This network must not be accessible from the managed clusters.
The following diagram illustrates the networking schema of the Container Cloud
deployment on bare metal with a managed cluster:
The following network roles are defined for all Mirantis Container Cloud
clusters nodes on bare metal including the bootstrap, management and managed
cluster nodes:
Out-of-band (OOB) network
Connects the Baseboard Management Controllers (BMCs) of the hosts
in the network to Ironic. This network is out of band for the
host operating system.
PXE network
Enables remote booting of servers through the PXE protocol. In management
clusters, DHCP server listens on this network for hosts discovery and
inspection. In managed clusters, hosts use this network for the initial
PXE boot and provisioning.
LCM network
Connects LCM Agents running on the node to the LCM API of the management
cluster. It is also used for communication between kubelet and the
Kubernetes API server inside a Kubernetes cluster. The MKE components use
this network for communication inside a swarm cluster.
In management clusters, it is replaced by the management network.
Kubernetes workloads (pods) network
Technology Preview
Serves connections between Kubernetes pods.
Each host has an address on this network, and this address is used
by Calico as an endpoint to the underlay network.
Kubernetes external network
Technology Preview
Serves external connection to the Kubernetes API
and the user services exposed by the cluster. In management clusters,
it is replaced by the management network.
Management network
Serves external connections to the Container Cloud Management API and
services of the management cluster. Not available in a managed cluster.
Storage access network
Connects Ceph nodes to the storage clients. The Ceph OSD service is
bound to the address on this network. This is a public network in
Ceph terms. 0
Storage replication network
Connects Ceph nodes to each other. Serves internal replication traffic.
This is a cluster network in Ceph terms. 0
Each network is represented on the host by a virtual Linux bridge. Physical
interfaces may be connected to one of the bridges directly, or through a
logical VLAN subinterface, or combined into a bond interface that is in
turn connected to a bridge.
The following table summarizes the default names used for the bridges
connected to the networks listed above:
The baremetal-based Mirantis Container Cloud uses Ceph as a distributed
storage system for file, block, and object storage. This section provides an
overview of a Ceph cluster deployed by Container Cloud.
Mirantis Container Cloud deploys Ceph on baremetal-based managed clusters
using Helm charts with the following components:
Rook Ceph Operator
A storage orchestrator that deploys Ceph on top of a Kubernetes cluster. Also
known as Rook or RookOperator. Rook operations include:
Deploying and managing a Ceph cluster based on provided Rook CRs such as
CephCluster, CephBlockPool, CephObjectStore, and so on.
Orchestrating the state of the Ceph cluster and all its daemons.
KaaSCephCluster custom resource (CR)
Represents the customization of a Kubernetes installation and allows you to
define the required Ceph configuration through the Container Cloud web UI
before deployment. For example, you can define the failure domain, Ceph pools,
Ceph node roles, number of Ceph components such as Ceph OSDs, and so on.
The ceph-kcc-controller controller on the Container Cloud management
cluster manages the KaaSCephCluster CR.
Ceph Controller
A Kubernetes controller that obtains the parameters from Container Cloud
through a CR, creates CRs for Rook and updates its CR status based on the Ceph
cluster deployment progress. It creates users, pools, and keys for OpenStack
and Kubernetes and provides Ceph configurations and keys to access them. Also,
Ceph Controller eventually obtains the data from the OpenStack Controller for
the Keystone integration and updates the RADOS Gateway services configurations
to use Kubernetes for user authentication. Ceph Controller operations include:
Transforming user parameters from the Container Cloud Ceph CR into Rook CRs
and deploying a Ceph cluster using Rook.
Providing integration of the Ceph cluster with Kubernetes.
Providing data for OpenStack to integrate with the deployed Ceph cluster.
Ceph Status Controller
A Kubernetes controller that collects all valuable parameters from the current
Ceph cluster, its daemons, and entities and exposes them into the
KaaSCephCluster status. Ceph Status Controller operations include:
Collecting all statuses from a Ceph cluster and corresponding Rook CRs.
Collecting additional information on the health of Ceph daemons.
Provides information to the status section of the KaaSCephCluster
CR.
Ceph Request Controller
A Kubernetes controller that obtains the parameters from Container Cloud
through a CR and manages Ceph OSD lifecycle management (LCM) operations. It
allows for a safe Ceph OSD removal from the Ceph cluster. Ceph Request
Controller operations include:
Providing an ability to perform Ceph OSD LCM operations.
Obtaining specific CRs to remove Ceph OSDs and executing them.
Pausing the regular Ceph Controller reconcile until all requests are
completed.
A typical Ceph cluster consists of the following components:
Ceph Monitors - three or, in rare cases, five Ceph Monitors.
Ceph Managers:
Before Container Cloud 2.22.0, one Ceph Manager.
Since Container Cloud 2.22.0, two Ceph Managers.
RADOS Gateway services - Mirantis recommends having three or more RADOS
Gateway instances for HA.
Ceph OSDs - the number of Ceph OSDs may vary according to the deployment
needs.
Warning
A Ceph cluster with 3 Ceph nodes does not provide
hardware fault tolerance and is not eligible
for recovery operations,
such as a disk or an entire Ceph node replacement.
A Ceph cluster uses the replication factor that equals 3.
If the number of Ceph OSDs is less than 3, a Ceph cluster
moves to the degraded state with the write operations
restriction until the number of alive Ceph OSDs
equals the replication factor again.
The placement of Ceph Monitors and Ceph Managers is defined in the
KaaSCephCluster CR.
The following diagram illustrates the way a Ceph cluster is deployed in
Container Cloud:
The following diagram illustrates the processes within a deployed Ceph cluster:
A Ceph cluster configuration in Mirantis Container Cloud
includes but is not limited to the following limitations:
Only one Ceph Controller per a managed cluster and only one Ceph cluster per
Ceph Controller are supported.
The replication size for any Ceph pool must be set to more than 1.
All CRUSH rules must have the same failure_domain.
Only one CRUSH tree per cluster. The separation of devices per Ceph pool is
supported through device classes
with only one pool of each type for a device class.
Only the following types of CRUSH buckets are supported:
topology.kubernetes.io/region
topology.kubernetes.io/zone
topology.rook.io/datacenter
topology.rook.io/room
topology.rook.io/pod
topology.rook.io/pdu
topology.rook.io/row
topology.rook.io/rack
topology.rook.io/chassis
Only IPv4 is supported.
If two or more Ceph OSDs are located on the same device, there must be no
dedicated WAL or DB for this class.
Only a full collocation or dedicated WAL and DB configurations are supported.
The minimum size of any defined Ceph OSD device is 5 GB.
Lifted since Container Cloud 2.24.2 (Cluster releases 14.0.1 and 15.0.1).
Ceph cluster does not support removable devices (with hotplug enabled) for
deploying Ceph OSDs.
Ceph OSDs support only raw disks as data devices meaning that no dm or
lvm devices are allowed.
When adding a Ceph node with the Ceph Monitor role, if any issues occur with
the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead,
named using the next alphabetic character in order. Therefore, the Ceph Monitor
names may not follow the alphabetical order. For example, a, b, d,
instead of a, b, c.
Reducing the number of Ceph Monitors is not supported and causes the Ceph
Monitor daemons removal from random nodes.
Removal of the mgr role in the nodes section of the
KaaSCephCluster CR does not remove Ceph Managers. To remove a Ceph
Manager from a node, remove it from the nodes spec and manually delete
the mgr pod in the Rook namespace.
Lifted since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.10).
Ceph does not support allocation of Ceph RGW pods on nodes where the
Federal Information Processing Standard (FIPS) mode is enabled.
There are several formats to use when specifying and addressing storage devices
of a Ceph cluster. The default and recommended one is the /dev/disk/by-id
format. This format is reliable and unaffected by the disk controller actions,
such as device name shuffling or /dev/disk/by-path recalculating.
Difference between by-id, name, and by-path formats¶
The storage device /dev/disk/by-id format in most of the cases bases on
a disk serial number, which is unique for each disk. A by-id symlink
is created by the udev rules in the following format, where <BusID>
is an ID of the bus to which the disk is attached and <DiskSerialNumber>
stands for a unique disk serial number:
/dev/disk/by-id/<BusID>-<DiskSerialNumber>
Typical by-id symlinks for storage devices look as follows:
In the example above, symlinks contain the following IDs:
Bus IDs: nvme, scsi-SATA and ata
Disk serial numbers: SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543,
HGST_HUS724040AL_PN1334PEHN18ZS and
WDC_WD4003FZEX-00Z4SA0_WD-WMC5D0D9DMEH.
An exception to this rule is the wwnby-id symlinks, which are
programmatically generated at boot. They are not solely based on disk
serial numbers but also include other node information. This can lead
to the wwn being recalculated when the node reboots. As a result,
this symlink type cannot guarantee a persistent disk identifier and should
not be used as a stable storage device symlink in a Ceph cluster.
The storage device name and by-path formats cannot be considered
persistent because the sequence in which block devices are added during boot
is semi-arbitrary. This means that block device names, for example, nvme0n1
and sdc, are assigned to physical disks during discovery, which may vary
inconsistently from the previous node state. The same inconsistency applies
to by-path symlinks, as they rely on the shortest physical path
to the device at boot and may differ from the previous node state.
Therefore, Mirantis highly recommends using storage device by-id symlinks
that contain disk serial numbers. This approach enables you to use a persistent
device identifier addressed in the Ceph cluster specification.
Example KaaSCephCluster with device by-id identifiers¶
Below is an example KaaSCephCluster custom resource using the
/dev/disk/by-id format for storage devices specification:
Note
Container Cloud enables you to use fullPath for the by-id
symlinks since 2.25.0. For the earlier product versions, use the name
field instead.
apiVersion:kaas.mirantis.com/v1alpha1kind:KaaSCephClustermetadata:name:ceph-cluster-managed-clusternamespace:managed-nsspec:cephClusterSpec:nodes:# Add the exact ``nodes`` names.# Obtain the name from the "get machine" list.cz812-managed-cluster-storage-worker-noefi-58spl:roles:-mgr-mon# All disk configuration must be reflected in ``status.providerStatus.hardware.storage`` of the ``Machine`` objectstorageDevices:-config:deviceClass:ssdfullPath:/dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231440912cz813-managed-cluster-storage-worker-noefi-lr4k4:roles:-mgr-monstorageDevices:-config:deviceClass:nvmefullPath:/dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543cz814-managed-cluster-storage-worker-noefi-z2m67:roles:-mgr-monstorageDevices:-config:deviceClass:nvmefullPath:/dev/disk/by-id/nvme-SAMSUNG_ML1EB3T8HMLA-00007_S46FNY1R130423pools:-default:truedeviceClass:ssdname:kubernetesreplicated:size:3role:kubernetesk8sCluster:name:managed-clusternamespace:managed-ns
Migrating device names used in KaaSCephCluster to device by-id symlinks¶
The majority of existing clusters uses device names as addressed storage
devices identifiers in the spec.cephClusterSpec.nodes section of
the KaaSCephCluster custom resource. Therefore, they are prone
to the issue of inconsistent storage device identifiers during cluster
update. Refer to Migrate Ceph cluster to address storage devices using by-id to mitigate possible
risks.
Mirantis Container Cloud provides APIs that enable you
to define hardware configurations that extend the reference architecture:
Bare Metal Host Profile API
Enables for quick configuration of host boot and storage devices
and assigning of custom configuration profiles to individual machines.
See Create a custom bare metal host profile.
IP Address Management API
Enables for quick configuration of host network interfaces and IP addresses
and setting up of IP addresses ranges for automatic allocation.
See Create L2 templates.
Typically, operations with the extended hardware configurations are available
through the API and CLI, but not the web UI.
To keep operating system on a bare metal host up to date with the latest
security updates, the operating system requires periodic software
packages upgrade that may or may not require the host reboot.
Mirantis Container Cloud uses life cycle management tools to update
the operating system packages on the bare metal hosts. Container Cloud
may also trigger restart of bare metal hosts to apply the updates.
In the management cluster of Container Cloud, software package upgrade and
host restart is applied automatically when a new Container Cloud version
with available kernel or software packages upgrade is released.
In managed clusters, package upgrade and host restart is applied
as part of usual cluster upgrade using the Update cluster option
in the Container Cloud web UI.
Operating system upgrade and host restart are applied to cluster
nodes one by one. If Ceph is installed in the cluster, the Container
Cloud orchestration securely pauses the Ceph OSDs on the node before
restart. This allows avoiding degradation of the storage service.
Caution
Depending on the cluster configuration, applying security
updates and host restart can increase the update time for each node to up to
1 hour.
Cluster nodes are updated one by one. Therefore, for large clusters,
the update may take several days to complete.
The Mirantis Container Cloud managed clusters use MetalLB for load balancing
of services and HAProxy with VIP managed by Virtual Router Redundancy Protocol
(VRRP) with Keepalived for the Kubernetes API load balancer.
Every control plane node of each Kubernetes cluster runs the kube-api
service in a container. This service provides a Kubernetes API endpoint.
Every control plane node also runs the haproxy server that provides
load balancing with backend health checking for all kube-api endpoints as
backends.
The default load balancing method is least_conn. With this method,
a request is sent to the server with the least number of active
connections. The default load balancing method cannot be changed
using the Container Cloud API.
Only one of the control plane nodes at any given time serves as a
front end for Kubernetes API. To ensure this, the Kubernetes clients
use a virtual IP (VIP) address for accessing Kubernetes API.
This VIP is assigned to one node at a time using VRRP. Keepalived running on
each control plane node provides health checking and failover of the VIP.
Keepalived is configured in multicast mode.
Note
The use of VIP address for load balancing of Kubernetes API requires
that all control plane nodes of a Kubernetes cluster are connected to a
shared L2 segment. This limitation prevents from installing full L3
topologies where control plane nodes are split between different L2 segments
and L3 networks.
The services provided by the Kubernetes clusters, including Container Cloud and
user services, are balanced by MetalLB. The metallb-speaker service runs on
every worker node in the cluster and handles connections to the service IP
addresses.
MetalLB runs in the MAC-based (L2) mode. It means that all control plane nodes
must be connected to a shared L2 segment. This is a limitation that does not
allow installing full L3 cluster topologies.
The Kubernetes lifecycle management (LCM) engine in Mirantis Container Cloud
consists of the following components:
LCM Controller
Responsible for all LCM operations. Consumes the LCMCluster object
and orchestrates actions through LCM Agent.
LCM Agent
Runs on the target host. Executes Ansible playbooks in headless mode.
Helm Controller
Responsible for the Helm charts life cycle, is installed by the provider
as a Helm v3 chart.
The Kubernetes LCM components handle the following custom resources:
LCMCluster
LCMMachine
HelmBundle
The following diagram illustrates handling of the LCM custom resources by the
Kubernetes LCM components. On a managed cluster, apiserver handles multiple
Kubernetes objects, for example, deployments, nodes, RBAC, and so on.
The Kubernetes LCM components handle the following custom resources (CRs):
LCMMachine
LCMCluster
HelmBundle
LCMMachine
Describes a machine that is located on a cluster.
It contains the machine type, control or worker,
StateItems that correspond to Ansible playbooks and miscellaneous actions,
for example, downloading a file or executing a shell command.
LCMMachine reflects the current state of the machine, for example,
a node IP address, and each StateItem through its status.
Multiple LCMMachine CRs can correspond to a single cluster.
LCMCluster
Describes a managed cluster. In its spec,
LCMCluster contains a set of StateItems for each type of LCMMachine,
which describe the actions that must be performed to deploy the cluster.
LCMCluster is created by the provider, using machineTypes
of the Release object. The status field of LCMCluster
reflects the status of the cluster,
for example, the number of ready or requested nodes.
HelmBundle
Wrapper for Helm charts that is handled by Helm Controller.
HelmBundle tracks what Helm charts must be installed
on a managed cluster.
LCM Controller runs on the management cluster and orchestrates the
LCMMachine objects according to their type and their LCMCluster object.
Once the LCMCluster and LCMMachine objects are created, LCM Controller
starts monitoring them to modify the spec fields and update
the status fields of the LCMMachine objects when required.
The status field of LCMMachine is updated by LCM Agent
running on a node of a management or managed cluster.
Each LCMMachine has the following lifecycle states:
Uninitialized - the machine is not yet assigned to an LCMCluster.
Pending - the agent reports a node IP address and host name.
Prepare - the machine executes StateItems that correspond
to the prepare phase. This phase usually involves downloading
the necessary archives and packages.
Deploy - the machine executes StateItems that correspond
to the deploy phase that is becoming a Mirantis Kubernetes Engine (MKE)
node.
Ready - the machine is being deployed.
Upgrade - the machine is being upgraded to the new MKE version.
Reconfigure - the machine executes StateItems that correspond
to the reconfigure phase. The machine configuration is being updated
without affecting workloads running on the machine.
The templates for StateItems are stored in the machineTypes
field of an LCMCluster object, with separate lists
for the MKE manager and worker nodes.
Each StateItem has the execution phase field for a management and
managed cluster:
The prepare phase is executed for all machines for which
it was not executed yet. This phase comprises downloading the files
necessary for the cluster deployment, installing the required packages,
and so on.
During the deploy phase, a node is added to the cluster.
LCM Controller applies the deploy phase to the nodes
in the following order:
First manager node is deployed.
The remaining manager nodes are deployed one by one
and the worker nodes are deployed in batches (by default,
up to 50 worker nodes at the same time).
LCM Controller deploys and upgrades a Mirantis Container Cloud cluster
by setting StateItems of LCMMachine objects following the corresponding
StateItems phases described above. The Container Cloud cluster upgrade
process follows the same logic that is used for a new deployment,
that is applying a new set of StateItems to the LCMMachines after
updating the LCMCluster object. But if the existing worker node is being
upgraded, LCM Controller performs draining and cordoning on this node honoring
the Pod Disruption Budgets.
This operation prevents unexpected disruptions of the workloads.
LCM Agent handles a single machine that belongs to a management or managed
cluster. It runs on the machine operating system but communicates with
apiserver of the management cluster. LCM Agent is deployed as a systemd
unit using cloud-init. LCM Agent has a built-in self-upgrade mechanism.
LCM Agent monitors the spec of a particular LCMMachine object
to reconcile the machine state with the object StateItems and update
the LCMMachine status accordingly. The actions that LCM Agent performs
while handling the StateItems are as follows:
Download configuration files
Run shell commands
Run Ansible playbooks in headless mode
LCM Agent provides the IP address and host name of the machine for the
LCMMachine status parameter.
Helm Controller is used by Mirantis Container Cloud to handle management and
managed clusters core addons such as StackLight and the application addons
such as the OpenStack components.
Helm Controller is installed as a separate Helm v3 chart by the Container
Cloud provider. Its Pods are created using Deployment.
The Helm release information is stored in the KaaSRelease object for
the management clusters and in the ClusterRelease object for all types of
the Container Cloud clusters.
These objects are used by the Container Cloud provider.
The Container Cloud provider uses the information from the
ClusterRelease object together with the Container Cloud API
Clusterspec. In Clusterspec, the operator can specify
the Helm release name and charts to use.
By combining the information from the ClusterproviderSpec parameter
and its ClusterRelease object, the cluster actuator generates
the LCMCluster objects. These objects are further handled by LCM Controller
and the HelmBundle object handled by Helm Controller.
HelmBundle must have the same name as the LCMCluster object
for the cluster that HelmBundle applies to.
Although a cluster actuator can only create a single HelmBundle
per cluster, Helm Controller can handle multiple HelmBundle objects
per cluster.
Helm Controller handles the HelmBundle objects and reconciles them with the
state of Helm in its cluster.
Helm Controller can also be used by the management cluster with corresponding
HelmBundle objects created as part of the initial management cluster setup.
Identity and access management (IAM) provides a central point
of users and permissions management of the Mirantis Container
Cloud cluster resources in a granular and unified manner.
Also, IAM provides infrastructure for single sign-on user experience
across all Container Cloud web portals.
IAM for Container Cloud consists of the following components:
Keycloak
Provides the OpenID Connect endpoint
Integrates with an external identity provider (IdP), for example,
existing LDAP or Google Open Authorization (OAuth)
Stores roles mapping for users
IAM Controller
Provides IAM API with data about Container Cloud projects
Handles all role-based access control (RBAC) components in Kubernetes API
IAM API
Provides an abstraction API for creating user scopes and roles
To be consistent and keep the integrity of a user database
and user permissions, in Mirantis Container Cloud,
IAM stores the user identity information internally.
However in real deployments, the identity provider usually already exists.
Out of the box, in Container Cloud, IAM supports
integration with LDAP and Google Open Authorization (OAuth).
If LDAP is configured as an external identity provider,
IAM performs one-way synchronization by mapping attributes according
to configuration.
In the case of the Google Open Authorization (OAuth) integration,
the user is automatically registered and their credentials are stored
in the internal database according to the user template configuration.
The Google OAuth registration workflow is as follows:
The user requests a Container Cloud web UI resource.
The user is redirected to the IAM login page and logs in using
the Log in with Google account option.
IAM creates a new user with the default access rights that are defined
in the user template configuration.
The user can access the Container Cloud web UI resource.
The following diagram illustrates the external IdP integration to IAM:
You can configure simultaneous integration with both external IdPs
with the user identity matching feature enabled.
Mirantis IAM performs as an OpenID Connect (OIDC) provider,
it issues a token and exposes discovery endpoints.
The credentials can be handled by IAM itself or delegated
to an external identity provider (IdP).
The issued JSON Web Token (JWT) is sufficient to perform operations across
Mirantis Container Cloud according to the scope and role defined
in it. Mirantis recommends using asymmetric cryptography for token signing
(RS256) to minimize the dependency between IAM and managed components.
When Container Cloud calls Mirantis Kubernetes Engine (MKE),
the user in Keycloak is created automatically with a JWT issued by Keycloak
on behalf of the end user.
MKE, in its turn, verifies whether the JWT is issued by Keycloak. If
the user retrieved from the token does not exist in the MKE database,
the user is automatically created in the MKE database based on the
information from the token.
The authorization implementation is out of the scope of IAM in Container
Cloud. This functionality is delegated to the component level.
IAM interacts with a Container Cloud component using the OIDC token
content that is processed by a component itself and required authorization
is enforced. Such an approach enables you to have any underlying authorization
that is not dependent on IAM and still to provide a unified user experience
across all Container Cloud components.
The following diagram illustrates the Kubernetes CLI authentication flow.
The authentication flow for Helm and other Kubernetes-oriented CLI utilities
is identical to the Kubernetes CLI flow,
but JSON Web Tokens (JWT) must be pre-provisioned.
Mirantis Container Cloud uses StackLight, the logging, monitoring, and
alerting solution that provides a single pane of glass for cloud maintenance
and day-to-day operations as well as offers critical insights into cloud
health including operational information about the components deployed in
management and managed clusters.
StackLight is based on Prometheus, an open-source monitoring solution and a
time series database.
Mirantis Container Cloud deploys the StackLight stack
as a release of a Helm chart that contains the helm-controller
and helmbundles.lcm.mirantis.com (HelmBundle) custom resources.
The StackLight HelmBundle consists of a set of Helm charts
with the StackLight components that include:
Receives, consolidates, and deduplicates the alerts sent by Alertmanager
and visually represents them through a simple web UI. Using the Alerta
web UI, you can view the most recent or watched alerts, group, and
filter alerts.
Alertmanager
Handles the alerts sent by client applications such as Prometheus,
deduplicates, groups, and routes alerts to receiver integrations.
Using the Alertmanager web UI, you can view the most recent fired
alerts, silence them, or view the Alertmanager configuration.
Elasticsearch Curator
Maintains the data (indexes) in OpenSearch by performing
such operations as creating, closing, or opening an index as well as
deleting a snapshot. Also, manages the data retention policy in
OpenSearch.
Elasticsearch Exporter Compatible with OpenSearch
The Prometheus exporter that gathers internal OpenSearch metrics.
Grafana
Builds and visually represents metric graphs based on time series
databases. Grafana supports querying of Prometheus using the PromQL
language.
Database backends
StackLight uses PostgreSQL for Alerta and Grafana. PostgreSQL reduces
the data storage fragmentation while enabling high availability.
High availability is achieved using Patroni, the PostgreSQL cluster
manager that monitors for node failures and manages failover
of the primary node. StackLight also uses Patroni to manage major
version upgrades of PostgreSQL clusters, which allows leveraging
the database engine functionality and improvements
as they are introduced upstream in new releases,
maintaining functional continuity without version lock-in.
Logging stack
Responsible for collecting, processing, and persisting logs and
Kubernetes events. By default, when deploying through the Container
Cloud web UI, only the metrics stack is enabled on managed clusters. To
enable StackLight to gather managed cluster logs, enable the logging
stack during deployment. On management clusters, the logging stack is
enabled by default. The logging stack components include:
OpenSearch, which stores logs and notifications.
Fluentd-logs, which collects logs, sends them to OpenSearch, generates
metrics based on analysis of incoming log entries, and exposes these
metrics to Prometheus.
OpenSearch Dashboards, which provides real-time visualization of
the data stored in OpenSearch and enables you to detect issues.
Metricbeat, which collects Kubernetes events and sends them to
OpenSearch for storage.
Prometheus-es-exporter, which presents the OpenSearch data
as Prometheus metrics by periodically sending configured queries to
the OpenSearch cluster and exposing the results to a scrapable HTTP
endpoint like other Prometheus targets.
Note
The logging mechanism performance depends on the cluster log load. In
case of a high load, you may need to increase the default resource requests
and limits for fluentdLogs. For details, see
StackLight configuration parameters: Resource limits.
Metric collector
Collects telemetry data (CPU or memory usage, number of active alerts,
and so on) from Prometheus and sends the data to centralized cloud
storage for further processing and analysis. Metric collector runs on
the management cluster.
Note
This component is designated for internal StackLight use only.
Prometheus
Gathers metrics. Automatically discovers and monitors the endpoints.
Using the Prometheus web UI, you can view simple visualizations and
debug. By default, the Prometheus database stores metrics of the past 15
days or up to 15 GB of data depending on the limit that is reached
first.
Prometheus Blackbox Exporter
Allows monitoring endpoints over HTTP, HTTPS, DNS, TCP, and ICMP.
Prometheus-es-exporter
Presents the OpenSearch data as Prometheus metrics by periodically
sending configured queries to the OpenSearch cluster and exposing the
results to a scrapable HTTP endpoint like other Prometheus targets.
Prometheus Node Exporter
Gathers hardware and operating system metrics exposed by kernel.
Prometheus Relay
Adds a proxy layer to Prometheus to merge the results from underlay
Prometheus servers to prevent gaps in case some data is missing on
some servers. Is available only in the HA StackLight mode.
Salesforce notifier
Enables sending Alertmanager notifications to Salesforce to allow
creating Salesforce cases and closing them once the alerts are resolved.
Disabled by default.
Salesforce reporter
Queries Prometheus for the data about the amount of vCPU, vRAM, and
vStorage used and available, combines the data, and sends it to
Salesforce daily. Mirantis uses the collected data for further analysis
and reports to improve the quality of customer support. Disabled by
default.
Telegraf
Collects metrics from the system. Telegraf is plugin-driven and has
the concept of two distinct set of plugins: input plugins collect
metrics from the system, services, or third-party APIs; output plugins
write and expose metrics to various destinations.
The Telegraf agents used in Container Cloud include:
telegraf-ds-smart monitors SMART disks, and runs on both
management and managed clusters.
telegraf-ironic monitors Ironic on the baremetal-based
management clusters. The ironic input plugin collects and
processes data from Ironic HTTP API, while the http_response
input plugin checks Ironic HTTP API availability. As an output plugin,
to expose collected data as Prometheus target, Telegraf uses
prometheus.
telegraf-docker-swarm gathers metrics from the Mirantis Container
Runtime API about the Docker nodes, networks, and Swarm services. This
is a Docker Telegraf input plugin with downstream additions.
Telemeter
Enables a multi-cluster view through a Grafana dashboard of the
management cluster. Telemeter includes a Prometheus federation push
server and clients to enable isolated Prometheus instances, which
cannot be scraped from a central Prometheus instance, to push metrics
to the central location.
The Telemeter services are distributed between the management cluster
that hosts the Telemeter server and managed clusters that host the
Telemeter client. The metrics from managed clusters are aggregated
on management clusters.
Note
This component is designated for internal StackLight use only.
Every Helm chart contains a default values.yml file. These default values
are partially overridden by custom values defined in the StackLight Helm chart.
Before deploying a managed cluster, you can select the HA or non-HA StackLight
architecture type. The non-HA mode is set by default on managed clusters. On
management clusters, StackLight is deployed in the HA mode only.
The following table lists the differences between the HA and non-HA modes:
One Alertmanager instance
Since 2.24.0 and 2.24.2 for MOSK 23.2
One OpenSearch instance
One PostgreSQL instance
One iam-proxy instance
One persistent volume is provided for storing data. In case of a service
or node failure, a new pod is redeployed and the volume is reattached to
provide the existing data. Such setup has a reduced hardware footprint
but provides less performance.
Two Prometheus instances
Two Alertmanager instances
Three OpenSearch instances
Three PostgreSQL instances
Two iam-proxy instances
Since 2.23.0 and 2.23.1 for MOSK 23.1
Local Volume Provisioner is used to provide local host storage. In case
of a service or node failure, the traffic is automatically redirected to
any other running Prometheus or OpenSearch server. For better
performance, Mirantis recommends that you deploy StackLight in the HA
mode. Two iam-proxy instances ensure access to HA components if one
iam-proxy node fails.
Note
Before Container Cloud 2.24.0, Alertmanager has 2 replicas in the
non-HA mode.
Caution
Non-HA StackLight requires a backend storage provider,
for example, a Ceph cluster. For details, see Storage.
Depending on the Container Cloud cluster type and selected StackLight database
mode, StackLight is deployed on the following number of nodes:
StackLight provides five web UIs including Prometheus, Alertmanager, Alerta,
OpenSearch Dashboards, and Grafana. Access to StackLight web UIs is protected
by Keycloak-based Identity and access management (IAM). All web UIs except
Alerta are exposed to IAM through the IAM proxy middleware. The Alerta
configuration provides direct integration with IAM.
The following diagram illustrates accessing the IAM-proxied StackLight web UIs,
for example, Prometheus web UI:
Authentication flow for the IAM-proxied StackLight web UIs:
A user enters the public IP of a StackLight web UI, for example, Prometheus
web UI.
The public IP leads to IAM proxy, deployed as a Kubernetes LoadBalancer,
which protects the Prometheus web UI.
LoadBalancer routes the HTTP request to Kubernetes internal IAM proxy
service endpoints, specified in the X-Forwarded-Proto or X-Forwarded-Host
headers.
The Keycloak login form opens (the login_url field in the IAM proxy
configuration, which points to Keycloak realm) and the user enters
the user name and password.
Keycloak validates the user name and password.
The user obtains access to the Prometheus web UI (the upstreams field
in the IAM proxy configuration).
Note
The discovery URL is the URL of the IAM service.
The upstream URL is the hidden endpoint of a web UI (Prometheus web UI in
the example above).
The following diagram illustrates accessing the Alerta web UI:
Authentication flow for the Alerta web UI:
A user enters the public IP of the Alerta web UI.
The public IP leads to Alerta deployed as a Kubernetes LoadBalancer type.
LoadBalancer routes the HTTP request to the Kubernetes internal Alerta
service endpoint.
The Keycloak login form opens (Alerta refers to the IAM realm) and
the user enters the user name and password.
Using the Mirantis Container Cloud web UI,
on the pre-deployment stage of a managed cluster,
you can view, enable or disable, or tune the following StackLight features
available:
StackLight HA mode.
Database retention size and time for Prometheus.
Tunable index retention period for OpenSearch.
Tunable PersistentVolumeClaim (PVC) size for Prometheus and OpenSearch
set to 16 GB for Prometheus and 30 GB for OpenSearch by
default. The PVC size must be logically aligned with the retention periods or
sizes for these components.
Email and Slack receivers for the Alertmanager notifications.
Predefined set of dashboards.
Predefined set of alerts and capability to add
new custom alerts for Prometheus in the following exemplary format:
StackLight measures, analyzes, and reports in a timely manner about failures
that may occur in the following Mirantis Container Cloud
components and their sub-components, if any:
StackLight uses a storage-based log retention strategy that optimizes storage
utilization and ensures effective data retention.
A proportion of available disk space is defined as 80% of disk space allocated
for the OpenSearch node with the following data types:
80% for system logs
10% for audit logs
5% for OpenStack notifications (applies only to MOSK clusters)
5% for Kubernetes events
This approach ensures that storage resources are efficiently allocated based
on the importance and volume of different data types.
The logging index management implies the following advantages:
Storage-based rollover mechanism
The rollover mechanism for system and audit indices enforces shard size
based on available storage, ensuring optimal resource utilization.
Consistent shard allocation
The number of primary shards per index is dynamically set based on cluster
size, which boosts search and facilitates ingestion for large clusters.
Minimal size of cluster state
The logging size of the cluster state is minimal and uses static mappings,
which are based on Elastic Common Schema (ESC) with slight deviations
from the standard. Dynamic mapping in index templates is avoided to reduce
overhead.
Storage compression
The system and audit indices utilize the best_compression codec that
minimizes the size of stored indices, resulting in significant storage
savings of up to 50% on average.
No filter by logging level
In light of non-even severity level over components in Container Cloud,
logs of all severity levels are collected to prevent ignorance of important
logs of low severity while debugging a cluster. Filtering by tags is still
available.
The data collected and transmitted through an encrypted channel back to
Mirantis provides our Customer Success Organization information to better
understand the operational usage patterns our customers are experiencing
as well as to provide feedback on product usage statistics to enable our
product teams to enhance our products and services for our customers.
Mirantis collects the following statistics using configuration-collector:
Since the Cluster releases 17.1.0 and 16.1.0
Mirantis collects hardware information using the following metrics:
mcc_hw_machine_chassis
mcc_hw_machine_cpu_model
mcc_hw_machine_cpu_number
mcc_hw_machine_nics
mcc_hw_machine_ram
mcc_hw_machine_storage (storage devices and disk layout)
mcc_hw_machine_vendor
Before the Cluster releases 17.0.0, 16.0.0, and 14.1.0
Mirantis collects the summary of all deployed Container Cloud configurations
using the following objects, if any:
Note
The data is anonymized from all sensitive information, such as IDs,
IP addresses, passwords, private keys, and so on.
Cluster
Machine
MCCUpgrade
BareMetalHost
BareMetalHostProfile
IPAMHost
IPAddr
KaaSCephCluster
L2Template
Subnet
Note
In the Cluster releases 17.0.0, 16.0.0, and 14.1.0, Mirantis does
not collect any configuration summary in light of the
configuration-collector refactoring.
The node-level resource data are broken down into three broad categories:
Cluster, Node, and Namespace. The telemetry data tracks Allocatable,
Capacity, Limits, Requests, and actual Usage of node-level resources.
StackLight components, which require external access, automatically use the
same proxy that is configured for Mirantis Container Cloud clusters. Therefore,
you only need to configure proxy during deployment of your management or
managed clusters. No additional actions are required to set up proxy for
StackLight. For more details about implementation of proxy support in
Container Cloud, see Proxy and cache support.
Note
Proxy handles only the HTTP and HTTPS traffic. Therefore, for
clusters with limited or no Internet access, it is not possible to set up
Alertmanager email notifications, which use SMTP, when proxy is used.
Proxy is used for the following StackLight components:
Component
Cluster type
Usage
Alertmanager
Any
As a default http_config
for all HTTP-based receivers except the predefined HTTP-alerta and
HTTP-salesforce. For these receivers, http_config is overridden on
the receiver level.
Metric Collector
Management
To send outbound cluster metrics to Mirantis.
Salesforce notifier
Any
To send notifications to the Salesforce instance.
Salesforce reporter
Any
To send metric reports to the Salesforce instance.
Using Mirantis Container Cloud, you can deploy a Mirantis Kubernetes Engine
(MKE) cluster on bare metal that requires corresponding resources.
If you use a firewall or proxy, make sure that the bootstrap and management
clusters have access to the following IP ranges and domain names
required for the Container Cloud content delivery network and alerting:
mirror.mirantis.com and repos.mirantis.com for packages
binary.mirantis.com for binaries and Helm charts
mirantis.azurecr.io and *.blob.core.windows.net for Docker images
mcc-metrics-prod-ns.servicebus.windows.net:9093 for Telemetry
(port 9093 if proxy is disabled, or port 443 if proxy is enabled)
mirantis.my.salesforce.com and login.salesforce.com
for Salesforce alerts
Note
Access to Salesforce is required from any Container Cloud
cluster type.
If any additional Alertmanager notification receiver is enabled,
for example, Slack, its endpoint must also be accessible
from the cluster.
Caution
Regional clusters are unsupported since Container Cloud 2.25.0.
Mirantis does not perform functional integration testing of the feature and
the related code is removed in Container Cloud 2.26.0. If you still
require this feature, contact Mirantis support for further information.
The following hardware configuration is used as a reference to deploy
Mirantis Container Cloud with bare metal Container Cloud clusters with
Mirantis Kubernetes Engine.
Reference hardware configuration for Container Cloud
management and managed clusters on bare metal¶
A management cluster requires 2 volumes for Container Cloud
(total 50 GB) and 5 volumes for StackLight (total 60 GB).
A managed cluster requires 5 volumes for StackLight.
The seed node is necessary only to deploy the management cluster.
When the bootstrap is complete, the bootstrap node can be
redeployed and its resources can be reused
for the managed cluster workloads.
The minimum reference system requirements for a baremetal-based bootstrap
seed node are as follows:
Basic server on Ubuntu 22.04 with the following configuration:
Kernel version 4.15.0-76.86 or later
8 GB of RAM
4 CPU
10 GB of free disk space for the bootstrap cluster cache
No DHCP or TFTP servers on any NIC networks
Routable access IPMI network for the hardware servers. For more details, see
Host networking.
Internet access for downloading of all required artifacts
The following diagram illustrates the physical and virtual L2 underlay
networking schema for the final state of the Mirantis Container Cloud
bare metal deployment.
The network fabric reference configuration is a spine/leaf with 2 leaf ToR
switches and one out-of-band (OOB) switch per rack.
Reference configuration uses the following switches for ToR and OOB:
Cisco WS-C3560E-24TD has 24 of 1 GbE ports. Used in OOB network
segment.
Dell Force 10 S4810P has 48 of 1/10GbE ports. Used as ToR in Common/PXE
network segment.
In the reference configuration, all odd interfaces from NIC0 are connected
to TORSwitch1, and all even interfaces from NIC0 are connected
to TORSwitch2. The Baseboard Management Controller (BMC) interfaces
of the servers are connected to OOBSwitch1.
The following recommendations apply to all types of nodes:
Use the Link Aggregation Control Protocol (LACP) bonding mode
with MC-LAG domains configured on leaf switches. This corresponds to
the 802.3ad bond mode on hosts.
Use ports from different multi-port NICs when creating bonds. This makes
network connections redundant if failure of a single NIC occurs.
Configure the ports that connect servers to the PXE network with PXE VLAN
as native or untagged. On these ports, configure LACP fallback to ensure
that the servers can reach DHCP server and boot over network.
When setting up the network range for DHCP Preboot Execution Environment
(PXE), keep in mind several considerations to ensure smooth server
provisioning:
Determine the network size. For instance, if you target a concurrent
provision of 50+ servers, a /24 network is recommended. This specific size
is crucial as it provides sufficient scope for the DHCP server to provide
unique IP addresses to each new Media Access Control (MAC) address,
thereby minimizing the risk of collision.
The concept of collision refers to the likelihood of two or more devices
being assigned the same IP address. With a /24 network, the collision
probability using the SDBM hash function, which is used by the DHCP server,
is low. If a collision occurs, the DHCP server
provides a free address using a linear lookup strategy.
In the context of PXE provisioning, technically, the IP address does not
need to be consistent for every new DHCP request associated with the same
MAC address. However, maintaining the same IP address can enhance user
experience, making the /24 network size more of a recommendation
than an absolute requirement.
For a minimal network size, it is sufficient to cover the number of
concurrently provisioned servers plus one additional address (50 + 1).
This calculation applies after covering any exclusions that exist in the
range. You can define excludes in the corresponding field of the Subnet
object. For details, see API Reference: Subnet resource.
When the available address space is less than the minimum described above,
you will not be able to automatically provision all servers. However, you
can manually provision them by combining manual IP assignment for each
bare metal host with manual pauses. For these operations, use the
host.dnsmasqs.metal3.io/address and baremetalhost.metal3.io/detached
annotations in the BareMetalHostInventory object. For details, see
Operations Guide: Manually allocate IP addresses for bare metal hosts.
All addresses within the specified range must remain unused before
provisioning. If an IP address in-use is issued by the DHCP server to a
BOOTP client, that specific client cannot complete provisioning.
The management cluster requires minimum two storage devices per node.
Each device is used for different type of storage.
The first device is always used for boot partitions and the root
file system. SSD is recommended. RAID device is not supported.
One storage device per server is reserved for local persistent
volumes. These volumes are served by the Local Storage Static Provisioner
(local-volume-provisioner) and used by many services of Container Cloud.
If you require all Internet access to go through a proxy server
for security and audit purposes, you can bootstrap management clusters using
proxy. The proxy server settings consist of three standard environment
variables that are set prior to the bootstrap process:
HTTP_PROXY
HTTPS_PROXY
NO_PROXY
These settings are not propagated to managed clusters. However, you can enable
a separate proxy access on a managed cluster using the Container Cloud web UI.
This proxy is intended for the end user needs and is not used for a managed
cluster deployment or for access to the Mirantis resources.
Caution
Since Container Cloud uses the OpenID Connect (OIDC) protocol
for IAM authentication, management clusters require
a direct non-proxy access from managed clusters.
StackLight components, which require external access, automatically use the
same proxy that is configured for Container Cloud clusters.
On the managed clusters with limited Internet access, a proxy is required for
StackLight components that use HTTP and HTTPS and are disabled by default but
need external access if enabled, for example, for the Salesforce integration
and Alertmanager notifications external rules.
For more details about proxy implementation in StackLight, see StackLight proxy.
For the list of Mirantis resources and IP addresses to be accessible
from the Container Cloud clusters, see Requirements.
After enabling proxy support on managed clusters, proxy is used for:
Docker traffic on managed clusters
StackLight
OpenStack on MOSK-based clusters
Warning
Any modification to the Proxy object used in any cluster, for
example, changing the proxy URL, NO_PROXY values, or
certificate, leads to cordon-drain and Docker
restart on the cluster machines.
The Container Cloud managed clusters are deployed without direct Internet
access in order to consume less Internet traffic in your cloud.
The Mirantis artifacts used during managed clusters deployment are downloaded
through a cache running on a management cluster.
The feature is enabled by default on new managed clusters
and will be automatically enabled on existing clusters during upgrade
to the latest version.
Caution
IAM operations require a direct non-proxy access
of a managed cluster to a management cluster.
To ensure the Mirantis Container Cloud stability in managing the Container
Cloud-based Mirantis Kubernetes Engine (MKE) clusters, the following MKE API
functionality is not available for the Container Cloud-based MKE clusters as
compared to the MKE clusters that are deployed not by Container Cloud.
Use the Container Cloud web UI or CLI for this functionality instead.
Public APIs limitations in a Container Cloud-based MKE cluster¶
API endpoint
Limitation
GET/swarm
Swarm Join Tokens are filtered out for all users, including admins.
PUT/api/ucp/config-toml
All requests are forbidden.
POST/nodes/{id}/update
Requests for the following changes are forbidden:
Change Role
Add or remove the com.docker.ucp.orchestrator.swarm and
com.docker.ucp.orchestrator.kubernetes labels.
Since 2.25.1 (Cluster releases 16.0.1 and 17.0.1), Container Cloud does not
override changes in MKE configuration except the following list of parameters
that are automatically managed by Container Cloud. These parameters are always
overridden by the Container Cloud default values if modified direclty using
the MKE API. For details on configuration using the MKE API, see
MKE configuration managed directly by the MKE API.
However, you can manually configure a few options from this list using the
Cluster object of a Container Cloud cluster. They are labeled with the
superscript and contain references to the
respective configuration procedures in the Comments columns of the tables.
All possible values for parameters labeled with the
superscript, which you can manually
configure using the Cluster object are described in
MKE Operations Guide: Configuration options.
MKE configuration managed directly by the MKE API¶
Since 2.25.1, aside from MKE parameters described in MKE configuration managed by Container Cloud,
Container Cloud does not override changes in MKE configuration that are applied
directly through the MKE API. For the configuration options and procedure, see
MKE documentation:
Mirantis cannot guarrantee the expected behavior of the
functionality configured using the MKE API as long as customer-specific
configuration does not undergo testing within Container Cloud. Therefore,
Mirantis recommends that you test custom MKE settings configured through
the MKE API on a staging environment before applying them to production.
This tutorial applies only to the Container Cloud web UI users
with the m:kaas:namespace@operator or m:kaas:namespace@writer
access role assigned by the Infrastructure Operator.
To add a bare metal host, the m:kaas@operator or
m:kaas:namespace@bm-pool-operator role is required.
After you deploy the Mirantis Container Cloud management cluster,
you can start creating managed clusters depending on your cloud needs.
The deployment procedure is performed using the Container Cloud web UI
and comprises the following steps:
Create a dedicated non-default project for managed clusters.
Create and configure bare metal hosts with corresponding labels for machines
such as worker, manager, or storage.
Create an initial cluster configuration.
Add the required amount of machines with the corresponding configuration
to the managed cluster.
Add a Ceph cluster.
Note
The Container Cloud web UI communicates with Keycloak
to authenticate users. Keycloak is exposed using HTTPS with
self-signed TLS certificates that are not trusted by web browsers.
This feature is available as Technology Preview. Use such
configuration for testing and evaluation purposes only.
For the Technology Preview feature definition, refer to Technology Preview features.
This feature is available as Technology Preview. Use such
configuration for testing and evaluation purposes only.
For the Technology Preview feature definition, refer to Technology Preview features.
This feature is available as Technology Preview. Use such
configuration for testing and evaluation purposes only.
For the Technology Preview feature definition, refer to Technology Preview features.
By default, MKE uses Keycloak as the OIDC provider. Using the
ClusterOIDCConfiguration custom resource, you can add your own OpenID
Connect (OIDC) provider for MKE on managed clusters to authenticate user
requests to Kubernetes. For OIDC provider requirements, see OIDC official
specification.
Note
For OpenStack and StackLight, Container Cloud supports only
Keycloak, which is configured on the management cluster,
as the OIDC provider.
To add a custom OIDC provider for MKE:
Configure the OIDC provider:
Log in to the OIDC provider dashboard.
Create an OIDC client. If you are going to use an existing one, skip
this step.
Add the MKE redirectURL of the managed cluster to the OIDC client.
By default, the URL format is https://<MKEIP>:6443/login.
Add the <ContainerCloudwebUIIP>/token to the OIDC client
for generation of kubeconfig files of the target managed cluster
through the Container Cloud web UI.
Ensure that the aud claim of the issued id_token for audience
will be equal to the created client ID.
Optional. Allow MKE to refresh authentication when id_token expires
by allowing the offline_access claim for the OIDC client.
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
The ClusterOIDCConfiguration object is created in the management
cluster. Users with the m:kaas:ns@operator/writer/member roles have
access to this object.
Once done, the following dependent objects are created automatically in the
target managed cluster: the
rbac.authorization.k8s.io/v1/ClusterRoleBinding object that binds the
admin group defined in spec:adminRoleCriteria:value to the
cluster-adminrbac.authorization.k8s.io/v1/ClusterRole object.
In the Cluster object of the managed cluster, add the name of the
ClusterOIDCConfiguration object to the spec.providerSpec.value.oidc
field.
Wait until the cluster machines switch from the Reconfigure to
Ready state for the changes to apply.
This section is intended only for advanced Infrastructure Operators
who are familiar with Kubernetes Cluster API.
Mirantis currently supports only those Mirantis
Container Cloud API features that are implemented in the
Container Cloud web UI.
Use other Container Cloud API features for testing
and evaluation purposes only.
The Container Cloud APIs are implemented using the Kubernetes
CustomResourceDefinitions (CRDs) that enable you to expand
the Kubernetes API. Different types of resources are grouped in the dedicated
files, such as cluster.yaml or machines.yaml.
For testing and evaluation purposes, you may also use the experimentalpublic Container Cloud API that
allows for implementation of custom clients for creating and operating of
managed clusters. This repository contains branches that correspond to the
Container Cloud releases. For an example usage, refer to the
README
file of the repository.
This section describes the License custom resource (CR) used in Mirantis
Container Cloud API to maintain the Mirantis Container Cloud license data.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
The Container Cloud License CR contains the following fields:
apiVersion
The API version of the object that is kaas.mirantis.com/v1alpha1.
kind
The object type that is License.
metadata
The metadata object field of the License resource contains
the following fields:
name
The name of the License object, must be license.
spec
The spec object field of the License resource contains the
Secret reference where license data is stored.
license
secret
The Secret reference where the license data is stored.
key
The name of a key in the license Secret data field
under which the license data is stored.
name
The name of the Secret where the license data is stored.
value
The value of the updated license. If you need to update the license,
place it under this field. The new license data will be placed to the
Secret and value will be cleaned.
status
customerID
The unique ID of a customer generated during the license issuance.
instance
The unique ID of the current Mirantis Container Cloud instance.
dev
The license is for development.
openstack
The license limits for MOSK clusters:
clusters
The maximum number of MOSK clusters to be deployed.
If the field is absent, the number of deployments is unlimited.
workersPerCluster
The maximum number of workers per MOSK cluster to be
created. If the field is absent, the number of workers is unlimited.
expirationTime
The license expiration time in the ISO 8601 format.
expired
The license expiration state. If the value is true, the license has
expired. If the field is absent, the license is valid.
This section describes the Diagnostic custom resource (CR) used in Mirantis
Container Cloud API to trigger self-diagnostics for management or managed
clusters.
The Container Cloud Diagnostic CR contains the following fields:
apiVersion
API version of the object that is diagnostic.mirantis.com/v1alpha1.
kind
Object type that is Diagnostic.
metadata
Object metadata that contains the following fields:
name
Name of the Diagnostic object.
namespace
Namespace used to create the Diagnostic object. Must be equal to the
namespace of the target cluster.
spec
Resource specification that contains the following fields:
cluster
Name of the target cluster to run diagnostics on.
checks
Reserved for internal usage, any override will be discarded.
status
finishedAt
Completion timestamp of diagnostics. If the Diagnostic Controller version
is outdated, this field is not set and the corresponding error message
is displayed in the error field.
error
Error that occurs during diagnostics or if the Diagnostic Controller
version is outdated. Omitted if empty.
controllerVersion
Version of the controller that launched diagnostics.
result
Map of check statuses where the key is the check name and the value is
the result of the corresponding diagnostic check:
description
Description of the check in plain text.
result
Result of diagnostics. Possible values are PASS, ERROR,
FAIL, WARNING, INFO.
message
Optional. Explanation of the check results. It may optionally contain
a reference to the documentation describing a known issue related to
the check results, including the existing workaround for the issue.
success
Success status of the check. Boolean.
ticketInfo
Optional. Information about the ticket to track the resolution
progress of the known issue related to the check results. For example,
FIELD-12345.
The Diagnostic resource example:
apiVersion:diagnostic.mirantis.com/v1alpha1kind:Diagnosticmetadata:name:test-diagnosticnamespace:test-namespacespec:cluster:test-clusterstatus:finishedAt:2024-07-01T11:27:14Zerror:""controllerVersion:v1.40.11result:bm_address_capacity:description:Baremetal addresses capacitymessage:LCM Subnet 'default/k8s-lcm-nics' has 8 allocatable addresses (thresholdis 5) - OK; PXE-NIC Subnet 'default/k8s-pxe-nics' has 7 allocatable addresses(threshold is 5) - OK; Auto-assignable address pool 'default' from MetallbConfig'default/kaas-mgmt-metallb' has left 21 available IP addresses (thresholdis 10) - OKresult:INFOsuccess:truebm_artifacts_overrides:description:Baremetal overrides checkmessage:BM operator has no undesired overridesresult:PASSsuccess:true
IAMUser is the Cluster (non-namespaced) object. Its objects are synced
from Keycloak that is they are created upon user creation in Keycloak and
deleted user upon deletion in Keycloak. The IAMUser is exposed as read-only
to all users. It contains the following fields:
apiVersion
API version of the object that is iam.mirantis.com/v1alpha1
kind
Object type that is IAMUser
metadata
Object metadata that contains the following field:
name
Sanitized user name without special characters with first 8 symbols of
the user UUID appended to the end
displayName
Name of the user as defined in the Keycloak database
externalID
ID of the user as defined in the Keycloak database
The management-admin role is available since Container
Cloud 2.25.0 (Cluster releases 17.0.0, 16.0.0, 14.1.0).
description
Role description.
scope
Role scope.
Configuration example:
apiVersion:iam.mirantis.com/v1alpha1kind:IAMRolemetadata:name:global-admindescription:Gives permission to manage IAM role bindings in the Container Cloud deployment.scope:global
IAMGlobalRoleBinding is the Cluster (non-namespaced) object that
should be used for global role bindings in all namespaces. This object is
accessible to users with the global-adminIAMRole assigned through the
IAMGlobalRoleBinding object. The object contains the following fields:
apiVersion
API version of the object that is iam.mirantis.com/v1alpha1.
kind
Object type that is IAMGlobalRoleBinding.
metadata
Object metadata that contains the following field:
name
Role binding name. If the role binding is user-created, user can set
any unique name. If a name relates to a binding that is synced by
user-controller from Keycloak, the naming convention is
<username>-<rolename>.
role
Object role that contains the following field:
name
Role name.
user
Object name that contains the following field:
name
Name of the iamuser object that the defined role is provided to.
Not equal to the user name in Keycloak.
legacy
Defines whether the role binding is legacy. Possible values are true or
false.
legacyRole
Applicable when the legacy field value is true.
Defines the legacy role name in Keycloak.
external
Defines whether the role is assigned through Keycloak and is synced by
user-controller with the Container Cloud API as the
IAMGlobalRoleBinding object. Possible values are true or false.
Caution
If you create the IAM*RoleBinding, do not set or modify
the legacy, legacyRole, and external fields unless absolutely
necessary and you understand all implications.
IAMRoleBinding is the namespaced object that represents a grant of one
role to one user in all clusters of the namespace. It is accessible to users
that have either of the following bindings assigned to them:
IAMGlobalRoleBinding that binds them with the global-admin,
operator, or useriamRole. For user, the bindings are
read-only.
IAMRoleBinding that binds them with the operator or useriamRole in a particular namespace. For user, the bindings are
read-only.
apiVersion
API version of the object that is iam.mirantis.com/v1alpha1.
kind
Object type that is IAMRoleBinding.
metadata
Object metadata that contains the following fields:
namespace
Namespace that the defined binding belongs to.
name
Role binding name. If the role is user-created, user can set any unique
name. If a name relates to a binding that is synced from Keycloak,
the naming convention is <userName>-<roleName>.
legacy
Defines whether the role binding is legacy. Possible values are true or
false.
legacyRole
Applicable when the legacy field value is true.
Defines the legacy role name in Keycloak.
external
Defines whether the role is assigned through Keycloak and is synced by
user-controller with the Container Cloud API as the
IAMGlobalRoleBinding object. Possible values are true or false.
Caution
If you create the IAM*RoleBinding, do not set or modify
the legacy, legacyRole, and external fields unless absolutely
necessary and you understand all implications.
role
Object role that contains the following field:
name
Role name.
user
Object user that contains the following field:
name
Name of the iamuser object that the defined role is granted to.
Not equal to the user name in Keycloak.
IAMClusterRoleBinding is the namespaced object that represents a grant
of one role to one user on one cluster in the namespace. This object is
accessible to users that have either of the following bindings
assigned to them:
IAMGlobalRoleBinding that binds them with the global-admin,
operator, or useriamRole. For user, the bindings are
read-only.
IAMRoleBinding that binds them with the operator or useriamRole in a particular namespace. For user, the bindings are
read-only.
The IAMClusterRoleBinding object contains the following fields:
apiVersion
API version of the object that is iam.mirantis.com/v1alpha1.
kind
Object type that is IAMClusterRoleBinding.
metadata
Object metadata that contains the following fields:
namespace
Namespace of the cluster that the defined binding belongs to.
name
Role binding name. If the role is user-created, user can set any unique
name. If a name relates to a binding that is synced from Keycloak,
the naming convention is <userName>-<roleName>-<clusterName>.
role
Object role that contains the following field:
name
Role name.
user
Object user that contains the following field:
name
Name of the iamuser object that the defined role is granted to.
Not equal to the user name in Keycloak.
cluster
Object cluster that contains the following field:
name
Name of the cluster on which the defined role is granted.
legacy
Defines whether the role binding is legacy. Possible values are true or
false.
legacyRole
Applicable when the legacy field value is true.
Defines the legacy role name in Keycloak.
external
Defines whether the role is assigned through Keycloak and is synced by
user-controller with the Container Cloud API as the
IAMGlobalRoleBinding object. Possible values are true or false.
Caution
If you create the IAM*RoleBinding, do not set or modify
the legacy, legacyRole, and external fields unless absolutely
necessary and you understand all implications.
This section contains description of the OpenID Connect (OIDC) custom resource
for Mirantis Container Cloud that you can use to customize OIDC for Mirantis
Kubernetes Engine (MKE) on managed clusters. Using this resource, add your own
OIDC provider to authenticate user requests to Kubernetes. For OIDC provider
requirements, see OIDC official specification.
The Container Cloud ClusterOIDCConfiguration custom resource contains
the following fields:
apiVersion
The API version of the object that is kaas.mirantis.com/v1alpha1.
kind
The object type that is ClusterOIDCConfiguration.
metadata
The metadata object field of the ClusterOIDCConfiguration resource
contains the following fields:
name
The object name.
namespace
The project name (Kubernetes namespace) of the related managed cluster.
spec
The spec object field of the ClusterOIDCConfiguration resource
contains the following fields:
adminRoleCriteria
Definition of the id_token claim with the admin role and the role
value.
matchType
Matching type of the claim with the requested role. Possible values
that MKE uses to match the claim with the requested value:
must
Requires a plain string in the id_token claim, for example,
"iam_role":"mke-admin".
contains
Requires an array of strings in the id_token claim,
for example, "iam_role":["mke-admin","pod-reader"].
name
Name of the admin id_token claim containing a role or array of
roles.
value
Role value that matches the "iam_role" value in the admin
id_token claim.
caBundle
Base64-encoded certificate authority bundle of the OIDC provider
endpoint.
clientID
ID of the OIDC client to be used by Kubernetes.
clientSecret
Secret value of the clientID parameter. After the
ClusterOIDCConfiguration object creation, this field is updated
automatically with a reference to the corresponding Secret. For example:
This section describes the UpdateGroup custom resource (CR) used in the
Container Cloud API to configure update concurrency for specific sets of
machines or machine pools within a cluster. This resource enhances the update
process by allowing a more granular control over the concurrency of machine
updates. This resource also provides a way to control the reboot behavior of
machines during a Cluster release update.
The Container Cloud UpdateGroup CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is UpdateGroup.
metadata
Metadata of the UpdateGroup CR that contains the following fields. All
of them are required.
name
Name of the UpdateGroup object.
namespace
Project where the UpdateGroup is created.
labels
Label to associate the UpdateGroup with a specific cluster in the
cluster.sigs.k8s.io/cluster-name:<cluster-name> format.
spec
Specification of the UpdateGroup CR that contains the following fields:
index
Index to determine the processing order of the UpdateGroup object.
Groups with the same index are processed concurrently.
Number of machines to update concurrently within UpdateGroup.
rebootIfUpdateRequiresSince 2.28.0 (17.3.0 and 16.3.0)
Technology Preview. Automatic reboot of controller or worker machines
of an update group if a Cluster release update involves node reboot,
for example, when kernel version update is available in new Cluster
release. You can set this parameter for management or managed clusters.
Boolean. By default, true on management clusters and false on
managed clusters. On managed clusters:
If set to true, related machines are rebooted as part of a Cluster
release update that requires a reboot.
If set to false, machines are not rebooted even if a Cluster
release update requires a reboot.
Caution
During a distribution upgrade, machines are always rebooted,
overriding rebootIfUpdateRequires:false.
This section describes the MCCUpgrade resource used in Mirantis
Container Cloud API to configure a schedule for the Container Cloud update.
The Container Cloud MCCUpgrade CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is MCCUpgrade.
metadata
The metadata object field of the MCCUpgrade resource contains
the following fields:
name
The name of MCCUpgrade object, must be mcc-upgrade.
spec
The spec object field of the MCCUpgrade resource contains the
schedule when Container Cloud update is allowed or blocked. This field
contains the following fields:
blockUntil
Deprecated since Container Cloud 2.28.0 (Cluster release 16.3.0). Use
autoDelay instead.
Time stamp in the ISO 8601 format, for example,
2021-12-31T12:30:00-05:00. Updates will be disabled until this time.
You cannot set this field to more than 7 days in the future and more
than 30 days after the latest Container Cloud release.
autoDelay
Available since Container Cloud 2.28.0 (Cluster release 16.3.0).
Flag that enables delay of the management cluster auto-update to a new
Container Cloud release and ensures that auto-update is not started
immediately on the release date. Boolean, false by default.
The delay period is minimum 20 days for each newly discovered release
and depends on specifics of each release cycle and on optional
configuration of week days and hours selected for update. You can verify
the exact date of a scheduled auto-update in the status section of
the MCCUpgrade object.
Note
Modifying the delay period is not supported.
timeZone
Name of a time zone in the IANA Time Zone Database. This time zone will
be used for all schedule calculations. For example: Europe/Samara,
CET, America/Los_Angeles.
schedule
List of schedule items that allow an update at specific hours or
weekdays. The update process can proceed if at least one of these items
allows it. Schedule items allow update when both hours and
weekdays conditions are met. When this list is empty or absent,
update is allowed at any hour of any day. Every schedule item contains
the following fields:
hours
Object with 2 fields: from and to. Both must be non-negative
integers not greater than 24. The to field must be greater than
the from one. Update is allowed if the current hour in the
time zone specified by timeZone is greater or equals to from
and is less than to. If hours is absent, update is allowed
at any hour.
weekdays
Object with boolean fields with these names:
monday
tuesday
wednesday
thursday
friday
saturday
sunday
Update is allowed only on weekdays that have the corresponding field
set to true. If all fields are false or absent, or
weekdays is empty or absent, update is allowed on all weekdays.
In this example, all schedule calculations are done in the CET timezone and
upgrades are allowed only:
From 7:00 to 17:00 on Mondays
From 10:00 to 17:00 on Tuesdays
From 7:00 to 10:00 on Fridays
status
The status object field of the MCCUpgrade resource contains
information about the next planned Container Cloud update, if available.
This field contains the following fields:
nextAttemptDeprecated since 2.28.0 (Cluster release 16.3.0)
Time stamp in the ISO 8601 format indicating the time when the Release
Controller will attempt to discover and install a new Container Cloud
release. Set to the next allowed time according to the schedule
configured in spec or one minute in the future if the schedule
currently allows update.
messageDeprecated since 2.28.0 (Cluster release 16.3.0)
Message from the last update step or attempt.
nextRelease
Object describing the next release that Container Cloud will be updated
to. Absent if no new releases have been discovered. Contains the
following fields:
version
Semver-compatible version of the next Container Cloud release, for
example, 2.22.0.
date
Time stamp in the ISO 8601 format of the Container Cloud release
defined in version:
Since 2.28.0 (Cluster release 16.3.0), the field indicates the
publish time stamp of a new release.
Before 2.28.0 (Cluster release 16.2.x or earlier), the field
indicates the discovery time stamp of a new release.
scheduled
Available since Container Cloud 2.28.0 (Cluster release 16.3.0).
Time window that the pending Container Cloud release update is
scheduled for:
startTime
Time stamp in the ISO 8601 format indicating the start time of
the update for the pending Container Cloud release.
endTime
Time stamp in the ISO 8601 format indicating the end time of
the update for the pending Container Cloud release.
lastUpgrade
Time stamps of the latest Container Cloud update:
startedAt
Time stamp in the ISO 8601 format indicating the time when the last
Container Cloud update started.
finishedAt
Time stamp in the ISO 8601 format indicating the time when the last
Container Cloud update finished.
conditions
Available since Container Cloud 2.28.0 (Cluster release 16.3.0). List of
status conditions describing the status of the MCCUpgrade resource.
Each condition has the following format:
type
Condition type representing a particular aspect of the MCCUpgrade
object. Currently, the only supported condition type is Ready that
defines readiness to process a new release.
If the status field of the Ready condition type is False,
the Release Controller blocks the start of update operations.
status
Condition status. Possible values: True, False,
Unknown.
reason
Machine-readable explanation of the condition.
lastTransitionTime
Time of the latest condition transition.
message
Human-readable description of the condition.
Example of MCCUpgrade status:
status:conditions:-lastTransitionTime:"2024-09-16T13:22:27Z"message:New release scheduled for upgradereason:ReleaseScheduledstatus:"True"type:ReadylastUpgrade:{}message:''nextAttempt:"2024-09-16T13:23:27Z"nextRelease:date:"2024-08-25T21:05:46Z"scheduled:endTime:"2024-09-17T00:00:00Z"startTime:"2024-09-16T00:00:00Z"version:2.28.0
Available since 2.27.0 (17.2.0 and 16.2.0)TechPreview
This section describes the ClusterUpdatePlan custom resource (CR) used in
the Container Cloud API to granularly control update process of a managed
cluster by stopping the update after each step.
The ClusterUpdatePlan CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is ClusterUpdatePlan.
metadata
Metadata of the ClusterUpdatePlan CR that contains the following fields:
name
Name of the ClusterUpdatePlan object.
namespace
Project name of the cluster that relates to ClusterUpdatePlan.
spec
Specification of the ClusterUpdatePlan CR that contains the following
fields:
source
Source name of the Cluster release from which the cluster is updated.
target
Target name of the Cluster release to which the cluster is updated.
cluster
Name of the cluster for which ClusterUpdatePlan is created.
releaseNotes
Available since Container Cloud 2.29.0 (Cluster releases 17.4.0 and
16.4.0). Link to MOSK release notes of the target
release.
steps
List of update steps, where each step contains the following fields:
id
Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and
16.3.0). Step ID.
name
Step name.
description
Step description.
constraints
Description of constraints applied during the step execution.
impact
Impact of the step on the cluster functionality and workloads.
Contains the following fields:
users
Impact on the Container Cloud user operations. Possible values:
none, major, or minor.
workloads
Impact on workloads. Possible values: none, major, or
minor.
info
Additional details on impact, if any.
duration
Details about duration of the step execution. Contains the following
fields:
estimated
Estimated time to complete the update step.
Note
Before Container Cloud 2.29.0 (Cluster releases 17.4.0
and 16.4.0), this field was named eta.
info
Additional details on update duration, if any.
granularity
Information on the current step granularity. Indicates whether the
current step is applied to each machine individually or to the entire
cluster at once. Possible values are cluster or machine.
commence
Flag that allows controlling the step execution. Boolean, false
by default. If set to true, the step starts execution after all
previous steps are completed.
Caution
Cancelling an already started update step is unsupported.
status
Status of the ClusterUpdatePlan CR that contains the following fields:
startedAt
Time when ClusterUpdatePlan has started.
completedAt
Available since Container Cloud 2.29.0 (Cluster releases 17.4.0 and
16.4.0). Time of update completion.
status
Overall object status.
steps
List of step statuses in the same order as defined in spec. Each step
status contains the following fields:
id
Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and
16.3.0). Step ID.
name
Step name.
status
Step status. Possible values are:
NotStarted
Step has not started yet.
Scheduled
Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and
16.3.0). Step is already triggered but its execution has not
started yet.
InProgress
Step is currently in progress.
AutoPaused
Available since Container Cloud 2.29.0 (Cluster release 17.4.0) as
Technology Preview. Update is automatically paused by the trigger
from a firing alert defined in the UpdateAutoPause
configuration. For details, see UpdateAutoPause resource.
Stuck
Step execution contains an issue, which also indicates that the
step does not fit into the estimate defined in the duration
field for this step in spec.
Completed
Step has been completed.
message
Message describing status details the current update step.
duration
Current duration of the step execution.
startedAt
Start time of the step execution.
Example of a ClusterUpdatePlan object:
apiVersion:kaas.mirantis.com/v1alpha1kind:ClusterUpdatePlanmetadata:creationTimestamp:"2025-02-06T16:53:51Z"generation:11name:mosk-17.4.0namespace:childresourceVersion:"6072567"uid:82c072be-1dc5-43dd-b8cf-bc643206d563spec:cluster:moskreleaseNotes:https://docs.mirantis.com/mosk/latest/25.1-series.htmlsource:mosk-17-3-0-24-3steps:-commence:truedescription:-install new version of OpenStack and Tungsten Fabric life cycle managementmodules-OpenStack and Tungsten Fabric container images pre-cached-OpenStack and Tungsten Fabric control plane components restarted in parallelduration:estimated:1h30m0sinfo:-15 minutes to cache the images and update the life cycle management modules-1h to restart the componentsgranularity:clusterid:openstackimpact:info:-some of the running cloud operations may fail due to restart of API servicesand schedulers-DNS might be affectedusers:minorworkloads:minorname:Update OpenStack and Tungsten Fabric-commence:truedescription:-Ceph version update-restart Ceph monitor, manager, object gateway (radosgw), and metadata services-restart OSD services node-by-node, or rack-by-rack depending on the clusterconfigurationduration:estimated:8m30sinfo:-15 minutes for the Ceph version update-around 40 minutes to update Ceph cluster of 30 nodesgranularity:clusterid:cephimpact:info:-'minorunavailabilityofobjectstorageAPIs:S3/Swift'-workloads may experience IO performance degradation for the virtual storagedevices backed by Cephusers:minorworkloads:minorname:Update Ceph-commence:truedescription:-new host OS kernel and packages get installed-host OS configuration re-applied-container runtime version gets bumped-new versions of Kubernetes components installedduration:estimated:1h40m0sinfo:-about 20 minutes to update host OS per a Kubernetes controller, nodes updatedone-by-one-Kubernetes components update takes about 40 minutes, all nodes in parallelgranularity:clusterid:k8s-controllersimpact:users:noneworkloads:nonename:Update host OS and Kubernetes components on master nodes-commence:truedescription:-new host OS kernel and packages get installed-host OS configuration re-applied-container runtime version gets bumped-new versions of Kubernetes components installed-data plane components (Open vSwitch and Neutron L3 agents, TF agents and vrouter)restarted on gateway and compute nodes-storage nodes put to “no-out” mode to prevent rebalancing-by default, nodes are updated one-by-one, a node group can be configured toupdate several nodes in parallelduration:estimated:8h0m0sinfo:-host OS update - up to 15 minutes per node (not including host OS configurationmodules)-Kubernetes components update - up to 15 minutes per node-OpenStack controllers and gateways updated one-by-one-nodes hosting Ceph OSD, monitor, manager, metadata, object gateway (radosgw)services updated one-by-onegranularity:machineid:k8s-workers-vdrok-child-defaultimpact:info:-'OpenStackcontrollernodes:somerunningOpenStackoperationsmightnotcompleteduetorestartofcomponents'-'OpenStackcomputenodes:minorlossoftheEast-WestconnectivitywiththeOpenvSwitchnetworkingbackendthatcausesapproximately5minofdowntime'-'OpenStackgatewaynodes:minorlossoftheNorth-SouthconnectivitywiththeOpenvSwitchnetworkingbackend:anon-distributedHAvirtualrouterneedsupto1minutetofailover;anon-distributedandnon-HAvirtualrouterfailovertimedependsonmanyfactorsandmaytakeupto10minutes'users:majorworkloads:majorname:Update host OS and Kubernetes components on worker nodes, group vdrok-child-default-commence:truedescription:-restart of StackLight, MetalLB services-restart of auxiliary controllers and chartsduration:estimated:1h30m0sgranularity:clusterid:mcc-componentsimpact:info:-minor cloud API downtime due restart of MetalLB componentsusers:minorworkloads:nonename:Auxiliary components updatetarget:mosk-17-4-0-25-1status:completedAt:"2025-02-07T19:24:51Z"startedAt:"2025-02-07T17:07:02Z"status:Completedsteps:-duration:26m36.355605528sid:openstackmessage:Readyname:Update OpenStack and Tungsten FabricstartedAt:"2025-02-07T17:07:02Z"status:Completed-duration:6m1.124356485sid:cephmessage:Readyname:Update CephstartedAt:"2025-02-07T17:33:38Z"status:Completed-duration:24m3.151554465sid:k8s-controllersmessage:Readyname:Update host OS and Kubernetes components on master nodesstartedAt:"2025-02-07T17:39:39Z"status:Completed-duration:1h19m9.359184228sid:k8s-workers-vdrok-child-defaultmessage:Readyname:Update host OS and Kubernetes components on worker nodes, group vdrok-child-defaultstartedAt:"2025-02-07T18:03:42Z"status:Completed-duration:2m0.772243006sid:mcc-componentsmessage:Readyname:Auxiliary components updatestartedAt:"2025-02-07T19:22:51Z"status:Completed
This section describes the UpdateAutoPause custom resource (CR) used in the
Container Cloud API to configure automatic pausing of cluster release updates
in a managed cluster using StackLight alerts.
The Container Cloud UpdateAutoPause CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is UpdateAutoPause.
metadata
Metadata of the UpdateAutoPause CR that contains the following fields:
name
Name of the UpdateAutoPause object. Must match the cluster name.
namespace
Project where the UpdateAutoPause is created. Must match the cluster
namespace.
spec
Specification of the UpdateAutoPause CR that contains the following
field:
alerts
List of alert names. The occurrence of any alert from this list triggers
auto-pause of the cluster release update.
status
Status of the UpdateAutoPause CR that contains the following fields:
firingAlerts
List of currently firing alerts from the specified set.
error
Error message, if any, encountered during object processing.
TechPreviewAvailable since 2.24.0 and 23.2 for MOSK clusters
This section describes the CacheWarmupRequest custom resource (CR) used in
the Container Cloud API to predownload images and store them in the
mcc-cache service.
The Container Cloud CacheWarmupRequest CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is CacheWarmupRequest.
metadata
The metadata object field of the CacheWarmupRequest
resource contains the following fields:
name
Name of the CacheWarmupRequest object that must match the existing
management cluster name to which the warm-up operation applies.
namespace
Container Cloud project in which the cluster is created.
Always set to default as the only available project for management
clusters creation.
spec
The spec object field of the CacheWarmupRequest resource
contains the settings for artifacts fetching and artifacts filtering
through Cluster releases. This field contains the following fields:
clusterReleases
Array of strings. Defines a set of Cluster release names to
warm up in the mcc-cache service.
openstackReleases
Optional. Array of strings. Defines a set of OpenStack
releases to warm up in mcc-cache. Applicable only
if ClusterReleases field contains mosk releases.
If you plan to upgrade an OpenStack version, define the current and the
target versions including the intermediate versions, if any.
For example, to upgrade OpenStack from Victoria to Yoga:
openstackReleases:-victoria-wallaby-xena-yoga
fetchRequestTimeout
Optional. String. Time for a single request to download
a single artifact. Defaults to 30m. For example, 1h2m3s.
clientsPerEndpoint
Optional. Integer. Number of clients to use for fetching artifacts
per each mcc-cache service endpoint. Defaults to 2.
openstackOnly
Optional. Boolean. Enables fetching of the OpenStack-related artifacts
for MOSK. Defaults to false. Applicable only if the
ClusterReleases field contains mosk releases. Useful when you
need to upgrade only an OpenStack version.
This section describes the GracefulRebootRequest custom resource (CR)
used in the Container Cloud API for a rolling reboot of several or all cluster
machines without workloads interruption. The resource is also useful for a
bulk reboot of machines, for example, on large clusters.
The Container Cloud GracefulRebootRequest CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is GracefulRebootRequest.
metadata
Metadata of the GracefulRebootRequest CR that contains the following
fields:
name
Name of the GracefulRebootRequest object. The object name must match
the name of the cluster on which you want to reboot machines.
namespace
Project where the GracefulRebootRequest is created.
spec
Specification of the GracefulRebootRequest CR that contains the
following fields:
machines
List of machines for a rolling reboot. Each machine of the list is
cordoned, drained, rebooted, and uncordoned in the order of cluster
upgrade policy. For details about the upgrade order,
see Change the upgrade order of a machine or machine pool.
Leave this field empty to reboot all cluster machines.
Caution
The cluster and machines must have the Ready status to
perform a graceful reboot.
This section describes the ContainerRegistry custom resource (CR) used in
Mirantis Container Cloud API to configure CA certificates on machines to access
private Docker registries.
The Container Cloud ContainerRegistry CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1
kind
Object type that is ContainerRegistry
metadata
The metadata object field of the ContainerRegistry CR contains
the following fields:
name
Name of the container registry
namespace
Project where the container registry is created
spec
The spec object field of the ContainerRegistry CR contains the
following fields:
domain
Host name and optional port of the registry
CACert
CA certificate of the registry in the base64-encoded format
Caution
Only one ContainerRegistry resource can exist per domain.
To configure multiple CA certificates for the same domain, combine them into
one certificate.
This section describes the TLSConfig resource used in Mirantis
Container Cloud API to configure TLS certificates for cluster applications.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
The Container Cloud TLSConfig CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is TLSConfig.
metadata
The metadata object field of the TLSConfig resource contains
the following fields:
name
Name of the public key.
namespace
Project where the TLS certificate is created.
spec
The spec object field contains the configuration to apply for an
application. It contains the following fields:
serverName
Host name of a server.
serverCertificate
Certificate to authenticate server’s identity to a client.
A valid certificate bundle can be passed.
The server certificate must be on the top of the chain.
privateKey
Reference to the Secret object that contains a private key.
A private key is a key for the server. It must correspond to the
public key used in the server certificate.
key
Key name in the secret.
name
Secret name.
caCertificate
Certificate that issued the server certificate. The top-most
intermediate certificate should be used if a CA certificate is
unavailable.
Private API since Container Cloud 2.29.0 (Cluster release 16.4.0)
Warning
Since Container Cloud 2.29.0 (Cluster release 16.4.0), use the
BareMetalHostInventory resource instead of BareMetalHost for
adding and modifying configuration of a bare metal server. Any change in the
BareMetalHost object will be overwitten by BareMetalHostInventory.
For any existing BareMetalHost object, a BareMetalHostInventory
object is created automatically during management cluster update to the
Cluster release 16.4.0.
This section describes the BareMetalHost resource used in the
Mirantis Container Cloud API. BareMetalHost object
is being created for each Machine and contains all information about
machine hardware configuration. BareMetalHost objects are used to monitor
and manage the state of a bare metal server. This includes inspecting the host
hardware, firmware, operating system provisioning, power control, server
deprovision. When a machine is created, the bare metal provider assigns a
BareMetalHost to that machine using labels and the BareMetalHostProfile
configuration.
For demonstration purposes, the Container Cloud BareMetalHost
custom resource (CR) can be split into the following major sections:
The Container Cloud BareMetalHost CR contains the following fields:
apiVersion
API version of the object that is metal3.io/v1alpha1.
kind
Object type that is BareMetalHost.
metadata
The metadata field contains the following subfields:
name
Name of the BareMetalHost object.
namespace
Project in which the BareMetalHost object was created.
annotations
Available since Cluster releases 12.5.0, 11.5.0, and 7.11.0.
Key-value pairs to attach additional metadata to the object:
kaas.mirantis.com/baremetalhost-credentials-name
Key that connects the BareMetalHost object with a previously
created BareMetalHostCredential object. The value of this key
must match the BareMetalHostCredential object name.
host.dnsmasqs.metal3.io/address
Available since Cluster releases 17.0.0 and 16.0.0.
Key that assigns a particular IP address to a bare metal host during
PXE provisioning.
baremetalhost.metal3.io/detached
Available since Cluster releases 17.0.0 and 16.0.0.
Key that pauses host management by the bare metal Operator for a
manual IP address assignment.
Note
If the host provisioning has already started or completed, adding
of this annotation deletes the information about the host from Ironic without
triggering deprovisioning. The bare metal Operator recreates the host
in Ironic once you remove the annotation. For details, see
Metal3 documentation.
Available since Cluster releases 17.0.0 and 16.0.0. Optional.
Key that defines sorting of the bmh:status:storage[] list during
inspection of a bare metal host. Accepts multiple tags separated by
a comma or semi-column with the ASC/DESC suffix for sorting
direction. Example terms: sizeBytesDESC, hctlASC,
typeASC, nameDESC.
Since Cluster releases 17.1.0 and 16.1.0, the following default
value applies: hctlASC,wwnASC,by_idASC,nameASC.
labels
Labels used by the bare metal provider to find a matching
BareMetalHost object to deploy a machine:
hostlabel.bm.kaas.mirantis.com/controlplane
hostlabel.bm.kaas.mirantis.com/worker
hostlabel.bm.kaas.mirantis.com/storage
Each BareMetalHost object added using the Container Cloud web UI
will be assigned one of these labels. If the BareMetalHost and
Machine objects are created using API, any label may be used
to match these objects for a bare metal host to deploy a machine.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
Configuration example:
apiVersion:metal3.io/v1alpha1kind:BareMetalHostmetadata:name:master-0namespace:defaultlabels:kaas.mirantis.com/baremetalhost-id:hw-master-0kaas.mirantis.com/baremetalhost-id:<bareMetalHostHardwareNodeUniqueId>annotations:# Since 2.21.0 (7.11.0, 12.5.0, 11.5.0)kaas.mirantis.com/baremetalhost-credentials-name:hw-master-0-credentials
The spec section for the BareMetalHost object defines the desired state
of BareMetalHost. It contains the following fields:
bmc
Details for communication with the Baseboard Management Controller (bmc)
module on a host. Contains the following subfields:
address
URL for communicating with the BMC. URLs vary depending on the
communication protocol and the BMC type, for example:
IPMI
Default BMC type in the ipmi://<host>:<port> format. You can also
use a plain <host>:<port> format. A port is optional if using the
default port 623.
You can change the IPMI privilege level from the default
ADMINISTRATOR to OPERATOR with an optional URL parameter
privilegelevel: ipmi://<host>:<port>?privilegelevel=OPERATOR.
Redfish
BMC type in the redfish:// format. To disable TLS, you
can use the redfish+http:// format. A host name or IP address and
a path to the system ID are required for both formats. For example,
redfish://myhost.example/redfish/v1/Systems/System.Embedded.1
or redfish://myhost.example/redfish/v1/Systems/1.
credentialsName
Name of the secret containing the BareMetalHost object credentials.
Since Container Cloud 2.21.0 and 2.21.1 for MOSK 22.5,
this field is updated automatically during cluster deployment. For
details, see BareMetalHostCredential.
Before Container Cloud 2.21.0 or MOSK 22.5,
the secret requires the username and password keys in the
Base64 encoding.
disableCertificateVerification
Boolean to skip certificate validation when true.
bootMACAddress
MAC address for booting.
bootMode
Boot mode: UEFI if UEFI is enabled and legacy if disabled.
online
Defines whether the server must be online after provisioning is done.
Warning
Setting online:false to more than one bare metal host
in a management cluster at a time can make the cluster non-operational.
Configuration example for Container Cloud 2.21.0 or later:
metadata:name:node-1-nameannotations:kaas.mirantis.com/baremetalhost-credentials-name:node-1-credentials# Since Container Cloud 2.21.0spec:bmc:address:192.168.33.106:623credentialsName:''bootMACAddress:0c:c4:7a:a8:d3:44bootMode:legacyonline:true
Configuration example for Container Cloud 2.20.1 or earlier:
The status field of the BareMetalHost object defines the current
state of BareMetalHost. It contains the following fields:
errorMessage
Last error message reported by the provisioning subsystem.
goodCredentials
Last credentials that were validated.
hardware
Hardware discovered on the host. Contains information about the storage,
CPU, host name, firmware, and so on.
operationalStatus
Status of the host:
OK
Host is configured correctly and is manageable.
discovered
Host is only partially configured. For example, the bmc address
is discovered but not the login credentials.
error
Host has any sort of error.
poweredOn
Host availability status: powered on (true) or powered off (false).
provisioning
State information tracked by the provisioner:
state
Current action being done with the host by the provisioner.
id
UUID of a machine.
triedCredentials
Details of the last credentials sent to the provisioning backend.
Configuration example:
status:errorMessage:""goodCredentials:credentials:name:master-0-bmc-secretnamespace:defaultcredentialsVersion:"13404"hardware:cpu:arch:x86_64clockMegahertz:3000count:32flags:-3dnowprefetch-abm...model:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHzfirmware:bios:date:""vendor:""version:""hostname:ipa-fcab7472-892f-473c-85a4-35d64e96c78fnics:-ip:""mac:0c:c4:7a:a8:d3:45model:0x8086 0x1521name:enp8s0f1pxe:falsespeedGbps:0vlanId:0...ramMebibytes:262144storage:-by_path:/dev/disk/by-path/pci-0000:00:1f.2-ata-1hctl:"4:0:0:0"model:Micron_5200_MTFDname:/dev/sdarotational:falseserialNumber:18381E8DC148sizeBytes:1920383410176vendor:ATAwwn:"0x500a07511e8dc148"wwnWithExtension:"0x500a07511e8dc148"...systemVendor:manufacturer:SupermicroproductName:SYS-6018R-TDW (To be filled by O.E.M.)serialNumber:E16865116300188operationalStatus:OKpoweredOn:trueprovisioning:state:provisionedtriedCredentials:credentials:name:master-0-bmc-secretnamespace:defaultcredentialsVersion:"13404"
This section describes the BareMetalHostCredential custom resource (CR)
used in the Mirantis Container Cloud API. The BareMetalHostCredential
object is created for each BareMetalHostInventory and contains all
information about the Baseboard Management Controller (bmc) credentials.
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
For demonstration purposes, the BareMetalHostCredential CR can be split
into the following sections:
The BareMetalHostCredential metadata contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1
kind
Object type that is BareMetalHostCredential
metadata
The metadata field contains the following subfields:
name
Name of the BareMetalHostCredential object
namespace
Container Cloud project in which the related BareMetalHostInventory
object was created
labels
Labels used by the bare metal provider:
kaas.mirantis.com/region
Region name
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
The spec section for the BareMetalHostCredential object contains
sensitive information that is moved to a separate
Secret object during cluster deployment:
username
User name of the bmc account with administrator privileges to control
the power state and boot source of the bare metal host
password
Details on the user password of the bmc account with administrator
privileges:
value
Password that will be automatically removed once saved in a separate
Secret object
name
Name of the Secret object where credentials are saved
The BareMetalHostCredential object creation triggers the following
automatic actions:
Create an underlying Secret object containing data about username
and password of the bmc account of the related
BareMetalHostCredential object.
Erase sensitive password data of the bmc account from the
BareMetalHostCredential object.
Add the created Secret object name to the spec.password.name
section of the related BareMetalHostCredential object.
Update BareMetalHostInventory.spec.bmc.bmhCredentialsName with the
BareMetalHostCredential object name.
Note
Before Container Cloud 2.29.0 (17.4.0 and 16.4.0),
BareMetalHost.spec.bmc.credentialsName was updated with the
BareMetalHostCredential object name.
Note
When you delete a BareMetalHostInventory object, the related
BareMetalHostCredential object is deleted automatically.
Note
On existing clusters, a BareMetalHostCredential object is
automatically created for each BareMetalHostInventory object during a
cluster update.
Example of BareMetalHostCredential before the cluster deployment starts:
Available since Container Cloud 2.29.0 (Cluster release 16.4.0)
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
This section describes the BareMetalHostInventory resource used in the
Mirantis Container Cloud API to monitor and manage the state of a bare metal
server. This includes inspecting the host hardware, firmware, operating system
provisioning, power control, and server deprovision.
The BareMetalHostInventory object is created for each Machine and
contains all information about machine hardware configuration.
Each BareMetalHostInventory object is synchronized with an automatically
created BareMetalHost object, which is used for internal purposes of
the Container Cloud private API.
Use the BareMetalHostInventory object instead of BareMetalHost for
adding and modifying configuration of a bare metal server.
Caution
Any change in the BareMetalHost object will be overwitten by
BareMetalHostInventory.
For any existing BareMetalHost object, a BareMetalHostInventory
object is created automatically during management cluster update to
Container Cloud 2.29.0 (Cluster release 16.4.0).
For demonstration purposes, the Container Cloud BareMetalHostInventory
custom resource (CR) can be split into the following major sections:
Key that pauses host management by the bare metal Operator for a
manual IP address assignment.
Note
If the host provisioning has already started or completed, adding
of this annotation deletes the information about the host from Ironic without
triggering deprovisioning. The bare metal Operator recreates the host
in Ironic once you remove the annotation. For details, see
Metal3 documentation.
Optional. Key that defines sorting of the bmh:status:storage[] list
during inspection of a bare metal host. Accepts multiple tags separated
by a comma or semi-column with the ASC/DESC suffix for sorting
direction. Example terms: sizeBytesDESC, hctlASC,
typeASC, nameDESC.
The default value is hctlASC,wwnASC,by_idASC,nameASC.
labels
Labels used by the bare metal provider to find a matching
BareMetalHostInventory object for machine deployment. For example:
hostlabel.bm.kaas.mirantis.com/controlplane
hostlabel.bm.kaas.mirantis.com/worker
hostlabel.bm.kaas.mirantis.com/storage
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
Configuration example:
apiVersion:kaas.mirantis.com/v1alpha1kind:BareMetalHostInventorymetadata:name:master-0namespace:defaultlabels:kaas.mirantis.com/baremetalhost-id:hw-master-0annotations:inspect.metal3.io/hardwaredetails-storage-sort-term:hctl ASC, wwn ASC, by_id ASC, name ASC
The spec section for the BareMetalHostInventory object defines the
required state of BareMetalHostInventory. It contains the following fields:
bmc
Details for communication with the Baseboard Management Controller (bmc)
module on a host. Contains the following subfields:
address
URL for communicating with the BMC. URLs vary depending on the
communication protocol and the BMC type. For example:
IPMI
Default BMC type in the ipmi://<host>:<port> format. You can also
use a plain <host>:<port> format. A port is optional if using the
default port 623.
You can change the IPMI privilege level from the default
ADMINISTRATOR to OPERATOR with an optional URL parameter
privilegelevel: ipmi://<host>:<port>?privilegelevel=OPERATOR.
Redfish
BMC type in the redfish:// format. To disable TLS, you can use
the redfish+http:// format. A host name or IP address and a path
to the system ID are required for both formats. For example,
redfish://myhost.example/redfish/v1/Systems/System.Embedded.1
or redfish://myhost.example/redfish/v1/Systems/1.
bmhCredentialsName
Name of the BareMetalHostCredentials object.
disableCertificateVerification
Key that disables certificate validation. Boolean, false by default.
When true, the validation is skipped.
bootMACAddress
MAC address for booting.
bootMode
Boot mode: UEFI if UEFI is enabled and legacy if disabled.
online
Defines whether the server must be online after provisioning is done.
Warning
Setting online:false to more than one bare metal host
in a management cluster at a time can make the cluster non-operational.
This section describes the BareMetalHostProfile resource used
in Mirantis Container Cloud API
to define how the storage devices and operating system
are provisioned and configured.
For demonstration purposes, the Container Cloud BareMetalHostProfile
custom resource (CR) is split into the following major sections:
The spec field of BareMetalHostProfile object contains
the fields to customize your hardware configuration:
Warning
Any data stored on any device defined in the fileSystems
list can be deleted or corrupted during cluster (re)deployment. It happens
because each device from the fileSystems list is a part of the
rootfs directory tree that is overwritten during (re)deployment.
Examples of affected devices include:
A raw device partition with a file system on it
A device partition in a volume group with a logical volume that has a
file system on it
An mdadm RAID device with a file system on it
An LVM RAID device with a file system on it
The wipe field (deprecated) or wipeDevice structure (recommended
since Container Cloud 2.26.0) have no effect in this case and cannot
protect data on these devices.
Therefore, to prevent data loss, move the necessary data from these file
systems to another server beforehand, if required.
devices
List of definitions of the physical storage devices. To configure more
than three storage devices per host, add additional devices to this list.
Each device in the list can have one or more
partitions defined by the list in the partitions field.
Each device in the list must have the following fields in the
properties section for device handling:
workBy (recommended, string)
Defines how the device should be identified. Accepts a comma-separated
string with the following recommended value (in order of priority):
by_id,by_path,by_wwn,by_name. Since 2.25.1, this value is set
by default.
wipeDevice (recommended, object)
Available since Container Cloud 2.26.0 (Cluster releases 17.1.0 and
16.1.0). Enables and configures cleanup of a device or its metadata
before cluster deployment. Contains the following fields:
eraseMetadata (dictionary)
Enables metadata cleanup of a device. Contains the following
field:
enabled (boolean)
Enables the eraseMetadata option. False by default.
eraseDevice (dictionary)
Configures a complete cleanup of a device. Contains the following
fields:
blkdiscard (object)
Executes the blkdiscard command on the target device
to discard all data blocks. Contains the following fields:
enabled (boolean)
Enables the blkdiscard option. False by default.
zeroout (string)
Configures writing of zeroes to each block during device
erasure. Contains the following options:
fallback - default, blkdiscard attempts to
write zeroes only if the device does not support the block
discard feature. In this case, the blkdiscard
command is re-executed with an additional --zeroout
flag.
always - always write zeroes.
never - never write zeroes.
userDefined (object)
Enables execution of a custom command or shell script to erase
the target device. Contains the following fields:
enabled (boolean)
Enables the userDefined option. False by default.
command (string)
Defines a command to erase the target device. Empty by
default. Mutually exclusive with script. For the command
execution, the ansible.builtin.command module is called.
script (string)
Defines a plain-text script allowing pipelines (|) to
erase the target device. Empty by default. Mutually exclusive
with command. For the script execution, the
ansible.builtin.shell module is called.
When executing a command or a script, you can use the following
environment variables:
DEVICE_KNAME (always defined by Ansible)
Device kernel path, for example, /dev/sda
DEVICE_BY_NAME (optional)
Link from /dev/disk/by-name/ if it was added by
udev
DEVICE_BY_ID (optional)
Link from /dev/disk/by-id/ if it was added by
udev
DEVICE_BY_PATH (optional)
Link from /dev/disk/by-path/ if it was added by
udev
DEVICE_BY_WWN (optional)
Link from /dev/disk/by-wwn/ if it was added by
udev
Defines whether the device must be wiped of the data before being used.
Note
This field is deprecated since Container Cloud 2.26.0
(Cluster releases 17.1.0 and 16.1.0) for the sake of wipeDevice
and will be removed in one of the following releases.
For backward compatibility, any existing wipe:true option
is automatically converted to the following structure:
wipeDevice:eraseMetadata:enabled:True
Before Container Cloud 2.26.0, the wipe field is mandatory.
Each device in the list can have the following fields in its
properties section that affect the selection of the specific device
when the profile is applied to a host:
type (optional, string)
The device type. Possible values: hdd, ssd,
nvme. This property is used to filter selected devices by type.
partflags (optional, string)
Extra partition flags to be applied on a partition. For example,
bios_grub.
The lower and upper limit of the selected device size. Only the
devices matching these criteria are considered for allocation.
Omitted parameter means no upper or lower limit.
The minSize and maxSize parameter names are also available
for the same purpose.
Caution
Mirantis recommends using only one parameter name type and units
throughout the configuration files. If both sizeGiB and size are
used, sizeGiB is ignored during deployment and the suffix is adjusted
accordingly. For example, 1.5Gi will be serialized as 1536Mi.
The size without units is counted in bytes. For example, size:120 means
120 bytes.
Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0),
minSizeGiB and maxSizeGiB are deprecated.
Instead of floats that define sizes in GiB for *GiB fields, use
the <sizeNumber>Gi text notation (Ki, Mi, and so on).
All newly created profiles are automatically migrated to the Gi
syntax. In existing profiles, migrate the syntax manually.
byName (forbidden in new profiles since 2.27.0, optional, string)
The specific device name to be selected during provisioning, such as
dev/sda.
Warning
With NVME devices and certain hardware disk controllers,
you cannot reliably select such device by the system name.
Therefore, use a more specific byPath, serialNumber, or
wwn selector.
Caution
Since Container Cloud 2.26.0 (Cluster releases 17.1.0 and
16.1.0), byName is deprecated. Since Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0), byName is blocked by
admission-controller in new BareMetalHostProfile objects.
As a replacement, use a more specific selector, such as byPath,
serialNumber, or wwn.
byPath (optional, string) Since 2.26.0 (17.1.0, 16.1.0)
The specific device name with its path to be selected during
provisioning, such as /dev/disk/by-path/pci-0000:00:07.0.
serialNumber (optional, string) Since 2.26.0 (17.1.0, 16.1.0)
The specific serial number of a physical disk to be selected during
provisioning, such as S2RBNXAH116186E.
wwn (optional, string) Since 2.26.0 (17.1.0, 16.1.0)
The specific World Wide Name number of a physical disk to be selected
during provisioning, such as 0x5002538d409aeeb4.
Warning
When using strict filters, such as byPath,
serialNumber, or wwn, Mirantis strongly recommends not
combining them with a soft filter, such as minSize / maxSize.
Use only one approach.
softRaidDevicesTech Preview
List of definitions of a software-based Redundant Array of Independent
Disks (RAID) created by mdadm. Use the following fields to describe
an mdadm RAID device:
name (mandatory, string)
Name of a RAID device. Supports the following formats:
dev path, for example, /dev/md0.
simple name, for example, raid-name that will be created as
/dev/md/raid-name on the target OS.
devices (mandatory, list)
List of partitions from the devices list. Expand the resulting list
of devices into at least two partitions.
level (optional, string)
Level of a RAID device, defaults to raid1. Possible values:
raid1, raid0, raid10.
metadata (optional, string)
Metadata version of RAID, defaults to 1.0.
Possible values: 1.0, 1.1, 1.2. For details about the
differences in metadata, see
man 8 mdadm.
Warning
The EFI system partition partflags: ['esp'] must be
a physical partition in the main partition table of the disk, not under
LVM or mdadm software RAID.
fileSystems
List of file systems. Each file system can be created on top of either
device, partition, or logical volume. If more file systems are required
for additional devices, define them in this field. Each fileSystems
in the list has the following fields:
fileSystem (mandatory, string)
Type of a file system to create on a partition. For example, ext4,
vfat.
mountOpts (optional, string)
Comma-separated string of mount options. For example,
rw,noatime,nodiratime,lazytime,nobarrier,commit=240,data=ordered.
mountPoint (optional, string)
Target mount point for a file system. For example,
/mnt/local-volumes/.
partition (optional, string)
Partition name to be selected for creation from the list in the
devices section. For example, uefi.
logicalVolume (optional, string)
LVM logical volume name if the file system is supposed to be created
on an LVM volume defined in the logicalVolumes section. For example,
lvp.
logicalVolumes
List of LVM logical volumes. Every logical volume belongs to a volume
group from the volumeGroups list and has the size attribute
for a size in the corresponding units.
You can also add a software-based RAID raid1 created by LVM
using the following fields:
name (mandatory, string)
Name of a logical volume.
vg (mandatory, string)
Name of a volume group that must be a name from the volumeGroups
list.
sizeGiB or size (mandatory, string)
Size of a logical volume in gigabytes. When set to 0, all available
space on the corresponding volume group will be used. The 0 value
equals -l100%FREE in the lvcreate command.
type (optional, string)
Type of a logical volume. If you require a usual logical volume,
you can omit this field.
Possible values:
linear
Default. A usual logical volume. This value is implied for bare metal
host profiles created using the Container Cloud release earlier than
2.12.0 where the type field is unavailable.
raid1Tech Preview
Serves to build the raid1 type of LVM. Equals to the
lvcreate --type raid1... command. For details, see
man 8 lvcreate
and man 7 lvmraid.
Caution
Mirantis recommends using only one parameter name type and units
throughout the configuration files. If both sizeGiB and size are
used, sizeGiB is ignored during deployment and the suffix is adjusted
accordingly. For example, 1.5Gi will be serialized as 1536Mi.
The size without units is counted in bytes. For example, size:120 means
120 bytes.
volumeGroups
List of definitions of LVM volume groups. Each volume group contains one
or more devices or partitions from the devices list. Contains the
following field:
devices (mandatory, list)
List of partitions to be used in a volume group. For example:
Name of a volume group to be created. For example: lvm_root.
preDeployScript (optional, string)
Shell script that executes on a host before provisioning the target
operating system inside the ramfs system.
postDeployScript (optional, string)
Shell script that executes on a host after deploying the operating
system inside the ramfs system that is chrooted to the target
operating system. To use a specific default gateway (for example,
to have Internet access) on this stage, refer to
MOSK Deployment Guide: Configure multiple DHCP address
ranges.
grubConfig (optional, object)
Set of options for the Linux GRUB bootloader on the target operating system.
Contains the following field:
defaultGrubOptions (optional, array)
Set of options passed to the Linux GRUB bootloader. Each string in the
list defines one parameter. For example:
If asymmetric traffic is expected on some of the managed cluster
nodes, enable the loose mode for the corresponding interfaces on those
nodes by setting the net.ipv4.conf.<interface-name>.rp_filter
parameter to "2" in the kernelParameters.sysctl section.
For example:
General configuration example with the deprecated wipe
option for devices - applies before 2.26.0 (17.1.0 and 16.1.0)
spec:devices:-device:#byName: /dev/sdaminSize:61GiBwipe:trueworkBy:by_wwn,by_path,by_id,by_namepartitions:-name:bios_grubpartflags:-bios_grubsize:4Miwipe:true-name:uefipartflags:['esp']size:200Miwipe:true-name:config-2# limited to 64Mbsize:64Miwipe:true-name:md_root_part1wipe:truepartflags:['raid']size:60Gi-name:lvm_lvp_part1wipe:truepartflags:['raid']# 0 Means, all left spacesize:0-device:#byName: /dev/sdbminSize:61GiBwipe:trueworkBy:by_wwn,by_path,by_id,by_namepartitions:-name:md_root_part2wipe:truepartflags:['raid']size:60Gi-name:lvm_lvp_part2wipe:true# 0 Means, all left spacesize:0-device:#byName: /dev/sdcminSize:30Gibwipe:trueworkBy:by_wwn,by_path,by_id,by_namesoftRaidDevices:-name:md_rootmetadata:"1.2"devices:-partition:md_root_part1-partition:md_root_part2volumeGroups:-name:lvm_lvpdevices:-partition:lvm_lvp_part1-partition:lvm_lvp_part2logicalVolumes:-name:lvpvg:lvm_lvp# Means, all left spacesizeGiB:0postDeployScript:|#!/bin/bash -execho $(date) 'post_deploy_script done' >> /root/post_deploy_donepreDeployScript:|#!/bin/bash -execho 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"' > /etc/udev/rules.d/60-ssd-scheduler.rulesecho $(date) 'pre_deploy_script done' >> /root/pre_deploy_donefileSystems:-fileSystem:vfatpartition:config-2-fileSystem:vfatpartition:uefimountPoint:/boot/efi/-fileSystem:ext4softRaidDevice:md_rootmountPoint:/-fileSystem:ext4logicalVolume:lvpmountPoint:/mnt/local-volumes/grubConfig:defaultGrubOptions:-GRUB_DISABLE_RECOVERY="true"-GRUB_PRELOAD_MODULES=lvm-GRUB_TIMEOUT=20kernelParameters:sysctl:# For the list of options prohibited to change, refer to# https://docs.mirantis.com/mke/3.7/install/predeployment/set-up-kernel-default-protections.htmlkernel.dmesg_restrict:"1"kernel.core_uses_pid:"1"fs.file-max:"9223372036854775807"fs.aio-max-nr:"1048576"fs.inotify.max_user_instances:"4096"vm.max_map_count:"262144"modules:-filename:kvm_intel.confcontent:|options kvm_intel nested=1
During volume mounts, Mirantis strongly advises against mounting the entire
/var directory to a separate disk or partition. Otherwise, the
cloud-init service may fail to configure the target host system during
the first boot.
This recommendation allows preventing the following cloud-init issue related
to asynchronous mount in systemd with ignoring dependency:
System boots the / mounts.
The cloud-init service starts and processes data in
/var/lib/cloud-init, which currently references
[/]var/lib/cloud-init.
The systemd service mounts /var/lib/cloud-init and breaks the
cloud-init service logic.
Recommended configuration example for /var/lib/nova
spec:devices:...- device:serialNumber:BTWA516305VE480FGNtype:ssdwipeDevice:eraseMetadata:enabled:truepartitions:-name:var_partsize:0fileSystems:....- fileSystem:ext4partition:var_partmountPoint:'/var'# NOT RECOMMENDEDmountOpts:'rw,noatime,nodiratime,lazytime'
The fields of the Cluster resource that are located
under the status section including providerStatus
are available for viewing only.
They are automatically generated by the bare metal cloud provider
and must not be modified using Container Cloud API.
The Container Cloud Cluster CR contains the following fields:
apiVersion
API version of the object that is cluster.k8s.io/v1alpha1.
kind
Object type that is Cluster.
The metadata object field of the Cluster resource
contains the following fields:
name
Name of a cluster. A managed cluster name is specified under the
ClusterName field in the Create Cluster wizard of the
Container Cloud web UI. A management cluster name is configurable in the
bootstrap script.
namespace
Project in which the cluster object was created. The management cluster is
always created in the default project. The managed cluster project
equals to the selected project name.
labels
Key-value pairs attached to the object:
kaas.mirantis.com/provider
Provider type that is baremetal for the baremetal-based clusters.
kaas.mirantis.com/region
Region name. The default region name for the management cluster is
region-one.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec object field of the Cluster object
represents the BaremetalClusterProviderSpec subresource that
contains a complete description of the desired bare metal cluster
state and all details to create the cluster-level
resources. It also contains the fields required for LCM deployment
and integration of the Container Cloud components.
The providerSpec object field is custom for each cloud provider and
contains the following generic fields for the bare metal provider:
apiVersion
API version of the object that is baremetal.k8s.io/v1alpha1
Maintenance mode of a cluster. Prepares a cluster for maintenance
and enables the possibility to switch machines into maintenance mode.
containerRegistries
List of the ContainerRegistries resources names.
ntpEnabled
NTP server mode. Boolean, enabled by default.
Since Container Cloud 2.23.0, you can optionally disable NTP to disable
the management of chrony configuration by Container Cloud and use your
own system for chrony management. Otherwise, configure the regional NTP
server parameters to be applied to all machines of managed clusters.
Before Container Cloud 2.23.0, you can optionally configure NTP parameters
if servers from the Ubuntu NTP pool (*.ubuntu.pool.ntp.org) are
accessible from the node where a management cluster is being provisioned.
Otherwise, this configuration is mandatory.
NTP configuration
Configure the regional NTP server parameters to be applied to all machines
of managed clusters.
In the Cluster object, add the ntp:servers section
with the list of required server names:
Optional. Auditing tools enabled on the cluster. Contains the auditd
field that enables the Linux Audit daemon auditd to monitor
activity of cluster processes and prevent potential malicious activity.
Boolean, default - false. Enables the auditd role to install the
auditd packages and configure rules. CIS rules: 4.1.1.1, 4.1.1.2.
enabledAtBoot
Boolean, default - false. Configures grub to audit processes that can
be audited even if they start up prior to auditd startup. CIS rule:
4.1.1.3.
backlogLimit
Integer, default - none. Configures the backlog to hold records. If during
boot audit=1 is configured, the backlog holds 64 records. If more than
64 records are created during boot, auditd records will be lost with a
potential malicious activity being undetected. CIS rule: 4.1.1.4.
maxLogFile
Integer, default - none. Configures the maximum size of the audit log file.
Once the log reaches the maximum size, it is rotated and a new log file is
created. CIS rule: 4.1.2.1.
maxLogFileAction
String, default - none. Defines handling of the audit log file reaching the
maximum file size. Allowed values:
keep_logs - rotate logs but never delete them
rotate - add a cron job to compress rotated log files and keep
maximum 5 compressed files.
compress - compress log files and keep them under the
/var/log/auditd/ directory. Requires
auditd_max_log_file_keep to be enabled.
CIS rule: 4.1.2.2.
maxLogFileKeep
Integer, default - 5. Defines the number of compressed log files to keep
under the /var/log/auditd/ directory. Requires
auditd_max_log_file_action=compress. CIS rules - none.
mayHaltSystem
Boolean, default - false. Halts the system when the audit logs are
full. Applies the following configuration:
space_left_action=email
action_mail_acct=root
admin_space_left_action=halt
CIS rule: 4.1.2.3.
customRules
String, default - none. Base64-encoded content of the 60-custom.rules
file for any architecture. CIS rules - none.
customRulesX32
String, default - none. Base64-encoded content of the 60-custom.rules
file for the i386 architecture. CIS rules - none.
customRulesX64
String, default - none. Base64-encoded content of the 60-custom.rules
file for the x86_64 architecture. CIS rules - none.
presetRules
String, default - none. Comma-separated list of the following built-in
preset rules:
access
actions
delete
docker
identity
immutable
logins
mac-policy
modules
mounts
perm-mod
privileged
scope
session
system-locale
time-change
Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0) in the
Technology Preview scope, you can collect some of the preset rules indicated
above as groups and use them in presetRules:
ubuntu-cis-rules - this group contains rules to comply with the Ubuntu
CIS Benchmark recommendations, including the following CIS Ubuntu 20.04
v2.0.1 rules:
scope - 5.2.3.1
actions - same as 5.2.3.2
time-change - 5.2.3.4
system-locale - 5.2.3.5
privileged - 5.2.3.6
access - 5.2.3.7
identity - 5.2.3.8
perm-mod - 5.2.3.9
mounts - 5.2.3.10
session - 5.2.3.11
logins - 5.2.3.12
delete - 5.2.3.13
mac-policy - 5.2.3.14
modules - 5.2.3.19
docker-cis-rules - this group contains rules to comply with
Docker CIS Benchmark recommendations, including the docker Docker CIS
v1.6.0 rules 1.1.3 - 1.1.18.
You can also use two additional keywords inside presetRules:
none - select no built-in rules.
all - select all built-in rules. When using this keyword, you can add
the ! prefix to a rule name to exclude some rules. You can use the
! prefix for rules only if you add the all keyword as the
first rule. Place a rule with the ! prefix only after
the all keyword.
Example configurations:
presetRules:none - disable all preset rules
presetRules:docker - enable only the docker rules
presetRules:access,actions,logins - enable only the
access, actions, and logins rules
presetRules:ubuntu-cis-rules - enable all rules from the
ubuntu-cis-rules group
presetRules:docker-cis-rules,actions - enable all rules from
the docker-cis-rules group and the actions rule
presetRules:all - enable all preset rules
presetRules:all,!immutable,!sessions - enable all preset
rules except immutable and sessions
Optional. Technology Preview. Deprecated since Container Cloud 2.29.0
(Cluster releases 17.4.0 and 16.4.0). Available since Container Cloud 2.24.0
(Cluster release 14.0.0). Enables WireGuard for traffic encryption on the
Kubernetes workloads network. Boolean. Disabled by default.
Caution
Before enabling WireGuard, ensure that the Calico MTU size is
at least 60 bytes smaller than the interface MTU size of the workload
network. IPv4 WireGuard uses a 60-byte header. For details, see
Set the MTU size for Calico.
Caution
Changing this parameter on a running cluster causes a downtime
that can vary depending on the cluster size.
This section represents the Container Cloud components that are
enabled on a cluster. It contains the following fields:
management
Configuration for the management cluster components:
enabled
Management cluster enabled (true) or disabled (false).
helmReleases
List of the management cluster Helm releases that will be installed
on the cluster. A Helm release includes the name and values
fields. The specified values will be merged with relevant Helm release
values of the management cluster in the Release object.
regional
List of regional cluster components for the provider:
provider
Provider type that is baremetal.
helmReleases
List of the regional Helm releases that will be installed
on the cluster. A Helm release includes the name and values
fields. The specified values will be merged with relevant
regional Helm release values in the Release object.
The providerStatus object field of the Cluster resource that reflects
the cluster readiness contains the following fields:
persistentVolumesProviderProvisioned
Status of the persistent volumes provisioning.
Prevents the Helm releases that require persistent volumes from being
installed until some default StorageClass is added to the Cluster
object.
helm
Details about the deployed Helm releases:
ready
Status of the deployed Helm releases. The true value indicates that
all Helm releases are deployed successfully.
releases
List of the enabled Helm releases that run on the Container Cloud
cluster:
releaseStatuses
List of the deployed Helm releases. The success:true field
indicates that the release is deployed successfully.
stacklight
Status of the StackLight deployment. Contains URLs of all StackLight
components. The success:true field indicates that StackLight
is deployed successfully.
nodes
Details about the cluster nodes:
ready
Number of nodes that completed the deployment or update.
requested
Total number of nodes. If the number of ready nodes does not match
the number of requested nodes, it means that a cluster is being
currently deployed or updated.
notReadyObjects
The list of the services, deployments, and statefulsets
Kubernetes objects that are not in the Ready state yet.
A service is not ready if its external address has not been provisioned
yet. A deployment or statefulset is not ready if the number of
ready replicas is not equal to the number of desired replicas. Both objects
contain the name and namespace of the object and the number of ready and
desired replicas (for controllers). If all objects are ready, the
notReadyObjects list is empty.
The oidc section of the providerStatus object field
in the Cluster resource reflects the Open ID Connect configuration details.
It contains the required details to obtain a token for
a Container Cloud cluster and consists of the following fields:
certificate
Base64-encoded OIDC certificate.
clientId
Client ID for OIDC requests.
groupsClaim
Name of an OIDC groups claim.
issuerUrl
Issuer URL to obtain the representation of the realm.
ready
OIDC status relevance. If true, the status corresponds to the
LCMCluster OIDC configuration.
The releaseRefs section of the providerStatus object field
in the Cluster resource provides the current Cluster release version
as well as the one available for upgrade. It contains the following fields:
current
Details of the currently installed Cluster release:
lcmType
Type of the Cluster release (ucp).
name
Name of the Cluster release resource.
version
Version of the Cluster release.
unsupportedSinceKaaSVersion
Indicates that a Container Cloud release newer than
the current one exists and that it does not support the current
Cluster release.
available
List of the releases available for upgrade. Contains the name and
version fields.
For security reasons and to ensure safe and reliable cluster
operability, test this configuration on a staging environment before
applying it to production. For any questions, contact Mirantis support.
Caution
As long as the feature is still on the development stage,
Mirantis highly recommends deleting all HostOSConfiguration objects,
if any, before automatic upgrade of the management cluster to Container Cloud
2.27.0 (Cluster release 16.2.0). After the upgrade, you can recreate the
required objects using the updated parameters.
This precautionary step prevents re-processing and re-applying of existing
configuration, which is defined in HostOSConfiguration objects, during
management cluster upgrade to 2.27.0. Such behavior is caused by changes in
the HostOSConfiguration API introduced in 2.27.0.
This section describes the HostOSConfiguration custom resource (CR)
used in the Container Cloud API. It contains all necessary information to
introduce and load modules for further configuration of the host operating
system of the related Machine object.
Note
This object must be created and managed on the management cluster.
For demonstration purposes, we split the Container Cloud
HostOSConfiguration CR into the following sections:
The spec object field contains configuration for a
HostOSConfiguration object and has the following fields:
machineSelector
Required for production deployments. A set of Machine objects to apply
the HostOSConfiguration object to. Has the format of the Kubernetes
label selector.
configs
Required. List of configurations to apply to Machine objects defined in
machineSelector. Each entry has the following fields:
module
Required. Name of the module that refers to an existing module in one of
the HostOSConfigurationModules
objects.
moduleVersion
Required. Version of the module in use in the SemVer format.
description
Optional. Description and purpose of the configuration.
order
Optional. Positive integer between 1 and 1024 that indicates the
order of applying the module configuration. A configuration with the
lowest order value is applied first. If the order field is not set:
Since 2.27.0 (Cluster releases 17.2.0 and 16.2.0)
The configuration is applied in the order of appearance in the list
after all configurations with the value are applied.
In 2.26.0 (Cluster releases 17.1.0 and 16.1.0)
The following rules apply to the ordering when comparing each pair
of entries:
Ordering by alphabet based on the module values unless they are
equal.
Ordering by version based on the moduleVersion values, with
preference given to the lesser value.
values
Optional if secretValues is set. Module configuration in the format
of key-value pairs.
secretValues
Optional if values is set. Reference to a Secret object that
contains the configuration values for the module:
namespace
Project name of the Secret object.
name
Name of the Secret object.
Note
You can use both values and secretValues together.
But if the values are duplicated, the secretValues data rewrites
duplicated keys of the values data.
Warning
The referenced Secret object must contain only primitive
non-nested values. Otherwise, the values will not be applied correctly.
phase
Optional. LCM phase, in which a module configuration must be executed.
The only supported and default value is reconfigure. Hence, you may
omit this field.
orderRemoved in 2.27.0 (17.2.0 and 16.2.0)
Optional. Positive integer between 1 and 1024 that indicates the
order of applying HostOSConfiguration objects on newly added or newly
assigned machines. An object with the lowest order value is applied first.
If the value is not set, the object is applied last in the order.
If no order field is set for all HostOSConfiguration objects,
the objects are sorted by name.
Note
If a user changes the HostOSConfiguration object that was
already applied on some machines, then only the changed items from
the spec.configs section of the HostOSConfiguration object are
applied to those machines, and the execution order applies only to the
changed items.
The configuration changes are applied on corresponding LCMMachine
objects almost immediately after host-os-modules-controller
verifies the changes.
Configuration example:
spec:machineSelector:matchLabels:label-name:"label-value"configs:-description:Brief description of the configurationmodule:container-cloud-provided-module-namemoduleVersion:1.0.0order:1# the 'phase' field is provided for illustration purposes. it is redundant# because the only supported value is "reconfigure".phase:"reconfigure"values:foo:1bar:"baz"secretValues:name:values-from-secretnamespace:default
The status field of the HostOSConfiguration object contains the
current state of the object:
controllerUpdateSince 2.27.0 (17.2.0 and 16.2.0)
Reserved. Indicates whether the status updates are initiated by
host-os-modules-controller.
isValidSince 2.27.0 (17.2.0 and 16.2.0)
Indicates whether all given configurations have been validated successfully
and are ready to be applied on machines. An invalid object is discarded
from processing.
specUpdatedAtSince 2.27.0 (17.2.0 and 16.2.0)
Defines the time of the last change in the object spec observed by
host-os-modules-controller.
containsDeprecatedModulesSince 2.28.0 (17.3.0 and 16.3.0)
Indicates whether the object uses one or several deprecated modules.
Boolean.
machinesStatesSince 2.27.0 (17.2.0 and 16.2.0)
Specifies the per-machine state observed by baremetal-provider.
The keys are machines names, and each entry has the following fields:
observedGeneration
Read-only. Specifies the sequence number representing the quantity of
changes in the object since its creation. For example, during object
creation, the value is 1.
selected
Indicates whether the machine satisfied the selector of the object.
Non-selected machines are not defined in machinesStates. Boolean.
secretValuesChanged
Indicates whether the secret values have been changed and the
corresponding stateItems have to be updated. Boolean.
The value is set to true by host-os-modules-controller if changes
in the secret data are detected. The value is set to false by
baremetal-provider after processing.
configStateItemsStatuses
Specifies key-value pairs with statuses of StateItems that are
applied to the machine. Each key contains the name and version
of the configuration module. Each key value has the following format:
Key: name of a configuration StateItem
Value: simplified status of the configuration StateItem that has
the following fields:
hash
Value of the hash sum from the status of the corresponding
StateItem in the LCMMachine object. Appears when the status
switches to Success.
state
Actual state of the corresponding StateItem from the
LCMMachine object. Possible values: NotStarted,
Running, Success, Failed.
configs
List of configurations statuses, indicating results of application of each
configuration. Every entry has the following fields:
moduleName
Existing module name from the list defined in the spec:modules
section of the related HostOSConfigurationModules object.
moduleVersion
Existing module version defined in the spec:modules section of the
related HostOSConfigurationModules object.
modulesReference
Name of the HostOSConfigurationModules object that contains
the related module configuration.
modulePlaybook
Name of the Ansible playbook of the module. The value is taken from
the related HostOSConfigurationModules object where this module
is defined.
moduleURL
URL to the module package in the FQDN format. The value is taken
from the related HostOSConfigurationModules object where this module
is defined.
moduleHashsum
Hash sum of the module. The value is taken from the related
HostOSConfigurationModules object where this module is defined.
lastDesignatedConfiguration
Removed in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0).
Key-value pairs representing the latest designated configuration data
for modules. Each key corresponds to a machine name, while the
associated value contains the configuration data encoded in the
gzip+base64 format.
lastValidatedSpec
Removed in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0).
Last validated module configuration encoded in the gzip+base64
format.
valuesValid
Removed in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0).
Validation state of the configuration and secret values defined in the
object spec against the module valuesValidationSchema.
Always true when valuesValidationSchema is empty.
error
Details of an error, if any, that occurs during the object processing
by host-os-modules-controller.
secretObjectVersion
Available since Container Cloud 2.27.0 (Cluster releases 17.2.0 and
16.2.0). Resource version of the corresponding Secret object observed
by host-os-modules-controller. Is present only if secretValues
is set.
moduleDeprecatedBy
Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and
16.3.0). List of modules that deprecate the currently configured module.
Contains the name and version fields specifying one or more
modules that deprecate the current module.
supportedDistributions
Available since Container Cloud 2.28.0 (Cluster releases 17.3.0
and 16.3.0). List of operating system distributions that are supported by
the current module. An empty list means support of any distribution by
the current module.
HostOSConfiguration status example:
status:configs:-moduleHashsum:bc5fafd15666cb73379d2e63571a0de96fff96ac28e5bce603498cc1f34de299moduleName:module-namemodulePlaybook:main.yamlmoduleURL:<url-to-module-archive.tgz>moduleVersion:1.1.0modulesReference:mcc-modulesmoduleDeprecatedBy:-name:another-module-nameversion:1.0.0-moduleHashsum:53ec71760dd6c00c6ca668f961b94d4c162eef520a1f6cb7346a3289ac5d24cdmoduleName:another-module-namemodulePlaybook:main.yamlmoduleURL:<url-to-another-module-archive.tgz>moduleVersion:1.1.0modulesReference:mcc-modulessecretObjectVersion:"14234794"containsDeprecatedModules:trueisValid:truemachinesStates:default/master-0:configStateItemsStatuses:# moduleName-moduleVersionmodule-name-1.1.0:# corresponding state itemhost-os-download-<object-name>-module-name-1.1.0-reconfigure:hash:0e5c4a849153d3278846a8ed681f4822fb721f6d005021c4509e7126164f428dstate:Successhost-os-<object-name>-module-name-1.1.0-reconfigure:state:Not Startedanother-module-name-1.1.0:host-os-download-<object-name>-another-module-name-1.1.0-reconfigure:state:Not Startedhost-os-<object-name>-another-module-name-1.1.0-reconfigure:state:Not StartedobservedGeneration:1selected:trueupdatedAt:"2024-04-23T14:10:28Z"
For security reasons and to ensure safe and reliable cluster
operability, test this configuration on a staging environment before
applying it to production. For any questions, contact Mirantis support.
This section describes the HostOSConfigurationModules custom resource (CR)
used in the Container Cloud API. It contains all necessary information to
introduce and load modules for further configuration of the host operating
system of the related Machine object. For description of module format,
schemas, and rules, see Format and structure of a module package.
Note
This object must be created and managed on the management cluster.
For demonstration purposes, we split the Container Cloud
HostOSConfigurationModules CR into the following sections:
Required for custom modules. URL to the archive containing the module
package in the FQDN format. If omitted, the module is considered as the
one provided and validated by Container Cloud.
version
Required. Module version in SemVer format that must equal the
corresponding custom module version defined in the metadata section
of the corresponding module. For reference, see MOSK
documentation: Day-2 operations - Metadata file format.
sha256sum
Required. Hash sum computed using the SHA-256 algorithm.
The hash sum is automatically validated upon fetching the module
package, the module does not load if the hash sum is invalid.
deprecatesSince 2.28.0 (17.3.0 and 16.3.0)
Reserved. List of modules that will be deprecated by the module.
This field is overriden by the same field, if any, of the module
metadata section.
Contains the name and version fields specifying one or more
modules to be deprecated. If name is omitted, it inherits the name
of the current module.
The status field of the HostOSConfigurationModules object contains the
current state of the object:
modules
List of module statuses, indicating the loading results of each module.
Each entry has the following fields:
name
Name of the loaded module.
version
Version of the loaded module.
url
URL to the archive containing the loaded module package in the FQDN
format.
docURL
URL to the loaded module documentation if it was initially present
in the module package.
description
Description of the loaded module if it was initially present in the
module package.
sha256sum
Actual SHA-256 hash sum of the loaded module.
valuesValidationSchema
JSON schema used against the module configuration values if it was
initially present in the module package. The value is encoded in the
gzip+base64 format.
state
Actual availability state of the module. Possible values are:
available or error.
error
Error, if any, that occurred during the module fetching and verification.
playbookName
Name of the module package playbook.
deprecatesSince 2.28.0 (17.3.0 and 16.3.0)
List of modules that are deprecated by the module. Contains the name
and version fields specifying one or more modules deprecated by the
current module.
deprecatedBySince 2.28.0 (17.3.0 and 16.3.0)
List of modules that deprecate the current module. Contains the name
and version fields specifying one or more modules that deprecate
the current module.
supportedDistributionsSince 2.28.0 (17.3.0 and 16.3.0)
List of operating system distributions that are supported by
the current module. An empty list means support of any distribution by
the current module.
HostOSConfigurationModules status example:
status:modules:-description:Brief description of the moduledocURL:https://docs.mirantis.comname:mirantis-provided-module-nameplaybookName:directory/main.yamlsha256sum:ff3c426d5a2663b544acea74e583d91cc2e292913fc8ac464c7d52a3182ec146state:availableurl:https://example.mirantis.com/path/to/module-name-1.0.0.tgzvaluesValidationSchema:<gzip+base64 encoded data>version:1.0.0deprecates:-name:custom-module-nameversion:1.0.0-description:Brief description of the moduledocURL:https://example.documentation.page/module-namename:custom-module-nameplaybookName:directory/main.yamlsha256sum:258ccafac1570de7b7829bde108fa9ee71b469358dbbdd0215a081f8acbb63bastate:availableurl:https://fully.qualified.domain.name/to/module/archive/module-name-1.0.0.tgzversion:1.0.0deprecatedBy:-name:mirantis-provided-module-nameversion:1.0.0supportedDistributions:-ubuntu/jammy
This section describes the IPaddr resource used in Mirantis
Container Cloud API. The IPAddr object describes an IP address
and contains all information about the associated MAC address.
For demonstration purposes, the Container Cloud IPaddr
custom resource (CR) is split into the following major sections:
The Container Cloud IPaddr CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1
kind
Object type that is IPaddr
metadata
The metadata field contains the following subfields:
name
Name of the IPaddr object in the auto-XX-XX-XX-XX-XX-XX format
where XX-XX-XX-XX-XX-XX is the associated MAC address
namespace
Project in which the IPaddr object was created
labels
Key-value pairs that are attached to the object:
ipam/IP
IPv4 address
ipam/IpamHostID
Unique ID of the associated IpamHost object
ipam/MAC
MAC address
ipam/SubnetID
Unique ID of the Subnet object
ipam/UID
Unique ID of the IPAddr object
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The status object field of the IPAddr resource reflects the actual
state of the IPAddr object. In contains the following fields:
address
IP address.
cidr
IPv4 CIDR for the Subnet.
gateway
Gateway address for the Subnet.
mac
MAC address in the XX:XX:XX:XX:XX:XX format.
nameservers
List of the IP addresses of name servers of the Subnet.
Each element of the list is a single address, for example, 172.18.176.6.
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
phase
Deprecated since Container Cloud 2.23.0 and will be removed in one of the
following releases in favor of state. Possible values: Active,
Failed, or Terminating.
Configuration example:
status:address:172.16.48.201cidr:172.16.48.201/24gateway:172.16.48.1objCreated:2021-10-21T19:09:32Z by v5.1.0-20210930-121522-f5b2af8objStatusUpdated:2021-10-21T19:14:18.748114886Z by v5.1.0-20210930-121522-f5b2af8objUpdated:2021-10-21T19:09:32.606968024Z by v5.1.0-20210930-121522-f5b2af8mac:0C:C4:7A:A8:B8:18nameservers:-172.18.176.6state:OKphase:Active
This section describes the IpamHost resource used in Mirantis
Container Cloud API. The kaas-ipam controller monitors
the current state of the bare metal Machine, verifies if BareMetalHost
is successfully created and inspection is completed.
Then the kaas-ipam controller fetches the information about the network
interface configuration, creates the IpamHost object, and requests the IP
addresses.
The IpamHost object is created for each Machine and contains
all configuration of the host network interfaces and IP address.
It also contains the information about associated BareMetalHost,
Machine, and MAC addresses.
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
For demonstration purposes, the Container Cloud IpamHost
custom resource (CR) is split into the following major sections:
The Container Cloud IpamHost CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1
kind
Object type that is IpamHost
metadata
The metadata field contains the following subfields:
name
Name of the IpamHost object
namespace
Project in which the IpamHost object has been created
labels
Key-value pairs that are attached to the object:
cluster.sigs.k8s.io/cluster-name
References the Cluster object name that IpamHost is
assigned to
ipam/BMHostID
Unique ID of the associated BareMetalHost object
ipam/MAC-XX-XX-XX-XX-XX-XX:"1"
Number of NICs of the host that the corresponding MAC address is
assigned to
ipam/MachineID
Unique ID of the associated Machine object
ipam/UID
Unique ID of the IpamHost object
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the IpamHost resource describes the desired
state of the object. It contains the following fields:
nicMACmap
Represents an unordered list of all NICs of the host obtained during the
bare metal host inspection.
Each NIC entry contains such fields as name, mac, ip,
and so on. The primary field defines which NIC was used for PXE booting.
Only one NIC can be primary. The IP address is not configurable
and is provided only for debug purposes.
l2TemplateSelector
If specified, contains the name (first priority) or label
of the L2 template that will be applied during a machine creation.
The l2TemplateSelector field is copied from the MachineproviderSpec object to the IpamHost object only once,
during a machine creation. To modify l2TemplateSelector after creation
of a Machine CR, edit the IpamHost object.
netconfigUpdateModeTechPreview
Update mode of network configuration. Possible values:
MANUAL
Default, recommended. An operator manually applies new network
configuration.
AUTO-UNSAFE
Unsafe, not recommended. If new network configuration is rendered by
kaas-ipam successfully, it is applied automatically with no
manual approval.
MANUAL-GRACEPERIOD
Initial value set during the IpamHost object creation. If new network
configuration is rendered by kaas-ipam successfully, it is applied
automatically with no manual approval. This value is implemented for
automatic changes in the IpamHost object during the host provisioning
and deployment. The value is changed automatically to MANUAL in
three hours after the IpamHost object creation.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
netconfigUpdateAllowTechPreview
Manual approval of network changes. Possible values: true or false.
Set to true to approve the Netplan configuration file candidate
(stored in netconfigCandidate) and copy its contents to the effective
Netplan configuration file list (stored in netconfigFiles). After that,
its value is automatically switched back to false.
Note
This value has effect only if netconfigUpdateMode is set to
MANUAL.
Set to true only if status.netconfigCandidateState of network
configuration candidate is OK.
Caution
The following fields of the ipamHost status are renamed since
Container Cloud 2.22.0 in the scope of the L2Template and IpamHost
objects refactoring:
netconfigV2 to netconfigCandidate
netconfigV2state to netconfigCandidateState
netconfigFilesState to netconfigFilesStates (per file)
No user actions are required after renaming.
The format of netconfigFilesState changed after renaming. The
netconfigFilesStates field contains a dictionary of statuses of network
configuration files stored in netconfigFiles. The dictionary contains
the keys that are file paths and values that have the same meaning for each
file that netconfigFilesState had:
For a successfully rendered configuration file:
OK:<timestamp><sha256-hash-of-rendered-file>, where a timestamp
is in the RFC 3339 format.
For a failed rendering: ERR:<error-message>.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
The following fields of the ipamHost status are renamed since
Container Cloud 2.22.0 in the scope of the L2Template and IpamHost
objects refactoring:
netconfigV2 to netconfigCandidate
netconfigV2state to netconfigCandidateState
netconfigFilesState to netconfigFilesStates (per file)
No user actions are required after renaming.
The format of netconfigFilesState changed after renaming. The
netconfigFilesStates field contains a dictionary of statuses of network
configuration files stored in netconfigFiles. The dictionary contains
the keys that are file paths and values that have the same meaning for each
file that netconfigFilesState had:
For a successfully rendered configuration file:
OK:<timestamp><sha256-hash-of-rendered-file>, where a timestamp
is in the RFC 3339 format.
For a failed rendering: ERR:<error-message>.
The status field of the IpamHost resource describes the observed
state of the object. It contains the following fields:
netconfigCandidate
Candidate of the Netplan configuration file in human readable format that
is rendered using the corresponding L2Template. This field contains
valid data if l2RenderResult and netconfigCandidateState retain the
OK result.
l2RenderResultDeprecated
Status of a rendered Netplan configuration candidate stored in
netconfigCandidate. Possible values:
For a successful L2 template rendering:
OK: timestamp sha256-hash-of-rendered-netplan, where
timestamp is in the RFC 3339 format
For a failed rendering: ERR: <error-message>
This field is deprecated and will be removed in one of the following
releases. Use netconfigCandidateState instead.
netconfigCandidateStateTechPreview
Status of a rendered Netplan configuration candidate stored in
netconfigCandidate. Possible values:
For a successful L2 template rendering:
OK: timestamp sha256-hash-of-rendered-netplan, where
timestamp is in the RFC 3339 format
For a failed rendering: ERR: <error-message>
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Its contents are changed only if rendering of Netplan configuration was
successful. So, it always retains the last successfully rendered Netplan
configuration. To apply changes in contents, the Infrastructure Operator
approval is required. For details, see Modify network configuration on an existing machine.
Every item in this list contains:
content
The base64-encoded Netplan configuration file that was rendered
using the corresponding L2Template.
path
The file path for the Netplan configuration file on the target host.
netconfigFilesStates
Status of Netplan configuration files stored in netconfigFiles.
Possible values are:
For a successful L2 template rendering:
OK: timestamp sha256-hash-of-rendered-netplan, where
timestamp is in the RFC 3339 format
For a failed rendering: ERR: <error-message>
serviceMap
Dictionary of services and their endpoints (IP address and optional
interface name) that have the ipam/SVC-<serviceName> label.
These addresses are added to the ServiceMap dictionary
during rendering of an L2 template for a given IpamHost.
For details, see Service labels and their life cycle.
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
Configuration example:
status:l2RenderResult:OKl2TemplateRef:namespace_name/l2-template-name/1/2589/88865f94-04f0-4226-886b-2640af95a8abnetconfigFiles:-content:...<base64-encoded Netplan configuration file>...path:/etc/netplan/60-kaas-lcm-netplan.yamlnetconfigFilesStates: /etc/netplan/60-kaas-lcm-netplan.yaml:'OK:2023-01-23T09:27:22.71802Zece7b73808999b540e32ca1720c6b7a6e54c544cc82fa40d7f6b2beadeca0f53'netconfigCandidate:...<Netplan configuration file in plain text, rendered from L2Template>...netconfigCandidateState: OK:2022-06-08T03:18:08.49590Z a4a128bc6069638a37e604f05a5f8345cf6b40e62bce8a96350b5a29bc8bccde\serviceMap:ipam/SVC-ceph-cluster:-ifName:ceph-br2ipAddress:10.0.10.11-ifName:ceph-br1ipAddress:10.0.12.22ipam/SVC-ceph-public:-ifName:ceph-publicipAddress:10.1.1.15ipam/SVC-k8s-lcm:-ifName:k8s-lcmipAddress:10.0.1.52phase:Activestate:OKobjCreated:2021-10-21T19:09:32Z by v5.1.0-20210930-121522-f5b2af8objStatusUpdated:2021-10-21T19:14:18.748114886Z by v5.1.0-20210930-121522-f5b2af8objUpdated:2021-10-21T19:09:32.606968024Z by v5.1.0-20210930-121522-f5b2af8
This section describes the L2Template resource used in Mirantis
Container Cloud API.
By default, Container Cloud configures a single interface on cluster nodes,
leaving all other physical interfaces intact.
With L2Template, you can create advanced host networking configurations
for your clusters. For example, you can create bond interfaces on top of
physical interfaces on the host.
For demonstration purposes, the Container Cloud L2Template
custom resource (CR) is split into the following major sections:
The Container Cloud L2Template CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is L2Template.
metadata
The metadata field contains the following subfields:
name
Name of the L2Template object.
namespace
Project in which the L2Template object was created.
labels
Key-value pairs that are attached to the object:
Caution
All ipam/* labels, except ipam/DefaultForCluster,
are set automatically and must not be configured manually.
cluster.sigs.k8s.io/cluster-name
Mandatory for newly created L2Template since Container Cloud
2.25.0 (Cluster releases 17.0.0 and 16.0.0). References the
Cluster object name that this template is applied to.
The process of selecting the L2Template object for a specific
cluster is as follows:
The kaas-ipam controller monitors the L2Template objects
with the cluster.sigs.k8s.io/cluster-name:<clusterName> label.
The L2Template object with the
cluster.sigs.k8s.io/cluster-name:<clusterName>
label is assigned to a cluster with Name:<clusterName>,
if available.
ipam/PreInstalledL2Template:"1"
Is automatically added during a management cluster deployment.
Indicates that the current L2Template object was preinstalled.
Represents L2 templates that are automatically copied to a project
once it is created. Once the L2 templates are copied,
the ipam/PreInstalledL2Template label is removed.
Note
Preinstalled L2 templates are removed in Container Cloud
2.26.0 (Cluster releases 17.1.0 and 16.1.0) along with the
ipam/PreInstalledL2Template label. During cluster update to the
mentioned releases, existing preinstalled templates are
automatically removed.
ipam/DefaultForCluster
This label is unique per cluster. When you use several L2 templates
per cluster, only the first template is automatically labeled
as the default one. All consequent templates must be referenced
in the machines configuration files using L2templateSelector.
You can manually configure this label if required.
ipam/UID
Unique ID of an object.
kaas.mirantis.com/provider
Provider type.
kaas.mirantis.com/region
Region name.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
An L2 template must have the same project (Kubernetes namespace) as the
referenced cluster.
A cluster can be associated with many L2 templates. Only one of them can
have the ipam/DefaultForCluster label. Every L2 template that does not
have the ipam/DefaultForCluster label can be later assigned to a
particular machine using l2TemplateSelector.
The following rules apply to the default L2 template of a namespace:
Since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0),
creation of the default L2 template for a namespace is disabled. On
existing clusters, the Spec.clusterRef:default parameter of such an
L2 template is automatically removed during the migration process.
Subsequently, this parameter is not substituted with the
cluster.sigs.k8s.io/cluster-name label, ensuring the application
of the L2 template across the entire Kubernetes namespace. Therefore,
you can continue using existing default namespaced L2 templates.
Before Container Cloud 2.25.0 (Cluster releases 15.x, 14.x, or earlier),
the default L2Template object of a namespace must have the
Spec.clusterRef:default parameter that is deprecated since 2.25.0.
The spec field of the L2Template resource describes the desired
state of the object. It contains the following fields:
ifMapping
List of interface names for the template. The interface mapping is defined
globally for all Machine objects linked to the template but can be
overridden at the host level, if required, by editing the IpamHost
object for a particular host. The ifMapping parameter is mutually
exclusive with autoIfMappingPrio.
autoIfMappingPrio
List of prefixes, such as eno, ens, and so on, to match the
interfaces to automatically create a list for the template. If you are not
aware of any specific ordering of interfaces on the nodes, use the default
ordering from Predictable Network Interfaces Names specification for
systemd.
You can also override the default NIC list per host using the
IfMappingOverride parameter of the corresponding IpamHost. The
provision value corresponds to the network interface that was used to
provision a node. Usually, it is the first NIC found on a particular node.
It is defined explicitly to ensure that this interface will not be
reconfigured accidentally.
The autoIfMappingPrio parameter is mutually exclusive with
ifMapping.
l3Layout
Subnets to be used in the npTemplate section. The field contains
a list of subnet definitions with parameters used by template macros.
subnetName
Defines the alias name of the subnet that can be used to reference this
subnet from the template macros. This parameter is mandatory for every
entry in the l3Layout list.
subnetPoolUnsupported since 2.28.0 (17.3.0 and 16.3.0)
Optional. Default: none. Defines a name of the parent SubnetPool
object that will be used to create a Subnet object with a given
subnetName and scope. For deprecation details, see
MOSK Deprecation Notes: SubnetPool resource management.
If a corresponding Subnet object already exists,
nothing will be created and the existing object will be used.
If no SubnetPool is provided, no new Subnet object will be
created.
scope
Logical scope of the Subnet object with a corresponding
subnetName. Possible values:
global - the Subnet object is accessible globally,
for any Container Cloud project and cluster, for example, the PXE
subnet.
namespace - the Subnet object is accessible within the same
project where the L2 template is defined.
cluster - Unsupported since Container Cloud 2.28.0 (Cluster
releases 17.3.0 and 16.3.0). The Subnet object uses the namespace
where the referenced cluster is located. A subnet is only accessible to
the cluster that
L2Template.metadata.labels:cluster.sigs.k8s.io/cluster-name
(mandatory since MOSK 23.3) or
L2Template.spec.clusterRef (deprecated in MOSK 23.3)
refers to. The Subnet objects with the cluster scope will be
created for every new cluster.
Note
Every subnet referenced in an L2 template can have either a
global or namespaced scope. In the latter case, the subnet must exist
in the same project where the corresponding cluster and L2 template
are located.
labelSelector
Contains a dictionary of labels and their respective values that will be
used to find the matching Subnet object. If the labelSelector
field is omitted, the Subnet object will be selected by name,
specified by the subnetName parameter.
Caution
The labels and their values in this section must match the
ones added for the corresponding Subnet object.
Caution
The l3Layout section is mandatory for each L2Template
custom resource.
npTemplate
A netplan-compatible configuration with special lookup functions that
defines the networking settings for the cluster hosts, where physical
NIC names and details are parameterized. This configuration will be
processed using Go templates. Instead of specifying IP and MAC addresses,
interface names, and other network details specific to a particular host,
the template supports use of special lookup functions. These lookup
functions, such as nic, mac, ip, and so on, return
host-specific network information when the template is rendered for
a particular host.
Caution
All rules and restrictions of the netplan configuration also
apply to L2 templates. For details, see
official netplan documentation.
Caution
Mirantis strongly recommends following the below conventions on
network interface naming:
A physical NIC name set by an L2 template must not exceed 15 symbols.
Otherwise, an L2 template creation fails. This limit is set by the
Linux kernel.
Names of virtual network interfaces such as VLANs, bridges, bonds,
veth, and so on must not exceed 15 symbols.
Mirantis recommends setting interfaces names that do not exceed 13
symbols for both physical and virtual interfaces to avoid corner cases
and issues in netplan rendering.
The following table describes the main lookup functions for an
L2 template.
Lookup function
Description
{{nic N}}
Name of a NIC number N. NIC numbers correspond to the interface
mapping list. This macro can be used as a key for the elements
of the ethernets map, or as the value of the name and
set-name parameters of a NIC. It is also used to reference the
physical NIC from definitions of virtual interfaces (vlan,
bridge).
{{mac N}}
MAC address of a NIC number N registered during a host hardware
inspection.
{{ip “N:subnet-a”}}
IP address and mask for a NIC number N. The address will be auto-allocated
from the given subnet if the address does not exist yet.
{{ip “br0:subnet-x”}}
IP address and mask for a virtual interface, “br0” in this example.
The address will be auto-allocated from the given subnet
if the address does not exist yet.
For virtual interfaces names, an IP address placeholder must contain
a human-readable ID that is unique within the L2 template and must
have the following format:
The <shortUniqueHumanReadableID> is made equal to a virtual
interface name throughout this document and Container Cloud
bootstrap templates.
{{cidr_from_subnet “subnet-a”}}
IPv4 CIDR address from the given subnet.
{{gateway_from_subnet“subnet-a”}}
IPv4 default gateway address from the given subnet.
{{nameservers_from_subnet “subnet-a”}}
List of the IP addresses of name servers from the given subnet.
{{cluster_api_lb_ip}}
Technology Preview since Container Cloud 2.24.4 (Cluster releases
15.0.3 and 14.0.3). IP address for a cluster API load balancer.
clusterRef
Caution
Deprecated since Container Cloud 2.25.0 (Cluster releases
17.0.0 and 16.0.0) in favor of the mandatory
cluster.sigs.k8s.io/cluster-name label. Will be removed in one of
the following releases.
On existing clusters, this parameter is automatically migrated to the
cluster.sigs.k8s.io/cluster-name label since 2.25.0.
If an existing cluster has clusterRef:default set, the migration
process involves removing this parameter. Subsequently, it is not
substituted with the cluster.sigs.k8s.io/cluster-name label, ensuring
the application of the L2 template across the entire Kubernetes
namespace.
The Cluster object name that this template is applied to.
The default value is used to apply the given template to all clusters
within a particular project, unless an L2 template that references
a specific cluster name exists. The clusterRef field has priority over
the cluster.sigs.k8s.io/cluster-name label:
When clusterRef is set to a non-default value, the
cluster.sigs.k8s.io/cluster-name label will be added or updated with
that value.
When clusterRef is set to default, the
cluster.sigs.k8s.io/cluster-name label will be absent or removed.
The status field of the L2Template resource reflects the actual state
of the L2Template object and contains the following fields:
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
phase
Deprecated since Container Cloud 2.23.0 (Cluster release 11.7.0) and will be
removed in one of the following releases in favor of state. Possible
values: Active, Failed, or Terminating.
reason
Deprecated since Container Cloud 2.23.0 (Cluster release 11.7.0) and will be
removed in one of the following releases in favor of messages. For the
field description, see messages.
Configuration example:
status:phase:Failedstate:ERRmessages:-"ERR:Thekaas-mgmtsubnetintheterminatingstate."objCreated:2021-10-21T19:09:32Z by v5.1.0-20210930-121522-f5b2af8objStatusUpdated:2021-10-21T19:14:18.748114886Z by v5.1.0-20210930-121522-f5b2af8objUpdated:2021-10-21T19:09:32.606968024Z by v5.1.0-20210930-121522-f5b2af8
This section describes the Machine resource used in Mirantis
Container Cloud API for bare metal provider.
The Machine resource describes the machine-level parameters.
For demonstration purposes, the Container Cloud Machine
custom resource (CR) is split into the following major sections:
The Container Cloud Machine CR contains the following fields:
apiVersion
API version of the object that is cluster.k8s.io/v1alpha1.
kind
Object type that is Machine.
The metadata object field of the Machine resource contains
the following fields:
name
Name of the Machine object.
namespace
Project in which the Machine object is created.
annotations
Key-value pair to attach arbitrary metadata to the object:
metal3.io/BareMetalHost
Annotation attached to the Machine object to reference
the corresponding BareMetalHostInventory object in the
<BareMetalHostProjectName/BareMetalHostName> format.
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
labels
Key-value pairs that are attached to the object:
kaas.mirantis.com/provider
Provider type that matches the provider type in the Cluster object
and must be baremetal.
kaas.mirantis.com/region
Region name that matches the region name in the Cluster object.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
cluster.sigs.k8s.io/cluster-name
Cluster name that the Machine object is linked to.
cluster.sigs.k8s.io/control-plane
For the control plane role of a machine, this label contains any value,
for example, "true".
For the worker role, this label is absent.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
Configuration example:
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:example-control-planenamespace:example-nsannotations:metal3.io/BareMetalHost:default/master-0labels:kaas.mirantis.com/provider:baremetalcluster.sigs.k8s.io/cluster-name:example-clustercluster.sigs.k8s.io/control-plane:"true"# remove for worker
The spec object field of the Machine object represents
the BareMetalMachineProviderSpec subresource with all required
details to create a bare metal instance. It contains the following fields:
apiVersion
API version of the object that is baremetal.k8s.io/v1alpha1.
kind
Object type that is BareMetalMachineProviderSpec.
bareMetalHostProfile
Configuration profile of a bare metal host:
name
Name of a bare metal host profile
namespace
Project in which the bare metal host profile is created.
l2TemplateIfMappingOverride
If specified, overrides the interface mapping value for the corresponding
L2Template object.
l2TemplateSelector
If specified, contains the name (first priority) or label
of the L2 template that will be applied during a machine creation.
The l2TemplateSelector field is copied from the MachineproviderSpec object to the IpamHost object only once,
during a machine creation. To modify l2TemplateSelector after creation
of a Machine CR, edit the IpamHost object.
hostSelector
Specifies the matching criteria for labels on the bare metal hosts.
Limits the set of the BareMetalHostInventory objects considered for
claiming for the Machine object. The following selector labels
can be added when creating a machine using the Container Cloud web UI:
hostlabel.bm.kaas.mirantis.com/controlplane
hostlabel.bm.kaas.mirantis.com/worker
hostlabel.bm.kaas.mirantis.com/storage
Any custom label that is assigned to one or more bare metal hosts using API
can be used as a host selector. If the BareMetalHostInventory objects
with the specified label are missing, the Machine object will not
be deployed until at least one bare metal host with the specified label
is available.
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
nodeLabels
List of node labels to be attached to a node for the user to run certain
components on separate cluster nodes. The list of allowed node labels
is located in the Cluster object status
providerStatus.releaseRef.current.allowedNodeLabels field.
If the value field is not defined in allowedNodeLabels, a label can
have any value.
Before or after a machine deployment, add the required label from the allowed
node labels list with the corresponding value to
spec.providerSpec.value.nodeLabels in machine.yaml. For example:
nodeLabels:-key:stacklightvalue:enabled
The addition of a node label that is not available in the list of allowed node
labels is restricted.
distributionMandatory
Specifies an operating system (OS) distribution ID that is present in the
current ClusterRelease object under the AllowedDistributions list.
When specified, the BareMetalHostInventory object linked to this
Machine object will be provisioned using the selected OS distribution
instead of the default one.
By default, ubuntu/jammy is installed on greenfield managed clusters:
Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), for
MOSK clusters
Since Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0), for
non-MOSK clusters
The default distribution is marked with the boolean flag default
inside one of the elements under the AllowedDistributions list.
The ubuntu/focal distribution was deprecated in Container Cloud 2.28.0
and only supported for existing managed clusters. Container Cloud 2.28.x
release series is the last one to support Ubuntu 20.04 as the host operating
system for managed clusters.
Caution
The outdated ubuntu/bionic distribution, which is removed
in Cluster releases 17.0.0 and 16.0.0, is only supported for existing
clusters based on Ubuntu 18.04. For greenfield deployments of managed
clusters, only ubuntu/jammy is supported.
Warning
During the course of the Container Cloud 2.28.x series, Mirantis
highly recommends upgrading an operating system on any nodes of all your
managed cluster machines to Ubuntu 22.04 before the next major Cluster
release becomes available.
It is not mandatory to upgrade all machines at once. You can upgrade them
one by one or in small batches, for example, if the maintenance window is
limited in time.
Otherwise, the Cluster release update of the Ubuntu 20.04-based managed
clusters will become impossible as of Container Cloud 2.29.0 with Ubuntu
22.04 as the only supported version.
Management cluster update to Container Cloud 2.29.1 will be blocked if
at least one node of any related managed cluster is running Ubuntu 20.04.
maintenance
Maintenance mode of a machine. If enabled, the node of the selected machine
is drained, cordoned, and prepared for maintenance operations.
upgradeIndex (optional)
Positive numeral value that determines the order of machines upgrade. The
first machine to upgrade is always one of the control plane machines
with the lowest upgradeIndex. Other control plane machines are upgraded
one by one according to their upgrade indexes.
If the Cluster spec dedicatedControlPlane field is false, worker
machines are upgraded only after the upgrade of all control plane machines
finishes. Otherwise, they are upgraded after the first control plane
machine, concurrently with other control plane machines.
If two or more machines have the same value of upgradeIndex, these
machines are equally prioritized during upgrade.
deletionPolicy
Generally available since Container Cloud 2.25.0 (Cluster releases 17.0.0
and 16.0.0). Technology Preview since 2.21.0 (Cluster releases 11.5.0 and
7.11.0) for non-MOSK clusters. Policy used to identify steps
required during a Machine object deletion. Supported policies are as
follows:
graceful
Prepares a machine for deletion by cordoning, draining, and removing
from Docker Swarm of the related node. Then deletes Kubernetes objects
and associated resources. Can be aborted only before a node is removed
from Docker Swarm.
unsafe
Default. Deletes Kubernetes objects and associated resources without any
preparations.
forced
Deletes Kubernetes objects and associated resources without any
preparations. Removes the Machine object even if the cloud provider
or LCM Controller gets stuck at some step. May require a manual cleanup
of machine resources in case of the controller failure.
The status object field of the Machine object represents the
BareMetalMachineProviderStatus subresource that describes the current
bare metal instance state and contains the following fields:
apiVersion
API version of the object that is cluster.k8s.io/v1alpha1.
kind
Object type that is BareMetalMachineProviderStatus.
hardware
Provides a machine hardware information:
cpu
Number of CPUs.
ram
RAM capacity in GB.
storage
List of hard drives mounted on the machine. Contains the disk name
and size in GB.
status
Represents the current status of a machine:
Provision
A machine is yet to obtain a status
Uninitialized
A machine is yet to obtain the node IP address and host name
Pending
A machine is yet to receive the deployment instructions and
it is either not booted yet or waits for the LCM controller to be
deployed
Prepare
A machine is running the Prepare phase during which Docker images
and packages are being predownloaded
Deploy
A machine is processing the LCM Controller instructions
Reconfigure
A machine is being updated with a configuration without affecting
workloads running on the machine
Ready
A machine is deployed and the supported Mirantis Kubernetes Engine (MKE)
version is set
Maintenance
A machine host is cordoned, drained, and prepared for maintenance
operations
currentDistributionSince 2.24.0 as TechPreview and 2.24.2 as GA
Distribution ID of the current operating system installed on the machine.
For example, ubuntu/jammy.
maintenance
Maintenance mode of a machine. If enabled, the node of the selected machine
is drained, cordoned, and prepared for maintenance operations.
rebootAvailable since 2.22.0
Indicator of a host reboot to complete the Ubuntu operating system updates,
if any.
required
Specifies whether a host reboot is required. Boolean. If true,
a manual host reboot is required.
reason
Specifies the package name(s) to apply during a host reboot.
upgradeIndex
Positive numeral value that determines the order of machines upgrade. If
upgradeIndex in the Machine object spec is set, this status value
equals the one in the spec. Otherwise, this value displays the automatically
generated order of upgrade.
delete
Generally available since Container Cloud 2.25.0 (Cluster releases 17.0.0
and 16.0.0). Technology Preview since 2.21.0 for non-MOSK
clusters. Start of a machine deletion or a successful abortion. Boolean.
prepareDeletionPhase
Generally available since Container Cloud 2.25.0 (Cluster releases 17.0.0
and 16.0.0). Technology Preview since 2.21.0 for non-MOSK
clusters. Preparation phase for a graceful machine deletion. Possible values
are as follows:
started
Cloud provider controller prepares a machine for deletion by cordoning,
draining the machine, and so on.
completed
LCM Controller starts removing the machine resources since
the preparation for deletion is complete.
aborting
Cloud provider controller attempts to uncordon the node. If the attempt
fails, the status changes to failed.
TechPreview since 2.21.0 and 2.21.1 for MOSK 22.5GA since 2.24.0 for management and regional clustersGA since 2.25.0 for managed clusters
This section describes the MetalLBConfig custom resource used in the
Container Cloud API that contains the MetalLB configuration objects for a
particular cluster.
For demonstration purposes, the Container Cloud MetalLBConfig
custom resource description is split into the following major sections:
The Container Cloud MetalLBConfig CR contains the following fields:
apiVersion
API version of the object that is kaas.mirantis.com/v1alpha1.
kind
Object type that is MetalLBConfig.
The metadata object field of the MetalLBConfig resource
contains the following fields:
name
Name of the MetalLBConfig object.
namespace
Project in which the object was created. Must match the project name of
the target cluster.
labels
Key-value pairs attached to the object. Mandatory labels:
kaas.mirantis.com/provider
Provider type that is baremetal.
kaas.mirantis.com/region
Region name that matches the region name of the target cluster.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
cluster.sigs.k8s.io/cluster-name
Name of the cluster that the MetalLB configuration must apply to.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the MetalLBConfig object represents the
MetalLBConfigSpec subresource that contains the description of MetalLB
configuration objects.
These objects are created in the target cluster during its deployment.
The spec field contains the following optional fields:
addressPools
Removed in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0),
deprecated in 2.26.0 (Cluster releases 17.2.0 and 16.2.0).
List of MetalLBAddressPool objects to create MetalLB AddressPool
objects.
bfdProfiles
List of MetalLBBFDProfile objects to create MetalLB BFDProfile
objects.
bgpAdvertisements
List of MetalLBBGPAdvertisement objects to create MetalLB
BGPAdvertisement objects.
bgpPeers
List of MetalLBBGPPeer objects to create MetalLB BGPPeer objects.
communities
List of MetalLBCommunity objects to create MetalLB Community
objects.
ipAddressPools
List of MetalLBIPAddressPool objects to create MetalLB
IPAddressPool objects.
l2Advertisements
List of MetalLBL2Advertisement objects to create MetalLB
L2Advertisement objects.
The l2Advertisements object allows defining interfaces to optimize
the announcement. When you use the interfaces selector, LB addresses
are announced only on selected host interfaces.
Mirantis recommends using the interfaces selector if nodes use separate
host networks for different types of traffic. The pros of such configuration
are as follows: less spam on other interfaces and networks and limited chances
to reach IP addresses of load-balanced services from irrelevant interfaces and
networks.
Caution
Interface names in the interfaces list must match those
on the corresponding nodes.
Name of the MetalLBConfigTemplate object used as a source of MetalLB
configuration objects. Mutually exclusive with the fields listed below
that will be part of the MetalLBConfigTemplate object. For details,
see MetalLBConfigTemplate.
Before Cluster releases 17.2.0 and 16.2.0, MetalLBConfigTemplate is the
default configuration method for MetalLB on bare metal deployments. Since
Cluster releases 17.2.0 and 16.2.0, use the MetalLBConfig object
instead.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Caution
For managed clusters, this field is available as Technology
Preview since Container Cloud 2.24.0, is generally available since
2.25.0, and is deprecated since 2.27.0.
The objects listed in the spec field of the MetalLBConfig object,
such as MetalLBIPAddressPool, MetalLBL2Advertisement, and so on,
are used as templates for the MetalLB objects that will be created in the
target cluster. Each of these objects has the following structure:
labels
Optional. Key-value pairs attached to the metallb.io/<objectName>
object as metadata.labels.
name
Name of the metallb.io/<objectName> object.
spec
Contents of the spec section of the metallb.io/<objectName> object.
The spec field has the metallb.io/<objectName>Spec type.
For details, see MetalLB objects.
For example, MetalLBIPAddressPool is a template for the
metallb.io/IPAddressPool object and has the following structure:
labels
Optional. Key-value pairs attached to the metallb.io/IPAddressPool
object as metadata.labels.
name
Name of the metallb.io/IPAddressPool object.
spec
Contents of spec section of the metallb.io/IPAddressPool object.
The spec has the metallb.io/IPAddressPoolSpec type.
Container Cloud supports the following MetalLB object types of the
metallb.io API group:
IPAddressPool
Community
L2Advertisement
BFDProfile
BGPAdvertisement
BGPPeer
As of v1beta1 and v1beta2 API versions, metadata of MetalLB objects
has a standard format with no specific fields or labels defined for any
particular object:
apiVersion
API version of the object that can be metallb.io/v1beta1 or
metallb.io/v1beta2.
kind
Object type that is one of the metallb.io types listed above. For
example, IPAddressPool.
metadata
Object metadata that contains the following subfields:
name
Name of the object.
namespace
Namespace where the MetalLB components are located. It matches
metallb-system in Container Cloud.
labels
Optional. Key-value pairs that are attached to the object. It can be an
arbitrary set of labels. No special labels are defined as of v1beta1
and v1beta2 API versions.
The MetalLBConfig object contains spec sections of the
metallb.io/<objectName> objects that have the
metallb.io/<objectName>Spec type. For metallb.io/<objectName> and
metallb.io/<objectName>Spec types definitions, refer to the official
MetalLB documentation:
Before Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0),
metallb.io/<objectName> objects v0.13.9 are supported.
The l2Advertisements object allows defining interfaces to optimize
the announcement. When you use the interfaces selector, LB addresses
are announced only on selected host interfaces. Mirantis recommends this
configuration if nodes use separate host networks for different types
of traffic. The pros of such configuration are as follows: less spam on
other interfaces and networks, limited chances to reach services
LB addresses from irrelevant interfaces and networks.
Configuration example:
l2Advertisements:|- name: management-lcmspec:ipAddressPools:- defaultinterfaces:# LB addresses from the "default" address pool will be announced# on the "k8s-lcm" interface- k8s-lcm
Caution
Interface names in the interfaces list must match those
on the corresponding nodes.
For managed clusters, this field is available as Technology
Preview and is generally available since Container Cloud 2.25.0.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
The status field describes the actual state of the object.
It contains the following fields:
bootstrapModeOnly in 2.24.0
Field that appears only during a management cluster bootstrap as true
and is used internally for bootstrap. Once deployment completes, the value
is moved to false and is excluded from the status output.
objects
Description of MetalLB objects that is used to create MetalLB native
objects in the target cluster.
The format of underlying objects is the same as for those in the spec
field, except templateName, which is obsolete since Container Cloud
2.28.0 (Cluster releases 17.3.0 and 16.3.0) and which is not present in this
field. The objects contents are rendered from the following locations,
with possible modifications for the bootstrap cluster:
Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0),
MetalLBConfig.spec
Before Container Cloud 2.28.0 (Cluster releases 17.2.0, 16.2.0, or
earlier):
MetalLBConfigTemplate.status of the corresponding template if
MetalLBConfig.spec.templateName is defined
MetalLBConfig.spec if MetalLBConfig.spec.templateName is not
defined
propagateResult
Result of objects propagation. During objects propagation, native MetalLB
objects of the target cluster are created and updated according to the
description of the objects present in the status.objects field.
This field contains the following information:
message
Text message that describes the result of the last attempt of objects
propagation. Contains an error message if the last attempt was
unsuccessful.
success
Result of the last attempt of objects propagation. Boolean.
time
Timestamp of the last attempt of objects propagation. For example,
2023-07-04T00:30:36Z.
If the objects propagation was successful, the MetalLB objects of the
target cluster match the ones present in the status.objects field.
updateResult
Status of the MetalLB objects update. Has the same format of subfields
that in propagateResult described above.
During objects update, the status.objects contents are rendered as
described in the objects field definition above.
If the objects update was successful, the MetalLB objects description
present in status.objects is rendered successfully and up to date.
This description is used to update MetalLB objects in the target cluster.
If the objects update was not successful, MetalLB objects will not be
propagated to the target cluster.
Interface names in the interfaces list must match the ones
on the corresponding nodes.
After the object is created and processed by the MetalLB Controller, the
status field is added. For example:
status:objects:ipAddressPools:-name:servicesspec:addresses:-10.100.100.151-10.100.100.170autoAssign:trueavoidBuggyIPs:falsel2Advertisements:-name:servicesspec:ipAddressPools:-servicespropagateResult:message:Objects were successfully updatedsuccess:truetime:"2023-07-04T14:31:40Z"updateResult:message:Objects were successfully read from MetalLB configuration specificationsuccess:truetime:"2023-07-04T14:31:39Z"
Example of native MetalLB objects to be created in the
managed-ns/managed-cluster cluster during deployment:
Admission Controller blocks creation of the object
2.28.0
17.3.0 and 16.3.0
Unsupported for any cluster type
2.27.0
17.2.0 and 16.2.0
Deprecated for any cluster type
2.25.0
17.0.0 and 16.0.0
Generally available for managed clusters
2.24.2
15.0.1, 14.0.1, 14.0.0
Technology Preview for managed clusters
2.24.0
14.0.0
Generally available for management clusters
This section describes the MetalLBConfigTemplate custom resource used in
the Container Cloud API that contains the template for MetalLB configuration
for a particular cluster.
Note
The MetalLBConfigTemplate object applies to bare metal
deployments only.
Before Cluster releases 17.2.0 and 16.2.0, MetalLBConfigTemplate is the
default configuration method for MetalLB on bare metal deployments. This method
allows the use of Subnet objects to define MetalLB IP address pools
the same way as they were used before introducing the MetalLBConfig and
MetalLBConfigTemplate objects. Since Cluster releases 17.2.0 and 16.2.0,
use the MetalLBConfig object for this purpose instead.
For demonstration purposes, the Container Cloud MetalLBConfigTemplate
custom resource description is split into the following major sections:
The Container Cloud MetalLBConfigTemplate CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is MetalLBConfigTemplate.
The metadata object field of the MetalLBConfigTemplate resource
contains the following fields:
name
Name of the MetalLBConfigTemplate object.
namespace
Project in which the object was created. Must match the project name of
the target cluster.
labels
Key-value pairs attached to the object. Mandatory labels:
kaas.mirantis.com/provider
Provider type that is baremetal.
kaas.mirantis.com/region
Region name that matches the region name of the target cluster.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
cluster.sigs.k8s.io/cluster-name
Name of the cluster that the MetalLB configuration applies to.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the MetalLBConfigTemplate object contains the
templates of MetalLB configuration objects and optional auxiliary variables.
Container Cloud uses these templates to create MetalLB configuration objects
during the cluster deployment.
The spec field contains the following optional fields:
machines
Key-value dictionary to select IpamHost objects corresponding to nodes
of the target cluster. Keys contain machine aliases used in
spec.templates. Values contain the NameLabelsSelector items that
select IpamHost by name or by labels. For example:
This field is required if some IP addresses of nodes are used in
spec.templates.
vars
Key-value dictionary of arbitrary user-defined variables that are used in
spec.templates. For example:
vars:localPort:4561
templates
List of templates for MetalLB configuration objects that are used to
render MetalLB configuration definitions and create MetalLB objects in
the target cluster. Contains the following optional fields:
bfdProfiles
Template for the MetalLBBFDProfile object list to create MetalLB
BFDProfile objects.
bgpAdvertisements
Template for the MetalLBBGPAdvertisement object list to create
MetalLB BGPAdvertisement objects.
bgpPeers
Template for the MetalLBBGPPeer object list to create MetalLB
BGPPeer objects.
communities
Template for the MetalLBCommunity object list to create MetalLB
Community objects.
ipAddressPools
Template for the MetalLBIPAddressPool object list to create MetalLB
IPAddressPool objects.
l2Advertisements
Template for the MetalLBL2Advertisement object list to create
MetalLB L2Advertisement objects.
Each template is a string and has the same structure as the list of the
corresponding objects described in MetalLBConfig spec such as
MetalLBIPAddressPool and MetalLBL2Advertisement, but
you can use additional functions and variables inside these templates.
Note
When using the MetalLBConfigTemplate object, you can define
MetalLB IP address pools using both Subnet objects and
spec.ipAddressPools templates. IP address pools rendered from these
sources will be concatenated and then written to
status.renderedObjects.ipAddressPools.
You can use the following functions in templates:
ipAddressPoolNames
Selects all IP address pools of the given announcement type found for
the target cluster. Possible types: layer2, bgp, any.
The any type includes all IP address pools found for the target
cluster. The announcement types of IP address pools are verified using
the metallb/address-pool-protocol labels of the corresponding
Subnet object.
The ipAddressPools templates have no types as native MetalLB
IPAddressPool objects have no announcement type.
The l2Advertisements template can refer to IP address pools of the
layer2 or any type.
The bgpAdvertisements template can refer to IP address pools of the
bgp or any type.
IP address pools are searched in the templates.ipAddressPools field
and in the Subnet objects of the target cluster. For example:
The l2Advertisements object allows defining interfaces to optimize
the announcement. When you use the interfaces selector, LB addresses
are announced only on selected host interfaces. Mirantis recommends this
configuration if nodes use separate host networks for different types
of traffic. The pros of such configuration are as follows: less spam on
other interfaces and networks, limited chances to reach services
LB addresses from irrelevant interfaces and networks.
Configuration example:
l2Advertisements:|- name: management-lcmspec:ipAddressPools:- defaultinterfaces:# LB addresses from the "default" address pool will be announced# on the "k8s-lcm" interface- k8s-lcm
Caution
Interface names in the interfaces list must match those
on the corresponding nodes.
The status field describes the actual state of the object.
It contains the following fields:
renderedObjects
MetalLB objects description rendered from spec.templates in the same
format as they are defined in the MetalLBConfig spec field.
All underlying objects are optional. The following objects can be present:
bfdProfiles, bgpAdvertisements, bgpPeers, communities,
ipAddressPools, l2Advertisements.
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
apiVersion:ipam.mirantis.com/v1alpha1kind:MetalLBConfigTemplatemetadata:labels:cluster.sigs.k8s.io/cluster-name:kaas-mgmtkaas.mirantis.com/provider:baremetalname:mgmt-metallb-templatenamespace:defaultspec:templates:l2Advertisements:|- name: management-lcmspec:ipAddressPools:- defaultinterfaces:# IPs from the "default" address pool will be announced on the "k8s-lcm" interface- k8s-lcm- name: provision-pxespec:ipAddressPools:- services-pxeinterfaces:# IPs from the "services-pxe" address pool will be announced on the "k8s-pxe" interface- k8s-pxe
Configuration example for Subnet of the default pool
After the objects are created and processed by the kaas-ipam Controller,
the status field displays for MetalLBConfigTemplate:
Configuration example of the status field for
MetalLBConfigTemplate
status:checksums:annotations:sha256:38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bedlabels:sha256:380337902278e8985e816978c349910a4f7ed98169c361eb8777411ac427e6baspec:sha256:0860790fc94217598e0775ab2961a02acc4fba820ae17c737b94bb5d55390dbemessages:-Template for BFDProfiles is undefined-Template for BGPAdvertisements is undefined-Template for BGPPeers is undefined-Template for Communities is undefinedobjCreated:2023-06-30T21:22:56.00000Z by v6.5.999-20230627-072014-ba8d918objStatusUpdated:2023-07-04T00:30:35.82023Z by v6.5.999-20230627-072014-ba8d918objUpdated:2023-06-30T22:10:51.73822Z by v6.5.999-20230627-072014-ba8d918renderedObjects:ipAddressPools:-name:defaultspec:addresses:-10.0.34.101-10.0.34.120autoAssign:true-name:services-pxespec:addresses:-10.0.24.221-10.0.24.230autoAssign:falsel2Advertisements:-name:management-lcmspec:interfaces:-k8s-lcmipAddressPools:-default-name:provision-pxespec:interfaces:-k8s-pxeipAddressPools:-services-pxestate:OK
The following example illustrates contents of the status field that
displays for MetalLBConfig after the objects are processed
by the MetalLB Controller.
Configuration example of the status field for
MetalLBConfig
status:objects:ipAddressPools:-name:defaultspec:addresses:-10.0.34.101-10.0.34.120autoAssign:trueavoidBuggyIPs:false-name:services-pxespec:addresses:-10.0.24.221-10.0.24.230autoAssign:falseavoidBuggyIPs:falsel2Advertisements:-name:management-lcmspec:interfaces:-k8s-lcmipAddressPools:-default-name:provision-pxespec:interfaces:-k8s-pxeipAddressPools:-services-pxepropagateResult:message:Objects were successfully updatedsuccess:truetime:"2023-07-05T03:10:23Z"updateResult:message:Objects were successfully read from MetalLB configuration specificationsuccess:truetime:"2023-07-05T03:10:23Z"
Using the objects described above, several native MetalLB objects are created
in the kaas-mgmt cluster during deployment.
Configuration example of MetalLB objects created during cluster
deployment
In the following configuration example, MetalLB is configured to use BGP for
announcement of external addresses of Kubernetes load-balanced services
for the managed cluster from master nodes. Each master node is located in
its own rack without the L2 layer extension between racks.
This section contains only examples of the objects required to illustrate
the MetalLB configuration. For Rack, MultiRackCluster, L2Template
and other objects required to configure BGP announcement of the cluster API
load balancer address for this scenario, refer to Multiple rack configuration example.
apiVersion:ipam.mirantis.com/v1alpha1kind:MetalLBConfigTemplatemetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-metallb-bgp-templatenamespace:managed-nsspec:templates:bgpAdvertisements:|- name: servicesspec:ipAddressPools:- servicespeers: # "peers" can be omitted if all defined peers- svc-peer-rack1 # are used in a particular "bgpAdvertisement"- svc-peer-rack2- svc-peer-rack3bgpPeers:|- name: svc-peer-rack1spec:peerAddress: 10.77.41.1 # peer address is in the external subnet #1peerASN: 65100myASN: 65101nodeSelectors:- matchLabels:rack-id: rack-master-1 # references the node corresponding# to the "test-cluster-master-1" Machine- name: svc-peer-rack2spec:peerAddress: 10.77.42.1 # peer address is in the external subnet #2peerASN: 65100myASN: 65101nodeSelectors:- matchLabels:rack-id: rack-master-2 # references the node corresponding# to the "test-cluster-master-2" Machine- name: svc-peer-rack3spec:peerAddress: 10.77.43.1 # peer address is in the external subnet #3peerASN: 65100myASN: 65101nodeSelectors:- matchLabels:rack-id: rack-master-3 # references the node corresponding# to the "test-cluster-master-3" Machine
The following objects illustrate configuration for three subnets that
are used to configure external network in three racks. Each master node uses
its own external L2/L3 network segment.
Configuration example for the Subnetext-rack-control-1
Rack objects and ipam/RackRef labels in Machine objects are not
required for MetalLB configuration. But in this example, rack objects
are implied to be used for configuration of BGP announcement of the cluster
API load balancer address. Rack objects are not present in this example.
Machine objects select different L2 templates because each master node uses
different L2/L3 network segments for LCM, external, and other networks.
Configuration example for the Machinetest-cluster-master-1
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-1namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-1labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-1kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-1l2TemplateSelector:name:test-cluster-master-1nodeLabels:-key:rack-id# it is used in "nodeSelectors"value:rack-master-1# of "bgpPeer" MetalLB objects
Configuration example for the Machinetest-cluster-master-2
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-2namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-2labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-2kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-2l2TemplateSelector:name:test-cluster-master-2nodeLabels:-key:rack-id# it is used in "nodeSelectors"value:rack-master-1# of "bgpPeer" MetalLB objects
Configuration example for the Machinetest-cluster-master-2
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-3namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-3labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-3kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-3l2TemplateSelector:name:test-cluster-master-3nodeLabels:-key:rack-id# it is used in "nodeSelectors"value:rack-master-3# of "bgpPeer" MetalLB objects
This section describes the MultiRackCluster resource used in the
Container Cloud API.
When you create a bare metal managed cluster with a multi-rack topology,
where Kubernetes masters are distributed across multiple racks
without L2 layer extension between them, the MultiRackCluster resource
allows you to set cluster-wide parameters for configuration of the
BGP announcement of the cluster API load balancer address.
In this scenario, the MultiRackCluster object must be bound to the
Cluster object.
The MultiRackCluster object is generally used for a particular cluster
in conjunction with Rack objects described in Rack.
For demonstration purposes, the Container Cloud MultiRackCluster
custom resource (CR) description is split into the following major sections:
The Container Cloud MultiRackCluster CR metadata contains the following
fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is MultiRackCluster.
metadata
The metadata field contains the following subfields:
name
Name of the MultiRackCluster object.
namespace
Container Cloud project (Kubernetes namespace) in which the object was
created.
labels
Key-value pairs that are attached to the object:
cluster.sigs.k8s.io/cluster-name
Cluster object name that this MultiRackCluster object is
applied to. To enable the use of BGP announcement for the cluster API
LB address, set the useBGPAnnouncement parameter in the
Cluster object to true:
spec:providerSpec:value:useBGPAnnouncement:true
kaas.mirantis.com/provider
Provider name that is baremetal.
kaas.mirantis.com/region
Region name.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The MultiRackCluster metadata configuration example:
The spec field of the MultiRackCluster resource describes the desired
state of the object. It contains the following fields:
bgpdConfigFileName
Name of the configuration file for the BGP daemon (bird). Recommended value
is bird.conf.
bgpdConfigFilePath
Path to the directory where the configuration file for the BGP daemon
(bird) is added. The recommended value is /etc/bird.
bgpdConfigTemplate
Optional. Configuration text file template for the BGP daemon (bird)
configuration file where you can use go template constructs and
the following variables:
RouterID, LocalIP
Local IP on the given network, which is a key in the
Rack.spec.peeringMap dictionary, for a given node. You can use
it, for example, in the routerid{{$.RouterID}}; instruction.
LocalASN
Local AS number.
NeighborASN
Neighbor AS number.
NeighborIP
Neighbor IP address. Its values are taken from Rack.spec.peeringMap,
it can be used only inside the range iteration through the
Neighbors list.
Neighbors
List of peers in the given network and node. It can be iterated
through the range statement in the go template.
Values for LocalASN and NeighborASN are taken from:
MultiRackCluster.defaultPeer - if not used as a field inside the
range iteration through the Neighbors list.
Corresponding values of Rack.spec.peeringMap - if used as a field
inside the range iteration through the Neighbors list.
This template can be overridden using the Rack objects. For details,
see Rack spec.
defaultPeer
Configuration parameters for the default BGP peer. These parameters will
be used in rendering of the configuration file for BGP daemon from
the template if they are not overridden for a particular rack or network
using Rack objects. For details, see Rack spec.
localASN
Mandatory. Local AS number.
neighborASN
Mandatory. Neighbor AS number.
neighborIP
Reserved. Neighbor IP address. Leave it as an empty string.
password
Optional. Neighbor password. If not set, you can hardcode it in
bgpdConfigTemplate. It is required for MD5 authentication between
BGP peers.
Configuration examples:
Since Cluster releases 17.1.0 and 16.1.0 for bird v2.x
spec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|protocol device {}#protocol direct {interface "lo";ipv4;}#protocol kernel {ipv4 {export all;};}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local port 1179 as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
Before Cluster releases 17.1.0 and 16.1.0 for bird v1.x
spec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|listen bgp port 1179;protocol device {}#protocol direct {interface "lo";}#protocol kernel {export all;}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
The status field of the MultiRackCluster resource reflects the actual
state of the MultiRackCluster object and contains the following fields:
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
Configuration example:
status:checksums:annotations:sha256:38e0b9de817f645c4bec37c0d4a3e58baecccb040f5718dc069a72c7385a0bedlabels:sha256:d8f8eacf487d57c22ca0ace29bd156c66941a373b5e707d671dc151959a64ce7spec:sha256:66b5d28215bdd36723fe6230359977fbede828906c6ae96b5129a972f1fa51e9objCreated:2023-08-11T12:25:21.00000Z by v6.5.999-20230810-155553-2497818objStatusUpdated:2023-08-11T12:32:58.11966Z by v6.5.999-20230810-155553-2497818objUpdated:2023-08-11T12:32:57.32036Z by v6.5.999-20230810-155553-2497818state:OK
The following configuration examples of several bare metal objects illustrate
how to configure BGP announcement of the load balancer address used to expose
the cluster API.
In the following example, all master nodes are in a single rack. One Rack
object is required in this case for master nodes. Some worker nodes can
coexist in the same rack with master nodes or occupy separate racks. It is
implied that the useBGPAnnouncement parameter is set to true in the
corresponding Cluster object.
Configuration example for MultiRackCluster
Since Cluster releases 17.1.0 and 16.1.0 for bird v2.x:
apiVersion:ipam.mirantis.com/v1alpha1kind:MultiRackClustermetadata:name:multirack-test-clusternamespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalkaas.mirantis.com/region:region-onespec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|protocol device {}#protocol direct {interface "lo";ipv4;}#protocol kernel {ipv4 {export all;};}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local port 1179 as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
Before Cluster releases 17.1.0 and 16.1.0 for bird v1.x:
apiVersion:ipam.mirantis.com/v1alpha1kind:MultiRackClustermetadata:name:multirack-test-clusternamespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|listen bgp port 1179;protocol device {}#protocol direct {interface "lo";}#protocol kernel {export all;}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
Configuration example for Rack
apiVersion:ipam.mirantis.com/v1alpha1kind:Rackmetadata:name:rack-masternamespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:peeringMap:lcm-rack-control:peers:-neighborIP:10.77.31.1# "localASN" and "neighborASN" are taken from-neighborIP:10.77.37.1# "MultiRackCluster.spec.defaultPeer"# if not set here
Configuration example for Machine
# "Machine" templates for "test-cluster-master-2" and "test-cluster-master-3"# differ only in BMH selectors in this example.apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-1namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-1labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master# used to connect "IpamHost" to "Rack" objects, so that# BGP parameters can be obtained from "Rack" to# render BGP configuration for the given "IpamHost" objectkaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-1l2TemplateSelector:name:test-cluster-master
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
Configuration example for L2Template
apiVersion:ipam.mirantis.com/v1alpha1kind:L2Templatemetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-masternamespace:managed-nsspec:...l3Layout:-subnetName:lcm-rack-control# this network is referenced in "rack-master" Rackscope:namespace...npTemplate:|...ethernets:lo:addresses:- {{ cluster_api_lb_ip }} # function for cluster API LB IPdhcp4: falsedhcp6: false...
After the objects are created and nodes are provisioned, the IpamHost
objects will have BGP daemon configuration files in their status fields.
For example:
You can decode /etc/bird/bird.conf contents and verify the configuration:
echo"<<base64-string>>"|base64-d
The following system output applies to the above configuration examples:
Configuration example for the decoded bird.conf
Since Cluster releases 17.1.0 and 16.1.0 for bird v2.x:
protocol device {}#protocol direct {interface "lo";ipv4;}#protocol kernel {ipv4 {export all;};}#protocol bgp 'bgp_peer_0' {local port 1179 as 65101;neighbor 10.77.31.1 as 65100;ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}protocol bgp 'bgp_peer_1' {local port 1179 as 65101;neighbor 10.77.37.1 as 65100;ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}
Before Cluster releases 17.1.0 and 16.1.0 for bird v1.x:
listen bgp port 1179;protocol device {}#protocol direct {interface "lo";}#protocol kernel {export all;}#protocol bgp 'bgp_peer_0' {local as 65101;neighbor 10.77.31.1 as 65100;import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}protocol bgp 'bgp_peer_1' {local as 65101;neighbor 10.77.37.1 as 65100;import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}
BGP daemon configuration files are copied from IpamHost.status
to the corresponding LCMMachine object the same way as it is done for
netplan configuration files. Then, the configuration files are written to the
corresponding node by the LCM-Agent.
In the following configuration example, each master node is located in its own
rack. Three Rack objects are required in this case for master nodes.
Some worker nodes can coexist in the same racks with master nodes or occupy
separate racks.
Only objects that are required to show configuration for BGP announcement
of the cluster API load balancer address are provided here.
It is implied that the useBGPAnnouncement parameter is set to true
in the corresponding Cluster object.
Configuration example for MultiRackCluster
Since Cluster releases 17.1.0 and 16.1.0 for bird v2.x:
# It is the same object as in the single rack example.apiVersion:ipam.mirantis.com/v1alpha1kind:MultiRackClustermetadata:name:multirack-test-clusternamespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalkaas.mirantis.com/region:region-onespec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|protocol device {}#protocol direct {interface "lo";ipv4;}#protocol kernel {ipv4 {export all;};}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local port 1179 as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
Before Cluster releases 17.1.0 and 16.1.0 for bird v1.x:
# It is the same object as in the single rack example.apiVersion:ipam.mirantis.com/v1alpha1kind:MultiRackClustermetadata:name:multirack-test-clusternamespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:bgpdConfigFileName:bird.confbgpdConfigFilePath:/etc/birdbgpdConfigTemplate:|listen bgp port 1179;protocol device {}#protocol direct {interface "lo";}#protocol kernel {export all;}#{{range $i, $peer := .Neighbors}}protocol bgp 'bgp_peer_{{$i}}' {local as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}{{end}}defaultPeer:localASN:65101neighborASN:65100neighborIP:""
The following Rack objects differ in neighbor IP addresses and in the
network (L3 subnet) used for BGP connection to announce the cluster API LB IP
and for cluster API traffic.
Configuration example for Rack 1
apiVersion:ipam.mirantis.com/v1alpha1kind:Rackmetadata:name:rack-master-1namespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:peeringMap:lcm-rack-control-1:peers:-neighborIP:10.77.31.2# "localASN" and "neighborASN" are taken from-neighborIP:10.77.31.3# "MultiRackCluster.spec.defaultPeer" if# not set here
Configuration example for Rack 2
apiVersion:ipam.mirantis.com/v1alpha1kind:Rackmetadata:name:rack-master-2namespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:peeringMap:lcm-rack-control-2:peers:-neighborIP:10.77.32.2# "localASN" and "neighborASN" are taken from-neighborIP:10.77.32.3# "MultiRackCluster.spec.defaultPeer" if# not set here
Configuration example for Rack 3
apiVersion:ipam.mirantis.com/v1alpha1kind:Rackmetadata:name:rack-master-3namespace:managed-nslabels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalspec:peeringMap:lcm-rack-control-3:peers:-neighborIP:10.77.33.2# "localASN" and "neighborASN" are taken from-neighborIP:10.77.33.3# "MultiRackCluster.spec.defaultPeer" if# not set here
As compared to single rack examples, the following Machine objects differ
in:
BMH selectors
L2Template selectors
Rack selectors (the ipam/RackRef label)
The rack-id node labels
The labels on master nodes are required for MetalLB node selectors if
MetalLB is used to announce LB IP addresses on master nodes. In this
scenario, the L2 (ARP) announcement mode cannot be used for MetalLB because
master nodes are in different L2 segments.
So, the BGP announcement mode must be used for MetalLB. Node selectors
are required to properly configure BGP connections from each master node.
Note
Before update of the management cluster to Container Cloud 2.29.0
(Cluster release 16.4.0), instead of BareMetalHostInventory, use the
BareMetalHost object. For details, see BareMetalHost.
Caution
While the Cluster release of the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
Configuration example for Machine 1
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-1namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-1labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-1kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-1l2TemplateSelector:name:test-cluster-master-1nodeLabels:# not used for BGP announcement of the-key:rack-id# cluster API LB IP but can be used forvalue:rack-master-1# MetalLB if "nodeSelectors" are required
Configuration example for Machine 2
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-2namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-2labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-2kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-2l2TemplateSelector:name:test-cluster-master-2nodeLabels:# not used for BGP announcement of the-key:rack-id# cluster API LB IP but can be used forvalue:rack-master-2# MetalLB if "nodeSelectors" are required
Configuration example for Machine 3
apiVersion:cluster.k8s.io/v1alpha1kind:Machinemetadata:name:test-cluster-master-3namespace:managed-nsannotations:metal3.io/BareMetalHost:managed-ns/test-cluster-master-3labels:cluster.sigs.k8s.io/cluster-name:test-clustercluster.sigs.k8s.io/control-plane:controlplanehostlabel.bm.kaas.mirantis.com/controlplane:controlplaneipam/RackRef:rack-master-3kaas.mirantis.com/provider:baremetalspec:providerSpec:value:kind:BareMetalMachineProviderSpecapiVersion:baremetal.k8s.io/v1alpha1hostSelector:matchLabels:kaas.mirantis.com/baremetalhost-id:test-cluster-master-3l2TemplateSelector:name:test-cluster-master-3nodeLabels:# optional. not used for BGP announcement of-key:rack-id# the cluster API LB IP but can be used forvalue:rack-master-3# MetalLB if "nodeSelectors" are required
Configuration example for Subnet defining the cluster API
LB IP address
The following L2Template objects differ in LCM and external subnets that
each master node uses.
Configuration example for L2Template 1
apiVersion:ipam.mirantis.com/v1alpha1kind:L2Templatemetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-master-1namespace:managed-nsspec:...l3Layout:-subnetName:lcm-rack-control-1# this network is referencedscope:namespace# in the "rack-master-1" Rack-subnetName:ext-rack-control-1# this optional network is used forscope:namespace# Kubernetes services traffic and# MetalLB BGP connections...npTemplate:|...ethernets:lo:addresses:- {{ cluster_api_lb_ip }} # function for cluster API LB IPdhcp4: falsedhcp6: false...
Configuration example for L2Template 2
apiVersion:ipam.mirantis.com/v1alpha1kind:L2Templatemetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-master-2namespace:managed-nsspec:...l3Layout:-subnetName:lcm-rack-control-2# this network is referencedscope:namespace# in "rack-master-2" Rack-subnetName:ext-rack-control-2# this network is used for Kubernetes servicesscope:namespace# traffic and MetalLB BGP connections...npTemplate:|...ethernets:lo:addresses:- {{ cluster_api_lb_ip }} # function for cluster API LB IPdhcp4: falsedhcp6: false...
Configuration example for L2Template 3
apiVersion:ipam.mirantis.com/v1alpha1kind:L2Templatemetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-master-3namespace:managed-nsspec:...l3Layout:-subnetName:lcm-rack-control-3# this network is referencedscope:namespace# in "rack-master-3" Rack-subnetName:ext-rack-control-3# this network is used for Kubernetes servicesscope:namespace# traffic and MetalLB BGP connections...npTemplate:|...ethernets:lo:addresses:- {{ cluster_api_lb_ip }} # function for cluster API LB IPdhcp4: falsedhcp6: false...
The following MetalLBConfig example illustrates how node labels
are used in nodeSelectors of bgpPeers. Each of bgpPeers
corresponds to one of master nodes.
Configuration example for MetalLBConfig
apiVersion:ipam.mirantis.com/v1alpha1kind:MetalLBConfigmetadata:labels:cluster.sigs.k8s.io/cluster-name:test-clusterkaas.mirantis.com/provider:baremetalname:test-cluster-metallb-confignamespace:managed-nsspec:...bgpPeers:-name:svc-peer-rack1spec:holdTime:0skeepaliveTime:0speerAddress:10.77.41.1# peer address is in external subnet# instead of LCM subnet used for BGP# connection to announce cluster API LB IPpeerASN:65100# the same as for BGP connection used to announce# cluster API LB IPmyASN:65101# the same as for BGP connection used to announce# cluster API LB IPnodeSelectors:-matchLabels:rack-id:rack-master-1# references the node corresponding# to "test-cluster-master-1" Machine-name:svc-peer-rack2spec:holdTime:0skeepaliveTime:0speerAddress:10.77.42.1peerASN:65100myASN:65101nodeSelectors:-matchLabels:rack-id:rack-master-1-name:svc-peer-rack3spec:holdTime:0skeepaliveTime:0speerAddress:10.77.43.1peerASN:65100myASN:65101nodeSelectors:-matchLabels:rack-id:rack-master-1...
After the objects are created and nodes are provisioned, the IpamHost
objects will have BGP daemon configuration files in their status fields.
Refer to Single rack configuration example on how to verify the BGP configuration
files.
This section describes the Rack resource used in the Container Cloud API.
When you create a bare metal managed cluster with a multi-rack topology,
where Kubernetes masters are distributed across multiple racks
without L2 layer extension between them, the Rack resource allows you
to configure BGP announcement of the cluster API load balancer address from
each rack.
In this scenario, Rack objects must be bound to Machine objects
corresponding to master nodes of the cluster. Each Rack object describes
the configuration of the BGP daemon (bird) used to announce the cluster API LB
address from a particular master node (or from several nodes in the same rack).
Rack objects are used for a particular cluster only in conjunction with
the MultiRackCluster object described in MultiRackCluster.
For demonstration purposes, the Container Cloud Rack custom resource (CR)
description is split into the following major sections:
The Container Cloud Rack CR metadata contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is Rack.
metadata
The metadata field contains the following subfields:
name
Name of the Rack object. Corresponding Machine objects must
have their ipam/RackRef label value set to the name of the Rack
object. This label is required only for Machine objects of the
master nodes that announce the cluster API LB address.
namespace
Container Cloud project (Kubernetes namespace) where the object was
created.
labels
Key-value pairs that are attached to the object:
cluster.sigs.k8s.io/cluster-name
Cluster object name that this Rack object is applied to.
kaas.mirantis.com/provider
Provider name that is baremetal.
kaas.mirantis.com/region
Region name.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the Rack resource describes the desired
state of the object. It contains the following fields:
bgpdConfigTemplate
Optional. Configuration file template that will be used to create
configuration file for a BGP daemon on nodes in this rack. If not set, the
configuration file template from the corresponding MultiRackCluster
object is used.
peeringMap
Structure that describes general parameters of BGP peers to be used
in the configuration file for a BGP daemon for each network where BGP
announcement is used. Also, you can define a separate configuration file
template for the BGP daemon for each of those networks.
The peeringMap structure is as follows:
peeringMap:<network-name-a>:peers:-localASN:<localASN-1>neighborASN:<neighborASN-1>neighborIP:<neighborIP-1>password:<password-1>-localASN:<localASN-2>neighborASN:<neighborASN-2>neighborIP:<neighborIP-2>password:<password-2>bgpdConfigTemplate:|<configuration file template for a BGP daemon>...
<network-name-a>
Name of the network where a BGP daemon should connect to the neighbor
BGP peers. By default, it is implied that the same network is used on the
node to make connection to the neighbor BGP peers as well as to receive
and respond to the traffic directed to the IP address being advertised.
In our scenario, the advertised IP address is the cluster API LB
IP address.
This network name must be the same as the subnet name used in the L2
template (l3Layout section) for the corresponding master node(s).
peers
Optional. List of dictionaries where each dictionary defines
configuration parameters for a particular BGP peer. Peer parameters are
as follows:
localASN
Optional. Local AS number. If not set, it can be taken from
MultiRackCluster.spec.defaultPeer or can be hardcoded in
bgpdConfigTemplate.
neighborASN
Optional. Neighbor AS number. If not set, it can be taken from
MultiRackCluster.spec.defaultPeer or can be hardcoded in
bgpdConfigTemplate.
neighborIP
Mandatory. Neighbor IP address.
password
Optional. Neighbor password. If not set, it can be taken from
MultiRackCluster.spec.defaultPeer or can be hardcoded in
bgpdConfigTemplate. It is required when MD5 authentication
between BGP peers is used.
bgpdConfigTemplate
Optional. Configuration file template that will be used to create the
configuration file for the BGP daemon of the network-name-a network
on a particular node. If not set, Rack.spec.bgpdConfigTemplate
is used.
Configuration example:
Since Cluster releases 17.1.0 and 16.1.0 for bird v2.x
spec:bgpdConfigTemplate:|protocol device {}#protocol direct {interface "lo";ipv4;}#protocol kernel {ipv4 {export all;};}#protocol bgp bgp_lcm {local port 1179 as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};ipv4 {import none;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};};}peeringMap:lcm-rack1:peers:-localASN:65050neighborASN:65011neighborIP:10.77.31.1
Before Cluster releases 17.1.0 and 16.1.0 for bird v1.x
spec:bgpdConfigTemplate:|listen bgp port 1179;protocol device {}#protocol direct {interface "lo";}#protocol kernel {export all;}#protocol bgp bgp_lcm {local as {{.LocalASN}};neighbor {{.NeighborIP}} as {{.NeighborASN}};import all;export filter {if dest = RTD_UNREACHABLE then {reject;}accept;};}peeringMap:lcm-rack1:peers:-localASN:65050neighborASN:65011neighborIP:10.77.31.1
The status field of the Rack resource reflects the actual state
of the Rack object and contains the following fields:
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
Configuration example:
status:checksums:annotations:sha256:cd4b751d9773eacbfd5493712db0cbebd6df0762156aefa502d65a9d5e8af31dlabels:sha256:fc2612d12253443955e1bf929f437245d304b483974ff02a165bc5c78363f739spec:sha256:8f0223b1eefb6a9cd583905a25822fd83ac544e62e1dfef26ee798834ef4c0c1objCreated:2023-08-11T12:25:21.00000Z by v6.5.999-20230810-155553-2497818objStatusUpdated:2023-08-11T12:33:00.92163Z by v6.5.999-20230810-155553-2497818objUpdated:2023-08-11T12:32:59.11951Z by v6.5.999-20230810-155553-2497818state:OK
The Container Cloud Subnet CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is Subnet
metadata
This field contains the following subfields:
name
Name of the Subnet object.
namespace
Project in which the Subnet object was created.
labels
Key-value pairs that are attached to the object:
ipam/DefaultSubnet: "1"Deprecated since 2.14.0
Indicates that this subnet was automatically created
for the PXE network.
ipam/UID
Unique ID of a subnet.
kaas.mirantis.com/provider
Provider type.
kaas.mirantis.com/region
Region name.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the Subnet resource describes the desired state of
a subnet. It contains the following fields:
cidr
A valid IPv4 CIDR, for example, 10.11.0.0/24.
gateway
A valid gateway address, for example, 10.11.0.9.
includeRanges
A comma-separated list of IP address ranges within the given CIDR that should
be used in the allocation of IPs for nodes. The gateway, network, broadcast,
and DNSaddresses will be excluded (protected) automatically if they intersect
with one of the range. The IPs outside the given ranges will not be used in
the allocation. Each element of the list can be either an interval
10.11.0.5-10.11.0.70 or a single address 10.11.0.77.
Warning
Do not use values that are out of the given CIDR.
excludeRanges
A comma-separated list of IP address ranges within the given CIDR that should
not be used in the allocation of IPs for nodes. The IPs within the given CIDR
but outside the given ranges will be used in the allocation.
The gateway, network, broadcast, and DNS addresses will be excluded
(protected) automatically if they are included in the CIDR.
Each element of the list can be either an interval 10.11.0.5-10.11.0.70
or a single address 10.11.0.77.
Warning
Do not use values that are out of the given CIDR.
useWholeCidr
If set to false (by default), the subnet address and broadcast
address will be excluded from the address allocation.
If set to true, the subnet address and the broadcast address
are included into the address allocation for nodes.
nameservers
The list of IP addresses of name servers. Each element of the list
is a single address, for example, 172.18.176.6.
The status field of the Subnet resource describes the actual state of
a subnet. It contains the following fields:
allocatable
The number of IP addresses that are available for allocation.
allocatedIPs
The list of allocated IP addresses in the IP:<IPAddrobjectUID> format.
capacity
The total number of IP addresses to be allocated, including the sum of
allocatable and already allocated IP addresses.
cidr
The IPv4 CIDR for a subnet.
gateway
The gateway address for a subnet.
nameservers
The list of IP addresses of name servers.
ranges
The list of IP address ranges within the given CIDR that are used in
the allocation of IPs for nodes.
statusMessage
Deprecated since Container Cloud 2.23.0 and will be removed in one of the
following releases in favor of state and messages. Since Container
Cloud 2.24.0, this field is not set for the subnets of newly created
clusters. For the field description, see state.
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
Configuration example:
status:allocatable:51allocatedIPs:-172.16.48.200:24e94698-f726-11ea-a717-0242c0a85b02-172.16.48.201:2bb62373-f726-11ea-a717-0242c0a85b02-172.16.48.202:37806659-f726-11ea-a717-0242c0a85b02capacity:54cidr:172.16.48.0/24gateway:172.16.48.1nameservers:-172.18.176.6ranges:-172.16.48.200-172.16.48.253objCreated:2021-10-21T19:09:32Z by v5.1.0-20210930-121522-f5b2af8objStatusUpdated:2021-10-21T19:14:18.748114886Z by v5.1.0-20210930-121522-f5b2af8objUpdated:2021-10-21T19:09:32.606968024Z by v5.1.0-20210930-121522-f5b2af8state:OK
The Container Cloud SubnetPool CR contains the following fields:
apiVersion
API version of the object that is ipam.mirantis.com/v1alpha1.
kind
Object type that is SubnetPool.
metadata
The metadata field contains the following subfields:
name
Name of the SubnetPool object.
namespace
Project in which the SubnetPool object was created.
labels
Key-value pairs that are attached to the object:
kaas.mirantis.com/provider
Provider type that is baremetal.
kaas.mirantis.com/region
Region name.
Note
The kaas.mirantis.com/region label is removed from all
Container Cloud objects in 2.26.0 (Cluster releases 17.1.0 and 16.1.0).
Therefore, do not add the label starting these releases. On existing
clusters updated to these releases, or if manually added, this label will
be ignored by Container Cloud.
Warning
Labels and annotations that are not documented in this API
Reference are generated automatically by Container Cloud. Do not modify them
using the Container Cloud API.
The spec field of the SubnetPool resource describes the desired state
of a subnet pool. It contains the following fields:
cidr
Valid IPv4 CIDR. For example, 10.10.0.0/16.
blockSize
IP address block size to use when assigning an IP address block
to every new child Subnet object. For example, if you set /25,
every new child Subnet will have 128 IPs to allocate.
Possible values are from /29 to the cidr size. Immutable.
nameservers
Optional. List of IP addresses of name servers to use for every new child
Subnet object. Each element of the list is a single address,
for example, 172.18.176.6. Default: empty.
gatewayPolicy
Optional. Method of assigning a gateway address to new child Subnet
objects. Default: none. Possible values are:
first - first IP of the IP address block assigned to a child
Subnet, for example, 10.11.10.1.
last - last IP of the IP address block assigned to a child Subnet,
for example, 10.11.10.254.
The status field of the SubnetPool resource describes the actual state
of a subnet pool. It contains the following fields:
allocatedSubnets
List of allocated subnets. Each subnet has the <CIDR>:<SUBNET_UID>
format.
blockSize
Block size to use for IP address assignments from the defined pool.
capacity
Total number of IP addresses to be allocated. Includes the number of
allocatable and already allocated IP addresses.
allocatable
Number of subnets with the blockSize size that are available for
allocation.
stateSince 2.23.0
Message that reflects the current status of the resource.
The list of possible values includes the following:
OK - object is operational.
ERR - object is non-operational. This status has a detailed
description in the messages list.
TERM - object was deleted and is terminating.
messagesSince 2.23.0
List of error or warning messages if the object state is ERR.
objCreated
Date, time, and IPAM version of the resource creation.
objStatusUpdated
Date, time, and IPAM version of the last update of the status
field in the resource.
objUpdated
Date, time, and IPAM version of the last resource update.
Example:
status:allocatedSubnets:-10.10.0.0/24:0272bfa9-19de-11eb-b591-0242ac110002blockSize:/24capacity:54allocatable:51objCreated:2021-10-21T19:09:32Z by v5.1.0-20210930-121522-f5b2af8objStatusUpdated:2021-10-21T19:14:18.748114886Z by v5.1.0-20210930-121522-f5b2af8objUpdated:2021-10-21T19:09:32.606968024Z by v5.1.0-20210930-121522-f5b2af8state:OK
The Mirantis Container Cloud Release Compatibility Matrix
outlines the specific operating environments that are validated and supported.
The document provides the deployment compatibility for each product release and
determines the upgrade paths between major components versions when upgrading.
The document also provides the Container Cloud browser compatibility.
A Container Cloud management cluster upgrades automatically when a new
product release becomes available. Once the management cluster has been
updated, the user may trigger the managed clusters upgrade through the
Container Cloud web UI or API.
To view the full components list with their respective versions for each
Container Cloud release, refer to the Container Cloud Release Notes related
to the release version of your deployment or use the Releases
section in the web UI or API.
Caution
The document applies to the Container Cloud regular
deployments. For supported configurations of existing Mirantis Kubernetes
Engine (MKE) clusters that are not deployed by Container Cloud, refer to
MKE Compatibility Matrix.
The following tables outline the compatibility matrices of the most recent
major Container Cloud and Cluster releases along with patch releases and
their component versions. For details about unsupported releases, see
Releases summary.
Major and patch versions update path
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
Legend
Symbol
Definition
Cluster release is not included in the Container Cloud release yet.
Latest supported Cluster release to use for cluster deployment or update.
Deprecated Cluster release that you must update to the latest supported
Cluster release. The deprecated Cluster release will become unsupported
in one of the following Container Cloud releases. Greenfield deployments
based on a deprecated Cluster release are not supported.
Use the latest supported Cluster release instead.
Unsupported Cluster release that blocks automatic upgrade of a
management cluster. Update the Cluster release to the latest supported
one to unblock management cluster upgrade and obtain newest product
features and enhancements.
Component is included in the Container Cloud release.
Component is available in the Technology Preview
scope. Use it only for testing purposes on staging environments.
Component is unsupported in the Container Cloud release.
The following table outlines the compatibility matrix for the Container Cloud
release series 2.29.x.
The major Cluster release 14.1.0 is dedicated for the vSphere provider
only. This is the last Cluster release for the vSphere provider based
on MCR 20.10 and MKE 3.6.6 with Kubernetes 1.24.
OpenStack Antelope is supported as TechPreview since
MOSK 23.3.
A Container Cloud cluster based on MOSK Yoga or
Antelope with Tungsten Fabric is supported as TechPreview since
Container Cloud 2.25.1. Since Container Cloud 2.26.0, support for this
configuration is suspended. If you still require this configuration,
contact Mirantis support for further information.
OpenStack Victoria is supported until September, 2023.
MOSK 23.2 is the last release version where OpenStack
Victoria packages are updated.
If you have not already upgraded your OpenStack version to Yoga,
Mirantis highly recommends doing this during the course of the
MOSK 23.2 series. For details, see
MOSK documentation: Upgrade OpenStack.
Since Container Cloud 2.27.3 (Cluster release 16.2.3), the VMware
vSphere configuration is unsupported. For details, see
Deprecation notes.
VMware vSphere is supported on RHEL 8.7 or Ubuntu 20.04.
RHEL 8.7 is generally available since Cluster releases 16.0.0 and
14.1.0. Before these Cluster releases, it is supported within the
Technology Preview features scope.
For Ubuntu deployments, Packer builds a vSphere virtual machine
template that is based on Ubuntu 20.04 with kernel
5.15.0-116-generic.
If you build a VM template manually, we recommend installing the
same kernel version 5.15.0-116-generic.
Attachment of non Container Cloud based MKE clusters is supported
only for vSphere-based management clusters on Ubuntu 20.04. Since Container
Cloud 2.27.3 (Cluster release 16.2.3), the vSphere-based configuration is
unsupported. For details, see Deprecation notes.
The kernel version of the host operating system is validated by Mirantis
and confirmed to be working for the supported use cases. Usage of
custom kernel versions or third-party vendor-provided kernels, such
as FIPS-enabled, assume full responsibility for validating the
compatibility of components in such environments.
On non-MOSK clusters, Ubuntu 22.04 is installed by
default on management and managed clusters. Ubuntu 20.04 is not
supported.
On MOSK clusters:
Since Container Cloud 2.28.0 (Cluster releases 17.3.0), Ubuntu 22.04
is generally available for managed clusters. All existing
deployments based on Ubuntu 20.04 must be upgraded to 22.04 within
the course of 2.28.x. Otherwise, update of managed clusters to
2.29.0 will become impossible and management cluster update to
2.29.1 will be blocked.
Before Container Cloud 2.28.0 (Cluster releases 17.2.0, 16.2.0, or
earlier), Ubuntu 22.04 is installed by default on management
clusters only. And Ubuntu 20.04 is the only supported distribution
for managed clusters.
In Container Cloud 2.29.1 (Cluster releases 17.3.6, 16.4.1, and
16.3.6), docker-ee-cli is updated to 23.0.17 on MOSK
clusters (MCR 23.0.15) and 25.0.9m1 on management clusters (MCR 25.0.8)
to fix several CVEs.
The Container Cloud web UI runs in the browser, separate from any backend
software. As such, Mirantis aims to support browsers separately from
the backend software in use, although each Container Cloud release is tested
with specific browser versions.
Mirantis currently supports the following web browsers for the Container Cloud
web UI:
Browser
Supported version
Release date
Supported operating system
Firefox
94.0 or newer
November 2, 2021
Windows, macOS
Google Chrome
96.0.4664 or newer
November 15, 2021
Windows, macOS
Microsoft Edge
95.0.1020 or newer
October 21, 2021
Windows
Caution
This table does not apply to third-party web UIs such as the
StackLight or Keycloak endpoints that are available through the Container
Cloud web UI. Refer to the official documentation of the corresponding
third-party component for details about its supported browsers versions.
To ensure the best user experience, Mirantis recommends that you use the
latest version of any of the supported browsers. The use of other browsers
or older versions of the browsers we support can result in rendering issues,
and can even lead to glitches and crashes in the event that the Container Cloud
web UI does not support some JavaScript language features or browser web APIs.
Important
Mirantis does not tie browser support to any particular Container Cloud
release.
Mirantis strives to leverage the latest in browser technology to build more
performant client software, as well as ensuring that our customers benefit from
the latest browser security updates. To this end, our strategy is to regularly
move our supported browser versions forward, while also lagging behind the
latest releases by approximately one year to give our customers a
sufficient upgrade buffer.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines the release notes for the Mirantis
Container Cloud GA release. Within the scope of the Container Cloud GA
release, major releases are being published continuously with new features,
improvements, and critical issues resolutions to enhance the
Container Cloud GA version. Between major releases, patch releases that
incorporate fixes for CVEs of high and critical severity are being delivered.
For details, see Container Cloud releases, Cluster releases (managed), and
Patch releases.
Once a new Container Cloud release is available, a management cluster
automatically upgrades to a newer consecutive release unless this cluster
contains managed clusters with a Cluster release unsupported by the newer
Container Cloud release. For more details about the Container Cloud
release mechanism, see
Reference Architecture: Release Controller.
The Container Cloud patch release 2.29.2, which is based on the
2.29.0 major release, provides the following updates:
Support for the patch Cluster releases 16.3.7 and 16.4.2
Support for the patch Cluster release 17.3.7
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.3.4
Support for Mirantis Kubernetes Engine to 3.7.20
Support for docker-ee-cli 23.0.17 on MOSK clusters
(MCR 23.0.15) and 25.0.9m1 on management clusters (MCR 25.0.8)
Mandatory migration of container runtime from Docker to containerd
Bare metal: update of Ubuntu mirror from ubuntu-2025-03-05-003900 to
ubuntu-2025-03-31-003900 along with update of minor kernel version from
5.15.0-134-generic to 5.15.0-135-generic
Ubuntu base image: support for utils that extend NVMe provisioning options
Security fixes for CVEs in images
This patch release also supports the latest major Cluster releases
17.4.0 and 16.4.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster releases
instead.
For main deliverables of the parent Container Cloud release of 2.29.0, refer
to 2.29.0.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.3.7, 16.3.7, or 16.4.2.
Post-update actions¶Mandatory migration of container runtime from Docker to containerd¶
Migration of container runtime from Docker to containerd, which is implemented
for existing management and managed clusters, becomes mandatory in the scope of
Container Cloud 2.29.x. Otherwise, the management cluster update to Container
Cloud 2.30.0 will be blocked.
In total, since Container Cloud 2.29.1, 103 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.29.2: 5 of critical and 98 of
high severity.
The table below includes the total numbers of addressed unique and common CVEs
in images by product component. The common CVEs are issues addressed across
several images.
This section lists known issues with workarounds for the Mirantis Container
Cloud release 2.29.2 including the Cluster releases 17.3.7,
16.3.7, and 16.4.2. For the known issues in the related
MOSK release, see MOSK release notes 24.3.4:
Known issues.
This section also outlines still valid known issues
from previous Container Cloud releases.
Bare metal¶[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50637] Ceph creates second miracephnodedisable object during node disabling¶
During managed cluster update, if some node is being disabled and at the same
time ceph-maintenance-controller is restarted, a second
miracephnodedisable object is erroneously created for the node. As a
result, the second object fails in the Cleaning state, which blocks
managed cluster update.
Workaround
On the affected managed cluster, obtain the list of miracephnodedisable
objects:
kubectlgetmiracephnodedisable-nceph-lcm-mirantis
The system response must contain one completed and one failed
miracephnodedisable object for the node being disabled. For example:
[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
LCM¶[50561] The local-volume-provisioner pod switches to CrashLoopBackOff¶
After machine disablement and consequent re-enablement, persistent volumes
(PVs) provisioned by local-volume-provisioner that are not used by any pod
may cause the local-volume-provisioner pod on such machine to switch to the
CrashLoopBackOff state.
Workaround:
Identify the ID of the affected local-volume-provisioner:
To work around the issue, manually adjust the affected dashboards to
restore their custom appearance.
Cluster update¶[51339] Cluster upgrade to 2.29.2 is blocked if nodes are not rebooted¶
Cluster upgrade from Container Cloud 2.29.1 to 2.29.2 is blocked if at least
one node of any cluster is not rebooted while applying upgrade. As a
workaround, reboot all cluster nodes.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
The following issues have been addressed in the Mirantis Container Cloud
release 2.29.2 along with the Cluster releases 17.3.7,
16.3.7, and 16.4.2, where applicable. For
the list of MOSK addressed issues, if any, see
MOSK documentation: Release notes 24.3.4.
[51145] [StackLight] Addressed the issue that caused the
PrometheusTargetScrapesDuplicate alert to permanently fire on a
management cluster that has sf-notifier enabled.
[50636] [LCM] Addressed the issue that caused the nfs-common package
to be deleted during MOSK cluster update. This package is no
longer automatically removed from cluster nodes if a MOSK
cluster is deployed with the MariaDB backup hosted on an external NFS
backend.
This section lists the artifacts of components included in the Container Cloud
patch release 2.29.2. For artifacts of the Cluster releases introduced in
2.29.2, see patch Cluster releases 17.3.7, 16.3.7, and
16.4.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud patch release 2.29.1, which is based on the
2.29.0 major release, provides the following updates:
Support for the patch Cluster releases 16.3.6 and 16.4.1
Support for the patch Cluster release 17.3.6
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.3.3
Support for Mirantis Kubernetes Engine to 3.7.20
Support for docker-ee-cli 23.0.17 on MOSK clusters
(MCR 23.0.15) and 25.0.9m1 on management clusters (MCR 25.0.8)
Mandatory migration of container runtime from Docker to containerd
Bare metal: update of Ubuntu mirror to ubuntu-2025-03-05-003900 along with
update of minor kernel version to 5.15.0-134-generic
Security fixes for CVEs in images
This patch release also supports the latest major Cluster releases
17.4.0 and 16.4.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster releases
instead.
For main deliverables of the parent Container Cloud release of 2.29.0, refer
to 2.29.0.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.3.6, 16.3.6, or 16.4.1.
Pre-update actions¶Update managed clusters to Ubuntu 22.04¶
Management cluster update to Container Cloud 2.29.1 will be blocked if at least
one node of any related managed cluster is running Ubuntu 20.04, which reaches
end-of-life in April 2025. Moreover, in Container Cloud 2.29.0, the Cluster
release update of the Ubuntu 20.04-based managed clusters became impossible,
and Ubuntu 22.04 became the only supported version of the operating system.
Therefore, ensure that every node of all your managed clusters are running
Ubuntu 22.04 to unblock management cluster update in Container Cloud 2.29.1 and
managed cluster update in Container Cloud 2.29.0.
Existing management clusters were automatically updated to Ubuntu
22.04 during cluster upgrade to the Cluster release 16.2.0 in Container
Cloud 2.27.0. Greenfield deployments of management clusters are also based
on Ubuntu 22.04.
Post-update actions¶Migration of container runtime from Docker to containerd¶
Since Container Cloud 2.28.4, Mirantis introduced an optional migration of
container runtime from Docker to containerd, which is implemented for existing
management and managed bare metal clusters. This migration becomes mandatory in
the scope of Container Cloud 2.29.x. Otherwise, the management cluster update
to Container Cloud 2.30.0 will be blocked.
In total, since Container Cloud 2.29.0, 203 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.29.1: 21 of critical and 182 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.29.0.
The common CVEs are issues addressed across several images.
This section lists known issues with workarounds for the Mirantis Container
Cloud release 2.29.1 including the Cluster releases 17.3.6,
16.3.6, and 16.4.1. For the known issues in the related
MOSK release, see MOSK release notes 24.3.3:
Known issues.
This section also outlines still valid known issues
from previous Container Cloud releases.
Bare metal¶[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50637] Ceph creates second miracephnodedisable object during node disabling¶
During managed cluster update, if some node is being disabled and at the same
time ceph-maintenance-controller is restarted, a second
miracephnodedisable object is erroneously created for the node. As a
result, the second object fails in the Cleaning state, which blocks
managed cluster update.
Workaround
On the affected managed cluster, obtain the list of miracephnodedisable
objects:
kubectlgetmiracephnodedisable-nceph-lcm-mirantis
The system response must contain one completed and one failed
miracephnodedisable object for the node being disabled. For example:
[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
LCM¶[50561] The local-volume-provisioner pod switches to CrashLoopBackOff¶
After machine disablement and consequent re-enablement, persistent volumes
(PVs) provisioned by local-volume-provisioner that are not used by any pod
may cause the local-volume-provisioner pod on such machine to switch to the
CrashLoopBackOff state.
Workaround:
Identify the ID of the affected local-volume-provisioner:
On management clusters with sf-notifier enabled, the
PrometheusTargetScrapesDuplicate alert is permanently firing while
sf-notifier runs with no errors.
You can safely disregard the issue because it does not affect cluster health.
To work around the issue, manually adjust the affected dashboards to
restore their custom appearance.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
The following issues have been addressed in the Mirantis Container Cloud
release 2.29.1 along with the Cluster releases 17.3.6,
16.3.6, and 16.4.1, where applicable. For
the list of MOSK addressed issues, if any, see
MOSK documentation: Release notes 24.3.3.
[50768] [LCM] Addressed the issue that prevented successful editing of
the MCCUpgrade object, which contained the Internal errorfailed to
call webhook: the server could not find the requested resource when trying
to save changes in the object.
[50622] [core] Addressed the issue that prevented any user except
m:kaas@management-admin to access or modify BareMetalHostInventory
objects.
[50287] [bare metal] Addressed the issue that prevented a
BareMetalHost object with a Redfish Baseboard Management Controller
address to pass the registering phase.
[50140] [Container Cloud web UI] Addressed the issue that prevented the
Clusters page for the bare metal provider to display information
about the Ceph cluster in the Ceph Clusters tab.
This section lists the artifacts of components included in the Container Cloud
patch release 2.29.1. For artifacts of the Cluster releases introduced in
2.29.1, see patch Cluster releases 17.3.6, 16.3.6, and
16.4.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Does not support greenfield deployments on deprecated Cluster releases
of the 17.3.x and 16.3.x series. Use the latest available Cluster releases
of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.29.0.
To allow the operator use the gitops approach, implemented the
BareMetalHostInventory resource that must be used instead of
BareMetalHost for adding and modifying configuration of bare metal servers.
The BareMetalHostInventory resource monitors and manages the state of a
bare metal server and is created for each Machine with all information
about machine hardware configuration.
Each BareMetalHostInventory object is synchronized with an automatically
created BareMetalHost object, which is now used for internal purposes of
the Container Cloud private API.
Caution
Any change in the BareMetalHost object will be overwitten by
BareMetalHostInventory.
For any existing BareMetalHost object, a BareMetalHostInventory object
is created automatically during cluster update.
Caution
While the Cluster release the management cluster is 16.4.0,
BareMetalHostInventory operations are allowed to
m:kaas@management-admin only. Once the management cluster is updated
to the Cluster release 16.4.1 (or later), this limitation will be lifted.
Validation of the Subnet object changes against allocated IP addresses¶
Implemented a validation of the Subnet object changes against already
allocated IP addresses. This validation is performed by the Admission
Controller. The controller now blocks changes in the Subnet object
containing allocated IP addresses that are out of the allocatable IP address
space, which is formed by a CIDR address and include/exclude address ranges.
Improvements in calculation of update estimates using ClusterUpdatePlan¶
Improved calculation of update estimates for a managed cluster that is managed
by the ClusterUpdatePlan object. Each step of ClusterUpdatePlan now has
more precise estimates that are based on the following calculations:
The amount and type of components updated between releases during patch
updates
The amount of nodes with particular roles in the OpenStack cluster
The number of nodes and storage used in the Ceph cluster
Also, the ClusterUpdatePlan object now contains the releaseNotes field
that links to MOSK release notes of the target release.
Switch of the default container runtime from Docker to containerd¶
Switched the default container runtime from Docker to containerd on greenfield
management and managed clusters. The use of containerd allows for better
Kubernetes performance and component update without pod restart when applying
fixes for CVEs.
On existing clusters, perform the mandatory migration from Docker to containerd
in the scope of Container Cloud 2.29.x. Otherwise, the management cluster
update to Container Cloud 2.30.0 will be blocked.
Important
Container runtime migration involves machine cordoning and
draining.
The following issues have been addressed in the Mirantis Container Cloud
release 2.29.0 along with the Cluster releases 17.4.0 and
16.4.0. For the list of MOSK addressed
issues, see MOSK release notes 25.1: Addressed issues.
Note
This section provides descriptions of issues addressed since
the last Container Cloud patch release 2.28.5.
For details on addressed issues in earlier patch releases since 2.28.0,
which are also included into the major release 2.29.0, refer to
2.28.x patch releases.
[47263] [StackLight] Fixed the issue with configuration inconsistencies
for requests and limits between the deprecated
resourcesPerClusterSize and resources parameters.
[44193] [StackLight] Fixed the issue with OpenSearch reaching the 85%
disk usage watermark on High Availability clusters that use Local Volume
Provisioner, which caused the OpenSearch cluster state to switch to
Warning or Critical.
[46858] [Container Cloud web UI] Fixed the issue that prevented the
drop-down menu from displaying the full list of allowed node labels.
[39437] [LCM] Fixed the issue that caused failure to replace a master
node and the Kubelet'sNodeReadyconditionisUnknown message in the
machine status on the remaining master nodes.
This section lists known issues with workarounds for the Mirantis Container
Cloud release 2.29.0 including the Cluster releases 17.4.0
and 16.4.0. For the known issues in the related
MOSK release, see MOSK release notes 25.1: Known
issues.
During addition of a bare metal host containing a Redfish Baseboard Management
Controller address with the following exemplary configuration may get stuck
during the registering phase:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50637] Ceph creates second miracephnodedisable object during node disabling¶
During managed cluster update, if some node is being disabled and at the same
time ceph-maintenance-controller is restarted, a second
miracephnodedisable object is erroneously created for the node. As a
result, the second object fails in the Cleaning state, which blocks
managed cluster update.
Workaround
On the affected managed cluster, obtain the list of miracephnodedisable
objects:
kubectlgetmiracephnodedisable-nceph-lcm-mirantis
The system response must contain one completed and one failed
miracephnodedisable object for the node being disabled. For example:
[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
While editing the MCCUpgrade object, the following error occurs when trying
to save changes:
HTTPresponsebody:{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"Internal error occurred: failed calling webhook \"mccupgrades.kaas.mirantis.com\":failed to call webhook: the server could not find the requested resource",
"reason":"InternalError",
"details":{"causes":[{"message":"failed calling webhook \"mccupgrades.kaas.mirantis.com\":failed to call webhook: the server could not find the requested resource"}]},"code":500}
To work around the issue, remove the
name:mccupgrades.kaas.mirantis.com entry from
mutatingwebhookconfiguration:
[50561] The local-volume-provisioner pod switches to CrashLoopBackOff¶
After machine disablement and consequent re-enablement, persistent volumes
(PVs) provisioned by local-volume-provisioner that are not used by any pod
may cause the local-volume-provisioner pod on such machine to switch to the
CrashLoopBackOff state.
Workaround:
Identify the ID of the affected local-volume-provisioner:
On management clusters with sf-notifier enabled, the
PrometheusTargetScrapesDuplicate alert is permanently firing while
sf-notifier runs with no errors.
You can safely disregard the issue because it does not affect cluster health.
To work around the issue, manually adjust the affected dashboards to
restore their custom appearance.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
[50140] The Ceph Clusters tab does not display Ceph cluster details¶
The Clusters page for the bare metal provider does not display
information about the Ceph cluster in the Ceph Clusters tab and
contains accessdenied errors.
The following table lists major components and their versions delivered in
Container Cloud 2.29.0. The components that are newly added, updated,
deprecated, or removed as compared to 2.28.0, are marked with a corresponding
superscript, for example, admission-controllerUpdated.
This section lists the artifacts of components included in the Container Cloud
release 2.29.0. The components that are newly added, updated,
deprecated, or removed as compared to 2.28.0, are marked with a corresponding
superscript, for example, admission-controllerUpdated.
In total, since Container Cloud 2.28.5, in 2.29.0, 736
Common Vulnerabilities and Exposures (CVE) have been fixed:
125 of critical and 611 of high severity.
The table below includes the total numbers of addressed unique and common
vulnerabilities and exposures (CVE) by product component since the 2.28.5
patch release. The common CVEs are issues addressed across several images.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.4.0 or 16.4.0. For details on update impact and maintenance
window planning, see MOSK Update notes.
Pre-update actions¶Update managed clusters to Ubuntu 22.04¶
In Container Cloud 2.29.0, the Cluster release update of the Ubuntu 20.04-based
managed clusters becomes impossible, and Ubuntu 22.04 becomes the only
supported version of the operating system. Therefore, ensure that every node of
your managed clusters are running Ubuntu 22.04 to unblock managed cluster
update in Container Cloud 2.29.0.
Management cluster update to Container Cloud 2.29.1 will be blocked
if at least one node of any related managed cluster is running Ubuntu 20.04.
Note
Existing management clusters were automatically updated to Ubuntu
22.04 during cluster upgrade to the Cluster release 16.2.0 in Container
Cloud 2.27.0. Greenfield deployments of management clusters are also based
on Ubuntu 22.04.
Back up custom Grafana dashboards on managed clusters¶
In Container Cloud 2.29.0, Grafana is updated to version 11 where the following
deprecated Angular-based plugins are automatically migrated to the React-based
ones:
Graph (old) -> Time Series
Singlestat -> Stat
Stat (old) -> Stat
Table (old) -> Table
Worldmap -> Geomap
This migration may corrupt custom Grafana dashboards that have Angular-based
panels. Therefore, if you have such dashboards on managed clusters, back them
up and manually upgrade Angular-based panels before updating to the Cluster
release 17.4.0 to prevent custom appearance issues after plugin migration.
For management clusters that are updated automatically, it is
important to remove all Angular-based panels and prepare the backup of
custom Grafana dashboards before Container Cloud 2.29.0 is released. For
details, see Post update notes in 2.28.5 release notes. Otherwise, custom dashboards using Angular-based
plugins may be corrupted and must be manually restored without a backup.
Post-update actions¶Start using BareMetalHostInventory instead of BareMetalHost¶
Container Cloud 2.29.0 introduces the BareMetalHostInventory resource that
must be used instead of BareMetalHost for adding and modifying
configuration of bare metal servers. Therefore, if you need to modify an
existing or create a new configuration of a bare metal host, use
BareMetalHostInventory.
Each BareMetalHostInventory object is synchronized with an automatically
created BareMetalHost object, which is now used for internal purposes of
the Container Cloud private API.
Caution
Any change in the BareMetalHost object will be overwitten by
BareMetalHostInventory.
For any existing BareMetalHost object, a BareMetalHostInventory object
is created automatically during cluster update.
The rules are applied automatically to all cluster nodes during cluster
update. Therefore, if you use custom Linux accounts protected by passwords, do
not plan any critical maintenance activities right after cluster upgrade as you
may need to update Linux user passwords.
Note
By default, during cluster creation, mcc-user is created without
a password with an option to add an SSH key.
Migrate container runtime from Docker to containerd¶
Container Cloud 2.29.0 introduces switching of the default container runtime
from Docker to containerd on greenfield management and managed clusters.
On existing clusters, perform the mandatory migration from Docker to containerd
in the scope of Container Cloud 2.29.x. Otherwise, the management cluster
update to Container Cloud 2.30.0 will be blocked.
Important
Container runtime migration involves machine cordoning and
draining.
Note
If you have not upgraded the operating system distribution on your
machines to Jammy yet, Mirantis recommends migrating machines from Docker
to containerd on managed clusters together with distribution upgrade to
minimize the maintenance window.
In this case, ensure that all cluster machines are updated at once during
the same maintenance window to prevent machines from running different
container runtimes.
Support for the patch Cluster release 17.3.5
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.3.2.
Support for Mirantis Kubernetes Engine 3.7.18 and Mirantis Container Runtime
23.0.15, which includes containerd 1.6.36.
Optional migration of container runtime from Docker to containerd.
Bare metal: update of Ubuntu mirror from ubuntu-2024-12-05-003900 to
ubuntu-2025-01-08-003900 along with update of minor kernel version from
5.15.0-126-generic to 5.15.0-130-generic.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.3.0 and 16.3.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.28.5, refer
to 2.28.0.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.3.5 or 16.3.5.
Post-update actions¶Optional migration of container runtime from Docker to containerd¶
Since Container Cloud 2.28.4, Mirantis introduced an optional migration of
container runtime from Docker to containerd, which is implemented for existing
management and managed bare metal clusters. The use of containerd allows for
better Kubernetes performance and component update without pod restart when
applying fixes for CVEs. For the migration procedure, refer to
MOSK Operations Guide: Migrate container runtime from Docker
to containerd.
Note
Container runtime migration becomes mandatory in the scope of
Container Cloud 2.29.x. Otherwise, the management cluster update to
Container Cloud 2.30.0 will be blocked.
Note
In Containter Cloud 2.28.x series, the default container runtime
remains Docker for greenfield deployments. Support for greenfield
deployments based on containerd will be announced in one of the following
releases.
Important
Container runtime migration involves machine cordoning and
draining.
Note
If you have not upgraded the operating system distribution on your
machines to Jammy yet, Mirantis recommends migrating machines from Docker
to containerd on managed clusters together with distribution upgrade to
minimize the maintenance window.
In this case, ensure that all cluster machines are updated at once during
the same maintenance window to prevent machines from running different
container runtimes.
In Container Cloud 2.29.0, Grafana will be updated to version 11 where
the following deprecated Angular-based plugins will be automatically migrated
to the React-based ones:
Graph (old) -> Time Series
Singlestat -> Stat
Stat (old) -> Stat
Table (old) -> Table
Worldmap -> Geomap
This migration may corrupt custom Grafana dashboards that have Angular-based
panels. Therefore, if you have such dashboards, back them up and manually
upgrade Angular-based panels during the course of Container Cloud 2.28.x
(Cluster releases 17.3.x and 16.3.x) to prevent custom appearance issues after
plugin migration in Container Cloud 2.29.0 (Cluster releases 17.4.0 and
16.4.0).
For management clusters that are updated automatically, it is
important to prepare the backup before Container Cloud 2.29.0 is released.
Otherwise, custom dashboards using Angular-based plugins may be corrupted.
For managed clusters, you can perform the backup after the Container Cloud
2.29.0 release date but before updating them to the Cluster release 17.4.0.
In total, since Container Cloud 2.28.4, 1 Common Vulnerability and Exposure
(CVE) of high severity has been fixed in 2.28.5.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.28.4.
The common CVEs are issues addressed across several images.
This section lists known issues with workarounds for the Mirantis Container
Cloud release 2.28.5 including the Cluster releases 16.3.5 and
17.3.5. For the known issues in the related MOSK
release, see MOSK release notes 24.3.2: Known issues.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.28.5. For artifacts of the Cluster releases introduced in
2.28.5, see patch Cluster releases 17.3.5 and 16.3.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Support for the patch Cluster release 17.3.4
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.3.1.
Support for Mirantis Kubernetes Engine to 3.7.17 and Mirantis Container
Runtime 23.0.15, which includes containerd 1.6.36.
Optional migration of container runtime from Docker to containerd.
Bare metal: update of Ubuntu mirror from ubuntu-2024-11-18-003900 to
ubuntu-2024-12-05-003900 along with update of minor kernel version from
5.15.0-125-generic to 5.15.0-126-generic.
Security fixes for CVEs in images.
OpenStack provider: suspension of support for cluster deployment and update.
For details, see Deprecation notes.
This patch release also supports the latest major Cluster releases
17.3.0 and 16.3.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.28.4, refer
to 2.28.0.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.3.4 or 16.3.4.
Important
For MOSK deployments, although
MOSK 24.3.1 is classified as a patch release, as a
cloud operator, you will be performing a major update regardless of the
upgrade path: whether you are upgrading from patch 24.2.5 or major version
24.3. For details, see MOSK 24.3.1 release notes: Update
notes.
Post-update actions¶Optional migration of container runtime from Docker to containerd¶
Container Cloud 2.28.4 introduces an optional migration of container runtime
from Docker to containerd, which is implemented for existing management and
managed bare metal clusters. The use of containerd allows for better Kubernetes
performance and component update without pod restart when applying fixes for
CVEs. For the migration procedure, refer to MOSK Operations
Guide: Migrate container runtime from Docker to containerd.
Note
Container runtime migration becomes mandatory in the scope of
Container Cloud 2.29.x. Otherwise, the management cluster update to
Container Cloud 2.30.0 will be blocked.
Note
In Containter Cloud 2.28.x series, the default container runtime
remains Docker for greenfield deployments. Support for greenfield
deployments based on containerd will be announced in one of the following
releases.
Important
Container runtime migration involves machine cordoning and
draining.
Note
If you have not upgraded the operating system distribution on your
machines to Jammy yet, Mirantis recommends migrating machines from Docker
to containerd on managed clusters together with distribution upgrade to
minimize the maintenance window.
In this case, ensure that all cluster machines are updated at once during
the same maintenance window to prevent machines from running different
container runtimes.
In Container Cloud 2.29.0, Grafana will be updated to version 11 where
the following deprecated Angular-based plugins will be automatically migrated
to the React-based ones:
Graph (old) -> Time Series
Singlestat -> Stat
Stat (old) -> Stat
Table (old) -> Table
Worldmap -> Geomap
This migration may corrupt custom Grafana dashboards that have Angular-based
panels. Therefore, if you have such dashboards, back them up and manually
upgrade Angular-based panels during the course of Container Cloud 2.28.x
(Cluster releases 17.3.x and 16.3.x) to prevent custom appearance issues after
plugin migration in Container Cloud 2.29.0 (Cluster releases 17.4.0 and
16.4.0).
For management clusters that are updated automatically, it is
important to prepare the backup before Container Cloud 2.29.0 is released.
Otherwise, custom dashboards using Angular-based plugins may be corrupted.
For managed clusters, you can perform the backup after the Container Cloud
2.29.0 release date but before updating them to the Cluster release 17.4.0.
In total, since Container Cloud 2.28.3, 158 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.28.4: 10 of critical and 148 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.28.3.
The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.28.4 along with the patch Cluster releases 16.3.4 and
17.3.4:
[30294] [LCM] Fixed the issue that prevented replacement of a manager
machine during the calico-node Pod start on a new node that has the
same IP address as the node being replaced.
[5782] [LCM] Fixed the issue that prevented deployment of a manager
machine during node replacement.
[5568] [LCM] Fixed the issue that prevented cleaning of resources by the
calico-kube-controllers Pod during unsafe or forced deletion of a
manager machine.
This section lists known issues with workarounds for the Mirantis Container
Cloud release 2.28.4 including the Cluster releases 16.3.4
and 17.3.4. For the known issues in the related MOSK
release, see MOSK release notes 24.3.1: Known issues.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.28.4. For artifacts of the Cluster releases introduced in
2.28.4, see patch Cluster releases 17.3.4 and 16.3.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section contains historical information on the unsupported Container
Cloud releases delivered in 2024. For the latest supported Container
Cloud release, see Container Cloud releases.
Container Cloud 2.27.2 is the second patch release of the 2.27.x
release series that introduces the following updates:
Support for the patch Cluster release 16.2.2.
Support for the patch Cluster releases 16.1.7 and 17.1.7 that
represents MOSK patch release
24.1.7.
Support for MKE 3.7.11.
Bare metal: update of Ubuntu mirror to ubuntu-2024-07-16-014744 along with
update of the minor kernel version to 5.15.0-116-generic (Cluster release 16.2.2).
Support for the patch Cluster releases 16.2.7 and 17.2.7
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.2.5.
Bare metal: update of Ubuntu mirror from ubuntu-2024-10-28-012906 to
ubuntu-2024-11-18-003900 along with update of minor kernel version from
5.15.0-124-generic to 5.15.0-125-generic.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.3.0 and 16.3.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.28.3, refer
to 2.28.0.
In total, since Container Cloud 2.28.2, 66 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.28.3: 4 of critical and 62 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.28.2.
The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.28.3 along with the patch Cluster releases 16.3.3,
16.2.7, and 17.2.7:
[47594] [StackLight] Fixed the issue with Patroni pods getting stuck in
the CrashLoopBackOff state due to the patroni container being
terminated with reason:OOMKilled.
[47929] [LCM] Fixed the issue with incorrect restrictive permissions set
for registry certificate files in /etc/docker/certs.d, which were set to
644 instead of 444.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.28.3 including the Cluster releases 16.2.7,
16.3.3, and 17.2.7.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
StackLight¶[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state¶
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.28.3. For artifacts of the Cluster releases introduced in
2.28.3, see patch Cluster releases 17.2.7, 16.3.3, and
16.2.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Support for the patch Cluster releases 16.2.6 and 17.2.6
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.2.4.
Support for MKE 3.7.16.
Bare metal: update of Ubuntu mirror from ubuntu-2024-10-14-013948 to
ubuntu-2024-10-28-012906 along with update of minor kernel version from
5.15.0-122-generic to 5.15.0-124-generic.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.3.0 and 16.3.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.28.2, refer
to 2.28.0.
In total, since Container Cloud 2.28.1, 15 Common Vulnerabilities and
Exposures (CVE) of high severity have been fixed in 2.28.2.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.28.1.
The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.28.2 along with the patch Cluster releases 16.3.2,
16.2.6, and 17.2.6.
[47741] [LCM] Fixed the issue with upgrade to MKE 3.7.15 getting stuck
due to the leftover ucp-upgrade-check-images service that is part of MKE
3.7.12.
[47304] [StackLight] Fixed the issue with OpenSearch not storing kubelet
logs due to the JSON-based format of ucp-kubelet logs.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.28.2 including the Cluster releases 16.2.6,
16.3.2, and 17.2.6.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
StackLight¶[47594] Patroni pods may get stuck in the CrashLoopBackOff state¶
The Patroni pods may get stuck in the CrashLoopBackOff state due to the
patroni container being terminated with reason:OOMKilled that you can
see in the pod status. For example:
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.28.2. For artifacts of the Cluster releases introduced in
2.28.2, see patch Cluster releases 17.2.6, 16.3.2, and
16.2.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Support for the patch Cluster releases 16.2.5 and 17.2.5
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.2.3.
Support for MKE 3.7.15.
Bare metal: update of Ubuntu mirror from 2024-09-11-014225 to
ubuntu-2024-10-14-013948 along with update of minor kernel version from
5.15.0-119-generic to 5.15.0-122-generic.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.3.0 and 16.3.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.28.1, refer
to 2.28.0.
In total, since Container Cloud 2.28.0, 400 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.28.1: 46 of critical and 354 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.28.0.
The common CVEs are issues addressed across several images.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.28.1 including the Cluster releases 16.2.5,
16.3.1, and 17.2.5.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
StackLight¶[47594] Patroni pods may get stuck in the CrashLoopBackOff state¶
The Patroni pods may get stuck in the CrashLoopBackOff state due to the
patroni container being terminated with reason:OOMKilled that you can
see in the pod status. For example:
Due to the JSON-based format of ucp-kubelet logs, OpenSearch does not store
kubelet logs. Mirantis is working on the issue and will deliver the resolution
in one of the nearest patch releases.
[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state¶
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.28.1. For artifacts of the Cluster releases introduced in
2.28.1, see patch Cluster releases 17.2.5, 16.3.1, and
16.2.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Does not support greenfield deployments on deprecated Cluster releases
of the 17.2.x and 16.2.x series. Use the latest available Cluster releases
of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.28.0.
This section outlines new features and enhancements introduced in the
Container Cloud release 2.28.0. For the list of enhancements delivered with
the Cluster releases introduced by Container Cloud 2.28.0, see
17.3.0 and 16.3.0.
General availability for Ubuntu 22.04 on MOSK clusters¶
Implemented full support for Ubuntu 22.04 LTS (Jammy Jellyfish) as the default
host operating system in MOSK clusters, including greenfield
deployments and update from Ubuntu 20.04 to 22.04 on existing clusters.
Ubuntu 20.04 is deprecated for greenfield deployments and supported during the
MOSK 24.3 release cycle only for existing clusters.
Warning
During the course of the Container Cloud 2.28.x series, Mirantis
highly recommends upgrading an operating system on any nodes of all your
managed cluster machines to Ubuntu 22.04 before the next major Cluster
release becomes available.
It is not mandatory to upgrade all machines at once. You can upgrade them
one by one or in small batches, for example, if the maintenance window is
limited in time.
Otherwise, the Cluster release update of the Ubuntu 20.04-based managed
clusters will become impossible as of Container Cloud 2.29.0 with Ubuntu
22.04 as the only supported version.
Management cluster update to Container Cloud 2.29.1 will be blocked if
at least one node of any related managed cluster is running Ubuntu 20.04.
Note
Since Container Cloud 2.27.0 (Cluster release 16.2.0), existing
MOSK management clusters were automatically updated to
Ubuntu 22.04 during cluster upgrade. Greenfield deployments of management
clusters are also based on Ubuntu 22.04.
Day-2 operations for bare metal: updating modules¶
TechPreview
Implemented the capability to update custom modules using deprecation. Once
you create a new custom module, you can use it to deprecate another module
by adding the deprecates field to metadata.yaml of the new module.
The related HostOSConfiguration and HostOSConfigurationModules objects
reflect the deprecation status of new and old modules using the corresponding
fields in spec and status sections.
Also, added monitoring of deprecated modules by implementing the StackLight
metrics for the Host Operating System Modules Controller along with the
Day2ManagementControllerTargetDown and Day2ManagementDeprecatedConfigs
alerts to notify the cloud operator about detected deprecations and issues with
host-os-modules-controller.
Note
Deprecation is soft, meaning that no actual restrictions are applied
to the usage of a deprecated module.
Caution
Deprecating a version automatically deprecates all lower SemVer
versions of the specified module.
Day-2 operations for bare metal: configuration enhancements for modules¶
TechPreview
Introduced the following configuration enhancements for custom modules:
Module-specific Ansible configuration
Updated the Ansible execution mechanism for running any modules. The default
ansible.cfg file is now placed in /etc/ansible/mcc.cfg and used for
execution of lcm-ansible and day-2 modules. However, if a module has its
own ansible.cfg in the module root folder, such configuration is used
for the module execution instead of the default one.
Configuration of supported operating system distribution
Added the supportedDistributions to the metadata section of a module
custom resource to define the list of supported operating system
distributions for the module. This field is informative and does not block
the module execution on machines running non-supported distributions, but
such execution will be most probably completed with an error.
Separate flag for machines requiring reboot
Introduced a separate /run/day2/reboot-required file for day-2 modules
to add a notification about required reboot for a machine and a reason for
reboot that appear after the module execution. The feature allows for
separation of the reboot reason between LCM and day-2 operations.
Implemented the update group for controller nodes using the UpdateGroup
resource, which is automatically generated during initial cluster creation with
the following settings:
Name: <cluster-name>-control
Index: 1
Concurrent updates: 1
This feature decouples the concurrency settings from the global cluster level
and provides update flexibility.
All control plane nodes are automatically assigned to the control update
group with no possibility to change it.
Note
On existing clusters created before 2.28.0 (Cluster releases 17.2.0,
16.2.0, or earlier), the control update group is created after upgrade of
the Container Cloud release to 2.28.0 (Cluster release 16.3.0) on the
management cluster.
Implemented the rebootIfUpdateRequires parameter for the UpdateGroup
custom resource. The parameter allows for rebooting a set of controller or
worker machines added to an update group during a Cluster release update that
requires a reboot, for example, when kernel version update is available in the
target Cluster release. The feature reduces manual intervention and overall
downtime during cluster update.
Note
By default, rebootIfUpdateRequires is set to false on managed
clusters and to true on management clusters.
Self-diagnostics for management and managed clusters¶
Implemented the Diagnostic Controller that is a tool with a set of diagnostic
checks to perform self-diagnostics of any Container Cloud cluster and help the
operator to easily understand, troubleshoot, and resolve potential issues
against the following major subsystems: core, bare metal, Ceph, StackLight,
Tungsten Fabric, and OpenStack. The Diagnostic Controller analyzes the
configuration of the cluster subsystems and reports results of checks that
contain useful information about cluster health.
Running self-diagnostics on both management and managed clusters is essential
to ensure the overall health and optimal performance of your cluster. Mirantis
recommends running self-diagnostics before cluster update, node replacement, or
any other significant changes in the cluster to prevent potential issues and
optimize maintenance window.
Simplified the default auditd configuration by implementing the preset
groups that you can use in presetRules instead of exact names or the
virtual group all. The feature allows enabling a limited set of presets
using a single keyword (group name).
Also, optimized disk usage by removing the following Docker rule that was
removed from the Docker CIS Benchmark 1.3.0 due to producing excessive events:
# 1.2.4 Ensure auditing is configured for Docker files and directories - /var/lib/docker-w /var/lib/docker -k docker
Enhanced the ClusterUpdatePlan object by adding a separate update step for
each UpdateGroup of worker nodes of a managed cluster. The feature allows
the operator to granularly control the update process and its impact on
workloads, with the option to pause the update after each step.
Also, added several StackLight alerts to notify the operator about the update
progress and potential update issues.
Refactoring of delayed auto-update of a management cluster¶
Refactored the MCCUpgrade object by implementing a new mechanism to delay
Container Cloud release updates. You now have the following options for
auto-update of a management cluster:
Automatically update a cluster on the publish day of a new release
(by default).
Set specific days and hours for an auto-update allowing delays of up to one
week. For example, if a release becomes available on Monday, you can delay it
until Sunday by setting Sunday as the only permitted day for auto-updates.
Delay auto-update for minimum 20 days for each newly discovered release.
The exact number of delay days is set in the release metadata and cannot be
changed by the user. It depends on the specifics of each release cycle and
on optional configuration of week days and hours selected for update.
You can verify the exact date of a scheduled auto-update either in the
Status section of the Management Cluster Updates
page in the web UI or in the status section of the MCCUpgrade
object.
Combine auto-update delay with the specific days and hours setting
(two previous options).
Also, optimized monitoring of auto-update by implementing several StackLight
metrics for the kaas-exporter job along with the MCCUpdateBlocked and
MCCUpdateScheduled alerts to notify the cloud operator about new releases
as well as other important information about management cluster auto-update.
On top of continuous improvements delivered to the existing Container Cloud
guides, added documentation on how to run Ceph performance tests using
Kubernetes batch or cron jobs that run
fio
processes according to a predefined KaaSCephOperationRequest CR.
The following issues have been addressed in the Mirantis Container Cloud
release 2.28.0 along with the Cluster releases 17.3.0 and
16.3.0.
Note
This section provides descriptions of issues addressed since
the last Container Cloud patch release 2.27.4.
For details on addressed issues in earlier patch releases since 2.27.0,
which are also included into the major release 2.28.0, refer to
2.27.x patch releases.
[41305] [Bare metal] Fixed the issue with newly added management cluster
nodes failing to undergo provisioning if the management cluster nodes were
configured with a single L2 segment used for all network traffic (PXE and
LCM/management networks).
[46245] [Bare metal] Fixed the issue with lack of permissions for
serviceuser and users with the global-admin and operator
roles to fetch
HostOSConfigurationModules and
HostOSConfiguration custom resources.
[43164] [StackLight] Fixed the issue with rollover policy not being added
to indicies created without a policy.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
After upgrade of kernel to the latest supported version, old kernel
metapackages may remain on the cluster. The issue occurs if the system kernel
line is changed from LTS to HWE. This setting is controlled by the
upgrade_kernel_version parameter located in the ClusterRelease object
under the deploy StateItem. As a result, the operating system has both
LTS and HWE kernel packages installed and regularly updated, but only one
kernel image is used (loaded into memory). The unused kernel images consume
minimal amount of disk space.
Therefore, you can safely disregard the issue because it does not affect
cluster operability. If you still require removing unused kernel metapackages,
contact Mirantis support for detailed
instructions.
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
StackLight¶[47594] Patroni pods may get stuck in the CrashLoopBackOff state¶
The Patroni pods may get stuck in the CrashLoopBackOff state due to the
patroni container being terminated with reason:OOMKilled that you can
see in the pod status. For example:
Due to the JSON-based format of ucp-kubelet logs, OpenSearch does not store
kubelet logs. Mirantis is working on the issue and will deliver the resolution
in one of the nearest patch releases.
[44193] OpenSearch reaches 85% disk usage watermark affecting the cluster state¶
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
The following table lists the major components and their versions delivered in
Container Cloud 2.28.0. The components that are newly added, updated,
deprecated, or removed as compared to 2.27.0, are marked with a corresponding
superscript, for example, admission-controllerUpdated.
This section lists the artifacts of components included in the Container Cloud
release 2.27.0. The components that are newly added, updated,
deprecated, or removed as compared to 2.27.0, are marked with a corresponding
superscript, for example, admission-controllerUpdated.
In total, since Container Cloud 2.27.0, in 2.28.0, 2614
Common Vulnerabilities and Exposures (CVE) have been fixed:
299 of critical and 2315 of high severity.
The table below includes the total numbers of addressed unique and common
vulnerabilities and exposures (CVE) by product component since the 2.27.4
patch release. The common CVEs are issues addressed across several images.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.3.0 or 16.3.0.
Pre-update actions¶Change label values in Ceph metrics used in customizations¶
Note
If you do not use Ceph metrics in any customizations, for example,
custom alerts, Grafana dashboards, or queries in custom workloads, skip
this section.
After deprecating the performance metric exporter that is integrated into the
Ceph Manager daemon for the sake of the dedicated Ceph Exporter daemon in
Container Cloud 2.27.0, you may need to update values of several labels in Ceph
metrics if you use them in any customizations such as custom alerts, Grafana
dashboards, or queries in custom tools. These labels are changed in Container
Cloud 2.28.0 (Cluster releases 16.3.0 and 17.3.0).
Note
Names of metrics are changed, no metrics are removed.
All Ceph metrics to be collected by the Ceph Exporter daemon changed their
labels job and instance due to scraping metrics from new Ceph Exporter
daemon instead of the performance metric exporter of Ceph Manager:
Values of the job labels are changed from rook-ceph-mgr to
prometheus-rook-exporter for all Ceph metrics moved to Ceph
Exporter. The full list of moved metrics is presented below.
Values of the instance labels are changed from the metric endpoint
of Ceph Manager with port 9283 to the metric endpoint of Ceph Exporter
with port 9926 for all Ceph metrics moved to Ceph Exporter. The full
list of moved metrics is presented below.
Values of the instance_id labels of Ceph metrics from the RADOS
Gateway (RGW) daemons are changed from the daemon GID to the daemon
subname. For example, instead of instance_id="<RGW_PROCESS_GID>",
the instance_id="a" (ceph_rgw_qlen{instance_id="a"}) is now
used. The list of moved Ceph RGW metrics is presented below.
List of affected Ceph RGW metrics
ceph_rgw_cache_.*
ceph_rgw_failed_req
ceph_rgw_gc_retire_object
ceph_rgw_get.*
ceph_rgw_keystone_.*
ceph_rgw_lc_.*
ceph_rgw_lua_.*
ceph_rgw_pubsub_.*
ceph_rgw_put.*
ceph_rgw_qactive
ceph_rgw_qlen
ceph_rgw_req
List of all metrics to be collected by Ceph Exporter instead of
Ceph Manager
ceph_bluefs_.*
ceph_bluestore_.*
ceph_mds_cache_.*
ceph_mds_caps
ceph_mds_ceph_.*
ceph_mds_dir_.*
ceph_mds_exported_inodes
ceph_mds_forward
ceph_mds_handle_.*
ceph_mds_imported_inodes
ceph_mds_inodes.*
ceph_mds_load_cent
ceph_mds_log_.*
ceph_mds_mem_.*
ceph_mds_openino_dir_fetch
ceph_mds_process_request_cap_release
ceph_mds_reply_.*
ceph_mds_request
ceph_mds_root_.*
ceph_mds_server_.*
ceph_mds_sessions_.*
ceph_mds_slow_reply
ceph_mds_subtrees
ceph_mon_election_.*
ceph_mon_num_.*
ceph_mon_session_.*
ceph_objecter_.*
ceph_osd_numpg.*
ceph_osd_op.*
ceph_osd_recovery_.*
ceph_osd_stat_.*
ceph_paxos.*
ceph_prioritycache.*
ceph_purge.*
ceph_rgw_cache_.*
ceph_rgw_failed_req
ceph_rgw_gc_retire_object
ceph_rgw_get.*
ceph_rgw_keystone_.*
ceph_rgw_lc_.*
ceph_rgw_lua_.*
ceph_rgw_pubsub_.*
ceph_rgw_put.*
ceph_rgw_qactive
ceph_rgw_qlen
ceph_rgw_req
ceph_rocksdb_.*
Post-update actions¶Manually disable collection of performance metrics by Ceph Manager (optional)¶
Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), Ceph cluster
metrics are collected by the dedicated Ceph Exporter daemon. At the same time,
same metrics are still available to be collected by the Ceph Manager daemon.
To improve performance of the Ceph Manager daemon, you can manually disable
collection of performance metrics by Ceph Manager, which are collected by the
Ceph Exporter daemon.
To disable performance metrics for the Ceph Manager daemon, add the following
parameter to the KaaSCephClusterspec in the rookConfig section:
Once you add this option, Ceph performance metrics are collected by the Ceph
Exporter daemon only. For more details, see Official Ceph documentation.
Upgrade to Ubuntu 22.04 on baremetal-based clusters¶
In Container Cloud 2.29.0, the Cluster release update of the Ubuntu 20.04-based
managed clusters will become impossible, and Ubuntu 22.04 will become the only
supported version of the operating system. Therefore, ensure that every node of
your managed clusters are running Ubuntu 22.04 to unblock managed cluster
update in Container Cloud 2.29.0.
Warning
Management cluster update to Container Cloud 2.29.1 will be blocked
if at least one node of any related managed cluster is running Ubuntu 20.04.
It is not mandatory to upgrade all machines at once. You can upgrade them
one by one or in small batches, for example, if the maintenance window is
limited in time.
Note
Existing management clusters were automatically updated to Ubuntu
22.04 during cluster upgrade to the Cluster release 16.2.0 in Container
Cloud 2.27.0. Greenfield deployments of management clusters are also based
on Ubuntu 22.04.
Warning
Usage of third-party software, which is not part of
Mirantis-supported configurations, for example, the use of custom DPDK
modules, may block upgrade of an operating system distribution. Users are
fully responsible for ensuring the compatibility of such custom components
with the latest supported Ubuntu version.
For MOSK clusters, Container Cloud 2.27.4 is the
second patch release of MOSK 24.2.x series using the patch
Cluster release 17.2.4. For the update path of 24.1 and 24.2 series, see
MOSK documentation: Cluster update scheme.
The Container Cloud patch release 2.27.4, which is based on the
2.27.0 major release, provides the following updates:
Support for the patch Cluster releases 16.2.4 and 17.2.4
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.2.2.
Bare metal: update of Ubuntu mirror from ubuntu-2024-08-06-014502 to
ubuntu-2024-08-21-014714 along with update of the minor kernel version from
5.15.0-117-generic to 5.15.0-119-generic for Jammy and to 5.15.0-118-generic
for Focal.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.2.0 and 16.2.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.27.4, refer
to 2.27.0.
In total, since Container Cloud 2.27.3, 131 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.27.4: 15 of critical and 116 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.27.3.
The common CVEs are issues addressed across several images.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
Patch cluster update¶[49713] Patch update is stuck with some nodes in Prepare state¶
Patch update from 2.27.3 to 2.27.4 may get stuck with one or more management
cluster nodes remaining in the Prepare state and with the following error
in the lcm-controller logs on the management cluster:
failed to create cluster updater for cluster default/kaas-mgmt:machine update group not found for machine default/master-0
To work around the issue, in the LCMMachine objects of the management
cluster, set the following annotation:
This section lists the artifacts of components included in the Container Cloud
patch release 2.27.4. For artifacts of the Cluster releases introduced in
2.27.4, see patch Cluster releases 16.2.4 and 17.2.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
For MOSK clusters, Container Cloud 2.27.3 is the
first patch release of MOSK 24.2.x series using the patch
Cluster release 17.2.3. For the update path of 24.1 and 24.2 series, see
MOSK documentation: Cluster update scheme.
The Container Cloud patch release 2.27.3, which is based on the
2.27.0 major release, provides the following updates:
Support for the patch Cluster releases 16.2.3 and 17.2.3
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.2.1.
MKE:
Support for MKE 3.7.12.
Improvements in the MKE benchmark compliance (control ID 5.1.5): analyzed
and fixed the majority of failed compliance checks for the following
components:
Container Cloud: iam-keycloak in the kaas namespace and
opensearch-dashboards in the stacklight namespace
MOSK: opensearch-dashboards in the stacklight
namespace
Bare metal: update of Ubuntu mirror from ubuntu-2024-07-16-014744 to
ubuntu-2024-08-06-014502 along with update of the minor kernel version from
5.15.0-116-generic to 5.15.0-117-generic.
VMware vSphere: suspension of support for cluster deployment, update, and
attachment. For details, see Deprecation notes.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.2.0 and 16.2.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.27.3, refer
to 2.27.0.
In total, since Container Cloud 2.27.2, 1559 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.27.3: 253 of critical and 1306 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.27.2.
The common CVEs are issues addressed across several images.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.27.3. For artifacts of the Cluster releases introduced in
2.27.3, see patch Cluster releases 16.2.3 and 17.2.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
For MOSK clusters, Container Cloud 2.27.2 is the
continuation for MOSK 24.1.x series using the patch Cluster
release 17.1.7. For the update path of 24.1 and 24.2 series, see
MOSK documentation: Cluster update scheme.
The management cluster of a MOSK 24.1, 24.1.5, or 24.1.6
cluster is automatically updated to the latest patch Cluster release
16.2.2.
The Container Cloud patch release 2.27.2, which is based on the
2.27.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.7 and 17.1.7
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.1.7.
Support for MKE 3.7.11.
Bare metal: update of Ubuntu mirror from ubuntu-2024-06-27-095142 to
ubuntu-2024-07-16-014744 along with update of minor kernel version from
5.15.0-113-generic to 5.15.0-116-generic (Cluster release 16.2.2).
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.2.0 and 16.2.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.27.2, refer
to 2.27.0.
In total, since Container Cloud 2.27.1, 95 Common Vulnerabilities and
Exposures (CVE) have been fixed in 2.27.2: 6 of critical and 89 of
high severity.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.27.1.
The common CVEs are issues addressed across several images.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.27.2 including the Cluster releases 16.2.2,
16.1.7, and 17.1.7.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section lists the artifacts of components included in the Container Cloud
patch release 2.27.2. For artifacts of the Cluster releases introduced in
2.27.2, see patch Cluster releases 16.2.2, 16.1.7, and
17.1.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
For MOSK clusters, Container Cloud 2.27.1 is the
continuation for MOSK 24.1.x series using the patch Cluster
release 17.1.6. For the update path of 24.1 and 24.2 series, see
MOSK documentation: Cluster update scheme.
The management cluster of a MOSK 24.1 or 24.1.5 cluster is
automatically updated to the latest patch Cluster release 16.2.1.
The Container Cloud patch release 2.27.1, which is based on the
2.27.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.6 and 17.1.6
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
24.1.6.
Support for MKE 3.7.10.
Support for docker-ee-cli 23.0.13 in MCR 23.0.11 to fix several CVEs.
Bare metal: update of Ubuntu mirror from ubuntu-2024-05-17-013445 to
ubuntu-2024-06-27-095142 along with update of minor kernel version from
5.15.0-107-generic to 5.15.0-113-generic.
Security fixes for CVEs in images.
Bug fixes.
This patch release also supports the latest major Cluster releases
17.2.0 and 16.2.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.27.1, refer
to 2.27.0.
In total, since Container Cloud 2.27.0, 270 Common Vulnerabilities and
Exposures (CVE) of high severity have been fixed in 2.27.1.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since Container Cloud 2.27.0.
The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.27.1 along with the patch Cluster releases 16.2.1,
16.1.6, and 17.1.6.
[42304] [StackLight] [Cluster releases 17.1.6, 16.1.6] Fixed the issue
with failure of shard relocation in the OpenSearch cluster on large
Container Cloud managed clusters.
[40020] [StackLight] [Cluster releases 17.1.6, 16.1.6] Fixed the issue
with rollover_policy not being applied to the current indices while
updating the policy for the current system* and audit* data streams.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.27.1 including the Cluster releases 16.2.1,
16.1.6, and 17.1.6.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.1.6, 16.2.1, or 16.1.6.
Post-update actions¶Prepare for changing label values in Ceph metrics used in customizations¶
Note
If you do not use Ceph metrics in any customizations, for example,
custom alerts, Grafana dashboards, or queries in custom workloads, skip
this section.
After deprecating the performance metric exporter that is integrated into the
Ceph Manager daemon for the sake of the dedicated Ceph Exporter daemon in
Container Cloud 2.27.0, you may need to prepare for updating values of several
labels in Ceph metrics if you use them in any customizations such as custom
alerts, Grafana dashboards, or queries in custom tools. These labels will be
changed in Container Cloud 2.28.0 (Cluster releases 16.3.0 and 17.3.0).
Note
Names of metrics will not be changed, no metrics will be removed.
All Ceph metrics to be collected by the Ceph Exporter daemon will change their
labels job and instance due to scraping metrics from new Ceph Exporter
daemon instead of the performance metric exporter of Ceph Manager:
Values of the job labels will be changed from rook-ceph-mgr to
prometheus-rook-exporter for all Ceph metrics moved to Ceph
Exporter. The full list of moved metrics is presented below.
Values of the instance labels will be changed from the metric endpoint
of Ceph Manager with port 9283 to the metric endpoint of Ceph Exporter
with port 9926 for all Ceph metrics moved to Ceph Exporter. The full
list of moved metrics is presented below.
Values of the instance_id labels of Ceph metrics from the RADOS
Gateway (RGW) daemons will be changed from the daemon GID to the daemon
subname. For example, instead of instance_id="<RGW_PROCESS_GID>",
the instance_id="a" (ceph_rgw_qlen{instance_id="a"}) will be
used. The list of moved Ceph RGW metrics is presented below.
List of affected Ceph RGW metrics
ceph_rgw_cache_.*
ceph_rgw_failed_req
ceph_rgw_gc_retire_object
ceph_rgw_get.*
ceph_rgw_keystone_.*
ceph_rgw_lc_.*
ceph_rgw_lua_.*
ceph_rgw_pubsub_.*
ceph_rgw_put.*
ceph_rgw_qactive
ceph_rgw_qlen
ceph_rgw_req
List of all metrics to be collected by Ceph Exporter instead of
Ceph Manager
This section lists the artifacts of components included in the Container Cloud
patch release 2.27.1. For artifacts of the Cluster releases introduced in
2.27.1, see patch Cluster releases 16.2.1, 16.1.6, and
17.1.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Does not support greenfield deployments on deprecated Cluster releases
of the 17.1.x and 16.1.x series. Use the latest available Cluster releases
of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.27.0.
This section outlines new features and enhancements introduced in the
Container Cloud release 2.27.0. For the list of enhancements delivered with
the Cluster releases introduced by Container Cloud 2.27.0, see
17.2.0 and 16.2.0.
General availability for Ubuntu 22.04 on bare metal clusters¶
Implemented full support for Ubuntu 22.04 LTS (Jellyfish) as the default
host operating system that now installs on non-MOSK bare metal
management and managed clusters.
For MOSK:
Existing management clusters are automatically updated to Ubuntu 22.04 during
cluster upgrade to Container Cloud 2.27.0 (Cluster release 16.2.0).
Greenfield deployments of management clusters are based on Ubuntu 22.04.
Existing and greenfield deployments of managed clusters are still based on
Ubuntu 20.04. The support for Ubuntu 22.04 on this cluster type will be
announced in one of the following releases.
Caution
Upgrading from Ubuntu 20.04 to 22.04 on existing deployments
of Container Cloud managed clusters is not supported.
Improvements in the day-2 management API for bare metal clusters¶
TechPreview
Enhanced the day-2 management API the bare metal provider with several key
improvements:
Implemented the sysctl, package, and irqbalance configuration
modules, which become available for usage after your management cluster
upgrade to the Cluster release 16.2.0. These Container Cloud modules use the
designated HostOSConfiguration object named mcc-modules to distingish
them from custom modules.
Configuration modules allow managing the operating system of a bare metal
host granularly without rebuilding the node from scratch. Such approach
prevents workload evacuation and significantly reduces configuration time.
Optimized performance for faster, more efficient operations.
Enhanced user experience for easier and more intuitive interactions.
Resolved various internal issues to ensure smoother functionality.
Added comprehensive documentation, including concepts, guidelines,
and recommendations for effective use of day-2 operations.
Optimization of strict filtering for devices on bare metal clusters¶
Optimized the BareMetalHostProfile custom resource, which uses
the strict byID filtering to target system disks using the byPath,
serialNumber, and wwn reliable device options instead of the
unpredictable byName naming format.
The optimization includes changes in admission-controller that now blocks
the use of bmhp:spec:devices:by_name in new BareMetalHostProfile
objects.
Deprecation of SubnetPool and MetalLBConfigTemplate objects¶
As part of refactoring of the bare metal provider, deprecated the
SubnetPool and MetalLBConfigTemplate objects. The objects will be
completely removed from the product in one of the following releases.
Both objects are automatically migrated to the MetallbConfig object during
cluster update to the Cluster release 17.2.0 or 16.2.0.
The ClusterUpdatePlan object for a granular cluster update¶
TechPreview
Implemented the ClusterUpdatePlan custom resource to enable a granular
step-by-step update of a managed cluster. The operator can control the update
process by manually launching update stages using the commence flag.
Between the update stages, a cluster remains functional from the perspective
of cloud users and workloads.
A ClusterUpdatePlan object is automatically created by the respective
Container Cloud provider when a new Cluster release becomes available for your
cluster. This object contains a list of predefined self-descriptive update
steps that are cluster-specific. These steps are defined in the spec
section of the object with information about their impact on the cluster.
Implemented the UpdateGroup custom resource for creation of update groups
for worker machines on managed clusters. The use of update groups provides
enhanced control over update of worker machines. This feature decouples the
concurrency settings from the global cluster level, providing update
flexibility based on the workload characteristics of different worker machine
sets.
Implemented the same heartbeat model for the LCM Agent as Kubernetes uses
for Nodes. This model allows reflecting the actual status of the LCM Agent
when it fails. For visual representation, added the corresponding
LCM Agent status to the Container Cloud web UI for clusters and
machines, which reflects health status of the LCM agent along with its status
of update to the version from the current Cluster release.
Handling secret leftovers using secret-controller¶
Implemented secret-controller that runs on a management cluster and cleans
up secret leftovers of credentials that are not cleaned up automatically after
creation of new secrets. This controller replaces rhellicense-controller,
proxy-controller, and byo-credentials-controller as well as partially
replaces the functionality of license-controller and other credential
controllers.
Note
You can change memory limits for secret-controller on a
management cluster using the resources:limits parameter in the
spec:providerSpec:value:kaas:management:helmReleases: section of the
Cluster object.
MariaDB backup for bare metal and vSphere providers¶
Implemented the capability to back up and restore MariaDB databases on
management clusters for bare metal and vSphere providers. Also, added
documentation on how to change the storage node for backups on clusters of
these provider types.
The following issues have been addressed in the Mirantis Container Cloud
release 2.27.0 along with the Cluster releases 17.2.0 and
16.2.0.
Note
This section provides descriptions of issues addressed since
the last Container Cloud patch release 2.26.5.
For details on addressed issues in earlier patch releases since 2.26.0,
which are also included into the major release 2.27.0, refer to
2.26.x patch releases.
[42304] [StackLight] Fixed the issue with failure of shard relocation in
the OpenSearch cluster on large Container Cloud managed clusters.
[41890] [StackLight] Fixed the issue with Patroni failing to start
because of the short default timeout.
[40020] [StackLight] Fixed the issue with rollover_policy not being
applied to the current indices while updating the policy for the current
system* and audit* data streams.
[41819] [Ceph] Fixed the issue with the graceful cluster reboot being
blocked by active Ceph ClusterWorkloadLock objects.
[28865] [LCM] Fixed the issue with validation of the NTP configuration
before cluster deployment. Now, deployment does not start until the NTP
configuration is validated.
If the dnsmasq pod is restarted during the bootstrap of newly added
nodes, those nodes may fail to undergo inspection. That can result in
inspectionerror in the corresponding BareMetalHost objects.
The issue can occur when:
The dnsmasq pod was moved to another node.
DHCP subnets were changed, including addition or removal. In this case, the
dhcpd container of the dnsmasq pod is restarted.
Caution
If changing or adding of DHCP subnets is required to bootstrap
new nodes, wait after changing or adding DHCP subnets until the
dnsmasq pod becomes ready, then create BareMetalHost objects.
To verify whether the nodes are affected:
Verify whether the BareMetalHost objects contain the
inspectionerror:
Verify whether the dnsmasq pod was in Ready state when the
inspection of the affected baremetal hosts (test-worker-3 in the example
above) was started:
In the system response above, inspection was started at
"2024-10-11T07:38:19Z", immediately before the period of the dhcpd
container downtime. Therefore, this node is most likely affected by the
issue.
Workaround
Reboot the node using the IPMI reset or cycle
command.
If the node fails to boot, remove the failed BareMetalHost object and
create it again:
Remove BareMetalHost object. For example:
kubectldeletebmh-nmanaged-nstest-worker-3
Verify that the BareMetalHost object is removed:
kubectlgetbmh-nmanaged-nstest-worker-3
Create a BareMetalHost object from the template. For example:
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[50566] Ceph upgrade is very slow during patch or major cluster update¶
Due to the upstream Ceph issue
66717,
during CVE upgrade of the Ceph daemon image of Ceph Reef 18.2.4, OSDs may start
slow and even fail the starting probe with the following describe output in
the rook-ceph-osd-X pod:
Complete the following steps during every patch or major cluster update of the
Cluster releases 17.2.x, 17.3.x, and 17.4.x (until Ceph 18.2.5 becomes
supported):
Plan extra time in the maintenance window for the patch cluster update.
Slow starts will still impact the update procedure, but after completing the
following step, the recovery process noticeably shortens without affecting
the overall cluster state and data responsiveness.
Select one of the following options:
Before the cluster update, set the noout flag:
cephosdsetnoout
Once the Ceph OSDs image upgrade is done, unset the flag:
cephosdunsetnoout
Monitor the Ceph OSDs image upgrade. If the symptoms of slow start appear,
set the noout flag as soon as possible. Once the Ceph OSDs image
upgrade is done, unset the flag.
[42908] The ceph-exporter pods are present in the Ceph crash list¶
After a managed cluster update, the ceph-exporter pods are present in the
ceph crash ls list while rook-ceph-exporter attempts to obtain
the port that is still in use. The issue does not block the managed cluster
update. Once the port becomes available, rook-ceph-exporter obtains the
port and the issue disappears.
As a workaround, run ceph crash archive-all to remove
ceph-exporter pods from the Ceph crash list.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
[43164] Rollover policy is not added to indicies created without a policy¶
The initial index for the system* and audit* data streams can be
created without any policy attached due to race condition.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.\
<class'curator.exceptions.FailedExecution'>:Exceptionencountered.\
RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.\
Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] \is the write index for data stream [system] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.\
<class'curator.exceptions.FailedExecution'>:Exceptionencountered.\
RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.\
Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] \is the write index for data stream [audit] and cannot be deleted')
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index.
Container Cloud web UI¶[50181] Failure to deploy a compact cluster¶
A compact MOSK cluster fails to be deployed through the Container Cloud web UI
due to inability to add any label to the control plane machines along with
inability to change dedicatedControlPlane:false using the web UI.
To work around the issue, manually add the required labels using CLI. Once
done, the cluster deployment resumes.
[50168] Inability to use a new project right after creation¶
A newly created project does not display all available tabs in the Container
Cloud web UI and contains different accessdenied errors during first five
minutes after creation.
To work around the issue, refresh the browser in five minutes after the
project creation.
The following table lists the major components and their versions delivered in
Container Cloud 2.27.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the artifacts of components included in the Container Cloud
release 2.27.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
In total, since Container Cloud 2.26.0, in 2.27.0, 408
Common Vulnerabilities and Exposures (CVE) have been fixed:
26 of critical and 382 of high severity.
The table below includes the total numbers of addressed unique and common
vulnerabilities and exposures (CVE) by product component since the 2.26.5
patch release. The common CVEs are issues addressed across several images.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.2.0 or 16.2.0.
For those clusters that update between only major versions, the update
scheme remains unchaged.
Caution
In Container Cloud patch releases 2.27.1 and 2.27.2,
only the 16.2.x patch Cluster releases will be delivered with an
automatic update of management clusters and the possibility to update
non-MOSK managed clusters.
In parallel, 2.27.1 and 2.27.2 will include new 16.1.x and 17.1.x patches
for MOSK 24.1.x. And the first 17.2.x patch Cluster release
for MOSK 24.2.x will be delivered in 2.27.3. For details,
see MOSK documentation: Update path for 24.1 and 24.2 series.
Pre-update actions¶Update bird configuration on BGP-enabled bare metal clusters¶
Note
If you have already completed the below procedure after updating
your clusters to Container Cloud 2.26.0 (Cluster releases 17.1.0 or 16.1.0),
skip this subsection.
Container Cloud 2.26.0 introduced the bird daemon update from v1.6.8
to v2.0.7 on master nodes if BGP is used for BGP announcement of the cluster
API load balancer address.
Configuration files for bird v1.x are not fully compatible with those for
bird v2.x. Therefore, if you used BGP announcement of cluster API LB address
on a deployment based on Cluster releases 17.0.0 or 16.0.0, update bird
configuration files to fit bird v2.x using configuration examples provided in
the API Reference: MultirRackCluster section.
Review and adjust the storage parameters for OpenSearch¶
Note
If you have already completed the below procedure after updating
your clusters to Container Cloud 2.26.0 (Cluster releases 17.1.0 or 16.1.0),
skip this subsection.
To prevent underused or overused storage space, review your storage space
parameters for OpenSearch on the StackLight cluster:
Review the value of elasticsearch.persistentVolumeClaimSize and
the real storage available on volumes.
Decide whether you have to additionally set
elasticsearch.persistentVolumeUsableStorageSizeGB.
Post-update actions¶Prepare for changing label values in Ceph metrics used in customizations¶
Note
If you do not use Ceph metrics in any customizations, for example,
custom alerts, Grafana dashboards, or queries in custom workloads, skip
this section.
After deprecating the performance metric exporter that is integrated into the
Ceph Manager daemon for the sake of the dedicated Ceph Exporter daemon in
Container Cloud 2.27.0, you may need to prepare for updating values of several
labels in Ceph metrics if you use them in any customizations such as custom
alerts, Grafana dashboards, or queries in custom tools. These labels will be
changed in Container Cloud 2.28.0 (Cluster releases 16.3.0 and 17.3.0).
Note
Names of metrics will not be changed, no metrics will be removed.
All Ceph metrics to be collected by the Ceph Exporter daemon will change their
labels job and instance due to scraping metrics from new Ceph Exporter
daemon instead of the performance metric exporter of Ceph Manager:
Values of the job labels will be changed from rook-ceph-mgr to
prometheus-rook-exporter for all Ceph metrics moved to Ceph
Exporter. The full list of moved metrics is presented below.
Values of the instance labels will be changed from the metric endpoint
of Ceph Manager with port 9283 to the metric endpoint of Ceph Exporter
with port 9926 for all Ceph metrics moved to Ceph Exporter. The full
list of moved metrics is presented below.
Values of the instance_id labels of Ceph metrics from the RADOS
Gateway (RGW) daemons will be changed from the daemon GID to the daemon
subname. For example, instead of instance_id="<RGW_PROCESS_GID>",
the instance_id="a" (ceph_rgw_qlen{instance_id="a"}) will be
used. The list of moved Ceph RGW metrics is presented below.
List of affected Ceph RGW metrics
ceph_rgw_cache_.*
ceph_rgw_failed_req
ceph_rgw_gc_retire_object
ceph_rgw_get.*
ceph_rgw_keystone_.*
ceph_rgw_lc_.*
ceph_rgw_lua_.*
ceph_rgw_pubsub_.*
ceph_rgw_put.*
ceph_rgw_qactive
ceph_rgw_qlen
ceph_rgw_req
List of all metrics to be collected by Ceph Exporter instead of
Ceph Manager
The Container Cloud patch release 2.26.5, which is based on the
2.26.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.5
and 17.1.5 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
24.1.5.
Bare metal: update of Ubuntu mirror from 20.04~20240502102020 to
20.04~20240517090228 along with update of minor kernel version from
5.15.0-105-generic to 5.15.0-107-generic.
Security fixes for CVEs in images.
Bug fixes.
This patch release also supports the latest major Cluster releases
17.1.0 and 16.1.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.26.5, refer
to 2.26.0.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.26.4 patch
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.26.5 along with the patch Cluster releases 17.1.5
and 16.1.5.
[42408] [bare metal] Fixed the issue with old versions of system
packages, including kernel, remaining on the manager nodes after cluster
update.
[41540] [LCM] Fixed the issue with lcm-agent failing to grab storage
information on a host and leaving lcmmachine.status.hostinfo.hardware
empty due to issues with managing physical NVME devices.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.1.5 or 16.1.5.
To improve user update experience and make the update path more flexible,
Container Cloud is introducing a new scheme of updating between patch Cluster
releases. More specifically, Container Cloud intends to ultimately provide a
possibility to update to any newer patch version within single series at any
point of time. The patch version downgrade is not supported.
Though, in some cases, Mirantis may request to update to some specific
patch version in the series to be able to update to the next major series.
This may be necessary due to the specifics of technical content already
released or planned for the release. For possible update paths in
MOSK in 24.1 and 24.2 series, see MOSK
documentation: Cluster update scheme.
The exact number of patch releases for the 16.1.x and 17.1.x series is yet to
be confirmed, but the current target is 7 releases.
Note
The management cluster update scheme remains the same.
A management cluster obtains the new product version automatically
after release.
Post-update actions¶Delete ‘HostOSConfiguration’ objects on baremetal-based clusters¶
If you use the HostOSConfiguration and HostOSConfigurationModules
custom resources for the bare metal provider, which are available in the
Technology Preview scope in Container Cloud 2.26.x, delete all
HostOSConfiguration objects right after update of your managed cluster to
the Cluster release 17.1.5 or 16.1.5, before automatic upgrade of the
management cluster to Container Cloud 2.27.0 (Cluster release 16.2.0).
After the upgrade, you can recreate the required objects using the updated
parameters.
This precautionary step prevents re-processing and re-applying of existing
configuration, which is defined in HostOSConfiguration objects, during
management cluster upgrade to 2.27.0. Such behavior is caused by changes in
the HostOSConfiguration API introduced in 2.27.0.
Configure Kubernetes auditing and profiling for log rotation¶
Note
Skip this procedure if you have already completed it after updating
your managed cluster to Container Cloud 2.26.4 (Cluster release 17.1.4 or
16.1.4).
After the MKE update to 3.7.8, if you are going to enable or already enabled
Kubernetes auditing and profiling on your managed or management cluster,
keep in mind that enabling audit log rotation requires an additional
step. Set the following options in the MKE configuration file after enabling
auditing and profiling:
This section lists the artifacts of components included in the Container Cloud
patch release 2.26.5. For artifacts of the Cluster releases introduced in
2.26.5, see patch Cluster releases 17.1.5 and 16.1.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud patch release 2.26.4, which is based on the
2.26.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.4
and 17.1.4 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
24.1.4.
Support for MKE 3.7.8.
Bare metal: update of Ubuntu mirror from 20.04~20240411171541 to
20.04~20240502102020 along with update of minor kernel version from
5.15.0-102-generic to 5.15.0-105-generic.
Security fixes for CVEs in images.
Bug fixes.
This patch release also supports the latest major Cluster releases
17.1.0 and 16.1.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.26.4, refer
to 2.26.0.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.26.3 patch
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.26.4 along with the patch Cluster releases 17.1.4
and 16.1.4.
[41806] [Container Cloud web UI] Fixed the issue with failure to
configure management cluster using the Configure cluster web UI
menu without updating the Keycloak Truststore settings.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
After managed cluster update, old versions of system packages, including
kernel, may remain on the manager nodes. This issue occurs because the task
responsible for updating packages fails to run after updating Ubuntu mirrors.
As a workaround, manually run apt-get upgrade on every manager
node after the cluster update but before rebooting the node.
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[41540] LCM Agent cannot grab storage information on a host¶
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.1.4 or 16.1.4.
Post-update actions¶Configure Kubernetes auditing and profiling for log rotation¶
After the MKE update to 3.7.8, if you are going to enable or already enabled
Kubernetes auditing and profiling on your managed or management cluster,
keep in mind that enabling audit log rotation requires an additional
step. Set the following options in the MKE configuration file after enabling
auditing and profiling:
This section lists the artifacts of components included in the Container Cloud
patch release 2.26.4. For artifacts of the Cluster releases introduced in
2.26.4, see patch Cluster releases 17.1.4 and 16.1.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud patch release 2.26.3, which is based on the
2.26.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.3
and 17.1.3 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
24.1.3.
Support for MKE 3.7.7.
Bare metal: update of Ubuntu mirror from 20.04~20240324172903 to
20.04~20240411171541 along with update of minor kernel version from
5.15.0-101-generic to 5.15.0-102-generic.
Security fixes for CVEs in images.
Bug fixes.
This patch release also supports the latest major Cluster releases
17.1.0 and 16.1.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.26.3, refer
to 2.26.0.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.26.2 patch
release. The common CVEs are issues addressed across several images.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[41540] LCM Agent cannot grab storage information on a host¶
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
Container Cloud web UI¶[41806] Configuration of a management cluster fails without Keycloak settings¶
During configuration of a management cluster settings using the
Configure cluster web UI menu, updating the Keycloak Truststore
settings is mandatory, despite being optional.
As a workaround, update the management cluster using the API or CLI.
This section lists the artifacts of components included in the Container Cloud
patch release 2.26.3. For artifacts of the Cluster releases introduced in
2.26.3, see patch Cluster releases 17.1.3 and 16.1.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud patch release 2.26.2, which is based on the
2.26.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.2
and 17.1.2 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
24.1.2.
Support for MKE 3.7.6.
Support for docker-ee-cli 23.0.10 in MCR 23.0.9 to fix several CVEs.
Bare metal: update of Ubuntu mirror from 20.04~20240302175618 to
20.04~20240324172903 along with update of minor kernel version from
5.15.0-97-generic to 5.15.0-101-generic.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.1.0 and 16.1.0. And it does not
support greenfield deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.26.2, refer
to 2.26.0.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.26.1 patch
release. The common CVEs are issues addressed across several images.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[41540] LCM Agent cannot grab storage information on a host¶
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[40811] Pod is stuck in the Terminating state on the deleted node¶
During deletion of a machine, the related DaemonSet Pod can remain on the
deleted node in the Terminating state. As a workaround, manually
delete the Pod:
kubectldeletepod-n<podNamespace><podName>
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
Container Cloud web UI¶[41806] Configuration of a management cluster fails without Keycloak settings¶
During configuration of a management cluster settings using the
Configure cluster web UI menu, updating the Keycloak Truststore
settings is mandatory, despite being optional.
As a workaround, update the management cluster using the API or CLI.
This section lists the artifacts of components included in the Container Cloud
patch release 2.26.2. For artifacts of the Cluster releases introduced in
2.26.2, see patch Cluster releases 17.1.2 and 16.1.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud patch release 2.26.1, which is based on the
2.26.0 major release, provides the following updates:
Support for the patch Cluster releases 16.1.1
and 17.1.1 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
24.1.1.
Delivery mechanism for CVE fixes on Ubuntu in bare metal clusters that
includes update of Ubuntu kernel minor version.
For details, see Enhancements.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.1.0 and 16.1.0. And it does not
support greenfield deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.26.1, refer
to 2.26.0.
This section outlines new features and enhancements introduced in the
Container Cloud patch release 2.26.1 along with Cluster releases 17.1.1 and
16.1.1.
Delivery mechanism for CVE fixes on Ubuntu in bare metal clusters¶
Introduced the ability to update Ubuntu packages including kernel minor
version update, when available in a Cluster release, for both management and
managed bare metal clusters to address CVE issues on a host operating system.
On management clusters, the update of Ubuntu mirror along with the update
of minor kernel version occurs automatically with cordon-drain and reboot
of machines.
On managed clusters, the update of Ubuntu mirror along with the update
of minor kernel version applies during a manual cluster update without
automatic cordon-drain and reboot of machines. After a managed cluster
update, all cluster machines have the rebootisrequired notification.
You can manually handle the reboot of machines during a convenient
maintenance window using
GracefulRebootRequest.
This section lists the artifacts of components included in the Container Cloud
patch release 2.26.1. For artifacts of the Cluster releases introduced in
2.26.1, see patch Cluster releases 17.1.1 and 16.1.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.26.0 major
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.26.1 along with the patch Cluster releases 17.1.1
and 16.1.1.
[39330] [StackLight] Fixed the issue with the OpenSearch cluster being
stuck due to initializing replica shards.
[39220] [StackLight] Fixed the issue with Patroni failure due to no limit
configuration for the max_timelines_history parameter.
[39080] [StackLight] Fixed the issue with the
OpenSearchClusterStatusWarning alert firing during cluster upgrade if
StackLight is deployed in the HA mode.
[38970] [StackLight] Fixed the issue with the Logs dashboard
in the OpenSearch Dashboards web UI not working for the system index.
[38937] [StackLight] Fixed the issue with the
View logs in OpenSearch Dashboards link not working in the
Grafana web UI.
[40747] [vSphere] Fixed the issue with the unsupported Cluster release
being available for greenfield vSphere-based managed cluster deployments
in the drop-down menu of the cluster creation window in the Container Cloud
web UI.
[40036] [LCM] Fixed the issue causing nodes to remain in the
Kubernetes cluster when the corresponding Machine object is disabled
during cluster update.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
LCM¶[41540] LCM Agent cannot grab storage information on a host¶
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[40811] Pod is stuck in the Terminating state on the deleted node¶
During deletion of a machine, the related DaemonSet Pod can remain on the
deleted node in the Terminating state. As a workaround, manually
delete the Pod:
kubectldeletepod-n<podNamespace><podName>
[39437] Failure to replace a master node on a Container Cloud cluster¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
Container Cloud web UI¶[41806] Configuration of a management cluster fails without Keycloak settings¶
During configuration of a management cluster settings using the
Configure cluster web UI menu, updating the Keycloak Truststore
settings is mandatory, despite being optional.
As a workaround, update the management cluster using the API or CLI.
Does not support greenfield deployments on deprecated Cluster releases
of the 17.0.x and 16.0.x series. Use the latest available Cluster releases
of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.26.0.
This section outlines new features and enhancements introduced in the
Container Cloud release 2.26.0. For the list of enhancements delivered with
the Cluster releases introduced by Container Cloud 2.26.0, see
17.1.0 and 16.1.0.
Pre-update inspection of pinned product artifacts in a ‘Cluster’ object¶
To ensure that Container Cloud clusters remain consistently updated with the
latest security fixes and product improvements, the Admission Controller
has been enhanced. Now, it actively prevents the utilization of pinned
custom artifacts for Container Cloud components. Specifically, it blocks
a management or managed cluster release update, or any cluster configuration
update, for example, adding public keys or proxy, if a Cluster object
contains any custom Container Cloud artifacts with global or image-related
values overwritten in the helm-releases section, until these values are
removed.
Normally, the Container Cloud clusters do not contain pinned artifacts,
which eliminates the need for any pre-update actions in most deployments.
However, if the update of your cluster is blocked with the
invalidHelmReleasesconfiguration error, refer to
Update notes: Pre-update actions for details.
Note
In rare cases, if the image-related or global values should be
changed, you can use the ClusterRelease or KaaSRelease objects
instead. But make sure to update these values manually after every major
and patch update.
Note
The pre-update inspection applies only to images delivered by
Container Cloud that are overwritten. Any custom images unrelated to the
product components are not verified and do not block cluster update.
Disablement of worker machines on managed clusters¶
TechPreview
Implemented the machine disabling API that allows you to seamlessly remove
a worker machine from the LCM control of a managed cluster. This action
isolates the affected node without impacting other machines in the cluster,
effectively eliminating it from the Kubernetes cluster. This functionality
proves invaluable in scenarios where a malfunctioning machine impedes cluster
updates.
Added initial Technology Preview support for the HostOSConfiguration and
HostOSConfigurationModules custom resources in the bare metal provider.
These resources introduce configuration modules that allow managing the
operating system of a bare metal host granularly without rebuilding the node
from scratch. Such approach prevents workload evacuation and significantly
reduces configuration time.
Configuration modules manage various settings of the operating system using
Ansible playbooks, adhering to specific schemas and metadata requirements.
For description of module format, schemas, and rules, contact Mirantis support.
Warning
For security reasons and to ensure safe and reliable cluster
operability, contact Mirantis support
to start using these custom resources.
Caution
As long as the feature is still on the development stage,
Mirantis highly recommends deleting all HostOSConfiguration objects,
if any, before automatic upgrade of the management cluster to Container Cloud
2.27.0 (Cluster release 16.2.0). After the upgrade, you can recreate the
required objects using the updated parameters.
This precautionary step prevents re-processing and re-applying of existing
configuration, which is defined in HostOSConfiguration objects, during
management cluster upgrade to 2.27.0. Such behavior is caused by changes in
the HostOSConfiguration API introduced in 2.27.0.
Strict filtering for devices on bare metal clusters¶
Implemented the strict byID filtering for targeting system disks using
specific device options: byPath, serialNumber, and wwn.
These options offer a more reliable alternative to the unpredictable
byName naming format.
Mirantis recommends adopting these new device naming options when adding new
nodes and redeploying existing ones to ensure a predictable and stable device
naming schema.
Dynamic IP allocation for faster host provisioning¶
Introduced a mechanism in the Container Cloud dnsmasq server to dynamically
allocate IP addresses for baremetal hosts during provisioning. This new
mechanism replaces sequential IP allocation that includes the ping check
with dynamic IP allocation without the ping check. Such behavior significantly
increases the amount of baremetal servers that you can provision in parallel,
which allows you to streamline the process of setting up a large managed
cluster.
Support for Kubernetes auditing and profiling on management clusters¶
Added support for the Kubernetes auditing and profiling enablement and
configuration on management clusters. The auditing option is enabled by
default. You can configure both options using Cluster object of the
management cluster.
Note
For managed clusters, you can also configure Kubernetes auditing
along with profiling using the Cluster object of a managed cluster.
Cleanup of LVM thin pool volumes during cluster provisioning¶
Implemented automatic cleanup of LVM thin pool volumes during the provisioning
stage to prevent issues with logical volume detection before removal, which
could cause node cleanup failure during cluster redeployment.
Wiping a device or partition before a bare metal cluster deployment¶
Implemented the capability to erase existing data from hardware devices to be
used for a bare metal management or managed cluster deployment. Using the
new wipeDevice structure, you can either erase an existing partition or
remove all existing partitions from a physical device. For these purposes,
use the eraseMetadata or eraseDevice option that configures cleanup
behavior during configuration of a custom bare metal host profile.
Note
The wipeDevice option replaces the deprecated wipe option
that will be removed in one of the following releases. For backward
compatibility, any existing wipe:true option is automatically converted
to the following structure:
Policy Controller for validating pod image signatures¶
Technology Preview
Introduced initial Technology Preview support for the Policy Controller that
validates signatures of pod images.
The Policy Controller verifies that images used by the Container Cloud and
Mirantis OpenStack for Kubernetes controllers are signed by a trusted authority.
The Policy Controller inspects defined image policies that list Docker
registries and authorities for signature validation.
Added support for configuring Keycloak truststore using the Container Cloud
web UI to allow for a proper validation of client self-signed certificates.
The truststore is used to ensure secured connection to identity brokers,
LDAP identity providers, and others.
Added the LCM Operation condition to monitor health of all LCM
operations on a cluster and its machines that is useful during cluster update.
You can monitor the status of LCM operations using the the Container Cloud
web UI in the status hover menus of a cluster and machine.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the documentation on how to export logs from OpenSearch
dashboards to CSV.
The following issues have been addressed in the Mirantis Container Cloud
release 2.26.0 along with the Cluster releases 17.1.0 and
16.1.0.
Note
This section provides descriptions of issues addressed since
the last Container Cloud patch release 2.25.4.
For details on addressed issues in earlier patch releases since 2.25.0,
which are also included into the major release 2.26.0, refer to
2.25.x patch releases.
[32761] [LCM] Fixed the issue with node cleanup failing on
MOSK clusters due to the Ansible provisioner hanging in a
loop while trying to remove LVM thin pool logical volumes, which
occurred due to issues with volume detection before removal during cluster
redeployment. The issue resolution comprises implementation of automatic
cleanup of LVM thin pool volumes during the provisioning stage.
[36924] [LCM] Fixed the issue with Ansible starting to run on nodes
of a managed cluster after the mcc-cache certificate is applied
on a management cluster.
[37268] [LCM] Fixed the issue with Container Cloud cluster being
blocked by a node stuck in the Prepare or Deploy state with
errorprocessingpackageopenssh-server. The issue was caused by
customizations in /etc/ssh/sshd_config, such as additional Match
statements.
[34820] [Ceph] Fixed the issue with the Ceph rook-operator failing
to connect to Ceph RADOS Gateway pods on clusters with the
Federal Information Processing Standard mode enabled.
[38340] [StackLight] Fixed the issue with Telegraf Docker Swarm timing
out while collecting data by increasing its timeout from 10 to 25 seconds.
When trying to list the HostOSConfigurationModules and HostOSConfiguration custom resources, serviceuser or a user with
the global-admin or operator role obtains the accessdenied error.
For example:
[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
After node maintenance of a management cluster, the newly added nodes may
fail to undergo provisioning successfully. The issue relates to new nodes
that are in the same L2 domain as the management cluster.
The issue was observed on environments having management cluster nodes
configured with a single L2 segment used for all network traffic
(PXE and LCM/management networks).
To verify whether the cluster is affected:
Verify whether the dnsmasq and dhcp-relay pods run on the same node
in the management cluster:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
vSphere¶[40747] Unsupported Cluster release is available for managed cluster deployment¶
The Cluster release 16.0.0, which is not supported for greenfield vSphere-based
deployments, is still available in the drop-down menu of the cluster creation
window in the Container Cloud web UI.
Do not select this Cluster release to prevent deployment failures.
Use the latest supported version instead.
LCM¶[41540] LCM Agent cannot grab storage information on a host¶
Due to issues with managing physical NVME devices, lcm-agent cannot grab
storage information on a host. As a result,
lcmmachine.status.hostinfo.hardware is empty and the following example
error is present in logs:
{"level":"error","ts":"2024-05-02T12:26:10Z","logger":"agent",\"msg":"get hardware details",\"host":"kaas-node-548b2861-aed0-41c9-8ff2-10c5476b000b",\"error":"new storage info: get disk info \"nvme0c0n1\": \invoke command: exit status 1","errorVerbose":"exit status 1
As a workaround, on the affected node, create a symlink for any device
indicated in lcm-agent logs. For example:
ln-sfn/dev/nvme0n1/dev/nvme0c0n1
[40036] Node is not removed from a cluster when its Machine is disabled¶
During the replacement of a master node on a cluster of any type, the process
may get stuck with Kubelet'sNodeReadyconditionisUnknown in the
machine status on the remaining master nodes.
As a workaround, log in on the affected node and run the following
command:
dockerrestartucp-kubelet
[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[41819] Graceful cluster reboot is blocked by the Ceph ClusterWorkloadLocks¶
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
On High Availability (HA) clusters that use Local Volume Provisioner (LVP),
Prometheus and OpenSearch from StackLight may share the same pool of storage.
In such configuration, OpenSearch may approach the 85% disk usage watermark
due to the combined storage allocation and usage patterns set by the Persistent
Volume Claim (PVC) size parameters for Prometheus and OpenSearch, which consume
storage the most.
When the 85% threshold is reached, the affected node is transitioned to the
read-only state, preventing shard allocation and causing the OpenSearch cluster
state to transition to Warning (Yellow) or Critical (Red).
Caution
The issue and the provided workaround apply only for clusters on
which OpenSearch and Prometheus utilize the same storage pool.
Derived from .values.elasticsearch.persistentVolumeUsableStorageSizeGB,
defaulting to .values.elasticsearch.persistentVolumeClaimSize if
unspecified. To obtain the OpenSearch PVC size:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
If the formula result is positive, it is an early indication that the
cluster is affected.
Verify whether the OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical alert is firing. And if so,
verify the following:
Log in to the OpenSearch web UI.
In Management -> Dev Tools, run the following command:
GET_cluster/allocation/explain
The following system response indicates that the corresponding node is
affected:
"explanation":"the node is above the low watermark cluster setting \[cluster.routing.allocation.disk.watermark.low=85%], using more disk space \than the maximum allowed [85.0%], actual free: [xx.xxx%]"
Note
The system response may contain even higher watermark percent
than 85.0%, depending on the case.
Workaround:
Warning
The workaround implies adjustement of the retention threshold for
OpenSearch. And depending on the new threshold, some old logs will be
deleted.
A user-defined variable that specifies what percentage of the total storage
capacity should not be used by OpenSearch or Prometheus. This is used to
reserve space for other components. It should be expressed as a decimal.
For example, for 5% of reservation, Reserved_Percentage is 0.05.
Mirantis recommends using 0.05 as a starting point.
Filesystem_Reserve
Percentage to deduct for filesystems that may reserve some portion of the
available storage, which is marked as occupied. For example, for EXT4, it
is 5% by default, so the value must be 0.05.
Prometheus_PVC_Size_GB
Sourced from .values.prometheusServer.persistentVolumeClaimSize.
Total_Storage_Capacity_GB
Total capacity of the OpenSearch PVCs. For LVP, the capacity of the
storage pool. To obtain the total capacity:
The system response contains multiple outputs, one per opensearch-master
node. Select the capacity for the affected node.
Note
Convert the values to GB if they are set in different units.
Calculation of above formula provides a maximum safe storage to allocate
for .values.elasticsearch.persistentVolumeUsableStorageSizeGB. Use this
formula as a reference for setting
.values.elasticsearch.persistentVolumeUsableStorageSizeGB on a cluster.
Wait up to 15-20 mins for OpenSearch to perform the cleaning.
Verify that the cluster is not affected anymore using the procedure above.
[42304] Failure of shard relocation in the OpenSearch cluster¶
On large managed clusters, shard relocation may fail in the OpenSearch cluster
with the yellow or red status of the OpenSearch cluster.
The characteristic symptom of the issue is that in the stacklight
namespace, the statefulset.apps/opensearch-master containers are
experiencing throttling with the KubeContainersCPUThrottlingHigh alert
firing for the following set of labels:
The throttling that OpenSearch is experiencing may be a temporary
situation, which may be related, for example, to a peaky load and the
ongoing shards initialization as part of disaster recovery or after node
restart. In this case, Mirantis recommends waiting until initialization
of all shards is finished. After that, verify the cluster state and whether
throttling still exists. And only if throttling does not disappear, apply
the workaround below.
To verify that the initialization of shards is ongoing:
The system response above indicates that shards from the
.ds-system-000072, .ds-system-000073, and .ds-audit-000001
indicies are in the INITIALIZING state. In this case, Mirantis
recommends waiting until this process is finished, and only then consider
changing the limit.
You can additionally analyze the exact level of throttling and the current
CPU usage on the Kubernetes Containers dashboard in Grafana.
Workaround:
Verify the currently configured CPU requests and limits for the
opensearch containers:
In the example above, the CPU request is 500m and the CPU limit is
600m.
Increase the CPU limit to a reasonably high number.
For example, the default CPU limit for the clusters with the
clusterSize:large parameter set was increased from
8000m to 12000m for StackLight in Container Cloud 2.27.0
(Cluster releases 17.2.0 and 16.2.0).
If the CPU limit for the opensearch component is already set, increase
it in the Cluster object for the opensearch parameter. Otherwise,
the default StackLight limit is used. In this case, increase the CPU limit
for the opensearch component using the resources parameter.
Wait until all opensearch-master pods are recreated with the new CPU
limits and become running and ready.
To verify the current CPU limit for every opensearch container in every
opensearch-master pod separately:
The waiting time may take up to 20 minutes depending on the cluster size.
If the issue is fixed, the KubeContainersCPUThrottlingHigh alert stops
firing immediately, while OpenSearchClusterStatusWarning or
OpenSearchClusterStatusCritical can still be firing for some time during
shard relocation.
If the KubeContainersCPUThrottlingHigh alert is still firing, proceed with
another iteration of the CPU limit increase.
[40020] Rollover policy update is not appllied to the current index¶
While updating rollover_policy for the current system* and audit*
data streams, the update is not applied to indices.
One of indicators that the cluster is most likely affected is the
KubeJobFailed alert firing for the elasticsearch-curator job and one or
both of the following errors being present in elasticsearch-curator pods
that remain in the Error status:
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-audit-000001] is the write index for data stream [audit] and cannot be deleted')
or
2024-05-3113:16:04,459ERRORFailedtocompleteaction:delete_indices.<class'curator.exceptions.FailedExecution'>:Exceptionencountered.RerunwithloglevelDEBUGand/orcheckElasticsearchlogsformoreinformation.Exception:RequestError(400,'illegal_argument_exception','index [.ds-system-000001] is the write index for data stream [system] and cannot be deleted')
Note
Instead of .ds-audit-000001 or .ds-system-000001 index names,
similar names can be present with the same prefix but different suffix
numbers.
If the above mentioned alert and errors are present, an immediate action is
required, because it indicates that the corresponding index size has already
exceeded the space allocated for the index.
To verify that the cluster is affected:
Caution
Verify and apply the workaround to both index patterns, system and
audit, separately.
If one of indices is affected, the second one is most likely affected
as well. Although in rare cases, only one index may be affected.
The cluster is affected if the rollover policy is missing.
Otherwise, proceed to the following step.
Verify the system response from the previous step. For example:
{"_id":"system_rollover_policy","_version":7229,"_seq_no":42362,"_primary_term":28,"policy":{"policy_id":"system_rollover_policy","description":"system index rollover policy.","last_updated_time":1708505222430,"schema_version":19,"error_notification":null,"default_state":"rollover","states":[{"name":"rollover","actions":[{"retry":{"count":3,"backoff":"exponential","delay":"1m"},"rollover":{"min_size":"14746mb","copy_alias":false}}],"transitions":[]}],"ism_template":[{"index_patterns":["system*"],"priority":200,"last_updated_time":1708505222430}]}}
Verify and capture the following items separately for every policy:
The _seq_no and _primary_term values
The rollover policy threshold, which is defined in
policy.states[0].actions[0].rollover.min_size
If the rollover policy is not attached, the cluster is affected.
If the rollover policy is attached but _seq_no and _primary_term
numbers do not match the previously captured ones, the cluster is
affected.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size),
the cluster is most probably affected.
Perform again the last step of the cluster verification procedure provided
above and make sure that the policy is attached to the index and has
the same _seq_no and _primary_term.
If the index size drastically exceeds the defined threshold of the
rollover policy (which is the previously captured min_size), wait
up to 15 minutes and verify that the additional index is created with
the consecutive number in the index name. For example:
system: if you applied changes to .ds-system-000001, wait until
.ds-system-000002 is created.
audit: if you applied changes to .ds-audit-000001, wait until
.ds-audit-000002 is created.
If such index is not created, escalate the issue to Mirantis support.
Container Cloud web UI¶[41806] Configuration of a management cluster fails without Keycloak settings¶
During configuration of a management cluster settings using the
Configure cluster web UI menu, updating the Keycloak Truststore
settings is mandatory, despite being optional.
As a workaround, update the management cluster using the API or CLI.
The following table lists the major components and their versions delivered in
the Container Cloud 2.26.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the artifacts of components included in the Container Cloud
release 2.26.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
vulnerabilities and exposures (CVE) by product component since the 2.25.4
patch release. The common CVEs are issues addressed across several images.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.1.0 or 16.1.0.
Pre-update actions¶Unblock cluster update by removing any pinned product artifacts¶
If any pinned product artifacts are present in the Cluster object of a
management or managed cluster, the update will be blocked by the Admission
Controller with the invalidHelmReleasesconfiguration error until such
artifacts are removed. The update process does not start and any changes in
the Cluster object are blocked by the Admission Controller except the
removal of fields with pinned product artifacts.
Therefore, verify that the following sections of the Cluster objects
do not contain any image-related (tag, name, pullPolicy,
repository) and global values inside Helm releases:
The custom pinned product artifacts are inspected and blocked by the
Admission Controller to ensure that Container Cloud clusters remain
consistently updated with the latest security fixes and product improvements
Note
The pre-update inspection applies only to images delivered by
Container Cloud that are overwritten. Any custom images unrelated to the
product components are not verified and do not block cluster update.
Update queries for custom log-based metrics in StackLight¶
Container Cloud 2.26.0 introduces reorganized and significantly improved
StackLight logging pipeline. It involves changes in queries implemented
in the scope of the logging.metricQueries feature designed for creation
of custom log-based metrics. For the procedure, see StackLight
operations: Create logs-based metrics.
If you already have some custom log-based metrics:
Before the cluster update, save existing queries.
After the cluster update, update the queries according to the changes
implemented in the scope of the logging.metricQueries feature.
These steps prevent failures of queries containing fields that are renamed
or removed in Container Cloud 2.26.0.
Post-update actions¶Update bird configuration on BGP-enabled bare metal clusters¶
Container Cloud 2.26.0 introduces the bird daemon update from v1.6.8
to v2.0.7 on master nodes if BGP is used for BGP announcement of the cluster
API load balancer address.
Configuration files for bird v1.x are not fully compatible with those for
bird v2.x. Therefore, if you used BGP announcement of cluster API LB address
on a deployment based on Cluster releases 17.0.0 or 16.0.0, update bird
configuration files to fit bird v2.x using configuration examples provided in
the API Reference: MultirRackCluster section.
Review and adjust the storage parameters for OpenSearch¶
To prevent underused or overused storage space, review your storage space
parameters for OpenSearch on the StackLight cluster:
Review the value of elasticsearch.persistentVolumeClaimSize and
the real storage available on volumes.
Decide whether you have to additionally set
elasticsearch.persistentVolumeUsableStorageSizeGB.
The Container Cloud patch release 2.25.4, which is based on the
2.25.0 major release, provides the following updates:
Support for the patch Cluster releases 16.0.4 and 17.0.4
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
23.3.4.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.0.0 and 16.0.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.25.4, refer
to 2.25.0.
This section lists the artifacts of components included in the Container Cloud
patch release 2.25.4. For artifacts of the Cluster releases introduced in
2.25.4, see patch Cluster releases 17.0.4 and 16.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.25.3 patch
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.25.4 along with the patch Cluster releases 17.0.4
and 16.0.4.
[38259] Fixed the issue causing the failure to attach an existing
MKE cluster to a Container Cloud management cluster. The issue was related
to byo-provider and prevented the attachment of MKE clusters having
less than three manager nodes and two worker nodes.
[38399] Fixed the issue causing the failure to deploy a management
cluster in the offline mode due to the issue in the setup script.
This section contains historical information on the unsupported Container
Cloud releases delivered in 2023. For the latest supported Container
Cloud release, see Container Cloud releases.
Introduces the major Cluster release 15.0.1 that is based on 14.0.1
and supports Mirantis OpenStack for Kubernetes (MOSK)
23.2.
Supports the Cluster release 14.0.1.
The deprecated Cluster release 14.0.0 and the 12.7.x along with
11.7.x series are not supported for new deployments.
Contains features and amendments of the parent releases
2.24.0 and 2.24.1.
Support for the patch Cluster releases 16.0.3 and 17.0.3
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
23.3.3.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.0.0 and 16.0.0. And it does not support greenfield
deployments based on deprecated Cluster releases. Use the latest available Cluster release
instead.
For main deliverables of the parent Container Cloud release of 2.25.3, refer
to 2.25.0.
This section lists the artifacts of components included in the Container Cloud
patch release 2.25.3. For artifacts of the Cluster releases introduced in
2.25.3, see patch Cluster releases 17.0.3 and 16.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.25.2 patch
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.25.3 along with the patch Cluster releases 17.0.3
and 16.0.3.
[37634][OpenStack] Fixed the issue with a management or managed cluster
deployment or upgrade being blocked by all pods being stuck in the
Pending state due to incorrect secrets being used to initialize
the OpenStack external Cloud Provider Interface.
[37766][IAM] Fixed the issue with sign-in to the MKE web UI of the
management cluster using the Sign in with External Provider
option, which failed with the invalid parameter: redirect_uri
error.
The Container Cloud patch release 2.25.2, which is based on the
2.25.0 major release, provides the following updates:
Renewed support for attachment of MKE clusters that are not originally
deployed by Container Cloud for vSphere-based management clusters.
Support for the patch Cluster releases 16.0.2 and 17.0.2
that represents Mirantis OpenStack for Kubernetes (MOSK) patch release
23.3.2.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.0.0 and 16.0.0. And it does not support greenfield
deployments based on deprecated Cluster releases 14.0.1,
15.0.1, 16.0.1, and 17.0.1. Use the latest
available Cluster releases instead.
For main deliverables of the parent Container Cloud release of 2.25.2, refer
to 2.25.0.
This section lists the artifacts of components included in the Container Cloud
patch release 2.25.2. For artifacts of the Cluster releases introduced in
2.25.2, see patch Cluster releases 17.0.2 and 16.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.25.1 patch
release. The common CVEs are issues addressed across several images.
The Container Cloud patch release 2.25.1, which is based on the
2.25.0 major release, provides the following updates:
Support for the patch Cluster releases 16.0.1
and 17.0.1 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
23.3.1.
Several product improvements. For details, see Enhancements.
Security fixes for CVEs in images.
This patch release also supports the latest major Cluster releases
17.0.0 and 16.0.0.
And it does not support greenfield deployments based on deprecated Cluster
releases 14.1.0, 14.0.1, and
15.0.1. Use the latest available Cluster releases
instead.
For main deliverables of the parent Container Cloud release of 2.25.1, refer
to 2.25.0.
This section outlines new features and enhancements introduced in the
Container Cloud patch release 2.25.1 along with Cluster releases 17.0.1 and
16.0.1.
Introduced support for Mirantis Kubernetes Engine (MKE) 3.7.2 on Container
Cloud management and managed clusters. On existing managed clusters, MKE is
updated to the latest supported version when you update your cluster to the
patch Cluster release 17.0.1 or 16.0.1.
To simplify MKE configuration through API, moved management of MKE parameters
controlled by Container Cloud from lcm-ansible to lcm-controller.
Now, Container Cloud overrides only a set of MKE configuration parameters that
are automatically managed by Container Cloud.
Introduced Kubernetes network policies for all StackLight components. The
feature is implemented using the networkPolicies parameter that is enabled
by default.
The Kubernetes NetworkPolicy resource allows controlling network connections
to and from Pods within a cluster. This enhances security by restricting
communication from compromised Pod applications and provides transparency
into how applications communicate with each other.
External vSphere CCM with CSI supporting vSphere 6.7 on Kubernetes 1.27¶
Switched to the external vSphere cloud controller manager (CCM) that uses
vSphere Container Storage Plug-in 3.0 for volume attachment. The feature
implementation implies an automatic migration of PersistentVolume and
PersistentVolumeClaim.
The external vSphere CCM supports vSphere 6.7 on Kubernetes 1.27 as compared
to the in-tree vSphere CCM that does not support vSphere 6.7 since
Kubernetes 1.25.
Important
The major Cluster release 14.1.0 is the last Cluster release
for the vSphere provider based on MCR 20.10 and MKE 3.6.6 with
Kubernetes 1.24. Therefore, Mirantis highly recommends updating your
existing vSphere-based managed clusters to the Cluster release
16.0.1 that contains newer versions on MCR and MKE with
Kubernetes. Otherwise, your management cluster upgrade to Container Cloud
2.25.2 will blocked.
Since Container Cloud 2.25.1, the major Cluster release 14.1.0 is deprecated.
Greenfield vSphere-based deployments on this Cluster release are not
supported. Use the patch Cluster release 16.0.1 for new deployments instead.
This section lists the artifacts of components included in the Container Cloud
patch release 2.25.1. For artifacts of the Cluster releases introduced in
2.25.1, see patch Cluster releases 17.0.1 and
16.0.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs in images by product component since the Container Cloud 2.25.0 major
release. The common CVEs are issues addressed across several images.
The following issues have been addressed in the Container Cloud patch release
2.25.1 along with the patch Cluster releases 17.0.1
and 16.0.1.
[35426] [StackLight] Fixed the issue with the prometheus-libvirt-exporter
Pod failing to reconnect to libvirt after the libvirt Pod recovery from
a failure.
[35339] [LCM] Fixed the issue with the LCM Ansible task of copying
kubectl from the ucp-hyperkube image failing if
kubectl exec is in use, for example, during a management cluster
upgrade.
[35089] [bare metal, Calico] Fixed the issue with arbitrary Kubernetes pods
getting stuck in an error loop due to a failed Calico networking setup for
that pod.
[33936] [bare metal, Calico] Fixed the issue with deletion failure of a
controller node during machine replacement due to the upstream
Calico issue.
The Mirantis Container Cloud major release 2.25.0:
Introduces support for the Cluster release 17.0.0
that is based on the Cluster release 16.0.0 and
represents Mirantis OpenStack for Kubernetes (MOSK)
23.3.
Introduces support for the Cluster release 16.0.0 that
is based on Mirantis Container Runtime (MCR) 23.0.7 and Mirantis Kubernetes
Engine (MKE) 3.7.1 with Kubernetes 1.27.
Introduces support for the Cluster release 14.1.0 that
is dedicated for the vSphere provider only. This is the last Cluster
release for the vSphere provider based on MKE 3.6.6 with Kubernetes 1.24.
Does not support greenfield deployments on deprecated Cluster releases
of the 15.x and 14.x series. Use the latest available Cluster releases
of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.25.0.
This section outlines new features and enhancements introduced in the
Container Cloud release 2.25.0. For the list of enhancements delivered with
the Cluster releases introduced by Container Cloud 2.25.0, see
17.0.0, 16.0.0, and
14.1.0.
Implemented Container Cloud Bootstrap v2 that provides an exceptional user
experience to set up Container Cloud. With Bootstrap v2, you also gain access
to a comprehensive and user-friendly web UI for the OpenStack and vSphere
providers.
Bootstrap v2 empowers you to effortlessly provision management clusters before
deployment, while benefiting from a streamlined process that isolates
each step. This approach not only simplifies the bootstrap process but also
enhances troubleshooting capabilities for addressing any potential
intermediate failures.
Note
The Bootstrap web UI support for the bare metal provider will be
added in one of the following Container Cloud releases.
General availability for ‘MetalLBConfigTemplate’ and ‘MetalLBConfig’ objects¶
Completed development of the MetalLB configuration related to address
allocation and announcement for load-balanced services using the
MetalLBConfigTemplate object for bare metal and the MetalLBConfig
object for vSphere. Container Cloud uses these objects in default templates as
recommended during creation of a management or managed cluster.
At the same time, removed the possibility to use the deprecated options, such
as configInline value of the MetalLB chart and the use of Subnet
objects without new MetalLBConfigTemplate and MetalLBConfig objects.
Automated migration, which applied to these deprecated options during creation
of clusters of any type or cluster update to Container Cloud 2.24.x, is
removed automatically during your management cluster upgrade to Container
Cloud 2.25.0. After that, any changes in MetalLB configuration related to
address allocation and announcement for load-balanced services will be applied
using the MetalLBConfig, MetalLBConfigTemplate, and Subnet objects
only.
These annotations are helpful if you have a limited amount of free and unused
IP addresses for server provisioning. Using these annotations, you can
manually create bare metal hosts one by one and provision servers in small,
manually managed chunks.
Status of infrastructure health for bare metal and OpenStack providers¶
Implemented the Infrastructure Status condition to monitor
infrastructure readiness in the Container Cloud web UI during cluster
deployment for bare metal and OpenStack providers. Readiness of the following
components is monitored:
Bare metal: the MetalLBConfig object along with MetalLB and DHCP subnets
OpenStack: cluster network, routers, load balancers, and Bastion along with
their ports and floating IPs
For the bare metal provider, also implemented the
Infrastructure Status condition for machines to monitor readiness
of the IPAMHost, L2Template, BareMetalHost, and
BareMetalHostProfile objects associated with the machine.
General availability for RHEL 8.7 on vSphere-based clusters¶
Introduced general availability support for RHEL 8.7 on VMware vSphere-based
clusters. You can install this operating system on any type of a Container
Cloud cluster including the bootstrap node.
Note
RHEL 7.9 is not supported as the operating system for the bootstrap
node.
Caution
A Container Cloud cluster based on mixed RHEL versions, such as
RHEL 7.9 and 8.7, is not supported.
Implemented automatic cleanup of old Ubuntu kernel and other unnecessary
system packages. During cleanup, Container Cloud keeps two most recent kernel
versions, which is the default behavior of the Ubuntu
apt autoremove command.
Mirantis recommends keeping two kernel versions with the previous kernel
version as a fallback option in the event that the current kernel may become
unstable at any time. However, if you absolutely require leaving only the
latest version of kernel packages, you can use the
cleanup-kernel-packages script after considering all possible risks.
Configuration of a custom OIDC provider for MKE on managed clusters¶
Implemented the ability to configure a custom OpenID Connect (OIDC) provider
for MKE on managed clusters using the ClusterOIDCConfiguration custom
resource. Using this resource, you can add your own OIDC provider
configuration to authenticate user requests to Kubernetes.
Note
For OpenStack and StackLight, Container Cloud supports only
Keycloak, which is configured on the management cluster,
as the OIDC provider.
Implemented the management-admin OIDC role to grant full admin access
specifically to a management cluster. This role enables the user to manage
Pods and all other resources of the cluster, for example, for debugging
purposes.
General availability for graceful machine deletion¶
Introduced general availability support for graceful machine deletion with
a safe cleanup of node resources:
Changed the default deletion policy from unsafe to graceful for
machine deletion using the Container Cloud API.
Using the deletionPolicy:graceful parameter in the
providerSpec.value section of the Machine object, the cloud provider
controller prepares a machine for deletion by cordoning, draining, and
removing the related node from Docker Swarm. If required, you can abort a
machine deletion when using deletionPolicy:graceful, but only before
the related node is removed from Docker Swarm.
Implemented the following machine deletion methods in the Container Cloud
web UI: Graceful, Unsafe, Forced.
Added support for deletion of manager machines, which is intended only for
replacement or recovery of failed nodes, for MOSK-based
clusters using either of deletion policies mentioned above.
General availability for parallel update of worker nodes¶
Completed development of the parallel update of worker nodes during cluster
update by implementing the ability to configure the required options using the
Container Cloud web UI. Parallelizing of node update operations significantly
optimizes the update efficiency of large clusters.
The following options are added to the Create Cluster window:
Parallel Upgrade Of Worker Machines that sets the maximum number
of worker nodes to update simultaneously
Parallel Preparation For Upgrade Of Worker Machines that sets
the maximum number of worker nodes for which new artifacts are downloaded
at a given moment of time
The following issues have been addressed in the Mirantis Container Cloud
release 2.25.0 along with the Cluster releases 17.0.0,
16.0.0, and 14.1.0.
Note
This section provides descriptions of issues addressed since
the last Container Cloud patch release 2.24.5.
For details on addressed issues in earlier patch releases since 2.24.0,
which are also included into the major release 2.25.0, refer to
2.24.x patch releases.
[34462] [BM] Fixed the issue with incorrect handling of the DHCP egress
traffic by reconfiguring the external traffic policy for the dhcp-lb
Kubernetes Service. For details about the issue, refer to the
Kubernetes upstream bug.
On existing clusters with multiple L2 segments using DHCP relays on the
border switches, in order to successfully provision new nodes or reprovision
existing ones, manually point the DHCP relays on your network infrastructure
to the new IP address of the dhcp-lb Service of the Container Cloud
cluster.
To obtain the new IP address:
kubectl-nkaasgetservicedhcp-lb
[35429] [BM] Fixed the issue with the WireGuard interface not having
the IPv4 address assigned. The fix implies automatic restart of the
calico-node Pod to allocate the IPv4 address on the WireGuard interface.
[36131] [BM] Fixed the issue with IpamHost object changes not being
propagated to LCMMachine during netplan configuration after cluster
deployment.
[34657] [LCM] Fixed the issue with iam-keycloak Pods not starting
after powering up master nodes and starting the Container Cloud upgrade
right after.
[34750] [LCM] Fixed the issue with journald generating a lot of log
messages that already exist in the auditd log due to enabled
systemd-journald-audit.socket.
[35738] [StackLight] Fixed the issue with ucp-node-exporter being
unable to bind the port 9100 with the ucp-node-exporter failing to start
due to a conflict with the StackLight node-exporter binding the same
port.
The resolution of the issue involves an automatic change of the port for the
StackLight node-exporter from 9100 to 19100. No manual port update is
required.
If your cluster uses a firewall, add an additional firewall rule that
grants the same permissions to port 19100 as those currently assigned
to port 9100 on all cluster nodes.
[34296] [StackLight] Fixed the issue with the CPU over-consumption by
helm-controller leading to the KubeContainersCPUThrottlingHigh
alert firing.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.25.0 including the Cluster releases
17.0.0, 16.0.0, and
14.1.0.
This section also outlines still valid known issues
from previous Container Cloud releases.
Bare metal¶[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
An arbitrary Kubernetes pod may get stuck in an error loop due to a failed
Calico networking setup for that pod. The pod cannot access any network
resources. The issue occurs more often during cluster upgrade or node
replacement, but this can sometimes happen during the new deployment as well.
You may find the following log for the failed pod IP (for example,
10.233.121.132) in calico-node logs:
Due to the upstream Calico issue, a controller node
cannot be deleted if the calico-node Pod is stuck blocking node deletion.
One of the symptoms is the following warning in the baremetal-operator
logs:
Resolving dependency Service dhcp-lb in namespace kaas failed:\
the server was unable to return a response in the time allotted,\
but may still be processing the request (get endpoints dhcp-lb).
As a workaround, delete the Pod that is stuck to retrigger the node
deletion.
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
OpenStack¶[37634] Cluster deployment or upgrade is blocked by all pods in ‘Pending’ state¶
When using OpenStackCredential with a custom CACert, a management or
managed cluster deployment or upgrade is blocked by all pods being stuck in
the Pending state. The issue is caused by incorrect secrets being used to
initialize the OpenStack external Cloud Provider Interface.
As a workaround, copy CACert from the OpenStackCredential object
to openstack-ca-secret:
A sign-in to the MKE web UI of the management cluster using the
Sign in with External Provider option can fail with the
invalid parameter: redirect_uri error.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the IAM realm.
Navigate to Clients > kaas.
On the page, navigate to
Seetings > Access settings > Valid redirect URIs.
Add https://<mgmtmkeip>:6443/* to the list of valid redirect URIs
and click Save.
Refresh the browser window with the sign-in URI.
LCM¶[31186,34132] Pods get stuck during MariaDB operations¶
During MariaDB operations on a management cluster, Pods may get stuck
in continuous restarts with the following example error:
On MOSK clusters, the Ansible provisioner may hang in a loop while trying to
remove LVM thin pool logical volumes (LVs) due to issues with volume detection
before removal. The Ansible provisioner cannot remove LVM thin pool LVs
correctly, so it consistently detects the same volumes whenever it scans
disks, leading to a repetitive cleanup process.
The following symptoms mean that a cluster can be affected:
A node was configured to use thin pool LVs. For example, it had the
OpenStack Cinder role in the past.
A bare metal node deployment flaps between provisioninig and
deprovisioning states.
In the Ansible provisioner logs, the following example warnings are growing:
88621.log:7389:2023-06-22 16:30:45.109 88621 ERROR ansible.plugins.callback.ironic_log[-] Ansible task clean:fail failed on node 14eb0dbc-c73a-4298-8912-4bb12340ff49:{'msg':'There are more devices to clean', '_ansible_no_log': None, 'changed': False}
Important
Therearemoredevicestoclean is a regular warning
indicating some in-progress tasks. But if the number of such warnings is
growing along with the node flapping between provisioninig and
deprovisioning states, the cluster is highly likely affected by the
issue.
As a workaround, erase disks manually using any preferred tool.
[30294] Replacement of a master node is stuck on the calico-node Pod start¶
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[34820] The Ceph ‘rook-operator’ fails to connect to RGW on FIPS nodes¶
Due to the upstream Ceph issue,
on clusters with the Federal Information Processing Standard (FIPS) mode
enabled, the Ceph rook-operator fails to connect to Ceph RADOS Gateway
(RGW) pods.
As a workaround, do not place Ceph RGW pods on nodes where FIPS mode is
enabled.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Container Cloud upgrade may be blocked by a node being stuck in the Prepare
or Deploy state with errorprocessingpackageopenssh-server.
The issue is caused by customizations in /etc/ssh/sshd_config, such as
additional Match statements. This file is managed by Container Cloud and
must not be altered manually.
As a workaround, move customizations from sshd_config to a new file
in the /etc/ssh/sshd_config.d/ directory.
[36928] The helm-controllerDeployment is stuck during cluster update¶
During a cluster update, a Kubernetes helm-controllerDeployment may
get stuck in a restarting Pod loop with Terminating and Running states
flapping. Other Deployment types may also be affected.
As a workaround, restart the Deployment that got stuck:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name containing
the cluster with stuck Pods
<affectedDeployName> is the Deployment name that failed to run Pods
in the specified project
<replicasNumber> is the original number of replicas for the
Deployment that you can obtain using the get deploy command
[33438] ‘CalicoDataplaneFailuresHigh’ alert is firing during cluster update¶
During cluster update of a managed bare metal cluster, the false positive
CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert,
which will disappear once cluster update succeeds.
The observed behavior is typical for calico-node during upgrades,
as workload changes occur frequently. Consequently, there is a possibility
of temporary desynchronization in the Calico dataplane. This can occasionally
result in throttling when applying workload changes to the Calico dataplane.
The following table lists the major components and their versions delivered in
the Container Cloud 2.25.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the artifacts of components included in the Container Cloud
release 2.25.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The table below includes the total numbers of addressed unique and common
CVEs by product component since the 2.24.5 patch release. The common
CVEs are issues addressed across several images.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
releases 17.0.0, 16.0.0, or 14.1.0.
Pre-update actions¶Upgrade to Ubuntu 20.04 on baremetal-based clusters¶
The Cluster release series 14.x and 15.x are the last ones where Ubuntu 18.04
is supported on existing clusters. A Cluster release update to 17.0.0 or
16.0.0 is impossible for a cluster running on Ubuntu 18.04.
Configure managed clusters with the etcd storage quota set¶
If your cluster has custom etcd storage quota set as described in
Increase storage quota for etcd, before the management cluster upgrade to 2.25.0,
configure LCMMachine resources:
Manually set the ucp_etcd_storage_quota parameter in LCMMachine
resources of the cluster controller nodes:
After the management cluster is upgraded to 2.25.0, update your managed
cluster to the Cluster release 17.0.0 or 16.0.0.
Manually remove the ucp_etcd_storage_quota parameter from the
stateItemsOverwrites.deploy section.
Allow the TCP port 12392 for management cluster nodes¶
The Cluster release 16.x and 17.x series are shipped with MKE 3.7.x.
To ensure cluster operability after the update, verify that the TCP
port 12392 is allowed in your network for the Container Cloud management
cluster nodes.
Post-update actions¶Migrate Ceph cluster to address storage devices using by-id¶
Container Cloud uses the device by-id identifier as the default method
of addressing the underlying devices of Ceph OSDs. This is the only persistent
device identifier for a Ceph cluster that remains stable after cluster
upgrade or any other cluster maintenance.
Point DHCP relays on routers to the new dhcp-lb IP address¶
If your managed cluster has multiple L2 segments using DHCP relays on the
border switches, after the related management cluster automatically upgrades
to Container Cloud 2.25.0, manually point the DHCP relays on your network
infrastructure to the new IP address of the dhcp-lb service of the
Container Cloud managed cluster in order to successfully provision new nodes
or reprovision existing ones.
To obtain the new IP address:
kubectl-nkaasgetservicedhcp-lb
This change is required after the product has included the resolution of
the issue related to the incorrect handling of DHCP egress traffic. The fix
involves reconfiguring the external traffic policy for the dhcp-lb
Kubernetes Service. For details about the issue, refer to the
Kubernetes upstream bug.
The Container Cloud patch release 2.24.5, which is based on the
2.24.2 major release, provides the following updates:
Support for the patch Cluster releases 14.0.4
and 15.0.4 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
23.2.3.
Security fixes for CVEs of Critical and High severity
This patch release also supports the latest major Cluster releases
14.0.1 and 15.0.1.
And it does not support greenfield deployments based on deprecated Cluster
releases 15.0.3, 15.0.2,
14.0.3, 14.0.2
along with 12.7.x and 11.7.x series.
Use the latest available Cluster releases for new deployments instead.
For main deliverables of the parent Container Cloud releases of 2.24.5, refer
to 2.24.0 and 2.24.1.
This section lists the components artifacts of the Container Cloud patch
release 2.24.5. For artifacts of the Cluster releases introduced in 2.24.5,
see patch Cluster releases 15.0.4 and
14.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
In total, since Container Cloud 2.24.4, in 2.24.5, 21
Common Vulnerabilities and Exposures (CVE) have been fixed:
18 of critical and 3 of high severity.
The summary table contains the total number of unique CVEs along with the
total number of issues fixed across the images.
The full list of the CVEs present in the current Container Cloud release is
available at the Mirantis Security Portal.
The Container Cloud patch release 2.24.4, which is based on the
2.24.2 major release, provides the following updates:
Support for the patch Cluster releases 14.0.3
and 15.0.3 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release
23.2.2.
Support for the multi-rack topology on bare metal managed clusters
Support for configuration of the etcd storage quota
Security fixes for CVEs of Critical and High severity
This patch release also supports the latest major Cluster releases
14.0.1 and 15.0.1.
And it does not support greenfield deployments based on deprecated Cluster
releases 15.0.2, 14.0.2,
along with 12.7.x and 11.7.x series.
Use the latest available Cluster releases for new deployments instead.
For main deliverables of the parent Container Cloud releases of 2.24.4, refer
to 2.24.0 and 2.24.1.
Added the capability to configure storage quota, which is 2 GB by default.
You may need to increase the default etcd storage quota if etcd runs out of
space and there is no other way to clean up the storage on your management
or managed cluster.
Multi-rack topology for bare metal managed clusters¶
TechPreview
Added support for the multi-rack topology on bare metal managed clusters.
Implementation of the multi-rack topology implies the use of Rack and
MultiRackCluster objects that support configuration of BGP announcement
of the cluster API load balancer address.
You can now create a managed cluster where cluster nodes including Kubernetes
masters are distributed across multiple racks without L2 layer extension
between them, and use BGP for announcement of the cluster API load balancer
address and external addresses of Kubernetes load-balanced services.
This section lists the components artifacts of the Container Cloud patch
release 2.24.4. For artifacts of the Cluster releases introduced in 2.24.4,
see patch Cluster releases 15.0.3 and
14.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
In total, since Container Cloud 2.24.3, in 2.24.4, 18
Common Vulnerabilities and Exposures (CVE) have been fixed:
3 of critical and 15 of high severity.
The summary table contains the total number of unique CVEs along with the
total number of issues fixed across the images.
The full list of the CVEs present in the current Container Cloud release is
available at the Mirantis Security Portal.
Support for enablement of Kubernetes auditing and profiling options using
the Container Cloud Cluster object on managed clusters. For details,
see Configure Kubernetes auditing and profiling.
Support for the patch Cluster releases 14.0.2
and 15.0.2 that represents Mirantis OpenStack for Kubernetes
(MOSK) patch release.
23.2.1.
This patch release also supports the latest major Cluster releases
14.0.1 and 15.0.1.
And it does not support greenfield deployments based on deprecated Cluster
release 14.0.0 along with 12.7.x and
11.7.x series. Use the latest available Cluster releases
instead.
For main deliverables of the parent Container Cloud releases of 2.24.3, refer
to 2.24.0 and 2.24.1.
This section lists the components artifacts of the Container Cloud patch
release 2.24.3. For artifacts of the Cluster releases introduced in 2.24.3,
see Cluster releases 15.0.2 and 14.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Container Cloud major release 2.24.2 based on 2.24.0
and 2.24.1 provides the following:
Introduces support for the major Cluster release 15.0.1
that is based on the Cluster release 14.0.1 and
represents Mirantis OpenStack for Kubernetes (MOSK)
23.2.
This Cluster release is based on the updated version of Mirantis Kubernetes
Engine 3.6.5 with Kubernetes 1.24 and Mirantis Container Runtime 20.10.17.
Does not support greenfield deployments based on deprecated Cluster release
14.0.0 along with 12.7.x and
11.7.x series. Use the latest available Cluster releases
of the series instead.
For main deliverables of the Container Cloud release 2.24.2, refer to its
parent release 2.24.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
The Container Cloud patch release 2.24.1 based on 2.24.0
includes updated baremetal-operator, admission-controller, and iam
artifacts and provides hot fixes for the following issues:
[34218] Fixed the issue with the iam-keycloak Pod being stuck in the
Pending state during Keycloak upgrade to version 21.1.1.
[34247] Fixed the issue with MKE backup failing during cluster update
due to wrong permissions in the etcd backup directory. If the issue still
persists, which may occur on clusters that were originally deployed using
early Container Cloud releases delivered in 2020-2021, follow the
workaround steps described in Known issues: LCM.
Note
Container Cloud patch release 2.24.1 does not introduce new Cluster
releases.
For main deliverables of the Container Cloud release 2.24.1, refer to its
parent release 2.24.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Container Cloud 2.24.0 has been successfully applied to a
certain number of clusters. The 2.24.0 related documentation content
fully applies to these clusters.
If your cluster started to update but was reverted to the previous product
version or the update is stuck, you automatically receive the 2.24.1 patch
release with the bug fixes to unblock the update to the 2.24 series.
There is no impact on the cluster workloads. For details on the patch
release, see 2.24.1.
The Mirantis Container Cloud GA release 2.24.0:
Introduces support for the Cluster release 14.0.0
that is based on Mirantis Container Runtime 20.10.17 and
Mirantis Kubernetes Engine 3.6.5 with Kubernetes 1.24.
Supports the latest major and patch Cluster releases of the
12.7.x series that supports Mirantis OpenStack for Kubernetes
(MOSK) 23.1 series.
Does not support greenfield deployments on deprecated Cluster releases
12.7.3, 11.7.4, or earlier patch
releases, 12.5.0, or 11.7.0.
Use the latest available Cluster releases of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.24.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.24.0. For the list of enhancements in the
Cluster release 14.0.0 that is introduced by the Container Cloud
release 2.24.0, see the 14.0.0.
Automated upgrade of operating system on bare metal clusters¶
Support status of the feature
Since MOSK 23.2, the feature is generally available for
MOSK clusters.
Since Container Cloud 2.24.2, the feature is generally available for any
type of bare metal clusters.
Since Container Cloud 2.24.0, the feature is available as Technology
Preview for management and regional clusters only.
Implemented automatic in-place upgrade of an operating system (OS)
distribution on bare metal clusters. The OS upgrade occurs as part of cluster
update that requires machines reboot. The OS upgrade workflow is as follows:
The distribution ID value is taken from the id field of the
distribution from the allowedDistributions list in the spec of the
ClusterRelease object.
The distribution that has the default:true value is used during
update. This distribution ID is set in the
spec:providerSpec:value:distribution field of the Machine object
during cluster update.
On management and regional clusters, the operating system upgrades
automatically during cluster update. For managed clusters, an in-place OS
distribution upgrade should be performed between cluster updates.
This scenario implies a machine cordoning, draining, and reboot.
Warning
During the course of the Container Cloud 2.28.x series, Mirantis
highly recommends upgrading an operating system on any nodes of all your
managed cluster machines to Ubuntu 22.04 before the next major Cluster
release becomes available.
It is not mandatory to upgrade all machines at once. You can upgrade them
one by one or in small batches, for example, if the maintenance window is
limited in time.
Otherwise, the Cluster release update of the Ubuntu 20.04-based managed
clusters will become impossible as of Container Cloud 2.29.0 with Ubuntu
22.04 as the only supported version.
Management cluster update to Container Cloud 2.29.1 will be blocked if
at least one node of any related managed cluster is running Ubuntu 20.04.
Added initial Technology Preview support for WireGuard that enables traffic
encryption on the Kubernetes workloads network. Set secureOverlay:true
in the Cluster object during deployment of management, regional, or
managed bare metal clusters to enable WireGuard encryption.
Also, added the possibility to configure the maximum transmission unit (MTU)
size for Calico that is required for the WireGuard functionality and allows
maximizing network performance.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
MetalLB configuration changes for bare metal and vSphere¶
For management and regional clusters
Caution
For managed clusters, this object is available as Technology
Preview and will become generally available in one of the following
Container Cloud releases.
Introduced the following MetalLB configuration changes and objects related to
address allocation and announcement of services LB for bare metal and vSphere
providers:
Introduced the MetalLBConfigTemplate object for bare metal and the
MetalLBConfig object for vSphere to be used as default and recommended.
For vSphere, during creation of clusters of any type, now a separate
MetalLBConfig object is created instead of corresponding settings
in the Cluster object.
The use of either Subnet objects without the new MetalLB objects or the
configInline MetalLB value of the Cluster object is deprecated and
will be removed in one of the following releases.
If the MetalLBConfig object is not used for MetalLB configuration
related to address allocation and announcement of services LB, then
automated migration applies during creation of clusters of any type or
cluster update to Container Cloud 2.24.0.
During automated migration, the MetalLBConfig and
MetalLBConfigTemplate objects for bare metal or the MetalLBConfig
for vSphere are created and contents of the MetalLB chart configInline
value is converted to the parameters of the MetalLBConfigTemplate object
for bare metal or of the MetalLBConfig object for vSphere.
The following changes apply to the bare metal bootstrap procedure:
Moved the following environment variables from cluster.yaml.template to
the dedicated ipam-objects.yaml.template:
BOOTSTRAP_METALLB_ADDRESS_POOL
KAAS_BM_BM_DHCP_RANGE
SET_METALLB_ADDR_POOL
SET_LB_HOST
Modified the default network configuration. Now it includes a bond interface
and separated PXE and management networks. Mirantis recommends using
separate PXE and management networks for management and regional clusters.
Added support for RHEL 8.7 on the vSphere-based management, regional, and
managed clusters.
Custom flavors for Octavia on OpenStack-based clusters¶
Implemented the possibility to use custom Octavia Amphora flavors that you can
enable in spec:providerSpec section of the Cluster object using
serviceAnnotations:loadbalancer.openstack.org/flavor-id during
management or regional cluster deployment.
Note
For managed clusters, you can enable the feature through the
Container Cloud API. The web UI functionality will be added in one of the
following Container Cloud releases.
Deletion of persistent volumes during an OpenStack-based cluster deletion¶
Completed the development of persistent volumes deletion during an
OpenStack-based managed cluster deletion by implementing the
Delete all volumes in the cluster check box in the cluster
deletion menu of the Container Cloud web UI.
Upgraded the Keycloak major version from 18.0.0 to 21.1.1. For the list of new
features and enhancements, see
Keycloak Release Notes.
The upgrade path is fully automated. No data migration or custom LCM changes
are required.
Important
After the Keycloak upgrade, access the Keycloak Admin Console
using the new URL format: https://<keycloak.ip>/auth instead of
https://<keycloak.ip>. Otherwise, the Resource not found
error displays in a browser.
Added initial Technology Preview support for custom host names of machines on
any supported provider and any cluster type. When enabled, any machine host
name in a particular region matches the related Machine object name. For
example, instead of the default kaas-node-<UID>, a machine host name will
be master-0. The custom naming format is more convenient and easier to
operate with.
You can enable the feature before or after management or regional cluster
deployment. If enabled after deployment, custom host names will apply to all
newly deployed machines in the region. Existing host names will remain the
same.
Added initial Technology Preview support for parallelizing of node update
operations that significantly improves the efficiency of your cluster. To
configure the parallel node update, use the following parameters located under
spec.providerSpec of the Cluster object:
maxWorkerUpgradeCount - maximum number of worker nodes for simultaneous
update to limit machine draining during update
maxWorkerPrepareCount - maximum number of workers for artifacts
downloading to limit network load during update
Implemented the CacheWarmupRequest resource to predownload, aka warm up,
a list of artifacts included in a given set of Cluster releases into the
mcc-cache service only once per release. The feature facilitates and
speeds up deployment and update of managed clusters.
After a successful cache warm-up, the object of the CacheWarmupRequest
resource is automatically deleted from the cluster and cache remains for
managed clusters deployment or update until next Container Cloud auto-upgrade
of the management or regional cluster.
Caution
If the disk space for cache runs out, the cache for the oldest
object is evicted. To avoid running out of space in the cache, verify and
adjust its size before each cache warm-up.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Added initial Technology Preview support for the Linux Audit daemon
auditd to monitor activity of cluster processes on any type of
Container Cloud cluster. The feature is an essential requirement for many
security guides that enables auditing of any cluster process to detect
potential malicious activity.
You can enable and configure auditd either during or after cluster deployment
using the Cluster object.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Enhanced TLS certificates configuration for cluster applications:
Added support for configuration of TLS certificates for MKE on management
or regional clusters to the existing support on managed clusters.
Implemented the ability to configure TLS certificates using the Container
Cloud web UI through the Security section located in the
More > Configure cluster menu.
Expanded the capability to perform a graceful reboot on a management,
regional, or managed cluster for all supported providers by adding the
Reboot machines option to the cluster menu in the Container
Cloud web UI. The feature allows for a rolling reboot of all cluster
machines without workloads interruption. The reboot occurs in the order of
cluster upgrade policy.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Creation and deletion of bare metal host credentials using web UI¶
Improved management of bare metal host credentials using the Container Cloud
web UI:
Added the Add Credential menu to the Credentials
tab. The feature facilitates association of credentials with bare metal
hosts created using the BM Hosts tab.
Implemented automatic deletion of credentials during deletion of bare metal
hosts after deletion of managed cluster.
Improved the Node Labels menu in the Container Cloud web UI by
making it more intuitive. Replaced the greyed out (disabled) label names with
the No labels have been assigned to this machine. message and
the Add a node label button link.
Also, added the possibility to configure node labels for machine pools
after deployment using the More > Configure Pool option.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the documentation on managing Ceph OSDs with a separate metadata
device.
The following issues have been addressed in the Mirantis Container Cloud
release 2.24.0 along with the Cluster release 14.0.0. For
the list of hot fixes delivered in the 2.24.1 patch release, see
2.24.1.
[5981] Fixed the issue with upgrade of a cluster containing more than
120 nodes getting stuck on one node with errors about IP addresses
exhaustion in the docker logs. On existing clusters, after updating to
the Cluster release 14.0.0 or later, you can optionally remove the abandoned
mke-overlay network using docker network rm mke-overlay.
[29604] Fixed the issue with the false positive
failedtogetkubeconfig error occurring on the
WaitingforTLSsettingstobeapplied stage during TLS configuration.
[29762] Fixed the issue with a wrong IP address being assigned after the
MetalLB controller restart.
[30635] Fixed the issue with the pg_autoscaler module of Ceph
Manager failing with the pool <poolNumber> has overlapping roots error
if a Ceph cluster contains a mix of pools with deviceClass
either explicitly specified or not specified.
[30857] Fixed the issue with irrelevant error message displaying in the
osd-prepare Pod during the deployment of Ceph OSDs on removable devices
on AMD nodes. Now, the error message clearly states that removable devices
(with hotplug enabled) are not supported for deploying Ceph OSDs.
This issue has been addressed since the Cluster release 14.0.0.
[30781] Fixed the issue with cAdvisor failing to collect metrics on
CentOS-based deployments. Missing metrics affected the
KubeContainersCPUThrottlingHigh alert and the following Grafana
dashboards: Kubernetes Containers, Kubernetes Pods,
and Kubernetes Namespaces.
[31288] Fixed the issue with Fluentd agent failing and the
fluentd-logs Pods reporting the maximumopenshards limit error,
thus preventing OpenSearch to accept new logs. The fix enables the
possibility to increase the limit for maximum open shards using
cluster.max_shards_per_node. For details, see Tune StackLight for long-term log retention.
[31485] Fixed the issue with Elasticsearch Curator not deleting indices
according to the configured retention period on any type of Container Cloud
clusters.
Bare metal¶[42386] A load balancer service does not obtain the external IP address¶
Due to the MetalLB upstream issue,
a load balancer service may not obtain the external IP address.
The issue occurs when two services share the same external IP address and have
the same externalTrafficPolicy value. Initially, the services have the
external IP address assigned and are accessible. After modifying the
externalTrafficPolicy value for both services from Cluster to
Local, the first service that has been changed remains with no external IP
address assigned. Though, the second service, which was changed later, has the
external IP assigned as expected.
To work around the issue, make a dummy change to the service object where
external IP is <pending>:
During netplan configuration after cluster deployment, changes in the
IpamHost object are not propagated to LCMMachine.
The workaround is to manually add any new label to the labels section
of the Machine object for the target host, which triggers machine
reconciliation and propagates network changes.
[35429] The WireGuard interface does not have the IPv4 address assigned¶
Due to the upstream Calico
issue, on clusters
with WireGuard enabled, the WireGuard interface on a node may not have
the IPv4 address assigned. This leads to broken inter-Pod communication
between the affected node and other cluster nodes.
The node is affected if the IP address is missing on the WireGuard interface:
Due to the upstream Calico issue, a controller node
cannot be deleted if the calico-node Pod is stuck blocking node deletion.
One of the symptoms is the following warning in the baremetal-operator
logs:
Resolving dependency Service dhcp-lb in namespace kaas failed:\
the server was unable to return a response in the time allotted,\
but may still be processing the request (get endpoints dhcp-lb).
As a workaround, delete the Pod that is stuck to retrigger the node
deletion.
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
On MOSK clusters, the Ansible provisioner may hang in a loop while trying to
remove LVM thin pool logical volumes (LVs) due to issues with volume detection
before removal. The Ansible provisioner cannot remove LVM thin pool LVs
correctly, so it consistently detects the same volumes whenever it scans
disks, leading to a repetitive cleanup process.
The following symptoms mean that a cluster can be affected:
A node was configured to use thin pool LVs. For example, it had the
OpenStack Cinder role in the past.
A bare metal node deployment flaps between provisioninig and
deprovisioning states.
In the Ansible provisioner logs, the following example warnings are growing:
88621.log:7389:2023-06-22 16:30:45.109 88621 ERROR ansible.plugins.callback.ironic_log[-] Ansible task clean:fail failed on node 14eb0dbc-c73a-4298-8912-4bb12340ff49:{'msg':'There are more devices to clean', '_ansible_no_log': None, 'changed': False}
Important
Therearemoredevicestoclean is a regular warning
indicating some in-progress tasks. But if the number of such warnings is
growing along with the node flapping between provisioninig and
deprovisioning states, the cluster is highly likely affected by the
issue.
As a workaround, erase disks manually using any preferred tool.
MKE backup may fail during update of a management, regional, or managed
cluster due to wrong permissions in the etcd backup
/var/lib/docker/volumes/ucp-backup/_data directory.
The issue affects only clusters that were originally deployed using early
Container Cloud releases delivered in 2020-2021.
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
Ceph¶[34820] The Ceph ‘rook-operator’ fails to connect to RGW on FIPS nodes¶
Due to the upstream Ceph issue,
on clusters with the Federal Information Processing Standard (FIPS) mode
enabled, the Ceph rook-operator fails to connect to Ceph RADOS Gateway
(RGW) pods.
As a workaround, do not place Ceph RGW pods on nodes where FIPS mode is
enabled.
[34599] Ceph ‘ClusterWorkloadLock’ blocks upgrade from 2.23.5 to 2.24.1¶
On management clusters based on Ubuntu 18.04, after the cluster starts
upgrading from 2.23.5 to 2.24.1, all controller machines are stuck in the
In Progress state with the Distribution update in
progress hover message displaying in the Container Cloud web UI.
The issue is caused by clusterworkloadlock containing the outdated
release name in the status.release field, which blocks LCM Controller
to proceed with machine upgrade. This behavior is caused by a complete removal
of the ceph-controller chart from management clusters and a failed
ceph-clusterworkloadlock removal.
The workaround is to manually remove ceph-clusterworkloadlock from the
management cluster to unblock upgrade:
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state becomes Running.
Update¶[33438] ‘CalicoDataplaneFailuresHigh’ alert is firing during cluster update¶
During cluster update of a managed bare metal cluster, the false positive
CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert,
which will disappear once cluster update succeeds.
The observed behavior is typical for calico-node during upgrades,
as workload changes occur frequently. Consequently, there is a possibility
of temporary desynchronization in the Calico dataplane. This can occasionally
result in throttling when applying workload changes to the Calico dataplane.
The following table lists the major components and their versions delivered in
the Container Cloud releases 2.24.0 - 2.24.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
In total, since Container Cloud 2.23.0 major release, in 2.24.0, 2130
Common Vulnerabilities and Exposures (CVE) have been fixed: 98 of
critical and 2032 of high severity.
Among them, 984 CVEs that are listed in Addressed CVEs - detailedAddressed CVEs - detailed have been fixed since the 2.23.5 patch
release: 62 of critical and 922 of high severity.
The remaining CVEs were addressed since Container Cloud 2.23.0 and the fixes
released with the patch releases of the 2.23.x series.
The summary table contains the total number of unique CVEs along with the
total number of issues fixed across the images.
The full list of the CVEs present in the current Container Cloud release is
available at the Mirantis Security Portal.
This section describes the specific actions you as a cloud operator need to
complete before or after your Container Cloud cluster update to the Cluster
release 14.0.0.
Pre-update actions¶Update L2 templates on existing bare metal clusters¶
Since Container Cloud 2.24.0, the use of the l3Layout section in L2
templates is mandatory. Therefore, if your L2 templates do not contain this
section, manually add it for all existing clusters by defining all
subnets that are used in the npTemplate section of the L2 template.
Container Cloud 2.23.5 is the fourth patch release of the 2.23.x release
series that incorporates security fixes for CVEs of Critical and High
severity. This patch release:
Supports the latest major Cluster releases 12.7.0,
11.7.0.
Does not support greenfield deployments based on deprecated Cluster releases
12.7.3, 11.7.3,
12.7.2, 11.7.2,
12.7.1, 11.7.1,
12.5.0, and 11.6.0. Use the latest
available Cluster releases of the series instead.
This section describes known issues and contains the lists of updated
artifacts and CVE fixes for the Container Cloud release 2.23.5. For CVE fixes
delivered with the previous patch release, see security notes for
2.23.4, 2.23.3,
and 2.23.2.
For enhancements, addressed and known issues of the parent Container Cloud
release 2.23.0, refer to 2.23.0.
This section lists the components artifacts of the Container Cloud patch
release 2.23.5. For artifacts of the Cluster releases introduced in 2.23.5,
see Cluster release 12.7.4 and
Cluster release 11.7.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
In the Container Cloud patch release 2.23.5, 70 vendor-specific Common
Vulnerabilities and Exposures (CVE) have been addressed: 7 of critical and
63 of high severity.
The full list of the CVEs present in the current Container Cloud release is
available at the Mirantis Security Portal.
[32761] Bare-metal nodes stuck in the cleaning state¶
During the initial deployment of Container Cloud, some nodes may get stuck
in the cleaning state. As a workaround, wipe disks manually
before initializing the Container Cloud bootstrap.
Container Cloud 2.23.4 is the third patch release of the 2.23.x release
series that includes several addressed issues and incorporates security
fixes for CVEs of Critical and High severity. This patch release:
Supports the latest major Cluster releases 12.7.0,
11.7.0.
Does not support greenfield deployments based on deprecated Cluster releases
12.7.2, 11.7.2,
12.7.1, 11.7.1,
12.5.0, and 11.6.0. Use the latest
available Cluster releases of the series instead.
This section describes addressed issues and contains the lists of updated
artifacts and CVE fixes for the Container Cloud release 2.23.4. For CVE fixes
delivered with the previous patch release, see security notes for
2.23.3 and
2.23.2.
For enhancements, addressed and known issues of the parent Container Cloud
release 2.23.0, refer to 2.23.0.
This section lists the components artifacts of the Container Cloud patch
release 2.23.4. For artifacts of the Cluster releases introduced in 2.23.4,
see Cluster release 12.7.3 and
Cluster release 11.7.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Container Cloud 2.23.3 is the second patch release of the 2.23.x release
series that incorporates security fixes for CVEs of Critical and High
severity. This patch release:
Supports the latest major Cluster releases 12.7.0,
11.7.0.
Does not support greenfield deployments based on deprecated Cluster releases
12.7.1, 11.7.1,
12.5.0, and 11.6.0. Use the latest
available Cluster releases of the series instead.
This section contains the lists of updated artifacts and CVE fixes for the
Container Cloud release 2.23.3. For CVE fixes delivered with the previous
patch release, see security notes for 2.23.2.
For enhancements, addressed and known issues of the parent Container Cloud
release 2.23.0, refer to 2.23.0.
This section lists the components artifacts of the Container Cloud patch
release 2.23.3. For artifacts of the Cluster releases introduced in 2.23.3,
see Cluster release 12.7.2 and
Cluster release 11.7.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Container Cloud 2.23.2 is the first patch release of the 2.23.x release
series that incorporates security updates for CVEs with Critical and High
severity. This patch release:
Introduces support for patch Cluster releases 12.7.1 and
11.7.1.
Supports the latest major Cluster releases 12.7.0 and
11.7.0.
Does not support greenfield deployments based on deprecated Cluster releases
12.5.0 and 11.6.0. Use the latest
available Cluster releases of the series instead.
This section contains the lists of updated artifacts and CVE fixes for the
Container Cloud release 2.23.2. For enhancements, addressed and known issues
of the parent Container Cloud release 2.23.0, refer to
2.23.0.
This section lists the components artifacts of the Mirantis Container Cloud
release 2.23.2. For artifacts of the Cluster releases introduced in 2.23.2,
see Cluster release 12.7.1 and
Cluster release 11.7.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Does not support greenfield deployments based on deprecated Cluster releases
12.5.0 and 11.6.0. Use the latest
available Cluster releases of the series instead.
For details about the Container Cloud release 2.23.1, refer to its parent
releases 2.23.0 and 2.22.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 11.7.0
that is based on Mirantis Container Runtime 20.10.13 and
Mirantis Kubernetes Engine 3.5.7 with Kubernetes 1.21.
Does not support greenfield deployments on deprecated Cluster releases
11.6.0, 8.10.0, and
7.11.0. Use the latest available Cluster releases of the
series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.23.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.23.0. For the list of enhancements in the
Cluster release 11.7.0 that is introduced by the Container Cloud release
2.23.0, see the Cluster releases (managed).
Implemented the capability to perform a graceful reboot on a management,
regional, or managed cluster for all supported providers using the
GracefulRebootRequest custom resource. Use this resource for a rolling
reboot of several or all cluster machines without workloads interruption.
The reboot occurs in the order of cluster upgrade policy.
The resource is also useful for a bulk reboot of machines, for example,
on large clusters.
Readiness fields for ‘Machine’ and ‘Cluster’ objects¶
Enhanced Machine and Cluster objects by adding the following output
columns to the kubectl get machines -o wide and
kubectl get cluster -o wide commands to simplify
monitoring of machine and cluster states. More specifically, you can now
obtain the following machine and cluster details:
Machine object:
READY
UPGRADEINDEX
REBOOTREQUIRED
WARNINGS
LCMPHASE (renamed from PHASE)
Cluster object:
READY
RELEASE
WARNINGS
Example system response of the
kubectl get machines <machineName> -o wide command:
Deletion of persistent volumes during an OpenStack-based cluster deletion¶
TechPreview
Implemented the initial Technology Preview API support for deletion of
persistent volumes during an OpenStack-based managed cluster deletion.
To enable the feature, set the boolean volumesCleanupEnabled option in the
spec.providerSpec.value section of the Cluster object before a managed
cluster deletion.
Implemented the capability to disable time sync management during a management
or regional cluster bootstrap using the ntpEnabled=false option.
The default setting remains ntpEnabled=true. The feature disables
the management of chrony configuration by Container Cloud and enables you
to use your own system for chrony management.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
The ‘Upgrade’ button for easy cluster update through the web UI¶
Implemented a separate Upgrade button in the Container Cloud web
UI to simplify the start of a cluster update. This button provides easy access
to the cluster update dialog and has the same functionality as the
Upgrade cluster option available under the cluster menu.
The Upgrade button is located on the Clusters page
next to the More action icon located in the last column for each
cluster when a new Cluster release update becomes available.
The following issues have been addressed in the Mirantis Container Cloud
release 2.23.0 along with the Cluster release 11.7.0:
[29647] Fixed the issue with the Networkprepared stage getting stuck
in the NotStarted status during deployment of a vSphere-based management
or regional cluster with IPAM disabled.
[26896] Fixed the issue with the MetalLB liveness and readiness timeouts
in a slow network.
[28313] Fixed the issue with the iam-keycloak Pod starting slowly
because of DB errors causing timeouts while waiting for the OIDC
configuration readiness.
[28675] Fixed the issue with the Ceph OSD-related parameters configured
using rookConfig in KaaSCephcluster being not applied until OSDs
are restarted. Now, parameters for Ceph OSD daemons apply during runtime
instead of setting them directly in ceph.conf. Therefore, no restart is
required.
[30040] Fixed the issue with the HelmBundleReleaseNotDeployed alert
that has the release_name=opensearch label firing during the Container
Cloud or Cluster release update due to issues with the claim request size
in the elasticsearch.persistentVolumeClaimSize configuration.
[29329] Fixed the issue with recreation of the Patroni container replica
being stuck in the degraded state due to the liveness probe killing the
container that runs the pg_rewind procedure during cluster update.
[28822] Fixed the issue with Reference Application triggering false-positive
alerts related to Reference Application during its upgrade.
[28479] Fixed the issue with the restarts count of the metric-collector
Pod being increased in time with reason:OOMKilled in
containerStatuses of the metric-collector Pod on baremetal-based
management clusters with HTTP proxy enabled.
[28417] Fixed the issue with the Reports Dashboards plugin not being enabled
by default preventing the use of the reporting option. For details about
this plugin, see the GitHub OpenSearch documentation: OpenSearch Dashboards
Reports.
[28373] Fixed the issue with Alerta getting stuck after a failed
initialization during cluster creation with StackLight enabled.
Due to the upstream MetalLB issue,
a race condition occurs when assigning an IP address after the MetalLB
controller restart. If a new service of the LoadBalancer type is created
during the MetalLB Controller restart, then this service can be assigned an IP
address that was already assigned to another service before the MetalLB
Controller restart.
To verify that the cluster is affected:
Verify whether IP addresses of the LoadBalancer (LB) type are duplicated
where they are not supposed to:
kubectlgetsvc-A|grepLoadBalancer
Note
Some services use shared IP addresses on purpose. In the example
system response below, these are services using the IP address 10.0.1.141.
In the above example, the iam-keycloak-http and kaas-kaas-ui
services erroneously use the same IP address 10.100.91.101. They both use the
same port 443 producing a collision when an application tries to access the
10.100.91.101:443 endpoint.
Workaround:
Unassign the current LB IP address for the selected service, as no LB IP
address can be used for the NodePort service:
The second affected service will continue using its current LB IP address.
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
Upgrade of a cluster with more than 120 nodes gets stuck with errors about
IP addresses exhaustion in the docker logs.
Note
If you plan to scale your cluster to more than 120 nodes, the cluster
will be affected by the issue. Therefore, you will have to perform the
workaround below.
Workaround:
Caution
If you have not run the cluster upgrade yet, simply recreate the
mke-overlay network as described in the step 6 and skip all other steps.
Note
If you successfully upgraded the cluster with less than 120 nodes
but plan to scale it to more that 120 node, proceed with steps 2-9.
Verify that MKE nodes are upgraded:
On any master node, run the following command to identify
ucp-worker-agent that has a newer version:
Recreate the mke-overlay network with a correct CIDR that must be
at least /20 and have no interventions with other subnets in the
cluster network. For example:
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
[30294] Replacement of a master node is stuck on the calico-node Pod start¶
During replacement of a master node on a cluster of any type, the
calico-node Pod fails to start on a new node that has the same IP address
as the node being replaced.
Workaround:
Log in to any master node.
From a CLI with an MKE client bundle, create a shell alias to start
calicoctl using the mirantis/ucp-dsinfo image:
In the above command, replace the following values with the corresponding
settings of the affected cluster:
<etcdEndpoint> is the etcd endpoint defined in the Calico
configuration file. For example, ETCD_ENDPOINTS=127.0.0.1:12378
<mkeVersion> is the MKE version installed on your cluster.
For example, mirantis/ucp-dsinfo:3.5.7.
Verify the node list on the cluster:
kubectlgetnode
Compare this list with the node list in Calico to identify the old node:
calicoctlgetnode-owide
Remove the old node from Calico:
calicoctldeletenodekaas-node-<nodeID>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the adminkubeconfig instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig again.
When setting a new Transport Layer Security (TLS) certificate for a cluster,
the false positive failedtogetkubeconfig error may occur on the
WaitingforTLSsettingstobeapplied stage. No actions are required.
Therefore, disregard the error.
To verify the status of the TLS configuration being applied:
In this example, expirationTime equals the NotAfter field of the
server certificate. And the value of hostname contains the configured
application name.
Ceph¶[30857] Irrelevant error during Ceph OSD deployment on removable devices¶
The deployment of Ceph OSDs fails with the following messages in the status
section of the KaaSCephCluster custom resource:
shortClusterInfo:messages:-Not all osds are deployed-Not all osds are in-Not all osds are up
To find out if your cluster is affected, verify if the devices on
the AMD hosts you use for the Ceph OSDs deployment are removable.
For example, if the sdb device name is specified in
spec.cephClusterSpec.nodes.storageDevices of the KaaSCephCluster
custom resource for the affected host, run:
# cat /sys/block/sdb/removable1
The system output shows that the reason of the above messages in status
is the enabled hotplug functionality on the AMD nodes, which marks all drives
as removable. And the hotplug functionality is not supported by Ceph in
Container Cloud.
As a workaround, disable the hotplug functionality in the BIOS settings
for disks that are configured to be used as Ceph OSD data devices.
[30635] Ceph ‘pg_autoscaler’ is stuck with the ‘overlapping roots’ error¶
Due to the upstream Ceph issue
occurring since Ceph Pacific, the pg_autoscaler module of Ceph Manager
fails with the pool <poolNumber> has overlapping roots error if a Ceph
cluster contains a mix of pools with deviceClass either explicitly
specified or not specified.
The deviceClass parameter is required for a pool definition in the
spec section of the KaaSCephCluster object, but not required for Ceph
RADOS Gateway (RGW) and Ceph File System (CephFS).
Therefore, if sections for Ceph RGW or CephFS data or metadata pools are
defined without deviceClass, then autoscaling of placement groups is
disabled on a cluster due to overlapping roots. Overlapping roots imply that
Ceph RGW and/or CephFS pools obtained the default crush rule and have no
demarcation on a specific class to store data.
Note
If pools for Ceph RGW and CephFS already have deviceClass
specified, skip the corresponding steps of the below procedure.
Note
Perform the below procedure on the affected managed cluster using
its kubeconfig.
Workaround:
Obtain failureDomain and required replicas for Ceph RGW and/or CephFS
pools:
Note
If the KaasCephClusterspec section does not contain
failureDomain, failureDomain equals host by default to store
one replica per node.
Note
The types of pools crush rules include:
An erasureCoded pool requires the codingChunks+dataChunks
number of available units of failureDomain.
A replicated pool requires the replicated.size number of
available units of failureDomain.
To obtain Ceph RGW pools, use the
spec.cephClusterSpec.objectStorage.rgw section of the
KaaSCephCluster object. For example:
The dataPool pool requires the sum of codingChunks and
dataChunks values representing the number of available units of
failureDomain. In the example above, for failureDomain:host,
dataPool requires 3 available nodes to store its objects.
The metadataPool pool requires the replicated.size number
of available units of failureDomain. For failureDomain:host,
metadataPool requires 3 available nodes to store its objects.
To obtain CephFS pools, use the
spec.cephClusterSpec.sharedFilesystem.cephFS section of the
KaaSCephCluster object. For example:
The default-pool and metadataPool pools require the
replicated.size number of available units of failureDomain.
For failureDomain:host, default-pool requires 3 available
nodes to store its objects.
The second-pool pool requires the sum of codingChunks and
dataChunks representing the number of available units of
failureDomain. For failureDomain:host, second-pool requires
3 available nodes to store its objects.
Obtain the device class that meets the desired number of required replicas
for the defined failureDomain.
Summarize the USED size of all <rgwName>.rgw.* pools and
compare it with the AVAIL size of each applicable device class
selected in the previous step.
Note
As Ceph RGW pools lack explicit specification of
deviceClass, they may store objects on all device classes.
The resulted device size can be smaller than the calculated
USED size because part of data can already be stored in the
desired class.
Therefore, limiting pools to a single device class may result in a
smaller occupied data size than the total USED size.
Nonetheless, calculating the USED size of all pools remains
valid because the pool data may not be stored on the selected device
class.
For CephFS data or metadata pools, use the previous step to calculate
the USED size of pools and compare it with the AVAIL size.
Decide which device class from applicable by required replicas and
available size is more preferable to store Ceph RGW and CephFS data.
In the example output above, hdd and ssd are both applicable.
Therefore, select any of them.
Note
You can select different device classes for Ceph RGW and
CephFS. For example, hdd for Ceph RGW and ssd for CephFS.
Select a device class based on performance expectations, if any.
Create the rule-helper script to switch Ceph RGW or CephFS pools to a
device usage.
Create the /tmp/rule-helper.py file with the following content:
cat>/tmp/rule-helper.py<< EOFimport argparseimport jsonimport subprocessfrom sys import argv, exitdef get_cmd(cmd_args): output_args = ['--format', 'json'] _cmd = subprocess.Popen(cmd_args + output_args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) stdout, stderr = _cmd.communicate() if stderr: error = stderr print("[ERROR] Failed to get '{0}': {1}".format(cmd_args.join(' '), stderr)) return return stdoutdef format_step(action, cmd_args): return "{0}:\n\t{1}".format(action, ' '.join(cmd_args))def process_rule(rule): steps = [] new_rule_name = rule['rule_name'] + '_v2' if rule['type'] == "replicated": rule_create_args = ['ceph', 'osd', 'crush', 'create-replicated', new_rule_name, rule['root'], rule['failure_domain'], rule['device_class']] steps.append(format_step("create a new replicated rule for pool", rule_create_args)) else: new_profile_name = rule['profile_name'] + '_' + rule['device_class'] profile_create_args = ['ceph', 'osd', 'erasure-code-profile', 'set', new_profile_name] for k,v in rule['profile'].items(): profile_create_args.append("{0}={1}".format(k,v)) rule_create_args = ['ceph', 'osd', 'crush', 'create-erasure', new_rule_name, new_profile_name] steps.append(format_step("create a new erasure-coded profile", profile_create_args)) steps.append(format_step("create a new erasure-coded rule for pool", rule_create_args)) set_rule_args = ['ceph', 'osd', 'pool', 'set', 'crush_rule', rule['pool_name'], new_rule_name] revert_rule_args = ['ceph', 'osd', 'pool', 'set', 'crush_rule', new_rule_name, rule['pool_name']] rm_old_rule_args = ['ceph', 'osd', 'crush', 'rule', 'rm', rule['rule_name']] rename_rule_args = ['ceph', 'osd', 'crush', 'rule', 'rename', new_rule_name, rule['rule_name']] steps.append(format_step("set pool crush rule to new one", set_rule_args)) steps.append("check that replication is finished and status healthy: ceph -s") steps.append(format_step("in case of any problems revert step 2 and stop procedure", revert_rule_args)) steps.append(format_step("remove standard (old) pool crush rule", rm_old_rule_args)) steps.append(format_step("rename new pool crush rule to standard name", rename_rule_args)) if rule['type'] != "replicated": rm_old_profile_args = ['ceph', 'osd', 'erasure-code-profile', 'rm', rule['profile_name']] steps.append(format_step("remove standard (old) erasure-coded profile", rm_old_profile_args)) for idx, step in enumerate(steps): print(" {0}) {1}".format(idx+1, step))def check_rules(args): extra_pools_lookup = [] if args.type == "rgw": extra_pools_lookup.append(".rgw.root") pools_str = get_cmd(['ceph', 'osd', 'pool', 'ls', 'detail']) if pools_str == '': return rules_str = get_cmd(['ceph', 'osd', 'crush', 'rule', 'dump']) if rules_str == '': return try: pools_dump = json.loads(pools_str) rules_dump = json.loads(rules_str) if len(pools_dump) == 0: print("[ERROR] No pools found") return if len(rules_dump) == 0: print("[ERROR] No crush rules found") return crush_rules_recreate = [] for pool in pools_dump: if pool['pool_name'].startswith(args.prefix) or pool['pool_name'] in extra_pools_lookup: rule_id = pool['crush_rule'] for rule in rules_dump: if rule['rule_id'] == rule_id: recreate = False new_rule = {'rule_name': rule['rule_name'], 'pool_name': pool['pool_name']} for step in rule.get('steps',[]): root = step.get('item_name', '').split('~') if root[0] != '' and len(root) == 1: new_rule['root'] = root[0] continue failure_domain = step.get('type', '') if failure_domain != '': new_rule['failure_domain'] = failure_domain if new_rule.get('root', '') == '': continue new_rule['device_class'] = args.device_class if pool['erasure_code_profile'] == "": new_rule['type'] = "replicated" else: new_rule['type'] = "erasure" profile_str = get_cmd(['ceph', 'osd', 'erasure-code-profile', 'get', pool['erasure_code_profile']]) if profile_str == '': return profile_dump = json.loads(profile_str) profile_dump['crush-device-class'] = args.device_class new_rule['profile_name'] = pool['erasure_code_profile'] new_rule['profile'] = profile_dump crush_rules_recreate.append(new_rule) break print("Found {0} pools with crush rules require device class set".format(len(crush_rules_recreate))) for new_rule in crush_rules_recreate: print("- Pool {0} requires crush rule update, device class is not set".format(new_rule['pool_name'])) process_rule(new_rule) except Exception as err: print("[ERROR] Failed to get info from Ceph: {0}".format(err)) returnif __name__ == '__main__': parser = argparse.ArgumentParser( description='Ceph crush rules checker. Specify device class and service name.', prog=argv[0], usage='%(prog)s [options]') parser.add_argument('--type', type=str, help='Type of pool: rgw, cephfs', default='', required=True) parser.add_argument('--prefix', type=str, help='Pool prefix. If objectstore - use objectstore name, if CephFS - CephFS name.', default='', required=True) parser.add_argument('--device-class', type=str, help='Device class to switch on.', required=True) args = parser.parse_args() if len(argv) < 3: parser.print_help() exit(0) check_rules(args)EOF
Exit the ceph-tools Pod.
For Ceph RGW, execute the rule-helper script to output the step-by-step
instruction and run each step provided in the output manually.
Note
The following steps include creation of crush rules with the same
parameters as before but with the device class specification and switching
of pools to new crush rules.
Execution of the rule-helper script steps for Ceph RGW
<rgwName> with the Ceph RGW name from
spec.cephClusterSpec.objectStorage.rgw.name in the
KaaSCephCluster object. In the example above, the name is
openstack-store.
<deviceClass> with the device class selected in the previous
steps.
Using the output of the command from the previous step, run manual
commands step-by-step.
<cephfsName> with CephFS name from
spec.cephClusterSpec.sharedFilesystem.cephFS[0].name in the
KaaSCephCluster object. In the example above, the name is
cephfs-store.
<deviceClass> with the device class selected in the previous
steps.
Using the output of the command from the previous step, run manual
commands step-by-step.
Verify that the Ceph cluster has rebalanced and has the HEALTH_OK
status:
ceph-s
Exit the ceph-tools Pod.
Verify the pg_autoscaler module after switching deviceClass for
all required pools:
cephosdpoolautoscale-status
The system response must contain all Ceph RGW and CephFS pools.
On the management cluster, edit the KaaSCephCluster object of the
corresponding managed cluster by adding the selected device class to the
deviceClass parameter of the updated Ceph RGW and CephFS pools:
Substitute <rgwDeviceClass> with the device class applied to Ceph RGW
pools and <cephfsDeviceClass> with the device class applied to CephFS
pools.
You can use this configuration step for further management of Ceph RGW
and/or CephFS. It does not impact the existing Ceph cluster configuration.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
This CronJob is removed automatically during upgrade to the
major Container Cloud release 2.24.0 or to the patch Container Cloud release
2.23.3 if you obtain patch releases.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.23.0. For major components and
versions of the Cluster release introduced in 2.23.0, see
Cluster release 11.7.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Mirantis Container Cloud
release 2.23.0. For artifacts of the Cluster release introduced in 2.23.0,
see Cluster release 11.7.0.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 11.6.0
that is based on Mirantis Container Runtime 20.10.13 and
Mirantis Kubernetes Engine 3.5.5 with Kubernetes 1.21.
Does not support greenfield deployments on deprecated Cluster releases
11.5.0 and 8.10.0. Use the latest
available Cluster releases of the series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.22.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.22.0. For the list of enhancements in the
Cluster release 11.6.0 that is introduced by the Container Cloud release
2.22.0, see the Cluster releases (managed).
The ‘rebootRequired’ notification in the baremetal-based machine status¶
Added the rebootRequired field to the status of a Machine object for
the bare metal provider. This field indicates whether a manual host reboot
is required to complete the Ubuntu operating system updates, if any.
You can view this notification either using the Container Cloud API or web UI:
API: reboot.required.true in
status:providerStatus of a Machine object
Web UI: the One or more machines require a reboot
notification on the Clusters and Machines pages
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Custom network configuration for managed clusters based on Equinix Metal with private networking¶
TechPreview
Implemented the ability to configure advanced network settings on managed
clusters that are based on Equinix Metal with private networking. Using the custom parameter in the
Cluster object, you can customize network configuration for the cluster
machines. The feature comprises usage of dedicated Subnet and
L2Template objects that contain necessary configuration for cluster
machines.
Custom TLS certificates for the StackLight ‘iam-proxy’ endpoints¶
Implemented the ability to set up custom TLS certificates for the following
StackLight iam-proxy endpoints on any type of Container Cloud clusters:
Implemented the following Container Cloud objects describing the history of a
cluster and machine deployment and update:
ClusterDeploymentStatus
ClusterUpgradeStatus
MachineDeploymentStatus
MachineUpgradeStatus
Using these objects, you can inspect cluster and machine deployment and
update stages, their time stamps, statuses, and failure messages, if any.
In the Container Cloud web UI, use the History option located
under the More action icon of a cluster and machine.
For existing clusters, these objects become available after the management
cluster upgrade to Container Cloud 2.22.0.
Extended logging format for essential management cluster components¶
Extended the logging format for the admission-controller,
storage-discovery, and all supported <providerName>-provider services
of a management cluster. Now, log records for these services contain
the following entries:
The following issues have been addressed in the Mirantis Container Cloud
release 2.22.0 along with the Cluster release 11.6.0:
[27192] Fixed the issue that prevented portforward-controller from
accepting new connections correctly.
[26659] Fixed the issue that caused the deployment of a regional cluster
based on bare metal or Equinix Metal with private networking to fail with
mcc-cache Pods being stuck in the CrashLoopBackOff status of
restarts.
[28783] Fixed the issue with Ceph condition getting stuck in absence of the
Ceph cluster secrets information on the MOSK 22.3 clusters.
Caution
Starting from MOSK 22.4, the Ceph cluster
version updates to 15.2.17. Therefore, if you applied the workaround for
MOSK 22.3 described in Ceph known issue 28783, remove the version parameter definition from
KaaSCephCluster after the managed cluster update to MOSK
22.4.
[26820] Fixed the issue with the status section in the
KaaSCephCluster.status CR not reflecting issues during a Ceph cluster
deletion.
[25624] Fixed the issue with inability to specify the Ceph pool API
parameters by adding the parameters option that specifies the key-value
map for the parameters of the Ceph pool.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
[28526] Fixed the issue with a low CPU limit 100m for kaas-exporter
blocking metric collection.
[28134] Fixed the issue with failure to update a cluster with nodes being
stuck in the Prepare state due to errorwhenevictingpods for
Patroni.
[27732-1] Fixed the issue with the OpenSearch
elasticsearch.persistentVolumeClaimSize custom setting being overwritten
by logging.persistentVolumeClaimSize during deployment of a Container
Cloud cluster of any type and be set to the default 30Gi.
Depending on available resources on existing clusters that were affected by
the issue, additional actions may be required after an update to Container
Cloud 2.22.0. For details, see OpenSearchPVCMismatch alert raises due to the OpenSearch PVC size mismatch.
New clusters deployed on top of Container Cloud 2.22.0 are not affected.
[27732-2] Fixed the issue with custom settings for the deprecated
elasticsearch.logstashRetentionTime parameter being overwritten by the
default setting set to 1 day.
[20876] Fixed the issue with StackLight Pods getting stuck with the
PodpredicateNodeAffinityfailed error due to the
StackLight node label added to one machine and then removed
from another one.
[28651] Updated Telemeter for StackLight to fix the discovered
vulnerabilities.
Bare metal¶[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
During deployment of a vSphere-based management or regional cluster with IPAM
disabled, the Networkprepared stage gets stuck in the NotStarted
status. The issue does not affect cluster deployment. Therefore, disregard
the error message.
LCM¶[5782] Manager machine fails to be deployed during node replacement¶
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the adminkubeconfig instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig again.
When setting a new Transport Layer Security (TLS) certificate for a cluster,
the false positive failedtogetkubeconfig error may occur on the
WaitingforTLSsettingstobeapplied stage. No actions are required.
Therefore, disregard the error.
To verify the status of the TLS configuration being applied:
In this example, expirationTime equals the NotAfter field of the
server certificate. And the value of hostname contains the configured
application name.
StackLight¶[30040] OpenSearch is not in the ‘deployed’ status during cluster update¶
The issue may affect the Container Cloud or Cluster release update
to the following versions:
2.22.0 for management and regional clusters
11.6.0 for management, regional, and managed clusters
13.2.5, 13.3.5, 13.4.3, and 13.5.2 for attached MKE clusters
The issue does not affect clusters originally deployed since the following
Cluster releases: 11.0.0, 8.6.0, 7.6.0.
During cluster update to versions mentioned in the note above, the following
OpenSearch-related error may occur on clusters that were originally deployed
or attached using Container Cloud 2.15.0 or earlier, before the transition
from Elasticsearch to OpenSearch:
The stacklight/opensearch release of the stacklight/stacklight-bundle HelmBundlereconciled by the stacklight/stacklight-helm-controller Controlleris not in the "deployed" status for the last 15 minutes.
The issue affects clusters with elasticsearch.persistentVolumeClaimSize
configured for values other than 30Gi.
To verify that the cluster is affected:
Verify whether the HelmBundleReleaseNotDeployed alert for the
opensearch release is firing. If so, the cluster is most probably
affected. Otherwise, the cluster is not affected.
Verify the reason of the HelmBundleReleaseNotDeployed alert for the
opensearch release:
During an update of a Container Cloud cluster of any type, recreation of the
Patroni container replica is stuck in the degraded state due to the liveness
probe killing the container that runs the pg_rewind procedure. The issue
affects clusters on which the pg_rewind procedure takes more time than the
full cycle of the liveness probe.
The sample logs of the affected cluster:
INFO:doing crash recovery in a single user modeERROR:Crash recovery finished with code=-6INFO:stdout=INFO:stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0 LOG: database system was interrupted; last known up at 2023-01-10 17:00:59 GMT[64]:[2-1] 63be8d72.40 0 LOG: could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192[64]:[3-1] 63be8d72.40 0 LOG: invalid primary checkpoint record[64]:[4-1] 63be8d72.40 0 PANIC: could not locate a valid checkpoint record
On managed clusters with enabled Reference Application, the following alerts
are triggered during a managed cluster update from the Cluster release 11.5.0
to 11.6.0 or 7.11.0 to 11.5.0:
KubeDeploymentOutage for the refapp Deployment
RefAppDown
RefAppProbeTooLong
RefAppTargetDown
This behavior is expected, no actions are required. Therefore, disregard these
alerts.
[28479] Increase of the ‘metric-collector’ Pod restarts due to OOM¶
On the baremetal-based management clusters, the restarts count of the
metric-collector Pod is increased in time with reason:OOMKilled in
the containerStatuses of the metric-collector Pod. Only clusters with
HTTP proxy enabled are affected.
Such behavior is expected. Therefore, disregard these restarts.
[28373] Alerta can get stuck after a failed initialization¶
During creation of a Container Cloud cluster of any type with StackLight
enabled, Alerta can get stuck after a failed initialization with only 1 Pod
in the READY state. For example:
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
Ceph¶[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.22.0. For major components and
versions of the Cluster release introduced in 2.22.0, see
Cluster release 11.6.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Mirantis Container Cloud
release 2.22.0. For artifacts of the Cluster release introduced in 2.22.0,
see Cluster release 11.6.0.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section contains historical information on the unsupported Container
Cloud releases delivered in 2022. For the latest supported Container
Cloud release, see Container Cloud releases.
Based on 2.15.0, this release introduces the Cluster release 8.5.0
that is based on 5.22.0 and supports Mirantis OpenStack for Kubernetes
(MOSK)
22.1.
For the list of Cluster releases 7.x and 5.x that are supported by
2.15.1 as well as for its features with addressed and known issues,
refer to the parent release 2.15.0.
Introduces support for Mirantis Kubernetes Engine 3.5.5 with Kubernetes 1.21
and Mirantis Container Runtime 20.10.13 in the 12.x Cluster release series.
Supports the latest Cluster releases 7.11.0 and
11.5.0.
Does not support greenfield deployments based on deprecated Cluster releases
11.4.0, 8.10.0, and
7.10.0. Use the latest available Cluster releases of the
series instead.
For details about the Container Cloud release 2.21.1, refer to its parent
release 2.21.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 11.5.0
that is based on Mirantis Container Runtime 20.10.13 and
Mirantis Kubernetes Engine 3.5.5 with Kubernetes 1.21.
Introduces support for the Cluster release 7.11.0 that is
based on Mirantis Container Runtime 20.10.13 and Mirantis Kubernetes Engine
3.4.11 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
11.4.0, 8.8.0, and
7.10.0. Use the latest available Cluster releases of the
series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.21.0.
Caution
Container Cloud 2.21.0 requires manual post-upgrade steps.
For details, see Post-upgrade actions.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.21.0. For the list of enhancements in the
Cluster releases 11.5.0 and 7.11.0 that are introduced by the Container Cloud
release 2.21.0, see the Cluster releases (managed).
‘BareMetalHostCredential’ custom resource for bare metal hosts¶
Implemented the BareMetalHostCredential custom resource to simplify
permissions and roles management on a bare metal management, regional, and
managed cluster.
Note
For MOSK-based deployments, the feature support is
available since MOSK 22.5.
The BareMetalHostCredential object creation triggers the following
automatic actions:
Create an underlying Secret object containing data about username
and password of the bmc account of the related
BareMetalHostCredential object.
Erase sensitive password data of the bmc account from the
BareMetalHostCredential object.
Add the created Secret object name to the spec.password.name
section of the related BareMetalHostCredential object.
Update BareMetalHost.spec.bmc.credentialsName with the
BareMetalHostCredential object name.
Note
When you delete a BareMetalHost object, the related
BareMetalHostCredential object is deleted automatically.
Note
On existing clusters, a BareMetalHostCredential object is
automatically created for each BareMetalHost object during a cluster
update.
Enhanced the logic of the dnsmasq server to listen on the PXE network of
the management cluster by using the dhcp-lb Kubernetes Service instead of
listening on the PXE interface of one management cluster node.
To configure the DHCP relay service, specify the external address of the
dhcp-lb Kubernetes Service as an upstream address for the relayed DHCP
requests, which is the IP helper address for DHCP. There is the dnsmasq
Deployment behind this service that can only accept relayed DHCP requests.
Container Cloud has its own DHCP relay running on one of the management
cluster nodes. That DHCP relay serves for proxying DHCP requests in the
same L2 domain where the management cluster nodes are located.
The enhancement comprises deprecation of the dnsmasq.dhcp_range parameter.
Use the Subnet object configuration for this purpose instead.
Note
If you configured multiple DHCP ranges before Container Cloud 2.21.0
during the management cluster bootstrap, the DHCP configuration will
automatically migrate to Subnet objects after cluster upgrade to 2.21.0.
Caution
Using of custom DNS server addresses for servers that boot over
PXE is not supported.
Combining router and seed node settings on one Equinix Metal server¶
Implemented the ability to combine configuration of a router and seed node on
the same server when preparing infrastructure for an Equinix Metal based
Container Cloud with private networking using Terraform templates.
Set router_as_seed to true in the required Metro configuration while
preparing terraform.tfvars to combine both the router and seed node roles.
Implemented the possibility to safely clean up a node resources using the
Container Cloud API before deleting it from a cluster. Using the
deletionPolicy:graceful parameter in the providerSpec.value section
of the Machine object, the cloud provider controller now prepares a
machine for deletion by cordoning, draining, and removing the related node
from Docker Swarm. If required, you can abort a machine deletion when using
deletionPolicy:graceful, but only before the related node is removed
from Docker Swarm.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Add custom Docker registries using the Container Cloud web UI¶
Enhanced support for custom Docker registries configuration in management,
regional, and managed clusters by adding the Container Registries
tab to the Container Cloud web UI. Using this tab, you can configure CA
certificates on machines to access private Docker registries.
Note
For MOSK-based deployments, the feature support is
available since MOSK 22.5.
The following issues have been addressed in the Mirantis Container Cloud
release 2.21.0 along with the Cluster releases 11.5.0 and
7.11.0:
[23002] Fixed the issue with inability to set a custom value for a
predefined node label using the Container Cloud web UI.
[26416] Fixed the issue with inability to automatically upload an MKE client
bundle during cluster attachment using the Container Cloud web UI.
[26740] Fixed the issue with failure to upgrade a management cluster with
a Keycloak or web UI TLS custom certificate.
[27193] Fixed the issue with missing permissions for the
m:kaas:<namespaceName>@member role that are required for the Container
Cloud web UI to work properly. The issue relates to reading permissions for
resources objects of all providers as well as clusterRelease,
unsupportedCluster objects, and so on.
[26379] Fixed the issue with missing logs for MOSK-related
namespaces when using the container-cloud collect logs command
without the --extended flag.
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache Pods being stuck in the
CrashLoopBackOff status of restarts.
As a workaround, remove failed mcc-cache Pods to restart them
automatically. For example:
kubectl-nkaasdeletepodmcc-cache-0
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache Pods being stuck in the
CrashLoopBackOff status of restarts.
As a workaround, remove failed mcc-cache Pods to restart them
automatically. For example:
kubectl-nkaasdeletepodmcc-cache-0
vSphere¶[26070] RHEL system cannot be registered in Red Hat portal over MITM proxy¶
Deployment of RHEL machines using the Red Hat portal registration, which
requires user and password credentials, over MITM proxy fails while building
the virtual machines template with the following error:
Unable to verify server's identity:[SSL: CERTIFICATE_VERIFY_FAILED]certificate verify failed (_ssl.c:618)
The Container Cloud deployment gets stuck while applying the RHEL license
to machines with the same error in the lcm-agent logs.
As a workaround, use the internal Red Hat Satellite server that a VM can
access directly without a MITM proxy.
LCM¶[5782] Manager machine fails to be deployed during node replacement¶
During the unsafe or forced deletion of a manager machine running the
calico-kube-controllers Pod in the kube-system namespace,
the following issues occur:
The calico-kube-controllers Pod fails to clean up resources associated
with the deleted node
The calico-node Pod may fail to start up on a newly created node if the
machine is provisioned with the same IP address as the deleted machine had
As a workaround, before deletion of the node running the
calico-kube-controllers Pod, cordon and drain the node:
kubectlcordon<nodeName>
kubectldrain<nodeName>
[27797] A cluster ‘kubeconfig’ stops working during MKE minor version update¶
During update of a Container Cloud cluster of any type, if the MKE minor
version is updated from 3.4.x to 3.5.x, access to the cluster using the
existing kubeconfig fails with the You must be logged in to the server
(Unauthorized) error due to OIDC settings being reconfigured.
As a workaround, during the cluster update process, use the adminkubeconfig instead of the existing one. Once the update completes, you can
use the existing cluster kubeconfig again.
During bootstrap of a management or regional cluster of any type,
portforward-controller ends accepting new connections after receiving the
Accept error: “EOF” error. Hence, nothing is copied between clients.
The workaround below applies only if machines are stuck in the Provision
state. Otherwise, contact Mirantis support to further assess the issue.
Workaround:
Verify that machines are stuck in the Provision state for up to 20
minutes or more. For example:
During an update of a Container Cloud cluster of any type, recreation of the
Patroni container replica is stuck in the degraded state due to the liveness
probe killing the container that runs the pg_rewind procedure. The issue
affects clusters on which the pg_rewind procedure takes more time than the
full cycle of the liveness probe.
The sample logs of the affected cluster:
INFO:doing crash recovery in a single user modeERROR:Crash recovery finished with code=-6INFO:stdout=INFO:stderr=2023-01-11 10:20:34 GMT [64]: [1-1] 63be8d72.40 0 LOG: database system was interrupted; last known up at 2023-01-10 17:00:59 GMT[64]:[2-1] 63be8d72.40 0 LOG: could not read from log segment 00000002000000000000000F, offset 0: read 0 of 8192[64]:[3-1] 63be8d72.40 0 LOG: invalid primary checkpoint record[64]:[4-1] 63be8d72.40 0 PANIC: could not locate a valid checkpoint record
On the baremetal-based management clusters, the restarts count of the
metric-collector Pod is increased in time with reason:OOMKilled in
the containerStatuses of the metric-collector Pod. Only clusters with
HTTP proxy enabled are affected.
Such behavior is expected. Therefore, disregard these restarts.
[28134] Failure to update a cluster with nodes in the ‘Prepare’ state¶
A Container Cloud cluster of any type fails to update with nodes being
stuck in the Prepare state and the following example error in
Conditions of the affected machine:
Error:error when evicting pods/"patroni-13-2" -n "stacklight": global timeout reached: 10m0s
Other symptoms of the issue are as follows:
One of the Patroni Pods has 2/3 of containers ready. For example:
The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is
overwritten by logging.persistentVolumeClaimSize during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi.
Note
This issue does not block the OpenSearch cluster operations if
the default retention time is set. The default setting is usually enough for
the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle
brackets to match the affected cluster values.
Continue the cluster deployment. The system will use the custom value
set in logging.persistentVolumeClaimSize.
Caution
If elasticsearch.persistentVolumeClaimSize is absent in
the .yaml file, the Admission Controller blocks the configuration
update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage
of OpenSearch and its dependent components may occur with the following
alerts firing on the cluster. This behavior is expected. Therefore,
disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level
Alert name
Label name and component
Any cluster with high probability
KubeStatefulSetOutage
statefulset=opensearch-master
KubeDeploymentOutage
deployment=opensearch-dashboards
deployment=metricbeat
Large cluster with average probability
KubePodsNotReadyRemoved in 17.0.0, 16.0.0, and 14.1.0
created_by_name="opensearch-master*"
created_by_name="opensearch-dashboards*"
created_by_name="metricbeat-*"
OpenSearchClusterStatusWarning
n/a
OpenSearchNumberOfPendingTasks
n/a
OpenSearchNumberOfInitializingShards
n/a
OpenSearchNumberOfUnassignedShardsRemoved in 2.27.0 (17.2.0 and 16.2.0)
n/a
Any cluster with low probability
KubeStatefulSetReplicasMismatch
statefulset=opensearch-master
KubeDeploymentReplicasMismatch
deployment=opensearch-dashboards
deployment=metricbeat
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Therefore, if required, migrate log data to a new persistent volume
(PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the opensearch-master StatefulSet with dependent
resources to 0 and disable the elasticsearch-curator CronJob:
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Depending on your custom provisioner, you may find a third-party
tool, such as pv-migrate,
that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
This command removes all existing logs data from PVCs.
In the Cluster configuration, set
logging.persistentVolumeClaimSize to the same value as the size of
the elasticsearch.persistentVolumeClaimSize parameter. For example:
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster object, replace
elasticsearch.logstashRetentionTime with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
Ceph conditon gets stuck in absence of the Ceph cluster secrets information.
The observed behaviour can be found on the MOSK 22.3 clusters running on top
of Container Cloud 2.21.
The list of the symptoms includes:
The Cluster object contains the following condition:
Substitute <managedClusterProject> with the corresponding
managed cluster namespace.
Define the version parameter in the KaaSCephCluster spec:
spec:cephClusterSpec:version:15.2.13
Note
Starting from MOSK 22.4, the Ceph cluster version updates to
15.2.17. Therefore, remove the version parameter definition from
KaaSCephCluster after the managed cluster update.
Save the updated KaaSCephCluster spec.
Find the MiraCeph Custom Resource on a managed cluster and copy all
annotations starting with meta.helm.sh:
Substitute <managedClusterKubeconfig> with a corresponding managed
cluster kubeconfig.
Example of a system output:
apiVersion:apiextensions.k8s.io/v1kind:CustomResourceDefinitionmetadata:annotations:controller-gen.kubebuilder.io/version:v0.6.0# save all annotations with "meta.helm.sh" somewheremeta.helm.sh/release-name:ceph-controllermeta.helm.sh/release-namespace:ceph...
Create the miracephsecretscrd.yaml file and fill it with the following
template:
apiVersion:apiextensions.k8s.io/v1kind:CustomResourceDefinitionmetadata:annotations:controller-gen.kubebuilder.io/version:v0.6.0<insert all "meta.helm.sh" annotations here>labels:app.kubernetes.io/managed-by:Helmname:miracephsecrets.lcm.mirantis.comspec:conversion:strategy:Nonegroup:lcm.mirantis.comnames:kind:MiraCephSecretlistKind:MiraCephSecretListplural:miracephsecretssingular:miracephsecretscope:Namespacedversions:-name:v1alpha1schema:openAPIV3Schema:description:MiraCephSecret aggregates secrets created by Cephproperties:apiVersion:type:stringkind:type:stringmetadata:type:objectstatus:properties:lastSecretCheck:type:stringlastSecretUpdate:type:stringmessages:items:type:stringtype:arraystate:type:stringtype:objecttype:objectserved:truestorage:true
Insert the copied meta.helm.sh annotations to the
metadata.annotations section of the template.
Apply miracephsecretscrd.yaml on the managed cluster:
Substitute <managedClusterKubeconfig> with a corresponding managed
cluster kubeconfig.
After some delay, the cluster condition will be updated to the health
state.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.21.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Since Kubernetes policy does not allow updating images in existing IAM jobs,
after Container Cloud automatically upgrades to 2.21.0, update the MariaDB
image manually using the following steps:
Delete the existing job:
kubectldeletejob-nkaasiam-cluster-wait
In the management Cluster object, and add following snippet:
This Cluster release is based on the updated version of Mirantis Kubernetes
Engine 3.4.10 with Kubernetes 1.20 and Mirantis Container Runtime 20.10.12.
Supports the latest Cluster releases 7.10.0 and
11.4.0.
Does not support greenfield deployments based on deprecated Cluster releases
11.3.0, 8.8.0, and
7.9.0. Use the latest available Cluster releases of the
series instead.
For details about the Container Cloud release 2.20.1, refer to its parent
release 2.20.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 11.4.0
that is based on Mirantis Container Runtime 20.10.12 and
Mirantis Kubernetes Engine 3.5.4 with Kubernetes 1.21.
Introduces support for the Cluster release 7.10.0 that is
based on Mirantis Container Runtime 20.10.12 and Mirantis Kubernetes Engine
3.4.10 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
11.3.0, 8.6.0, and
7.9.0. Use the latest available Cluster releases of the
series instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.20.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.20.0. For the list of enhancements in the
Cluster releases 11.4.0 and 7.10.0 that are introduced by the Container Cloud
release 2.20.0, see the Cluster releases (managed).
Added the IAM member role to the existing IAM roles list. The
Infrastructure Operator with the member role has the read and write
access to Container Cloud API allowing cluster operations and does not have
access to IAM objects.
Bastion node configuration for OpenStack and AWS manged clusters¶
Implemented the capability to configure the Bastion node on greenfield
deployments of the OpenStack-based and AWS-based managed clusters using the
Container Cloud web UI. Using the Create Cluster wizard, you
can now configure the following parameters for the Bastion node:
OpenStack-based: flavor, image, availability zone, server metadata, booting
from a volume
AWS-based: instance type, AMI ID
Note
Reconfiguration of the Bastion node on an existing cluster is not
supported.
Mandatory IPAM service label for bare metal LCM subnets¶
Made the ipam/SVC-k8s-lcm label mandatory for the LCM subnet on new
deployments of management and managed bare metal clusters. It allows the
LCM Agent to correctly identify IP addresses to use on multi-homed bare metal
hosts. Therefore, you must add this label explicitly on new clusters.
Each node of every cluster must now have only one IP address in the LCM
network that is allocated from one of the Subnet objects having the
ipam/SVC-k8s-lcm label defined. Therefore, all Subnet objects used
for LCM networks must have the ipam/SVC-k8s-lcm label defined.
Note
For MOSK-based deployments, the feature support is
available since MOSK 22.4.
Implemented the possibility to use flexible size units throughout bare
metal host profiles for management, regional, and managed clusters. For
example, you can now use either sizeGiB:0.1 or size:100Mi when
specifying a device size. The size without units is counted in bytes. For
example, size:120 means 120 bytes.
Caution
Mirantis recommends using only one parameter name type and units
throughout the configuration files. If both sizeGiB and size are
used, sizeGiB is ignored during deployment and the suffix is adjusted
accordingly. For example, 1.5Gi will be serialized as 1536Mi.
The size without units is counted in bytes. For example, size:120 means
120 bytes.
Note
For MOSK-based deployments, the feature support is
available since MOSK 22.4.
Completed integration of the man-in-the-middle (MITM) proxies support for
offline deployments by adding AWS, vSphere, and Equinix Metal with private
networking to the list of existing supported providers: OpenStack and bare
metal.
With trusted proxy CA certificates that you can now add using the
CA Certificate check box in the Add new Proxy
window during a managed cluster creation, the feature allows monitoring all
cluster traffic for security and audit purposes.
Note
For Azure and Equinix Metal with public networking, the feature is not
supported
For MOSK-based deployments, the feature support will become
available in one of the following Container Cloud releases.
Configuration of TLS certificates for ‘mcc-cache’ and MKE¶
Implemented the ability to configure TLS certificates for mcc-cache on
management or regional clusters and for MKE on managed clusters deployed or
updated by Container Cloud using the latest Cluster release.
Note
TLS certificates configuration for MKE is not supported:
For MOSK-based clusters
For attached MKE clusters that were not originally deployed by Container
Cloud
On top of continuous improvements delivered to the existing Container Cloud
guides, added a document on how to increase the overall storage size for all
Ceph pools of the same device class: hdd, ssd, or nvme. For
details, see Increase Ceph cluster storage size.
[24927] Fixed the issue wherein a failure to create lcmclusterstate did
not trigger a retry.
[24852] Fixed the issue wherein the Upgrade Schedule tab in the
Container Cloud web UI was displaying the NOT ALLOWED label
instead of ALLOWED if the upgrade was enabled.
[24837] Fixed the issue wherein some Keycloak iam-keycloak-* pods were in
the CrashLoopBackOff state during an update of a baremetal-based
management or managed cluster with enabled FIPs.
[24813] Fixed the issue wherein the IPaddr objects were not reconciled
after the ipam/SVC-* labels changed on the parent subnet. This prevented
the ipam/SVC-* labels from propagating to IPaddr objects and caused
the serviceMap update to fail in the corresponding IpamHost.
[23125] Fixed the issue wherein an OpenStack-based regional cluster creation
in an offline mode was failing. Adding the Kubernetes load balancer address
to the NO_PROXY environment variable is no longer required.
[22576] Fixed the issue wherein provisioning-ansible did not use the
wipe flags during the deployment phase.
[5238] Improved the Bastion readiness checks to avoid issues with some
clusters having several Bastion nodes.
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache Pods being stuck in the
CrashLoopBackOff status of restarts.
As a workaround, remove failed mcc-cache Pods to restart them
automatically. For example:
kubectl-nkaasdeletepodmcc-cache-0
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
Deployment of a regional cluster based on bare metal or Equinix Metal with
private networking fails with mcc-cache Pods being stuck in the
CrashLoopBackOff status of restarts.
As a workaround, remove failed mcc-cache Pods to restart them
automatically. For example:
kubectl-nkaasdeletepodmcc-cache-0
vSphere¶[26070] RHEL system cannot be registered in Red Hat portal over MITM proxy¶
Deployment of RHEL machines using the Red Hat portal registration, which
requires user and password credentials, over MITM proxy fails while building
the virtual machines template with the following error:
Unable to verify server's identity:[SSL: CERTIFICATE_VERIFY_FAILED]certificate verify failed (_ssl.c:618)
The Container Cloud deployment gets stuck while applying the RHEL license
to machines with the same error in the lcm-agent logs.
As a workaround, use the internal Red Hat Satellite server that a VM can
access directly without a MITM proxy.
StackLight¶[28526] CPU throttling for ‘kaas-exporter’ blocking metric collection¶
The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is
overwritten by logging.persistentVolumeClaimSize during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi.
Note
This issue does not block the OpenSearch cluster operations if
the default retention time is set. The default setting is usually enough for
the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle
brackets to match the affected cluster values.
Continue the cluster deployment. The system will use the custom value
set in logging.persistentVolumeClaimSize.
Caution
If elasticsearch.persistentVolumeClaimSize is absent in
the .yaml file, the Admission Controller blocks the configuration
update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage
of OpenSearch and its dependent components may occur with the following
alerts firing on the cluster. This behavior is expected. Therefore,
disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level
Alert name
Label name and component
Any cluster with high probability
KubeStatefulSetOutage
statefulset=opensearch-master
KubeDeploymentOutage
deployment=opensearch-dashboards
deployment=metricbeat
Large cluster with average probability
KubePodsNotReadyRemoved in 17.0.0, 16.0.0, and 14.1.0
created_by_name="opensearch-master*"
created_by_name="opensearch-dashboards*"
created_by_name="metricbeat-*"
OpenSearchClusterStatusWarning
n/a
OpenSearchNumberOfPendingTasks
n/a
OpenSearchNumberOfInitializingShards
n/a
OpenSearchNumberOfUnassignedShardsRemoved in 2.27.0 (17.2.0 and 16.2.0)
n/a
Any cluster with low probability
KubeStatefulSetReplicasMismatch
statefulset=opensearch-master
KubeDeploymentReplicasMismatch
deployment=opensearch-dashboards
deployment=metricbeat
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Therefore, if required, migrate log data to a new persistent volume
(PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the opensearch-master StatefulSet with dependent
resources to 0 and disable the elasticsearch-curator CronJob:
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Depending on your custom provisioner, you may find a third-party
tool, such as pv-migrate,
that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
This command removes all existing logs data from PVCs.
In the Cluster configuration, set
logging.persistentVolumeClaimSize to the same value as the size of
the elasticsearch.persistentVolumeClaimSize parameter. For example:
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster object, replace
elasticsearch.logstashRetentionTime with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
[26441] Cluster update fails with the MountDevice failed for volume warning¶
Update of a managed cluster based on bare metal and Ceph enabled fails with
PersistentVolumeClaim getting stuck in the Pending state for the
prometheus-server StatefulSet and the
MountVolume.MountDevicefailedforvolume warning in the StackLight event
logs.
Workaround:
Verify that the description of the Pods that failed to run contain the
FailedMount events:
In the command above, replace the following values:
<affectedProjectName> is the Container Cloud project name where
the Pods failed to run
<affectedPodName> is a Pod name that failed to run in the specified project
In the Pod description, identify the node name where the Pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbdvolumemountfailed:<csi-vol-uuid>isbeingused error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
An upgrade of a Container Cloud management cluster with a custom Keycloak or
web UI TLS certificate fails with the following example error:
failed to update management cluster:\
admission webhook "validations.kaas.mirantis.com" denied the request: \
failed to validate TLS spec for Cluster 'default/kaas-mgmt': \
desired hostname is not set for 'ui'
Workaround:
Verify that the tls section of the management cluster contains the
hostname and certificate fields for configured applications:
Open the management Cluster object for editing:
kubectleditcluster<mgmtClusterName>
Verify that the tls section contains the following fields:
tls:keycloak:certificate:name:keycloakhostname:<keycloakHostName>tlsConfigRef:“” or “keycloak”ui:certificate:name:uihostname:<webUIHostName>tlsConfigRef:“” or “ui”
Container Cloud web UI¶[26416] Failure to upload an MKE client bundle during cluster attachment¶
During attachment of an existing MKE cluster using the Container Cloud web UI,
uploading of an MKE client bundle fails with a false-positive message about
a successful uploading.
Workaround:
Select from the following options:
Fill in the required fields for the MKE client bundle manually.
In the Attach Existing MKE Cluster window, use
upload MKE client bundle twice to upload
ucp.bundle-admin.zip and ucp-docker-bundle.zip located in the first
archive.
[23002] Inability to set a custom value for a predefined node label¶
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.20.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 11.3.0
that is based on Mirantis Container Runtime 20.10.11 and
Mirantis Kubernetes Engine 3.5.3 with Kubernetes 1.21.
Introduces support for the Cluster release 7.9.0 that is
based on Mirantis Container Runtime 20.10.11 and Mirantis Kubernetes Engine
3.4.9 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
11.2.0, 8.6.0, and
7.8.0. Use the latest Cluster releases of the series
instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.19.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.19.0. For the list of enhancements in the
Cluster releases 11.3.0 and 7.9.0 that are introduced by the Container Cloud
release 2.19.0, see the Cluster releases (managed).
General availability support for machines upgrade order¶
Implemented full support for the upgrade sequence of machines that allows
prioritized machines to be upgraded first. You can now set the upgrade index
on an existing machine or machine pool using the Container Cloud web UI.
Consider the following upgrade index specifics:
The first machine to upgrade is always one of the control plane machines
with the lowest upgradeIndex. Other control plane machines are upgraded
one by one according to their upgrade indexes.
If the Cluster spec dedicatedControlPlane field is false, worker
machines are upgraded only after the upgrade of all control plane machines
finishes. Otherwise, they are upgraded after the first control plane
machine, concurrently with other control plane machines.
If several machines have the same upgrade index, they have the same priority
during upgrade.
If the value is not set, the machine is automatically assigned a value
of the upgrade index.
Web UI support for booting an OpenStack machine from a volume¶
TechPreview
Implemented the Boot From Volume option for the OpenStack machine
creation wizard in the Container Cloud web UI. The feature allows booting
OpenStack-based machines from a block storage volume.
The feature is beneficial for clouds that do not have enough space on
hypervisors. After enabling this option, the Cinder storage is used instead of
the Nova storage.
Modification of network configuration on machines¶
TechPreview
Enabled the ability to modify existing network configuration on running bare
metal clusters with a mandatory approval of new settings by an Infrastructure
Operator. This validation is required to prevent accidental cluster failures
due to misconfiguration.
After you make necessary network configuration changes in the required L2
template, you now need to approve the changes by setting the
spec.netconfigUpdateAllow:true flag in each affected IpamHost object.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented a new format of log entries for cluster and machine logs of a
management cluster. Each log entry now contains a request ID that identifies
chronology of actions performed on a cluster or machine. The feature applies
to all supported cloud providers.
The new format is <providerType>.<objectName>.req:<requestID>.
For example, bm.machine.req:374, bm.cluster.req:172.
<objectName> - name of an object being processed by provider, possible
values: cluster, machine.
<requestID> - request ID number that increases when a provider receives
a request from Kubernetes about creating, updating, deleting an object. The
request ID allows combining all operations performed with an object within
one request. For example, the result of a machine creation, update of its
statuses, and so on.
Implemented the --extended flag for collecting the extended version of
logs that contains system and MKE logs, logs from LCM Ansible and LCM Agent
along with cluster events and Kubernetes resources description and logs.
You can use this flag to collect logs on any cluster type.
Distribution selector for bare metal machines in web UI¶
Added the Distribution field to the bare metal machine creation
wizard in the Container Cloud web UI. The default operating system in the
distribution list is Ubuntu 20.04.
Caution
Do not use the outdated Ubuntu 18.04 distribution on greenfield
deployments but only on existing clusters based on Ubuntu 18.04.
After switching all remaining OpenStack Helm releases from v2 to v3,
dropped support for Helm v2 in Helm Controller and removed the Tiller image
for all related components.
The following issues have been addressed in the Mirantis Container Cloud
release 2.19.0 along with the Cluster releases 11.3.0 and
7.9.0:
[16379, 23865] Fixed the issue that caused an Equinix-based management or
managed cluster update to fail with the FailedAttachVolume and
FailedMount warnings.
[24286] Fixed the issue wherein creation of a new Equinix-based managed
cluster failed due to failure to release a new vRouter ID.
[24722] Fixed the issue that caused Ceph clusters to be broken on
Equinix-based managed clusters deployed on a Container Cloud instance
with a non-default (different from region-one) region configured.
[24806] Fixed the issue wherein the dhcp-option=tag parameters were not
applied to dnsmasq.conf during the bootstrap of a bare metal management
cluster with a multi-rack topology.
[17778] Fixed the issue wherein the Container Cloud web UI displayed the new
release version while update for some nodes was still in progress.
[24676] Fixed the issue wherein the deployment of an Equinix-based management
cluster failed with the following error message:
Failed waiting for OIDC configuration readiness:timed out waiting for thecondition
[25050] For security reasons, disabled the deprecated TLS v1.0 and v1.1 for
the mcc-cache and kaas-ui Container Cloud services.
[25256] Optimized the number of simultaneous connections to etcd to be open
during configuration of Calico policies.
[24914] Fixed the issue wherein Helm Controller was getting stuck during
readiness checks due to the timeout for helmclient being not set.
[24317] Fixed a number of security vulnerabilities in the Container Cloud
Docker images:
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Bare metal¶[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is
overwritten by logging.persistentVolumeClaimSize during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi.
Note
This issue does not block the OpenSearch cluster operations if
the default retention time is set. The default setting is usually enough for
the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle
brackets to match the affected cluster values.
Continue the cluster deployment. The system will use the custom value
set in logging.persistentVolumeClaimSize.
Caution
If elasticsearch.persistentVolumeClaimSize is absent in
the .yaml file, the Admission Controller blocks the configuration
update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage
of OpenSearch and its dependent components may occur with the following
alerts firing on the cluster. This behavior is expected. Therefore,
disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level
Alert name
Label name and component
Any cluster with high probability
KubeStatefulSetOutage
statefulset=opensearch-master
KubeDeploymentOutage
deployment=opensearch-dashboards
deployment=metricbeat
Large cluster with average probability
KubePodsNotReadyRemoved in 17.0.0, 16.0.0, and 14.1.0
created_by_name="opensearch-master*"
created_by_name="opensearch-dashboards*"
created_by_name="metricbeat-*"
OpenSearchClusterStatusWarning
n/a
OpenSearchNumberOfPendingTasks
n/a
OpenSearchNumberOfInitializingShards
n/a
OpenSearchNumberOfUnassignedShardsRemoved in 2.27.0 (17.2.0 and 16.2.0)
n/a
Any cluster with low probability
KubeStatefulSetReplicasMismatch
statefulset=opensearch-master
KubeDeploymentReplicasMismatch
deployment=opensearch-dashboards
deployment=metricbeat
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Therefore, if required, migrate log data to a new persistent volume
(PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the opensearch-master StatefulSet with dependent
resources to 0 and disable the elasticsearch-curator CronJob:
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Depending on your custom provisioner, you may find a third-party
tool, such as pv-migrate,
that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
This command removes all existing logs data from PVCs.
In the Cluster configuration, set
logging.persistentVolumeClaimSize to the same value as the size of
the elasticsearch.persistentVolumeClaimSize parameter. For example:
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster object, replace
elasticsearch.logstashRetentionTime with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
During attachment of an existing MKE cluster using the Container Cloud web UI,
uploading of an MKE client bundle fails with a false-positive message about
a successful uploading.
Workaround:
Select from the following options:
Fill in the required fields for the MKE client bundle manually.
In the Attach Existing MKE Cluster window, use
upload MKE client bundle twice to upload
ucp.bundle-admin.zip and ucp-docker-bundle.zip located in the first
archive.
[23002] Inability to set a custom value for a predefined node label¶
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.19.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Mirantis Container Cloud GA release 2.18.1 is based on
2.18.0 and:
Introduces support for the Cluster release 8.8.0
that is based on the Cluster release 7.8.0 and represents
Mirantis OpenStack for Kubernetes (MOSK) 22.3.
This Cluster release is based on the updated version of Mirantis Kubernetes
Engine 3.4.8 with Kubernetes 1.20 and Mirantis Container Runtime 20.10.11.
Supports the latest Cluster releases 7.8.0 and
11.2.0.
Does not support new deployments based on the deprecated Cluster releases
11.1.0, 8.6.0, and
7.7.0.
For details about the Container Cloud release 2.18.1, refer to its parent
release 2.18.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 11.2.0
that is based on Mirantis Container Runtime 20.10.8 and
Mirantis Kubernetes Engine 3.5.1 with Kubernetes 1.21.
Introduces support for the Cluster release 7.8.0 that is
based on Mirantis Container Runtime 20.10.8 and Mirantis Kubernetes Engine
3.4.7 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
11.1.0, 8.5.0, and
7.7.0. Use the latest Cluster releases of the series
instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.18.0.
This section outlines new features and enhancements introduced in the
Mirantis Container Cloud release 2.18.0. For the list of enhancements in the
Cluster releases 11.2.0 and 7.8.0 that are introduced by the Container Cloud
release 2.18.0, see the Cluster releases (managed).
Updated the Ubuntu kernel version to 5.4.0-109-generic for bare metal
non-MOSK-based management, regional, and managed clusters to
apply Ubuntu 18.04 or 20.04 security and system updates.
Caution
During a baremetal-based cluster update to Container Cloud 2.18
and to the latest Cluster releases 11.2.0 and 7.8.0, hosts will be restarted
to apply the latest supported Ubuntu 18.04 or 20.04 packages. Therefore:
Depending on the cluster configuration, applying security
updates and host restart can increase the update time for each node to up to
1 hour.
Cluster nodes are updated one by one. Therefore, for large clusters,
the update may take several days to complete.
Support for Ubuntu 20.04 on greenfield vSphere deployments¶
Implemented full support for Ubuntu 20.04 LTS (Focal Fossa) as the default
host operating system that now installs on management, regional, and managed
clusters for the vSphere cloud provider.
Caution
Upgrading from Ubuntu 18.04 to 20.04 on existing deployments
is not supported.
Booting a machine from a block storage volume for OpenStack provider¶
TechPreview
Implemented initial Technology Preview support for booting of the
OpenStack-based machines from a block storage volume. The feature is
beneficial for clouds that do not have enough space on hypervisors. After
enabling this option, the Cinder storage is used instead of the Nova storage.
Using the Container Cloud API, you can boot the Bastion node, or the required
management, regional, or managed cluster nodes from a volume.
Note
The ability to enable the boot from volume option using the
Container Cloud web UI for managed clusters will be implemented in one of the
following Container Cloud releases.
IPSec encryption for the Kubernetes workloads network¶
TechPreviewExperimental since 2.19.0
Implemented initial Technology Preview support for enabling IPSec encryption
for the Kubernetes workloads network. The feature allows for secure
communication between servers.
You can enable encryption for the Kubernetes workloads network on greenfield
deployments during initial creation of a management, regional, and managed
cluster through the Cluster object using the secureOverlay parameter.
For the bare metal cloud provider and MOSK-based deployments, the feature
support will become available in one of the following Container Cloud
releases.
For existing deployments, the feature support will become available in
one of the following Container Cloud releases.
Implemented the initial Technology Preview support for
man-in-the-middle (MITM) proxies on offline OpenStack and
non-MOSK-based bare metal deployments. Using trusted proxy CA
certificates, the feature allows monitoring all cluster traffic for security
and audit purposes.
Implemented support for custom Docker registries configuration in the
Container Cloud management, regional, and managed clusters. Using the
ContainerRegistry custom resource, you can configure CA certificates on
machines to access private Docker registries.
Note
For MOSK-based deployments, the feature support is available
since Container Cloud 2.18.1.
Implemented initial Technology Preview support for machines upgrade index that
allows prioritized machines to be upgraded first. During a machine or a
machine pool creation, you can use the Container Cloud web UI
Upgrade Index option to set a positive numeral value that
defines the order of machine upgrade during cluster update.
To set the upgrade order on an existing cluster, use the Container Cloud API:
For a machine that is not assigned to a machine pool, add the
upgradeIndex field with the required value to the
spec:providerSpec:value section in the Machine object.
For a machine pool, add the upgradeIndex field with the required value
to the spec:machineSpec:providerSpec:value section of the
MachinePool object to apply the upgrade order to all machines in the
pool.
Note
The first machine to upgrade is always one of the control plane machines
with the lowest upgradeIndex. Other control plane machines are upgraded
one by one according to their upgrade indexes. If the Cluster spec
dedicatedControlPlane field is false, worker machines are upgraded
only after the upgrade of all control plane machines finishes. Otherwise,
they are upgraded after the first control plane machine, concurrently with
other control plane machines.
If two or more machines have the same value of upgradeIndex, these
machines are equally prioritized during upgrade.
Changing of the machine upgrade index during an already running
cluster update or maintenance is not supported.
Enablement of Salesforce propagation to all clusters using web UI¶
Simplified the ability to enable automatic update and sync of the Salesforce
configuration on all your clusters by adding the corresponding check box to
the Salesforce settings in the Container Cloud web UI.
The following issues have been addressed in the Mirantis Container Cloud
release 2.18.0 along with the Cluster releases 11.2.0 and
7.8.0:
[24075] Fixed the issue with the Ubuntu 20.04 option not
displaying in the operating systems drop-down list during machine creation
for the AWS and Equinix Metal with public networking providers.
Warning
After Container Cloud is upgraded to 2.18.0, remove the values
added during the workaround application from the Cluster object.
[9339] Fixed the issue with incorrect health monitoring for Kubernetes and
MKE endpoints on OpenStack-based clusters.
[21710] Fixed the issue with a too high threshold being set for the
KubeContainersCPUThrottlingHigh StackLight alert.
[22872] Removed the inefficient ElasticNoNewDataCluster and
ElasticNoNewDataNode StackLight alerts.
[23853] Fixed the issue wherein the KaaSCephOperationRequest resource
created to remove the failed node from the Ceph cluster was stuck with the
Failed status and an error message in errorReason. The Failed
status blocked the replacement of the failed master node on regional clusters
of the bare metal and Equinix Metal providers.
[23841] Improved error logging for load balancers deletion:
The reason for the inability to delete an LB is now displayed in the
provider logs.
If the search of an FIP associated with the LB deletion returns more than
one FIP, the provider returns an error instead of deleting all found FIPs.
[18331] Fixed the issue with the Keycloak admin console menu disappearing
on the Add identity provider page during configuration of an
identity provider SAML.
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
[24005] Deletion of a node with ironic Pod is stuck in the Terminating state¶
During deletion of a manager machine running the ironic Pod from a bare
metal management cluster, the following problems occur:
All Pods are stuck in the Terminating state
A new ironic Pod fails to start
The related bare metal host is stuck in the deprovisioning state
As a workaround, before deletion of the node running the ironic Pod,
cordon and drain the node using the kubectl cordon <nodeName> and
kubectl drain <nodeName> commands.
[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
The OpenSearch elasticsearch.persistentVolumeClaimSize custom setting is
overwritten by logging.persistentVolumeClaimSize during deployment of a
Container Cloud cluster of any type and is set to the default 30Gi.
Note
This issue does not block the OpenSearch cluster operations if
the default retention time is set. The default setting is usually enough for
the capacity size of this cluster.
The issue may affect the following Cluster releases:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
To verify that the cluster is affected:
Note
In the commands below, substitute parameters enclosed in angle
brackets to match the affected cluster values.
Continue the cluster deployment. The system will use the custom value
set in logging.persistentVolumeClaimSize.
Caution
If elasticsearch.persistentVolumeClaimSize is absent in
the .yaml file, the Admission Controller blocks the configuration
update.
Workaround for an existing cluster:
Caution
During the application of the below workarounds, a short outage
of OpenSearch and its dependent components may occur with the following
alerts firing on the cluster. This behavior is expected. Therefore,
disregard these alerts.
StackLight alerts list firing during cluster update
Cluster size and outage probability level
Alert name
Label name and component
Any cluster with high probability
KubeStatefulSetOutage
statefulset=opensearch-master
KubeDeploymentOutage
deployment=opensearch-dashboards
deployment=metricbeat
Large cluster with average probability
KubePodsNotReadyRemoved in 17.0.0, 16.0.0, and 14.1.0
created_by_name="opensearch-master*"
created_by_name="opensearch-dashboards*"
created_by_name="metricbeat-*"
OpenSearchClusterStatusWarning
n/a
OpenSearchNumberOfPendingTasks
n/a
OpenSearchNumberOfInitializingShards
n/a
OpenSearchNumberOfUnassignedShardsRemoved in 2.27.0 (17.2.0 and 16.2.0)
n/a
Any cluster with low probability
KubeStatefulSetReplicasMismatch
statefulset=opensearch-master
KubeDeploymentReplicasMismatch
deployment=opensearch-dashboards
deployment=metricbeat
StackLight in HA mode with LVP provisioner for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Therefore, if required, migrate log data to a new persistent volume
(PV).
Move the existing log data to a new PV, if required.
Increase the disk size for local volume provisioner (LVP).
Scale down the opensearch-master StatefulSet with dependent
resources to 0 and disable the elasticsearch-curator CronJob:
StackLight in non-HA mode with a non-expandable StorageClass
and no LVP for OpenSearch PVCs
Warning
After applying this workaround, the existing log data will be
lost. Depending on your custom provisioner, you may find a third-party
tool, such as pv-migrate,
that provides a possibility to copy all data from one PV to another.
If data loss is acceptable, proceed with the workaround below.
This command removes all existing logs data from PVCs.
In the Cluster configuration, set
logging.persistentVolumeClaimSize to the same value as the size of
the elasticsearch.persistentVolumeClaimSize parameter. For example:
Custom settings for the deprecated elasticsearch.logstashRetentionTime
parameter are overwritten by the default setting set to 1 day.
The issue may affect the following Cluster releases with enabled
elasticsearch.logstashRetentionTime:
11.2.0 - 11.5.0
7.8.0 - 7.11.0
8.8.0 - 8.10.0, 12.5.0 (MOSK clusters)
10.2.4 - 10.8.1 (attached MKE 3.4.x clusters)
13.0.2 - 13.5.1 (attached MKE 3.5.x clusters)
As a workaround, in the Cluster object, replace
elasticsearch.logstashRetentionTime with elasticsearch.retentionTime
that was implemented to replace the deprecated parameter. For example:
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
Upgrade¶[24802] Container Cloud upgrade to 2.18.0 can trigger managed clusters update¶
Affects only Container Cloud 2.18.0
On clusters with enabled proxy and the NO_PROXY settings containing
localhost/127.0.0.1 or matching the automatically added Container Cloud
internal endpoints, the Container Cloud release upgrade from 2.17.0 to 2.18.0
triggers automatic update of managed clusters to the latest available Cluster
releases in their respective series.
For the issue workaround, contact Mirantis support.
[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and
AGE values of the required Ubuntu-based cluster. If the values
match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while
applying the Deploy state and identify the containerd package version:
containerd--version
If the version is 1.5.9, the cluster is affected.
In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible
deployment logs contain the following errors that indicate that the
cluster is affected:
The following packages will be upgraded:docker-ee docker-ee-cliThe following packages will be DOWNGRADED:containerd.ioSTDERR:E:Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
Wait for Ansible to reconcile. The node should become Ready in several
minutes.
Wait for the next node of the cluster to get stuck on the Deploy phase
with the Ansible log errors. Only after that, apply the steps above on the
next node.
Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶[23002] Inability to set a custom value for a predefined node label¶
During machine creation using the Container Cloud web UI, a custom value for
a node label cannot be set.
As a workaround, manually add the value to
spec.providerSpec.value.nodeLabels in machine.yaml.
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.18.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 11.1.0
that is based on Mirantis Container Runtime 20.10.8 and
Mirantis Kubernetes Engine 3.5.1 with Kubernetes 1.21.
Introduces support for the Cluster release 7.7.0 that is
based on Mirantis Container Runtime 20.10.8 and Mirantis Kubernetes Engine
3.4.7 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
11.0.0, 8.5.0, and
7.6.0. Use the latest Cluster releases of the series
instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.17.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.17.0.
For the list of enhancements in the Cluster releases 11.1.0 and 7.7.0
that are introduced by the Container Cloud release 2.17.0,
see the Cluster releases (managed).
General availability for Ubuntu 20.04 on greenfield deployments¶
Implemented full support for Ubuntu 20.04 LTS (Focal Fossa) as the default
host operating system that now installs on management, regional, and managed
clusters for the following cloud providers: AWS, Azure, OpenStack,
Equinix Metal with public or private networking, and
non-MOSK-based bare metal.
For the vSphere and MOSK-based (managed) deployments, support
for Ubuntu 20.04 will be announced in one of the following Container Cloud
releases.
Note
The management or regional bare metal cluster dedicated for
managed clusters running MOSK is based on Ubuntu 20.04.
Caution
Upgrading from Ubuntu 18.04 to 20.04 on existing deployments
is not supported.
Container Cloud on top of MOSK Victoria with Tungsten Fabric¶
Implemented the capability to deploy Container Cloud management, regional,
and managed clusters based on OpenStack Victoria with Tungsten Fabric
networking on top of Mirantis OpenStack for Kubernetes (MOSK) Victoria with
Tungsten Fabric.
Note
On the MOSK Victoria with Tungsten Fabric clusters
of Container Cloud deployed before MOSK 22.3, Octavia
enables a default security group for newly created load balancers. To change
this configuration, refer to MOSK Operations Guide: Configure
load balancing.
To use the default security group, configure ingress rules.
EBS instead of NVMe as persistent storage for AWS-based nodes¶
Replaced the Non-Volatile Memory Express (NVMe) drive type with the Amazon
Elastic Block Store (EBS) one as the persistent storage requirement for
AWS-based nodes. This change prevents cluster nodes from becoming unusable
after instances are stopped and NVMe drives are erased.
Previously, the /var/lib/docker Docker data was located on local NVMe SSDs
by default. Now, this data is located on the same EBS volume drive as the
operating system.
Implemented the capability to delete manager nodes with the purpose of
replacement or recovery. Consider the following precautions:
Create a new manager machine to replace the deleted one as soon as
possible. This is necessary since after a machine removal, the cluster
has limited capabilities to tolerate faults. Deletion of manager machines
is intended only for replacement or recovery of failed nodes.
You can delete a manager machine only if your cluster has at least
two manager machines in the Ready state.
Do not delete more than one manager machine at once to prevent cluster
failure and data loss.
Ensure that the machine to delete is not a Ceph Monitor. If it is, migrate
the Ceph Monitor to keep the odd number quorum of Ceph Monitors after the
machine deletion. For details, see Migrate a Ceph Monitor before machine replacement.
If you delete a machine on the regional cluster, refer to the
known issue 23853 to complete the deletion.
For the sake of HA, limited a managed cluster size to have only an odd number
of manager machines. In an even-sized cluster, an additional machine remains
in the Pending state until an extra manager machine is added.
Extended the use of node labels for all supported cloud providers with the
ability to set custom values. Especially from the MOSK
standpoint, this feature makes it easy to schedule overrides for OpenStack
services using API. For example, now you can set the node-type label to
define the node purpose such as hpc-compute, compute-lvm, or
storage-ssd in its value.
The list of allowed node labels is located in the Cluster object status
providerStatus.releaseRef.current.allowedNodeLabels field. Before or after
a machine deployment, add the required label from the allowed node labels list
with the corresponding value to spec.providerSpec.value.nodeLabels in
machine.yaml.
Note
Due to the known issue 23002, it is not possible
to set a custom value for a predefined node label using the Container Cloud
web UI. For a workaround, refer to the issue description.
Introduced the MachinePool custom resource. A machine pool is a template
that allows managing a set of machines with the same provider spec as a
single unit. You can create different sets of machine pools with required
specs during machines creation on a new or existing cluster using the
Create machine wizard in the Container Cloud web UI. You can
assign or unassign machines from a pool, if required. You can also increase
or decrease replicas count. In case of replicas count increasing, new
machines will be added automatically.
Automatic propagation of Salesforce configuration to all clusters¶
Implemented the capability to enable automatic propagation of the Salesforce
configuration of your management cluster to the related regional and managed
clusters using the autoSyncSalesForceConfig=true flag added to the
Cluster object of the management cluster. This option allows for
automatic update and sync of the Salesforce settings on all your clusters
after you update your management cluster configuration.
You can also set custom settings for regional and managed clusters that always
override automatically propagated Salesforce values.
Note
The capability to enable this option using the Container Cloud web
UI will be announced in one of the following releases.
The following issues have been addressed in the Mirantis Container Cloud
release 2.17.0 along with the Cluster releases 11.1.0 and
7.7.0:
Bare metal:
[22563] Fixed the issue wherein a deployment of a bare metal node with
an LVM volume on top of a mdadm-based raid10 failed during
provisioning due to insufficient cleanup of RAID devices.
Equinix Metal:
[22264] Fixed the issue wherein the KubeContainersCPUThrottlingHigh
alerts for Equinix Metal and AWS deployments raised due to low default
deployment limits set for Equinix Metal and AWS controller containers.
StackLight:
[23006] Fixed the issue that caused StackLight endpoints to crash on start
with the private key does not match public key error message.
[22626] Fixed the issue that caused constant restarts of the
kaas-exporter pod. Increased the memory for kaas-exporter requests
and limits.
[22337] Improved the certificate expiration alerts by enhancing the alert
severities.
[20856] Fixed the issue wherein variables values in the
PostgreSQL Grafana dashboard were not calculated.
[20855] Fixed the issue wherein the Cluster > Health panel
showed N/A in the Elasticsearch Grafana dashboard.
Ceph:
[19014] Updated the Rook Docker image and fixed the following security
vulnerabilities:
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Bare metal¶[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
LCM¶[23853] Replacement of a regional master node fails on bare metal and Equinix Metal¶
During replacement of a failed master node on regional clusters of the
bare metal and Equinix Metal providers, the KaaSCephOperationRequest
resource created to remove the failed node from the Ceph cluster is stuck with
the Failed status and an error message in errorReason. For example:
status:removeStatus:osdRemoveStatus:errorReason:Timeout (30m0s) reached for waiting pg rebalance for osd 2status:Failed
The Failed status blocks the replacement of the failed master node.
Workaround:
On the management cluster, obtain metadata.name, metadata.namespace,
and the spec section of KaaSCephOperationRequest being stuck:
Name of new KaaSCephOperationRequest that differs from the failed
one. Usually a failed KaaSCephOperationRequest resource is called
delete-request-for-<masterMachineName>. Therefore, you can name the
new resource as delete-request-for-<masterMachineName>-new.
<kcorNamespace>
Namespace of the failed KaaSCephOperationRequest resource.
<kcorSpec>
Spec of the failed KaaSCephOperationRequest resource.
Apply the created template to the management cluster. For example:
kubectlapply-fkcor-stuck-regional.yaml
Remove the failed KaaSCephOperationRequest resource from
the management cluster:
kubectldeletekaascephoperationrequest<kcorName>
Replace <kcorName> with the name of KaaSCephOperationRequest that
has the Failed status.
StackLight¶[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
Upgrade¶[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and
AGE values of the required Ubuntu-based cluster. If the values
match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while
applying the Deploy state and identify the containerd package version:
containerd--version
If the version is 1.5.9, the cluster is affected.
In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible
deployment logs contain the following errors that indicate that the
cluster is affected:
The following packages will be upgraded:docker-ee docker-ee-cliThe following packages will be DOWNGRADED:containerd.ioSTDERR:E:Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
Wait for Ansible to reconcile. The node should become Ready in several
minutes.
Wait for the next node of the cluster to get stuck on the Deploy phase
with the Ansible log errors. Only after that, apply the steps above on the
next node.
Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶[24075] Ubuntu 20.04 does not display for AWS and Equinix Metal managed clusters¶
During creation of a machine for AWS or Equinix Metal provider with public
networking, the Ubuntu 20.04 option does not display in the
drop-down list of operating systems in the Container Cloud UI. Only
Ubuntu 18.04 displays in the list.
Workaround:
Identify the parent management or regional cluster of the affected managed
cluster located in the same region.
For example, if the affected managed cluster was deployed in region-one,
identify its parent cluster by running:
If the aws-credentials-controller or
equinixmetal-credentials-controller Helm releases are missing in the
spec.providerSpec.value.kaas.regional section or the helmReleases
array is missing for the corresponding provider, add the releases with the
overwritten values.
During machine creation using the Container Cloud web UI, a custom value for
a node label cannot be set.
As a workaround, manually add the value to
spec.providerSpec.value.nodeLabels in machine.yaml.
[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.17.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Mirantis Container Cloud GA release 2.16.1 is based on
2.16.0 and:
Introduces support for the Cluster release 8.6.0
that is based on the Cluster release 7.6.0 and represents
Mirantis OpenStack for Kubernetes (MOSK) 22.2.
This Cluster release is based on the updated version of Mirantis Kubernetes
Engine 3.4.7 with Kubernetes 1.20 and Mirantis Container Runtime 20.10.8.
Supports the latest Cluster releases 7.6.0 and
11.0.0.
Does not support new deployments based on the deprecated Cluster releases
8.5.0, 7.5.0, 6.20.0,
and 5.22.0 that were deprecated in 2.16.0.
For details about the Container Cloud release 2.16.1, refer to its parent
release 2.16.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 11.0.0 for
managed clusters that is based on Mirantis Container Runtime 20.10.8 and
the updated version of Mirantis Kubernetes Engine 3.5.1 with Kubernetes 1.21.
Introduces support for the Cluster release 7.6.0 for all
types of clusters that is based on Mirantis Container Runtime 20.10.8 and
the updated version of Mirantis Kubernetes Engine 3.4.7 with Kubernetes 1.20.
Does not support greenfield deployments on deprecated Cluster releases
7.5.0, 6.20.0, and
5.22.0. Use the latest Cluster releases of the series
instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.16.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.16.0.
For the list of enhancements in the Cluster releases 11.0.0 and 7.6.0
that are introduced by the Container Cloud release 2.16.0,
see the Cluster releases (managed).
License management using the Container Cloud web UI¶
Implemented a mechanism for the Container Cloud and MKE license update using
the Container Cloud web UI. During the automatic license update, machines are
not cordoned and drained and user workloads are not interrupted for all
clusters starting from Cluster releases 7.6.0, 8.6.0, and 11.0.0. Therefore,
after your management cluster upgrades to Container Cloud 2.16.0, make sure to
update your managed clusters to the latest available Cluster releases.
Caution
Only the Container Cloud web UI users with the
m:kaas@global-admin role can update the Container Cloud license.
Scheduling of a management cluster upgrade using web UI¶
TechPreview
Implemented initial Technology Preview support for management cluster upgrade
scheduling through the Container Cloud web UI. Also, added full support for
management cluster upgrade scheduling through CLI.
Implemented automatic renewal of self-signed TLS certificates for internal
Container Cloud services that are generated and managed by the Container Cloud
provider.
Note
Custom certificates still require manual renewal. If applicable,
the information about expiring custom certificates is available in the
Container Cloud web UI.
Ubuntu 20.04 for greenfield bare metal managed clusters¶
TechPreview
Implemented initial Technology Preview support for Ubuntu 20.04 (Focal Fossa)
on bare metal non-MOSK-based greenfield deployments of managed
clusters. Now, you can optionally deploy Kubernetes machines with Ubuntu 20.04
on bare metal hosts. By default, Ubuntu 18.04 is used.
Caution
Upgrading to Ubuntu 20.04 on existing deployments initially
created before Container Cloud 2.16.0 is not supported.
Note
Support for Ubuntu 20.04 on MOSK-based Cluster
releases will be added in one of the following Container Cloud releases.
Extended the regional clusters support by implementing the ability to
deploy an additional regional cluster on bare metal. This provides an ability
to create baremetal-based managed clusters in bare metal regions
in parallel with managed clusters of other private-based regional clusters
within a single Container Cloud deployment.
Implemented the initial Technology Preview support for Mirantis OpenStack for Kubernetes
(MOSK) deployment on local software-based Redundant Array of
Independent Disks (RAID) devices to withstand failure of one device at a time.
The feature is available in the Cluster release 8.5.0 after the Container
Cloud upgrade to 2.16.0.
Using a custom bare metal host profile, you can configure and create
an mdadm-based software RAID device of type raid10 if you have
an even number of devices available on your servers. At least four
storage devices are required for such RAID device.
Implemented the ability to use any interface name instead of the k8s-lcm
bridge for the LCM network traffic on a bare metal cluster. The Subnet
objects for the LCM network must have the ipam/SVC-k8s-lcm label.
For details, see Service labels and their life cycle.
Keepalived for built-in load balancing in standalone containers¶
For the Container Cloud managed clusters that are based on vSphere,
Equinix Metal, or bare metal, moved Keepalived for the built-in load balancer
to run in standalone Docker containers managed by systemd as a service. This
change ensures version consistency of crucial infrastructure services and
reduces dependency on a host operating system version and configuration.
Reworked the Reconfigure phase applicable to LCMMachine that now can
apply to all nodes. This phase runs after the Deploy phase to apply
stateItems that relate to this phase without affecting workloads
running on the machine.
The following issues have been addressed in the Mirantis Container Cloud
release 2.16.0 along with the Cluster releases 11.0.0 and 7.6.0:
Bare metal:
[15989] Fixed the issue wherein removal of a bare metal-based management
cluster failed with a timeout.
[20189] Fixed the issue with the Container Cloud web UI reporting a
successful upgrade of a baremetal-based management cluster while running
the previous release.
OpenStack:
[20992] Fixed the issue that caused inability to deploy an OpenStack-based
managed cluster if DVR was enabled.
[20549] Fixed the CVE-2021-3520 security vulnerability
in the cinder-csi-pluginimage Docker image.
Equinix Metal:
[20467] Fixed the issue that caused deployment of an Equinix Metal based
management cluster with private networking to fail with the following error
message during the Ironic deployment:
0/3 nodes are available:3 pod has unbound immediate PersistentVolumeClaims.
[21324] Fixed the issue wherein the bare metal host was trying to configure
an Equinix node as UEFI even for nodes with UEFI disabled.
[21326] Fixed the issue wherein the Ironic agent could not properly
determine which disk will be the first disk on the node. As a result, some
Equinix servers failed to boot from the proper disk.
[21338] Fixed the issue wherein some Equinix servers were configured in
BIOS to always boot from PXE, which caused the operation system to fail to
start from disk after provisioning.
StackLight:
[21646] Adjusted the kaas-exporter resource requests and limits to
avoid issues with the kaas-exporter container being occassionally
throttled and OOMKilled, preventing the Container Cloud metrics gathering.
[20591] Adjusted the RAM usage limit and disabled indices monitoring for
prometheus-es-exporter to avoid prometheus-es-exporter pod crash
looping due to low memory issues.
[17493] Fixed the following security vulnerabilities in the fluentd and
spilo Docker images:
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
The default deployment limits for Equinix and AWS controller containers
set to 400m may be lower than the consumed amount of resources leading
to KubeContainersCPUThrottlingHigh alerts in StackLight.
As a workaround, increase the default resource limits for the affected
equinix-controllers or aws-controllers to 700m. For example:
kubectleditdeployment-nkaasaws-controllers
spec:...resources:limits:cpu:700m...
[16379,23865] Cluster update fails with the FailedMount warning¶
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state is Running.
Bare metal¶[20736] Region deletion failure after regional deployment failure¶
If a baremetal-based regional cluster deployment fails before pivoting is
done, the corresponding region deletion fails.
Workaround:
Using the command below, manually delete all possible traces of the failed
regional cluster deployment, including but not limited to the following
objects that contain the kaas.mirantis.com/region label of the affected
region:
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
StackLight¶[20876] StackLight pods get stuck with the ‘NodeAffinity failed’ error¶
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
The cordon-drain states are not removed after the maintenance mode is unset
for a machine. This issue may occur due to the maintenance transition
being stuck on the NodeWorkloadLock object.
Upgrade¶[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and
AGE values of the required Ubuntu-based cluster. If the values
match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while
applying the Deploy state and identify the containerd package version:
containerd--version
If the version is 1.5.9, the cluster is affected.
In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible
deployment logs contain the following errors that indicate that the
cluster is affected:
The following packages will be upgraded:docker-ee docker-ee-cliThe following packages will be DOWNGRADED:containerd.ioSTDERR:E:Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
Wait for Ansible to reconcile. The node should become Ready in several
minutes.
Wait for the next node of the cluster to get stuck on the Deploy phase
with the Ansible log errors. Only after that, apply the steps above on the
next node.
Patch the remaining nodes one-by-one using the steps above.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
Cluster health¶[21494] Controller pods are OOMkilled after deployment¶
After a successful deployment of a management or regional cluster, controller
pods may be OOMkilled and get stuck in CrashLoopBackOff state due to
incorrect memory limits.
Workaround:
Increase memory resources limits on the affected Deployment:
Open the affected Deployment configuration for editing:
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.16.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Mirantis Container Cloud GA release 2.15.1 is based on
2.15.0 and:
Introduces support for the Cluster release 8.5.0
that is based on the Cluster release 7.5.0 and represents
Mirantis OpenStack for Kubernetes (MOSK) 22.1.
This Cluster release is based on Mirantis Kubernetes Engine 3.4.6 with
Kubernetes 1.20 and Mirantis Container Runtime 20.10.8.
Supports the latest Cluster releases 7.5.0 and
5.22.0.
Does not support new deployments based on the Cluster releases
7.4.0 and 5.21.0 that were deprecated
in 2.15.0.
For details about the Container Cloud release 2.15.1, refer to its parent
release 2.15.0:
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Introduces support for the Cluster release 7.5.0
that is based on Mirantis Container Runtime 20.10.8 and the updated version
of Mirantis Kubernetes Engine 3.4.6 with Kubernetes 1.20.
Introduces support for the Cluster release 5.22.0
that is based on the updated version of Mirantis Kubernetes Engine 3.3.13
with Kubernetes 1.18 and Mirantis Container Runtime 20.10.8.
Does not support greenfield deployments on deprecated Cluster releases
7.4.0, 6.19.0, and
5.21.0. Use the latest Cluster releases of the series
instead.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.15.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.15.0.
For the list of enhancements in the Cluster releases 7.5.0 and 5.22.0
that are supported by the Container Cloud release
2.15.0, see the Cluster releases (managed).
Automatic upgrade of bare metal host operating system during cluster update¶
Introduced automatic upgrade of Ubuntu 18.04 packages on the bare metal hosts
during a management or managed cluster update.
Mirantis Container Cloud uses life cycle management tools to update
the operating system packages on the bare metal hosts. Container Cloud
may also trigger restart of the bare metal hosts to apply the updates, when
applicable.
Warning
During managed cluster update to the latest Cluster releases
available in Container Cloud 2.15.0, hosts are restarted to apply the latest
supported Ubuntu 18.04 packages and update kernel to version 5.4.0-90.101.
If Ceph is installed in the cluster, the Container
Cloud orchestration securely pauses the Ceph OSDs on the node before
restart. This allows avoiding degradation of the storage service.
HAProxy instead of NGINX for vSphere, Equinix Metal, and bare metal providers¶
Implemented a health check mechanism to verify target server availability by
reworking the high availability setup for the Container Cloud manager nodes
of the vSphere, Equinix Metal, and bare metal providers to use HAProxy instead
of NGINX. This change affects only the Ansible part. HAproxy deploys
as a container managed directly by containerd.
Additional regional cluster on Equinix Metal with private networking¶
Extended the regional clusters support by implementing the capability to
deploy an additional regional cluster on Equinix Metal with private
networking. This provides the capability to create managed clusters in the
Equinix Metal regions with private networking in parallel with managed clusters
of other supported providers within a single Container Cloud deployment.
Introduced the initial Technology Preview support for a scheduled Container
Cloud auto-upgrade using the MCCUpgrade object named mcc-upgrade
in Kubernetes API.
An Operator can delay or reschedule Container Cloud auto-upgrade that allows:
Blocking Container Cloud upgrade process for up to 7 days from the current
date and up to 30 days from the latest Container Cloud release
Limiting hours and weekdays when Container Cloud upgrade can run
Caution
Only the management cluster admin has access to the MCCUpgrade object.
You must use kubeconfig generated during the management cluster
bootstrap to access this object.
Note
Scheduling of the Container Cloud auto-upgrade using the Container
Cloud web UI will be implemented in one of the following releases.
Implemented the maintenance mode for management and managed clusters and
machines to prepare workloads for maintenance operations.
To enable maintenance mode on a machine, first enable maintenance mode
on a related cluster.
To disable maintenance mode on a cluster, first disable maintenance mode
on all machines of the cluster.
Warning
Cluster upgrades and configuration changes (except of the SSH keys
setting) are unavailable while a cluster is under maintenance. Make sure you
disable maintenance mode on the cluster after maintenance is complete.
Deprecated the iam-api service and IAM CLI (the iamctl command).
The logic of the iam-api service required for Container Cloud is moved
to scope-controller.
The iam-api service is used by IAM CLI only to manage users and
permissions. Instead of IAM CLI, Mirantis recommends using the Keycloak web UI
to perform necessary IAM operations.
The iam-api service and IAM CLI will be removed in one of the following
Container Cloud releases.
Upgraded the Ceph Helm releases in the ClusterRelease object from v2 to v3.
Switching of the remaining OpenStack Helm releases for Mirantis OpenStack for
Kubernetes to v3 will be implemented in one of the following Container Cloud
releases.
The following issues have been addressed in the Mirantis Container Cloud
release 2.15.0 along with the Cluster releases 7.5.0 and 5.22.0:
vSphere:
[19737] Fixed the issue with the vSphere VM template build hanging
with an empty kickstart file on the vSphere deployments with the RHEL 8.4
seed node.
[19468] Fixed the issue with the ‘Failed to remove finalizer from
machine’ error during cluster deletion if a RHEL license is removed before
the related managed cluster was deleted.
IAM:
[5025] Updated the Keycloak version from 12.0.0 to 15.0.2 to fix the
CVE-2020-2757.
[21024][Custom certificates] Fixed the issue with the readiness check
failure during addition of a custom certificate for Keycloak that hung
with the failed to wait for OIDC certificate to be updated timeout
warning.
StackLight:
[20193] Updated the Grafana Docker image from 8.2.2 to 8.2.7
to fix the high-severity CVE-2021-43798.
[18933] Fixed the issue with the Alerta pods failing to pass the
readiness check even if Patroni, the Alerta backend, operated correctly.
[19682] Fixed the issue with the Prometheus web UI URLs in
notifications sent to Salesforce using the HTTP protocol instead of HTTPS
on deployments with TLS enabled for IAM.
Ceph:
[19645] Fixed the issue with the Ceph OSD removal request failure
during the Processing stage.
[19574] Fixed the issue with the Ceph OSD removal not cleaning up the
device used for multiple OSDs.
[20298] Fixed the issue with spec validation failing during creation of
KaaSCephOperationRequest.
[20355] Fixed the issue with KaaSCephOperationRequest being cached
after recreation with the same name, specified in metadata.name, as the
previous KaaSCephOperationRequest CR. The issue caused no removal
to be performed upon applying the new KaaSCephOperationRequest CR.
Bare metal:
[19786] Fixed the issue with managed cluster deployment failing
on long-running management clusters with BareMetalHost being stuck in
the Preparing state and the ironic-conductor and ironic-api
pods reporting the not enough disk space error due to the
dnsmasq-dhcpd logs overflow.
Upgrade:
[20459] Fixed the issue with failure to upgrade a management or
regional cluster originally deployed using the Container Cloud release
earlier than 2.8.0. The failure occurred during Ansible update if a machine
contained /usr/local/share/ca-certificates/mcc.crt, which was either
empty or invalid.
MKE¶[20651] A cluster deployment or update fails with not ready compose deployments¶
A managed cluster deployment, attachment, or update to a Cluster release with
MKE versions 3.3.13, 3.4.6, 3.5.1, or earlier may fail with the
compose pods flapping (ready>terminating>pending) and with the
following error message appearing in logs:
Once done, the cluster deployment or update resumes.
Re-enable DCT.
Hardware-related¶[18962] Machine provisioning issues during cluster deployment¶
A node with Intel NICs may randomly get stuck in the Provisioning
state during the Equinix Metal based management or managed cluster deployment
with a private network. In this case, the affected machine is non-pingable
using the internal IP, for example, 192.168.0.53.
The issue relates to particular hardware with Intel Boot Agent (IBA)
installed, which is configured to be the first boot option on the server.
An affected server will continue booting from iPXE instead of booting from a
hard drive even after a successful provisioning. As a result,
the machine becomes inaccessible and cluster deployment gets stuck.
As a workaround, disable dnsmasq during cluster deployment as described
below.
Workaround:
Verify that the BareMetalHosts object status of the affected machine
is Provisioned:
Deployment of an Equinix Metal based management cluster with private networking
may fail with the following error message during the Ironic deployment. The
issue is caused by csi-rbdplugin provisioner pods that got stuck.
0/3 nodes are available:3 pod has unbound immediate PersistentVolumeClaims.
The workaround is to restart the csi-rbdplugin provisioner pods:
After removal of a managed cluster, the namespace is not deleted due to
KaaSCephOperationRequest CRs blocking the deletion. The workaround is to
manually remove finalizers and delete the KaaSCephOperationRequest CRs.
Workaround:
Remove finalizers from all KaaSCephOperationRequest resources:
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
LCM¶[22341] The cordon-drain states are not removed after maintenance mode is unset¶
The cordon-drain states are not removed after the maintenance mode is unset
for a machine. This issue may occur due to the maintenance transition
being stuck on the NodeWorkloadLock object.
On a managed cluster, the StackLight pods may get stuck with the
PodpredicateNodeAffinityfailed error in the pod status. The issue may
occur if the StackLight node label was added to one machine and
then removed from another one.
The issue does not affect the StackLight services, all required StackLight
pods migrate successfully except extra pods that are created and stuck during
pod migration.
On the highly loaded clusters, the kaas-exporter resource limits for CPU
and RAM are lower than the consumed amount of resources. As a result, the
kaas-exporter container is periodically throttled and OOMKilled preventing
the Container Cloud metrics gathering.
As a workaround, increase the default resource limits for kaas-exporter
in the Cluster object of the management cluster. For example:
Upgrade¶[21810] Upgrade to Cluster releases 5.22.0 and 7.5.0 may get stuck¶
Affects Ubuntu-based clusters deployed after Feb 10, 2022
If you deploy an Ubuntu-based cluster using the deprecated Cluster release
7.4.0 (and earlier) or 5.21.0 (and earlier) starting from February 10, 2022,
the cluster update to the Cluster releases 7.5.0 and 5.22.0 may get stuck
while applying the Deploy state to the cluster machines. The issue
affects all cluster types: management, regional, and managed.
To verify that the cluster is affected:
Log in to the Container Cloud web UI.
In the Clusters tab, capture the RELEASE and
AGE values of the required Ubuntu-based cluster. If the values
match the ones from the issue description, the cluster may be affected.
Using SSH, log in to the manager or worker node that got stuck while
applying the Deploy state and identify the containerd package version:
containerd--version
If the version is 1.5.9, the cluster is affected.
In /var/log/lcm/runners/<nodeName>/deploy/, verify whether the Ansible
deployment logs contain the following errors that indicate that the
cluster is affected:
The following packages will be upgraded:docker-ee docker-ee-cliThe following packages will be DOWNGRADED:containerd.ioSTDERR:E:Packages were downgraded and -y was used without --allow-downgrades.
Workaround:
Warning
Apply the steps below to the affected nodes one-by-one and
only after each consecutive node gets stuck on the Deploy phase with the
Ansible log errors. Such sequence ensures that each node is cordon-drained
and Docker is properly stopped. Therefore, no workloads are affected.
Using SSH, log in to the first affected node and install containerd 1.5.8:
Wait for Ansible to reconcile. The node should become Ready in several
minutes.
Wait for the next node of the cluster to get stuck on the Deploy phase
with the Ansible log errors. Only after that, apply the steps above on the
next node.
Patch the remaining nodes one-by-one using the steps above.
[20189] Container Cloud web UI reports upgrade while running previous release¶
Under certain conditions, the upgrade of the baremetal-based management
cluster may get stuck even though the Container Cloud web UI reports a
successful upgrade. The issue is caused by inconsistent metadata in IPAM that
prevents automatic allocation of the Ceph network. It happens when IPAddr
objects associated with the management cluster nodes refer to a non-existent
Subnet object by the resource UID.
To verify whether the cluster is affected:
Inspect the baremetal-provider logs:
kubectl-nkaaslogsdeployments/baremetal-provider
If the logs contain the following entries, the cluster may be affected:
Ceph public network address validation failed for cluster default/kaas-mgmt:invalid address '0.0.0.0/0' \
Ceph cluster network address validation failed for cluster default/kaas-mgmt: invalid address '0.0.0.0/0' \
'default/kaas-mgmt' cluster nodes internal (LCM) IP addresses: 10.64.96.171,10.64.96.172,10.64.96.173 \
failed to configure ceph network for cluster default/kaas-mgmt: \
Ceph network addresses auto-assignment error: validation failed for Ceph network addresses: \
error parsing address '': invalid CIDR address:
Empty values of the network parameters in the last entry indicate that
the provider cannot locate the Subnet object based on the data
from the IPAddr object.
Note
In the logs, capture the internal(LCM)IPaddresses of the
cluster nodes to use them later in this procedure.
Validate the network address used for Ceph by inspecting the
MiraCeph object:
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state is Running.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.15.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section contains historical information on the unsupported Container
Cloud releases delivered in 2020-2021. For the latest supported Container
Cloud release, see Container Cloud releases.
Based on 2.13.0, this release introduces the Cluster release 6.20.0
that is based on 5.20.0 and supports Mirantis OpenStack for Kubernetes
(MOS) 21.6.
For the list of Cluster releases 7.x and 5.x that are supported by
2.13.1 as well as for its features with addressed and known issues,
refer to the parent release 2.13.0.
AWS resources discovery in the Container Cloud web UI
Credentials statuses for OpenStack and AWS in the Container Cloud web UI
StackLight improvements:
Grafana upgrade from version 6.6.2 to 7.1.5
Grafana Image Renderer pod to offload rendering of images from charts
Grafana home dashboard improvements
Splitting of the regional and management cluster function
in StackLight telemetry to obtain aggregated metrics on the management
cluster from regional and managed clusters
First GA release of Container Cloud with the following key features:
Container Cloud with Mirantis Kubernetes Engine (MKE) container
clusters for the management plane
Support for managed Container Cloud with MKE container clusters
on top of the AWS, OpenStack, and bare metal cloud providers
Support for attaching of the existing MKE standalone clusters
Ceph as a Kubernetes storage provider for the bare metal use case
Multi-region support for security and scalability
IAM integration with MKE container clusters to provide SSO
Logging, monitoring, and alerting tuned for MKE with data aggregation
to the management cluster and telemetry sent to Mirantis
** - the Cluster release supports only attachment of existing MKE 3.3.4
clusters. For the deployment of new or attachment of existing clusters
based on other supported MKE versions, the latest available Cluster releases
are used.
Introduces support for the Cluster release 7.4.0
that is based on Mirantis Container Runtime 20.10.6 and the updated version
of Mirantis Kubernetes Engine 3.4.6 with Kubernetes 1.20.
Introduces support for the Cluster release 5.21.0
that is based on the updated version of Mirantis Kubernetes Engine 3.3.13
with Kubernetes 1.18 and Mirantis Container Runtime 20.10.6.
Supports deprecated Cluster releases 5.20.0,
6.19.0, and 7.3.0 that will become
unsupported in the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.14.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.14.0.
For the list of enhancements in the Cluster releases 7.4.0 and 5.21.0
that are supported by the Container Cloud release
2.14.0, see the Cluster releases (managed).
Support of the Equinix Metal provider with private networking¶
TechPreview
Introduced the Technology Preview support of Container Cloud deployments that
are based on the Equinix Metal infrastructure with private networking.
Private networks are required for the following use cases:
Connect the Container Cloud to the on-premises corporate networks without
exposing it to the Internet. This can be required by corporate security
policies.
Reduce ingress and egress bandwidth costs and the number of public IP
addresses utilized by the deployment. Public IP addresses are a scarce
and valuable resource, and Container Cloud should only expose the necessary
services in that address space.
Testing and staging environments typically do not require accepting
connections from the outside of the cluster. Such Container Cloud clusters
should be isolated in private VLANs.
Caution
The feature is supported starting from the Cluster releases
7.4.0 and 5.21.0.
Note
Support of the regional clusters that are based on Equinix Metal
with private networking will be announced in one of the following
Container Cloud releases.
Support of the community CentOS 7.9 version for the OpenStack provider¶
Introduced support of the community version of the CentOS 7.9 operating
system for the management, regional, and managed clusters machines deployed
with the OpenStack provider. The following CentOS resources are used:
Configuration of server metadata for OpenStack machines in web UI¶
Implemented the possibility to specify the cloud-init metadata
during the OpenStack machines creation through the Container Cloud web UI.
Server metadata is a set of string key-value pairs that you can configure
in the meta_data field of cloud-init.
Initial RHEL 8.4 support for the vSphere provider¶
TechPreview
Introduced the initial Technology Preview support of the RHEL 8.4
operating system for the vSphere-based management, regional, and managed
clusters.
Caution
Deployment of a Container Cloud cluster based on both RHEL
and CentOS operating systems or on mixed RHEL versions
is not supported.
Configuration of RAM and CPU for vSphere machines in web UI¶
Implemented the possibility to configure the following settings during a
vSphere machine creation using the Container Cloud web UI:
VM memory size that defaults to 16 GB
VM CPUs number that defaults to 8
Visualization of service mapping in the bare metal IpamHost object¶
Implemented the following amendments to the ipam/SVC-* labels to simplify
visualization of service mapping in the bare metal IpamHost object:
All IP addresses allocated from the Subnet` object that has the
ipam/SVC-* service labels defined will inherit those labels
The new ServiceMap field in IpamHost.Status contains information
about which IPs and interfaces correspond to which Container Cloud services.
Separation of PXE and management networks for bare metal clusters¶
Added the capability to configure a dedicated PXE network that is separated
from the management network on management or regional bare metal clusters.
A separate PXE network allows isolating sensitive bare metal provisioning
process from the end users. The users still have access to Container Cloud
services, such as Keycloak, to authenticate workloads in managed clusters,
such as Horizon in a Mirantis OpenStack for Kubernetes cluster.
User access management through the Container Cloud API or web UI¶
Implemented the capability to manage user access through the Container Cloud
API or web UI by introducing the following objects to manage user role
bindings:
IAMUser
IAMRole
IAMGlobalRoleBinding
IAMRoleBinding
IAMClusterRoleBinding
Also, updated the role naming used in Keycloak by introducing the following
IAM roles with the possibility to upgrade the old-style role names with the
new-style ones:
global-admin
bm-pool-operator
operator
user
stacklight-admin
Caution
User management for the MOSK m:os roles
through API or web UI is on the final development stage and
will be announced in one of the following Container Cloud
releases. Meanwhile, continue managing these roles using
Keycloak.
The possibility to manage the IAM*RoleBinding objects
through the Container Cloud web UI is available for the
global-admin role only. The possibility to manage project
role bindings using the operator role will become
available in one of the following Container Cloud releases.
Support matrix of MKE versions for cluster attachment¶
Updated the matrix of supported MKE versions for cluster attachment to improve
the upgrade and testing procedures:
Implemented separate Cluster release series to support 2 series of MKE
versions for cluster attachment:
Cluster release series 9.x for the 3.3.x version series
Cluster release series 10.x for the 3.4.x version series
Added a requirement to update an existing MKE cluster to the latest available
supported MKE version in a series to trigger the Container Cloud upgrade
that allows updating its components, such as StackLight, to the latest
versions.
When a new MKE version for cluster attachment is released in a series,
the oldest supported version of the previous Container Cloud release is
dropped.
The ‘Interface Guided Tour’ button in the Container Cloud web UI¶
Added the Interface Guided Tour button to the Container Cloud
web UI Help section for a handy access to the guided tour that
steps you through the web UI key features of the multi-cluster multi-cloud
Container Cloud platform.
Switch of bare metal and StackLight Helm releases from v2 to v3¶
Upgraded the bare metal and StackLight Helm releases in the ClusterRelease
and KaasRelease objects from v2 to v3. Switching of the remaining Ceph and
OpenStack Helm releases to v3 will be implemented in one of the following
Container Cloud releases.
The following issues have been addressed in the Mirantis Container Cloud
release 2.14.0 along with the Cluster releases 7.4.0 and 5.21.0.
[18429][StackLight] Increased the default resource requirements for
Prometheus Elasticsearch Exporter to prevent the
KubeContainersCPUThrottlingHigh firing too often.
[18879][Ceph] Fixed the issue with the RADOS Gateway (RGW) pod overriding the
global CA bundle located at /etc/pki/tls/certs with an incorrect
self-signed CA bundle during deployment of a Ceph cluster.
[9899][Upgrade] Fixed the issue with Helm releases getting stuck in the
PENDING_UPGRADE state during a management or managed cluster upgrade.
[18708][LCM] Fixed the issue with the Pending state of machines
during deployment of any Container Cloud cluster or attachment of an existing
MKE cluster due to some project being stuck in the Terminating state.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.14.0 including the Cluster releases 7.4.0, 6.20.0,
and 5.21.0.
After removal of a managed cluster, the namespace is not deleted due to
KaaSCephOperationRequest CRs blocking the deletion. The workaround is to
manually remove finalizers and delete the KaaSCephOperationRequest CRs.
Workaround:
Remove finalizers from all KaaSCephOperationRequest resources:
A managed cluster deployment fails on long-running management clusters with
BareMetalHost being stuck in the Preparing state and the
ironic-conductor and ironic-api pods reporting the
not enough disk space error due to the dnsmasq-dhcpd logs overflow.
Workaround:
Log in to the ironic-conductor pod.
Verify the free space in /volume/log/dnsmasq.
If the free space on a volume is less than 10%:
Manually delete log files in /volume/log/dnsmasq/.
Scale down the dnsmasq pod to 0 replicas:
kubectl-nkaasscaledeploymentdnsmasq--replicas=0
Scale up the dnsmasq pod to 1 replica:
kubectl-nkaasscaledeploymentdnsmasq--replicas=1
If the volume has enough space, assess the Ironic logs to identify
the root cause of the issue.
[17792] Full preflight fails with a timeout waiting for BareMetalHost¶
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:
Unset full preflight using the unsetKAAS_BM_FULL_PREFLIGHT
environment variable.
Rerun bootstrap.sh preflight that executes
fast preflight instead.
Hardware-related¶[18962] Machine provisioning issues during cluster deployment¶
A node with Intel NICs may randomly get stuck in the Provisioning
state during the Equinix Metal based management or managed cluster deployment
with a private network. In this case, the affected machine is non-pingable
using the internal IP, for example, 192.168.0.53.
The issue relates to particular hardware with Intel Boot Agent (IBA)
installed, which is configured to be the first boot option on the server.
An affected server will continue booting from iPXE instead of booting from a
hard drive even after a successful provisioning. As a result,
the machine becomes inaccessible and cluster deployment gets stuck.
As a workaround, disable dnsmasq during cluster deployment as described
below.
Workaround:
Verify that the BareMetalHosts object status of the affected machine
is Provisioned:
On the vSphere deployments with the RHEL 8.4 seed node, the VM template build
for deployment hangs because of an empty kickstart file provided to the VM.
In this case, the VMware web console displays the following error
for the affected VM:
Kickstart file /run/install/ks.cfg is missing
The fix for the issue is implemented in the latest version of the Packer image
for the VM template build.
Workaround:
Open bootstrap.sh in the kaas-bootstrap folder for editing.
Update the Docker image tag for the VSPHERE_PACKER_DOCKER_IMAGE
variable to v1.0-39.
Save edits and restart the VM template build:
./bootstrap.shvsphere_template
[19468] ‘Failed to remove finalizer from machine’ error during cluster deletion¶
If a RHEL license is removed before the related managed cluster is deleted,
the cluster deletion hangs with the following Machine object error:
Failed to remove finalizer from machine ...failed to get RHELLicense object
As a workaround, recreate the removed RHEL license object
with the same name using the Container Cloud web UI or API.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
[14080] Node leaves the cluster after IP address change¶
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
LCM¶[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
Adding a custom certificate for Keycloak using the container-cloud
binary hangs with the failed to wait for OIDC certificate to be updated
timeout warning. The readiness check fails due to a wrong condition.
Ignore the timeout warning. If you can log in to the Container Cloud web UI,
the certificate has been applied successfully.
[18331] Keycloak admin console menu disappears on ‘Add identity provider’ page¶
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
StackLight¶[18933] Alerta pods fail to pass the readiness check¶
Occasionally, an Alerta pod may be not Ready even if Patroni, the Alerta
backend, operates correctly. In this case, some of the following errors may
appear in the Alerta logs:
2021-10-25 13:10:55,865 DEBG 'nginx' stdout output:2021/10/25 13:10:55 [crit] 25#25: *17408 connect() to unix:/tmp/uwsgi.sock failed (2: No such file or directory) while connecting to upstream, client: 127.0.0.1, server: , request: "GET /api/config HTTP/1.1", upstream: "uwsgi://unix:/tmp/uwsgi.sock:", host: "127.0.0.1:8080"ip=\- [\25/Oct/2021:13:10:55 +0000] "\GET /api/config HTTP/1.1" \502 \157 "\-" "\python-requests/2.24.0"/web | /api/config | > GET /api/config HTTP/1.1
Prometheus web UI URLs in StackLight notifications sent to Salesforce use a
wrong protocol: HTTP instead of HTTPS. The issue affects deployments with
TLS enabled for IAM.
The workaround is to manually change the URL protocol in the web browser.
Storage¶[20312] Creation of ceph-based PVs gets stuck in Pending state¶
The csi-rbdplugin-provisioner pod (csi-provisioner container) may show
constant retries attempting to create a PV if the csi-rbdplugin-provisioner
pod was scheduled and started on a node with no connectivity to the Ceph
storage. As a result, creation of a Ceph-based persistent volume (PV) may get
stuck in the Pending state.
As a workaround manually specify the affinity or toleration rules for the
csi-rbdplugin-provisioner pod.
Workaround:
On the managed cluster, open the rook-ceph-operator-config map for
editing:
When creating a new KaaSCephOperationRequest CR with the same name
specified in metadata.name as in the previous KaaSCephOperationRequest
CR, even if the previous request was deleted manually, the new request includes
information about the previous actions and is in the Completed phase. In
this case, no removal is performed.
Workaround:
On the management cluster, manually delete the old
KaasCephOperationRequest CR with the same metadata.name:
Ocassionally, when Processing a Ceph OSD removal request,
KaaSCephOperationRequest retries the osdstop command without an
interval, which leads to removal request failure.
As a workaround create a new request to proceed with the Ceph OSD removal.
[19574] Ceph OSD removal does not clean up device used for multiple OSDs¶
When executing a Ceph OSD removal request to remove Ceph OSDs placed on one
disk, the request completes without errors but the device itself still keeps
the old LVM partitions. As a result, Rook cannot use such device.
An upgrade of a management or regional cluster originally deployed using
the Container Cloud release earlier than 2.8.0 fails with
error setting certificate verify locations during Ansible update if a
machine contains /usr/local/share/ca-certificates/mcc.crt, which is
either empty or invalid. Managed clusters are not affected.
Workaround:
On every machine of the affected management or regional cluster:
Delete /usr/local/share/ca-certificates/mcc.crt.
In /etc/lcm/environment, remove the following line:
The Equinix Metal and MOS-based managed clusters may fail to update to the
latest Cluster release with kubelet being stuck and reporting authorization
errors.
The cluster is affected by the issue if you see the Failed to make webhook
authorizer request: context canceled error in the kubelet logs:
dockerlogsucp-kubelet--since5m2>&1|grep'Failed to make webhook authorizer request: context canceled'
As a workaround, restart the ucp-kubelet container on the affected
node(s):
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Scale up the affected StatefulSet or Deployment back to the
original number of replicas and wait until its state is Running.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.14.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Mirantis Container Cloud GA release 2.13.1 is based on
2.13.0 and:
Introduces support for the Cluster release 6.20.0
that is based on the Cluster release 5.20.0 and represents
Mirantis OpenStack for Kubernetes (MOS) 21.6.
This Cluster release is based on Mirantis Kubernetes Engine 3.3.12 with
Kubernetes 1.18 and Mirantis Container Runtime 20.10.6.
Supports the latest Cluster releases 7.2.0 and
5.20.0.
Supports deprecated Cluster releases 7.2.0,
6.19.0, and 5.19.0 that will become
unsupported in the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
For details about the Container Cloud release 2.13.1, refer to its parent
release 2.13.0.
Introduces support for the Cluster release 7.3.0
that is based on Mirantis Container Runtime 20.10.6 and
Mirantis Kubernetes Engine 3.4.5 with Kubernetes 1.20.
Introduces support for the Cluster release 5.20.0
that is based on Mirantis Kubernetes Engine 3.3.12 with Kubernetes 1.18
and Mirantis Container Runtime 20.10.6.
Supports deprecated Cluster releases 5.19.0,
6.18.0, and 7.2.0 that will become
unsupported in the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.13.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.13.0.
For the list of enhancements in the Cluster releases 7.3.0 and 5.20.0
that are supported by the Container Cloud release
2.13.0, see the Cluster releases (managed).
Configuration of multiple DHCP ranges for bare metal clusters¶
Implemented the possibility to configure multiple DHCP ranges using the
bare metal Subnet resources to facilitate multi-rack and other
types of distributed bare metal datacenter topologies.
The dnsmasq DHCP server used for host provisioning in Container Cloud now
supports working with multiple L2 segments through DHCP relay capable
network routers.
To configure DHCP ranges for dnsmasq, create the Subnet objects
tagged with the ipam/SVC-dhcp-range label while setting up subnets
for a managed cluster using Container Cloud CLI.
Updated RAM requirements for management and regional clusters¶
To improve the Container Cloud performance and stability, increased RAM
requirements for management and regional clusters from 16 to 24 GB for all
supported cloud providers except bare metal, with the corresponding
flavor changes for the AWS and Azure providers:
AWS: updated the instance type from c5d.2xlarge to c5d.4xlarge
Azure: updated the VM size from Standard_F8s_v2 to Standard_F16s_v2
For the Container Cloud managed clusters, requirements remain the same.
The following issues have been addressed in the Mirantis Container Cloud
release 2.13.0 along with the Cluster releases 7.3.0 and 5.20.0.
[17705][Azure] Fixed the issue with the failure to deploy more than
62 Azure worker nodes.
[17938][bare metal] Fixed the issue with the bare metal host profile being
stuck in the matchprofile state during bootstrap.
[17960][bare metal] Fixed the issue with overflow of the Ironic storage
volume causing a StackLight alert being triggered for the ironic-aio-pvc
volume filling up.
[17981][bare metal] Fixed the issue with failure to redeploy a bare metal
node with an mdadm-based raid1 enabled due to insufficient cleanup
of RAID devices.
[17359][regional cluster] Fixed the issue with failure to delete an
AWS-based regional cluster due to the issue with the cluster credential
deletion.
[18193][upgrade] Fixed the issue with failure to upgrade an Equinix Metal
or baremetal-based management cluster with Ceph cluster being not ready.
[18076][upgrade] Fixed the issue with StackLight update failure on managed
cluster with logging disabled after changing NodeSelector.
[17771][StackLight] Fixed the issue with the Watchdog alert
not routing to Salesforce by default.
If you have applied the workaround as described in
StackLight known issues: 17771, revert it after updating
the Cluster releases to 5.20.0, 6.20.0, or 7.3.0:
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.13.0 including the Cluster releases 7.3.0, 6.19.0,
and 5.20.0.
After update of a management or managed cluster created using the Container
Cloud release earlier than 2.6.0, a bare metal host state is
Provisioned in the Container Cloud web UI while having the error state
in logs with the following message:
The issue is caused by the image URL pointing to an unavailable resource
due to the URI IP change during update. As a workaround, update URLs
for the bare metal host status and spec with the correct values
that use a stable DNS record as a host.
Workaround:
Note
In the commands below, we update master-2 as an example.
Replace it with the corresponding value to fit your deployment.
Exit Lens.
In a new terminal, configure access to the affected cluster.
If a RHEL license is removed before the related managed cluster is deleted,
the cluster deletion hangs with the following Machine object error:
Failed to remove finalizer from machine ...failed to get RHELLicense object
As a workaround, recreate the removed RHEL license object
with the same name using the Container Cloud web UI or API.
Warning
The kubectl apply command automatically saves the
applied data as plain text into the
kubectl.kubernetes.io/last-applied-configuration annotation of the
corresponding object. This may result in revealing sensitive data in this
annotation when creating or modifying the object.
Therefore, do not use kubectl apply on this object.
Use kubectl create, kubectl patch, or
kubectl edit instead.
If you used kubectl apply on this object, you
can remove the kubectl.kubernetes.io/last-applied-configuration
annotation from the object using kubectl edit.
[14080] Node leaves the cluster after IP address change¶
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
LCM¶[18708] ‘Pending’ state of machines during a cluster deployment or attachment¶
During deployment of any Container Cloud cluster or attachment of an existing
MKE cluster that is not deployed by Container Cloud, the machines are stuck
in the Pending state with no lcmcluster-controller entries from the
lcm-controller logs except the following ones:
If any project is in the Terminating state, proceed to the next step.
Otherwise, further assess the cluster logs to identify the root cause
of the issue.
Clean up the project that is stuck in the Terminating state:
Identify the objects that are stuck in the project:
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
StackLight¶[19682] URLs in Salesforce alerts use HTTP for IAM with enabled TLS¶
Prometheus web UI URLs in StackLight notifications sent to Salesforce use a
wrong protocol: HTTP instead of HTTPS. The issue affects deployments with
TLS enabled for IAM.
The workaround is to manually change the URL protocol in the web browser.
Storage¶[20312] Creation of ceph-based PVs gets stuck in Pending state¶
The csi-rbdplugin-provisioner pod (csi-provisioner container) may show
constant retries attempting to create a PV if the csi-rbdplugin-provisioner
pod was scheduled and started on a node with no connectivity to the Ceph
storage. As a result, creation of a Ceph-based persistent volume (PV) may get
stuck in the Pending state.
As a workaround manually specify the affinity or toleration rules for the
csi-rbdplugin-provisioner pod.
Workaround:
On the managed cluster, open the rook-ceph-operator-config map for
editing:
During deployment of a Ceph cluster, the RADOS Gateway (RGW) pod overrides
the global CA bundle located at /etc/pki/tls/certs with an incorrect
self-signed CA bundle. The issue affects only clusters with public
certificates.
Workaround:
Open the KaasCephCluster CR of a managed cluster for editing:
Substitute <managedClusterProjectName> with a corresponding value.
Select from the following options:
If you are using the GoDaddy certificates, in the
cephClusterSpec.objectStorage.rgw section, replace the
cacert parameters with your public CA certificate that already
contains both the root CA certificate and intermediate CA certificate:
[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13¶
Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1
Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are
unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and
may cause data corruption. Therefore, manageOsds is disabled until further
notice.
As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster,
perform the steps described in Remove Ceph OSD manually.
Upgrade¶[4288] Equinix and MOS managed clusters update failure¶
The Equinix Metal and MOS-based managed clusters may fail to update to the
latest Cluster release with kubelet being stuck and reporting authorization
errors.
The cluster is affected by the issue if you see the Failed to make webhook
authorizer request: context canceled error in the kubelet logs:
dockerlogsucp-kubelet--since5m2>&1|grep'Failed to make webhook authorizer request: context canceled'
As a workaround, restart the ucp-kubelet container on the affected
node(s):
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.13.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 7.2.0
that is based on Mirantis Container Runtime 20.10.6 and
Mirantis Kubernetes Engine 3.4.5 with Kubernetes 1.20.
Introduces support for the Cluster release 5.19.0
that is based on Mirantis Kubernetes Engine 3.3.12 with Kubernetes 1.18
and Mirantis Container Runtime 20.10.6.
Supports deprecated Cluster releases 5.18.0,
6.18.0, and 7.1.0 that will become
unsupported in the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.12.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.12.0.
For the list of enhancements in the Cluster releases 7.2.0, 6.19.0, and
5.19.0 that are supported by the Container Cloud release
2.12.0, see the Cluster releases (managed).
General availability of the Microsoft Azure cloud provider¶
Introduced official support for the Microsoft Azure cloud provider,
including support for creating and operating of management, regional,
and managed clusters.
Container Cloud deployment on top of MOS Victoria¶
Implemented the possibility to deploy Container Cloud management, regional,
and managed clusters on top of Mirantis OpenStack for Kubernetes (MOS)
Victoria that is based on the Open vSwitch networking.
LVM or mdadm RAID support for bare metal provisioning¶
TECHNOLOGY PREVIEW
Added the Technology Preview support for configuration of software-based
Redundant Array of Independent Disks (RAID) using BareMetalHosProfile
to set up LVM or mdadm-based RAID level 1 (raid1).
If required, you can further configure RAID in the same profile,
for example, to install a cluster operating system onto a RAID device.
You can configure RAID during a baremetal-based management or managed cluster
creation. RAID configuration on already provisioned bare metal machines
or on an existing cluster is not supported.
Caution
This feature is available as Technology Preview. Use such
configuration for testing and evaluation purposes only.
For the Technology Preview feature definition, refer to Technology Preview features.
Added the Preparing state to the provisioning workflow of bare metal hosts.
Bare Metal Operator inspects a bare metal host and moves it to the
Preparing state. In this state, the host becomes ready to be linked
to a bare metal machine.
Added the Transport Layer Security (TLS) configuration to all Container Cloud
endpoints for all supported cloud providers. The Container Cloud web UI and
StackLight endpoints are now available through TLS with self-signed
certificates generated by the Container Cloud provider.
If required, you can also add your own TLS certificates to the Container Cloud
web UI and Keycloak.
Caution
After the Container Cloud upgrade from 2.11.0 to 2.12.0, all
Container Cloud endpoints are available only through HTTPS.
Migration of iam-proxy from Louketo Proxy to OAuth2 Proxy¶
Migrated iam-proxy from the deprecated Louketo Proxy, formerly known as
keycloak-proxy to OAuth2 Proxy.
To apply the migration, all iam-proxy services in the StackLight namespace
are restarted during a management cluster upgrade or managed cluster update.
This causes a short downtime for the web UI access to StackLight services,
although all services themselves, such as Kibana or Grafana, continue working.
Backup configuration for a MariaDB database on a management cluster¶
Implemented the possibility to customize the default backup configuration
for a MariaDB database on a management cluster. You can customize the default
configuration either during a management cluster bootstrap or on an existing
management cluster. The Kubernetes cron job responsible for the MariaDB backup
is enabled by default for the OpenStack and AWS cloud providers and is disabled
for other supported providers.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to back up and restore an OpenStack or
AWS-based management cluster. The procedure consists of the MariaDB and MKE
backup and restore steps.
The following issues have been addressed in the Mirantis Container Cloud
release 2.12.0 along with the Cluster releases 7.2.0, 6.19.0, and 5.19.0.
[16718][Equinix Metal] Fixed the issue with the Equinix Metal provider
failing to create machines with an SSH key error if an Equinix Metal based
cluster was being deployed in an Equinix Metal project with no SSH keys.
[17118][bare metal] Fixed the issue with failure to add a new machine
to a baremetal-based managed cluster after the management cluster upgrade.
[16959][OpenStack] Fixed the issue with failure to create a proxy-based
OpenStack regional cluster due to the issue with the proxy secret creation.
[13385][IAM] Fixed the issue with MariaDB pods failing to start after
MariaDB blocked itself during the State Snapshot Transfers sync.
[8367][LCM] Fixed the issue with joining etcd from a new node to an existing
etcd cluster. The issue caused the new managed node to hang in the Deploy
state when adding it to a managed cluster.
[16873][bootstrap] Fixed the issue with a management cluster bootstrap
failing with failed to establish connection with tiller error due to
kind 0.9.0 delivered with the bootstrap script being not compatible with
the latest Ubuntu 18.04 image that requires kind 0.11.1.
[16964][Ceph] Fixed the issue with a bare metal or Equinix Metal management
cluster upgrade getting stuck and then failing with some Ceph daemons being
stuck on upgrade to Octopus and with the insecureglobal_idreclaim
health warning in Ceph logs.
[16843][StackLight] Fixed the issue causing inability to override default
route matchers for Salesforce notifier.
If you have applied the workaround as described in
StackLight known issues: 16843 after updating the
cluster releases to 5.19.0, 7.2.0, or 6.19.0 and if you need to define custom
matchers, replace the deprecated match and match_re parameters with
matchers as required. For details, see Deprecation notes and
StackLight configuration parameters.
[17477][Update][StackLight] Fixed the issue with StackLight in HA mode
placed on controller nodes being not deployed or cluster update being
blocked. Once you update your Mirantis OpenStack for Kubernetes cluster from
the Cluster release 6.18.0 to 6.19.0, roll back the workaround applied as
described in Upgrade known issues: 17477:
Remove stacklight labels from worker nodes. Wait for the labels to be
removed.
Remove the custom nodeSelector section from the cluster spec.
[16777][Update][StackLight] Fixed the issue causing the Cluster release
update from 7.0.0 to 7.1.0 to fail due to failed Patroni pod. The issue
affected the Container Cloud management, regional, or managed cluster
of any cloud provider.
[17069][Update][Ceph] Fixed the issue with upgrade of a bare metal or
Equinix Metal based management or managed cluster failing with the
Failed to configure Ceph cluster error due to different versions of the
rook-ceph-osd deployments.
[17007][Update] Fixed the issue with the false-positive
release: “squid-proxy” not found error during a management cluster upgrade
of any supported cloud provider except vSphere.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.12.0 including the Cluster releases 7.2.0, 6.19.0,
and 5.19.0.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Azure¶[17705] Failure to deploy more than 62 Azure worker nodes¶
Fixed in 2.13.0
The default value of the Ports per instance load balancer
outbound NAT setting that is 1024 prevents from deploying
more than 62 Azure worker nodes on a managed cluster. To workaround the issue,
set the Ports per instance parameter to 256.
Workaround:
Log in to the Azure portal.
Navigate to Home > Load Balancing.
Find and click the load balancer called mcc-<uniqueClusterID>.
You can obtain <uniqueClusterID> in the Cluster info field
in the Container Cloud web UI.
In the load balancer Settings left-side menu, click
Outbound rules > OutboundNATAllProtocols.
In the Outbound ports > Choose by menu, select
Ports per instance.
In the Ports per instance field, replace the default
1024 value with 256.
Click Save to apply the new setting.
Bare metal¶[18752] Bare metal hosts in ‘provisioned registration error’ state after update¶
After update of a management or managed cluster created using the Container
Cloud release earlier than 2.6.0, a bare metal host state is
Provisioned in the Container Cloud web UI while having the error state
in logs with the following message:
The issue is caused by the image URL pointing to an unavailable resource
due to the URI IP change during update. As a workaround, update URLs
for the bare metal host status and spec with the correct values
that use a stable DNS record as a host.
Workaround:
Note
In the commands below, we update master-2 as an example.
Replace it with the corresponding value to fit your deployment.
Exit Lens.
In a new terminal, configure access to the affected cluster.
On the baremetal-based management clusters with the Container Cloud version
2.12.0 or earlier, the storage volume used by Ironic can run out of free space.
As a result, a StackLight alert is triggered for the ironic-aio-pvc volume
filling up.
Symptoms
One or more of the following symptoms are observed:
The StackLight KubePersistentVolumeUsageCritical alert is firing
for the volume ironic-aio-pvc.
The ironic and dnsmasqDeployments are not in the OK status:
kubectl-nkaasgetdeployments
One or multiple ironic and dnsmasq pods fail to start:
For dnsmasq:
kubectlgetpods-nkaas-owide|grepdnsmasq
If the number of ready containers for the pod is not 2/2,
the management cluster can be affected by the issue.
For ironic:
kubectlgetpods-nkaas-owide|grepironic
If the number of ready containers for the pod is not 6/6,
the management cluster can be affected by the issue.
The free space on a volume is less than 10%.
To verify space usage on a volume:
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
LCM¶[16146] Stuck kubelet on the Cluster release 5.x.x series¶
Occasionally, kubelet may get stuck on the Cluster release 5.x.x series
with different errors in the ucp-kubelet containers leading to the nodes
failures. The following error occurs every time when accessing
the Kubernetes API server:
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
StackLight¶[17771] Watchdog alert missing in Salesforce route¶
Fixed in 2.13.0
The Watchdog alert is not routed to Salesforce by default.
Note
After applying the workaround, you may notice the following warning
message. It is expected and does not affect configuration rendering:
Prometheus web UI URLs in StackLight notifications sent to Salesforce use a
wrong protocol: HTTP instead of HTTPS. The issue affects deployments with
TLS enabled for IAM.
The workaround is to manually change the URL protocol in the web browser.
Storage¶[20312] Creation of ceph-based PVs gets stuck in Pending state¶
The csi-rbdplugin-provisioner pod (csi-provisioner container) may show
constant retries attempting to create a PV if the csi-rbdplugin-provisioner
pod was scheduled and started on a node with no connectivity to the Ceph
storage. As a result, creation of a Ceph-based persistent volume (PV) may get
stuck in the Pending state.
As a workaround manually specify the affinity or toleration rules for the
csi-rbdplugin-provisioner pod.
Workaround:
On the managed cluster, open the rook-ceph-operator-config map for
editing:
During deployment of a Ceph cluster, the RADOS Gateway (RGW) pod overrides
the global CA bundle located at /etc/pki/tls/certs with an incorrect
self-signed CA bundle. The issue affects only clusters with public
certificates.
Workaround:
Open the KaasCephCluster CR of a managed cluster for editing:
Substitute <managedClusterProjectName> with a corresponding value.
Select from the following options:
If you are using the GoDaddy certificates, in the
cephClusterSpec.objectStorage.rgw section, replace the
cacert parameters with your public CA certificate that already
contains both the root CA certificate and intermediate CA certificate:
[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13¶
Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1
Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are
unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and
may cause data corruption. Therefore, manageOsds is disabled until further
notice.
As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster,
perform the steps described in Remove Ceph OSD manually.
Regional cluster¶[17359] Deletion of AWS-based regional cluster credential fails¶
Fixed in 2.13.0
During deletion of an AWS-based regional cluster, deletion of the cluster
credential fails with error deleting regional credential: error waiting for
credential deletion: timed out waiting for the condition.
Workaround:
Change the directory to kaas-bootstrap.
Scale up the aws-credentials-controller-aws-credentials-controller
deployment:
The Equinix Metal and MOS-based managed clusters may fail to update to the
latest Cluster release with kubelet being stuck and reporting authorization
errors.
The cluster is affected by the issue if you see the Failed to make webhook
authorizer request: context canceled error in the kubelet logs:
dockerlogsucp-kubelet--since5m2>&1|grep'Failed to make webhook authorizer request: context canceled'
As a workaround, restart the ucp-kubelet container on the affected
node(s):
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
On a managed cluster with logging disabled, changing NodeSelector can
cause StackLight update failure with the following message in the StackLight
Helm Controller logs:
Upgrade "stacklight" failed:Job.batch "stacklight-delete-logging-pvcs-*" is invalid: spec.template: Invalid value:...
As a workaround, disable the stacklight-delete-logging-pvcs-* job.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.12.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 7.1.0
that is based on Mirantis Container Runtime 20.10.5 and
Mirantis Kubernetes Engine 3.4.0 with Kubernetes 1.20.
Introduces support for the Cluster release 5.18.0
that is based on Mirantis Kubernetes Engine 3.3.6 with Kubernetes 1.18
and Mirantis Container Runtime 20.10.5.
Supports deprecated Cluster releases 5.17.0,
6.16.0, and 7.0.0 that will become
unsupported in the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment
of existing MKE 3.3.4 clusters. For the deployment of new or attachment
of existing MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
Caution
Before upgrading an existing managed cluster with StackLight
deployed in HA mode to the latest Cluster release, add the
StackLight node label to at least 3 worker machines
as described in Upgrade managed clusters with StackLight deployed in HA mode.
Otherwise, the cluster upgrade will fail.
This section outlines release notes for the Container Cloud release 2.11.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.11.0.
For the list of enhancements in the Cluster releases 7.1.0, 6.18.0, and
5.18.0 that are supported by the Container Cloud release
2.11.0, see the Cluster releases (managed).
Introduced the Technology Preview support for the Microsoft Azure
cloud provider, including support for creating and operating of
management, regional, and managed clusters.
RHEL 7.9 bootstrap node for the vSphere-based provider¶
Implemented the capability to bootstrap the vSphere provider clusters on the
bootstrap node that is based on RHEL 7.9.
Validation labels for the vSphere-based VM templates¶
Implemented validation labels for the vSphere-based VM templates in the
Container Cloud web UI. If a VM template was initially created using the
built-in Packer mechanism, the Container Cloud version has a green label on the
right side of the drop-down list with VM templates. Otherwise, a template
is marked with the Unknown label.
Mirantis recommends using only green-labeled templates
for production deployments.
Automatic migration of Docker data and LVP volumes to NVMe on AWS clusters¶
Implemented automatic migration of Docker data located at /var/lib/docker
and local provisioner volumes from existing EBS to local NVMe SSDs during
the AWS-based management and managed clusters upgrade. On new clusters,
the /var/lib/docker Docker data is now located on local NVMe SSDs
by default.
The migration allows moving heavy workloads such as etcd and MariaDB
to local NVMe SSDs that significantly improves cluster performance.
Upgraded all core Helm releases in the ClusterRelease and KaasRelease
objects from v2 to v3. Switching of the remaining Helm releases to v3 will
be implemented in one of the following Container Cloud releases.
Bond interfaces for baremetal-based management clusters¶
Added the possibility to configure L2 templates for the baremetal-based
management cluster to set up a bond network interface to the PXE/Management
network.
Apply this configuration to the bootstrap templates
before you run the bootstrap script to deploy the management cluster.
Caution
Using this configuration requires that every host in your
management cluster has at least two physical interfaces.
Connect at least two interfaces per host to an Ethernet switch
that supports Link Aggregation Control Protocol (LACP)
port groups and LACP fallback.
Configure an LACP group on the ports connected
to the NICs of a host.
Configure the LACP fallback on the port group to ensure that
the host can boot over the PXE network before the bond interface
is set up on the host operating system.
Configure server BIOS for both NICs of a bond to be PXE-enabled.
If the server does not support booting from multiple NICs,
configure the port of the LACP group that is connected to the
PXE-enabled NIC of a server to be primary port.
With this setting, the port becomes active in the fallback mode.
Equinix Metal capacity labels for machines in web UI¶
Implemented the verification mechanism for the actual capacity of the
Equinix Metal facilities before machines deployment. Now, you can see
the following labels in the Equinix Metal Create a machine wizard
of the Container Cloud web UI:
Normal - the facility has a lot of available machines.
Prioritize this machine type over others.
Limited - the facility has a limited number of machines.
Do not request many machines of this type.
Unknown - Container Cloud cannot fetch information
about the capacity level since the feature is disabled.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to update the Keycloak IP address
on bare metal clusters.
The following issues have been addressed in the Mirantis Container Cloud
release 2.11.0 along with the Cluster releases 7.1.0, 6.18.0, and 5.18.0.
For more issues addressed for the Cluster release 6.18.0, see also
addressed issues 2.10.0.
[15698][vSphere] Fixed the issue with a load balancer virtual IP address
(VIP) being assigned to each manager node on any type of the vSphere-based
cluster.
[7573][Ceph] To avoid the Rook community issue with updating Rook to version
1.6, added the rgw_data_log_backing configuration option set to omap
by default.
[10050][Ceph] Fixed the issue with Ceph OSD pod being stuck
in the CrashLoopBackOff state due to the Ceph OSD authorization key
failing to be created properly after disk replacement if a custom
BareMetalHostProfile was used.
[16233][Ceph][Upgrade] Fixed the issue with ironic and dnsmasq pods
failing during a baremetal-based management cluster upgrade due to Ceph
not unmounting RBD volumes.
[7655][BM] Fixed the issue with a bare metal cluster to be deployed
successfully but with the runtime errors in the IpamHost object
if an L2 template was configured incorrectly.
[15348][StackLight] Fixed the issue with some panels of the
Alertmanager and Prometheus Grafana dashboards not
displaying data due to an invalid query.
[15834][StackLight] Removed the CPU resource limit from the
elasticsearch-curator container to avoid issues with the
CPUThrottlingHigh alert false-positively firing for Elasticsearch
Curator.
[16141][StackLight] Fixed the issue with the Alertmanager pod getting stuck
in CrashLoopBackOff during upgrade of a management, regional, or managed
cluster and thus causing upgrade failure with the Loading configuration file
failed error message in logs.
[15766][StackLight][Upgrade] Fixed the issue with management or regional
cluster upgrade failure from version 2.9.0 to 2.10.0 and managed cluster
from 5.16.0 to 5.17.0 with the Cannot evict pod error for the
patroni-12-0, patroni-12-1, or patroni-12-2 pod.
[16398][StackLight] Fixed the issue with inability to set require_tls to
false for Alertmanager email notifications.
[13303] [LCM] Fixed the issue with managed clusters update from the Cluster
release 6.12.0 to 6.14.0 failing with worker nodes being stuck in the
Deploy state with the Networkisunreachable error.
[13845] [LCM] Fixed the issue with the LCM Agent upgrade failing with x509
error during managed clusters update from the Cluster release 6.12.0 to
6.14.0.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.11.0 including the Cluster releases 7.1.0, 6.18.0,
and 5.18.0.
Note
This section also outlines still valid known issues
from previous Container Cloud releases.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Equinix Metal¶[16718] Equinix Metal provider fails to create machines with SSH keys error¶
Fixed in 2.12.0
If an Equinix Metal based cluster is being deployed in an Equinix Metal
project with no SSH keys, the Equinix Metal provider fails
to create machines with the following error:
Click Add New Key and add details of the newly created SSH key.
Click Add.
Restart the cluster deployment.
Bare metal¶[17118] Failure to add a new machine to cluster¶
Fixed in 2.12.0
Adding a new machine to a baremetal-based
managed cluster may fail after the baremetal-based management cluster upgrade.
The issue occurs because the PXE boot is not working for the new node.
In this case, file /volume/tftpboot/ipxe.efi not found logs appear on
dnsmasq-tftp.
Workaround:
Log in to a local machine where your management cluster kubeconfig is
located and where kubectl is installed.
An OpenStack-based regional cluster being deployed using proxy fails with the
Not ready objects: not ready: statefulSets: kaas/mcc-cache got 0/1 replicas
error message due to the issue with the proxy secret creation.
vSphere¶[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State:Waiting
Reason:CrashLoopBackOff
LastState:Terminated
Reason:ContainerCannotRun
Message:OCIruntimecreatefailed:container_linux.go:349:
startingcontainerprocesscaused"process_linux.go:297: applying cgroup configuration for process caused "mkdir/sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannotallocatememory": unknown
The issue occurs due to the
Kubernetes and
Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem to the kernel command line.
Note
The workaround below applies to the existing clusters only.
The issue is resolved for new Container Cloud 2.9.0 deployments
since the workaround below automatically applies to the VM template
built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected
cluster.
Workaround
SSH to any machine of the affected cluster using mcc-user and the SSH
key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for
GRUB_CMDLINE_LINUX.
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
LCM¶[16146] Stuck kubelet on the Cluster release 5.x.x series¶
Occasionally, kubelet may get stuck on the Cluster release 5.x.x series
with different errors in the ucp-kubelet containers leading to the nodes
failures. The following error occurs every time when accessing
the Kubernetes API server:
[8367] Adding of a new manager node to a managed cluster hangs on Deploy stage¶
Fixed in 2.12.0
Adding of a new manager node to a managed cluster may hang due to
issues with joining etcd from a new node to the existing etcd cluster.
The new manager node hangs in the Deploy stage.
Symptoms:
The Ansible run tries executing the WaitforDockerUCPtobeaccessible
step and fails with the following error message:
To determine the etcd leader, run on any manager node:
dockerexec-itucp-kvsh
# From the inside of the container:ETCDCTL_API=3etcdctl-wtable--endpoints=https://<1stmanagerIP>:12379,https://<2ndmanagerIP>:12379,https://<3rdmanagerIP>:12379endpointstatus
To verify logs on the leader node:
dockerlogsucp-kv
Root cause:
In case of an unlucky network partition, the leader may lose quorum
and members are not able to perform the election. For more details, see
Official etcd documentation: Learning, figure 5.
Workaround:
Restart etcd on the leader node:
dockerrm-fucp-kv
Wait several minutes until the etcd cluster starts and reconciles.
The deployment of the new manager node will proceed and it will join
the etcd cluster. After that, other MKE components will be configured and
the node deployment will be finished successfully.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
IAM¶[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the
State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectldescribepod-nkaas<failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message,
proceed to the next step. Otherwise, assess the MariaDB pod logs
to identify the failure root cause.
During configuration of an identity provider SAML using the
Add identity provider menu of the Keycloak admin console, the page
style breaks as well as the Save and Cancel buttons
disappear.
Workaround:
Log in to the Keycloak admin console.
In the sidebar menu, switch to the Master realm.
Navigate to Realm Settings > Themes.
In the Admin Console Theme drop-down menu, select
keycloak.
Click Save and refresh the browser window to apply the changes.
StackLight¶[16843] Inability to override default route matchers for Salesforce notifier¶
Fixed in 2.12.0
It may be impossible to override the default route matchers for Salesforce
notifier.
Note
After applying the workaround, you may notice the following warning
message. It is expected and does not affect configuration rendering:
Storage¶[16300] ManageOsds works unpredictably on Rook 1.6.8 and Ceph 15.2.13¶
Affects only Container Cloud 2.11,0, 2.12,0, 2.13.0, and 2.13.1
Ceph LCM automatic operations such as Ceph OSD or Ceph node removal are
unstable for the new Rook 1.6.8 and Ceph 15.2.13 (Ceph Octopus) versions and
may cause data corruption. Therefore, manageOsds is disabled until further
notice.
As a workaround, to safely remove a Ceph OSD or node from a Ceph cluster,
perform the steps described in Remove Ceph OSD manually.
Bootstrap¶[16873] Bootstrap fails with ‘failed to establish connection with tiller’ error¶
Fixed in 2.12.0
If the latest Ubuntu 18.04 image, for example, with kernel 4.15.0-153-generic,
is installed on the bootstrap node, a management cluster bootstrap fails
during the setup of the Kubernetes cluster by kind.
The issue occurs since the kind version 0.9.0 delivered with the bootstrap
script is not compatible with the latest Ubuntu 18.04 image that requires
kind version 0.11.1.
To verify that the bootstrap node is affected by the issue:
In the bootstrap script stdout, verify the connection to Tiller.
Example of system response extract on an affected bootstrap node:
Upgrade¶[17477] StackLight in HA mode is not deployed or cluster update is blocked¶
Fixed in 2.12.0
The deployment of new managed clusters using the Cluster release 6.18.0
with StackLight enabled in the HA mode on control plane nodes does not have
StackLight deployed. The update of existing clusters with such StackLight
configuration that were created using the Cluster release 6.16.0 is blocked
with the following error message:
If you faced the issue during a managed cluster deployment, skip
this step.
If you faced the issue during a managed cluster update, wait until all
StackLight components resources are recreated on the target nodes
with updated node selectors.
In the Container Cloud web UI, add a fake StackLight label to any 3 worker
nodes to satisfy the deployment requirement as described in
Create a machine using web UI. Eventually, StackLight will be still placed on the
target nodes with the forcedRole:stacklight label.
Once done, the StackLight deployment or update proceeds.
[17412] Cluster upgrade fails on the KaaSCephCluster CRD update¶
An upgrade of a bare metal or Equinix metal based management cluster
originally deployed using the Container Cloud release earlier than 2.8.0
fails with the following error message:
Upgrade"kaas-public-api"failed:\
cannotpatch"kaascephclusters.kaas.mirantis.com"withkind\
CustomResourceDefinition:CustomResourceDefinition.apiextensions.k8s.io\
kaascephclusters.kaas.mirantis.com" is invalid: \spec.preserveUnknownFields: Invalid value: true: \must be false in order to use defaults in the schema
Workaround:
Change the preserveUnknownFields value for the KaaSCephCluster CRD
to false:
[17007] False-positive ‘release: “squid-proxy” not found’ error¶
Fixed in 2.12.0
During a management cluster upgrade of any supported cloud provider except
vSphere, you may notice the following false-positive messages for the
squid-proxy Helm release that is disabled in Container Cloud 2.11.0:
Management cluster upgrade may get stuck and then fail with the following error
message: ClusterWorkloadLocks in cluster default/kaas-mgmt are still active -
ceph-clusterworkloadlock.
To verify that the cluster is affected:
Enter the ceph-tools pod.
Verify that some Ceph daemons were not upgraded to Octopus:
cephversions
Run ceph -s and verify that the output contains the following
health warning:
[16777] Cluster update fails due to Patroni being not ready¶
Fixed in 2.12.0
An update of the Container Cloud management, regional, or managed cluster
of any cloud provider type from the Cluster release 7.0.0 to 7.1.0
fails due to the failed Patroni pod.
As a workaround, increase the default resource requests and limits
for PostgreSQL as follows:
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
On a managed cluster with logging disabled, changing NodeSelector can
cause StackLight update failure with the following message in the StackLight
Helm Controller logs:
Upgrade "stacklight" failed:Job.batch "stacklight-delete-logging-pvcs-*" is invalid: spec.template: Invalid value:...
As a workaround, disable the stacklight-delete-logging-pvcs-* job.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.11.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Upgrade managed clusters with StackLight deployed in HA mode¶
Starting from Container Cloud 2.11.0, the StackLight node label
is required for managed clusters deployed in HA mode.
The StackLight node label allows running StackLight components on
specific worker nodes with corresponding resources.
Before upgrading an existing managed cluster with StackLight deployed
in HA mode to the latest Cluster release, add the StackLight node
label to at least 3 worker machines. Otherwise, the cluster upgrade will fail.
To add the StackLight node label to a worker machine:
Log in to the Container Cloud web UI.
On the Machines page, click the More action icon
in the last column of the required machine field and select
Configure machine.
In the window that opens, select the StackLight node label.
Caution
If your managed cluster contains more than 3 worker nodes,
select from the following options:
If you have a small cluster, add the StackLight
label to all worker nodes.
If you have a large cluster, identify the exact nodes that run
StackLight and add the label to these specific nodes only.
Otherwise, some of the StackLight components may become
inaccessible after the cluster update.
To identify the worker machines where StackLight is deployed:
Log in to the Container Cloud web UI.
Download the required cluster kubeconfig:
On the Clusters page, click the More
action icon in the last column of the required cluster
and select Download Kubeconfig.
Not recommended. Select Offline Token
to generate an offline IAM token. Otherwise, for security
reasons, the kubeconfig token expires every 30 minutes
of the Container Cloud API idle time and you have to
download kubeconfig again with a newly generated token.
Click Download.
Export the kubeconfig parameters to your local machine with
access to kubectl. For example:
Obtain the list of machines with the StackLight local volumes attached.
Note
In the command below, substitute <mgmtKubeconfig>
with the path to your management cluster kubeconfig
and projectName with the project name where your cluster
is located.
Introduces support for the Cluster release 7.0.0
that is based on the updated versions of Mirantis Container Runtime 20.10.5,
and Mirantis Kubernetes Engine 3.4.0 with Kubernetes 1.20.
Introduces support for the Cluster release 5.17.0
that is based on Mirantis Kubernetes Engine 3.3.6 with Kubernetes 1.18
and the updated version of Mirantis Container Runtime 20.10.5.
Supports deprecated Cluster releases 5.16.0 and
6.14.0 that will become unsupported
in one of the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment
of existing MKE 3.3.4 clusters. For the deployment of new or attachment
of existing MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.10.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.10.0.
For the list of enhancements in the Cluster releases 7.0.0, 5.17.0, and
6.16.0 that are supported by the Container Cloud release
2.10.0, see the Cluster releases (managed).
Support of MKE 3.3.x series and 3.4.0 for cluster attachment¶
Added support of several Mirantis Kubernetes Engine (MKE) versions
of the 3.3.x series and 3.4.0 for attaching or detaching of existing
MKE 3.3.3 - 3.3.6 and 3.4.0 clusters as well as updating them
to the latest supported version.
This feature allows for visualization of all your MKE clusters details
on one management cluster including clusters health, capacity, and usage.
Initial CentOS support for the VMware vSphere provider¶
Technology Preview
Introduced the initial Technology Preview support of the CentOS 7.9
operating system for the vSphere-based management, regional, and managed
clusters.
Note
Deployment of a Container Cloud cluster that is based on both
RHEL and CentOS operating systems is not supported.
To deploy a vSphere-based managed cluster on CentOS
with custom or additional mirrors configured in the VM template,
the squid-proxy configuration on the management or regional
cluster is required. It is done automatically if you use the
Container Cloud script for the VM template creation.
Added support of RHEL 7.9 for the vSphere provider. This operating system
is now installed by default on any type of the vSphere-based Container Cloud
clusters.
RHEL 7.8 deployment is still possible with allowed access to the
rhel-7-server-rpms repository provided by the Red Hat Enterprise
Linux Server 7 x86_64.
Verify that your RHEL license or activation key meets this requirement.
Implemented the guided tour in the Container Cloud web UI to help you get
oriented with the multi-cluster multi-cloud Container Cloud platform.
This brief guided tour will step you through the key features of Container
Cloud that can be performed using the Container Cloud web UI.
Removal of IAM and Keycloak IPs configuration for the vSphere provider¶
Removed the following Keycloak and IAM services variables that were used
during a vSphere-based management cluster bootstrap for the MetalLB
configuration:
KEYCLOAK_FLOATING_IP
IAM_FLOATING_IP
Now, these IPs are automatically generated in the MetalLB range
for certificates creation.
Implemented the container-cloud bootstrap user add command that
allows creating Keycloak users with specific permissions to access
the Container Cloud web UI and manage the Container Cloud clusters.
For security reasons, removed the default password password for Keycloak
that was generated during a management cluster bootstrap to access
the Container Cloud web UI.
On top of continuous improvements delivered to the existing Container Cloud
guides, added documentation about the Container Cloud user roles management
through the Keycloak Admin Console. The section outlines the IAM roles
and scopes structure in Container Cloud as well as role assignment to users
using the Keycloak Admin Console.
The following issues have been addressed in the Mirantis Container Cloud
release 2.10.0 along with the Cluster releases 7.0.0 and 5.17.0.
For more issues addressed for the Cluster release 6.16.0, see also
addressed issues 2.8.0 and
2.9.0.
[8013][AWS] Fixed the issue with managed clusters deployment, that requires
persistent volumes (PVs), failing with pods being stuck in the Pending
state and having the podhasunboundimmediatePersistentVolumeClaims
and node(s)hadvolumenodeaffinityconflict errors.
Note
The issue affects only the MKE deployments with Kubernetes 1.18
and is fixed for MKE 3.4.x with Kubernetes 1.20 that is available
since the Cluster release 7.0.0.
[14981] [Equinix Metal] Fixed the issue with a manager machine deployment
failing if the cluster contained at least one manager machine that was
stuck in the Provisioning state due to the capacity limits
in the selected Equinix Metal data center.
[13402] [LCM] Fixed the issue with the existing clusters failing with the
no space left on device error due to an excessive amount of core dumps
produced by applications that fail frequently.
[14125] [LCM] Fixed the issue with managed clusters deployed or updated
on a regional cluster of another provider type displaying inaccurate
Nodes readiness live status in the Container Cloud web UI.
[14040][StackLight] Fixed the issue with the Tiller container of the
stacklight-helm-controller pods switching to CrashLoopBackOff and
then being OOMKilled. Limited the releases number in history to 3 to
prevent RAM overconsumption by Tiller.
[14152] [Upgrade] Fixed the issue with managed cluster release upgrade
failing and the DNS names of the Kubernetes services on the affected pod
not being resolved due to DNS issues on pods with host networking.
This section lists known issues with workarounds for the Mirantis
Container Cloud release 2.10.0 including the Cluster releases 7.0.0, 6.16.0,
and 5.16.0.
Note
This section also outlines still valid known issues
from previous Container Cloud releases.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Equinix Metal¶[16718] Equinix Metal provider fails to create machines with SSH keys error¶
Fixed in 2.12.0
If an Equinix Metal based cluster is being deployed in an Equinix Metal
project with no SSH keys, the Equinix Metal provider fails
to create machines with the following error:
Click Add New Key and add details of the newly created SSH key.
Click Add.
Restart the cluster deployment.
Bare metal¶[17118] Failure to add a new machine to cluster¶
Fixed in 2.12.0
Adding a new machine to a baremetal-based
managed cluster may fail after the baremetal-based management cluster upgrade.
The issue occurs because the PXE boot is not working for the new node.
In this case, file /volume/tftpboot/ipxe.efi not found logs appear on
dnsmasq-tftp.
Workaround:
Log in to a local machine where your management cluster kubeconfig is
located and where kubectl is installed.
Scale the Ironic deployment down to 0 replicas.
kubectl-nkaasscaledeployments/ironic--replicas=0
Scale the Ironic deployment up to 1 replica:
kubectl-nkaasscaledeployments/ironic--replicas=1
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
OpenStack¶[10424] Regional cluster cleanup fails by timeout¶
An OpenStack-based regional cluster cleanup fails with the timeout error.
Workaround:
Wait for the Cluster object to be deleted in the bootstrap cluster:
vSphere¶[15698] VIP is assigned to each manager node instead of a single node¶
Fixed in 2.11.0
A load balancer virtual IP address (VIP) is assigned to each manager
node on any type of the vSphere-based cluster. The issue occurs because
the Keepalived instances cannot set up a cluster due to the blocked
vrrp protocol traffic in the firewall configuration on the Container Cloud
nodes.
Note
Before applying the workaround below, verify that the dedicated
vSphere network does not have any other virtual machines with the
keepalived instance running with the same vrouter_id.
You can verify the vrouter_id value of the cluster
in /etc/keepalived/keepalived.conf on the manager nodes.
Workaround
Update the firewalld configuration on each manager node of the affected
cluster to allow the vrrp protocol traffic between the nodes:
Apply the procedure to the remaining manager nodes of the cluster.
[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State:Waiting
Reason:CrashLoopBackOff
LastState:Terminated
Reason:ContainerCannotRun
Message:OCIruntimecreatefailed:container_linux.go:349:
startingcontainerprocesscaused"process_linux.go:297: applying cgroup configuration for process caused "mkdir/sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannotallocatememory": unknown
The issue occurs due to the
Kubernetes and
Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem to the kernel command line.
Note
The workaround below applies to the existing clusters only.
The issue is resolved for new Container Cloud 2.9.0 deployments
since the workaround below automatically applies to the VM template
built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected
cluster.
Workaround
SSH to any machine of the affected cluster using mcc-user and the SSH
key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for
GRUB_CMDLINE_LINUX.
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
LCM¶[16146] Stuck kubelet on the Cluster release 5.x.x series¶
Occasionally, kubelet may get stuck on the Cluster release 5.x.x series
with different errors in the ucp-kubelet containers leading to the nodes
failures. The following error occurs every time when accessing
the Kubernetes API server:
[8367] Adding of a new manager node to a managed cluster hangs on Deploy stage¶
Fixed in 2.12.0
Adding of a new manager node to a managed cluster may hang due to
issues with joining etcd from a new node to the existing etcd cluster.
The new manager node hangs in the Deploy stage.
Symptoms:
The Ansible run tries executing the WaitforDockerUCPtobeaccessible
step and fails with the following error message:
To determine the etcd leader, run on any manager node:
dockerexec-itucp-kvsh
# From the inside of the container:ETCDCTL_API=3etcdctl-wtable--endpoints=https://<1stmanagerIP>:12379,https://<2ndmanagerIP>:12379,https://<3rdmanagerIP>:12379endpointstatus
To verify logs on the leader node:
dockerlogsucp-kv
Root cause:
In case of an unlucky network partition, the leader may lose quorum
and members are not able to perform the election. For more details, see
Official etcd documentation: Learning, figure 5.
Workaround:
Restart etcd on the leader node:
dockerrm-fucp-kv
Wait several minutes until the etcd cluster starts and reconciles.
The deployment of the new manager node will proceed and it will join
the etcd cluster. After that, other MKE components will be configured and
the node deployment will be finished successfully.
[13303] Managed cluster update fails with the Network is unreachable error¶
Fixed in 2.11
A managed cluster update from the Cluster release 6.12.0 to
6.14.0 fails with worker nodes being stuck in the Deploy state with
the Networkisunreachable error.
Workaround:
Verify the state of the loopback network interface:
iplshowlo
If the interface is not in the UNKNOWN or UP state,
enable it manually:
iplsetloup
If the interface is in the UNKNOWN or UP state,
assess the cluster logs to identify the failure root cause.
Repeat the cluster update procedure.
[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0,
the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error:'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent: x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier
are affected.
As a workaround, restart lcm-agent using the
service lcm-agent-* restart command on the affected nodes.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
IAM¶[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the
State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectldescribepod-nkaas<failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message,
proceed to the next step. Otherwise, assess the MariaDB pod logs
to identify the failure root cause.
Storage¶[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
Bootstrap¶[16873] Bootstrap fails with ‘failed to establish connection with tiller’ error¶
Fixed in 2.12.0
If the latest Ubuntu 18.04 image, for example, with kernel 4.15.0-153-generic,
is installed on the bootstrap node, a management cluster bootstrap fails
during the setup of the Kubernetes cluster by kind.
The issue occurs since the kind version 0.9.0 delivered with the bootstrap
script is not compatible with the latest Ubuntu 18.04 image that requires
kind version 0.11.1.
To verify that the bootstrap node is affected by the issue:
In the bootstrap script stdout, verify the connection to Tiller.
Example of system response extract on an affected bootstrap node:
Upgrade¶[16233] Bare metal pods fail during upgrade due to Ceph not unmounting RBD¶
Fixed in 2.11.0
A baremetal-based management cluster upgrade can fail with stuck ironic
and dnsmasq pods. The issue may occur due to the Ceph pre-upgraded
persistent volumes being unmapped incorrectly. As a result, the RBD volumes
mounts on nodes are without any real RBD volumes.
<affectedProjectName> is the Container Cloud project name where
the pods failed to run
<affectedPodName> is a pod name that failed to run in this project
In the pod description, identify the node name where the pod failed to run.
Verify that the csi-rbdplugin logs of the affected node contain the
rbd volume mount failed: <csi-vol-uuid> is being used error.
The <csi-vol-uuid> is a unique RBD volume name.
Identify csiPodName of the corresponding csi-rbdplugin:
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Upgrade of a Container Cloud management or regional cluster from version 2.9.0
to 2.10.0 and managed cluster from 5.16.0 to 5.17.0 may fail with the following
error message for the patroni-12-0, patroni-12-1 or patroni-12-2
pod.
error when evicting pods/"patroni-12-2" -n "stacklight" (will retry after 5s):Cannot evict pod as it would violate the pod's disruption budget.
As a workaround, reinitialize the Patroni pod that got stuck:
kubectl-nstacklightdelete"pod/<POD_NAME>""pvc/<POD_PVC>"
sleep3# wait for StatefulSet to reschedule the pod, but miss dependent PVC creation
kubectl-nstacklightdelete"pod/<POD_NAME>"
[16141] Alertmanager pod gets stuck in CrashLoopBackOff during upgrade¶
Fixed in 2.11.0
An Alertmanager pod may get stuck in the CrashLoopBackOff state during
upgrade of a management, regional, or managed cluster and thus cause upgrade
failure with the Loading configuration file failed error message in logs.
Workaround:
Delete the Alertmanager pod that is stuck in the CrashLoopBackOff state.
For example:
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.10.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.16.0
that is based on Kubernetes 1.18, Mirantis Container Runtime 19.03.14,
and Mirantis Kubernetes Engine 3.3.6.
Supports deprecated Cluster releases 5.15.0 and
6.14.0 that will become unsupported
in one of the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment
of existing MKE 3.3.4 clusters. For the deployment of new or attachment
of existing MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.9.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.9.0.
For the list of enhancements in the Cluster release 5.16.0 and
Cluster release 6.16.0 that are supported by the Container Cloud release
2.9.0, see the 5.16.0 and 6.16.0 sections.
Introduced support for the Equinix Metal cloud provider. Equinix Metal
integrates a fully automated bare metal infrastructure at software speed.
Now, you can deploy managed clusters that are based on the Equinix Metal
management or regional clusters or on top of the AWS-based management cluster.
Using the Equinix Metal management cluster, you can also deploy additional
regional clusters that are based the OpenStack, AWS, vSphere, or Equinix Metal
cloud providers to deploy and operate managed clusters of different provider
types or configurations from a single Container Cloud management plane.
The Equinix Metal based managed clusters also include a Ceph cluster that can
be configured either automatically or manually before or after the cluster
deployment.
Implemented the Container Cloud integration to Lens. Using the Container Cloud
web UI and the Lens extension, you can now add any type of Container Cloud
clusters to Lens for further inspection and monitoring.
The following options are now available in the More action
icon menu of each deployed cluster:
New bootstrap node for additional regional clusters¶
Added the possibility to use a new bootstrap node for deployment of
additional regional clusters. You can now deploy regional clusters not only
on the bootstrap node where you originally deployed the related management
cluster, but also on a new node.
TLS certificates for management cluster applications¶
Implemented the possibility to configure TLS certificates for Keycloak
and Container Cloud web UI on new management clusters.
Caution
Adding of TLS certificates for Keycloak is not supported
on existing clusters deployed using the Container Cloud
release earlier than 2.9.0.
Default Keycloak authorization in Container Cloud web UI¶
For security reasons, updated the Keycloak authorization logic.
The Keycloak single sign-on (SSO) feature that was optional in previous
releases is now default and only possible login option for the
Container Cloud web UI.
While you are logged in using the Keycloak SSO, you can:
Download a cluster kubeconfig without a password
Log in to an MKE cluster without having to sign in again
Use the StackLight endpoints without having to sign in again
Note
Keycloak is exposed using HTTPS with self-signed TLS certificates
that are not trusted by web browsers.
Implemented management of SSH keys only for the universal mcc-user that is
now applicable to any Container Cloud provider and node type,
including Bastion. All existing SSH user names, such as ubuntu,
cloud-user for the vSphere-based clusters, are replaced with the universal
mcc-user user name.
Implemented the vsphereResources controller to represent the vSphere
resources as Kubernetes objects and manage them using the Container Cloud
web UI.
You can now use the drop-down list fields to filter results by a short
resource name during a cluster and machine creation.
The drop-down lists for the following vSphere resources paths are added to
the Container Cloud web UI:
Updated the L2 templates format for baremetal-based deployments.
In the new format, l2template:status:npTemplate is used
directly during provisioning. Therefore, a hardware node obtains and applies
a complete network configuration during the first system boot.
Before the Container Cloud 2.9.0, you were able to configure any network
interface except the default provisioning NIC for the PXE and LCM
managed to manager connection.
Since Container Cloud 2.9.0, you can configure any interface if required.
Caution
Deploy any new node using the L2 template of the new format.
Replace all deprecated L2 templates created before Container Cloud 2.9.0
with the L2 templates of new format.
The following issues have been addressed in the Mirantis Container Cloud
release 2.9.0 along with the Cluster releases 6.16.0 and 5.16.0.
For more issues addressed for the Cluster release 6.16.0, see also
2.8.0 addressed issues.
[14682][StackLight] Reduced the amount of KubePodNotReady and
KubePodCrashLooping alerts. Reworked these alerts and renamed to
KubePodsNotReady and KubePodsCrashLooping.
[14663][StackLight] Removed the inefficient Kubernetes API and etcd latency
alerts.
[14458][vSphere] Fixed the issue with newly created pods failing to run
and having the CrashLoopBackOff status on long-living vSphere-based
clusters.
The issue is fixed for new clusters deployed using Container Cloud 2.9.0.
For existing clusters, apply the workaround described in vSphere
known issues.
[14051][Ceph] Fixed the issue with the CephCluster creation failure
if manageOsds was enabled before deploy.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
vSphere¶[15698] VIP is assigned to each manager node instead of a single node¶
Fixed in 2.11.0
A load balancer virtual IP address (VIP) is assigned to each manager
node on any type of the vSphere-based cluster. The issue occurs because
the Keepalived instances cannot set up a cluster due to the blocked
vrrp protocol traffic in the firewall configuration on the Container Cloud
nodes.
Note
Before applying the workaround below, verify that the dedicated
vSphere network does not have any other virtual machines with the
keepalived instance running with the same vrouter_id.
You can verify the vrouter_id value of the cluster
in /etc/keepalived/keepalived.conf on the manager nodes.
Workaround
Update the firewalld configuration on each manager node of the affected
cluster to allow the vrrp protocol traffic between the nodes:
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State:Waiting
Reason:CrashLoopBackOff
LastState:Terminated
Reason:ContainerCannotRun
Message:OCIruntimecreatefailed:container_linux.go:349:
startingcontainerprocesscaused"process_linux.go:297: applying cgroup configuration for process caused "mkdir/sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannotallocatememory": unknown
The issue occurs due to the
Kubernetes and
Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem to the kernel command line.
Note
The workaround below applies to the existing clusters only.
The issue is resolved for new Container Cloud 2.9.0 deployments
since the workaround below automatically applies to the VM template
built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected
cluster.
Workaround
SSH to any machine of the affected cluster using mcc-user and the SSH
key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for
GRUB_CMDLINE_LINUX.
Equinix Metal¶[14981] Equinix Metal machine is stuck in Deploy stage¶
Fixed in 2.10.0
An Equinix Metal manager machine deployment may fail if the cluster contains
at least one manager machine that is stuck in the Provisioning state
due to the capacity limits in the selected Equinix Metal data center.
In this case, other machines that were successfully created in Equinix Metal
may also fail to finalize the deployment and get stuck on the Deploy stage.
If this is the case, remove all manager machines that are stuck in the
Provisioning state.
Workaround:
Export the kubeconfig of the management cluster. For example:
After all machines that are stuck in the Provisioning state are removed,
the deployment of the manager machine that is stuck on the Deploy stage
restores.
Bare metal¶[14642] Ironic logs overflow the storage volume¶
On the baremetal-based management clusters with the Cluster version 2.9.0
or earlier, the storage volume used by Ironic can run out of free space.
As a result, an automatic upgrade of the management cluster fails with the
no space left on device error in the Ironic logs.
Symptoms:
The httpdDeployment and the ironic and dnsmasqstatefulSets are not in the OK status:
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
Storage¶[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
IAM¶[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the
State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectldescribepod-nkaas<failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message,
proceed to the next step. Otherwise, assess the MariaDB pod logs
to identify the failure root cause.
Verify that wsrep_local_state_comment is Donor or Desynced:
kubectlexec-it-nkaas<failedMariadbPodName>--mysql-uroot-p<mariadbAdminPassword>-e"SHOW status LIKE \"wsrep_local_state_comment\";"
Restart the failed pod:
kubectldeletepod-nkaas<failedMariadbPodName>
LCM¶[13402] Cluster fails with error: no space left on device¶
Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters
If an application running on a Container Cloud management or managed cluster
fails frequently, for example, PostgreSQL, it may produce an excessive amount
of core dumps.
This leads to the no space left on device error on the cluster nodes and,
as a result, to the broken Docker Swarm and the entire cluster.
Core dumps are disabled by default on the operating system of the
Container Cloud nodes. But since Docker does not inherit the operating system
settings, disable core dumps in Docker using the workaround below.
Warning
The workaround below does not apply to the baremetal-based
clusters, including MOS deployments,
since Docker restart may destroy the Ceph cluster.
Workaround:
SSH to any machine of the affected cluster using mcc-user
and the SSH key provided during the cluster creation.
In /etc/docker/daemon.json, add the following parameters:
Repeat the steps above on each machine of the affected cluster one by one.
[8367] Adding of a new manager node to a managed cluster hangs on Deploy stage¶
Fixed in 2.12.0
Adding of a new manager node to a managed cluster may hang due to
issues with joining etcd from a new node to the existing etcd cluster.
The new manager node hangs in the Deploy stage.
Symptoms:
The Ansible run tries executing the WaitforDockerUCPtobeaccessible
step and fails with the following error message:
To determine the etcd leader, run on any manager node:
dockerexec-itucp-kvsh
# From the inside of the container:ETCDCTL_API=3etcdctl-wtable--endpoints=https://<1stmanagerIP>:12379,https://<2ndmanagerIP>:12379,https://<3rdmanagerIP>:12379endpointstatus
To verify logs on the leader node:
dockerlogsucp-kv
Root cause:
In case of an unlucky network partition, the leader may lose quorum
and members are not able to perform the election. For more details, see
Official etcd documentation: Learning, figure 5.
Workaround:
Restart etcd on the leader node:
dockerrm-fucp-kv
Wait several minutes until the etcd cluster starts and reconciles.
The deployment of the new manager node will proceed and it will join
the etcd cluster. After that, other MKE components will be configured and
the node deployment will be finished successfully.
[13303] Managed cluster update fails with the Network is unreachable error¶
Fixed in 2.11
A managed cluster update from the Cluster release 6.12.0 to
6.14.0 fails with worker nodes being stuck in the Deploy state with
the Networkisunreachable error.
Workaround:
Verify the state of the loopback network interface:
iplshowlo
If the interface is not in the UNKNOWN or UP state,
enable it manually:
iplsetloup
If the interface is in the UNKNOWN or UP state,
assess the cluster logs to identify the failure root cause.
Repeat the cluster update procedure.
[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0,
the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error:'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent: x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier
are affected.
As a workaround, restart lcm-agent using the
service lcm-agent-* restart command on the affected nodes.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
[14125] Inaccurate nodes readiness status on a managed cluster¶
Fixed in 2.10.0
A managed cluster deployed or updated on a regional cluster of another
provider type may display inaccurate Nodes readiness live status
in the Container Cloud web UI. While all nodes are ready, the Nodes
status indicates that some nodes are still not ready.
The issue occurs due to the cordon-drain desynchronization between
the LCMClusterState objects and the actual state of the cluster.
Note
The workaround below must be applied only by users with
the writer or cluster-admin access role
assigned by the Infrastructure Operator.
To verify that the cluster is affected:
Export the regional cluster kubeconfig created during the
regional cluster deployment:
Replace the parameters enclosed in angle brackets with the SSH key that
was used for the managed cluster deployment and the private IP address
of any control plane node of the cluster.
If the status of the Kubernetes and Swarm nodes is ready,
proceed with the next steps.
Otherwise, assess the cluster logs to identify the issue with not
ready nodes.
Obtain the LCMClusterState items related to the swarm-drain
and cordon-drain type:
The command above outputs the list of all LCMClusterState items.
Verify only the LCMClusterState items names
that start with the swarm-drain- and cordon-drain- prefix.
Verify the status of each LCMClusterState item of the
swarm-drain and cordon-drain type:
For cordon-drain, spec.value and status.value are "false"
For swarm-drain, spec.value is "true" and the
status.message contains an error related to waiting for the Kubernetes
cordon-drain to finish
Workaround:
For each LCMClusterState item of the swarm-drain type with
spec.value=="true" and the status.message described above,
replace "true" with "false" in spec.value:
Upgrade¶[15419] The iam-api pods are not ready after cluster upgrade¶
The iam-api pods are in the NotReady state on the management cluster
after the Container Cloud upgrade to 2.9.0 since they cannot reach Keycloak
due to the CA certificate issue.
The issue affects only the clusters originally deployed using the Container
Cloud release earlier than 2.6.0.
Workaround:
Replace the tls.crt and tls.key fields in the mcc-ca-cert
secret in the kaas namespace with the certificate and key generated
during the management cluster bootstrap.
These credentials are stored in the kaas-bootstrap/tls directory.
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
[14152] Managed cluster upgrade fails due to DNS issues¶
Fixed in 2.10.0
A managed cluster release upgrade may fail due to DNS issues on pods with
host networking. If this is the case, the DNS names of the Kubernetes services
on the affected pod cannot be resolved.
Workaround:
Export kubeconfig of the affected managed cluster. For example:
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.9.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Before the Container Cloud 2.9.0, you were able to configure any network
interface except the default provisioning NIC for the PXE and LCM
managed to manager connection.
Since Container Cloud 2.9.0, you can configure any interface if required.
Caution
Deploy any new node using the updated L2 template format.
All L2 templates created before Container Cloud 2.9.0 are now deprecated
and must not be used.
In the old L2 templates format, ipamhost spawns 2 structures after
processing l2template for machines:
l2template:status:osMetadataNetwork that renders automatically using
the default subnet from the management cluster and is used during
the cloud-init deployment phase after provisioning is done
l2template:status:npTemplate that is used during the lcm-agent
deployment phase and applied after lcmmachine starts deployment
In the new L2 templates format, l2template:status:npTemplate is used
directly during provisioning. Therefore, a hardware node obtains and applies
a complete network configuration during the first system boot.
To switch to the new L2 template format:
If you do not have a subnet for connection to the management LCM cluster
network (lcm-nw), manually create one. For details, see
Operations Guide: Create subnets.
In the previous L2 template format, {{nic0}}
for the PXE interface was not defined.
After switching to the new l2template format, the following info
message appears in the ipamhost status and indicates that bmh
successfully migrated to the new format of L2 templates:
KUBECONFIG=kubeconfig kubectl -n managed-ns get ipamhostsNAME STATUS AGE REGIONcz7700-bmh L2Template + L3Layout used, osMetadataNetwork is unacceptable in this mode 49m region-one
Introduces support for the Cluster release 5.15.0
that is based on Kubernetes 1.18, Mirantis Container Runtime 19.03.14,
and Mirantis Kubernetes Engine 3.3.6.
Supports deprecated Cluster releases 5.14.0 and
6.12.0 that will become unsupported
in one of the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment
of existing MKE 3.3.4 clusters. For the deployment of new or attachment
of existing MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.8.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.8.0.
For the list of enhancements in the Cluster release 5.15.0 and
Cluster release 6.14.0 that are supported by the Container Cloud release
2.8.0, see the 5.15.0 and 6.14.0 sections.
Implemented the possibility to collect logs of the syslog container that runs
in the Ironic pod on the bare metal bootstrap, management, and managed
clusters.
You can collect Ironic pod logs using the standard Container Cloud
container-cloud collect logs command. The output is located in
/objects/namespaced/<namespaceName>/core/pods/<ironicPodId>/syslog.log.
To simplify operations with logs, the syslog container generates output
in the JSON format.
Note
Logs collected by the syslog container during the bootstrap phase
are not transferred to the management cluster during pivoting.
These logs are located in
/volume/log/ironic/ansible_conductor.log inside the Ironic pod.
LoadBalancer and ProviderInstance monitoring for cluster and machine statuses¶
Improved monitoring of the cluster and machine live statuses in the Container
Cloud web UI:
Added the LoadBalancer and ProviderInstance fields.
Added the providerInstanceState field for an AWS machine status
that includes the AWS VM ID, state, and readiness. The analogous fields
instanceState and instanceID are deprecated as of Container Cloud
2.8.0 and will be removed in one of the following releases. For details,
see Deprecation notes.
Updated notification about outdated cluster version in web UI¶
Updated the notification about outdated cluster version in the Container
Cloud web UI. Now, you will be notified about any outdated managed cluster
that must be updated to unblock the upgrade
of the management cluster and Container Cloud to the latest version.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
The following issues have been addressed in the Mirantis Container Cloud
release 2.8.0 along with the Cluster release 5.15.0:
[12723] [Ceph] Fixed the issue with the ceph_role_mon and
ceph_role_mgr labels remaining after deletion of a node from
KaaSCephCluster.
[13381] [LCM] Fixed the issue with requests to apiserver failing after
bootstrap on the management and regional clusters with enabled proxy.
[13402] [LCM] Fixed the issue with the cluster failing with the
no space left on device error due to an excessive amount of core dumps
produced by applications that fail frequently.
Note
The issue is addressed only for new clusters created using
Container Cloud 2.8.0. To workaround the issue on existing
clusters created using the Container Cloud version below 2.8.0,
see LCM known issues: 13402.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
vSphere¶[15698] VIP is assigned to each manager node instead of a single node¶
Fixed in 2.11.0
A load balancer virtual IP address (VIP) is assigned to each manager
node on any type of the vSphere-based cluster. The issue occurs because
the Keepalived instances cannot set up a cluster due to the blocked
vrrp protocol traffic in the firewall configuration on the Container Cloud
nodes.
Note
Before applying the workaround below, verify that the dedicated
vSphere network does not have any other virtual machines with the
keepalived instance running with the same vrouter_id.
You can verify the vrouter_id value of the cluster
in /etc/keepalived/keepalived.conf on the manager nodes.
Workaround
Update the firewalld configuration on each manager node of the affected
cluster to allow the vrrp protocol traffic between the nodes:
A vSphere-based management cluster bootstrap fails due to a node leaving the
cluster after an accidental IP address change.
The issue may affect a vSphere-based cluster only when IPAM
is not enabled and IP addresses assignment to the vSphere virtual machines
is done by a DHCP server present in the vSphere network.
By default, a DHCP server keeps lease of the IP address for 30 minutes.
Usually, a VM dhclient prolongs such lease by frequent DHCP requests
to the server before the lease period ends.
The DHCP prolongation request period is always less than the default lease time
on the DHCP server, so prolongation usually works.
But in case of network issues, for example, when dhclient from the
VM cannot reach the DHCP server, or the VM is being slowly powered on
for more than the lease time, such VM may lose its assigned IP address.
As a result, it obtains a new IP address.
Container Cloud does not support network reconfiguration after
the IP of the VM has been changed. Therefore, such issue may lead
to a VM leaving the cluster.
Symptoms:
One of the nodes is in the NodeNotReady or down state:
kubectlgetnodes-owide
dockernodels
The UCP Swarm manager logs on the healthy manager node contain the
following example error:
The output of the docker info command contains the following
example error:
Error:rpc error: code = Unknown desc = The swarm does not have a leader. \
It's possible that too few managers are online. \
Make sure more than half of the managers are online.
The UCP controller logs contain the following example error:
dockerlogs-fucp-controller
"warning","msg":"Node State Active check error: \Swarm Mode Manager health check error: \info: Cannot connect to the Docker daemon at tcp://<node IP>:12376. \Is the docker daemon running?
On the affected node, the IP address on the first interface eth0
does not match the IP address configured in Docker. Verify the
NodeAddress field in the output of the docker info command.
The following lines are present in /var/log/messages:
If there are several lines where the IP is different, the node is affected.
Workaround:
Select from the following options:
Bind IP addresses for all machines to their MAC addresses on the DHCP server
for the dedicated vSphere network. In this case, VMs receive only specified
IP addresses that never change.
Remove the Container Cloud node IPs from the IP range on the DHCP server
for the dedicated vSphere network and configure the first interface eth0
on VMs with a static IP address.
If a managed cluster is affected, redeploy it with IPAM enabled
for new machines to be created and IPs to be assigned properly.
[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State:Waiting
Reason:CrashLoopBackOff
LastState:Terminated
Reason:ContainerCannotRun
Message:OCIruntimecreatefailed:container_linux.go:349:
startingcontainerprocesscaused"process_linux.go:297: applying cgroup configuration for process caused "mkdir/sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannotallocatememory": unknown
The issue occurs due to the
Kubernetes and
Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem to the kernel command line.
Note
The workaround below applies to the existing clusters only.
The issue is resolved for new Container Cloud 2.9.0 deployments
since the workaround below automatically applies to the VM template
built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected
cluster.
Workaround
SSH to any machine of the affected cluster using mcc-user and the SSH
key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for
GRUB_CMDLINE_LINUX.
Bare metal¶[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
Storage¶[14051] CephCluster creation fails if manageOsds is enabled before deploy¶
Fixed in 2.9.0
If manageOsds is enabled in the pre-deployment KaaSCephCluster
template, the bare metal management or managed cluster fails to deploy
due to the CephCluster creation failure.
As a workaround, disable manageOsds in the KaaSCephCluster
template before the cluster deployment.
You can enable this parameter after deployment as described in
Ceph advanced configuration.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
IAM¶[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the
State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectldescribepod-nkaas<failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message,
proceed to the next step. Otherwise, assess the MariaDB pod logs
to identify the failure root cause.
Verify that wsrep_local_state_comment is Donor or Desynced:
kubectlexec-it-nkaas<failedMariadbPodName>--mysql-uroot-p<mariadbAdminPassword>-e"SHOW status LIKE \"wsrep_local_state_comment\";"
Restart the failed pod:
kubectldeletepod-nkaas<failedMariadbPodName>
LCM¶[13402] Cluster fails with error: no space left on device¶
Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters
If an application running on a Container Cloud management or managed cluster
fails frequently, for example, PostgreSQL, it may produce an excessive amount
of core dumps.
This leads to the no space left on device error on the cluster nodes and,
as a result, to the broken Docker Swarm and the entire cluster.
Core dumps are disabled by default on the operating system of the
Container Cloud nodes. But since Docker does not inherit the operating system
settings, disable core dumps in Docker using the workaround below.
Warning
The workaround below does not apply to the baremetal-based
clusters, including MOS deployments,
since Docker restart may destroy the Ceph cluster.
Workaround:
SSH to any machine of the affected cluster using mcc-user
and the SSH key provided during the cluster creation.
In /etc/docker/daemon.json, add the following parameters:
Repeat the steps above on each machine of the affected cluster one by one.
[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0,
the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error:'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent: x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier
are affected.
As a workaround, restart lcm-agent using the
service lcm-agent-* restart command on the affected nodes.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
[14125] Inaccurate nodes readiness status on a managed cluster¶
Fixed in 2.10.0
A managed cluster deployed or updated on a regional cluster of another
provider type may display inaccurate Nodes readiness live status
in the Container Cloud web UI. While all nodes are ready, the Nodes
status indicates that some nodes are still not ready.
The issue occurs due to the cordon-drain desynchronization between
the LCMClusterState objects and the actual state of the cluster.
Note
The workaround below must be applied only by users with
the writer or cluster-admin access role
assigned by the Infrastructure Operator.
To verify that the cluster is affected:
Export the regional cluster kubeconfig created during the
regional cluster deployment:
Replace the parameters enclosed in angle brackets with the SSH key that
was used for the managed cluster deployment and the private IP address
of any control plane node of the cluster.
If the status of the Kubernetes and Swarm nodes is ready,
proceed with the next steps.
Otherwise, assess the cluster logs to identify the issue with not
ready nodes.
Obtain the LCMClusterState items related to the swarm-drain
and cordon-drain type:
The command above outputs the list of all LCMClusterState items.
Verify only the LCMClusterState items names
that start with the swarm-drain- and cordon-drain- prefix.
Verify the status of each LCMClusterState item of the
swarm-drain and cordon-drain type:
For cordon-drain, spec.value and status.value are "false"
For swarm-drain, spec.value is "true" and the
status.message contains an error related to waiting for the Kubernetes
cordon-drain to finish
Workaround:
For each LCMClusterState item of the swarm-drain type with
spec.value=="true" and the status.message described above,
replace "true" with "false" in spec.value:
Upgrade¶[13292] Local volume provisioner pod stuck in Terminating status after upgrade¶
After upgrade of Container Cloud from 2.6.0 to 2.7.0, the local
volume provisioner pod in the default project is stuck in the
Terminating status, even after upgrade to 2.8.0.
This issue does not affect functioning of the management, regional, or
managed clusters. The issue does not prevent the successful upgrade of
the cluster.
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
[14152] Managed cluster upgrade fails due to DNS issues¶
Fixed in 2.10.0
A managed cluster release upgrade may fail due to DNS issues on pods with
host networking. If this is the case, the DNS names of the Kubernetes services
on the affected pod cannot be resolved.
Workaround:
Export kubeconfig of the affected managed cluster. For example:
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.8.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.14.0
that is based on Kubernetes 1.18, Mirantis Container Runtime 19.03.14,
and Mirantis Kubernetes Engine 3.3.6.
Supports deprecated Cluster releases 5.13.0 and
6.12.0 that will become unsupported
in one of the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment of existing
MKE 3.3.4 clusters. For the deployment of new or attachment of existing
MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.7.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.7.0.
For the list of enhancements in the Cluster release 5.14.0 and
Cluster release 6.14.0 that are supported by the Container Cloud release
2.7.0, see the 5.14.0 and 6.14.0 sections.
Introduced general availability support for the VMware vSphere provider
after completing full integration of the vSphere provider on RHEL with
Container Cloud.
During the Container Cloud 2.6.0 - 2.7.0 release cycle, added the following
improvements:
Implemented a universal SSH user mcc-user to replace the existing
default SSH user names. The mcc-user user name is applicable
to any Container Cloud provider and node type, including Bastion.
The existing SSH user names are deprecated as of Container Cloud 2.7.0.
SSH keys will be managed only for mcc-user as of one of the following
Container Cloud releases.
Configuration of SSH keys on existing clusters using web UI¶
Implemented the possibility to configure SSH keys on existing clusters using
the Container Cloud web UI. You can now add or remove SSH keys on running
managed clusters using the Configure cluster web UI menu.
After the update of your Cluster release to the latest version
supported by 2.7.0 for the OpenStack and AWS-based managed clusters,
a one-time redeployment of the Bastion node is required to apply the first
configuration change of SSH keys. For this purpose, the
Allow Bastion Redeploy one-time check box is added to the
Configure Cluster wizard in the Container Cloud web UI.
Note
After the Bastion node redeploys on the AWS-based clusters,
its public IP address changes.
Implemented the possibility to monitor live status of a cluster and machine
deployment or update using the Container Cloud web UI. You can now follow
the deployment readiness and health of essential cluster components, such as
Helm, Kubernetes, kubelet, Swarm, OIDC, StackLight, and others. For machines,
you can monitor nodes readiness reported by kubelet and nodes health reported
by Swarm.
Enabling of proxy access using web UI for vSphere, AWS, and bare metal¶
Extended the Container Cloud web UI with the parameters that enable
proxy access on managed clusters for the remaining cloud providers:
vSphere, AWS, and bare metal.
The following issues have been addressed in the Mirantis Container Cloud
release 2.7.0 along with the Cluster releases 5.14.0 and 6.14.0:
[13176] [vSphere] Fixed the issue with the cluster network settings
related to IPAM disappearing from the cluster provider spec and
leading to invalid metadata provided to virtual machines.
[12683] [vSphere] Fixed the issue with the kaas-ipam pods being installed
and continuously restarted even if IPAM was disabled on the vSphere-based
regional cluster deployed on top of an AWS-based management cluster.
[12305] [Ceph] Fixed the issue with inability to define the CRUSH map rules
through the KaaSCephCluster custom resource. For details, see
Operations Guide: Ceph advanced configuration.
[10060] [Ceph] Fixed the issue with a Ceph OSD node removal
not being triggered properly and failing
after updating the KaasCephCluster custom resource (CR).
[13078] [StackLight] Fixed the issue with Elasticsearch not receiving
data from Fluentd due to the limit of open index shards per node.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
vSphere¶[14458] Failure to create a container for pod: cannot allocate memory¶
Fixed in 2.9.0 for new clusters
Newly created pods may fail to run and have the CrashLoopBackOff status
on long-living Container Cloud clusters deployed on RHEL 7.8 using the VMware
vSphere provider. The following is an example output of the
kubectl describe pod <pod-name> -n <projectName> command:
State:Waiting
Reason:CrashLoopBackOff
LastState:Terminated
Reason:ContainerCannotRun
Message:OCIruntimecreatefailed:container_linux.go:349:
startingcontainerprocesscaused"process_linux.go:297: applying cgroup configuration for process caused "mkdir/sys/fs/cgroup/memory/kubepods/burstable/<pod-id>/<container-id>>:
cannotallocatememory": unknown
The issue occurs due to the
Kubernetes and
Docker community issues.
According to the RedHat solution,
the workaround is to disable the kernel memory accounting feature
by appending cgroup.memory=nokmem to the kernel command line.
Note
The workaround below applies to the existing clusters only.
The issue is resolved for new Container Cloud 2.9.0 deployments
since the workaround below automatically applies to the VM template
built during the vSphere-based management cluster bootstrap.
Apply the following workaround on each machine of the affected
cluster.
Workaround
SSH to any machine of the affected cluster using mcc-user and the SSH
key provided during the cluster creation to proceed as the root user.
In /etc/default/grub, set cgroup.memory=nokmem for
GRUB_CMDLINE_LINUX.
Bare metal¶[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
Storage¶[7073] Cannot automatically remove a Ceph node¶
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
[12723] ceph_role_* labels remain after deleting a node from KaaSCephCluster¶
Fixed in 2.8.0
The ceph_role_mon and ceph_role_mgr labels that Ceph Controller
assigns to a node during a Ceph cluster creation are not automatically
removed after deleting a node from KaaSCephCluster.
As a workaround, manually remove the labels using the following commands:
IAM¶[13385] MariaDB pods fail to start after SST sync¶
Fixed in 2.12.0
The MariaDB pods fail to start after MariaDB blocks itself during the
State Snapshot Transfers sync.
Workaround:
Verify the failed pod readiness:
kubectldescribepod-nkaas<failedMariadbPodName>
If the readiness probe failed with the WSREP not synced message,
proceed to the next step. Otherwise, assess the MariaDB pod logs
to identify the failure root cause.
Verify that wsrep_local_state_comment is Donor or Desynced:
kubectlexec-it-nkaas<failedMariadbPodName>--mysql-uroot-p<mariadbAdminPassword>-e"SHOW status LIKE \"wsrep_local_state_comment\";"
Restart the failed pod:
kubectldeletepod-nkaas<failedMariadbPodName>
LCM¶[13845] Cluster update fails during the LCM Agent upgrade with x509 error¶
Fixed in 2.11.0
During update of a managed cluster from the Cluster releases 6.12.0 to 6.14.0,
the LCM Agent upgrade fails with the following error in logs:
lcmAgentUpgradeStatus:
error:'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent: x509: certificate signed by unknown authority'
Only clusters initially deployed using Container Cloud 2.4.0 or earlier
are affected.
As a workaround, restart lcm-agent using the
service lcm-agent-* restart command on the affected nodes.
[13381] Management and regional clusters with enabled proxy are unreachable¶
Fixed in 2.8.0
After bootstrap, requests to apiserver fail
on the management and regional clusters with enabled proxy.
As a workaround, before running bootstrap.sh,
add the entire range of IP addresses that will be used for floating IPs
to the NO_PROXY environment variable.
[13402] Cluster fails with error: no space left on device¶
Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters
If an application running on a Container Cloud management or managed cluster
fails frequently, for example, PostgreSQL, it may produce an excessive amount
of core dumps.
This leads to the no space left on device error on the cluster nodes and,
as a result, to the broken Docker Swarm and the entire cluster.
Core dumps are disabled by default on the operating system of the
Container Cloud nodes. But since Docker does not inherit the operating system
settings, disable core dumps in Docker using the workaround below.
Warning
The workaround below does not apply to the baremetal-based
clusters, including MOS deployments,
since Docker restart may destroy the Ceph cluster.
Workaround:
SSH to any machine of the affected cluster using mcc-user
and the SSH key provided during the cluster creation.
In /etc/docker/daemon.json, add the following parameters:
Repeat the steps above on each machine of the affected cluster one by one.
[8112] Nodes occasionally become Not Ready on long-running clusters¶
On long-running Container Cloud clusters, one or more nodes may occasionally
become NotReady with different errors in the ucp-kubelet containers
of failed nodes.
As a workaround, restart ucp-kubelet on the failed node:
[10029] Authentication fails with the 401 Unauthorized error¶
Authentication may not work on some controller nodes after a managed cluster
creation. As a result, the Kubernetes API operations with the managed cluster
kubeconfig fail with ResponseStatus:401Unauthorized.
As a workaround, manually restart the ucp-controller and ucp-auth
Docker services on the affected node.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
Upgrade¶[13292] Local volume provisioner pod stuck in Terminating status after upgrade¶
After upgrade of Container Cloud from 2.6.0 to 2.7.0, the local
volume provisioner pod in the default project is stuck in the
Terminating status, even after upgrade to 2.8.0.
This issue does not affect functioning of the management, regional, or
managed clusters. The issue does not prevent the successful upgrade of
the cluster.
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.7.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.13.0
that is based on Kubernetes 1.18, Mirantis Container Runtime 19.03.14,
and Mirantis Kubernetes Engine 3.3.6.
Still supports deprecated Cluster releases 5.12.0 and
6.10.0 that will become unsupported
in one of the following Container Cloud releases.
Supports the Cluster release 5.11.0 only for attachment of existing
MKE 3.3.4 clusters. For the deployment of new or attachment of existing
MKE 3.3.6 clusters, the latest available Cluster release is used.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.6.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.6.0.
For the list of enhancements in the Cluster release 5.13.0 and
Cluster release 6.12.0 that are supported by the Container Cloud release 2.6.0,
see the 5.13.0 and 6.12.0 sections.
In the scope of Technology Preview support for the VMware vSphere cloud
provider on RHEL, added an additional RHEL license activation method
that uses the activation key through RedHat Customer Portal or
RedHat Satellite server.
The Satellite configuration on the hosts is done by installing a specific
pre-generated RPM package from the Satellite package URL provided by the user
through API. The activation key is provided by the user through API.
Along with the new activation method, you can still use the existing method
that is adding of your RHEL subscription with the user name and password
of your RedHat Customer Portal account associated with your RHEL license
for Virtual Datacenters.
In the scope of Technology Preview support for the VMware vSphere cloud
provider on RHEL, added support for VMware vSphere Distributed Switch (VDS)
to provide networking to the vSphere virtual machines.
This is an alternative to the vSphere Standard Switch with network on top of
it. A VM is attached to a VDS port group. You can specify the path
to the port group using the NetworkPath parameter in
VsphereClusterProviderSpec.
VMware vSphere provider integration with IPAM controller¶
Technology Preview
In the scope of Technology Preview support for the VMware vSphere cloud
provider on RHEL, enabled the vSphere provider to use IPAM controller
to assign IP addresses to VMs automatically, without an external DHCP server.
If the IPAM controller is not enabled in the bootstrap template,
the vSphere provider must rely on external provisioning of the IP addresses
by a DHCP server of the user infrastructure.
Extended proxy support by enabling the feature for the remaining supported
AWS and bare metal cloud providers.
If you require all Internet access to go through a proxy server
for security and audit purposes, you can now bootstrap
management and regional clusters of any cloud provider type using proxy.
You can also enable a separate proxy access on the OpenStack-based managed
clusters using the Container Cloud web UI. This proxy is intended
for the end user needs and is not used for a managed cluster deployment
or for access to the Mirantis resources.
Caution
Enabling of proxy access using the Container Cloud web UI
for the vSphere, AWS, and baremetal-based managed clusters
is on the final development stage and will become available
in the next release.
Updated documentation on the bare metal networking¶
Expanded and restructured the bare metal networking documentation that now
contains the following subsections with a detailed description
of every bare metal network type:
The following issues have been addressed in the Mirantis Container Cloud
release 2.6.0 and the Cluster release 5.13.0:
[11302] [LCM] Fixed the issue with inability to delete a Container Cloud
project with attached MKE clusters that failed to be cleaned up
properly.
[11967] [LCM] Added vrrp_scriptchk_myscript to the Keepalived
configuration to prevent issues with VIP (Virtual IP) pointing to a node
with broken Kubernetes API.
[10491] [LCM] Fixed the issue with kubelet being randomly stuck,
for example, after a management cluster upgrade.
The fix enables automatic restart of kubelet in case of failures.
[7782] [bootstrap] Renamed the SSH key used during bootstrap for every cloud
provider from openstack_tmp to an accurate and clear ssh_key.
[11927] [StackLight] Fixed the issue with StackLight failing to integrate
with an external proxy with authentication handled by a proxy server
and ignoring the HTTP Authorization header for basic authentication
passed by Prometheus Alertmanager.
[11001] [StackLight] Fixed the issue with Patroni pod failing to start
and remaining in the CrashLoopBackOff status
after the management cluster update.
[10829] [IAM] Fixed the issue with the Keycloak pods failing to start
during a management cluster bootstrap with the Failed to update database
exception in logs.
[11468] [BM] Fixed the issue with the persistent volumes (PVs) that are
created using local volume provisioner (LVP) not being mounted on the
dedicated disk labeled as local-volume and using the root volume instead.
[9875] [BM] Fixed the issue with the bootstrap.sh preflight script
failing with a timeout waiting for BareMetalHost if
KAAS_BM_FULL_PREFLIGHT was enabled.
[11633] [vSphere] Fixed the issue with the vSphere-based managed cluster
projects failing to be cleaned up because of stale secret(s)
related to the RHEL license object(s).
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
vSphere¶[12683] The kaas-ipam pods restart on the vSphere region with IPAM disabled¶
Fixed in Container Cloud 2.7.0
Even though IPAM is disabled on the vSphere-based regional cluster deployed
on top of an AWS-based management cluster, the regional cluster still has
the kaas-ipam pods installed and continuously restarts them.
In this case, the pods logs contain the following exemplary errors:
WaitingforCRDs.[baremetalhosts.metal3.ioclusters.cluster.k8s.iomachines.cluster.k8s.io
ipamhosts.ipam.mirantis.comipaddrs.ipam.mirantis.comsubnets.ipam.mirantis.comsubnetpools.ipam.mirantis.com\
l2templates.ipam.mirantis.com]notfoundyet
E031811:58:21.0675021main.go:240]FetchCRDlistfailed:\
Object'Kind'ismissingin'unstructured object has no kind'
As a result, the KubePodCrashLooping StackLight alerts are firing
in Alertmanager for kaas-ipam. Disregard these alerts.
[13176] ClusterNetwork settings may disappear from the cluster provider spec¶
Fixed in Container Cloud 2.7.0
A vSphere-based cluster with IPAM enabled may lose cluster network settings
related to IPAM leading to invalid metadata provided to virtual machines.
As a result, virtual machines can not obtain assigned IP addresses.
The issue occurs during a management cluster bootstrap or a managed
cluster creation.
Workaround:
If the management cluster with IPAM enabled is not deployed yet,
follow the steps below before launching the bootstrap.sh script:
Open kaas-bootstrap/releases/kaas/2.6.0.yaml for editing.
Change the release-controller version from 1.18.1 to 1.18.3:
Now, you can deploy managed clusters with IPAM enabled.
Bare metal¶[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
StackLight¶[13078] Elasticsearch does not receive data from Fluentd¶
Fixed in Container Cloud 2.7.0
Elasticsearch may stop receiving new data from Fluentd. In such case, error
messages similar to the following will be present in
fluentd-elasticsearch logs:
ElasticsearchErrorerror="400 - Rejected by Elasticsearch [error type]:illegal_argument_exception[reason]:'Validation Failed: 1: this action wouldadd[15]totalshards,butthisclustercurrentlyhas[2989]/[3000]maximumshardsopen;'" location=nil tag="ucp-kubelet"
The workaround is to manually increase the limit of open index shards per
node:
A Ceph node removal is not being triggered properly after updating
the KaasCephCluster custom resource (CR). Both management and managed
clusters are affected.
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
[12723] ceph_role_* labels remain after deleting a node from KaaSCephCluster¶
Fixed in 2.8.0
The ceph_role_mon and ceph_role_mgr labels that Ceph Controller
assigns to a node during a Ceph cluster creation are not automatically
removed after deleting a node from KaaSCephCluster.
As a workaround, manually remove the labels using the following commands:
LCM¶[13402] Cluster fails with error: no space left on device¶
Fixed in 2.8.0 for new clusters and in 2.10.0 for existing clusters
If an application running on a Container Cloud management or managed cluster
fails frequently, for example, PostgreSQL, it may produce an excessive amount
of core dumps.
This leads to the no space left on device error on the cluster nodes and,
as a result, to the broken Docker Swarm and the entire cluster.
Core dumps are disabled by default on the operating system of the
Container Cloud nodes. But since Docker does not inherit the operating system
settings, disable core dumps in Docker using the workaround below.
Warning
The workaround below does not apply to the baremetal-based
clusters, including MOS deployments,
since Docker restart may destroy the Ceph cluster.
Workaround:
SSH to any machine of the affected cluster using mcc-user
and the SSH key provided during the cluster creation.
In /etc/docker/daemon.json, add the following parameters:
Repeat the steps above on each machine of the affected cluster one by one.
[10029] Authentication fails with the 401 Unauthorized error¶
Authentication may not work on some controller nodes after a managed cluster
creation. As a result, the Kubernetes API operations with the managed cluster
kubeconfig fail with ResponseStatus:401Unauthorized.
As a workaround, manually restart the ucp-controller and ucp-auth
Docker services on the affected node.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.6.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.12.0
that is based on Kubernetes 1.18, Mirantis Container Runtime 19.03.14,
and the updated version of Mirantis Kubernetes Engine 3.3.6.
Still supports previous Cluster releases 5.11.0 and
6.10.0 that are now deprecated and will become unsupported
in one of the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.5.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.5.0.
For the list of enhancements in the Cluster release 5.12.0 and
Cluster release 6.12.0 that are supported by the Container Cloud release 2.5.0,
see the 5.12.0 and 6.12.0 sections.
Proxy support for OpenStack and VMware vSphere providers¶
Implemented proxy support for OpenStack-based and
vSphere-based Technology Preview clusters.
If you require all Internet access to go through a proxy server
for security and audit purposes, you can now bootstrap
management and regional clusters using proxy.
You can also enable a separate proxy access on an OpenStack-based
managed cluster using the Container Cloud web UI. This proxy is intended
for the end user needs and is not used for a managed cluster deployment
or for access to the Mirantis resources.
Note
The proxy support for:
The OpenStack provider is generally available.
The VMware vSphere provider is available as Technology Preview.
For the Technology Preview feature definition, refer to
Technology Preview features.
The AWS and bare metal providers is in the development
stage and will become available in the future Container Cloud
releases.
Introduced artifacts caching support for all Container Cloud providers
to enable deployment of managed clusters without direct Internet access.
The Mirantis artifacts used during managed clusters deployment are downloaded
through a cache running on a regional cluster.
The feature is enabled by default on new managed clusters based on
the Cluster releases 5.12.0 and 6.12.0 and will be automatically enabled
on existing clusters during upgrade to the latest version.
Implemented the possibility to configure regional NTP server parameters
to be applied to all machines of regional and managed clusters in the
specified region.
The feature is applicable to all supported cloud providers.
The NTP server parameters can be added before or after management
and regional clusters deployment.
Optimized the ClusterRelease upgrade process by enabling
the Container Cloud provider to upgrade the LCMCluster components,
such as MKE, before the HelmBundle components, such as StackLight or Ceph.
Dedicated network for external connection to the Kubernetes services¶
Technology Preview
Implemented the k8s-ext bridge in L2 templates that allows you to use
a dedicated network for external connection to the Kubernetes services
exposed by the cluster. When using such bridge, the MetalLB ranges and the
IP addresses provided by the subnet that is associated with the bridge
must fit in the same CIDR.
If enabled, MetalLB will listen and respond on the dedicated virtual bridge.
Also, you can create additional subnets to configure additional address
ranges for MetalLB.
Caution
Use of a dedicated network for Kubernetes pods traffic,
for external connection to the Kubernetes services exposed
by the cluster, and for the Ceph cluster access and replication
traffic is available as Technology Preview. Use such
configurations for testing and evaluation purposes only.
For the Technology Preview feature definition,
refer to Technology Preview features.
The following issues have been addressed in the Mirantis Container Cloud
release 2.5.0 and the Cluster releases 5.12.0 and 6.12.0:
[10453] [LCM] Fixed the issue with time synchronization on nodes
that could cause networking issues.
[9748] [LCM] Fixed the issue with the false-positive helmRelease success
status in HelmBundle during Helm upgrade operations.
[9748] [LCM] Fixed the issue with the false-positive helmRelease success
status in HelmBundle during Helm upgrade operations.
[8464] Fixed the issue with Helm controller and OIDC integration
failing to be deleted during detach of an MKE cluster.
[9928] [Ceph] Fixed the issue with Ceph rebalance leading to data loss
during a managed cluster update by implementing the maintenance label
to be set before and unset after the cluster update.
[9892] [Ceph] Fixed the issue with Ceph being locked during a managed cluster
update by adding the PodDisruptionBudget object that enables minimum
2 Ceph OSD nodes running without rescheduling during update.
[6988] [BM] Fixed the issue with LVM failing to deploy on a new disk
if an old volume group with the same name already existed on the target
hardware node but on the different disk.
[8560] [BM] Fixed the issue with manual deletion of BareMetalHost
from a managed cluster leading to its silent removal without a power-off
and deprovision. The fix adds the admission controller webhook to validate
the old BareMetalHost when the deletion is requested.
[11102] [BM] Fixed the issue with Keepalived not detecting and restoring
a VIP of a managed cluster node after running
the netplan apply command.
[9905] [9906] [9909] [9914] [9921] [BM] Fixed the following
Ubuntu CVEs in the bare metal Docker images:
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
vSphere¶[11633] A vSphere-based project cannot be cleaned up¶
Fixed in Container Cloud 2.6.0
A vSphere-based managed cluster project can fail to be cleaned up
because of stale secret(s) related to the RHEL license object(s).
Before you can successfully clean up such project,
manually delete the secret using the steps below.
Workaround:
Log in to a local machine where your management cluster kubeconfig
is located and where kubectl is installed.
Bare metal¶[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
[9875] Full preflight fails with a timeout waiting for BareMetalHost¶
Fixed in Container Cloud 2.6.0
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:
As a workaround, unset full preflight using unsetKAAS_BM_FULL_PREFLIGHT
to run fast preflight instead.
[11468] Pods using LVP PV are not mounted to LVP disk¶
Fixed in Container Cloud 2.6.0
The persistent volumes (PVs) that are created using local volume provisioner
(LVP), are not mounted on the dedicated disk labeled as local-volume
and use the root volume instead. In the workaround below, we use StackLight
volumes as an example.
Workaround:
Identify whether your cluster is affected:
Log in to any control plane node on the management cluster.
In this entry, replace the old directory
/var/lib/local-volumes/stacklight/elasticsearch-data/vol00 with the
new one: /mnt/local-volumes/src/stacklight/elasticsearch-data/vol00.
A Ceph node removal is not being triggered properly after updating
the KaasCephCluster custom resource (CR). Both management and managed
clusters are affected.
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
IAM¶[10829] Keycloak pods fail to start during a management cluster bootstrap¶
Fixed in Container Cloud 2.6.0
The Keycloak pods may fail to start during a management cluster bootstrap
with the Failed to update database exception in logs.
Caution
The following workaround is applicable only to deployments
where mariadb-server has started successfully. Otherwise,
fix the issues with MariaDB first.
Workaround:
Verify that mariadb-server has started:
kubectlgetpo-nkaas|grepmariadb-server
Scale down the Keycloak instances:
kubectlscalestsiam-keycloak--replicas=0-nkaas
Open the iam-keycloak-sh configmap for editing:
kubectleditcm-nkaasiam-keycloak-sh
On the last line of the configmap, before the $MIGRATION_ARGS variable,
add the following parameter:
The recommended timeout value is minimum 15 minutes set in seconds.
In the Keycloak StatefulSet, adjust liveness probe timeouts:
kubectleditsts-nkaasiam-keycloak
Scale up the Keycloak instances:
kubectlscalestsiam-keycloak--replicas=3-nkaas
LCM¶[10029] Authentication fails with the 401 Unauthorized error¶
Authentication may not work on some controller nodes after a managed cluster
creation. As a result, the Kubernetes API operations with the managed cluster
kubeconfig fail with ResponseStatus:401Unauthorized.
As a workaround, manually restart the ucp-controller and ucp-auth
Docker services on the affected node.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
After the management cluster update, a Patroni pod may fail to start and remain
in the CrashLoopBackOff status. Messages similar to the following ones may
be present in Patroni logs:
Local timeline=4 lsn=0/A000000master_timeline=6master:history=1 0/1ADEB48 no recovery target specified2 0/8044500 no recovery target specified3 0/A0000A0 no recovery target specified4 0/A1B6CB0 no recovery target specified5 0/A2C0C80 no recovery target specified
As a workaround, reinitialize the affected pod with a new volume by deleting
the pod itself and the associated PersistentVolumeClaim (PVC).
kubectl-nstacklightdelete"pod/${POD_NAME}""pvc/${POD_PVC}"
sleep3# wait for StatefulSet to reschedule the pod, but miss dependent PVC creation
kubectl-nstacklightdelete"pod/${POD_NAME}"
Management and regional clusters¶[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.5.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.11.0
that is based on Kubernetes 1.18, Mirantis Kubernetes Engine 3.3.4,
and the updated version of Mirantis Container Runtime 19.03.14.
Still supports previous Cluster releases 5.10.0 and
6.8.1 that are now deprecated and will become unsupported
in one of the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.4.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.4.0.
For the list of enhancements in the Cluster release 5.11.0 and
Cluster release 6.10.0 that are supported by the Container Cloud release 2.4.0,
see the 5.11.0 and 6.10.0 sections.
Due to the development limitations, the MCR upgrade to version
19.03.13 or 19.03.14 on existing Container Cloud clusters
is not supported.
Dedicated network for Kubernetes pods traffic on bare metal clusters¶
Technology Preview
Implemented the k8s-pods bridge in L2 templates that allows you to use
a dedicated network for Kubernetes pods traffic.
When the k8s-pods bridge is defined in an L2 template,
Calico CNI uses that network for routing the pods traffic between nodes.
Caution
Using of a dedicated network for Kubernetes pods traffic
described above is available as Technology Preview. Use such
configuration for testing and evaluation purposes only.
For the Technology Preview feature definition,
refer to Technology Preview features.
The following features are still under development and will be
announced in one of the following Container Cloud releases:
Switching Kubernetes API to listen to the specified IP address
on the node
Enable MetalLB to listen and respond on the dedicated
virtual bridge.
Feedback form improvement in Container Cloud web UI¶
Extended the functionality of the feedback form for the Container Cloud
web UI. Using the Feedback button, you can now provide
5-star product rating and feedback about Container Cloud.
If you have an idea or found a bug in Container Cloud,
you can create a ticket for the Mirantis support team
to help us improve the product.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
As a workaround, unset full preflight using unsetKAAS_BM_FULL_PREFLIGHT
to run fast preflight instead.
[11102] Keepalived does not detect the loss of VIP deleted by netplan¶
Fixed in Container Cloud 2.5.0
This issue may occur on the baremetal-based managed clusters
that are created using L2 templates when network configuration is changed
by the user or when Container Cloud is updated from version 2.3.0 to 2.4.0.
Due to the community issue,
Keepalived 1.3.9 does not detect and restore a VIP of a managed cluster node
after running the netplan apply command. The command is used to
apply network configuration changes.
As a result, the Kubernetes API on the affected managed clusters becomes
inaccessible.
As a workaround, log in to all nodes of the affected managed clusters and
restart Keepalived using systemctl restart keepalived.
[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed
on a new disk if an old volume group with the same name already exists
on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe:true parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
[8560] Manual deletion of BareMetalHost leads to its silent removal¶
Fixed in Container Cloud 2.5.0
If BareMetalHost is manually removed from a managed cluster, it is
silently removed without a power-off and deprovision that leads to a managed
cluster failures.
Workaround:
Do not manually delete a BareMetalHost that has the Provisioned status.
A Ceph node removal is not being triggered properly after updating
the KaasCephCluster custom resource (CR). Both management and managed
clusters are affected.
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
LCM¶[10029] Authentication fails with the 401 Unauthorized error¶
Authentication may not work on some controller nodes after a managed cluster
creation. As a result, the Kubernetes API operations with the managed cluster
kubeconfig fail with ResponseStatus:401Unauthorized.
As a workaround, manually restart the ucp-controller and ucp-auth
Docker services on the affected node.
[6066] Helm releases get stuck in FAILED or UNKNOWN state¶
Note
The issue affects only Helm v2 releases and is addressed for Helm v3.
Starting from Container Cloud 2.19.0, all Helm releases are switched to v3.
During a management, regional, or managed cluster deployment,
Helm releases may get stuck in the FAILED or UNKNOWN state
although the corresponding machines statuses are Ready
in the Container Cloud web UI. For example, if the StackLight Helm release
fails, the links to its endpoints are grayed out in the web UI.
In the cluster status, providerStatus.helm.ready and
providerStatus.helm.releaseStatuses.<releaseName>.success are false.
HelmBundle cannot recover from such states and requires manual actions.
The workaround below describes the recovery steps for the stacklight
release that got stuck during a cluster deployment.
Use this procedure as an example for other Helm releases as required.
Workaround:
Verify the failed release has the UNKNOWN or FAILED status
in the HelmBundle object:
After the management cluster update, a Patroni pod may fail to start and remain
in the CrashLoopBackOff status. Messages similar to the following ones may
be present in Patroni logs:
Local timeline=4 lsn=0/A000000master_timeline=6master:history=1 0/1ADEB48 no recovery target specified2 0/8044500 no recovery target specified3 0/A0000A0 no recovery target specified4 0/A1B6CB0 no recovery target specified5 0/A2C0C80 no recovery target specified
As a workaround, reinitialize the affected pod with a new volume by deleting
the pod itself and the associated PersistentVolumeClaim (PVC).
kubectl-nstacklightdelete"pod/${POD_NAME}""pvc/${POD_PVC}"
sleep3# wait for StatefulSet to reschedule the pod, but miss dependent PVC creation
kubectl-nstacklightdelete"pod/${POD_NAME}"
Management cluster update¶[9899] Helm releases get stuck in PENDING_UPGRADE during cluster update¶
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following issues have been addressed in the Mirantis Container Cloud
release 2.4.0 and the Cluster releases 5.11.0 and 6.10.0:
[10351] [BM] [IPAM] Fixed the issue with the automatically allocated subnet
having the ability to requeue allocation from a SubnetPool
in the error state.
[10104] [BM] [Ceph] Fixed the issue with OpenStack services failing to access
rook-ceph-mon-* pods due to the changed metadata for connection after
pods restart if Ceph was deployed without hostNetwork:true.
[2757] [IAM] Fixed the issue with IAM failing to start with the IAM pods
being in the CrashLoopBackOff state during a management cluster
deployment.
[7562] [IAM] Disabled the http port in Keycloak to prevent security
vulnerabilities.
[10108] [LCM] Fixed the issue with accidental upgrade of the docker-ee,
docker-ee-cli, and containerd.io packages
that must be pinned during the host OS upgrade.
[10094] [LCM] Fixed the issue with error handling in the manage-taints
Ansible script.
[9676] [LCM] Fixed the issue with Keepalived and NGINX being installed on
worker nodes instead of being installed on control plane nodes only.
[10323] [UI] Fixed the issue with offline tokens being expired over time
if fetched using the Container Cloud web UI. The issue occurred if the
Log in with Keycloak option was used.
[8966] [UI] Fixed the issue with the “invalid_grant”,”error_description”:
“Session doesn’t have required client” error occurring over time
after logging in to the Container Cloud web UI through
Log in with Keycloak.
[10180] [UI] Fixed the issue with the SSH Keys dialog becoming
blank after the token expiration.
[7781] [UI] Fixed the issue with the previously selected Ceph cluster
machines disappearing from the drop-down menu of the
Create New Ceph Cluster dialog.
[7843] [UI] Fixed the issue with Provider Credentials being
stuck in the Processing state if created using the
Add new credential option of the Create New Cluster
dialog.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.4.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduces support for the Cluster release 5.10.0
that is based on Kubernetes 1.18 and the updated versions of
Mirantis Kubernetes Engine 3.3.4 and Mirantis Container Runtime 19.03.13.
Still supports previous Cluster releases 5.9.0 and
6.8.1 that are now deprecated and will become unsupported
in one of the following Container Cloud releases.
Caution
Make sure to update the Cluster release version
of your managed cluster before the current Cluster release
version becomes unsupported by a new Container Cloud release
version.
Otherwise, Container Cloud stops auto-upgrade and eventually
Container Cloud itself becomes unsupported.
This section outlines release notes for the Container Cloud release 2.3.0.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.3.0.
For the list of enhancements in the Cluster release 5.10.0 and
Cluster release 6.10.0 introduced by the Container Cloud release 2.3.0,
see the 5.10.0 and 6.10.0 sections.
Updated versions of Mirantis Kubernetes Engine and Container Runtime¶
Updated the Mirantis Kubernetes Engine (MKE) version to 3.3.4 and
the Mirantis Container Runtime (MCR) version to 19.03.13 for the
Container Cloud management and managed clusters.
In scope of Technology Preview support for the VMware vSphere provider,
added the capability to deploy an additional regional vSphere-based
cluster on top of the vSphere management cluster
to create managed clusters with different configurations if required.
Automated setup of a VM template for the VMware vSphere provider¶
Technical Preview
Automated the process of a VM template setup for the vSphere-based
management and managed clusters deployments.
The VM template is now set up by Packer using the vsphere_template flag
that is integrated into bootstrap.sh.
Added the capability to deploy StackLight on management clusters. However, such
deployment has the following limitations:
The Kubernetes Nodes and Kubernetes Cluster Grafana
dashboards may have empty panels.
The DockerNetworkUnhealthy and etcdGRPCRequestsSlow alerts may fail
to raise.
The CPUThrottlingHigh, CalicoDatapaneIfaceMsgBatchSizeHigh,
KubeCPUOvercommitPods, KubeMemOvercommitPods alerts, and the
TargetDown alert for the prometheus-node-exporter and
calico-node pods may be constantly firing.
Support of multiple host-specific L2 templates per a bare metal cluster¶
Added support of multiple host-specific L2 templates to be applied
to different nodes of the same bare metal cluster.
Now, you can use several independent host-specific L2 templates on a cluster
to support different hardware configurations. For example, you can create
L2 templates with a different number and layout of NICs to be applied
to the specific machines of a cluster.
Improvements in the Container Cloud logs collection¶
Improved user experience with the Container Cloud resources logs collection
by implementing collecting of logs on the Mirantis Kubernetes Engine cluster
and on all Kubernetes pods, including the ones that were previously removed
or failed.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Bare metal¶[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed
on a new disk if an old volume group with the same name already exists
on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe:true parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
[8560] Manual deletion of BareMetalHost leads to its silent removal¶
Fixed in Container Cloud 2.5.0
If BareMetalHost is manually removed from a managed cluster, it is
silently removed without a power-off and deprovision that leads to a managed
cluster failures.
Workaround:
Do not manually delete a BareMetalHost that has the Provisioned status.
[9875] Full preflight fails with a timeout waiting for BareMetalHost¶
Fixed in Container Cloud 2.6.0
If you run bootstrap.sh preflight with
KAAS_BM_FULL_PREFLIGHT=true, the script fails with the following message:
Substitute <mysqlDbadminPassword> with the corresponding value
obtained in the previous step.
Run the following command:
DROPDATABASEIFEXISTSkeycloak;
Manually delete the Keycloak pods:
kubectldeletepo-nkaasiam-keycloak-{0,1,2}
LCM¶[10029] Authentication fails with the 401 Unauthorized error¶
Authentication may not work on some controller nodes after a managed cluster
creation. As a result, the Kubernetes API operations with the managed cluster
kubeconfig fail with ResponseStatus:401Unauthorized.
As a workaround, manually restart the ucp-controller and ucp-auth
Docker services on the affected node.
Helm releases may get stuck in the PENDING_UPGRADE status
during a management or managed cluster upgrade. The HelmBundle Controller
cannot recover from this state and requires manual actions. The workaround
below describes the recovery process for the openstack-operator release
that stuck during a managed cluster update. Use it as an example for other
Helm releases as required.
A Ceph node removal is not being triggered properly after updating
the KaasCephCluster custom resource (CR). Both management and managed
clusters are affected.
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
[10050] Ceph OSD pod is in the CrashLoopBackOff state after disk replacement¶
Fixed in 2.11.0
If you use a custom BareMetalHostProfile, after disk replacement
on a Ceph OSD, the Ceph OSD pod switches to the CrashLoopBackOff state
due to the Ceph OSD authorization key failing to be created properly.
Workaround:
Export kubeconfig of your managed cluster. For example:
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following issues have been addressed in the Mirantis Container Cloud
release 2.3.0 and the Cluster releases 5.10.0 and 6.10.0:
[8869] Upgraded kind from version 0.3.0 to 0.9.0 and
the kindest/node image version from 1.14.2 to 1.18.8 to enhance
the Container Cloud performance and prevent compatibility issues.
[8220] Fixed the issue with failure to switch the default label from
one BareMetalHostProfile to another.
[7255] Fixed the issue with slow creation of the OpenStack clients and pools
by redesigning as well as increasing efficiency and speed of
ceph-controller.
[8618] Fixed the issue with missing pools during a Ceph cluster deployment.
[8111] Fixed the issue with a Ceph cluster being available after deleting
it using the Container Cloud web UI or deleting the KaaSCephCluster
object from the Kubernetes namespace using CLI.
[8409, 3836] Refactored and stabilized the upgrade procedure to prevent
locks during the upgrade operations.
[8925] Fixed improper handling of errors in lcm-controller
that may lead to its panic.
[8361] Fixed the issue with admission-controller allowing addition
of duplicated node labels per machine.
[8402] Fixed the issue with the AWS provider failing during node labeling
with the Observed a panic: “invalid memory address or nil pointer
dereference” error if privateIP is not set for a machine.
[7673] Moved logs collection of the bootstrap cluster to the /bootstrap
subdirectory to prevent unintentional erasure
of the management and regional cluster logs.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.3.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Mirantis Container Cloud
GA release 2.2.0. This release introduces support
for the Cluster release 5.9.0 that is based on
Mirantis Kubernetes Engine 3.3.3, Mirantis Container Runtime 19.03.12,
and Kubernetes 1.18. This release also introduces support for the
Cluster release 6.8.1 that introduces the support of the
Mirantis OpenStack for Kubernetes (MOSK) product.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.2.0.
For the list of enhancements in the Cluster release 5.9.0 and
Cluster release 6.8.1 introduced by the Container Cloud release 2.2.0,
see 5.9.0 and 6.8.1.
Introduced the Technology Preview support for the VMware vSphere cloud
provider on RHEL, including support for creation and operating of managed
clusters using the Container Cloud web UI.
Deployment of an additional regional vSphere-based cluster
or attaching an existing Mirantis Kubernetes Engine (MKE) cluster to a
vSphere-based management cluster is on the development stage and
will be announced in one of the following Container Cloud releases.
Kernel parameters management through BareMetalHostProfile¶
Implemented the API for managing kernel parameters typically managed by
sysctl for bare metal hosts through the BareMetalHost and
BareMetalHostProfile objects fields.
Implemented support of multiple subnets per a Container Cloud cluster with
an ability to specify a different network type for each subnet.
Introduced the SubnetPool object that allows for automatic creation of the
Subnet objects. Also, added the L3Layout section to
L2Template.spec. The L3Layout configuration allows defining the
subnets scopes to be used and to enable auto-creation of subnets
from a subnet pool.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the Mirantis Container Cloud API section to the Operations
Guide. This section is intended only for advanced Infrastructure Operators
who are familiar with Kubernetes Cluster API.
Currently, this section contains descriptions and examples
of the Container Cloud API resources for the bare metal cloud provider.
The API documentation for the OpenStack, AWS, and VMware vSphere
API resources will be added in the upcoming Container Cloud releases.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Bare metal¶[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed
on a new disk if an old volume group with the same name already exists
on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe:true parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
[7655] Wrong status for an incorrectly configured L2 template¶
Fixed in 2.11.0
If an L2 template is configured incorrectly, a bare metal cluster is deployed
successfully but with the runtime errors in the IpamHost object.
Workaround:
If you suspect that the machine is not working properly because
of incorrect network configuration, verify the status of the corresponding
IpamHost object. Inspect the l2RenderResult and ipAllocationResult
object fields for error messages.
[8560] Manual deletion of BareMetalHost leads to its silent removal¶
Fixed in Container Cloud 2.5.0
If BareMetalHost is manually removed from a managed cluster, it is
silently removed without a power-off and deprovision that leads to a managed
cluster failures.
Workaround:
Do not manually delete a BareMetalHost that has the Provisioned status.
IAM¶[2757] IAM fails to start during management cluster deployment¶
Fixed in Container Cloud 2.4.0
During a management cluster deployment, IAM fails to start with the IAM
pods being in the CrashLoopBackOff status.
Workaround:
Log in to the bootstrap node.
Remove the iam-mariadb-state configmap:
kubectldeletecm-nkaasiam-mariadb-state
Manually delete the mariadb pods:
kubectldeletepo-nkaasmariadb-server-{0,1,2}
Wait for the pods to start. If the mariadb pod does not start
with the connection to peer timed out exception, repeat the step 2.
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
The following issues have been addressed in the Mirantis Container Cloud
release 2.2.0 including the Cluster release 5.9.0:
[8012] Fixed the issue with helm-controller pod being stuck in the
CrashLoopBackOff state after reattaching of a Mirantis Kubernetes Engine
(MKE) cluster.
[7131] Fixed the issue with the deployment of a managed cluster failing
during the Ceph Monitor or Manager deployment.
[6164] Fixed the issue with the number of placement groups (PGs)
per Ceph OSD being too small and the Ceph cluster having the HEALTH_WARN
status.
[8302] Fixed the issue with deletion of a regional cluster leading to the
deletion of the related management cluster.
[7722] Fixed the issue with the Internal Server Error or similar errors
appearing in the HelmBundle controller logs after bootstrapping the
management cluster.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.2.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Mirantis Container Cloud
GA release 2.1.0. This release introduces support
for the Cluster release 5.8.0 that is based on
Mirantis Kubernetes Engine 3.3.3, Mirantis Container Runtime 19.03.12,
and Kubernetes 1.18.
This section outlines new features and enhancements
introduced in the Mirantis Container Cloud release 2.1.0.
For the list of enhancements in the Cluster release 5.8.0
introduced by the KaaS release 2.1.0, see 5.8.0.
Implemented the possibility to assign labels to specific machines with
dedicated system and hardware resources through the Container Cloud web UI.
For example, you can label the StackLight nodes that run Elasticsearch and
require more resources than a standard node to run the StackLight components
services on the dedicated nodes.
You can label a machine before or after it is deployed.
The list of available labels is taken from the current Cluster release.
Node labeling greatly improves cluster performance and prevents pods from
being quickly exhausted.
AWS resources discovery in Container Cloud web UI¶
Improved the user experience during a managed cluster creation
using the Container Cloud web UI by implementing drop-down menus with
available supported values for the following AWS resources:
AWS region
AWS AMI ID
AWS instance type
To apply the feature to existing deployments, update the IAM policies for
AWS.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Bare metal¶[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed
on a new disk if an old volume group with the same name already exists
on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe:true parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
IAM¶[2757] IAM fails to start during management cluster deployment¶
Fixed in Container Cloud 2.4.0
During a management cluster deployment, IAM fails to start with the IAM
pods being in the CrashLoopBackOff status.
Workaround:
Log in to the bootstrap node.
Remove the iam-mariadb-state configmap:
kubectldeletecm-nkaasiam-mariadb-state
Manually delete the mariadb pods:
kubectldeletepo-nkaasmariadb-server-{0,1,2}
Wait for the pods to start. If the mariadb pod does not start
with the connection to peer timed out exception, repeat the step 2.
After deploying a managed cluster with Ceph, the number of placement groups
(PGs) per Ceph OSD may be too small and the Ceph cluster may have the
HEALTH_WARN status:
health:HEALTH_WARNtoofewPGsperOSD(3<min30)
The workaround is to enable the PG balancer to properly manage the number
of PGs:
kexec -it $(k get pod -l "app=rook-ceph-tools" --all-namespaces -o jsonpath='{.items[0].metadata.name}') -n rook-ceph bash
ceph mgr module enable pg_autoscaler
[7131] rook-ceph-mgr fails during managed cluster deployment¶
Fixed in 2.2.0
Occasionally, the deployment of a managed cluster may fail during the Ceph
Monitor or Manager deployment. In this case, the Ceph cluster may be down and
and a stack trace similar to the following one may be present in Ceph Manager
logs:
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
In the Mirantis Container Cloud release 2.1.0,
the following issues have been addressed:
[7281] Fixed the issue with a management cluster bootstrap script failing
if there was a space in the PATH environment variable.
[7205] Fixed the issue with some cluster objects being stuck
during deletion of an AWS-based managed cluster
due to unresolved VPC dependencies.
[7304] Fixed the issue with failure to reattach a Mirantis Kubernetes Engine
(MKE) cluster with the same name.
[7101] Fixed the issue with the monitoring of Ceph and Ironic being enabled
when Ceph and Ironic are disabled on the baremetal-based clusters.
[7324] Fixed the issue with the monitoring of Ceph being disabled
on the baremetal-based managed clusters
due to the missing provider:BareMetal parameter.
[7180] Fixed the issue with lcm-controller periodically failing with
the invalid memory address or nil pointer dereference runtime error.
[7251] Fixed the issue with setting up the OIDC integration on the MKE side.
[7326] Fixed the issue with the missing entry for the host itself
in etc/hosts causing failure of services that require node FQDN.
[6989] Fixed the issue with baremetal-operator ignoring the
cleanfailed provisioning state if a node fails to deploy on a
baremetal-based managed cluster.
[7231] Fixed the issue with the baremetal-provider pod not restarting
after the ConfigMap changes and causing the telemeter-client pod
to fail during deployment.
The following table lists the major components and their versions
of the Mirantis Container Cloud release 2.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Apply updates to the AWS-based management clusters¶
To complete the AWS-based management cluster upgrade to version 2.1.0,
manually update the IAM policies for AWS before updating your AWS-based
managed clusters.
To update the IAM policies for AWS:
Choose from the following options:
Update the IAM policies using get_container_cloud.sh:
On any local machine, download and run the latest version
of the Container Cloud bootstrap script:
Update the AWS CloudFormation template for IAM policy:
./container-cloudbootstrapawspolicy
Update the IAM policies using the AWS Management Console:
Log in to your AWS Management Console.
Verify that the
controllers.cluster-api-provider-aws.kaas.mirantis.com role
or another AWS role that you use for Container Cloud users
contains the following permissions:
This section outlines release notes for the initial Mirantis Container Cloud
GA release 2.0.0. This release introduces support
for the Cluster release 5.7.0 that is based on Mirantis Kubernetes Engine
3.3.3, Mirantis Container Runtime 19.03.12, and Kubernetes 1.18.
AWS¶[8013] Managed cluster deployment requiring PVs may fail¶
Fixed in the Cluster release 7.0.0
Note
The issue below affects only the Kubernetes 1.18 deployments.
Moving forward, the workaround for this issue will be moved from
Release Notes to Operations Guide: Troubleshooting.
On a management cluster with multiple AWS-based managed
clusters, some clusters fail to complete the deployments that require
persistent volumes (PVs), for example, Elasticsearch.
Some of the affected pods get stuck in the Pending state
with the podhasunboundimmediatePersistentVolumeClaims and
node(s)hadvolumenodeaffinityconflict errors.
Warning
The workaround below applies to HA deployments where data
can be rebuilt from replicas. If you have a non-HA deployment,
back up any existing data before proceeding,
since all data will be lost while applying the workaround.
Workaround:
Obtain the persistent volume claims related to the storage mounts
of the affected pods:
Bare metal¶[6988] LVM fails to deploy if the volume group name already exists¶
Fixed in Container Cloud 2.5.0
During a management or managed cluster deployment, LVM cannot be deployed
on a new disk if an old volume group with the same name already exists
on the target hardware node but on the different disk.
Workaround:
In the bare metal host profile specific to your hardware configuration,
add the wipe:true parameter to the device that fails to be deployed.
For the procedure details,
see Operations Guide: Create a custom host profile.
IAM¶[2757] IAM fails to start during management cluster deployment¶
Fixed in Container Cloud 2.4.0
During a management cluster deployment, IAM fails to start with the IAM
pods being in the CrashLoopBackOff status.
Workaround:
Log in to the bootstrap node.
Remove the iam-mariadb-state configmap:
kubectldeletecm-nkaasiam-mariadb-state
Manually delete the mariadb pods:
kubectldeletepo-nkaasmariadb-server-{0,1,2}
Wait for the pods to start. If the mariadb pod does not start
with the connection to peer timed out exception, repeat the step 2.
Substitute <mysqlDbadminPassword> with the corresponding value
obtained in the previous step.
Run the following command:
DROPDATABASEIFEXISTSkeycloak;
Manually delete the Keycloak pods:
kubectldeletepo-nkaasiam-keycloak-{0,1,2}
StackLight¶[7101] Monitoring of disabled components¶
Fixed in 2.1.0
On the baremetal-based clusters, the monitoring of Ceph and Ironic is enabled
when Ceph and Ironic are disabled. The issue with Ceph relates to both
management or managed clusters, the issue with Ironic relates to managed
clusters only.
After deploying a managed cluster with Ceph, the number of placement groups
(PGs) per Ceph OSD may be too small and the Ceph cluster may have the
HEALTH_WARN status:
health:HEALTH_WARNtoofewPGsperOSD(3<min30)
The workaround is to enable the PG balancer to properly manage the number
of PGs:
kexec -it $(k get pod -l "app=rook-ceph-tools" --all-namespaces -o jsonpath='{.items[0].metadata.name}') -n rook-ceph bash
ceph mgr module enable pg_autoscaler
[7131] rook-ceph-mgr fails during managed cluster deployment¶
Fixed in 2.2.0
Occasionally, the deployment of a managed cluster may fail during the Ceph
Monitor or Manager deployment. In this case, the Ceph cluster may be down and
and a stack trace similar to the following one may be present in Ceph Manager
logs:
When removing a worker node, it is not possible to automatically remove a Ceph
node. The workaround is to manually remove the Ceph node from the Ceph
cluster as described in Operations Guide: Add, remove, or reconfigure
Ceph nodes before removing the worker node
from your deployment.
Bootstrap¶[7281] Space in PATH causes failure of bootstrap process¶
Fixed in 2.1.0
A management cluster bootstrap script fails if there is a space
in the PATH environment variable. As a workaround, before running the
bootstrap.sh script, verify that there are no spaces in the PATH
environment variable.
Container Cloud web UI¶[249] A newly created project does not display in the Container Cloud web UI¶
Affects only Container Cloud 2.18.0 and earlier
A project that is newly created in the Container Cloud web UI does not display
in the Projects list even after refreshing the page.
The issue occurs due to the token missing the necessary role
for the new project.
As a workaround, relogin to the Container Cloud web UI.
This section outlines the release notes for major and patch Cluster releases
that are supported by specific Container Cloud releases. For details about the
Container Cloud releases, see: Container Cloud releases.
Major and patch versions update path
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for supported major and patch Cluster
releases of the 17.x series dedicated for Mirantis OpenStack for Kubernetes
(MOSK).
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for major and patch Cluster releases of the
17.4.x series dedicated for Mirantis OpenStack for Kubernetes (MOSK).
This section outlines release notes for the major Cluster release 17.4.0 that
is introduced in the Container Cloud release 2.29.0. This
Cluster release is based on the Cluster release 16.4.0.
The Cluster release 17.4.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 25.1. For details, see
MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.7.19. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 25.0.8. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.29.0 section.
This section outlines new features implemented in the Cluster release 17.4.0
that is introduced in the Container Cloud release 2.29.0.
For MOSK enhancements, see MOSK 25.1: New
features.
Improvements in the CIS Benchmark compliance for Ubuntu, MKE, and Docker¶
Added the following improvments in the CIS Benchmark compliance for Ubuntu,
MKE, and Docker:
Introduced new password policies for local (Linux) user accounts. These
policies match the rules described in CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
The rules are applied automatically to all cluster nodes during cluster
update. Therefore, if you use custom Linux accounts protected by passwords,
pay attention to the following rules, as you may be forced to update
uncompliant password during login:
Password expiration interval: 365 days
Minimum password length: 14 symbols
Required symbols are capital letters, lower case letters, and digits
At least 2 characters of the new password must not be present in the old
password
Maximum identical consecutive characters: 3 (allowed: aaa123, not allowed:
aaaa123)
Maximum sequential characters: 3 (allowed: abc1xyz, not allowed:
abcd123)
Dictionary check is enabled
You must not reuse old password
After 3 failed password input attempts, the account is disabled for 15
minutes
Analyzed and reached 87% of pass rate in the CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
Note
Compliance results can vary between clusters due to
configuration-dependent tests, such as server disk partitioning.
If you require a detailed report of analyzed and fixed compliance checks,
contact Mirantis support.
Analyzed and fixed the following checks (where possible, to reduce the
number of failed components) in the Docker and MKE CIS benchmarks
compliances:
Ensure that container health is checked at runtime: No containers
without health checks
Note
The control IDs may differ depending on the scanning tool.
Note
Some security scanners may produce false-negative results for some
resources because native Docker containers and Kubernetes pods have
different configuration mechanisms.
The following table lists the components versions of the Cluster release
17.4.0. The components that are newly added, updated, deprecated, or removed as
compared to 17.3.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section lists the artifacts of components included in the Cluster release
17.4.0. The components that are newly added, updated, deprecated, or removed as
compared to 17.3.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for supported major and patch Cluster
releases of the 17.3.x series dedicated for Mirantis OpenStack for Kubernetes
(MOSK).
This section includes release notes for the patch Cluster release 17.3.7 that
is introduced in the Container Cloud patch release 2.29.2
and is based on the previous Cluster releases of the 17.3.x
series and on 16.3.7.
This patch Cluster release introduces MOSK 24.3.4 that is
based on Mirantis Kubernetes Engine 3.7.20 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.15 with docker-ee-cli updated to 23.0.17.
For the list of CVE fixes delivered with this patch Cluster release, see
2.29.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.3.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for supported major and patch Cluster
releases of the 16.x series.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for supported major and patch Cluster
releases of the 16.4.x series.
This section outlines release notes for the patch Cluster release 16.4.2 that
is introduced in the Container Cloud release 2.29.2 and is
based on 16.4.0 and 16.4.1.
The Cluster release 16.4.2 supports Mirantis Kubernetes Engine 3.7.20 with
Kubernetes 1.27 and Mirantis Container Runtime 25.0.7 with
docker-ee-cli updated to 25.0.9m1.
For details on patch release delivery, see Patch releases.
This section lists the artifacts of components included in the Cluster release
16.4.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 16.4.0 that
is introduced in the Container Cloud release 2.29.0.
The Cluster release 16.4.0 supports:
Mirantis Kubernetes Engine (MKE) 3.7.19. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 25.0.8. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.29.0 section.
Improvements in the CIS Benchmark compliance for Ubuntu, MKE, and Docker¶
Added the following improvments in the CIS Benchmark compliance for Ubuntu,
MKE, and Docker:
Introduced new password policies for local (Linux) user accounts. These
policies match the rules described in CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
The rules are applied automatically to all cluster nodes during cluster
update. Therefore, if you use custom Linux accounts protected by passwords,
pay attention to the following rules, as you may be forced to update
uncompliant password during login:
Password expiration interval: 365 days
Minimum password length: 14 symbols
Required symbols are capital letters, lower case letters, and digits
At least 2 characters of the new password must not be present in the old
password
Maximum identical consecutive characters: 3 (allowed: aaa123, not allowed:
aaaa123)
Maximum sequential characters: 3 (allowed: abc1xyz, not allowed:
abcd123)
Dictionary check is enabled
You must not reuse old password
After 3 failed password input attempts, the account is disabled for 15
minutes
Analyzed and reached 87% of pass rate in the CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
Note
Compliance results can vary between clusters due to
configuration-dependent tests, such as server disk partitioning.
If you require a detailed report of analyzed and fixed compliance checks,
contact Mirantis support.
Analyzed and fixed the following checks (where possible, to reduce the
number of failed components) in the Docker and MKE CIS benchmarks
compliances:
Ensure that container health is checked at runtime: No containers
without health checks
Note
The control IDs may differ depending on the scanning tool.
Note
Some security scanners may produce false-negative results for some
resources because native Docker containers and Kubernetes pods have
different configuration mechanisms.
The following table lists the components versions of the Cluster release
16.4.0. The components that are newly added, updated, deprecated, or removed as
compared to 16.3.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section lists the artifacts of components included in the Cluster release
16.4.0. The components that are newly added, updated, deprecated, or removed as
compared to 16.3.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for supported major and patch Cluster
releases of the 16.3.x series.
This section includes release notes for the patch Cluster release 16.3.7 that
is introduced in the Container Cloud patch release 2.29.2
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.20
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.15 with
docker-ee-cli updated to 23.0.17.
For the list of CVE fixes delivered with this patch Cluster release, see
2.29.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section describes the release notes for the deprecated major Cluster
releases that will become unsupported in one of the following Container Cloud
releases. Make sure to update your managed clusters to the latest supported
version as described in Update a managed cluster.
This section includes release notes for the patch Cluster release 17.3.6 that
is introduced in the Container Cloud patch release 2.29.1
and is based on the previous Cluster releases of the 17.3.x
series and on 16.3.6.
This patch Cluster release introduces MOSK 24.3.3 that is
based on Mirantis Kubernetes Engine 3.7.20 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.15 with docker-ee-cli updated to 23.0.17.
For the list of CVE fixes delivered with this patch Cluster release, see
2.29.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.3.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.3.5 that
is introduced in the Container Cloud patch release 2.28.5
and is based on the Cluster releases 17.3.0,
17.3.4, and 16.3.4.
This patch Cluster release introduces MOSK 24.3.2 that is
based on Mirantis Kubernetes Engine 3.7.18 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.15.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.5
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.3.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.3.4 that
is introduced in the Container Cloud patch release 2.28.4
and is based on the Cluster releases 17.3.0 and
16.3.4.
This patch Cluster release introduces MOSK 24.3.1 that is
based on Mirantis Kubernetes Engine 3.7.17 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.15, which includes containerd 1.6.36.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.3.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 17.3.0 that
is introduced in the Container Cloud release 2.28.0.
This Cluster release is based on the Cluster release 16.3.0.
The Cluster release 17.3.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 24.3.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.7.12. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.14. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.28.0 section.
Introduced support for Mirantis Container Runtime (MCR) 23.0.14 and Mirantis
Kubernetes Engine (MKE) 3.7.12 that includes Kubernetes 1.27.14.
On existing clusters, MKE and MCR are updated to the latest supported version
when you update your managed cluster to the Cluster release 17.3.0.
Note
The 3.7.12 update applies to users who follow the update train using
major releases. Users who install patch releases, have already obtained
MKE 3.7.12 in Container Cloud 2.27.3 (Cluster release 17.1.4).
Improvements in the CIS Benchmark compliance for Ubuntu Linux¶
Analyzed and reached 80% of pass rate in the CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
Note
Compliance results can vary between clusters due to
configuration-dependent tests, such as server disk partitioning.
If you require a detailed report of analyzed and fixed compliance checks,
contact Mirantis support.
Implemented proactive monitoring that allows the operator to quickly detect
and resolve LCM health issues in a cluster. The implementation includes the
dedicated MCCClusterLCMUnhealthy alert along with the
kaas_cluster_lcm_healthy and kaas_cluster_ready metrics that are
collected on the kaas-exporter side.
Refactored all certificate and license expiration alerts in StackLight that now
display the exact number of remaining days before expiration using
{{ $value | humanizeTimestamp }}. This optimization replaces vague wording
such as lessthan10days, which indicated a range from 0 to 9 days before
expiration.
The following table lists the components versions of the Cluster release
17.3.0. The components that are newly added, updated, deprecated, or removed as
compared to 17.2.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section lists the artifacts of components included in the Cluster release
17.3.0. The components that are newly added, updated, deprecated, or removed as
compared to 17.2.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 16.4.1 that
is introduced in the Container Cloud release 2.29.1 and is
based on 16.4.0.
The Cluster release 16.4.1 supports Mirantis Kubernetes Engine 3.7.20 with
Kubernetes 1.27 and Mirantis Container Runtime 25.0.7 with
docker-ee-cli updated to 25.0.9m1.
For details on patch release delivery, see Patch releases.
This section lists the artifacts of components included in the Cluster release
16.4.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.3.6 that
is introduced in the Container Cloud patch release 2.29.1
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.20
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.15 with
docker-ee-cli updated to 23.0.17.
For the list of CVE fixes delivered with this patch Cluster release, see
2.29.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.3.5 that
is introduced in the Container Cloud patch release 2.28.5
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.18
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.15.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.5
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.3.4 that
is introduced in the Container Cloud patch release 2.28.4
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.17
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.15, which includes
containerd 1.6.36.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 16.3.0 that
is introduced in the Container Cloud release 2.28.0.
The Cluster release 16.3.0 supports:
Mirantis Kubernetes Engine (MKE) 3.7.12. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.14. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.28.0 section.
Introduced support for Mirantis Container Runtime (MCR) 23.0.14 and Mirantis
Kubernetes Engine (MKE) 3.7.12 that includes Kubernetes 1.27.14 for
the Container Cloud management and managed clusters.
On existing managed clusters, MKE and MCR are updated to the latest supported
version when you update your managed cluster to the Cluster release 16.3.0.
Note
The 3.7.12 update applies to users who follow the update train using
major releases. Users who install patch releases, have already obtained
MKE 3.7.12 in Container Cloud 2.27.3 (Cluster release 16.1.4).
Improvements in the CIS Benchmark compliance for Ubuntu Linux¶
Analyzed and reached 80% of pass rate in the CIS Benchmark compliance checks
(executed by the Nessus scanner) for Ubuntu Linux 22.04 LTS v2.0.0 L1 Server,
revision 1.1.
Note
Compliance results can vary between clusters due to
configuration-dependent tests, such as server disk partitioning.
If you require a detailed report of analyzed and fixed compliance checks,
contact Mirantis support.
Implemented proactive monitoring that allows the operator to quickly detect
and resolve LCM health issues in a cluster. The implementation includes the
dedicated MCCClusterLCMUnhealthy alert along with the
kaas_cluster_lcm_healthy and kaas_cluster_ready metrics that are
collected on the kaas-exporter side.
Refactored all certificate and license expiration alerts in StackLight that now
display the exact number of remaining days before expiration using
{{ $value|humanizeTimestamp }}. This optimization replaces vague wording
such as lessthan10days, which indicated a range from 0 to 9 days before
expiration.
The following table lists the components versions of the Cluster release
16.3.0. The components that are newly added, updated, deprecated, or removed as
compared to 16.2.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section lists the artifacts of components included in the Cluster release
16.3.0. The components that are newly added, updated, deprecated, or removed as
compared to 16.2.0, are marked with a corresponding superscript, for example,
lcm-ansibleUpdated.
This section describes the release notes for the unsupported Cluster releases.
For details about supported Cluster releases, see Cluster releases (managed).
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for major and patch Cluster releases of the
17.2.x series dedicated for Mirantis OpenStack for Kubernetes (MOSK).
This section includes release notes for the patch Cluster release 17.2.7 that
is introduced in the Container Cloud patch release 2.28.3
and is based on the previous Cluster releases of the 17.2.x
series.
This patch Cluster release introduces MOSK 24.2.5 that is
based on Mirantis Kubernetes Engine 3.7.16 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.2.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.2.6 that
is introduced in the Container Cloud patch release 2.28.2
and is based on the previous Cluster releases of the 17.2.x
series.
This patch Cluster release introduces MOSK 24.2.4 that is
based on Mirantis Kubernetes Engine 3.7.16 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.2.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.2.5 that
is introduced in the Container Cloud patch release 2.28.1
and is based on the previous Cluster releases of the 17.2.x
series.
This patch Cluster release introduces MOSK 24.2.3 that is
based on Mirantis Kubernetes Engine 3.7.15 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.2.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.2.4 that
is introduced in the Container Cloud patch release 2.27.4
and is based on the previous Cluster releases of the 17.2.x
series.
This patch Cluster release introduces MOSK 24.2.2 that is
based on Mirantis Kubernetes Engine 3.7.12 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.2.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.2.3 that
is introduced in the Container Cloud patch release 2.27.3
and is based on the Cluster releases 17.2.0 and
16.2.3.
This patch Cluster release introduces MOSK 24.2.1 that is
based on Mirantis Kubernetes Engine 3.7.12 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.2.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 17.2.0 that
is introduced in the Container Cloud release 2.27.0.
This Cluster release is based on the Cluster release 16.2.0.
The Cluster release 17.2.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 24.2.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.7.8. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.11. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.27.0 section.
Introduced support for Mirantis Kubernetes Engine (MKE) 3.7.8 that supports
Kubernetes 1.27. On existing clusters, MKE is updated to the latest supported
version when you update your managed cluster to the Cluster release 17.2.0.
Note
This enhancement applies to users who follow the update train using
major releases. Users who install patch releases, have already obtained
MKE 3.7.8 in Container Cloud 2.26.4 (Cluster release 17.1.4).
Analyzed and fixed the majority of failed compliance checks in the MKE
benchmark compliance for Container Cloud core components and StackLight.
The following controls were analyzed:
Control ID
Component
Control description
Analyzed item
5.1.2
client-certificate-controller
helm-controller
local-volume-provisioner
Minimize access to secrets
ClusterRoles with get, list, and watch access to Secret
objects in a cluster
5.1.4
local-volume-provisioner
Minimize access to create pods
ClusterRoles with the create access to pod objects in a cluster
5.2.5
client-certificate-controller
helm-controller
policy-controller
stacklight
Minimize the admission of containers with allowPrivilegeEscalation
Containers with allowPrivilegeEscalation capability enabled
Upgraded Ceph major version from Quincy 17.2.7 (17.2.7-12.cve in the patch
release train) to Reef 18.2.3 with an automatic upgrade of Ceph components
on existing managed clusters during the Cluster version update.
Ceph Reef delivers new version of RocksDB which provides better IO performance.
Also, this version supports RGW multisite re-sharding and contains overall
security improvements.
Added support for Rook v1.13 that contains the Ceph CSI plugin 3.10.x as the
default supported version. For a complete list of features and breaking
changes, refer to official Rook documentation.
Setting a configuration section for Rook parameters¶
Implemented the section option for the rookConfig parameter that
enables you to specify the section where a Rook parameter must be placed.
The use of this option enables restart of only specific daemons related to the
corresponding section instead of restarting all Ceph daemons except Ceph OSD.
Implemented monitoring of disk along with I/O errors in kernel logs to detect
hardware and software issues. The implementation includes the dedicated
KernelIOErrorsDetected alert, the kernel_io_errors_total metric that
is collected on the Fluentd side using the I/O error patterns, and general
refactoring of metrics created in Fluentd.
S.M.A.R.T. metrics for creating alert rules on bare metal clusters¶
Added documentation describing usage examples of alert rules based on
S.M.A.R.T. metrics to monitor disk information on bare metal clusters.
The StackLight telegraf-ds-smart exporter uses the
S.M.A.R.T. plugin to
obtain detailed disk information and export it as metrics. S.M.A.R.T. is a
commonly used system across vendors with performance data provided as
attributes.
Improvements for OpenSearch and OpenSearch Indices Grafana dashboards¶
Improved performance and UX visibility of the OpenSearch and
OpenSearch Indices Grafana dashboards as well as added the
capability to minimize the number of indices to be displayed on dashboards.
Removal of grafana-image-renderer from StackLight¶
As part of StackLight refactoring, removed grafana-image-renderer from the
Grafana installation in Container Cloud. StackLight uses this component only
for image generation in the Grafana web UI, which can be easily replaced with
standard screenshots.
The improvement optimizes resources usage and prevents potential CVEs that
frequently affect this component.
The following table lists the components versions of the Cluster release
17.2.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for deprecated major and patch Cluster
releases of the 17.1.x series dedicated for Mirantis OpenStack for Kubernetes
(MOSK).
This section includes release notes for the patch Cluster release 17.1.7 that
is introduced in the Container Cloud patch release 2.27.2
and is based on the previous Cluster releases of the 17.1.x series series.
This patch Cluster release introduces MOSK 24.1.7 that is
based on Mirantis Kubernetes Engine 3.7.11 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.6 that
is introduced in the Container Cloud patch release 2.27.1
and is based on the previous Cluster releases of the 17.1.x series series.
This patch Cluster release introduces MOSK 24.1.6 that is
based on Mirantis Kubernetes Engine 3.7.10 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9, in which docker-ee-cli was updated to version
23.0.13 to fix several CVEs.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.5 that
is introduced in the Container Cloud patch release 2.26.5
and is based on the previous Cluster releases of the 17.1.x series series.
This patch Cluster release introduces MOSK 24.1.5 that is
based on Mirantis Kubernetes Engine 3.7.8 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9.
For the list of CVE fixes delivered with this patch Cluster release, see
2.26.5
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.4 that
is introduced in the Container Cloud patch release 2.26.4
and is based on the previous Cluster releases of the 17.1.x series series.
This patch Cluster release introduces MOSK 24.1.4 that is
based on Mirantis Kubernetes Engine 3.7.8 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.3 that
is introduced in the Container Cloud patch release 2.26.3
and is based on the previous Cluster releases of the 17.1.x series series.
This patch Cluster release introduces MOSK 24.1.3 that is
based on Mirantis Kubernetes Engine 3.7.7 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.2 that
is introduced in the Container Cloud patch release 2.26.2
and is based on the Cluster releases 17.1.1 and
17.1.0.
This patch Cluster release introduces MOSK 24.1.2 that is
based on Mirantis Kubernetes Engine 3.7.6 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9, in which docker-ee-cli was updated to version
23.0.10 to fix several CVEs.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.1.1 that
is introduced in the Container Cloud patch release 2.26.1
and is based on the Cluster release 17.1.0.
This patch Cluster release introduces MOSK 24.1.1 that is
based on Mirantis Kubernetes Engine 3.7.5 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.1.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 17.1.0 that
is introduced in the Container Cloud release 2.26.0.
This Cluster release is based on the Cluster release 16.1.0.
The Cluster release 17.1.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 24.1.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.7.5. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.9. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.26.0 section.
Added support for Rook v1.12 that contains the Ceph CSI plugin 3.9.x and
introduces automated recovery of RBD (RWO) volumes from a failed node onto a
new one, ensuring uninterrupted operations.
For a complete list of features introduced in the new Rook version,
refer to official Rook documentation.
Support for custom device classes in a Ceph cluster¶
TechPreview
Implemented the customDeviceClasses parameter that enables you to specify
the custom names different from the default ones, which include ssd,
hdd, and nvme, and use them in nodes and pools definitions.
Using this parameter, you can, for example, separate storage of
large snapshots without touching the rest of Ceph cluster storage.
To enhance network security, added NetworkPolicy objects for all types of
Ceph daemons. These policies allow only specified ports to be used by the
corresponding Ceph daemon pods.
Completely reorganized and significantly improved the StackLight logging
pipeline by implementing the following changes:
Switched to the storage-based log retention strategy that optimizes storage
utilization and ensures effective data retention. This approach ensures
that storage resources are efficiently allocated based on the importance
and volume of different data types. The logging index management implies
the following advantages:
Storage-based rollover mechanism
Consistent shard allocation
Minimal size of cluster state
Storage compression
No filter by logging level (filtering by tag is still available)
Control over disk space to be taken by indices types:
Logs
OpenStack notifications
Kubernetes events
Introduced new system and audit indices that are managed by OpenSearch
data streams. It is a convenient way to manage insert-only pipelines such as
log message collection.
Introduced the OpenSearchStorageUsageCritical and
OpenSearchStorageUsageMajor alerts to monitor OpenSearch used and free
space from the file system perspective.
Introduced the following parameters:
persistentVolumeUsableStorageSizeGB to define exclusive OpenSearch
node usage
output_kind to define the type of logs to be forwarded to external
outputs
Important
Changes in the StackLight logging pipeline require the
following actions before and after the manged cluster update:
Added the alertsCommonLabels parameter for Prometheus server that defines
the list of custom labels to be injected to firing alerts while they are sent
to Alertmanager.
Caution
When new labels are injected, Prometheus sends alert updates
with a new set of labels, which can potentially cause Alertmanager
to have duplicated alerts for a short period of time if the cluster
currently has firing alerts.
This section lists the artifacts of components included in the Cluster release
17.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for unsupported major and patch Cluster
17.0.x series dedicated for Mirantis OpenStack for Kubernetes (MOSK).
This section includes release notes for the patch Cluster release 17.0.4 that
is introduced in the Container Cloud patch release 2.25.4
and is based on Cluster releases 17.0.0, 17.0.1,
17.0.2, and 17.0.3.
This patch Cluster release introduces MOSK 23.3.4 that is
based on Mirantis Kubernetes Engine 3.7.3 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.0.3 that
is introduced in the Container Cloud patch release 2.25.3
and is based on Cluster releases 17.0.0, 17.0.1, and
17.0.2.
This patch Cluster release introduces MOSK 23.3.3 that is
based on Mirantis Kubernetes Engine 3.7.3 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.0.2 that
is introduced in the Container Cloud patch release 2.25.2
and is based on Cluster releases 17.0.0 and 17.0.1.
This patch Cluster release introduces MOSK 23.3.2 that is
based on Mirantis Kubernetes Engine 3.7.2 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 17.0.1 that
is introduced in the Container Cloud patch release 2.25.1
and is based on the Cluster release 17.0.0.
This patch Cluster release introduces MOSK 23.3.1 that is
based on Mirantis Kubernetes Engine 3.7.2 with Kubernetes 1.27 and Mirantis
Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
17.0.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 17.0.0 that
is introduced in the Container Cloud release 2.25.0.
This Cluster release is based on the Cluster release 16.0.0.
The Cluster release 17.0.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 23.3.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.7.1. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.7. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.25.0 section.
Introduced support for Mirantis Container Runtime (MCR) 23.0.7 and Mirantis
Kubernetes Engine (MKE) 3.7.1 that supports Kubernetes 1.27 for the Container
Cloud management and managed clusters. On existing clusters, MKE and MCR are
updated to the latest supported version when you update your managed cluster
to the Cluster release 17.0.0.
Caution
Support for MKE 3.6.x is dropped. Therefore, new deployments on
MKE 3.6.x are not supported.
Detailed view of a Ceph cluster summary in web UI¶
Implemented the Ceph Cluster details page in the Container Cloud
web UI containing the Machines and OSDs tabs with a
detailed descriptions and statuses of Ceph machines and Ceph OSDs comprising
a Ceph cluster deployment.
Addressing storage devices using by-id identifiers¶
Implemented the capability to address Ceph storage devices using the by-id
identifiers.
The by-id identifier is the only persistent device identifier for a Ceph
cluster that remains stable after the cluster upgrade or any other maintenance.
Therefore, Mirantis recommends using device by-id symlinks rather than
device names or by-path symlinks.
Added the kaasCephState field in the KaaSCephCluster.status
specification to display the current state of KaasCephCluster and
any errors during object reconciliation, including specification
generation, object creation on a managed cluster, and status retrieval.
Added initial Technology Preview support for forwarding of Container Cloud
services logs, which are sent to OpenSearch by default, to Splunk using the
syslog external output configuration.
Implemented the following monitoring improvements for Ceph:
Optimized the following Ceph dashboards in Grafana: Ceph Cluster,
Ceph Pools, Ceph OSDs.
Removed the redundant Ceph Nodes Grafana dashboard. You can view
its content using the following dashboards:
Ceph stats through the Ceph Cluster dashboard.
Resource utilization through the System dashboard, which now
includes filtering by Ceph node labels, such as ceph_role_osd,
ceph_role_mon, and ceph_role_mgr.
Removed the rook_cluster alert label.
Removed the redundant CephOSDDown alert.
Renamed the CephNodeDown alert to CephOSDNodeDown.
Optimized StackLight NodeDown alerts for a better notification handling
after cluster recovery from an accident:
Reworked the NodeDown-related alert inhibition rules
Reworked the logic of all NodeDown-related alerts for all supported
groups of nodes, which includes renaming of the <alertName>TargetsOutage
alerts to <alertNameTargetDown>
Added the TungstenFabricOperatorTargetDown alert for Tungsten Fabric
deployments of MOSK clusters
Removed redundant KubeDNSTargetsOutage and KubePodsNotReady alerts
Optimized OpenSearch configuration and StackLight datamodel to provide
better resources utilization and faster query response. Added the following
enhancements:
Limited the default namespaces for log collection with the ability to add
custom namespaces to the monitoring list using the following parameters:
logging.namespaceFiltering.logs - limits the number of namespaces
for Pods log collection. Enabled by default.
logging.namespaceFiltering.events - limits the number of namespaces
for Kubernetes events collection. Disabled by default.
logging.namespaceFiltering.events/logs.extraNamespaces - adds extra
namespaces, which are not in the default list, to collect specific
Kubernetes Pod logs or Kubernetes events. Empty by default.
Added the logging.enforceOopsCompression parameter that enforces 32 GB
of heap size, unless the defined memory limit allows using 50 GB of heap.
Enabled by default.
Added the NO_SEVERITY severity label that is automatically added to a
log with no severity label in the message. This allows having more control
over which logs are actually being processed by Fluentd and which
are skipped by mistake.
Added documentation on how to tune OpenSearch performance using hardware
and software settings for baremetal-based Container Cloud clusters.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the documentation on how to export data from the
Table panels of Grafana dashboards to CSV.
This section includes release notes for the patch Cluster release 16.3.3 that
is introduced in the Container Cloud patch release 2.28.3
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.16
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.14.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.3.2 that
is introduced in the Container Cloud patch release 2.28.2
and is based on the previous Cluster releases of the 16.3.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.16
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.14.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.3.1 that
is introduced in the Container Cloud patch release 2.28.1
and is based on the Cluster release 16.3.0.
This Cluster release supports Mirantis Kubernetes Engine 3.7.15
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.14.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.3.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for major and patch Cluster releases of the
16.2.x series.
This section includes release notes for the patch Cluster release 16.2.7 that
is introduced in the Container Cloud patch release 2.28.3
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.16 with Kubernetes
1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.6 that
is introduced in the Container Cloud patch release 2.28.2
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.16 with Kubernetes
1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.5 that
is introduced in the Container Cloud patch release 2.28.1
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.15 with Kubernetes
1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.28.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.4 that
is introduced in the Container Cloud patch release 2.27.4
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.12 with Kubernetes
1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.3 that
is introduced in the Container Cloud patch release 2.27.3
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.12
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.2 that
is introduced in the Container Cloud patch release 2.27.2
and is based on the previous Cluster releases of the 16.2.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.11
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.11.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.2.1 that
is introduced in the Container Cloud patch release 2.27.1
and is based on the Cluster release 16.2.0.
This Cluster release supports Mirantis Kubernetes Engine 3.7.10
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.11, in which
docker-ee-cli was updated to version 23.0.13 to fix several CVEs.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.2.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 16.2.0 that
is introduced in the Container Cloud release 2.27.0.
The Cluster release 16.2.0 supports:
Mirantis Kubernetes Engine (MKE) 3.7.8. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.11. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.27.0 section.
Introduced support for Mirantis Kubernetes Engine (MKE) 3.7.8 that supports
Kubernetes 1.27 for the Container Cloud management and managed clusters.
On existing managed clusters, MKE is updated to the latest supported
version when you update your managed cluster to the Cluster release 16.2.0.
Note
This enhancement applies to users who follow the update train using
major releases. Users who install patch releases, have already obtained
MKE 3.7.8 in Container Cloud 2.26.4 (Cluster release 16.1.4).
Analyzed and fixed the majority of failed compliance checks in the MKE
benchmark compliance for Container Cloud core components and StackLight.
The following controls were analyzed:
Control ID
Component
Control description
Analyzed item
5.1.2
client-certificate-controller
helm-controller
local-volume-provisioner
Minimize access to secrets
ClusterRoles with get, list, and watch access to Secret
objects in a cluster
5.1.4
local-volume-provisioner
Minimize access to create pods
ClusterRoles with the create access to pod objects in a cluster
5.2.5
client-certificate-controller
helm-controller
policy-controller
stacklight
Minimize the admission of containers with allowPrivilegeEscalation
Containers with allowPrivilegeEscalation capability enabled
Upgraded Ceph major version from Quincy 17.2.7 (17.2.7-12.cve in the patch
release train) to Reef 18.2.3 with an automatic upgrade of Ceph components
on existing managed clusters during the Cluster version update.
Ceph Reef delivers new version of RocksDB which provides better IO performance.
Also, this version supports RGW multisite re-sharding and contains overall
security improvements.
Added support for Rook v1.13 that contains the Ceph CSI plugin 3.10.x as the
default supported version. For a complete list of features and breaking
changes, refer to official Rook documentation.
Setting a configuration section for Rook parameters¶
Implemented the section option for the rookConfig parameter that
enables you to specify the section where a Rook parameter must be placed.
The use of this option enables restart of only specific daemons related to the
corresponding section instead of restarting all Ceph daemons except Ceph OSD.
Implemented monitoring of disk along with I/O errors in kernel logs to detect
hardware and software issues. The implementation includes the dedicated
KernelIOErrorsDetected alert, the kernel_io_errors_total metric that
is collected on the Fluentd side using the I/O error patterns, and general
refactoring of metrics created in Fluentd.
S.M.A.R.T. metrics for creating alert rules on bare metal clusters¶
Added documentation describing usage examples of alert rules based on
S.M.A.R.T. metrics to monitor disk information on bare metal clusters.
The StackLight telegraf-ds-smart exporter uses the
S.M.A.R.T. plugin to
obtain detailed disk information and export it as metrics. S.M.A.R.T. is a
commonly used system across vendors with performance data provided as
attributes.
Improvements for OpenSearch and OpenSearch Indices Grafana dashboards¶
Improved performance and UX visibility of the OpenSearch and
OpenSearch Indices Grafana dashboards as well as added the
capability to minimize the number of indices to be displayed on dashboards.
Removal of grafana-image-renderer from StackLight¶
As part of StackLight refactoring, removed grafana-image-renderer from the
Grafana installation in Container Cloud. StackLight uses this component only
for image generation in the Grafana web UI, which can be easily replaced with
standard screenshots.
The improvement optimizes resources usage and prevents potential CVEs that
frequently affect this component.
The following table lists the components versions of the Cluster release
16.2.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for major and patch Cluster releases of the
16.1.x series.
This section includes release notes for the patch Cluster release 16.1.7 that
is introduced in the Container Cloud patch release 2.27.2
and is based on the previous Cluster releases of the 16.1.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.11
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.7.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.6 that
is introduced in the Container Cloud patch release 2.27.1
and is based on the previous Cluster releases of the 16.1.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.10
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9, in which
docker-ee-cli was updated to version 23.0.13 to fix several CVEs.
For the list of CVE fixes delivered with this patch Cluster release, see
2.27.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.6.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.5 that
is introduced in the Container Cloud patch release 2.26.5
and is based on the previous Cluster releases of the 16.1.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.8
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9.
For the list of CVE fixes delivered with this patch Cluster release, see
2.26.5
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.5.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.4 that
is introduced in the Container Cloud patch release 2.26.4
and is based on the previous Cluster releases of the 16.1.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.8
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.3 that
is introduced in the Container Cloud patch release 2.26.3
and is based on the previous Cluster releases of the 16.1.x series series.
This Cluster release supports Mirantis Kubernetes Engine 3.7.7
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.2 that
is introduced in the Container Cloud patch release 2.26.2
and is based on the Cluster releases 16.1.1 and
16.1.0.
This Cluster release supports Mirantis Kubernetes Engine 3.7.6
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9, in which
docker-ee-cli was updated to version 23.0.10 to fix several CVEs.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 16.1.1 that
is introduced in the Container Cloud patch release 2.26.1
and is based on the Cluster release 16.1.0.
This Cluster release supports Mirantis Kubernetes Engine 3.7.5
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.9.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.26.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.1.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 16.1.0 that
is introduced in the Container Cloud release 2.26.0.
The Cluster release 16.1.0 supports:
Mirantis Kubernetes Engine (MKE) 3.7.5. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 23.0.9. For details, see
MCR Release Notes.
Kubernetes 1.27.
For the list of known and addressed issues, refer to the Container Cloud
release 2.26.0 section.
Introduced support for Mirantis Container Runtime (MCR) 23.0.9 and Mirantis
Kubernetes Engine (MKE) 3.7.5 that supports Kubernetes 1.27 for the Container
Cloud management and managed clusters.
On existing managed clusters, MKE and MCR are updated to the latest supported
version when you update your managed cluster to the Cluster release 16.1.0.
Added support for Rook v1.12 that contains the Ceph CSI plugin 3.9.x and
introduces automated recovery of RBD (RWO) volumes from a failed node onto a
new one, ensuring uninterrupted operations.
For a complete list of features introduced in the new Rook version,
refer to official Rook documentation.
Support for custom device classes in a Ceph cluster¶
TechPreview
Implemented the customDeviceClasses parameter that enables you to specify
the custom names different from the default ones, which include ssd,
hdd, and nvme, and use them in nodes and pools definitions.
Using this parameter, you can, for example, separate storage of
large snapshots without touching the rest of Ceph cluster storage.
To enhance network security, added NetworkPolicy objects for all types of
Ceph daemons. These policies allow only specified ports to be used by the
corresponding Ceph daemon pods.
Completely reorganized and significantly improved the StackLight logging
pipeline by implementing the following changes:
Switched to the storage-based log retention strategy that optimizes storage
utilization and ensures effective data retention. This approach ensures
that storage resources are efficiently allocated based on the importance
and volume of different data types. The logging index management implies
the following advantages:
Storage-based rollover mechanism
Consistent shard allocation
Minimal size of cluster state
Storage compression
No filter by logging level (filtering by tag is still available)
Control over disk space to be taken by indices types:
Logs
OpenStack notifications
Kubernetes events
Introduced new system and audit indices that are managed by OpenSearch
data streams. It is a convenient way to manage insert-only pipelines such as
log message collection.
Introduced the OpenSearchStorageUsageCritical and
OpenSearchStorageUsageMajor alerts to monitor OpenSearch used and free
space from the file system perspective.
Introduced the following parameters:
persistentVolumeUsableStorageSizeGB to define exclusive OpenSearch
node usage
output_kind to define the type of logs to be forwarded to external
outputs
Important
Changes in the StackLight logging pipeline require the
following actions before and after the manged cluster update:
Added the alertsCommonLabels parameter for Prometheus server that defines
the list of custom labels to be injected to firing alerts while they are sent
to Alertmanager.
Caution
When new labels are injected, Prometheus sends alert updates
with a new set of labels, which can potentially cause Alertmanager
to have duplicated alerts for a short period of time if the cluster
currently has firing alerts.
This section lists the artifacts of components included in the Cluster release
16.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for unsupported major and patch Cluster
releases of the 16.0.x series.
This section outlines release notes for the patch Cluster release 16.0.4 that
is introduced in the Container Cloud release 2.25.4.
and is based on Cluster releases 16.0.0, 16.0.1,
16.0.2, and 16.0.3.
This Cluster release supports Mirantis Kubernetes Engine 3.7.3
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.4
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 16.0.3 that
is introduced in the Container Cloud release 2.25.3.
and is based on Cluster releases 16.0.0, 16.0.1, and
16.0.2.
This Cluster release supports Mirantis Kubernetes Engine 3.7.3
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.3
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 16.0.2 that
is introduced in the Container Cloud release 2.25.2.
and is based on Cluster releases 16.0.0 and 16.0.1.
This Cluster release supports Mirantis Kubernetes Engine 3.7.2
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.2
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 16.0.1 that
is introduced in the Container Cloud release 2.25.1.
and is based on the Cluster release 16.0.0.
This Cluster release supports Mirantis Kubernetes Engine 3.7.2
with Kubernetes 1.27 and Mirantis Container Runtime 23.0.7.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.25.1
For details on patch release delivery, see Patch releases
This section lists the artifacts of components included in the Cluster release
16.0.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduced support for Mirantis Container Runtime (MCR) 23.0.7 and Mirantis
Kubernetes Engine (MKE) 3.7.1 that supports Kubernetes 1.27 for the Container
Cloud management and managed clusters. On existing clusters, MKE and MCR are
updated to the latest supported version when you update your managed cluster
to the Cluster release 16.0.0.
Caution
Support for MKE 3.6.x is dropped. Therefore, new deployments on
MKE 3.6.x are not supported.
Detailed view of a Ceph cluster summary in web UI¶
Implemented the Ceph Cluster details page in the Container Cloud
web UI containing the Machines and OSDs tabs with a
detailed descriptions and statuses of Ceph machines and Ceph OSDs comprising
a Ceph cluster deployment.
Addressing storage devices using by-id identifiers¶
Implemented the capability to address Ceph storage devices using the by-id
identifiers.
The by-id identifier is the only persistent device identifier for a Ceph
cluster that remains stable after the cluster upgrade or any other maintenance.
Therefore, Mirantis recommends using device by-id symlinks rather than
device names or by-path symlinks.
Added the kaasCephState field in the KaaSCephCluster.status
specification to display the current state of KaasCephCluster and
any errors during object reconciliation, including specification
generation, object creation on a managed cluster, and status retrieval.
Added initial Technology Preview support for forwarding of Container Cloud
services logs, which are sent to OpenSearch by default, to Splunk using the
syslog external output configuration.
Implemented the following monitoring improvements for Ceph:
Optimized the following Ceph dashboards in Grafana: Ceph Cluster,
Ceph Pools, Ceph OSDs.
Removed the redundant Ceph Nodes Grafana dashboard. You can view
its content using the following dashboards:
Ceph stats through the Ceph Cluster dashboard.
Resource utilization through the System dashboard, which now
includes filtering by Ceph node labels, such as ceph_role_osd,
ceph_role_mon, and ceph_role_mgr.
Removed the rook_cluster alert label.
Removed the redundant CephOSDDown alert.
Renamed the CephNodeDown alert to CephOSDNodeDown.
Optimized StackLight NodeDown alerts for a better notification handling
after cluster recovery from an accident:
Reworked the NodeDown-related alert inhibition rules
Reworked the logic of all NodeDown-related alerts for all supported
groups of nodes, which includes renaming of the <alertName>TargetsOutage
alerts to <alertNameTargetDown>
Added the TungstenFabricOperatorTargetDown alert for Tungsten Fabric
deployments of MOSK clusters
Removed redundant KubeDNSTargetsOutage and KubePodsNotReady alerts
Optimized OpenSearch configuration and StackLight datamodel to provide
better resources utilization and faster query response. Added the following
enhancements:
Limited the default namespaces for log collection with the ability to add
custom namespaces to the monitoring list using the following parameters:
logging.namespaceFiltering.logs - limits the number of namespaces
for Pods log collection. Enabled by default.
logging.namespaceFiltering.events - limits the number of namespaces
for Kubernetes events collection. Disabled by default.
logging.namespaceFiltering.events/logs.extraNamespaces - adds extra
namespaces, which are not in the default list, to collect specific
Kubernetes Pod logs or Kubernetes events. Empty by default.
Added the logging.enforceOopsCompression parameter that enforces 32 GB
of heap size, unless the defined memory limit allows using 50 GB of heap.
Enabled by default.
Added the NO_SEVERITY severity label that is automatically added to a
log with no severity label in the message. This allows having more control
over which logs are actually being processed by Fluentd and which
are skipped by mistake.
Added documentation on how to tune OpenSearch performance using hardware
and software settings for baremetal-based Container Cloud clusters.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the documentation on how to export data from the
Table panels of Grafana dashboards to CSV.
This section outlines release notes for unsupported Cluster releases of the
15.x series.
Major and patch versions update path
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section includes release notes for the patch Cluster release 15.0.3 that
is introduced in the Container Cloud patch release 2.24.5
and is based on Cluster releases 15.0.1,
15.0.2, and 15.0.3.
This patch Cluster release introduces MOSK 23.2.3 that is
based on Mirantis Kubernetes Engine 3.6.6 with Kubernetes 1.24 and Mirantis
Container Runtime 20.10.17.
For the list of CVE fixes delivered with this patch Cluster release, see
Container Cloud 2.24.5
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 15.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 15.0.3 that
is introduced in the Container Cloud patch release 2.24.4
and is based on Cluster releases 15.0.1 and
15.0.2.
This patch Cluster release introduces MOSK 23.2.2 that is
based on Mirantis Kubernetes Engine 3.6.6 with Kubernetes 1.24 and Mirantis
Container Runtime 20.10.17.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.24.4
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 15.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 15.0.2 that
is introduced in the Container Cloud patch release 2.24.3
and is based on the major Cluster release 15.0.1.
This patch Cluster release introduces MOSK 23.2.1 that is
based on Mirantis Kubernetes Engine 3.6.6 with Kubernetes 1.24 and Mirantis
Container Runtime 20.10.17, in which docker-ee-cli was updated to version
20.10.18 to fix the following CVEs:
CVE-2023-28840,
CVE-2023-28642,
CVE-2022-41723.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.24.3
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 15.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 15.0.1 that
is introduced in the Container Cloud release 2.24.2.
This Cluster release is based on the Cluster release 14.0.1.
The Cluster release 15.0.1 supports:
Mirantis OpenStack for Kubernetes (MOSK) 23.2.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.6.5. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 20.10.17. For details, see
MCR Release Notes.
Kubernetes 1.24.
For the list of known and addressed issues, refer to the Container Cloud
release 2.24.0 section.
Added support for Mirantis Container Runtime (MCR) 20.10.17 and Mirantis
Kubernetes Engine (MKE) 3.6.5 that supports Kubernetes 1.24.
An update from the Cluster release 12.7.0 or 12.7.4 to 15.0.1 becomes available
through the Container Cloud web UI menu once the related management or
regional cluster automatically upgrades to Container Cloud 2.24.2.
Caution
Support for MKE 3.5.x is dropped. Therefore, new deployments on
MKE 3.5.x are not supported.
Upgraded Ceph major version from Pacific 16.2.11 to Quincy 17.2.6 with an
automatic upgrade of Ceph components on existing managed clusters during the
Cluster version update.
On top of addressing the known issue 30635, introduced
a requirement for the deviceClass field in each Ceph pool specification to
prevent the issue recurrence. This rule includes all pools in
spec.cephClusterSpec.pools, spec.cephClusterSpec.objectStorage, and
spec.cephClusterSpec.sharedFilesystem of the pool specification.
Monitoring of network connectivity between Ceph nodes¶
Introduced healthcheck metrics and the following Ceph alerts to monitor
network connectivity between Ceph nodes:
Major version update of OpenSearch and OpenSearch Dashboards¶
Updated OpenSearch and OpenSearch Dashboards from major version 1.3.7 to 2.7.0.
The latest version includes a number of enhancements along with bug and
security fixes.
Caution
The version update process can take up to 20 minutes, during
which both OpenSearch and OpenSearch Dashboards may become temporarily
unavailable. Additionally, the KubeStatefulsetUpdateNotRolledOut alert
for the opensearch-master StatefulSet may fire for a short period of time.
Note
The end-of-life support of the major version 1.x ends on
December 31, 2023.
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
This section outlines release notes for unsupported Cluster releases of the
14.x series.
This section outlines release notes for the Cluster release 14.1.0 that is
introduced in the Container Cloud release 2.25.0. This
Cluster release is dedicated for the vSphere provider only. This is the last
Cluster release for the vSphere provider based on Mirantis Kubernetes Engine
3.6.6 with Kubernetes 1.24.
Important
The major Cluster release 14.1.0 is the last Cluster release
for the vSphere provider based on MCR 20.10 and MKE 3.6.6 with
Kubernetes 1.24. Therefore, Mirantis highly recommends updating your
existing vSphere-based managed clusters to the Cluster release
16.0.1 that contains newer versions on MCR and MKE with
Kubernetes. Otherwise, your management cluster upgrade to Container Cloud
2.25.2 will blocked.
Since Container Cloud 2.25.1, the major Cluster release 14.1.0 is deprecated.
Greenfield vSphere-based deployments on this Cluster release are not
supported. Use the patch Cluster release 16.0.1 for new deployments instead.
For the list of known and addressed issues delivered in the Cluster release
14.1.0, refer to the Container Cloud release 2.25.0 section.
Introduced support for Mirantis Container Runtime (MCR) 23.0.7 for the
Container Cloud management and managed clusters. On existing clusters, MCR is
updated to the latest supported version when you update your
managed cluster to the Cluster release 14.1.0.
Addressing storage devices using by-id identifiers¶
Implemented the capability to address Ceph storage devices using the by-id
identifiers.
The by-id identifier is the only persistent device identifier for a Ceph
cluster that remains stable after the cluster upgrade or any other maintenance.
Therefore, Mirantis recommends using device by-id symlinks rather than
device names or by-path symlinks.
Added the kaasCephState field in the KaaSCephCluster.status
specification to display the current state of KaasCephCluster and
any errors during object reconciliation, including specification
generation, object creation on a managed cluster, and status retrieval.
Added initial Technology Preview support for forwarding of Container Cloud
services logs, which are sent to OpenSearch by default, to Splunk using the
syslog external output configuration.
Implemented the following monitoring improvements for Ceph:
Optimized the following Ceph dashboards in Grafana: Ceph Cluster,
Ceph Pools, Ceph OSDs.
Removed the redundant Ceph Nodes Grafana dashboard. You can view
its content using the following dashboards:
Ceph stats through the Ceph Cluster dashboard.
Resource utilization through the System dashboard, which now
includes filtering by Ceph node labels, such as ceph_role_osd,
ceph_role_mon, and ceph_role_mgr.
Removed the rook_cluster alert label.
Removed the redundant CephOSDDown alert.
Renamed the CephNodeDown alert to CephOSDNodeDown.
Optimized StackLight NodeDown alerts for a better notification handling
after cluster recovery from an accident:
Reworked the NodeDown-related alert inhibition rules
Reworked the logic of all NodeDown-related alerts for all supported
groups of nodes, which includes renaming of the <alertName>TargetsOutage
alerts to <alertNameTargetDown>
Added the TungstenFabricOperatorTargetDown alert for Tungsten Fabric
deployments of MOSK clusters
Removed redundant KubeDNSTargetsOutage and KubePodsNotReady alerts
Optimized OpenSearch configuration and StackLight datamodel to provide
better resources utilization and faster query response. Added the following
enhancements:
Limited the default namespaces for log collection with the ability to add
custom namespaces to the monitoring list using the following parameters:
logging.namespaceFiltering.logs - limits the number of namespaces
for Pods log collection. Enabled by default.
logging.namespaceFiltering.events - limits the number of namespaces
for Kubernetes events collection. Disabled by default.
logging.namespaceFiltering.events/logs.extraNamespaces - adds extra
namespaces, which are not in the default list, to collect specific
Kubernetes Pod logs or Kubernetes events. Empty by default.
Added the logging.enforceOopsCompression parameter that enforces 32 GB
of heap size, unless the defined memory limit allows using 50 GB of heap.
Enabled by default.
Added the NO_SEVERITY severity label that is automatically added to a
log with no severity label in the message. This allows having more control
over which logs are actually being processed by Fluentd and which
are skipped by mistake.
Added documentation on how to tune OpenSearch performance using hardware
and software settings for baremetal-based Container Cloud clusters.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the documentation on how to export data from the
Table panels of Grafana dashboards to CSV.
The following table lists the components versions
of the Cluster release 14.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous major Cluster release version, are marked with
a corresponding superscript, for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Cluster release 14.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous major Cluster release version, are marked with
a corresponding superscript, for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 14.0.4 that
is introduced in the Container Cloud patch release 2.24.5
and is based on Cluster releases 14.0.1,
14.0.2, and 14.0.3.
This patch Cluster release is based on Mirantis Kubernetes Engine 3.6.6
with Kubernetes 1.24 and Mirantis Container Runtime 20.10.17.
For the list of CVE fixes delivered with this patch Cluster release, see
Container Cloud 2.24.5
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 14.0.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 14.0.3 that
is introduced in the Container Cloud patch release 2.24.4
and is based on Cluster releases 14.0.1 and
14.0.2.
This patch Cluster release is based on Mirantis Kubernetes Engine 3.6.6
with Kubernetes 1.24 and Mirantis Container Runtime 20.10.17.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.24.4
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 14.0.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 14.0.2 that
is introduced in the Container Cloud patch release 2.24.3
and is based on the Cluster release 14.0.1.
This patch Cluster release is based on Mirantis Kubernetes Engine 3.6.6
with Kubernetes 1.24 and Mirantis Container Runtime 20.10.17, in which
docker-ee-cli was updated to version 20.10.18 to fix the following CVEs:
CVE-2023-28840,
CVE-2023-28642,
CVE-2022-41723.
For the list of enhancements and CVE fixes delivered with this patch
Cluster release, see 2.24.3
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 14.0.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the major Cluster release 14.0.1 that
is introduced in the Container Cloud release 2.24.2.
This Cluster release supports Mirantis Kubernetes Engine 3.6.5
with Kubernetes 1.24 and Mirantis Container Runtime 20.10.17.
The Cluster release 14.0.1 is based on 14.0.0 introduced in Container Cloud
2.24.0. The only difference between these two 14.x releases
is that 14.0.1 contains the following updated LCM and StackLight artifacts to
address critical CVEs:
For For the list of enhancements, refer to the Cluster release
14.0.0. For For the list of known and addressed issues, refer to the
Container Cloud release 2.24.0 section.
The following table lists the components versions
of the Cluster release 14.0.1.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Introduced support for Mirantis Container Runtime (MCR) 20.10.17 and Mirantis
Kubernetes Engine (MKE) 3.6.5 that supports Kubernetes 1.24 for the Container
Cloud management, regional, and managed clusters. On existing clusters, MKE
and MCR are updated to the latest supported version when you update your
managed cluster to the Cluster release 14.0.0.
Caution
Support for MKE 3.5.x is dropped. Therefore, new deployments on
MKE 3.5.x are not supported.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Upgraded Ceph major version from Pacific 16.2.11 to Quincy 17.2.6 with an
automatic upgrade of Ceph components on existing managed clusters during the
Cluster version update.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Implemented a Ceph non-admin client to share the producer cluster resources
with the consumer cluster in the shared Ceph cluster configuration. The use of
the non-admin client, as opposed to the admin client, prevents the risk of
destructive actions from the consumer cluster.
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
Dropping of redundant Ceph components from management and regional clusters¶
As the final part of Ceph removal from Container Cloud management clusters,
which reduces resource consumption, removed the following Ceph components that
were present on clusters for backward compatibility:
On top of addressing the known issue 30635, introduced
a requirement for the deviceClass field in each Ceph pool specification to
prevent the issue recurrence. This rule includes all pools in
spec.cephClusterSpec.pools, spec.cephClusterSpec.objectStorage, and
spec.cephClusterSpec.sharedFilesystem of the pool specification.
Monitoring of network connectivity between Ceph nodes¶
Introduced healthcheck metrics and the following Ceph alerts to monitor
network connectivity between Ceph nodes:
CephDaemonSlowOps
CephMonClockSkew
CephOSDFlapping
CephOSDSlowClusterNetwork
CephOSDSlowPublicNetwork
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Implemented the following improvements to StackLight alerting:
Changed severity for multiple alerts to increase visibility of potentially
workload-impacting alerts and decrease noise of non-workload-impacting
alerts
Renamed MCCLicenseExpirationCritical to MCCLicenseExpirationHigh,
MCCLicenseExpirationMajor to MCCLicenseExpirationMedium
For Ironic:
Removed IronicBmMetricsMissing in favor of IronicBmApiOutage
Removed inhibition rules for IronicBmTargetDown and
IronicBmApiOutage
Improved expression for IronicBmApiOutage
For Kubernetes applications:
Reworked troubleshooting steps for KubeStatefulSetUpdateNotRolledOut,
KubeDeploymentOutage, KubeDeploymentReplicasMismatch
Updated descriptions for KubeStatefulSetOutage and
KubeDeploymentOutage
Changed expressions for KubeDeploymentOutage,
KubeDeploymentReplicasMismatch, CephOSDDiskNotResponding, and
CephOSDDown
Major version update of OpenSearch and OpenSearch Dashboards¶
Updated OpenSearch and OpenSearch Dashboards from major version 1.3.7 to 2.7.0.
The latest version includes a number of enhancements along with bug and
security fixes.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.2.
Caution
The version update process can take up to 20 minutes, during
which both OpenSearch and OpenSearch Dashboards may become temporarily
unavailable. Additionally, the KubeStatefulsetUpdateNotRolledOut alert
for the opensearch-master StatefulSet may fire for a short period of time.
Note
The end-of-life support of the major version 1.x ends on
December 31, 2023.
Tuned the performance of Grafana dashboards for faster loading and a better UX
by refactoring and optimizing different Grafana dashboards.
This enhancement includes extraction of the OpenSearch Indices
dashboard out of the OpenSearch dashboard to provide detailed
information about the state of indices, including their size, the size of
document values and segments.
To improve Prometheus performance and provide better resource utilization
with faster query response, dropped metrics that are unused by StackLight.
Also created the default white list of metrics that you can expand.
The feature is enabled by default using the
prometheusServer.metricsFiltering.enabled:true parameter. Thus, if you
have created custom alerts, recording rules, dashboards, or if you were
actively using some metrics for different purposes, some of those metrics can
be dropped. Therefore, verify the white list of Prometheus scrape jobs
to ensure that the required metrics are not dropped.
If a job name that relates to the required metric is not present in this list,
its target metrics are not dropped and are collected by Prometheus by default.
If the required metric is not present in this list, you can whitelist it
using the prometheusServer.metricsFiltering.extraMetricsInclude parameter.
This section outlines release notes for the unsupported Cluster releases
of the 12.x series. Cluster releases ending with a zero, for example, 12.x.0,
are major releases. Cluster releases ending with with a non-zero, for example,
12.x.1, are patch releases of a major release 12.x.0.
This section includes release notes for the patch Cluster release 12.7.4 that
is introduced in the Container Cloud patch release 2.23.5
and is based on the Cluster release 12.7.0. This patch
Cluster release supports MOSK 23.1.4.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.5
For CVE fixes delivered with the previous patch Cluster releases, see
security notes for 2.23.4,
2.23.3, and
2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 12.7.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 12.7.3 that
is introduced in the Container Cloud patch release 2.23.4
and is based on the Cluster release 12.7.0. This patch
Cluster release supports MOSK 23.1.3.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.4
For CVE fixes delivered with the previous patch Cluster releases, see
security notes for 2.23.3 and
2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 12.7.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 12.7.2 that
is introduced in the Container Cloud patch release 2.23.3
and is based on the Cluster release 12.7.0. This patch
Cluster release supports MOSK 23.1.2.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.3
For CVE fixes delivered with the previous patch Cluster release, see
security notes for 2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 12.7.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 12.7.1 that
is introduced in the Container Cloud patch release 2.23.2
and is based on the Cluster release 12.7.0. This patch
Cluster release supports MOSK 23.1.1.
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 12.7.1.
For artifacts of the Container Cloud release, see
Container Cloud release 2.23.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 12.7.0 that is
introduced in the Container Cloud release 2.23.1.
This Cluster release is based on the Cluster release 11.7.0.
The Cluster release 12.7.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 23.1.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.5.7. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 20.10.13. For details, see
MCR Release Notes.
Kubernetes 1.21.
For the list of known and resolved issues, refer to the Container Cloud
release 2.23.0 section.
Updated the Mirantis Kubernetes Engine (MKE) patch release from 3.5.5 to 3.5.7.
The MKE update occurs automatically when you update your managed cluster.
Automatic upgrade of Ceph from Octopus to Pacific¶
Upgraded Ceph major version from Octopus 15.2.17 to Pacific 16.2.11 with an
automatic upgrade of Ceph components on existing managed clusters during the
Cluster version update.
Caution
Since Ceph Pacific, while mounting an RBD or CephFS volume, CSI
drivers do not propagate the 777 permission on the mount path.
Increased the default number of Ceph Managers deployed on a Ceph cluster to
two, active and stand-by, to improve fault tolerance and HA.
On existing clusters, the second Ceph Manager deploys automatically after a
managed cluster update.
Note
Mirantis recommends labeling at least 3 Ceph nodes with the mgr
role that equals the default number of Ceph nodes for the mon role.
In such configuration, one back-up Ceph node will be available to redeploy
a failed Ceph Manager in case of a server outage.
Implemented monitoring of bond interfaces for clusters based on bare metal.
The number of active and configured slaves per bond is now monitored with the
following alerts raising in case of issues:
Calculation of storage retention time using OpenSearch and Prometheus panels¶
Implemented the following panels in the Grafana dashboards for OpenSearch
and Prometheus that provide details on the storage usage and allow
calculating the possible retention time based on provisioned storage and
average usage:
Implemented deployment of two iam-proxy instances for the StackLight HA
setup that ensures access to HA components if one iam-proxy instance
fails. The second iam-proxy instance is automatically deployed during
cluster update on existing StackLight HA deployments.
Log forwarding to third-party systems using Fluentd plugins¶
Added the capability to forward logs to external Elasticsearch and OpenSearch
servers as the fluentd-logs output. This enhancement also expands existing
configuration options for log forwarding to syslog.
Introduced logging.externalOutputs that deprecates logging.syslog and
enables you to configure any number of outputs with more configuration
flexibility.
‘MCC Applications Performance’ Grafana dashboard for StackLight¶
Implemented the MCC Applications Performance Grafana dashboard
that provides information on the Container Cloud internals work based on
Golang, controller runtime, and some custom metrics. You can use it to verify
performance of applications and for troubleshooting purposes.
The following table lists the components versions of the Cluster release
12.7.0. For major components and versions of the Container Cloud release, see
Container Cloud release 2.23.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Cluster release 12.7.0.
For artifacts of the Container Cloud release, see
Container Cloud release 2.23.0.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 12.5.0 that is
introduced in the Container Cloud release 2.21.1.
This Cluster release is based on the Cluster release 11.5.0.
The Cluster release 12.5.0 supports:
Mirantis OpenStack for Kubernetes (MOSK) 22.5.
For details, see MOSK Release Notes.
Mirantis Kubernetes Engine (MKE) 3.5.5. For details, see
MKE Release Notes.
Mirantis Container Runtime (MCR) 20.10.13. For details, see
MCR Release Notes.
Kubernetes 1.21.
For the list of known and resolved issues, refer to the Container Cloud
release 2.21.0 section.
Added support for the Mirantis Kubernetes Engine (MKE) 3.5.5 with Kubernetes
2.21 and the Mirantis Container Runtime (MCR) version 20.10.13.
An update from the Cluster release 8.10.0 to 12.5.0 becomes available
through the Container Cloud web UI menu once the related management or
regional cluster automatically upgrades to Container Cloud 2.21.1.
Updated the MetalLB version from 0.12.1 to 0.13.4 to apply the latest
enhancements. The MetalLB configuration is now stored in dedicated MetalLB
objects instead of the ConfigMap object.
Improved etcd monitoring by implementing the Etcd dashboard and
etcdDbSizeCritical and etcdDbSizeMajor alerts that inform about the
size of the etcd database.
The following table lists the components versions of the Cluster release
12.5.0. For major components and versions of the Container Cloud release, see
Container Cloud release 2.21.0.
This section lists the components artifacts of the Cluster release 12.5.0.
For artifacts of the Container Cloud release, see
Container Cloud release 2.21.0.
This section outlines release notes for the unsupported Cluster releases
of the 11.x series. Cluster releases ending with a zero, for example, 11.x.0,
are major releases. Cluster releases ending with with a non-zero, for example,
11.x.1, are patch releases of a major release 11.x.0.
This section includes release notes for the patch Cluster release 11.7.4 that
is introduced in the Container Cloud patch release 2.23.5
and is based on the Cluster release 11.7.0.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.5
For CVE fixes delivered with the previous patch Cluster releases, see
security notes for 2.23.4,
2.23.3, and
2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 11.7.4.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 11.7.3 that
is introduced in the Container Cloud patch release 2.23.4
and is based on the Cluster release 11.7.0.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.4
For CVE fixes delivered with the previous patch Cluster releases, see
security notes for 2.23.3 and
2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 11.7.3.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section includes release notes for the patch Cluster release 11.7.2 that
is introduced in the Container Cloud patch release 2.23.3
and is based on the Cluster release 11.7.0.
For CVE fixes delivered with this patch Cluster release, see security notes
for 2.23.3
For CVE fixes delivered with the previous patch Cluster release, see
security notes for 2.23.2
For details on patch release delivery, see Patch releases
This section lists the components artifacts of the Cluster release 11.7.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the patch Cluster release 11.7.1 that
is introduced in the Container Cloud patch release 2.23.2
and is based on the Cluster release 11.7.0.
For the list of CVE fixes delivered with this patch Cluster release, see
2.23.2. For details on patch release delivery, see
Patch releases.
This section lists the components artifacts of the Cluster release 11.7.1.
For artifacts of the Container Cloud release, see
Container Cloud release 2.23.2.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.5.5 to 3.5.7
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Automatic upgrade of Ceph from Octopus to Pacific¶
Upgraded Ceph major version from Octopus 15.2.17 to Pacific 16.2.11 with an
automatic upgrade of Ceph components on existing managed clusters during the
Cluster version update.
Caution
Since Ceph Pacific, while mounting an RBD or CephFS volume, CSI
drivers do not propagate the 777 permission on the mount path.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Implemented deployment of two iam-proxy instances for the StackLight HA
setup that ensures access to HA components if one iam-proxy instance
fails. The second iam-proxy instance is automatically deployed during
cluster update on existing StackLight HA deployments.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Log forwarding to third-party systems using Fluentd plugins¶
Added the capability to forward logs to external Elasticsearch and OpenSearch
servers as the fluentd-logs output. This enhancement also expands existing
configuration options for log forwarding to syslog.
Introduced logging.externalOutputs that deprecates logging.syslog and
enables you to configure any number of outputs with more configuration
flexibility.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
‘MCC Applications Performance’ Grafana dashboard for StackLight¶
Implemented the MCC Applications Performance Grafana dashboard
that provides information on the Container Cloud internals work based on
Golang, controller runtime, and some custom metrics. You can use it to verify
performance of applications and for troubleshooting purposes.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
The following table lists the components versions of the Cluster release
11.7.0. For major components and versions of the Container Cloud release, see
Container Cloud release 2.23.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Cluster release 11.7.0.
For artifacts of the Container Cloud release, see
Container Cloud release 2.23.0.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Implemented monitoring of bond interfaces for clusters based on bare metal
and Equinix Metal with public or private networking. The number of active and
configured slaves per bond is now monitored with the following alerts raising
in case of issues:
BondInterfaceDown
BondInterfaceSlaveDown
BondInterfaceOneSlaveLeft
BondInterfaceOneSlaveConfigured
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Calculation of storage retention time using OpenSearch and Prometheus panels¶
Implemented the following panels in the Grafana dashboards for OpenSearch
and Prometheus that provide details on the storage usage and allow
calculating the possible retention time based on provisioned storage and
average usage:
OpenSearch dashboard:
Cluster > Estimated Retention
Resources > Disk
Resources > File System Used Space by Percentage
Resources > Stored Indices Disk Usage
Resources > Age of Logs
Prometheus dashboard:
Cluster > Estimated Retention
Resources > Storage
Resources > Strage by Percentage
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
Container Cloud web UI support for Reference Application¶
Enhanced support for Reference Application that is designed for workload
monitoring on managed clusters adding the
Enable Reference Application check box to the
StackLight tab of the Create new cluster wizard
in the Container Cloud web UI.
You can also enable this option after deployment using the
Configure cluster menu of the Container Cloud web UI or using CLI
by editing the StackLight parameters in the Cluster object.
The Reference Application enhancement also comprises switching from MariaDB
to PostgreSQL to improve the application stability and performance.
Note
Reference Application requires the following resources per cluster
on top of the main product requirements:
Completed the development of the Ceph Shared File System (CephFS) feature.
CephFS provides the capability to create read/write shared file system
Persistent Volumes (PVs).
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
Implemented a mechanism connecting a consumer cluster to a producer cluster.
The consumer cluster uses the Ceph cluster deployed on the producer cluster to
store the necessary data.
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
Sharing of a Ceph cluster with attached MKE clusters¶
Implemented the ability to share a Ceph cluster with MKE clusters that were not
originally deployed by Container Cloud and are attached to the management
cluster. Shared Ceph clusters allow providing the Ceph-based CSI driver to MKE
clusters. Both ReadWriteOnce (RWO) and ReadWriteMany (RWX) access modes
are supported with shared Ceph clusters.
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
Increased the default number of Ceph Managers deployed on a Ceph cluster to
two, active and stand-by, to improve fault tolerance and HA.
On existing clusters, the second Ceph Manager deploys automatically after a
managed cluster update.
Note
Mirantis recommends labeling at least 3 Ceph nodes with the mgr
role that equals the default number of Ceph nodes for the mon role.
In such configuration, one back-up Ceph node will be available to redeploy
a failed Ceph Manager in case of a server outage.
Note
For MOSK-based deployments, the feature support is
available since MOSK 23.1.
The following table lists the components versions of the Cluster release
11.6.0. For major components and versions of the Container Cloud release, see
Container Cloud release 2.22.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section lists the components artifacts of the Cluster release 11.6.0.
For artifacts of the Container Cloud release, see
Container Cloud release 2.22.0.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.5.4 to 3.5.5 and
the Mirantis Container Runtime (MCR) version from 20.10.12 to 20.10.13
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Updated the MetalLB version from 0.12.1 to 0.13.4 for the Container Cloud
management, regional, and managed clusters of all cloud providers that use
MetalLB: bare metal, Equinix Metal with public and private networking, vSphere.
The MetalLB configuration is now stored in dedicated MetalLB objects
instead of the ConfigMap object.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Improved etcd monitoring by implementing the Etcd dashboard and
etcdDbSizeCritical and etcdDbSizeMajor alerts that inform about the
size of the etcd database.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented Reference Application that is a small microservice application
that enables workload monitoring on non-MOSK managed clusters.
It mimics a classical microservice application and provides metrics that
describe the likely behavior of user workloads.
Reference Application contains a set of alerts and a separate Grafana
dashboard to provide check statuses of Reference Application and statistics
such as response time and content length.
The feature is disabled by default and can be enabled using the StackLight
configuration manifest.
Ceph secrets specification in the Ceph cluster status¶
Added the miraCephSecretsInfo specification to KaaSCephCluster.status.
This specification contains current state and details of secrets that are used
in the Ceph cluster, such as keyrings, Ceph clients, RADOS Gateway user
credentials, and so on.
Using miraCephSecretsInfo, you can create, access, and remove Ceph RADOS
Block Device (RBD) or Ceph File System (CephFS) clients and RADOS Gateway
(RGW) users.
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
The following table lists the components versions
of the Cluster release 11.5.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.5.3 to 3.5.4 and
the Mirantis Container Runtime (MCR) version from 20.10.11 to 20.10.12
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
Ceph removal from management and regional clusters¶
To reduce resource consumption, removed Ceph cluster deployment from
management and regional clusters based on bare metal and Equinix Metal with
private networking. Ceph is automatically removed during the Cluster release
update to 11.4.0. Managed clusters continue using Ceph as a distributed
storage system.
Implemented the objectUsers RADOS Gateway parameter in the
KaaSCephCluster CR. The new parameter allows for an easy creation of custom
Ceph RADOS Gateway users with permission rules. The users parameter is now
deprecated and, if specified, will be automatically transformed to
objectUsers.
Implemented the rbdDeviceMapOptions field in the Ceph pool parameters of
the KaaSCephCluster CR. The new field allows specifying custom RADOS Block
Device (RBD) map options to use with StorageClass of a corresponding Ceph
pool.
Implemented the mgr.mgrModules parameter that includes the name and
enabled keys to provide the capability to disable a particular Ceph Manager
module. The mgr.modules parameter is now deprecated and, if specified, will
be automatically transformed to mgr.mgrModules.
The following table lists the components versions
of the Cluster release 11.4.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Enhanced the documentation by adding troubleshooting guidelines for the
Kubernetes system, Metric Collector, Helm Controller, Release Controller,
and MKE alerts.
Implemented the capability to remove or replace Ceph OSDs not only by the
device name or path but also by ID, using the by-id parameter in the
KaaSCephOperationRequest CR.
Implemented the capability to create multiple Ceph data pools per a single
CephFS installation using the dataPools parameter in the CephFS
specification. The dataPool parameter is now deprecated.
The following table lists the components versions
of the Cluster release 11.3.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.5.1 to 3.5.3 and
the Mirantis Container Runtime (MCR) version from 20.10.8 to 20.10.11
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
As part of the Elasticsearch switching to OpenSearch, removed the Elasticsearch
and Kibana services, as well as introduced a set of new parameters that will
replace the current ones in future releases. The old parameters are supported
and take precedence over the new ones. For details, see
Deprecation notes and StackLight configuration parameters.
Note
In the Container Cloud web UI, the Elasticsearch and
Kibana naming is still present. However, the services behind
them have switched to OpenSearch and OpenSearch Dashboards.
Implemented the following improvements to StackLight alerting:
Added the MCCClusterUpdating informational alert that raises when the
Mirantis Container Cloud cluster starts updating.
Enhanced StackLight alerting by clarifying alert severity levels. Switched
all Minor alerts to Warning. Now, only alerts of the following
severities exist: informational, warning, major, and
critical.
Enhanced the documentation by adding troubleshooting guidelines for the
Kubernetes applications, resources, and storage alerts.
Defined the following parameters in the StackLight configuration of the Cluster
object for all types of clusters as mandatory. This applies to the clusters
with StackLight enabled only. For existing clusters, Cluster object will be
updated automatically.
Important
When creating a new cluster, specify these parameters through
the Container Cloud web UI or as described in StackLight configuration parameters.
Update all cluster templates created before Container Cloud 2.18.0 that do
not have values for these parameters specified. Otherwise, the admission
controller will reject cluster creation.
The following table lists the components versions
of the Cluster release 11.2.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Expanded support for the Mirantis Kubernetes Engine (MKE) 3.5.1 that includes
Kubernetes 1.21 to be deployed on the Container Cloud management and regional
clusters. The MKE 3.5.1 support for managed clusters was introduced in
Container Cloud 2.16.0.
Implemented the capability to configure the Elasticsearch retention time per
logs, events, and notifications indices when creating a managed cluster
through Container Cloud web UI.
The Retention Time parameter in the Container Cloud web UI is now
replaced with the Logstash Retention Time,
Events Retention Time, and Notifications Retention Time
parameters.
Implemented configurable timeouts for Ceph requests processing. The default is
set to 30 minutes. You can configure the timeout using the
pgRebalanceTimeoutMin parameter in the Ceph Helm chart.
Implemented the capability to configure the replicas count for
cephController, cephStatus, and cephRequest controllers using the
replicas parameter in the Ceph Helm chart. The default is set to 3
replicas.
Implemented a separate ceph-kcc-controller that runs on a management
cluster and manages the KaaSCephCluster custom resource (CR). Previously,
the KaaSCephCluster CR was managed by bm-provider.
The following table lists the components versions
of the Cluster release 11.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 11.0.0
that is introduced in the Mirantis Container Cloud release 2.16.0 and is
designed for managed clusters.
This Cluster release supports Mirantis Kubernetes Engine 3.5.1
with Kubernetes 1.21 and Mirantis Container Runtime 20.10.8.
For the list of known and resolved issues, refer to the Container Cloud release
2.16.0 section.
Introduced support for the Mirantis Kubernetes Engine (MKE) 3.5.1 that includes
Kubernetes 1.21 to be deployed on the Container Cloud managed clusters.
Also, added support for attachment of existing MKE 3.5.1 clusters.
Implemented the capability to configure the Elasticsearch retention time per
index using the elasticsearch.retentionTime parameter in the StackLight
Helm chart. Now, you can configure different retention periods for different
indices: logs, events, and notifications.
The elasticsearch.logstashRetentionTime parameter is now deprecated.
Due to licensing changes for Elasticsearch, Mirantis Container Cloud has
switched from using Elasticsearch to OpenSearch and Kibana has switched to
OpenSearch Dashboards. OpenSearch is a fork of Elasticsearch under the
open-source Apache License with development led by Amazon Web Services.
For new deployments with the logging stack enabled, OpenSearch is now deployed
by default.
For existing deployments, migration to OpenSearch is performed automatically
during clusters update. However, the entire Elasticsearch cluster may go down
for up to 15 minutes.
Implemented the objectUsers RADOS Gateway parameter in the
KaaSCephCluster CR. The new parameter allows for an easy creation of custom
Ceph RADOS Gateway users with permission rules. The users parameter is now
deprecated and, if specified, will be automatically transformed to
objectUsers.
Implemented the capability to remove or replace Ceph OSDs not only by the
device name or path but also by ID, using the by-id parameter in the
KaaSCephOperationRequest CR.
The following table lists the components versions
of the Cluster release 8.10.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
As part of the Elasticsearch switching to OpenSearch, removed the Elasticsearch
and Kibana services, as well as introduced a set of new parameters that will
replace the current ones in future releases. The old parameters are supported
and take precedence over the new ones. For details, see
Deprecation notes and StackLight configuration parameters.
Note
In the Container Cloud web UI, the Elasticsearch and
Kibana naming is still present. However, the services behind
them have switched to OpenSearch and OpenSearch Dashboards.
Implemented the following improvements to StackLight alerting:
Added the MCCClusterUpdating informational alert that raises when the
Mirantis Container Cloud cluster starts updating.
Enhanced StackLight alerting by clarifying alert severity levels. Switched
all Minor alerts to Warning. Now, only alerts of the following
severities exist: informational, warning, major, and
critical.
Enhanced the documentation by adding troubleshooting guidelines for the
Kubernetes applications, resources, and storage alerts.
Defined the following parameters in the StackLight configuration of the Cluster
object for all types of clusters as mandatory. This applies to the clusters
with StackLight enabled only. For existing clusters, Cluster object will be
updated automatically.
Important
When creating a new cluster, specify these parameters through
the Container Cloud web UI or as described in StackLight configuration parameters.
Update all cluster templates created before Container Cloud 2.18.0 that do
not have values for these parameters specified. Otherwise, the Admission
Controller will reject cluster creation.
Implemented the capability to configure the Elasticsearch retention time per
logs, events, and notifications indices when creating a managed cluster
through Container Cloud web UI.
The Retention Time parameter in the Container Cloud web UI is now
replaced with the Logstash Retention Time,
Events Retention Time, and Notifications Retention Time
parameters.
Implemented configurable timeouts for Ceph requests processing. The default is
set to 30 minutes. You can configure the timeout using the
pgRebalanceTimeoutMin parameter in the Ceph Helm chart.
Implemented the capability to configure the replicas count for
cephController, cephStatus, and cephRequest controllers using the
replicas parameter in the Ceph Helm chart. The default is set to 3
replicas.
Implemented a separate ceph-kcc-controller that runs on a management
cluster and manages the KaaSCephCluster custom resource (CR). Previously,
the KaaSCephCluster CR was managed by bm-provider.
The following table lists the components versions
of the Cluster release 8.8.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) major version from 3.4.6 to 3.4.7
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE 3.4.7 clusters.
Implemented the capability to configure the Elasticsearch retention time per
index using the elasticsearch.retentionTime parameter in the StackLight
Helm chart. Now, you can configure different retention periods for different
indices: logs, events, and notifications.
The elasticsearch.logstashRetentionTime parameter is now deprecated.
Due to licensing changes for Elasticsearch, Mirantis Container Cloud has
switched from using Elasticsearch to OpenSearch and Kibana has switched to
OpenSearch Dashboards. OpenSearch is a fork of Elasticsearch under the
open-source Apache License with development led by Amazon Web Services.
For new deployments with the logging stack enabled, OpenSearch is now deployed
by default.
For existing deployments, migration to OpenSearch is performed automatically
during clusters update. However, the entire Elasticsearch cluster may go down
for up to 15 minutes.
The following table lists the components versions
of the Cluster release 8.6.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Implemented the initial Technology Preview support for Mirantis OpenStack for Kubernetes
(MOSK) deployment on local software-based Redundant Array of
Independent Disks (RAID) devices to withstand failure of one device at a time.
The feature becomes available once your Container Cloud cluster is
automatically upgraded to 2.16.0.
Using a custom bare metal host profile, you can configure and create
an mdadm-based software RAID device of type raid10 if you have
an even number of devices available on your servers. At least four
storage devices are required for such RAID device.
Introduced support for the Mirantis Kubernetes Engine version 3.4.6 with
Kubernetes 1.20 for the Container Cloud management, regional, and managed
clusters. Also, added support for attachment of existing MKE 3.4.6 clusters.
Updated the Mirantis Container Runtime (MCR) version from 20.10.6 to 20.10.8
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers.
Limited the number of monitored network interfaces to prevent extended
Prometheus RAM consumption in big clusters. By default, Prometheus Node
Exporter now only collects information of a basic set of interfaces, both host
and container. If required you can edit the list of excluded devices as needed.
Implemented the capability to define custom Prometheus recording rules through
the prometheusServer.customRecordingRules parameter in the StackLight Helm
chart. Overriding of existing recording rules is not supported.
Implemented the capability to configure packet size for the syslog logging
output. If remote logging to syslog is enabled in StackLight, use the
logging.syslog.packetSize parameter in the StackLight Helm chart to
configure the packet size.
Implemented the capability to configure the Prometheus Relay client timeout and
response size limit through the prometheusRelay.clientTimeout and
prometheusRelay.responseLimitBytes parameters in the StackLight Helm chart.
Implemented the MCCLicenseExpirationCritical and
MCCLicenseExpirationMajor alerts that notify about Mirantis Container Cloud
license expiration in less than 10 and 30 days.
Implemented the following improvements to StackLight alerting:
Enhanced Kubernetes applications alerting:
Reworked the Kubernetes applications alerts to minimize flapping, avoid
firing during pod rescheduling, and to detect crash looping for pods
that restart less frequently.
Added the KubeDeploymentOutage, KubeStatefulSetOutage, and
KubeDaemonSetOutage alerts.
Removed the redundant KubeJobCompletion alert.
Enhanced the alert inhibition rules to reduce alert flooding.
Improved alert descriptions.
Split TelemeterClientFederationFailed into TelemeterClientFailed and
TelemeterClientHAFailed to separate alerts depending on the HA mode
disabled or enabled.
Updated the description for DockerSwarmNodeFlapping.
Disabled unused Node Exporter collectors and implemented the capability to
manually enable needed collectors using the
nodeExporter.extraCollectorsEnabled parameter. Only the following
collectors are now enabled by default in StackLight:
To improve debugging and log reading, separated Ceph Controller, Ceph Status
Controller, and Ceph Request Controller, which used to run in one pod, into
three different deployments.
Implemented additional validation of networks specified in
spec.cephClusterSpec.network.publicNet and
spec.cephClusterSpec.network.clusterNet and prohibited the use of the
0.0.0.0/0 CIDR. Now, the bare metal provider automatically translates
the 0.0.0.0/0 network range to the default LCM IPAM subnet if it exists.
You can now also add corresponding labels for the bare metal IPAM subnets when
configuring the Ceph cluster during the management cluster deployment.
Implemented full support for automated Ceph LCM operations using the
KaaSCephOperationRequest CR, such as addition or removal of Ceph OSDs and
nodes, as well as replacement of failed Ceph OSDs or nodes.
Ceph CSI provisioner tolerations and node affinity¶
Implemented the capability to specify Container Storage Interface (CSI)
provisioner tolerations and node affinity for different Rook resources.
Added support for the all and mds keys in toleration rules.
The following table lists the components versions
of the Cluster release 8.5.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 7.11.0 that is
introduced in the Mirantis Container Cloud release 2.21.0
and is the last release in the 7.x series.
This Cluster release supports Mirantis Kubernetes Engine 3.4.11
with Kubernetes 1.20 and Mirantis Container Runtime 20.10.13.
For the list of known and resolved issues, refer to the Container Cloud release
2.21.0 section.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.4.10 to 3.4.11 and
the Mirantis Container Runtime (MCR) version from 20.10.12 to 20.10.13
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
Updated the MetalLB version from 0.12.1 to 0.13.4 for the Container Cloud
management, regional, and managed clusters of all cloud providers that use
MetalLB: bare metal, Equinix Metal with public and private networking, vSphere.
The MetalLB configuration is now stored in dedicated MetalLB objects
instead of the ConfigMap object.
Improved etcd monitoring by implementing the Etcd dashboard and
etcdDbSizeCritical and etcdDbSizeMajor alerts that inform about the
size of the etcd database.
Implemented Reference Application that is a small microservice application
that enables workload monitoring on non-MOSK managed clusters.
It mimics a classical microservice application and provides metrics that
describe the likely behavior of user workloads.
Reference Application contains a set of alerts and a separate Grafana
dashboard to provide check statuses of Reference Application and statistics
such as response time and content length.
The feature is disabled by default and can be enabled using the StackLight
configuration manifest.
Ceph secrets specification in the Ceph cluster status¶
Added the miraCephSecretsInfo specification to KaaSCephCluster.status.
This specification contains current state and details of secrets that are used
in the Ceph cluster, such as keyrings, Ceph clients, RADOS Gateway user
credentials, and so on.
Using miraCephSecretsInfo, you can create, access, and remove Ceph RADOS
Block Device (RBD) or Ceph File System (CephFS) clients and RADOS Gateway
(RGW) users.
The following table lists the components versions
of the Cluster release 7.11.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
For MOSK-based deployments, MKE will be updated from
3.4.10 to 3.4.11 and MCR will be updated from 20.10.12 to 20.10.13 in one of
the following Container Cloud releases.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.4.9 to 3.4.10 and
the Mirantis Container Runtime (MCR) version from 20.10.11 to 20.10.12
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers except MOSK-based deployments,
as well as for non Container Cloud based MKE cluster attachment.
Ceph removal from management and regional clusters¶
To reduce resource consumption, removed Ceph cluster deployment from
management and regional clusters based on bare metal and Equinix Metal with
private networking. Ceph is automatically removed during the Cluster release
update to 7.10.0. Managed clusters continue using Ceph as a distributed
storage system.
Implemented the objectUsers RADOS Gateway parameter in the
KaaSCephCluster CR. The new parameter allows for an easy creation of custom
Ceph RADOS Gateway users with permission rules. The users parameter is now
deprecated and, if specified, will be automatically transformed to
objectUsers.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented the rbdDeviceMapOptions field in the Ceph pool parameters of
the KaaSCephCluster CR. The new field allows specifying custom RADOS Block
Device (RBD) map options to use with StorageClass of a corresponding Ceph
pool.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented the mgr.mgrModules parameter that includes the name and
enabled keys to provide the capability to disable a particular Ceph Manager
module. The mgr.modules parameter is now deprecated and, if specified, will
be automatically transformed to mgr.mgrModules.
The following table lists the components versions
of the Cluster release 7.10.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.4.8 to 3.4.9
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers except MOSK-based deployments,
as well as for non Container Cloud based MKE cluster attachment.
Enhanced the documentation by adding troubleshooting guidelines for the
Kubernetes system, Metric Collector, Helm Controller, Release Controller,
and MKE alerts.
Implemented the capability to easily view the summary and health status of all
Ceph clusters through the Container Cloud web UI. The feature is supported for
the bare metal provider only.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented the capability to remove or replace Ceph OSDs not only by the
device name or path but also by ID, using the by-id parameter in the
KaaSCephOperationRequest CR.
Caution
For MKE clusters that are part of MOSK infrastructure, the
feature support will become available in one of the following
Container Cloud releases.
Implemented the capability to create multiple Ceph data pools per a single
CephFS installation using the dataPools parameter in the CephFS
specification. The dataPool parameter is now deprecated.
Caution
For MKE clusters that are part of MOSK infrastructure, the feature
is not supported yet.
The following table lists the components versions
of the Cluster release 7.9.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) version from 3.4.7 to 3.4.8 and
the Mirantis Container Runtime (MCR) version from 20.10.8 to 20.10.11
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers, as well as for non Container Cloud based MKE
cluster attachment.
As part of the Elasticsearch switching to OpenSearch, removed the Elasticsearch
and Kibana services, as well as introduced a set of new parameters that will
replace the current ones in future releases. The old parameters are supported
and take precedence over the new ones. For details, see
Deprecation notes and StackLight configuration parameters.
Note
In the Container Cloud web UI, the Elasticsearch and
Kibana naming is still present. However, the services behind
them have switched to OpenSearch and OpenSearch Dashboards.
Implemented the following improvements to StackLight alerting:
Added the MCCClusterUpdating informational alert that raises when the
Mirantis Container Cloud cluster starts updating.
Enhanced StackLight alerting by clarifying alert severity levels. Switched
all Minor alerts to Warning. Now, only alerts of the following
severities exist: informational, warning, major, and
critical.
Enhanced the documentation by adding troubleshooting guidelines for the
Kubernetes applications, resources, and storage alerts.
Defined the following parameters in the StackLight configuration of the Cluster
object for all types of clusters as mandatory. This applies to the clusters
with StackLight enabled only. For existing clusters, Cluster object will be
updated automatically.
Important
When creating a new cluster, specify these parameters through
the Container Cloud web UI or as described in StackLight configuration parameters.
Update all cluster templates created before Container Cloud 2.18.0 that do
not have values for these parameters specified. Otherwise, the Admission
Controller will reject cluster creation.
The following table lists the components versions
of the Cluster release 7.8.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Implemented the capability to configure the Elasticsearch retention time per
logs, events, and notifications indices when creating a managed cluster
through Container Cloud web UI.
The Retention Time parameter in the Container Cloud web UI is now
replaced with the Logstash Retention Time,
Events Retention Time, and Notifications Retention Time
parameters.
Implemented configurable timeouts for Ceph requests processing. The default is
set to 30 minutes. You can configure the timeout using the
pgRebalanceTimeoutMin parameter in the Ceph Helm chart.
Implemented the capability to configure the replicas count for
cephController, cephStatus, and cephRequest controllers using the
replicas parameter in the Ceph Helm chart. The default is set to 3
replicas.
Implemented a separate ceph-kcc-controller that runs on a management
cluster and manages the KaaSCephCluster custom resource (CR). Previously,
the KaaSCephCluster CR was managed by bm-provider.
The following table lists the components versions
of the Cluster release 7.7.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine (MKE) major version from 3.4.6 to 3.4.7
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE 3.4.7 clusters.
Implemented the capability to configure the Elasticsearch retention time per
index using the elasticsearch.retentionTime parameter in the StackLight
Helm chart. Now, you can configure different retention periods for different
indices: logs, events, and notifications.
The elasticsearch.logstashRetentionTime parameter is now deprecated.
Due to licensing changes for Elasticsearch, Mirantis Container Cloud has
switched from using Elasticsearch to OpenSearch and Kibana has switched to
OpenSearch Dashboards. OpenSearch is a fork of Elasticsearch under the
open-source Apache License with development led by Amazon Web Services.
For new deployments with the logging stack enabled, OpenSearch is now deployed
by default.
For existing deployments, migration to OpenSearch is performed automatically
during clusters update. However, the entire Elasticsearch cluster may go down
for up to 15 minutes.
The following table lists the components versions
of the Cluster release 7.6.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Container Runtime (MCR) version from 20.10.6 to 20.10.8
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers.
Implemented the MCCLicenseExpirationCritical and
MCCLicenseExpirationMajor alerts that notify about Mirantis Container Cloud
license expiration in less than 10 and 30 days.
Implemented the following improvements to StackLight alerting:
Enhanced Kubernetes applications alerting:
Reworked the Kubernetes applications alerts to minimize flapping, avoid
firing during pod rescheduling, and to detect crash looping for pods
that restart less frequently.
Added the KubeDeploymentOutage, KubeStatefulSetOutage, and
KubeDaemonSetOutage alerts.
Removed the redundant KubeJobCompletion alert.
Enhanced the alert inhibition rules to reduce alert flooding.
Improved alert descriptions.
Split TelemeterClientFederationFailed into TelemeterClientFailed and
TelemeterClientHAFailed to separate alerts depending on the HA mode
disabled or enabled.
Updated the description for DockerSwarmNodeFlapping.
Disabled unused Node Exporter collectors and implemented the capability to
manually enable needed collectors using the
nodeExporter.extraCollectorsEnabled parameter. Only the following
collectors are now enabled by default in StackLight:
Implemented full support for automated Ceph LCM operations using the
KaaSCephOperationRequest CR, such as addition or removal of Ceph OSDs and
nodes, as well as replacement of failed Ceph OSDs or nodes.
Ceph CSI provisioner tolerations and node affinity¶
Implemented the capability to specify Container Storage Interface (CSI)
provisioner tolerations and node affinity for different Rook resources.
Added support for the all and mds keys in toleration rules.
The following table lists the components versions
of the Cluster release 7.5.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Kubernetes Engine version from 3.4.5 to 3.4.6
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE 3.4.6 clusters.
Limited the number of monitored network interfaces to prevent extended
Prometheus RAM consumption in big clusters. By default, Prometheus Node
Exporter now only collects information of a basic set of interfaces, both host
and container. If required you can edit the list of excluded devices as needed.
Implemented the capability to define custom Prometheus recording rules through
the prometheusServer.customRecordingRules parameter in the StackLight Helm
chart. Overriding of existing recording rules is not supported.
Implemented the capability to configure packet size for the syslog logging
output. If remote logging to syslog is enabled in StackLight, use the
logging.syslog.packetSize parameter in the StackLight Helm chart to
configure the packet size.
Implemented the capability to configure the Prometheus Relay client timeout and
response size limit through the prometheusRelay.clientTimeout and
prometheusRelay.responseLimitBytes parameters in the StackLight Helm chart.
Implemented additional validation of networks specified in
spec.cephClusterSpec.network.publicNet and
spec.cephClusterSpec.network.clusterNet and prohibited the use of the
0.0.0.0/0 CIDR. Now, the bare metal provider automatically translates
the 0.0.0.0/0 network range to the default LCM IPAM subnet if it exists.
You can now also add corresponding labels for the bare metal IPAM subnets when
configuring the Ceph cluster during the management cluster deployment.
To improve debugging and log reading, separated Ceph Controller, Ceph Status
Controller, and Ceph Request Controller, which used to run in one pod, into
three different deployments.
Implemented the KaaSCephOperationRequest CR that provides LCM operations
for Ceph OSDs and nodes by automatically creating separate
CephOsdRemoveRequest requests. It allows for automated removal of healthy
or non-healthy Ceph OSDs from a Ceph cluster.
The following table lists the components versions
of the Cluster release 7.4.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The following table lists the components versions
of the Cluster release 7.3.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Container Runtime (MCR) version from 20.10.5 to 20.10.6
and Mirantis Kubernetes Engine (MKE) version from 3.4.0 to 3.4.5
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE clusters 3.3.7-3.3.12 and
3.4.1-3.4.5.
Integrated the Ceph maintenance to the common upgrade procedure. Now, the
maintenance flag function is set up programmatically and the flag itself is
deprecated.
Implemented the capability to specify RADOS Gateway tolerations through the
KaaSCephCluster spec using the native Rook way for setting resource
requirements for Ceph daemons.
Short names for Kubernetes nodes in Grafana dashboards¶
Enhanced the Grafana dashboards to display user-friendly short names for
Kubernetes nodes, for example, master-0, instead of long name labels
such as kaas-node-f736fc1c-3baa-11eb-8262-0242ac110002.
This feature provides for consistency with Kubernetes nodes naming in the
Container Cloud web UI.
All Grafana dashboards that present node data now have an additional
Node identifier drop-down menu. By default, it is set to
machine to display short names for Kubernetes nodes. To display
Kubernetes node name labels as previously, change this option to
node.
The following table lists the components versions
of the Cluster release 7.2.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Implemented the capability to define Ceph tolerations and resources management
through the KaaSCephCluster spec using the native Rook way for setting
resource requirements for Ceph daemons.
Improved the MiraCephLog custom resource by adding more information about
all Ceph cluster entities and their statuses. The MiraCeph, MiraCephLog
statuses and MiraCephLog values are now integrated to
KaaSCephCluster.status and can be viewed using the miraCephInfo,
shortClusterInfo, and fullClusterInfo fields.
Implemented the following improvements for the StackLight node labeling
during a cluster creation or post-deployment configuration:
Added a verification that a cluster contains minimum 3 worker nodes
with the StackLight label for clusters with StackLight deployed
in HA mode. This verification applies to cluster deployment and update
processes. For details on how to add the StackLight label
before upgrade to the latest Cluster releases of Container Cloud 2.11.0,
refer to Upgrade managed clusters with StackLight deployed in HA mode.
Added a notification about the minimum number of worker nodes with the
StackLight label for HA StackLight deployments
to the cluster live status description in the Container Cloud web UI.
Caution
Removal of the StackLight label from worker nodes
along with removal of worker nodes with StackLight
label can cause the StackLight components to become inaccessible.
It is important to keep the worker nodes where the StackLight
local volumes were provisioned.
Implemented the capability to set the default log level severity
for all StackLight components as well as set a custom log level severity
for specific StackLight components in the Container Cloud web UI. You can
update this setting either during a managed cluster creation or during a
post-deployment configuration.
Implemented the capability to enable feed update in Salesforce using
the feed_enabled parameter. By default, this parameter is set to false
to save API calls.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to manually remove a Ceph OSD from a Ceph
cluster.
The following table lists the components versions
of the Cluster release 7.1.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 7.0.0
that is introduced in the Mirantis Container Cloud release 2.10.0.
This Cluster release introduces support for the updated versions of
Mirantis Kubernetes Engine 3.4.0 with Kubernetes 1.20
and Mirantis Container Runtime 20.10.5.
For the list of known and resolved issues, refer to the Container Cloud release
2.10.0 section.
The 7.0.0 Cluster release introduces support for the updated versions of:
Mirantis Container Runtime (MCR) 20.10.5
Mirantis Kubernetes Engine (MKE) 3.4.0
Kubernetes 1.20.1
All existing management and regional clusters with the Cluster release 5.16.0
are automatically updated to the Cluster release 7.0.0 with the updated
versions of MCR, MKE, and Kubernetes.
Once you update your existing managed clusters from the Cluster release 5.16.0
to 5.17.0, an update to the Cluster release 7.0.0 becomes available
through the Container Cloud web UI menu.
Improved the MKE logs gathering by replacing the default DEBUG logs level
with INFO. This change reduces the unnecessary load on the MKE cluster
caused by an excessive amount of logs generated with the DEBUG level
enabled.
Implemented the following improvements to StackLight alerting:
Added the following alerts:
PrometheusMsTeamsDown that raises if prometheus-msteams is down.
ServiceNowWebhookReceiverDown that raises if
alertmanager-webhook-servicenow is down.
SfNotifierDown that raises if the sf-notifier is down.
KubeAPICertExpirationMajor, KubeAPICertExpirationWarning,
MKEAPICertExpirationMajor, MKEAPICertExpirationWarning that inform
on SSL certificates expiration.
Removed the inefficient PostgresqlPrimaryDown alert.
Reworked a number of alerts to improve alerting efficiency and reduce alert
flooding.
Reworked the alert inhibition rules to match the receivers.
Updated Alertmanager to v0.22.2.
Changed the default behavior of the Salesforce alerts integration. Now, by
default, only Critical alerts will be sent to the Salesforce.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to move a Ceph Monitor daemon to another node.
The following table lists the components versions
of the Cluster release 6.20.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 6.20.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Updated the Mirantis Container Runtime (MCR) version from 20.10.5 to 20.10.6
and Mirantis Kubernetes Engine version from 3.3.6 to 3.3.12
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE clusters 3.3.7-3.3.12 and
3.4.1-3.4.5.
Integrated the Ceph maintenance to the common upgrade procedure. Now, the
maintenance flag function is set up programmatically and the flag itself is
deprecated.
Implemented the capability to specify RADOS Gateway tolerations through the
KaaSCephCluster spec using the native Rook way for setting resource
requirements for Ceph daemons.
Short names for Kubernetes nodes in Grafana dashboards¶
Enhanced the Grafana dashboards to display user-friendly short names for
Kubernetes nodes, for example, master-0, instead of long name labels
such as kaas-node-f736fc1c-3baa-11eb-8262-0242ac110002.
This feature provides for consistency with Kubernetes nodes naming in the
Container Cloud web UI.
All Grafana dashboards that present node data now have an additional
Node identifier drop-down menu. By default, it is set to
machine to display short names for Kubernetes nodes. To display
Kubernetes node name labels as previously, change this option to
node.
The following table lists the components versions
of the Cluster release 6.19.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 6.19.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.18.0 is introduced in the Mirantis Container Cloud
release 2.11.0. This Cluster release is based on the Cluster release 5.18.0.
The Cluster release 6.18.0 supports:
Mirantis OpenStack for Kubernetes (MOS) 21.4.
For details, see MOS Release Notes.
Mirantis Kubernetes Engine (MKE) 3.3.6 and the updated version of
Mirantis Container Runtime (MCR)
20.10.5. For details, see MKE Release Notes
and MCR Release Notes.
Kubernetes 1.18.
For the list of addressed issues, refer to the Container Cloud releases
2.10.0 and 2.11.0 sections.
For the list of known issues, refer to the Container Cloud release
2.11.0.
Improved the MKE logs gathering by replacing the default DEBUG logs level
with INFO. This change reduces the unnecessary load on the MKE cluster
caused by an excessive amount of logs generated with the DEBUG level
enabled.
Implemented the capability to set the default log level severity
for all StackLight components as well as set a custom log level severity
for specific StackLight components in the Container Cloud web UI. You can
update this setting either during a managed cluster creation or during a
post-deployment configuration.
Implemented the following improvements to StackLight alerting:
Added the following alerts:
PrometheusMsTeamsDown that raises if prometheus-msteams is down.
ServiceNowWebhookReceiverDown that raises if
alertmanager-webhook-servicenow is down.
SfNotifierDown that raises if the sf-notifier is down.
KubeAPICertExpirationMajor, KubeAPICertExpirationWarning,
MKEAPICertExpirationMajor, MKEAPICertExpirationWarning that inform
on SSL certificates expiration.
KubeContainersCPUThrottlingHigh that raises in case of containers CPU
throttling.
KubeletDown that raises if kubelet is down.
Removed the following inefficient alerts:
PostgresqlPrimaryDown
FileDescriptorUsageCritical
KubeCPUOvercommitNamespaces
KubeMemOvercommitNamespaces
KubeQuotaExceeded
ContainerScrapeError
Reworked a number of alerts to improve alerting efficiency and reduce alert
flooding.
Reworked the alert inhibition rules to match the receivers.
Updated Alertmanager to v0.22.2.
Changed the default behavior of the Salesforce alerts integration. Now, by
default, only Critical alerts will be sent to the Salesforce.
Implemented the following improvements for the StackLight node labeling
during a cluster creation or post-deployment configuration:
Added a verification that a cluster contains minimum 3 worker nodes
with the StackLight label for clusters with StackLight deployed
in HA mode. This verification applies to cluster deployment and update
processes. For details on how to add the StackLight label
before upgrade to the latest Cluster releases of Container Cloud 2.11.0,
refer to Upgrade managed clusters with StackLight deployed in HA mode.
Added a notification about the minimum number of worker nodes with the
StackLight label for HA StackLight deployments
to the cluster live status description in the Container Cloud web UI.
Caution
Removal of the StackLight label from worker nodes
along with removal of worker nodes with StackLight
label can cause the StackLight components to become inaccessible.
It is important to keep the worker nodes where the StackLight
local volumes were provisioned.
Implemented the capability to enable feed update in Salesforce using
the feed_enabled parameter. By default, this parameter is set to false
to save API calls.
Implemented the capability to define Ceph tolerations and resources management
through the KaaSCephCluster spec using the native Rook way for setting
resource requirements for Ceph daemons.
Improved the MiraCephLog custom resource by adding more information about
all Ceph cluster entities and their statuses. The MiraCeph, MiraCephLog
statuses and MiraCephLog values are now integrated to
KaaSCephCluster.status and can be viewed using the miraCephInfo,
shortClusterInfo, and fullClusterInfo fields.
The following table lists the components versions
of the Cluster release 6.18.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.16.0 is introduced in the Mirantis Container Cloud
release 2.9.0. This Cluster release is based on the Cluster release 5.16.0.
The Cluster release 6.16.0 supports:
Mirantis OpenStack for Kubernetes (MOS) 21.3.
For details, see MOS Release Notes.
Mirantis Kubernetes Engine (MKE) 3.3.6 and Mirantis Container Runtime (MCR)
19.03.14. For details, see MKE Release Notes
and MCR Release Notes.
Kubernetes 1.18.
For the list of addressed issues, refer to the Container Cloud releases
2.8.0 and 2.9.0 sections.
For the list of known issues, refer to the Container Cloud release
2.9.0.
Implemented the capability to enable Alertmanager to send notifications to
ServiceNow. Also added the ServiceNowAuthFailure alert that will raise in
case of failure to authenticate to ServiceNow.
The following table lists the components versions
of the Cluster release 6.16.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.14.0 is introduced in the Mirantis Container Cloud
release 2.7.0.
This Cluster release is based on the Cluster release 5.14.0.
The Cluster release 6.14.0 supports:
Mirantis OpenStack for Kubernetes (MOS) 21.2.
For details, see MOS Release Notes.
Mirantis Kubernetes Engine (MKE) 3.3.6 and Mirantis Container Runtime (MCR)
19.03.14. For details, see MKE Release Notes
and MCR Release Notes.
Kubernetes 1.18.
For the list of resolved issues, refer to the Container Cloud releases
2.6.0 and 2.7.0 sections.
For the list of known issues, refer to the Container Cloud releases
2.7.0.
Significantly enhanced the StackLight log collection mechanism to avoid
collecting and keeping an excessive amount of log messages when it is not
essential. Now, during or after deployment of StackLight, you can select one of
the 9 available logging levels depending on the required severity. The default
logging level is INFO.
Implemented the capability to configure StackLight to forward all logs to an
external syslog server. In this case, StackLight will send logs both to the
syslog server and to Elasticsearch, which is the default target.
Implemented the capability to configure Ceph Controller to start pods on the
taint nodes and manage the resources of Ceph nodes. Now, when bootstrapping a
new management or managed cluster, you can specify requests, limits, or
tolerations for Ceph resources.
You can also configure resource management for an existing Ceph cluster.
However, such approach may cause downtime.
Improved user experience by moving the rgw section of the
KaasCephCluster CR to a common objectStorage section that now includes
all RADOS Gateway configurations of a Ceph cluster. The spec.rgw section is
deprecated. However, if you continue using spec.rgw, it will be
automatically translated into the new objectStorage.rgw section during the
Container Cloud update to 2.6.0.
Implemented the capability to enable Ceph maintenance mode using the
maintenance flag not only during a managed cluster update but also when
required. However, Mirantis does not recommend enabling maintenance on
production deployments other than during update.
Dedicated network for the Ceph distributed storage traffic¶
TECHNOLOGY PREVIEW
Added the possibility to configure dedicated networks for the Ceph cluster
access and replication traffic using dedicated subnets.
Container Cloud automatically configures Ceph to use the addresses from the
dedicated subnets after you assign the corresponding addresses
to the storage nodes.
Implemented the capability to enable the Ceph Multisite configuration that
allows object storage to replicate its data over multiple Ceph clusters. Using
Multisite, such object storage is independent and isolated from another object
storage in the cluster.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the Troubleshoot Ceph section to the Operations Guide. This
section now contains a detailed procedure on a failed or accidentally removed
Ceph cluster recovery.
The following table lists the components versions
of the Cluster release 6.14.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.12.0 is introduced in the Mirantis Container Cloud
release 2.5.0 and is supported by 2.6.0.
This Cluster release is based on the Cluster release 5.12.0.
The Cluster release 6.12.0 supports:
Mirantis OpenStack for Kubernetes (MOS) 21.1.
For details, see MOS Release Notes.
Updated versions of Mirantis Kubernetes Engine (MKE) 3.3.6
and Mirantis Container Runtime (MCR) 19.03.14. For details, see
MKE Release Notes
and MCR Release Notes.
Kubernetes 1.18.
For the list of resolved issues, refer to the Container Cloud releases
2.4.0 and 2.5.0 sections.
For the list of known issues, refer to the Container Cloud release
2.5.0 section.
Implemented alert inhibition rules to provide a clearer view on the cloud
status and simplify troubleshooting. Using alert inhibition rules, Alertmanager
decreases alert noise by suppressing dependent alerts notifications. The
feature is enabled by default. For details, see
Alert dependencies.
Implemented integration between Grafana and Kibana by adding a
View logs in Kibana link to the majority of Grafana dashboards,
which allows you to immediately view contextually relevant logs through the
Kibana web UI.
Enhanced StackLight to automatically set clusterId that defines an ID of
a Container Cloud cluster. Now, you do not need to set or modify this parameter
manually when configuring the sf-notifier and sf-reporter services.
Enhanced StackLight by adding support for Cerebro, a web UI that visualizes
health of Elasticsearch clusters and allows for convenient debugging. Cerebro
is disabled by default.
Implemented the maintenance label to set for Ceph during a managed
cluster update. This prevents Ceph rebalance leading to data loss
during a managed cluster update.
Implemented the Enable Object Storage checkbox in the Container
Cloud web UI to allow enabling a single-instance RGW Object Storage when
creating a Ceph cluster as described in Add a Ceph cluster.
Added proxy support for Alertmanager, Metric collector, Salesforce notifier and
reporter, and Telemeter client. Now, these StackLight components automatically
use the same proxy that is configured for Container Cloud clusters.
Note
Proxy handles only the HTTP and HTTPS traffic. Therefore, for
clusters with limited or no Internet access, it is not possible to set up
Alertmanager email notifications, which use SMTP, when proxy is used.
Note
Due to a limitation, StackLight fails to integrate with an external
proxy with authentication handled by a proxy server. In such cases, the
proxy server ignores the HTTP Authorization header for basic
authentication passed by Prometheus Alertmanager. Therefore, use proxies
without authentication or with authentication handled by a reverse proxy.
The following table lists the components versions
of the Cluster release 6.12.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.10.0 is introduced in the Mirantis Container Cloud
release 2.3.0 and supports:
Mirantis OpenStack for Kubernetes (MOS) Ussuri Update.
For details, see MOS Release Notes.
Updated versions of Mirantis Kubernetes Engine 3.3.4 and
Mirantis Container Runtime 19.03.13. For details, see
MKE Release Notes
and MCR Release Notes.
Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.3.0 section.
The following table lists the components versions
of the Cluster release 6.10.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The Cluster release 6.8.1 is introduced in the Mirantis Container Cloud
release 2.2.0.
This Cluster release is based on the Cluster release 5.8.0 and the main
difference is support of the Mirantis OpenStack for Kubernetes (MOS) product.
This section outlines release notes for the Cluster release 5.22.0
that is introduced in the Mirantis Container Cloud release 2.15.0.
This Cluster release supports Mirantis Container Runtime 20.10.8 and
Mirantis Kubernetes Engine 3.3.13 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.15.0 section.
Updated the Mirantis Container Runtime (MCR) version from 20.10.6 to 20.10.8
for the Container Cloud management, regional, and managed clusters on all
supported cloud providers.
Implemented the MCCLicenseExpirationCritical and
MCCLicenseExpirationMajor alerts that notify about Mirantis Container Cloud
license expiration in less than 10 and 30 days.
Implemented the following improvements to StackLight alerting:
Enhanced Kubernetes applications alerting:
Reworked the Kubernetes applications alerts to minimize flapping, avoid
firing during pod rescheduling, and to detect crash looping for pods
that restart less frequently.
Added the KubeDeploymentOutage, KubeStatefulSetOutage, and
KubeDaemonSetOutage alerts.
Removed the redundant KubeJobCompletion alert.
Enhanced the alert inhibition rules to reduce alert flooding.
Improved alert descriptions.
Split TelemeterClientFederationFailed into TelemeterClientFailed and
TelemeterClientHAFailed to separate alerts depending on the HA mode
disabled or enabled.
Updated the description for DockerSwarmNodeFlapping.
Disabled unused Node Exporter collectors and implemented the capability to
manually enable needed collectors using the
nodeExporter.extraCollectorsEnabled parameter. Only the following
collectors are now enabled by default in StackLight:
Implemented full support for automated Ceph LCM operations using the
KaaSCephOperationRequest CR, such as addition or removal of Ceph OSDs and
nodes, as well as replacement of failed Ceph OSDs or nodes.
Ceph CSI provisioner tolerations and node affinity¶
Implemented the capability to specify Container Storage Interface (CSI)
provisioner tolerations and node affinity for different Rook resources.
Added support for the all and mds keys in toleration rules.
The following table lists the components versions
of the Cluster release 5.22.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.22.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.21.0
that is introduced in the Mirantis Container Cloud release 2.14.0.
This Cluster release supports Mirantis Container Runtime 20.10.6 and
Mirantis Kubernetes Engine 3.3.12 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.14.0 section.
Updated the Mirantis Kubernetes Engine version from 3.3.12 to 3.3.13
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE 3.3.13 clusters.
Limited the number of monitored network interfaces to prevent extended
Prometheus RAM consumption in big clusters. By default, Prometheus Node
Exporter now only collects information of a basic set of interfaces, both host
and container. If required you can edit the list of excluded devices as needed.
Implemented the capability to define custom Prometheus recording rules through
the prometheusServer.customRecordingRules parameter in the StackLight Helm
chart. Overriding of existing recording rules is not supported.
Implemented the capability to configure packet size for the syslog logging
output. If remote logging to syslog is enabled in StackLight, use the
logging.syslog.packetSize parameter in the StackLight Helm chart to
configure the packet size.
Implemented the capability to configure the Prometheus Relay client timeout and
response size limit through the prometheusRelay.clientTimeout and
prometheusRelay.responseLimitBytes parameters in the StackLight Helm chart.
Implemented additional validation of networks specified in
spec.cephClusterSpec.network.publicNet and
spec.cephClusterSpec.network.clusterNet and prohibited the use of the
0.0.0.0/0 CIDR. Now, the bare metal provider automatically translates
the 0.0.0.0/0 network range to the default LCM IPAM subnet if it exists.
You can now also add corresponding labels for the bare metal IPAM subnets when
configuring the Ceph cluster during the management cluster deployment.
To improve debugging and log reading, separated Ceph Controller, Ceph Status
Controller, and Ceph Request Controller, which used to run in one pod, into
three different deployments.
Implemented the KaaSCephOperationRequest CR that provides LCM operations
for Ceph OSDs and nodes by automatically creating separate
CephOsdRemoveRequest requests. It allows for automated removal of healthy
or non-healthy Ceph OSDs from a Ceph cluster.
The following table lists the components versions
of the Cluster release 5.21.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.21.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.20.0
that is introduced in the Mirantis Container Cloud release 2.13.0.
This Cluster release supports Mirantis Container Runtime 20.10.6 and
Mirantis Kubernetes Engine 3.3.12 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.13.0 section.
The following table lists the components versions
of the Cluster release 5.20.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.20.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.19.0
that is introduced in the Mirantis Container Cloud release 2.12.0.
This Cluster release supports Mirantis Container Runtime 20.10.6 and
Mirantis Kubernetes Engine 3.3.12 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.12.0 section.
Updated the Mirantis Container Runtime (MCR) version from 20.10.5 to 20.10.6
and Mirantis Kubernetes Engine version from 3.3.6 to 3.3.12
for the Container Cloud management, regional, and managed clusters.
Also, added support for attachment of existing MKE clusters 3.3.7-3.3.12 and
3.4.1-3.4.5.
Integrated the Ceph maintenance to the common upgrade procedure. Now, the
maintenance flag function is set up programmatically and the flag itself is
deprecated.
Implemented the capability to specify RADOS Gateway tolerations through the
KaaSCephCluster spec using the native Rook way for setting resource
requirements for Ceph daemons.
Short names for Kubernetes nodes in Grafana dashboards¶
Enhanced the Grafana dashboards to display user-friendly short names for
Kubernetes nodes, for example, master-0, instead of long name labels
such as kaas-node-f736fc1c-3baa-11eb-8262-0242ac110002.
This feature provides for consistency with Kubernetes nodes naming in the
Container Cloud web UI.
All Grafana dashboards that present node data now have an additional
Node identifier drop-down menu. By default, it is set to
machine to display short names for Kubernetes nodes. To display
Kubernetes node name labels as previously, change this option to
node.
The following table lists the components versions
of the Cluster release 5.19.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.19.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.18.0
that is introduced in the Mirantis Container Cloud release 2.11.0.
This Cluster release supports Mirantis Container Runtime 20.10.5 and
Mirantis Kubernetes Engine 3.3.6 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.11.0 section.
Implemented the capability to define Ceph tolerations and resources management
through the KaaSCephCluster spec using the native Rook way for setting
resource requirements for Ceph daemons.
Improved the MiraCephLog custom resource by adding more information about
all Ceph cluster entities and their statuses. The MiraCeph, MiraCephLog
statuses and MiraCephLog values are now integrated to
KaaSCephCluster.status and can be viewed using the miraCephInfo,
shortClusterInfo, and fullClusterInfo fields.
Implemented the following improvements for the StackLight node labeling
during a cluster creation or post-deployment configuration:
Added a verification that a cluster contains minimum 3 worker nodes
with the StackLight label for clusters with StackLight deployed
in HA mode. This verification applies to cluster deployment and update
processes. For details on how to add the StackLight label
before upgrade to the latest Cluster releases of Container Cloud 2.11.0,
refer to Upgrade managed clusters with StackLight deployed in HA mode.
Added a notification about the minimum number of worker nodes with the
StackLight label for HA StackLight deployments
to the cluster live status description in the Container Cloud web UI.
Caution
Removal of the StackLight label from worker nodes
along with removal of worker nodes with StackLight
label can cause the StackLight components to become inaccessible.
It is important to keep the worker nodes where the StackLight
local volumes were provisioned.
Implemented the capability to set the default log level severity
for all StackLight components as well as set a custom log level severity
for specific StackLight components in the Container Cloud web UI. You can
update this setting either during a managed cluster creation or during a
post-deployment configuration.
Implemented the capability to enable feed update in Salesforce using
the feed_enabled parameter. By default, this parameter is set to false
to save API calls.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to manually remove a Ceph OSD from a Ceph
cluster.
The following table lists the components versions
of the Cluster release 5.18.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.18.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.17.0
that is introduced in the Mirantis Container Cloud release 2.10.0.
This Cluster release introduces support for the updated version of
Mirantis Container Runtime 20.10.5 and supports
Mirantis Kubernetes Engine 3.3.6 with Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.10.0 section.
Improved the MKE logs gathering by replacing the default DEBUG logs level
with INFO. This change reduces the unnecessary load on the MKE cluster
caused by an excessive amount of logs generated with the DEBUG level
enabled.
Implemented the following improvements to StackLight alerting:
Added the following alerts:
PrometheusMsTeamsDown that raises if prometheus-msteams is down.
ServiceNowWebhookReceiverDown that raises if
alertmanager-webhook-servicenow is down.
SfNotifierDown that raises if the sf-notifier is down.
KubeAPICertExpirationMajor, KubeAPICertExpirationWarning,
MKEAPICertExpirationMajor, MKEAPICertExpirationWarning that inform
on SSL certificates expiration.
Removed the inefficient PostgresqlPrimaryDown alert.
Reworked a number of alerts to improve alerting efficiency and reduce alert
flooding.
Reworked the alert inhibition rules to match the receivers.
Updated Alertmanager to v0.22.2.
Changed the default behavior of the Salesforce alerts integration. Now, by
default, only Critical alerts will be sent to the Salesforce.
On top of continuous improvements delivered to the existing Container Cloud
guides, added a procedure on how to move a Ceph Monitor daemon to another node.
The following table lists the components versions
of the Cluster release 5.17.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.17.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.16.0
that is introduced in the Mirantis Container Cloud release 2.9.0.
This Cluster release supports Mirantis Kubernetes Engine 3.3.6,
Mirantis Container Runtime 19.03.14, and Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.9.0 section.
The following table lists the components versions
of the Cluster release 5.16.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.16.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.15.0
that is introduced in the Mirantis Container Cloud release 2.8.0.
This Cluster release supports Mirantis Kubernetes Engine 3.3.6,
Mirantis Container Runtime 19.03.14, and Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.8.0 section.
Implemented the capability to enable Alertmanager to send notifications to
ServiceNow. Also added the ServiceNowAuthFailure alert that will raise in
case of failure to authenticate to ServiceNow.
The following table lists the components versions
of the Cluster release 5.15.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.15.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.14.0
that is introduced in the Mirantis Container Cloud release 2.7.0.
This Cluster release supports Mirantis Kubernetes Engine 3.3.6,
Mirantis Container Runtime 19.03.14, and Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.7.0 section.
Dedicated network for the Ceph distributed storage traffic¶
TECHNOLOGY PREVIEW
Added the possibility to configure dedicated networks for the Ceph cluster
access and replication traffic using dedicated subnets.
Container Cloud automatically configures Ceph to use the addresses from the
dedicated subnets after you assign the corresponding addresses
to the storage nodes.
Implemented the capability to enable the Ceph Multisite configuration that
allows object storage to replicate its data over multiple Ceph clusters. Using
Multisite, such object storage is independent and isolated from another object
storage in the cluster.
On top of continuous improvements delivered to the existing Container Cloud
guides, added the Troubleshoot Ceph section to the Operations Guide. This
section now contains a detailed procedure to recover a failed or accidentally
removed Ceph cluster.
The following table lists the components versions
of the Cluster release 5.14.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.14.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.13.0
that is introduced in the Mirantis Container Cloud release 2.6.0.
This Cluster release supports Mirantis Kubernetes Engine 3.3.6,
Mirantis Container Runtime 19.03.14, and Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.6.0 section.
Significantly enhanced the StackLight log collection mechanism to avoid
collecting and keeping an excessive amount of log messages when it is not
essential. Now, during or after deployment of StackLight, you can select one of
the 9 available logging levels depending on the required severity. The default
logging level is INFO.
Implemented the capability to configure StackLight to forward all logs to an
external syslog server. In this case, StackLight will send logs both to the
syslog server and to Elasticsearch, which is the default target.
Implemented the capability to configure Ceph Controller to start pods on the
taint nodes and manage the resources of Ceph nodes. Now, when bootstrapping a
new management or managed cluster, you can specify requests, limits, or
tolerations for Ceph resources.
You can also configure resource management for an existing Ceph cluster.
However, such approach may cause downtime.
Improved user experience by moving the rgw section of the
KaasCephCluster CR to a common objectStorage section that now includes
all RADOS Gateway configurations of a Ceph cluster. The spec.rgw section is
deprecated. However, if you continue using spec.rgw, it will be
automatically translated into the new objectStorage.rgw section during the
Container Cloud update to 2.6.0.
Implemented the capability to enable Ceph maintenance mode using the
maintenance flag not only during a managed cluster update but also when
required. However, Mirantis does not recommend enabling maintenance on
production deployments other than during update.
The following table lists the components versions
of the Cluster release 5.13.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.13.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.12.0
that is introduced in the Mirantis Container Cloud release 2.5.0.
This Cluster release supports Kubernetes 1.18 and Mirantis Container Runtime
19.03.14 as well as introduces support for the updated version of
Mirantis Kubernetes Engine 3.3.6.
For the list of known and resolved issues, refer to the Container Cloud release
2.5.0 section.
Implemented the maintenance label to set for Ceph during a managed
cluster update. This prevents Ceph rebalance leading to data loss
during a managed cluster update.
Implemented the Enable Object Storage checkbox in the Container
Cloud web UI to allow enabling a single-instance RGW Object Storage when
creating a Ceph cluster as described in Add a Ceph cluster.
Enhanced StackLight by adding support for Cerebro, a web UI that visualizes
health of Elasticsearch clusters and allows for convenient debugging. Cerebro
is disabled by default.
Added proxy support for Alertmanager, Metric collector, Salesforce notifier and
reporter, and Telemeter client. Now, these StackLight components automatically
use the same proxy that is configured for Container Cloud clusters.
Note
Proxy handles only the HTTP and HTTPS traffic. Therefore, for
clusters with limited or no Internet access, it is not possible to set up
Alertmanager email notifications, which use SMTP, when proxy is used.
Note
Due to a limitation, StackLight fails to integrate with an external
proxy with authentication handled by a proxy server. In such cases, the
proxy server ignores the HTTP Authorization header for basic
authentication passed by Prometheus Alertmanager. Therefore, use proxies
without authentication or with authentication handled by a reverse proxy.
The following table lists the components versions
of the Cluster release 5.12.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.12.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.11.0
that is introduced in the Mirantis Container Cloud release 2.4.0.
This Cluster release supports Kubernetes 1.18 and Mirantis Kubernetes Engine
3.3.4 as well as introduces support for the updated version of
Mirantis Container Runtime 19.03.14.
Note
The Cluster release 5.11.0 supports only attachment
of existing MKE 3.3.4 clusters.
For the deployment of new or attachment of existing clusters
based on other supported MKE versions,
the latest available Cluster releases are used.
For the list of known and resolved issues, refer to the Container Cloud release
2.4.0 section.
Implemented alert inhibition rules to provide a clearer view on the cloud
status and simplify troubleshooting. Using alert inhibition rules, Alertmanager
decreases alert noise by suppressing dependent alerts notifications. The
feature is enabled by default. For details, see
Alert dependencies.
Implemented integration between Grafana and Kibana by adding a
View logs in Kibana link to the majority of Grafana dashboards,
which allows you to immediately view contextually relevant logs through the
Kibana web UI.
Enhanced StackLight to automatically set clusterId that defines an ID of
a Container Cloud cluster. Now, you do not need to set or modify this parameter
manually when configuring the sf-notifier and sf-reporter services.
The following table lists the components versions
of the Cluster release 5.11.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
Components versions of the Cluster release 5.11.0¶
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.10.0
that is introduced in the Mirantis Container Cloud release 2.3.0.
This Cluster release supports Kubernetes 1.18 and introduces support for
the latest versions of Mirantis Kubernetes Engine 3.3.4 and
Mirantis Container Runtime 19.03.13.
For the list of known and resolved issues, refer to the Container Cloud release
2.3.0 section.
The following table lists the components versions
of the Cluster release 5.10.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.9.0
that is introduced in the Mirantis Container Cloud release 2.2.0
and supports Mirantis Kubernetes Engine 3.3.3, Mirantis Container Runtime
19.03.12, and Kubernetes 1.18.
For the list of known and resolved issues, refer to the Container Cloud release
2.2.0 section.
Enhanced StackLight to monitor the number of file descriptors on nodes and
raise FileDescriptorUsage* alerts when a node uses 80%, 90%, or 95% of
file descriptors.
The following table lists the components versions
of the Cluster release 5.9.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.8.0
that is introduced in the Mirantis Container Cloud release 2.1.0
and supports Mirantis Kubernetes Engine 3.3.3, Mirantis Container Runtime
19.03.12, and Kubernetes 1.18.
For the list of known issues, refer to the Container Cloud release 2.1.0
Known issues.
Introduced Grafana Image Renderer, a separate Grafana container in a pod to
offload rendering of images from charts. Grafana Image Renderer is enabled
by default.
Configured a home dashboard to replace the
Installation/configuration panel that opens when you access
Grafana. By default, Kubernetes Cluster is set as a home
dashboard. However, you can set any of the available Grafana dashboards.
Split the regional and management cluster function in StackLight telemetry.
Now, the metrics from managed clusters are aggregated on regional clusters,
then both regional and managed clusters metrics are sent from regional
clusters to the management cluster.
Added the capability to filter panels by regions in the
Clusters Overview and Telemeter Server Grafana
dashboards.
The following table lists the components versions
of the Cluster release 5.8.0.
Note
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
The components that are newly added, updated, deprecated, or removed
as compared to the previous release version, are marked
with a corresponding superscript,
for example, lcm-ansibleUpdated.
This section outlines release notes for the Cluster release 5.7.0
that is introduced in the Mirantis Container Cloud release 2.0.0
and supports Mirantis Kubernetes Engine 3.3.3, Mirantis Container Runtime
19.03.12, and Kubernetes 1.18.
For the list of known issues, refer to the Container Cloud release 2.0.0
Known issues.
Since Container Cloud 2.23.2, the release train comprises several patch
releases that Mirantis delivers on top of a major release mainly to incorporate
security updates as soon as they become available without waiting for the next
major release. By significantly reducing the time to provide fixes for Common
Vulnerabilities and Exposures (CVE), patch releases protect your clusters
from cyber threats and potential data breaches.
Major and patch versions update path
The primary distinction between major and patch product versions lies in
the fact that major release versions introduce new functionalities,
whereas patch release versions predominantly offer minor product
enhancements, mostly CVE resolutions for your clusters.
Depending on your deployment needs, you can either update only between
major Cluster releases or apply patch updates between major releases.
Choosing the latter option ensures you receive security fixes as soon as
they become available. Though, be prepared to update your cluster
frequently, approximately once every three weeks.
Otherwise, you can update only between major Cluster releases as each
subsequent major Cluster release includes patch Cluster release updates
of the previous major Cluster release.
As compared to a major Cluster release update, a patch release update does not
involve any public API or LCM changes, major version bumps of MKE or other
major components, workloads evacuation. A patch cluster update only may require
restart of containers running the Container Cloud controllers, MKE, Ceph, and
StackLight services to update base images with related libraries and apply
CVE fixes to images. The data plane is not affected.
The following table lists differences between content delivery in major
releases as compared to patch releases:
Management clusters obtain patch releases automatically the same way as major
releases. Managed clusters use the same update delivery method as for the major
Cluster release updates. New patch Cluster releases become available through
the Container Cloud web UI after automatic upgrade of a management cluster to
the latest patch Cluster release.
You may decide to use only major Cluster releases without updating to patch
Cluster releases. In this case, you will perform updates from an N to N+1
major release.
Major Cluster releases include all patch updates of the previous major Cluster
release. However, Mirantis recommends applying security fixes using patch
releases as soon as they become available to avoid security threats and
potentially achieve legal compliance.
If you delay the Container Cloud upgrade and schedule it at a later time as
described in Schedule Mirantis Container Cloud updates, make sure to schedule a longer
maintenance window as the upgrade queue can include several patch releases
along with the major release upgrade.
Starting from Container Cloud 2.26.5 (Cluster releases 16.1.5 and 17.1.5),
Mirantis introduces a new update scheme for managed clusters allowing for the
update path flexibility.
The user can update a managed cluster to any patch version in the series
even if a newer patch version has been released already.
Note
In Container Cloud patch releases 2.27.1 and 2.27.2,
only the 16.2.x patch Cluster releases will be delivered with an
automatic update of management clusters and the possibility to update
non-MOSK managed clusters.
In parallel, 2.27.1 and 2.27.2 will include new 16.1.x and 17.1.x patches
for MOSK 24.1.x. And the first 17.2.x patch Cluster release
for MOSK 24.2.x will be delivered in 2.27.3. For details,
see MOSK documentation: Update path for 24.1 and 24.2 series.
The user cannot update a managed cluster to the intermediate patch version
in the series if a newer patch version has been released. For example,
when the patch Cluster release 17.0.4 becomes available, you can update
from 17.0.1 to 17.0.4 at once, but not from 17.0.1 to 17.0.2.
The user can always update to the newer major version from the latest
patch version of the previous series. Additionally, there will be
another possibility of major update during the course of the patch
series from the patch version released immediately before the target
major version.
If the cluster starts receiving patch releases, the user must apply the
latest patch version in the series to be able to update to the following
major release. For example, to obtain the major Cluster release 17.1.0 while
using the patch Cluster release 17.0.2, you must update your cluster to the
latest patch Cluster release 17.0.4 first.
The following table lists the latest Container Cloud 2.29.x patch releases and
their supported Cluster releases that are being delivered on top of the
Container Cloud major release 2.29.0. Click the required patch release link to
learn more about its deliverables.
Container Cloud 2.29.x and supported patch Cluster releases¶
Cluster release is not included in the Container Cloud release yet.
Cluster release is deprecated, and you must update it to the latest
supported Cluster release. The deprecated Cluster release will become
unsupported in one of the following Container Cloud releases. Greenfield
deployments based on a deprecated Cluster release are not supported.
Use the latest supported Cluster release instead.
This section provides deprecation notes only about unsupported OpenStack cloud
provider. The information about deprecated and removed functionality of the
bare metal provider, Ceph, and StackLight was moved to
MOSK documentation: Deprecation Notes.
Deprecated and removed features of the OpenStack cloud provider¶
Component
Deprecated in
Finally available in
Removed in
Comments
OpenStack-based clusters
2.28.4
2.28.5
2.29.0
Suspended support for OpenStack-based deployments for the sake of the
MOSK product. Simultaneously, ceased performing functional
integration testing of the OpenStack provider and removed the possibility
to update an OpenStack-based cluster to Container Cloud 2.29.0
(Cluster release 16.4.0).
Therefore, the final supported version for this cloud provider is
Container Cloud 2.28.5 (Cluster release 16.3.5). If you still require
the feature, contact Mirantis support for further information.
Reference Application for workload monitoring
2.28.0
2.28.2
2.28.3
Deprecated support for Reference Application on non-MOSK
managed clusters. Due to this deprecation, if the RefAppDown alert
is firing in the cluster, disable refapp.enabled to prevent
unnecessary alerts.
Suspended support for regional clusters of the same or different cloud
provider type on a single management cluster.
Additionally, suspended support for several regions on a single
management cluster. Simultaneously, ceased performing functional
integration testing of the feature and removed the related code in
Container Cloud 2.26.0. If you still require this feature,
contact Mirantis support for further information.
Suspended support for attachment of existing Mirantis Kubernetes Engine
(MKE) clusters that were originally not deployed by Container Cloud.
Also suspended support for all related features, such as sharing a Ceph
cluster with an attached MKE cluster.