Mirantis OpenStack for Kubernetes Documentation

This documentation provides information on how to deploy and operate a Mirantis OpenStack for Kubernetes (MOSK) environment. The documentation is intended to help operators to understand the core concepts of the product. The documentation provides sufficient information to deploy and operate the solution.

The information provided in this documentation set is being constantly improved and amended based on the feedback and kind requests from the consumers of MOS.

The following table lists the guides included in the documentation set you are reading:

Guide list

Guide

Purpose

Reference Architecture

Learn the fundamentals of MOSK reference architecture to appropriately plan your deployment

Deployment Guide

Deploy a MOSK environment of a preferred configuration using supported deployment profiles tailored to the demands of specific business cases

Operations Guide

Operate your MOSK environment

Release Notes

Learn about new features and bug fixes in the current MOSK version

Intended audience

This documentation is intended for engineers who have the basic knowledge of Linux, virtualization and containerization technologies, Kubernetes API and CLI, Helm and Helm charts, Mirantis Kubernetes Engine (MKE), and OpenStack.

Documentation history

The following table contains the released revision of the documentation set you are reading.

Release date

Release name

August, 2023

MOSK 23.2 series

Conventions

This documentation set uses the following conventions in the HTML format:

Documentation conventions

Convention

Description

boldface font

Inline CLI tools and commands, titles of the procedures and system response examples, table titles

monospaced font

Files names and paths, Helm charts parameters and their values, names of packages, nodes names and labels, and so on

italic font

Information that distinguishes some concept or term

Links

External links and cross-references, footnotes

Main menu > menu item

GUI elements that include any part of interactive user interface and menu navigation

Superscript

Some extra, brief information

Note

The Note block

Messages of a generic meaning that may be useful for the user

Caution

The Caution block

Information that prevents a user from mistakes and undesirable consequences when following the procedures

Warning

The Warning block

Messages that include details that can be easily missed, but should not be ignored by the user and are valuable before proceeding

See also

The See also block

List of references that may be helpful for understanding of some related tools, concepts, and so on

Learn more

The Learn more block

Used in the Release Notes to wrap a list of internal references to the reference architecture, deployment and operation procedures specific to a newly implemented product feature

Product Overview

Mirantis OpenStack for Kubernetes (MOSK) combines the power of Mirantis Container Cloud for delivering and managing Kubernetes clusters, with the industry standard OpenStack APIs, enabling you to build your own cloud infrastructure.

The advantages of running all of the OpenStack components as a Kubernetes application are multi-fold and include the following:

  • Zero downtime, non-disruptive updates

  • Fully automated Day-2 operations

  • Full-stack management from bare metal through the operating system and all the necessary components

The list of the most common use cases includes:

Software-defined data center

The traditional data center requires multiple requests and interactions to deploy new services, by abstracting the data center functionality behind a standardized set of APIs service can be deployed faster and more efficiently. MOSK enables you to define all your data center resources behind the industry standard OpenStack APIs allowing you to automate the deployment of applications or simply request resources through the UI to quickly and efficiently provision virtual machines, storage, networking, and other resources.

Virtual Network Functions (VNFs)

VNFs require high performance systems that can be accessed on demand in a standardized way, with assurances that they will have access to the necessary resources and performance guarantees when needed. MOSK provides extensive support for VNF workload enabling easy access to functionality such as Intel EPA (NUMA, CPU pinning, Huge Pages) as well as the consumption of specialized networking interfaces cards to support SR-IOV and DPDK. The centralized management model of MOSK and Mirantis Container Cloud also enables the easy management of multiple MOSK deployments with full lifecycle management.

Legacy workload migration

With the industry moving toward cloud-native technologies many older or legacy applications are not able to be moved easily and often it does not make financial sense to transform the applications to cloud-native applications. MOSK provides a stable cloud platform that can cost-effectively host legacy applications whilst still providing the expected levels of control, customization, and uptime.

Reference Architecture

Mirantis OpenStack for Kubernetes (MOSK) is a virtualization platform that provides an infrastructure for cloud-ready applications, in combination with reliability and full control over the data.

MOSK combines OpenStack, an open-source cloud infrastructure software, with application management techniques used in the Kubernetes ecosystem that include container isolation, state enforcement, declarative definition of deployments, and others.

MOSK integrates with Mirantis Container Cloud to rely on its capabilities for bare-metal infrastructure provisioning, Kubernetes cluster management, and continuous delivery of the stack components.

MOSK simplifies the work of a cloud operator by automating all major cloud life cycle management routines including cluster updates and upgrades.

Deployment profiles

A Mirantis OpenStack for Kubernetes (MOSK) deployment profile is a thoroughly tested and officially supported reference architecture that is guaranteed to work at a specific scale and is tailored to the demands of a specific business case, such as generic IaaS cloud, Network Function Virtualisation infrastructure, Edge Computing, and others.

A deployment profile is defined as a combination of:

  • Services and features the cloud offers to its users.

  • Non-functional characteristics that users and operators should expect when running the profile on top of a reference hardware configuration. Including, but not limited to:

    • Performance characteristics, such as an average network throughput between VMs in the same virtual network.

    • Reliability characteristics, such as the cloud API error response rate when recovering a failed controller node.

    • Scalability characteristics, such as the total amount of virtual routers tenants can run simultaneously.

  • Hardware requirements - the specification of physical servers, and networking equipment required to run the profile in production.

  • Deployment parameters that an operator for the cloud can tweak within a certain range without being afraid of breaking the cloud or losing support.

In addition, the following items may be included in a definition:

  • Compliance-driven technical requirements, such as TLS encryption of all external API endpoints.

  • Foundation-level software components, such as Tungsten Fabric or Open vSwitch as a backend for the networking service.

Note

Mirantis reserves the right to revise the technical implementation of any profile at will while preserving its definition - the functional and non-functional characteristics that operators and users are known to rely on.

MOSK supports a huge list of different deployment profiles to address a wide variety of business tasks. The table below includes the profiles for the most common use cases.

Note

Some components of a MOSK cluster are mandatory and are being installed during the managed cluster deployment by Container Cloud regardless of the deployment profile in use. StackLight is one of the cluster components that are enabled by default. See Container Cloud Operations Guide for details.

Supported deployment profiles

Profile

OpenStackDeployment CR Preset

Description

Cloud Provider Infrastructure (CPI)

compute

Provides the core set of the services an IaaS vendor would need including some extra functionality. The profile is designed to support up 50-70 compute nodes and a reasonable number of storage nodes. 0

The core set of services provided by the profile includes:

  • Compute (Nova)

  • Images (Glance)

  • Networking (Neutron with Open vSwitch as a backend)

  • Identity (Keystone)

  • Block Storage (Cinder)

  • Orchestration (Heat)

  • Load balancing (Octavia)

  • DNS (Designate)

  • Secret Management (Barbican)

  • Web front end (Horizon)

  • Bare metal provisioning (Ironic) 1 2

CPI with Tungsten Fabric

compute-tf

A variation of the CPI profile 1 with Tugsten Fabric as a backend for networking.

0

The supported node count is approximate and may vary depending on the hardware, cloud configuration, and planned workload.

1(1,2)

Ironic is an optional component for the CPI profile. See Bare Metal service for details.

2

Ironic is not supported for the CPI with Tungsten Fabric profile. See Tungsten Fabric known limitations for details.

Components overview

Mirantis OpenStack for Kubernetes (MOSK) includes the following key design elements.

HelmBundle Operator

The HelmBundle Operator is the realization of the Kubernetes Operator pattern that provides a Kubernetes custom resource of the HelmBundle kind and code running inside a pod in Kubernetes. This code handles changes, such as creation, update, and deletion, in the Kubernetes resources of this kind by deploying, updating, and deleting groups of Helm releases from specified Helm charts with specified values.

OpenStack

The OpenStack platform manages virtual infrastructure resources, including virtual servers, storage devices, networks, and networking services, such as load balancers, as well as provides management functions to the tenant users.

Various OpenStack services are running as pods in Kubernetes and are represented as appropriate native Kubernetes resources, such as Deployments, StatefulSets, and DaemonSets.

For a simple, resilient, and flexible deployment of OpenStack and related services on top of a Kubernetes cluster, MOSK uses OpenStack-Helm that provides a required collection of the Helm charts.

Also, MOSK uses OpenStack Controller (Rockoon) as the realization of the Kubernetes Operator pattern. Rockoon provides a custom Kubernetes resource of the OpenStackDeployment kind and code running inside a pod in Kubernetes. This code handles changes such as creation, update, and deletion in the Kubernetes resources of this kind by deploying, updating, and deleting groups of the Helm releases.

Ceph

Ceph is a distributed storage platform that provides storage resources, such as objects and virtual block devices, to virtual and physical infrastructure.

MOSK uses Rook as the implementation of the Kubernetes Operator pattern that manages resources of the CephCluster kind to deploy and manage Ceph services as pods on top of Kubernetes to provide Ceph-based storage to the consumers, which include OpenStack services, such as Volume and Image services, and underlying Kubernetes through Ceph CSI (Container Storage Interface).

The Ceph Controller is the implementation of the Kubernetes Operator pattern, that manages resources of the MiraCeph kind to simplify management of the Rook-based Ceph clusters.

StackLight Logging, Monitoring, and Alerting

The StackLight component is responsible for collection, analysis, and visualization of critical monitoring data from physical and virtual infrastructure, as well as alerting and error notifications through a configured communication system, such as email. StackLight includes the following key sub-components:

  • Prometheus

  • OpenSearch

  • OpenSearch Dashboards

  • Fluentd

Requirements

MOSK cluster hardware requirements

This section provides hardware requirements for the Mirantis Container Cloud management cluster with a managed Mirantis OpenStack for Kubernetes (MOSK) cluster.

For installing MOSK, the Mirantis Container Cloud management cluster and managed cluster must be deployed with baremetal provider.

Important

A MOSK cluster is to be used for a deployment of an OpenStack cluster and its components. Deployment of third-party workloads on a MOSK cluster is neither allowed nor supported.

Note

One of the industry best practices is to verify every new update or configuration change in a non-customer-facing environment before applying it to production. Therefore, Mirantis recommends having a staging cloud, deployed and maintained along with the production clouds. The recommendation is especially applicable to the environments that:

  • Receive updates often and use continuous delivery. For example, any non-isolated deployment of Mirantis Container Cloud.

  • Have significant deviations from the reference architecture or third party extensions installed.

  • Are managed under the Mirantis OpsCare program.

  • Run business-critical workloads where even the slightest application downtime is unacceptable.

A typical staging cloud is a complete copy of the production environment including the hardware and software configurations, but with a bare minimum of compute and storage capacity.

The table below describes the node types the MOSK reference architecture includes.

MOSK node types

Node type

Description

Mirantis Container Cloud management cluster nodes

The Container Cloud management cluster architecture on bare metal requires three physical servers for manager nodes. On these hosts, we deploy a Kubernetes cluster with services that provide Container Cloud control plane functions.

OpenStack control plane node and StackLight node

Host OpenStack control plane services such as database, messaging, API, schedulers conductors, and L3 and L2 agents, as well as the StackLight components.

Note

MOSK enables the cloud operator to collocate the OpenStack control plane with the managed cluster master nodes on the OpenStack deployments of a small size. This capability is available as technical preview. Use such configuration for testing and evaluation purposes only.

Tenant gateway node

Optional. Hosts OpenStack gateway services including L2, L3, and DHCP agents. The tenant gateway nodes are combined with OpenStack control plane nodes. The strict requirement is a dedicated physical network (bond) for tenant network traffic.

Tungsten Fabric control plane node

Required only if Tungsten Fabric is enabled as a backend for the OpenStack networking. These nodes host the TF control plane services such as Cassandra database, messaging, API, control, and configuration services.

Tungsten Fabric analytics node

Required only if Tungsten Fabric is enabled as a backend for the OpenStack networking. These nodes host the TF analytics services such as Cassandra, ZooKeeper, and collector.

Compute node

Hosts the OpenStack Compute services such as QEMU, L2 agents, and others.

Infrastructure nodes

Runs underlying Kubernetes cluster management services. The MOSK reference configuration requires minimum three infrastructure nodes.

The table below specifies the hardware resources the MOSK reference architecture recommends for each node type.

Hardware requirements

Node type

# of servers

CPU cores # per server

RAM per server, GB

Disk space per server, GB

NICs # per server

Mirantis Container Cloud management cluster node

3 0

16

128

1 SSD x 960
1 SSD x 1900 1

3 2

OpenStack control plane, gateway 3, and StackLight nodes

3 or more

32

128

1 SSD x 500
2 SSD x 1000 6

5

Tenant gateway (optional)

0-3

32

128

1 SSD x 500

5

Tungsten Fabric control plane nodes 4

3

16

64

1 SSD x 500

1

Tungsten Fabric analytics nodes 4

3

32

64

1 SSD x 1000

1

Compute node

3 (varies)

16

64

1 SSD x 500 7

5

Infrastructure node (Kubernetes cluster management)

3 8

16

64

1 SSD x 500

5

Infrastructure node (Ceph) 5

3

16

64

1 SSD x 500
2 HDDs x 2000

5

Note

The exact hardware specifications and number of the control plane and gateway nodes depend on a cloud configuration and scaling needs. For example, for the clouds with more than 12,000 Neutron ports, Mirantis recommends increasing the number of gateway nodes.

0

Adding more than 3 nodes to a management cluster is not supported.

1

In total, at least 2 disks are required:

  • disk0 - system storage, minimum 120 GB.

  • disk1 - Container Cloud services storage, not less than 110 GB. The exact capacity requirements depend on StackLight data retention period.

See Management cluster storage for details.

2

OOB management (IPMI) port is not included.

3

OpenStack gateway services can optionally be moved to separate nodes.

4(1,2)

TF control plane and analytics nodes can be combined with a respective addition of RAM, CPU, and disk space to the hardware hosts. Though, Mirantis does not recommend such configuration for production environments as the risk of the cluster downtime if one of the nodes unexpectedly fails increases.

5
  • A Ceph cluster with 3 Ceph nodes does not provide hardware fault tolerance and is not eligible for recovery operations, such as a disk or an entire node replacement. Therefore, a minimum of 5 Ceph nodes is recommended for production use.

  • A Ceph cluster uses the replication factor that equals 3. If the number of Ceph OSDs is less than 3, a Ceph cluster moves to the degraded state with the write operations restriction until the number of alive Ceph OSDs equals the replication factor again.

6
  • 1 SSD x 500 for operating system

  • 1 SSD x 1000 for OpenStack LVP

  • 1 SSD x 1000 for StackLight LVP

7

When Nova is used with local folders, additional capacity is required depending on the VM images size.

8

For nodes hardware requirements, refer to Container Cloud Reference Architecture: Managed cluster hardware configuration.

Note

If you would like to evaluate the MOSK capabilities and do not have much hardware at your disposal, you can deploy it in a virtual environment. For example, on top of another OpenStack cloud using the sample Heat templates.

Please mind, the tooling is provided for reference only and is not a part of the product itself. Mirantis does not guarantee its interoperability with any MOSK version.

Management cluster storage

The management cluster requires minimum two storage devices per node. Each device is used for different type of storage:

  • One storage device for boot partitions and root file system. SSD is recommended. A RAID device is not supported.

  • One storage device per server is reserved for local persistent volumes. These volumes are served by the Local Storage Static Provisioner, that is local-volume-provisioner, and used by many services of Mirantis Container Cloud.

You can configure host storage devices using BareMetalHostProfile resources. For details, see Create a custom bare metal host profile.

System requirements for the seed node

The seed node is only necessary to deploy the management cluster. When the bootstrap is complete, the bootstrap node can be discarded and added back to the MOSK cluster as a node of any type.

The minimum reference system requirements for a baremetal-based bootstrap seed node are as follow:

  • Basic Ubuntu 18.04 server with the following configuration:

    • Kernel of version 4.15.0-76.86 or later

    • 8 GB of RAM

    • 4 CPU

    • 10 GB of free disk space for the bootstrap cluster cache

  • No DHCP or TFTP servers on any NIC networks

  • Routable access IPMI network for the hardware servers.

  • Internet access for downloading of all required artifacts

    If you use a firewall or proxy, make sure that the bootstrap and management clusters have access to the following IP ranges and domain names:

    • IP ranges:

    • Domain names:

      • mirror.mirantis.com and repos.mirantis.com for packages

      • binary.mirantis.com for binaries and Helm charts

      • mirantis.azurecr.io and *.blob.core.windows.net for Docker images

      • mcc-metrics-prod-ns.servicebus.windows.net:9093 for Telemetry (port 443 if proxy is enabled)

      • mirantis.my.salesforce.com and login.salesforce.com for Salesforce alerts

    Note

    • Access to Salesforce is required from any Container Cloud cluster type.

    • If any additional Alertmanager notification receiver is enabled, for example, Slack, its endpoint must also be accessible from the cluster.

Components collocation

MOSK uses Kubernetes labels to place components onto hosts. For the default locations of components, see MOSK cluster hardware requirements. Additionally, MOSK supports component collocation. This is mostly useful for OpenStack compute and Ceph nodes. For component collocation, consider the following recommendations:

  • When calculating hardware requirements for nodes, consider the requirements for all collocated components.

  • When performing maintenance on a node with collocated components, execute the maintenance plan for all of them.

  • When combining other services with the OpenStack compute host, verify that reserved_host_* has increased accordingly to the needs of collocated components by using node-specific overrides for the compute service.

Infrastructure requirements

This section lists the infrastructure requirements for the Mirantis OpenStack for Kubernetes (MOSK) reference architecture.

Infrastructure requirements

Service

Description

MetalLB

MetalLB exposes external IP addresses of cluster services to access applications in a Kubernetes cluster.

DNS

The Kubernetes Ingress NGINX controller is used to expose OpenStack services outside of a Kubernetes deployment. Access to the Ingress services is allowed only by its FQDN. Therefore, DNS is a mandatory infrastructure service for an OpenStack on Kubernetes deployment.

Automatic upgrade of a host operating system

To keep operating system on a bare metal host up to date with the latest security updates, the operating system requires periodic software packages upgrade that may or may not require the host reboot.

Mirantis Container Cloud uses life cycle management tools to update the operating system packages on the bare metal hosts.

In a management cluster, software package upgrade and host restart are applied automatically when a new Container Cloud version with available kernel or software packages upgrade is released.

In a managed cluster, package upgrade and host restart are applied as part of usual cluster update, when applicable. To start planning the maintenance window and proceed with the managed cluster update, see Cluster update.

Operating system upgrade and host restart are applied to cluster nodes one by one. If Ceph is installed in the cluster, the Container Cloud orchestration securely pauses the Ceph OSDs on the node before restart. This allows avoiding degradation of the storage service.

Cloud services

Each section below is dedicated to a particular service provided by MOSK. They contain configuration details and usage samples of supported capabilities provided through the custom resources.

Core cloud services:

Compute service

Mirantis OpenStack for Kubernetes (MOSK) provides instances management capability through the Compute service (OpenStack Nova). The Compute service interacts with other OpenStack components of an OpenStack environment to provide life-cycle management of the virtual machine instances.

Resource oversubscription

The Compute service (OpenStack Nova) enables you to spawn instances that can collectively consume more resources than what is physically available on a compute node through resource oversubscription, also known as overcommit or allocation ratio.

Resources available for oversubscription on a compute node include the number of CPUs, amount of RAM, and amount of available disk space. When making a scheduling decision, the scheduler of the Compute service takes into account the actual amount of resources multiplied by the allocation ratio. Thereby, the service allocates resources based on the assumption that not all instances will be using their full allocation of resources at the same time.

Oversubscription enables you to increase the density of workloads and compute resource utilization and, thus, achieve better Return on Investment (ROI) on compute hardware. In addition, oversubscription can also help avoid the need to create too many fine-grained flavors, which is commonly known as flavor explosion.

Configuring initial resource oversubscription

Available since MOSK 23.1

There are two ways to control the oversubscription values for compute nodes:

  • The legacy approach entails utilizing the {cpu,disk,ram}_allocation_ratio configuration options offered by the Compute service. A drawback of this method is that restarting the Compute service is mandatory to apply the new configuration. This introduces the risk of possible interruptions of cloud user operations, for example, instance build failures.

  • The modern and recommended approach, adopted in MOSK 23.1, involves using the initial_{cpu,disk,ram}_allocation_ratio configuration options, which are employed exclusively during the initial provisioning of a compute node. This may occur during the initial deployment of the cluster or when new compute nodes are added subsequently. Any further alterations can be performed dynamically using the OpenStack Placement service API without necessitating the restart of the service.

There is no definitive method for selecting optimal oversubscription values. As a cloud operator, you should continuously monitor your workloads, ideally have a comprehensive understanding of their nature, and experimentally determine the maximum values that do not impact performance. This approach ensures maximum workload density and cloud resource utilization.

To configure the initial compute resource oversubscription in MOSK, specify the spec:features:nova:allocation_ratios parameter in the OpenStackDeployment custom resource as explained in the table below.

Resource oversubscription configuration

Parameter

spec:features:nova:allocation_ratios

Configuration

Configure initial oversubscription of CPU, disk space, and RAM resources on compute nodes. By default, the following values are applied:

  • cpu: 8.0

  • disk: 1.6

  • ram: 1.0

Note

In MOSK 22.5 and earlier, the effective default value of RAM allocation ratio is 1.1.

Warning

Mirantis strongly advises against oversubscribing RAM, by any amount. See Preventing resource overconsumption for details.

Changing the resource oversubscription configuration through the OpenStackDeployment resource after cloud deployment will only affect the newly added compute nodes and will not change oversubscription for already existing compute nodes. To change oversubscription for already existing compute nodes, use the placement service API as described in Change oversubscription settings for existing compute nodes.

Usage

Configuration example:

kind: OpenStackDeployment
spec:
  features:
    nova:
      allocation_ratios:
        cpu: 8
        disk: 1.6
        ram: 1.0

Configuration example of setting different oversubscription values for specific nodes:

spec:
  nodes:
    compute-type::hi-perf:
      features:
        nova:
          allocation_ratios:
            cpu: 2.0
            disk: 1.0

In the example configuration above, the compute nodes labeled with compute-type=hi-perf label will use less intense oversubscription on CPU and no oversubscription on disk.

Preventing resource overconsumption

When using oversubscription, it is important to conduct thorough cloud management and monitoring to avoid system overloading and performance degradation. If many or all instances on a compute node start using all allocated resources at once and, thereby, overconsume physical resources, failure scenarios depend on the resource being exhausted.

Symptoms of resource exhaustion

Affected resource

Symptoms

CPU

Workloads are getting slower as they actively compete for physical CPU usage. A useful indicator is the steal time as reported inside the workload, which is a percentage of time the operating system in the workload is waiting for actual physical CPU core availability to run instructions.

To verify the steal time in the Linux-based workload, use the top command:

top -bn1 | head | grep st$ | awk -F ',' '{print $NF}'

Generally, steal times of >10 for 20-30 minutes are considered alarming.

RAM

Operating system on the compute node starts to aggressively use physical swap space, which significantly slows the workloads down. Sometimes, when the swap is also exhausted, the operating system of a compute node can outright OOM kill most offending processes, which can cause major disruptions to workloads or a compute node itself.

Warning

While it may seem like a good idea to make the most of available resources, oversubscribing RAM can lead to various issues and is generally not recommended due to potential performance degradation, reduced stability, and security risks for the workloads.

Mirantis strongly advises against oversubscribing RAM, by any amount.

Disk space

Depends on the physical layout of storage. Virtual root and ephemeral storage devices that are hosted on a compute node itself are put in the read-only mode negatively affecting workloads. Additionally, the file system used by the operating system on a compute node may become read-only too blocking the compute node operability.

There are workload types that are not suitable for running in an oversubscribed environment, especially those with high performance, latency-sensitive, or real-time requirements. Such workloads are better suited for compute nodes with dedicated CPUs, ensuring that only processes of a single instance run on each CPU core.

Virtual CPU

MOSK provides the capability to configure virtual CPU types for OpenStack instances through the OpenStackDeployment custom resource. This feature enables cloud user to tailor performance and resource allocation within their OpenStack environment to meet specific workload demands effectively.

Parameter

spec:features:nova:vcpu_type

Usage

Configures the type of virtual CPU that Nova will use when creating instances.

The list of supported CPU models include host-model (default), host-passthrough, and custom models.

The host-model CPU model

The host-model CPU model (default) mimics the host CPU and provides for decent performance, good security, and moderate compatibility with live migrations.

With this mode, libvirt finds an available predefined CPU model that best matches the host CPU, and then explicitly adds the missing CPU feature flags to closely match the host CPU features. To mitigate known security flaws, libvirt automatically adds critical CPU flags, supported by installed libvirt, QEMU, kernel, and CPU microcode versions.

This is a safe choice if your OpenStack compute node CPUs are of the same generation. If your OpenStack compute node CPUs are sufficiently different, for example, span multiple CPU generations, Mirantis strongly recommends setting explicit CPU models supported by all of your OpenStack compute node CPUs or organizing your OpenStack compute nodes into host aggregates and availability zones that have largely identical CPUs.

Note

The host-model model does not guarantee two-way live migrations between nodes.

When migrating instances, the libvirt domain XML is first copied as is to the destination OpenStack compute node. Once the instance is hard rebooted or shut down and started again, the domain XML will be re-generated. If versions of libvirt, kernel, CPU microcode, or BIOS firmware differ from what they were on the source compute node the instance was started before, libvirt may pick up additional CPU feature flags, making it impossible to live-migrate back to the original compute node.

The host-passthrough CPU model

The host-passthrough CPU model provides maximum performance, especially when nested virtualization is required or if live migration support is not a concern for workloads. Live migration requires exactly the same CPU on all OpenStack compute nodes, including the CPU microcode and kernel versions. Therefore, for live migrations support, organize your compute nodes into host aggregates and availability zones. For workload migration between non-identical OpenStack compute nodes, contact Mirantis support.

For example, to set the host-passthrough CPU model for all OpenStack compute nodes:

spec:
  features:
    nova:
      vcpu_type: host-passthrough
Custom CPU model

MOSK enables you to specify a comma-separated list of exact QEMU CPU models to create and emulate. Specify the common and less advanced CPU models first. All explicit CPU models provided must be compatible with the OpenStack compute node CPUs.

To specify an exact CPU model, review the available CPU models and their features. List and inspect the /usr/share/libvirt/cpu_map/*.xml files in the libvirt containers of pods of the libvirt DeamonSet or multiple DaemonSets if you are using node-specific settings.

To review the available CPU models
  1. Identify the available libvirt DaemonSets:

    kubectl -n openstack get ds -l application=libvirt --show-labels
    

    Example of system response:

    NAME                     DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR                   AGE  LABELS
    libvirt-libvirt-default  2        2        2      2           2          openstack-compute-node=enabled  34d  app.kubernetes.io/managed-by=Helm,application=libvirt,component=libvirt,release_group=openstack-libvirt
    
  2. Identify the pods of libvirt DaemonSets:

    kubectl -n openstack get po -l application=libvirt,release_group=openstack-libvirt
    

    Example of system response:

    NAME                           READY  STATUS   RESTARTS  AGE
    libvirt-libvirt-default-5zs8m  2/2    Running  0         8d
    libvirt-libvirt-default-vt8wd  2/2    Running  0         3d14h
    
  3. List and review the available CPU model definition files. For example:

    kubectl -n openstack exec -ti libvirt-libvirt-default-5zs8m -c libvirt -- ls /usr/share/libvirt/cpu_map/*.xml
    
  4. List and review the content of all CPU model definition files. For example:

    kubectl -n openstack exec -ti libvirt-libvirt-default-5zs8m -c libvirt -- bash -c 'for f in `ls /usr/share/libvirt/cpu_map/*.xml`; do echo $f; cat $f; done'
    

For example, for nodes that are labeled with processor=amd-epyc, set a custom EPYC CPU model:

spec:
  nodes:
    processor::amd-epyc:
      features:
        nova:
          vcpu_type: EPYC
Instance migration

OpenStack supports the following types of instance migrations:

  • Cold migration (also referred to simply as migration)

    The process involves shutting down the instance, copying its definition and disk, if necessary, to another host, and then starting the instance again on the new host.

    This method disrupts the workload running inside the instance but allows for more reliability and works for most types of instances and consumed resources.

  • Live migration

    The process involves copying the instance definition, memory, and disk, if necessary, to another host while the instance continues running, without shutting it down. The instance then momentarily switches to run on the new host.

    While generally less disruptive to workloads, this method is less reliable and imposes more restrictions on the instance and target host properties to succeed.

Configuring live migration

As a cloud operator, you can configure live migration through the OpenStackDeployment custom resource. The following table provides the details on available configuration.

Parameter

Usage

features:nova:live_migration_interface

Specifies the name of the NIC device on the actual host that will be used by Nova for the live migration of instances.

Mirantis recommends setting up your Kubernetes hosts in such a way that networking is configured identically on all of them, and names of the interfaces serving the same purpose or plugged into the same network are consistent across all physical nodes.

Also, set the option to vhost0 in the following cases:

  • The Neutron service uses Tungsten Fabric.

  • Nova migrates instances through the interface specified by the Neutron tunnel_interface parameter.

features:nova:libvirt:tls

Available since MOSK 23.2. If set to true, enables the live migration over TLS:

spec:
  features:
    nova:
      libvirt:
        tls:
          enabled: true

See also Encryption of live migration data.

Allowing non-administrative users to migrate instances

Available since MOSK 24.3

MOSK provides the following distinct sets of policies that govern access to cold and live migrations:

  • os_compute_api:os-migrate-server:migrate and os_compute_api:os-migrate-server:migrate_live define the ability to initiate migrations without specifying the target host. In this case, the OpenStack Compute scheduler selects the best suited target host automatically.

  • os_compute_api:os-migrate-server:migrate:host and os_compute_api:os-migrate-server:migrate_live:host define the ability to initiate migration together with specifying the target host. Depending on the API microversion used to start the migration, the host is either validated by the scheduler (recommended) or forced regardless of other considerations. The latter option is not recommended as it may lead to inconsistencies in the internal state of the Compute service.

Since MOSK 24.3, the default policies for migrations without the target host specification is set to rule: project_member_or_admin. This means that migration is available to both cloud administrators and project users with the member role.

The migration to a specific host requires administrative privileges.

Revert to admin-only migrations

If the default policy does not suit your deployment, you can require administrative access for all instance migrations by setting these policy values to rule:context_is_admin, or any other value appropriate for your use case.

If you use the default policies and want to revert to the old defaults, ensure that the following snippet is present in your OpenStackDeployment custom resource:

kind: OpenStackDeployment
spec:
  features:
    policies:
      nova:
        os_compute_api:os-migrate-server:migrate: rule:context_is_admin
        os_compute_api:os-migrate-server:migrate_live: rule:context_is_admin
Image storage backend

Parameter

features:nova:images:backend

Usage

Defines the type of storage for Nova to use on the compute hosts for the images that back up the instances.

The list of supported options include:

  • local Deprecated

    Option is deprecated and replaced by qcow2.

  • qcow2

    The local storage is used. The backend disk image format is qcow2. The pros include faster operation, failure domain independency from the external storage. The cons include local space consumption and less performant and robust live migration with block migration.

  • raw Available since 24.2

    The local storage is used. The backend disk image format is raw. Raw images are simple binary dumps of disk data, including empty space, resulting in larger file sizes. They provide superior performance because they do not incur overhead from features such as compression or copy-on-write, which are present in the qcow2 disk images.

  • ceph

    Instance images are stored in a Ceph pool shared across all Nova hypervisors. The pros include faster image start, faster and more robust live migration. The cons include considerably slower IO performance, workload operations direct dependency on Ceph cluster availability and performance.

  • lvm TechPreview

    Instance images and ephemeral images are stored on a local Logical Volume. If specified, features:nova:images:lvm:volume_group must be set to an available LVM Volume Group, by default, nova-vol. For details, see Enable LVM ephemeral storage.

Remote console access to virtual machines

MOSK provides a number of different methods to interact with OpenStack virtual machines including VNC (default) and SPICE remote consoles. This section outlines how you can configure these different console services through the OpenStackDeployment custom resource.

noVNC-based VNC remote console

The noVNC client provides remote control or remote desktop access to guest virtual machines through the Virtual Network Computing (VNC) system. The MOSK Compute service users can access their instances using the noVNC clients through the noVNC proxy server.

The VNC remote console is enabled by default in MOSK.

To disable VNC remote console through the OpenStackDeployment custom resource, set spec:features:nova:console:novnc to false:

spec:
  features:
    nova:
      console:
        novnc:
          enabled: false
Encryption of data transfer for the noVNC client

Available since MOSK 23.1

MOSK uses TLS to secure public-facing VNC access on networks between a noVNC client and noVNC proxy server.

The features:nova:console:novnc:tls:enabled ensures that the data transferred between the instance and the noVNC proxy server is encrypted. Both servers use the VeNCrypt authentication scheme for the data encryption.

To enable the encrypted data transfer for noVNC, use the following structure in the OpenStackDeployment custom resource:

 kind: OpenStackDeployment
 spec:
   features:
     nova:
       console:
         novnc:
           tls:
             enabled: true
SPICE remote console

Available since MOSK 24.1 TechPreview

The VNC protocol has its limitations, such as the lack of support for multiple monitors, bi-directional audio, reliable cut-and-paste, video streaming, and others. The SPICE protocol aims to overcome these limitations and deliver a robust remote desktop support.

The SPICE remote console is disabled by default in MOSK.

To enable SPICE remote console through the OpenStackDeployment custom resource, set spec:features:nova:console:spice:enabled to true:

spec:
  features:
    nova:
      console:
        spice:
          enabled: true
GPU virtualization

Available since MOSK 24.1 TechPreview

MOSK provides GPU virtualization capabilities to its users through the NVIDIA vGPU and Multi-Instance GPU (MIG) technologies.

GPU virtualization is a capability offered by modern datacenter-grade GPUs, enabling the partitioning of a single physical GPU into smaller virtual devices, that can then be attached to individual virtual machines.

In contrast to the Peripheral Component Interconnect (PCI) passthrough feature, leveraging the GPU virtualization enables concurrent utilization of the same physical GPU device by multiple virtual machines. This enhances hardware utilization and fosters a more elastic consumption of expensive hardware resources.

When using GPU virtualization, the physical device and its drivers manage computing resource partitioning and isolation.

Untitled Diagram

The use case for GPU virtualization aligns with any application necessitating or benefiting from accelerated parallel floating-point calculations, such as graphic-intensive desktop workloads, for example, 3D modeling and rendering, as well as computationally intensive tasks, for example, artifial intelligence, specifically, machine learning training and classification.

At its core, GPU virtualization operates on base of the single-root input/output virtualization framework (SR-IOV), which is already widely used by datacenter-grade network adapters and mediated devices Linux kernel framework.

Hardware drivers

Typically, using GPU virtualization requires the installation of specific physical GPU drivers on the host system. For detailed instructions on obtaining and installing the required drivers, refer to official documentation from the vendor of your GPU.

For the latest family of NVIDIA GPUs under NVIDIA AI Enterprise, start with NVIDIA AI Enterprise documentation.

You can automate the configuration of drivers by adding a custom post-install script to the BareMetalHostProfile object of your MOSK cluster. See Configure GPU virtualization for details.

NVIDIA GPU virtualization modes

Certain NVIDIA GPUs, for example, Ampere GPU architecture and later, support GPU virtualization in two modes: time sliced (vGPU) or Multi-Instance GPU (MIG). Older architectures support only the time-sliced mode.

The distinction between these modes lies in resource isolation, dedicated performance levels, and partitioning flexibility.

Typically, there is no fixed rule dictating which mode should be used, as it depends on the intended workloads for the virtual GPUs and the level of experience and assurances the cloud operator aims to offer users. Below, there is a brief overview of the differences between these two modes.

Time-sliced vGPUs

In time-sliced vGPU mode, each virtual GPU is allocated dedicated slices of the physical GPU memory while sharing the physical GPU engines. Only one vGPU operates at a time, with full access to all physical GPU engines. The resource scheduler within the physical GPU regulates the timing of each vGPU execution, ensuring fair allocation of resources.

Therefore, this setup may encounter issues with noisy neighbors, where the performance of one vGPU is affected by resource contention from others. However, when not all available vGPU slots are occupied, the active ones can fully utilize the power of its physical GPU.

Advantages:

  • Potential ability to fully utilize the compute power of physical GPU, even if not all possible vGPUs have yet been created on that physical GPU.

  • Easier configuration.

Disadvantages:

  • Only a single vGPU type (size of the vGPU) can be created on any given physical GPU. The cloud operator must decide beforehand what type of vGPU each physical GPU will be providing.

  • Less strict resource isolation. Noisy neighbors and unpredictable level of performance for every single guest vGPU.

Multi-Instance GPUs

In Multi-Instance GPUs (MIG) mode, each virtual GPU is allocated dedicated physical GPU engines, exclusively utilized by that specific virtual GPU. Virtual GPUs run in parallel, each on its own engines according to their type.

Advantages:

  • Ability to partition a single physical GPU into various types of virtual GPUs. This approach provides cloud operators with enhanced flexibility in determining the available vGPU types for cloud users. However, the cloud operator has to decide beforehand what types of virtual GPU each physical GPU will be providing and partition each GPU accordingly.

  • Better resource isolation and guaranteed resource access with predictable performance levels for every virtual GPU.

Disadvantages:

  • Under-utilization of physical GPU when not all possible virtual GPU slots are occupied.

  • Comparatively complicated configuration, especially in heterogeneous hardware environments.

Known limitations

Note

Some of these restrictions may be lifted in future releases of MOSK.

Cloud users will face the following limitations when working with GPU virtualization in MOSK:

  • Inability to create several instances with virtual GPUs in one request if there is no physical GPU available that can fit all of them at once. For NVIDIA MIG, this effectively means that you cannot create several instances with virtual GPUs in one request.

  • Inability to create an instance with several virtual GPUs.

  • Inability to attach virtual GPU to or detach virtual GPU from a running instance.

  • Inability to live-migrate instances with virtual GPU attached.

Cloud operator will face the following limitations when configuring GPU virtualization in MOSK:

  • Partition of physical GPUs to virtual GPUs is static and not on-demand. You need to decide beforehand what types of virtual GPUs each physical GPU will get partinioned into. Changing of the partitioning requires removing all instances using virtual GPUs from the compute node before initiating the repartitioning process.

  • Repartitioning may require additional manual steps to eliminate orphan resource providers in the placement service, and thus, avoid resource over-reporting and instance scheduling problems.

  • Configuration of multiple virtual GPU types per node may be very verbose since configuration depends on particular PCI addresses of physical GPUs on each node.

Graceful instance shutdown

Available since MOSK 24.3

Management of compute node reboots is an important Day 2 operation. Before shutting down a host, guest instances must either be migrated to other compute nodes or gracefully powered off. This ensures the integrity of disk filesystems and prevents damage to running applications.

MOSK provides the capability to automatically power off the instances during the compute node shutdown or reboot through the ACPI power event.

Graceful instance shutdown is managed using the systemd inhibit tool. When the nova-compute service starts, it creates locks. For example:

systemd-inhibit --list

Example system response:

WHO                   UID USER PID   COMM     WHAT     WHY                                    MODE
Nova Shutdown Handler 0   root 28927 python3  shutdown Handle events on shutdown notification delay

The process runs in the nova-compute-inhibit-lock container within the nova-compute pod. It intercepts systemd power event and starts graceful guest shutdown. When all guest instances are powered off, the inhibit lock is released.

To initiate a proper shutdown, use the following commands: systemctl shutdown and systemctl reboot.

Networking service

Mirantis OpenStack for Kubernetes (MOSK) Networking service (OpenStack Neutron) provides cloud applications with Connectivity-as-a-Service enabling instances to communicate with each other and the outside world.

The API provided by the service abstracts all the nuances of implementing a virtual network infrastructure on top of your own physical network infrastructure. The service allows cloud users to create advanced virtual network topologies that may include load balancing, virtual private networking, traffic filtering, and other services.

MOSK Networking service supports Open vSwitch and Tungsten Fabric SDN technologies as backends.

Backends

MOSK offers various networking backends. Selecting the appropriate backend option for the Networking service is essential for building a robust and efficient cloud networking infrastructure. Whether you choose Open vSwitch (OVS), Open Virtual Network (OVN), or Tungsten Fabric, understanding their features, capabilities, and suitability for your specific use case is crucial for achieving optimal performance and scalability in your OpenStack environment.

Refer to Networking backend configuration for the configuration details.

Capability matrix

Capability

Tungsten Fabric

Open vSwitch (OVS)

Open Virtual Network (OVN)

Logical routers

Static routes

SNAT

Floating IPs

External IPs on VMs

Per-tenant floating networks and SNAT pools

IPv6

Bare Metal as a Service (Ironic)

DNS as a Service

Designate and Tungsten Fabric vDNS

Designate

Designate

Firewalling

Security groups and application policies

OVS firewall

OVS firewall

Load balancing

Tungsten Fabric built in HAProxy, OpenStack Octavia/Amphora

OpenStack Octavia/Amphora

OpenStack Octavia/Amphora, Octavia/OVN native load balancer

BGP VPNs

TechPreview

VPN as a Service (IPsec)

TechPreview

TechPreview

Data plane acceleration

SR-IOV/DPDPK

SR-IOV/DPDK

SR-IOV/DPDK

QoS

Network equipment management

Netconf/OVSDB

Neutron ML2 plugins/networking-generic-switch

Neutron ML2 plugins/networking-generic-switch

East-West traffic encryption

Open vSwitch

Open vSwitch is a production-quality, multilayer virtual switch licensed under the open source Apache 2.0 license. It is designed to enable massive network automation through programmatic extension, while supporting standard management interfaces and protocols.

Open vSwitch is suitable for general-purpose networking requirements in OpenStack deployments. It provides flexibility and scalability for various network topologies.

Key characteristics of Open vSwitch:

  • Depends on RabbitMQ and RPC communication

  • Uses keepalived to set up HA routers

  • Uses namespace and Veth routing to provide its capabilities

  • Locates metadata in router or DHCP namespaces

  • Centralizes the DHCP service, which is running in a separate namespace

Open Virtual Network

Available since MOSK 25.1 as GA (Caracal) Available since MOSK 24.2 as TechPreview (Antelope)

Open Virtual Network is a solution for Open vSwitch that provides native virtual networking support for Open vSwitch environments. It provides enhanced scalability and performance compared to traditional Open vSwitch deployments.

Key characteristics of Open Virtual Network:

  • Uses the OVSDB protocol for commmunication

  • Is distributed by design

  • Handles all traffic with OpenFlow

  • Runs metadata on all nodes

  • Provides DHCP through local Open vSwitch instances

Caution

There are numerous limitations related to VLAN/Flat tenant networks in Open Virtual Network with distributed floating IPs for bare metal SR-IOV and Octavia VIP ports. For more information about Open Virtual Network limitations, see relevant upstream documentation.

Tungsten Fabric

Tungsten Fabric is an open-source SDN based on Juniper Contrail. Its design allows for simplified creation and management of virtual networks in cloud environments. Tungsten Fabric supports advanced networking scenarious, such as BGP integration and scalability.

Key characteristics of Tungsten Fabric:

  • Uses well scalable protocols to set up tunnels, such as BGP/MPLS

  • Provides out-of-the-box BGPaaS/Service chaining capabilities

General configuration

MOSK offers the Networking service as a part of its core setup. You can configure the service through the spec:features:neutron section of the OpenStackDeployment custom resource.

Backend

Parameter

features:neutron:backend

Usage

Defines the networking backend. The list of supported options includes:

  • ML2 for Open vSwitch

  • tungstenfabric for Tungsten Fabric

  • Available since MOSK 25.1 as GA (Caracal) Available since MOSK 24.2 as TechPreview (Antelope) ml2/ovn for Open Virtual Network

Refer to Backends to learn more about the networking backends supported by MOSK.

Tunnel interface

Parameter

features:neutron:tunnel_interface

Usage

Defines the name of the NIC device on the actual host that will be used for Neutron.

Mirantis recommends setting up your Kubernetes hosts in such a way that networking is configured identically on all of them, and names of the interfaces serving the same purpose or plugged into the same network are consistent across all physical nodes.

DNS servers

Parameter

features:neutron:dns_servers

Usage

Defines the list of IPs of DNS servers that are accessible from virtual networks. Used as default DNS servers for VMs.

External networks

Parameter

features:neutron:external_networks

Usage

Contains the data structure that defines external (provider) networks on top of which the Neutron networking will be created.

Floating IP networks

Parameter

features:neutron:floating_network

Usage

If enabled, must contain the data structure defining the floating IP network that will be created for Neutron to provide external access to your Nova instances.

BGP dynamic routing

Available since MOSK 23.2 TechPreview

The BGP dynamic routing extension to the Networking service (OpenStack Neutron) is particularly useful for the MOSK clouds where private networks managed by cloud users need to be transparently integrated into the networking of the data center.

For example, the BGP dynamic routing is a common requirement for IPv6-enabled environments, where clients need to seamlessly access cloud workloads using dedicated IP addresses with no address translation involved in between the cloud and the external network.

Untitled Diagram

BGP dynamic routing changes the way self-service (private) network prefixes are communicated to BGP-compatible physical network devices, such as routers, present in the data center. It eliminates the traditional reliance on static routes or ICMP-based advertising by enabling the direct passing of private network prefix information to router devices.

Note

To effectively use the BGP dynamic routing feature, Mirantis recommends acquiring good understanding of OpenStack address scopes and how they work.

The components of the OpenStack BGP dynamic routing are:

  • Service plugin

    An extension to the Networking service (OpenStack Neutron) that implements the logic for BGP-related entities orhestration and provides the cloud user-facing API. A cloud administrator creates and configures a BGP speaker using the CLI or API and manually schedules it to one or more hosts running the agent.

  • Agent

    Manages BGP peering sessions. In MOSK, the BGP agent runs on nodes labeled with openstack-gateway=enabled.

Prefix advertisement depends on the binding of external networks to a BGP speaker and the address scope of external and internal IP address ranges or subnets.

Prefix advertisement

BGP dynamic routing advertises prefixes for self-service networks and host routes for floating IP addresses.

To successfully advertise a self-service network, you need to fulfill the following conditions:

  • External and self-service networks reside in the same address scope.

  • The router contains an interface on the self-service subnet and a gateway on the external network.

  • The BGP speaker associates with the external network that provides a gateway on the router.

  • The BGP speaker has the advertise_tenant_networks attribute set to True.

To successfully advertise a floating IP address, you need to fulfill the following conditions:

  • The router with the floating IP address binding contains a gateway on an external network with the BGP speaker association.

  • The BGP speaker has the advertise_floating_ip_host_routes attribute set to true.

The diagram below is an example of the BGP dynamic routing in the non-DVR mode with self-service networks and the following advertisements:

  • B>* 192.168.0.0/25 [200/0] through 10.11.12.1

  • B>* 192.168.0.128/25 [200/0] through 10.11.12.2

  • B>* 10.11.12.234/32 [200/0] through 10.11.12.1

Untitled Diagram
Operation in the Distributed Virtal Router (DVR) mode

For both floating IP and IPv4 fixed IP addresses, the BGP speaker advertises the gateway of the floating IP agent on the corresponding compute node as the next-hop IP address. When using IPv6 fixed IP addresses, the BGP speaker advertises the DVR SNAT node as the next-hop IP address.

The diagram below is an example of the BGP dynamic routing in the DVR mode with self-service networks and the following advertisements:

  • B>* 192.168.0.0/25 [200/0] through 10.11.12.1

  • B>* 192.168.0.128/25 [200/0] through 10.11.12.2

  • B>* 10.11.12.234/32 [200/0] through 10.11.12.12

Untitled Diagram
DVR incompatibility with ARP announcements and VRRP

Due to the known issue #1774459 in the upstream implementation, Mirantis does not recommend using Distributed Virtual Routing (DVR) routers in the same networks as load balancers or other applications that utilize the Virtual Router Redundancy Protocol (VRRP) such as Keepalived. The issue prevents the DVR functionality from working correctly with network protocols that rely on the Address Resolution Protocol (ARP) announcements such as VRRP.

The issue occurs when updating permanent ARP entries for allowed_address_pair IP addresses in DVR routers because DVR performs the ARP table update through the control plane and does not allow any ARP entry to leave the node to prevent the router IP/MAC from contaminating the network.

This results in various network failover mechanisms not functioning in virtual networks that have a distributed virtual router plugged in. For instance, the default backend for MOSK Load Balancing service, represented by OpenStack Octavia with the OpenStack Amphora backend when deployed in the HA mode in a DVR-connected network, is not able to redirect the traffic from a failed active service instance to a standby one without interruption.

Block Storage service

Mirantis OpenStack for Kubernetes (MOSK) provides volume management capability through the Block Storage service (OpenStack Cinder).

Backup configuration

MOSK provides support for the following backends for the Block Storage service (OpenStack Cinder):

Support status of storage backends for Cinder

Backend

Support status

Ceph

Full support, default

NFS

  • TechPreview for Yoga and newer OpenStack releases

  • Available since MOSK 23.2

S3

  • TechPreview for Yoga and newer OpenStack releases

  • Available since MOSK 23.2

In MOSK, Cinder backup is enabled and uses the Ceph back end for Cinder by default. The backup configuration is stored in the spec:features:cinder:backup structure in the OpenStackDeployment custom resource. If necessary, you can disable the backup feature in Cinder as follows:

kind: OpenStackDeployment
spec:
  features:
    cinder:
      backup:
        enabled: false

Using this structure, you can also configure another backup driver supported by MOSK for Cinder as described below. At any given time, only one backend can be enabled.

Configuring an NFS driver

Available since MOSK 23.2 TechPreview

MOSK supports NFS Unix authentication exclusively. To use an NFS driver with MOSK, ensure you have a preconfigured NFS server with an NFS share accessible to a Unix Cinder user. This user must be the owner of the exported NFS folder, and the folder must have the permission value set to 775.

All Cinder services run with the same user by default. To obtain the Unix user ID:

kubectl -n openstack get pod -l application=cinder,component=api -o jsonpath='{.items[0].spec.securityContext.runAsUser}'

Note

The NFS server must be accessible through the network from all OpenStack control plane nodes of the cluster.

To enable the NFS storage for Cinder backup, configure the following structure in the OpenStackDeployment object:

spec:
  features:
    cinder:
      backup:
        drivers:
          <BACKEND_NAME>:
            type: nfs
            enabled: true
            backup_share: <URL_TO_NFS_SHARE>

You can specify the backup_share parameter in following formats: hostname:path, ipv4addr:path, or [ipv6addr]:path. For example: 1.2.3.4:/cinder_backup.

Configuring an S3 driver

Available since MOSK 23.2 TechPreview

To use an S3 driver with MOSK, ensure you have a preconfigured S3 storage with a user account created for access.

Note

The S3 storage must be accessible through the network from all OpenStack control plane nodes of the cluster.

To enable the S3 storage for Cinder backup:

  1. Create a dedicated secret in Kuberbetes to securely store the credentials required for accessing the S3 storage:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      labels:
        openstack.lcm.mirantis.com/osdpl_secret: "true"
      name: cinder-backup-s3-hidden
      namespace: openstack
    type: Opaque
    data:
      access_key: <ACCESS_KEY_FOR_S3_ACCOUNT>
      secret_key: <ACCESS_KEY_FOR_S3_ACCOUNT>
    
  2. Configure the following structure in the OpenStackDeployment object:

    spec:
      features:
        cinder:
          backup:
            drivers:
              <BACKEND_NAME>:
                type: s3
                enabled: true
                endpoint_url: <URL_TO_S3_STORAGE>
                store_bucket: <S3_BUCKET_NAME>
                store_access_key:
                  value_from:
                    secret_key_ref:
                      key: access_key
                      name: cinder-backup-s3-hidden
                store_secret_key:
                  value_from:
                    secret_key_ref:
                      key: secret_key
                      name: cinder-backup-s3-hidden
    
Volume encryption

TechPreview

The Block Storage service (OpenStack Cinder) supports volume encryption using a key stored in the Key Manager service (OpenStack Barbican). Such configuration uses Linux Unified Key Setup (LUKS) to create an encrypted volume type and attach it to the Compute service (OpenStack Nova) instances. Nova retrieves the asymmetric key from Barbican and stores it on the OpenStack compute node as a libvirt key to encrypt the volume locally or on the backend and only after that transfers it to Cinder.

Note

  • To create an encrypted volume under a non-admin user, the creator role must be assigned to the user.

  • When planning your cloud, consider that encryption may impact CPU.

Volume configuration

The MOSK Block Storage service (OpenStack Cinder) uses Ceph as the default backend for Cinder Volume. Also, MOSK enables its clients to define their own volume backends using the OpenStackDeployment custom resource. This section provides all the details required to properly configure a custom Cinder Volume backend as a StatefulSet or a DaemonSet.

Disabling the Ceph backend for Cinder Volume

MOSK stores the configuration for the default Ceph backend in the spec:features:cinder:volume structure in the OpenStackDeployment custom resource.

To disable the Ceph backend for Cinder Volume, modify the spec:features:cinder:volume structure as follows:

spec:
  features:
    cinder:
      volume:
        enabled: false
  services:
    block-storage:
      cinder:
        values:
          conf:
            DEFAULT:
              default_volume_type: <NEW-DEFAULT-VOLUME-TYPE-NAME>

When disabling the Ceph backend for Cinder Volume, you must explicitly specify the new default_volume_type parameter. Refer to the sections below to learn how you can configure it.

Considerations for configuring a custom Cinder Volume backend

Before you start deploying your custom Cinder Volume backend, decide on key backend parameters and understand how they affect other services:

Note

Make sure to navigate to the documentation for the specific OpenStack version used to deploy your environment when referring to the official OpenStack documentation.

In addition, you may need to build your own Cinder image as described in Customize OpenStack container images.

Next, review the following key considerations:

Considerations for configuring a custom Cinder Volume backend

Configuration option

Details

StatefulSet or DaemonSet

If the Cinder volume backend you prefer must run on all nodes with a specific label and scale automatically as nodes are added or removed, use a DaemonSet. This type of backend typically requires that its data remains on the same node where its pod is running. A common example of such a backend is the LVM backend.

Otherwise, Mirantis recommends using a StatefulSet, which offers more flexibility than a DaemonSet.

Support for Active/Active High Availability

If the driver does not support Active/Active High Availability, ensure that only a single copy of the backend runs and that the cluster parameter is left empty in the cinder.conf file for this backend.

When deploying the backend using a StatefulSet, set pod.replicas.volume to 1 for this backend configuration. Additionally, enable hostNetwork to ensure that the service endpoint’s IP address remains stable when the backend pod restarts.

Support for Multi-Attach

If the driver supports Multi-Attach, it allows multiple connections to the same volume. This capability is important for certain services, such as Glance. If the driver does not support Multi-Attach, the backend cannot be used for services that require this functionality.

Support for iSCSI and access to the /run directory

Some drivers require access to the /run directory on the host system for storing their PID or lock files. Additionally, they may need access to iSCSI and multipath services on the host. To enable this capability, set the conf:enable_iscsi parameter to true. In some cases, you might also need to run the backend container as privileged.

Privileged access for the container

For security reasons, Mirantis recommends running the Cinder Volume backend container with the minimum required privileges. However, if the drivers require privileged access, you can enable it for the StatefulSet by setting the parameter pod:security_context:cinder_volume:container:cinder_volume:privileged.

Access to the host network namespace

If the driver requires access to the host network namespace, or if you need to ensure that the Cinder Volume backend’s IP address remains unchanged after pod recreation or restart, set hostNetwork to true using the following parameters:

  • For a DaemonSet, use pod:useHostNetwork:volume_daemonset. This parameter is set to true by default.

  • For a StatefulSet, use pod:useHostNetwork:volume. Mirantis recommends avoiding using StatefulSets with hostNetwork as it may cause issues. StatefulSet pods are not tied to a specific node, and multiple pods can run on the same node.

Access to the host IPC namespace

If the driver requires access to the host’s IPC namespace, set hostIPC to true using the following parameters:

  • For a DaemonSet, use pod:useHostIPC:volume_daemonset. For DaemonSet, this parameter is set to true by default.

  • For a StatefulSet, use pod:useHostIPC:volume.

Access to host PID namespace

If the driver requires access to the host’s PID namespace, set hostPID to true using the following parameters:

  • For a DaemonSet, use pod:useHostPID:volume_daemonset.

  • For a StatefulSet, use pod:useHostPID:volume.

Configuring a custom StatefulSet backend

Available since MOSK 24.3 TechPreview

MOSK enables its clients to define volume backends as a StatefulSet.

To configure a custom StatefulSet backend for the MOSK Block Storage service (OpenStack Cinder), use the spec:features:cinder:volume:backends structure in the OpenStackDeployment custom resource:

spec:
  features:
    cinder:
      volume:
        backends:
          <UNIQUE_BACKEND_NAME>:
            enabled: true
            type: statefulset
            create_volume_type: true
            values:
              conf:
              images:
              labels:
              pod:

The enabled and create_volume_type parameters are optional. With create_volume_type set to true (default), the new backend will be added to the Cinder bootstrap job. Once this job is completed, the volume type for the custom backend will be created in OpenStack.

The supported value for type is statefulset.

The list of keys you can override in the values.yaml file of the Cinder chart includes conf, images, labels, and pod.

When you define the custom backend for the Block Storage service, MOSK deploys individual pods for it. These pods have separate Secrets for configuration files and ConfigMaps for scripts.

Example of configuration of a custom StatefulSet backend for Cinder:

The configuration example deploys a StatefulSet for the Cinder volume backend that uses the NFS driver, running a single replica on node labeled kubernetes.io/hostname:service-node. Privilege escalation for the Cinder volume pod is driver-specific.

spec:
  features:
    cinder:
      volume:
        enabled: false
        backends:
          nfs-volume:
            type: statefulset
            values:
              conf:
                cinder:
                  DEFAULT:
                    cluster: ""
                    enabled_backends: volumes-nfs
                  volumes-extra-nfs:
                    nas_host: 1.2.3.4
                    nas_share_path: /cinder_volume
                    nas_secure_file_operations: false
                    nfs_mount_point_base: /tmp/mountpoints
                    nfs_snapshot_support: true
                    volume_backend_name: volumes-nfs
                    volume_driver: cinder.volume.drivers.nfs.NfsDriver
              pod:
                replicas:
                  volume: 1
                security_context:
                  cinder_volume:
                    container:
                      cinder_volume:
                        privileged: true
              labels:
                volume:
                  node_selector_key: kubernetes.io/hostname
                  node_selector_value: service-node
  services:
    block-storage:
      cinder:
        values:
          conf:
            DEFAULT:
              default_volume_type: volumes-nfs
Configuring a custom DaemonSet backend

TechPreview

MOSK enables its clients to define volume backends as a DaemonSet, LVM in particular.

To configure a custom DaemonSet backend for the MOSK Block Storage service (OpenStack Cinder), use the spec:nodes structure in the OpenStackDeployment custom resource:

spec:
  nodes:
    <node label>:
      features:
        cinder:
          volume:
            backends:
              <backend name>:
                lvm:
                  <CINDER-LVM-DRIVER-PARAMETERS>

Example of configuration of a custom DaemonSet backend for Cinder:

The configuration example deploys a DaemonSet for the Cinder volume backend that uses the LVM driver and runs on nodes with the openstack-compute-node=enabled label:

Caution

For data storage, this backend uses the LVM cinder-vol group that must be present on nodes before the new backend is applied. For the procedure on how to deploy an LVM backend, refer to Enable LVM block storage.

spec:
  features:
    cinder:
      volume:
        enabled: false
  nodes:
    openstack-compute-node::enabled:
      features:
        cinder:
          volume:
            backends:
              volumes-lvm:
                lvm:
                  volume_group: "cinder-vol"
  services:
    block-storage:
      cinder:
        values:
          conf:
            DEFAULT:
              default_volume_type: volumes-lvm
Disabling stale volume services cleaning

MOSK provides the cinder-service-cleaner CronJob by default. This CronJob periodically checks whether all Cinder services in OpenStack are up to date and removes any stale ones.

This CronJob is tested only with backends supported by MOSK. If cinder-service-cleaner does not work properly with your custom Cinder volume backend, you can disable it at the OpenStackDeployment service level in the OpenStackDeployment custom resource:

spec:
  services:
    block-storage:
      cinder:
        values:
          manifests:
            cron_service_cleaner: false

Note

Make sure to navigate to the documentation for the specific OpenStack version used to deploy your environment when referring to the official OpenStack documentation.

Identity service

Mirantis OpenStack for Kubernetes (MOSK) provides authentication, service discovery, and distributed multi-tenant authorization through the OpenStack Identity service, aka Keystone.

Federation

MOSK integrates with Mirantis Container Cloud Identity and Access Management (IAM) subsystem to allow centralized management of users and their permissions across multiple clouds.

The core component of Container Cloud IAM is Keycloak, the open-source identity and access management software. Its primary function is to perform secure authentication of cloud users against its built-in or various external identity databases, such as LDAP directories, OpenID Connect or SAML compatible identity providers.

By default, every MOSK cluster is integrated with the Keycloak running in the Container Cloud management cluster. The integration automatically provisions the necessary configuration on the MOSK and Container Cloud IAM sides, such as the os client object in Keycloak. However, for the federated users to get proper permissions after logging in, the cloud operator needs to define the role mapping rules specific to each MOSK environment.

Connecting to Keycloak

MOSK enables you to connect to the Keycloak identity provider through the following structure in the OpenStackDeployment custom resource:

spec:
  features:
    keystone:
      keycloak:
        enabled: true
        url: https://keycloak.it.just.works
        oidc:
          OIDCSSLValidateServer: false
          OIDCOAuthSSLValidateServer: false
          OIDCScope: "openid email profile groups"
Connecting to external identity provider

Available since MOSK 24.3 TechPreview

MOSK enables you to connect external identity provider to Keystone directly through the following structure in the OpenStackDeployment custom resource:

spec:
  features:
    keystone:
     federations:
       openid:
         enabled: true
         oidc_auth_type: oauth2
         providers:
           keycloak:
             issuer: https://keycloak.it.just.works/auth/realms/iam
             mapping:
             - local:
               - user:
                   email: '{1}'
                   name: '{0}'
               - domain:
                   name: Default
                 groups: '{2}'
               remote:
               - type: OIDC-iam_username
               - type: OIDC-email
               - type: OIDC-iam_roles
             metadata:
               client:
                 client_id: os
               conf:
                 response_type: id_token
                 scope: openid email profile
                 ssl_validate_server: false
               provider:
                 value_from:
                   from_url:
                     url: https://keycloak.it.just.works/auth/realms/iam/.well-known/openid-configuration
           okta:
             description: OKTA provider
             enabled: true
             issuer: https://dev-68495932.okta.com/oauth2/default
             mapping:
             - local:
               - user:
                   email: '{1}'
                   name: '{0}'
               - domain:
                   name: Default
                 groups: m:os@admin
               remote:
               - type: OIDC-name
               - type: OIDC-email
             metadata:
               client:
                 client_id: 0oaixfwyqcAkCbC335d7
                 client_secret: aKOtnqHwu37ricQJfOD9ShECqj7DY7SVHgh8nm1NwlAhGbQjGqREHencsGagyfmQ
               conf: {}
               provider:
                 value_from:
                   from_url:
                     url: https://dev-68495932.okta.com/oauth2/default/.well-known/openid-configuration
             oauth2:
               OAuth2TokenVerify: jwks_uri https://dev-68495932.okta.com/oauth2/default/v1/keys
             token_endpoint: https://dev-68495932.okta.com/oauth2/default/v1/token

The oidc_auth_type parameter specifies the Apache module to use: oauth20 or oauth2. The oauth20 functionality is deprecated and superseded by a new oauth2 module. You can configure two and more identity providers only with the oauth2 module.

Regions

A region in MOSK represents a complete OpenStack cluster that has a dedicated control plane and set of API endpoints. It is not uncommon for operators of large clouds to offer their users several OpenStack regions, which differ by their geographical location or purpose. In order to easily navigate in a multi-region environment, cloud users need a way to distinguish clusters by their names.

The region_name parameter of an OpenStackDeployment custom resource specifies the name of the region that will be configured in all the OpenStack services comprising the MOSK cluster upon the initial deployment.

Important

Once the cluster is up and running, the cloud operator cannot set or change the name of the region. Therefore, Mirantis recommends selecting a meaningful name for the new region before the deployment starts. For example, the region name can be based on the name of the data center the cluster is located in.

Usage sample:

apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
metadata:
  name: openstack-cluster
  namespace: openstack
spec:
  region_name: <your-region-name>
Application credentials

Application credentials is a mechanism in the MOSK Identity service that enables application automation tools, such as shell scripts, Terraform modules, Python programs, and others, to securely perform various actions in the cloud API in order to deploy and manage application components.

Application credentials is a modern alternative to the legacy approach where every application owner had to request several technical user accounts to ensure their tools could authenticate in the cloud.

For the details on how to create and authenticate with application credentials, refer to Manage application credentials.

Application credentials must be explicitly enabled for federated users

By default, cloud users logging in to the cloud through the Mirantis Container Cloud IAM or any external identity provider cannot use the application credentials mechanism.

An application credential is heavily tied to the account of the cloud user owning it. An application automation tool that is a consumer of the credential acts on behalf of the human user who created the credential. Each action that the application automation tool performs gets authorized against the permissions, including roles and groups, the user currently has.

The source of truth about a federated user permissions is the identity provider. This information gets temporary transferred to the cloud’s Identity service inside a token once the user authenticates. By default, if such a user creates an application credential and passes it to the automation tool, there is no data to validate the tool’s action on the user’s behalf.

However, a cloud operator can configure the authorization_ttl parameter for an identity provider object to enable caching of its users authorization data. The parameter defines for how long in minutes the information about user permissions is preserved in the database after the user successfully logs in to the cloud.

Warning

Authorization data caching has security implications. In case a federated user account is revoked or his permissions change in the identity provider, the cloud Identity service will still allow performing actions on the user behalf until the cached data expires or the user re-authenticates in the cloud.

To set authorization_ttl to, for example, 60 minutes for the keycloak identity provider in Keystone:

  1. Log in to the keystone-client Pod:

    kubectl -n openstack exec $(kubectl -n openstack get po -l application=keystone,component=client -oname) -ti -c keystone-client -- bash
    
  2. Inside the Pod, run the following command:

    openstack identity provider set keycloak --authorization-ttl 60
    
Domain-specific configuration

Parameter

features:keystone:domain_specific_configuration

Usage

Defines the domain-specific configuration and is useful for integration with LDAP. An example of OsDpl with LDAP integration, which will create a separate domain.with.ldap domain and configure it to use LDAP as an identity driver:

spec:
  features:
    keystone:
      domain_specific_configuration:
        enabled: true
        domains:
          domain.with.ldap:
            enabled: true
            config:
              assignment:
                driver: keystone.assignment.backends.sql.Assignment
              identity:
                driver: ldap
              ldap:
                chase_referrals: false
                group_desc_attribute: description
                group_id_attribute: cn
                group_member_attribute: member
                group_name_attribute: ou
                group_objectclass: groupOfNames
                page_size: 0
                password: XXXXXXXXX
                query_scope: sub
                suffix: dc=mydomain,dc=com
                url: ldap://ldap01.mydomain.com,ldap://ldap02.mydomain.com
                user: uid=openstack,ou=people,o=mydomain,dc=com
                user_enabled_attribute: enabled
                user_enabled_default: false
                user_enabled_invert: true
                user_enabled_mask: 0
                user_id_attribute: uid
                user_mail_attribute: mail
                user_name_attribute: uid
                user_objectclass: inetOrgPerson
Image service

Mirantis OpenStack for Kubernetes (MOSK) provides the image management capability through the OpenStack Image service, aka Glance.

The Image service enables you to discover, register, and retrieve virtual machine images. Using the Glance API, you can query virtual machine image metadata and retrieve actual images.

MOSK deployment profiles include the Image service in the core set of services. You can configure the Image service through the spec:features definition in the OpenStackDeployment custom resource.

Image signature verification

TechPreview

MOSK can automatically verify the cryptographic signatures associated with images to ensure the integrity of their data. A signed image has a few additional properties set in its metadata that include img_signature, img_signature_hash_method, img_signature_key_type, and img_signature_certificate_uuid. You can find more information about these properties and their values in the upstream OpenStack documentation.

MOSK performs image signature verification during the following operations:

  • A cloud user or a service creates an image in the store and starts to upload its data. If the signature metadata properties are set on the image, its content gets verified against the signature. The Image service accepts non-signed image uploads.

  • A cloud user spawns a new instance from an image. The Compute service ensures that the data it downloads from the image storage matches the image signature. If the signature is missing or does not match the data, the operation fails. Limitations apply, see Known limitations.

  • A cloud user boots an instance from a volume, or creates a new volume from an image. If the image is signed, the Block Storage service compares the downloaded image data against the signature. If there is a mismatch, the operation fails. The service will accept a non-signed image as a source for a volume. Limitations apply, see Known limitations.

Configuration example
spec:
  features:
    glance:
      signature:
        enabled: true
Signing pre-built images

Every MOSK cloud is pre-provisioned with a baseline set of images containing most popular operating systems, such as Ubuntu, Fedora, CirrOS.

In addition, a few services in MOSK rely on the creation of service instances to provide their functions, namely the Load Balancer service and the Bare Metal service, and require corresponding images to exist in the image store.

When image signature verification is enabled during the cloud deployment, all these images get automatically signed with a pre-generated self-signed certificate. Enabling the feature in an already existing cloud requires manual signing of all of the images stored in it. Consult the OpenStack documentation for an example of the image signing procedure.

Supported storage backends

The image signature verification is supported for LVM and local backends for ephemeral storage.

The functionality is not compatible with Ceph-backed ephemeral storage combined with RAW formatted images. The Ceph copy-on-write mechanism enables the user to create instance virtual disks without downloading the image to a compute node, the data is handled completely on the side of a Ceph cluster. This enables you to spin up instances almost momentarily but makes it impossible to verify the image data before creating an instance from it.

Known limitations
  • The Image service does not enforce the presence of a signature in the metadata when the user creates a new image. The service will accept the non-signed image uploads.

  • The Image service does not verify the correctness of an image signature upon update of the image metadata.

  • MOSK does not validate if the certificate used to sign an image is trusted, it only ensures the correctness of the signature itself. Cloud users are allowed to use self-signed certificates.

  • The Compute service does not verify image signature for Ceph backend when the RAW image format is used as described in Supported storage backends.

  • The Compute service does not verify image signature if the image is already cached on the target compute node.

  • The Instance HA service may experience issues when auto-evacuating instances created from signed images if it does have access to the corresponding secrets in the Key manager service.

  • The Block Storage service does not perform image signature verification when a Ceph backend is used and the images are in the RAW format.

  • The Block Storage service does not enforce the presence of a signature on the images.

Object Storage service

Ceph Object Gateway provides Object Storage (Swift) API for end users in MOSK deployments. For the API compatibility, refer to Ceph Documentation: Ceph Object Gateway Swift API.

Object storage enablement

Parameter

features:services:object-storage

Usage

Enables the object storage and provides a RADOS Gateway Swift API that is compatible with the OpenStack Swift API.

To enable the service, add object-storage to the service list:

spec:
  features:
    services:
    - object-storage

To create the RADOS Gateway pool in Ceph, see :ref: Operations Guide: Enable Ceph RGW Object Storage <enable-rgw>.

Object storage server-side encryption

TechPreview

Ceph Object Gateway also provides Amazon S3 compatible API. For details, see Ceph Documentation: Ceph Object Gateway S3 API. Using integration with the OpenStack Key Manager service (Barbican), the objects uploaded through S3 API can be encrypted by Ceph Object Gateway according to the AWS Documentation: Protecting data using server-side encryption with customer-provided encryption keys (SSE-C) specification.

Instead of Swift, such configuration uses an S3 client to upload server-side encrypted objects. Using server-side encryption, the data is sent over a secure HTTPS connection in an unencrypted form and the Ceph Object Gateway stores that data in the Ceph cluster in an encrypted form.

Dashboard

MOSK Dashboard (OpenStack Horizon) provides a web-based interface for users to access the functions of the cloud services.

Custom theme

Parameter

features:horizon:themes

Usage

Defines the list of custom OpenStack Dashboard themes. Content of the archive file with a theme depends on the level of customization and can include static files, Django templates, and other artifacts. For the details, refer to OpenStack official documentation: Customizing Horizon Themes.

spec:
  features:
    horizon:
      themes:
        - name: theme_name
          description: The brand new theme
          url: https://<path to .tgz file with the contents of custom theme>
          sha256summ: <SHA256 checksum of the archive above>
Message of the Day (MOTD)

Available since MOSK 25.1

MOSK enables a cloud operator to configure Message of the Day (MOTD) for the MOSK Dashboard (OpenStack Horizon). These short messages inform users about current infrastructure issues, upcoming maintenance, and other events, helping them plan their work with minimal service disruption.

Cloud operators can configure messages to appear before or after users log in to Horizon, or both. Messages can also be visually distinguished based on severity and support minimal HTML formatting, including links.

To define the MOTD, populate the following structure in the OpenStackDeployment custom resource:

spec:
  features:
    horizon:
      motd:
        <NAME>:
          level: <LEVEL>
          message: <MESSAGE>
          afterLogin: <true|false>
          beforeLogin: <true|false>

Parameters:

  • <NAME>: A unique symbolic name to distinguish messages

  • level: The severity level of the message. Supported values: success, info, warning, and error

  • beforeLogin: Boolean. If true, the message appears on the login page for unauthorized users. Default: false.

  • afterLogin: Boolean. If true, the message appears after users log in. Default: true.

Configuration example:

spec:
  features:
    horizon:
      motd:
        errorBefore:
          level: error
          message: "We are experiencing <b>issues</b> with the authentication provider<br>Check the status at the <a href='https://foo.bar'>status page</a>"
          afterLogin: false
          beforeLogin: true
        warnAfter:
          level: warning
          message: "Planned maintenance tomorrow"

The above configuration results in the following two messages displayed for all cloud users:

  • Login page:

    _images/os-horizon-motd-before-login.png
  • After logging in:

    _images/os-horizon-motd-after-login.png

Auxiliary cloud services:

Bare Metal service

The Bare Metal service (Ironic) is an extra OpenStack service that can be deployed by the OpenStack Controller (Rockoon). This section provides the baremetal-specific configuration options of the OpenStackDeployment resource.

Enabling the Bare Metal service

The Bare Metal service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

To install bare metal services, add the baremetal keyword to the spec:features:services list:

spec:
  features:
    services:
      - baremetal

Note

All bare metal services are scheduled to the nodes with the openstack-control-plane: enabled label.

Ironic agent deployment images

To provision a user image onto a bare metal server, Ironic boots a node with a ramdisk image. Depending on the node’s deploy interface and hardware, the ramdisk may require different drivers (agents). MOSK provides tinyIPA-based ramdisk images and uses the direct deploy interface with the ipmitool power interface.

Example of agent_images configuration:

spec:
  features:
    ironic:
       agent_images:
         base_url: https://binary.mirantis.com/openstack/bin/ironic/tinyipa
         initramfs: tinyipa-stable-ussuri-20200617101427.gz
         kernel: tinyipa-stable-ussuri-20200617101427.vmlinuz

Since the bare metal nodes hardware may require additional drivers, you may need to build a deploy ramdisk for particular hardware. For more information, see Ironic Python Agent Builder. Be sure to create a ramdisk image with the version of Ironic Python Agent appropriate for your OpenStack release.

Bare metal networking

Ironic supports the flat and multitenancy networking modes.

The flat networking mode assumes that all bare metal nodes are pre-connected to a single network that cannot be changed during the virtual machine provisioning. This network with bridged interfaces for Ironic should be spread across all nodes including compute nodes to allow plug-in regular virtual machines to connect to Ironic network. In its turn, the interface defined as provisioning_interface should be spread across gateway nodes. The cloud operator can perform all these underlying configuration through the L2 templates.

Example of the OsDpl resource illustrating the configuration for the flat network mode:

spec:
  features:
    services:
      - baremetal
    neutron:
      external_networks:
        - bridge: ironic-pxe
          interface: <baremetal-interface>
          network_types:
            - flat
          physnet: ironic
          vlan_ranges: null
    ironic:
       # The name of neutron network used for provisioning/cleaning.
       baremetal_network_name: ironic-provisioning
       networks:
         # Neutron baremetal network definition.
         baremetal:
           physnet: ironic
           name: ironic-provisioning
           network_type: flat
           external: true
           shared: true
           subnets:
             - name: baremetal-subnet
               range: 10.13.0.0/24
               pool_start: 10.13.0.100
               pool_end: 10.13.0.254
               gateway: 10.13.0.11
       # The name of interface where provision services like tftp and ironic-conductor
       # are bound.
       provisioning_interface: br-baremetal

The multitenancy network mode uses the neutron Ironic network interface to share physical connection information with Neutron. This information is handled by Neutron ML2 drivers when plugging a Neutron port to a specific network. MOSK supports the networking-generic-switch Neutron ML2 driver out of the box.

Example of the OsDpl resource illustrating the configuration for the multitenancy network mode:

spec:
  features:
    services:
      - baremetal
    neutron:
      tunnel_interface: ens3
      external_networks:
        - physnet: physnet1
          interface: <physnet1-interface>
          bridge: br-ex
          network_types:
            - flat
          vlan_ranges: null
          mtu: null
        - physnet: ironic
          interface: <physnet-ironic-interface>
          bridge: ironic-pxe
          network_types:
            - vlan
          vlan_ranges: 1000:1099
    ironic:
      # The name of interface where provision services like tftp and ironic-conductor
      # are bound.
      provisioning_interface: <baremetal-interface>
      baremetal_network_name: ironic-provisioning
      networks:
        baremetal:
          physnet: ironic
          name: ironic-provisioning
          network_type: vlan
          segmentation_id: 1000
          external: true
          shared: false
          subnets:
            - name: baremetal-subnet
              range: 10.13.0.0/24
              pool_start: 10.13.0.100
              pool_end: 10.13.0.254
              gateway: 10.13.0.11
DNS service

Mirantis OpenStack for Kubernetes (MOSK) provides DNS records managing capability through the DNS service (OpenStack Designate).

LoadBalancer type for PowerDNS

The supported backend for Designate is PowerDNS. If required, you can specify whether to use an external IP address or UDP, TCP, or TCP + UDP kind of Kubernetes for the PowerDNS service.

To configure LoadBalancer for PowerDNS, use the spec:features:designate definition in the OpenStackDeployment custom resource.

The list of supported options includes:

  • external_ip - Optional. An IP address for the LoadBalancer service. If not defined, LoadBalancer allocates the IP address.

  • protocol - A protocol for the Designate backend in Kubernetes. Can only be udp, tcp, or tcp+udp.

  • type - The type of the backend for Designate. Can only be powerdns.

For example:

spec:
  features:
    designate:
      backend:
        external_ip: 10.172.1.101
        protocol: udp
        type: powerdns
DNS service known limitations
Inability to set up a secondary DNS zone

Lifted in 23.1

Due to an issue in the dnspython library, Asynchronous Transfer Full Range (AXFR) requests do not work and cause inability to set up a secondary DNS zone.

The issue affects OpenStack Victoria and is fixed in the Yoga release.

Key Manager service

MOSK Key Manager service (OpenStack Barbican) provides secure storage, provisioning, and management of cloud application secret data, such as Symmetric Keys, Asymmetric Keys, Certificates, and raw binary data.

Configuring the Vault backend

Parameter

features:barbican:backends:vault

Usage

Specifies the object containing the Vault parameters to connect to Barbican.

The list of supported options includes:

  • enabled - boolean parameter indicating that the Vault back end is enabled

  • approle_role_id - Vault app role ID

  • approle_secret_id - secret ID created for the app role

  • vault_url - URL of the Vault server

  • use_ssl - enables the SSL encryption. Since MOSK does not currently support the Vault SSL encryption, the use_ssl parameter should be set to false

  • kv_mountpoint TechPreview - optional, specifies the mountpoint of a Key-Value store in Vault to use

  • namespace TechPreview - optional, specifies the Vault namespace to use with all requests to Vault

    Note

    The Vault namespaces feature is available only in Vault Enterprise.

    Note

    Vault namespaces are supported only starting from the OpenStack Victoria release.

If the Vault backend is used, configure it properly using the following parameters:

spec:
  features:
    barbican:
      backends:
        vault:
          enabled: true
          approle_role_id: <APPROLE_ROLE_ID>
          approle_secret_id: <APPROLE_SECRET_ID>
          vault_url: <VAULT_SERVER_URL>
          use_ssl: false

Mirantis recommeds hiding the approle_id and approle_secret_id keys as described in Hiding sensitive information.

Note

Since MOSK does not currently support the Vault SSL encryption, set the use_ssl parameter to false.

Instance High Availability service

TechPreview

Instance High Availability service (OpenStack Masakari) enables cloud users to ensure that their instances get automatically evacuated from a failed hypervisor.

The service consists of the following components:

  • API recieves requests from users and events from monitors, and sends them to engine

  • Engine executes recovery workflow

  • Monitors detect failures and notifies API. MOSK uses monitors of the following types:

    • Instance monitor performs liveness of instance processes

    • Introspective instance monitor enhances instance high availability within OpenStack environments by monitoring and identifying system-level failures through the QEMU Guest Agent

    • Host monitor performs liveness of a compute host, runs as part of the Node controller from the OpenStack Controller (Rockoon)

    Note

    The Processes monitor is not present in MOSK as far as HA for the compute processes is handled by Kubernetes.

This section describes how to enable various components of the Instance High Availability service for your MOSK deployment:

Enabling the Instance HA service

The Instance HA service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

Parameter

features:services:instance-ha

Usage

Enables Masakari, the OpenStack service that ensures high availability of instances running on a host. To enable the service, add instance-ha to the service list:

spec:
  features:
    services:
    - instance-ha
Enabling introspective instance monitor

Available since MOSK 25.1 TechPreview

The introspective instance monitor in the Instance High Availability service enhances the reliability of the cloud environment by monitoring virtual machines for failure events, including operating system crashes, kernel panics, and unresponsive states. Upon detecting such events in real time, the monitor initiates automated recovery actions, such as rebooting the affected instance. This allows for reduced downtime and maintains high availability of an OpenStack environment.

As a cloud operator, you can enable and configure the instance introspection through the spec:features:masakari:monitors:introspective definition in the OpenStackDeployment custom resource. The list of supported options include:

  • enabled (boolean)

    Enables or disables the introspection monitor. Default: false.

  • guest_monitoring_interval (integer)

    Defines the time interval (in seconds) for monitoring the status of the guest virtual machine. Default: 10.

  • guest_monitoring_timeout (integer)

    Sets the timeout (in seconds) for detecting a non-responsive guest VM before marking it as failed. Default: 2.

  • guest_monitoring_failure_threshold (integer)

    Defines the number of consecutive failures required before a notification is sent or recovery action is initiated. Default: 3.

Example configuration:

spec:
  features:
    masakari:
      monitors:
        introspective:
          enabled: true
          guest_monitoring_interval: 10
          guest_monitoring_timeout: 2
          guest_monitoring_failure_threshold: 3

The introspective instance monitor relies on the QEMU Guest Agent being installed within the guest virtual machine. This agent enables communication between the host and guest operating systems, ensuring precise monitoring of the virtual machine health. Without the QEMU Guest Agent, the introspection monitor cannot accurately assess the state of the virtual machine, which may prevent the initiation of necessary recovery actions. To start monitoring, refer to Configure the introspective instance monitor.

Shared Filesystems service

Available since MOSK 24.3

MOSK Shared Filesystems service (OpenStack Manila) provides Shared Filesystems as a service. The Shared Filesystems service enables you to create and manage shared filesystems in your multi-project cloud environments.

Note

MOSK does not support the Shared Filesystems service for the clusters with Tungsten Fabric as a networking backend.

Service architecture

The Shared FileSystems service (OpenStack Manila) consists of manila-api, manila-scheduler, and manila-share services. All these services communicate with each other through the AMQP protocol and store their data in the MySQL database:

  • manila-api

    Provides a stable RESTful API, authenticates and routes requests throughout the Shared Filesystem service

  • manila-scheduler

    Responsible for scheduling and routing requests to the appropriate manila-share service by determining which backend should serve as the destination for a share creation request

  • manila-share

    Responsible for managing Shared Filesystems service devices, specifically the backend ones

The diagram below illustrates how the Shared FileSystems service components communicate with each other.

Untitled Diagram
Shared Filesystems drivers

MOSK ensures support for different kind of equipment and shared filesystems by means of special drivers that are part of the manila-share service. Also, these drivers determine the ability to restrict access to data stored on a shared filesystem, list of operations with Manila volumes, and types of connections to the client network.

Driver Handles Share Servers (DHSS) is one of the main parameters that define the Manila workflow including the way the Manila driver makes clients access shared filesystems. Some drivers support only one DHSS mode, for example, the LVM share driver. Others support both modes, for example, the Generic driver. If the DHSS is set to False in the driver configuration, the driver does not prepare the share server that provides access to the share filesystems and the server and network setup should be performed by the administrator. In this case, the Shared Filesystems service only manages the server in its own configuration.

Untitled Diagram

If the driver configuration includes DHSS=True, the driver creates a service virtual machine that provides access to shared filesystems. Also, when DHSS=True, the Shared Filesystems service performs a network setup to provide client’s access to the created service virtual machine. For working with the service virtual machine, the Shared Filesystems service requires a separate service network that must be included in the driver’s configuration as well.

The following are descriptions of drivers supported by the MOSK Shared Filesystems service.

Generic driver

The generic driver is an example for the DHSS=True case. There are two network topologies for connecting client’s network to the service virtual machine, which depend of the connect_share_server_to_tenant_network parameter. If the connect_share_server_to_tenant_network parameter is set to False, which is default, the client must create a shared network connected to a public router. IP addresses from this network will be granted access to the created shared filesystem. The Shared Filesystems service creates a subnet in its service network where the network port of the new service virtual machine and network port of the clent’s router will be connected to. When a new shared filesystem is created, the client’s machine is granted access to it through the router.

Untitled Diagram

If the connect_share_server_to_tenant_network parameter is set to True, the Shared Filesystems service creates the service virtual machines with two network interfaces. One of them is connected to the service network while the other one is connected to the client’s network.

Untitled Diagram
CephFS driver

Available since MOSK 25.1 TechPreview

The CephFS driver is a DHSS=False driver. The CephFS driver can be configured to use the Ceph protocol to provide shares. However, MOSK does not support the NFS Ganesha protocol.

The main advantages of using a direct connection to CephFS through the Ceph protocol over using the NFS protocol include:

  • Simplified setup

    No third-party services are required between the client and CephFS, whereas an NFS layer can introduce an additional point of failure.

  • No additional load balancing

    Making NFS highly available requires setting up additional load balancers, which is unnecessary with direct CephFS access.

  • Enhanced access control

    CephFS shares can be restricted using cephx authentication, whereas NFS only allows access restrictions based on IP addresses.

For the CephFS driver to function, the manila-share service must have access to the Storage Access network. To mount created shares, the client must have access to the Storage Access network, the share URL, and credentials. The URLs and credentials for created shares are exposed to clients through the Manila API.

Note

Due to the existing limitation for Ceph clusters, Ceph Monitor services are only accessible on the MOSK LCM network. Therefore, both the manila-share service and clients require access to the MOSK LCM network. By default, manila-share already have access to this network. However, to enable access for external clients, for example, client VMs, routing must be configured between the client VM and the MOSK LCM network.

Untitled Diagram

The risks of direct connection of client VMs to the Storage Access Network include:

  • A malicious host on the same network may attempt to attack or scan other clients or the Ceph cluster

  • A malicious host may intercept and manipulate communication, acting on behalf of a valid client or Ceph cluster (a man-in-the-middle attack)

The following measures can help reduce these risks:

  • Ensure that port security is enabled on client VM ports connected to Ceph networks, which is enabled by default on OpenStack networks

  • Ensure that the Ceph cluster and client use the msgr2 protocol with CRC and secure modes enabled, which are enabled by default for MOSK deployments

  • Configure OpenStack security groups for client VM ports to allow traffic only from trusted hosts

Enabling Shared Filesystems service

The Shared Filesystems service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

To install the OpenStack Manila services, add the shared-file-system keyword to the spec:features:services list:

spec:
  features:
    services:
      - shared-file-system

The above configuration installs the Shared Filesystems service with the generic driver configured.

Enabling CephFS driver for Shared Filesystems service

Available since MOSK 25.1 TechPreview

Caution

MOSK does not support enabling both the generic driver and CephFS driver in the same environment. If the CephFS driver is enabled in an environment where the generic driver was previously enabled, the CephFS driver will replace the generic one.

The CephFS driver is not enabled by default in the Shared Filesystems service. To enable the CephFS driver:

  1. Verify that CephFS is enabled in the Ceph cluster as described in Configure Ceph Shared File System (CephFS).

  2. Add the following configuration to the OpenStackDeployment object:

    spec:
      features:
        manila:
          share:
            backends:
              cephfs:
                type: statefulset
                values:
                  conf:
                    manila:
                      DEFAULT:
                        enabled_share_backends: cephfs
                      cephfs:
                        share_backend_name: cephfs
                        share_driver: manila.share.drivers.cephfs.driver.CephFSDriver
    
Dynamic Resource Balancer service

Available since MOSK 24.2 TechPreview

In a cloud environment where resources are shared across all workloads, those resources often become a point of contention.

For example, it is not uncommon for an oversubscribed compute node to experience the noisy neighbor problem, when one of the instances may start consuming a lot more resources than usually, negatively affecting performance of other instances running on the same node.

In such cases, an intervention is required from the cloud operators to manually re-distribute workloads in the cluster to achieve more equal utilization of resources.

The Dynamic Resource Balancer (DRB) service continiously measures resource usage on hypervisors and redistributes workloads to achieve some optimum target, thereby eliminating the need for manual interventions from cloud operators.

Architecture overview

The DRB service is implemented as a Kubernetes operator, controlled by the custom resource of kind: DRBConfig. Unless at least one resource of such kind is present, the service does not perform any operations. Cloud operators who want to enable the DRB service for their MOSK clouds, need to create the resource with proper configuration.

The DRB controller consists of the following сomponents interacting with each other:

  • collector

    Collects the statistics of resource consumption in the cluster

  • scheduler

    Based on the data from the collector, makes decisions whether cloud resources need to be relocated to achieve the optimum

  • actuator

    Executes the resource relocation decisions made by scheduler

Out of the box, these service components implement a very simple logic, which, however, can be individually enhanced according to the needs of a specific cloud environment by utilizing their pluggable architecture. The plugins need to be written in Python programming language and injected as modules into the DRB service by building a custom drb-controller container image. Default plugins as well as custom plugins are configured through the corresponding sections of DRBConfig custom resources.

Also, it is possible to limit the scope of DRB decisions and actions to only a subset of hosts. This way, you can model the node grouping schema that is configured in OpenStack, for example, compute node aggregates and availability zones, to avoid DRB service attempting resource placement changes that cannot be fulfilled by MOSK Compute service (OpenStack Nova).

Example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: DRBConfig
metadata:
  name: drb-test
  namespace: openstack
spec:
  actuator:
    max_parallel_migrations: 10
    migration_polling_interval: 5
    migration_timeout: 180
    name: os-live-migration
  collector:
    name: stacklight
  hosts: []
  migrateAny: false
  reconcileInterval: 300
  scheduler:
    load_threshold: 80
    min_improvement: 0
    name: vm-optimize

The spec section of configuration consists of the following main parts:

  • collector

    Specifies and configures the collector plugin to collect the metrics on which decisions are based. At a minimum, the name of the plugin must be provided.

  • scheduler

    Specifies and configures the scheduler plugin that will make decisions based on the collected metrics. At a minimum, the name of the plugin must be provided.

  • actuator

    Specifies and configures the actuator plugin that will move resources around. At a minimum, the name of the plugin must be provided.

  • reconcileInterval

    Defines time in seconds between reconciliation cycles. Should be large enough for the metrics to settle after resources are moved around.

    For the default stacklight collector plugin, this value must equal at least 300.

  • hosts

    Specifies the list of cluster hosts to which this given instance of DRBConfig applies. This means that only metrics from these hosts will be used for making decisions, only resources belonging to these hosts will be considered for re-distribution, and only these hosts will be considered as possible targets for re-distribution.

    You can create multiple DRBConfig resources that watch over non-overlapping sets of hosts.

    Default of this setting is an empty list that implies all hosts.

  • migrateAny

    A boolean flag that the scheduler plugin can consider when making decisions, allowing cloud operators and users to opt certain workloads in or out of redistribution.

    For the default vm-optimize scheduler plugin:

    • migrateAny: true (default) - any instance can be migrated, except for instances tagged with lcm.mirantis.com:no-drb, explicitly opting out of the DRB functionality

    • migrateAny: false - only instances tagged with lcm.mirantis.com:drb are migrated by the DRB service, explicitly opting in to the DRB functionality

Included default plugins
Collector plugins
stacklight

Collects node_load5, machine_cpu_cores, and libvirt_domain_info_cpu_time_seconds:rate5m metrics from the StackLight service running in the MOSK cluster.

Does not have options available.

Requires the reconcileInterval set to at least 300 (5 minutes), as both the collected node and instance CPU usage metrics are effectively averaged over a 5-minute sliding window.

Scheduler plugins
vm-optimize

Attempts to minimize the standard deviation of node load. The node load is normalized per CPU core, so heterogeneous compute hosts can be compared.

Available options:

  • load_threshold

    The value in percent of the compute host load after which the host will be considered overloaded and attempts will be made to migrate instances from it. Defaults to 80.

  • min_improvement

    Minimal improvement of the optimization metric in percent. While making decisions, the scheduler attempts to predict the resulting load distribution to determine if moving resources is beneficial. If the total improvement after all necessary decisions is calculated to be less than min_improvement, no decisions will be executed.

    Defaults to 0, any potential improvement is acted upon. Setting this to a higher value should allow avoiding instance migrations that provide negligible improvements.

Warning

The current version of this plugin takes into account only basic resource classes when making scheduling decisions. These include only RAM, disk, and vCPU count from the instance flavor. It does not take into account any other information including specific image or aggregate metadata, custom resource classes, PCI devices, NUMA, hugepages, and so on. Moving around instances that consume such resources will more likely fail as the current implementation of the scheduler plugin cannot reliably predict if such instances fit onto the selected target host.

Actuator plugins
os-live-migration

Live migrates instances to specific hosts. Assumes any migration is possible. Refer to the hosts and migrateAny options above to learn how to control which instances are migrated to which locations.

Available options:

  • max_parallel_migrations

    Defines the number of instances to migrate in parallel.

    Defaults to 10.

    This value applies to all decisions being processed, so it may involve instances from different hosts. Meanwhile, the nova-compute service may have its own limits on how many live migrations a given host can handle in parallel.

  • migration_polling_interval

    Defines the interval in seconds for checking the instance status while the latter is being migrated

    Defaults to 5.

  • migration_timeout

    Defines the interval in seconds after which an unfinished migration is considered failed.

    Defaults to 180.

noop

Only logs the decisions that were scheduled for execution. Useful for debugging and dry-runs.

Note

The list of the services and their supported features included in this section is not full and is being constantly amended based on the complexity of the architecture and use of a particular service.

OpenStack

OpenStack cluster

OpenStack and auxiliary services are running as containers in the kind: Pod Kubernetes resources. All long-running services are governed by one of the ReplicationController-enabled Kubernetes resources, which include either kind: Deployment, kind: StatefulSet, or kind: DaemonSet.

The placement of the services is mostly governed by the Kubernetes node labels. The labels affecting the OpenStack services include:

  • openstack-control-plane=enabled - the node hosting most of the OpenStack control plane services.

  • openstack-compute-node=enabled - the node serving as a hypervisor for Nova. The virtual machines with tenants workloads are created there.

  • openvswitch=enabled - the node hosting Neutron L2 agents and OpenvSwitch pods that manage L2 connection of the OpenStack networks.

  • openstack-gateway=enabled - the node hosting Neutron L3, Metadata and DHCP agents, Octavia Health Manager, Worker and Housekeeping components.

_images/os-k8s-pods-layout.png

Note

OpenStack is an infrastructure management platform. Mirantis OpenStack for Kubernetes (MOSK) uses Kubernetes mostly for orchestration and dependency isolation. As a result, multiple OpenStack services are running as privileged containers with host PIDs and Host Networking enabled. You must ensure that at least the user with the credentials used by Helm/Tiller (administrator) is capable of creating such Pods.

Infrastructure services

Service

Description

Storage

While the underlying Kubernetes cluster is configured to use Ceph CSI for providing persistent storage for container workloads, for some types of workloads such networked storage is suboptimal due to latency.

This is why the separate local-volume-provisioner CSI is deployed and configured as an additional storage class. Local Volume Provisioner is deployed as kind: DaemonSet.

Database

A single WSREP (Galera) cluster of MariaDB is deployed as the SQL database to be used by all OpenStack services. It uses the storage class provided by Local Volume Provisioner to store the actual database files. The service is deployed as kind: StatefulSet of a given size, which is no less than 3, on any openstack-control-plane node. For details, see OpenStack database architecture.

Messaging

RabbitMQ is used as a messaging bus between the components of the OpenStack services.

A separate instance of RabbitMQ is deployed for each OpenStack service that needs a messaging bus for intercommunication between its components.

An additional, separate RabbitMQ instance is deployed to serve as a notification messages bus for OpenStack services to post their own and listen to notifications from other services. StackLight also uses this message bus to collect notifications for monitoring purposes.

Each RabbitMQ instance is a single node and is deployed as kind: StatefulSet.

Caching

A single multi-instance of the Memcached service is deployed to be used by all OpenStack services that need caching, which are mostly HTTP API services.

Coordination

A separate instance of etcd is deployed to be used by Cinder, which require Distributed Lock Management for coordination between its components.

Ingress

Is deployed as kind: DaemonSet.

Image pre-caching

A special kind: DaemonSet is deployed and updated each time the kind: OpenStackDeployment resource is created or updated. Its purpose is to pre-cache container images on Kubernetes nodes, and thus, to minimize possible downtime when updating container images.

This is especially useful for containers used in kind: DaemonSet resources, as during the image update Kubernetes starts to pull the new image only after the container with the old image is shut down.

OpenStack services

Service

Description

Identity (Keystone)

Uses MySQL backend by default.

keystoneclient - a separate kind: Deployment with a pod that has the OpenStack CLI client as well as relevant plugins installed, and OpenStack admin credentials mounted. Can be used by administrator to manually interact with OpenStack APIs from within a cluster.

Image (Glance)

Supported backend is RBD (Ceph is required).

Volume (Cinder)

Supported backend is RBD (Ceph is required).

Network (Neutron)

Supported backends are Open vSwitch, Open Virtual Network, and Tungsten Fabric.

Placement

Compute (Nova)

Supported hypervisor is Qemu/KVM through libvirt library.

Dashboard (Horizon)

DNS (Designate)

Supported backend is PowerDNS.

Load Balancer (Octavia)

Ceph Object Gateway (SWIFT)

Provides the object storage and a Ceph Object Gateway Swift API that is compatible with the OpenStack Swift API. You can manually enable the service in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Instance HA (Masakari)

An OpenStack service that ensures high availability of instances running on a host. You can manually enable Masakari in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Orchestration (Heat)

Key Manager (Barbican)

The supported backends include:

  • The built-in Simple Crypto, which is used by default

  • Vault

    Vault by HashiCorp is a third-party system and is not installed by MOSK. Hence, the Vault storage backend should be available elsewhere on the user environment and accessible from the MOSK deployment.

    If the Vault backend is used, you can configure Vault in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Tempest

Runs tests against a deployed OpenStack cloud. You can manually enable Tempest in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Shared Filesystems (OpenStack Manila)

Provides Shared Filesystems as a service that enables you to create and manage shared filesystems in a multi-project cloud environments. For details, refer to Shared Filesystems service.

Shared Filesystems (OpenStack Manila)

Provides Shared Filesystems as a service that enables you to create and manage shared filesystems in a multi-project cloud environments. For details, refer to Shared Filesystems service.

OpenStack database architecture

A complete setup of a MariaDB Galera cluster for OpenStack is illustrated in the following image:

_images/os-k8s-mariadb-galera.png

MariaDB server pods are running a Galera multi-master cluster. Clients requests are forwarded by the Kubernetes mariadb service to the mariadb-server pod that has the primary label. Other pods from the mariadb-server StatefulSet have the backup label. Labels are managed by the mariadb-controller pod.

The MariaDB Controller periodically checks the readiness of the mariadb-server pods and sets the primary label to it if the following requirements are met:

  • The primary label has not already been set on the pod.

  • The pod is in the ready state.

  • The pod is not being terminated.

  • The pod name has the lowest integer suffix among other ready pods in the StatefulSet. For example, between mariadb-server-1 and mariadb-server-2, the pod with the mariadb-server-1 name is preferred.

Otherwise, the MariaDB Controller sets the backup label. This means that all SQL requests are passed only to one node while other two nodes are in the backup state and replicate the state from the primary node. The MariaDB clients are connecting to the mariadb service.

OpenStack lifecycle management

The OpenStack Operator component is a combination of the following entities:

OpenStack Controller (Rockoon)

The OpenStack Controller (Rockoon) runs in a set of containers in a pod in Kubernetes. Rockoon is deployed as a Deployment with 1 replica only. The failover is provided by Kubernetes that automatically restarts the failed containers in a pod.

However, given the recommendation to use a separate Kubernetes cluster for each OpenStack deployment, the controller in envisioned mode for operation and deployment will only manage a single OpenStackDeployment resource, making the proper HA much less of an issue.

Rockoon is written in Python using Kopf, as a Python framework to build Kubernetes operators, and Pykube, as a Kubernetes API client.

Using Kubernetes API, the controller subscribes to changes to resources of kind: OpenStackDeployment, and then reacts to these changes by creating, updating, or deleting appropriate resources in Kubernetes.

The basic child resources managed by the controller are Helm releases. They are rendered from templates taking into account an appropriate values set from the main and features fields in the OpenStackDeployment resource.

Then, the common fields are merged to resulting data structures. Lastly, the services fields are merged providing the final and precise override for any value in any Helm release to be deployed or upgraded.

The constructed values are then used by Rockoon during a Helm release installation.

Rockoon containers

Container

Description

osdpl

The core container that handles changes in the osdpl object.

helmbundle

The container that watches the helmbundle objects and reports their statuses to the osdpl object in status:children. See OpenStackDeploymentStatus custom resource for details.

health

The container that watches all Kubernetes native resources, such as Deployments, Daemonsets, Statefulsets, and reports their statuses to the osdpl object in status:health. See OpenStackDeploymentStatus custom resource for details.

secrets

The container that provides data exchange between different components such as Ceph.

node

The container that handles the node events.

_images/openstack_controller.png
OpenStackDeployment Admission Controller

The CustomResourceDefinition resource in Kubernetes uses the OpenAPI Specification version 2 to specify the schema of the resource defined. The Kubernetes API outright rejects the resources that do not pass this schema validation.

The language of the schema, however, is not expressive enough to define a specific validation logic that may be needed for a given resource. For this purpose, Kubernetes enables the extension of its API with Dynamic Admission Control.

For the OpenStackDeployment (OsDpl) CR the ValidatingAdmissionWebhook is a natural choice. It is deployed as part of OpenStack Controller (Rockoon) by default and performs specific extended validations when an OsDpl CR is created or updated.

The inexhaustive list of additional validations includes:

  • Deny the OpenStack version downgrade

  • Deny the OpenStack version skip-level upgrade

  • Deny the OpenStack master version deployment

  • Deny upgrade to the OpenStack master version

  • Deny upgrade if any part of an OsDpl CR specification changes along with the OpenStack version

Under specific circumstances, it may be viable to disable the Admission Controller, for example, when you attempt to deploy or upgrade to the master version of OpenStack.

Warning

Mirantis does not support MOSK deployments performed without the OpenStackDeployment Admission Controller enabled. Disabling of the OpenStackDeployment Admission Controller is only allowed in staging non-production environments.

To disable the Admission Controller, ensure that the following structures and values are present in the rockoon HelmBundle resource:

apiVersion: lcm.mirantis.com/v1alpha1
kind: HelmBundle
metadata:
  name: openstack-operator
  namespace: osh-system
spec:
  releases:
  - name: openstack-operator
    values:
      admission:
        enabled: false

At that point, all safeguards except for those expressed by the CR definition are disabled.

OpenStack Exporter

The OpenStack Exporter collects metrics from the OpenStack services and exposes them to Prometheus for integration with StackLight. The Exporter interacts with the REST APIs of various OpenStack services to gather data about the infrastructure state and performance for visualization, alerting, and analysis within the monitoring system.

To retrieve metrics from the OpenStack Exporter:

  1. Locate the Exporter pod. The OpenStack Exporter runs in the osh-system namespace:

    kubectl -n osh-system get pods | grep exporter
    
  2. Query the metrics by executing the curl request inside the exporter container:

    kubectl -n osh-system exec -t <EXPORTER-POD>  -c exporter curl http://localhost:9102/
    
OpenStack configuration

MOSK provides the configurational capabilities through a number of custom resources. This section is intended to provide detailed overview of these custom resources and their possible configuration.

OpenStackDeployment custom resource

The detailed information about schema of an OpenStackDeployment custom resource can be obtained by running:

kubectl get crd openstackdeployments.lcm.mirantis.com -o yaml

The definition of a particular OpenStack deployment can be obtained by running:

kubectl -n openstack get osdpl -o yaml
Example of an OpenStackDeployment CR of minimum configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
metadata:
  name: openstack-cluster
  namespace: openstack
spec:
  openstack_version: victoria
  preset: compute
  size: tiny
  internal_domain_name: cluster.local
  public_domain_name: it.just.works
  features:
    neutron:
      tunnel_interface: ens3
      external_networks:
        - physnet: physnet1
          interface: veth-phy
          bridge: br-ex
          network_types:
           - flat
          vlan_ranges: null
          mtu: null
      floating_network:
        enabled: False
    nova:
      live_migration_interface: ens3
      images:
        backend: local
Hiding sensitive information

Available since MOSK 23.1

The OpenStackDeployment custom resource enables you to securely store sensitive fields in Kubernetes secrets. To do that, verify that the reference secret is present in the same namespace as the OpenStackDeployment object and the openstack.lcm.mirantis.com/osdpl_secret label is set to true. The list of fields that can be hidden from OpenStackDeployment is limited and defined by the OpenStackDeployment schema.

For example, to hide spec:features:ssl:public_endpoints:api_cert, use the following structure:

spec:
  features:
    ssl:
      public_endpoints:
        api_cert:
          value_from:
            secret_key_ref:
              key: api_cert
              name: osh-dev-hidden
Main elements
Main elements of OpenStackDeployment custom resource

Element

Sub-element

Description

apiVersion

n/a

Specifies the version of the Kubernetes API that is used to create this object

kind

n/a

Specifies the kind of the object

metadata

name

Specifies the name of metadata. Should be set in compliance with the Kubernetes resource naming limitations

namespace

Specifies the metadata namespace. While technically it is possible to deploy OpenStack on top of Kubernetes in other than openstack namespace, such configuration is not included in the MOSK system integration test plans. Therefore, Mirantis does not recommend such scenario.

Warning

Both OpenStack and Kubernetes platforms provide resources to applications. When OpenStack is running on top of Kubernetes, Kubernetes is completely unaware of OpenStack-native workloads, such as virtual machines, for example.

For better results and stability, Mirantis recommends using a dedicated Kubernetes cluster for OpenStack, so that OpenStack and auxiliary services, Ceph, and StackLight are the only Kubernetes applications running in the cluster.

spec

openstack_version

Specifies the OpenStack release to deploy

preset

String that specifies the name of the preset, a predefined configuration for the OpenStack cluster. A preset includes:

  • A set of enabled services that includes virtualization, bare metal management, secret management, and others

  • Major features provided by the services, such as VXLAN encapsulation of the tenant traffic

  • Integration of services

Every supported deployment profile incorporates an OpenStack preset. Refer to Deployment profiles for the list of possible values.

size

String that specifies the size category for the OpenStack cluster. The size category defines the internal configuration of the cluster such as the number of replicas for service workers and timeouts, etc.

The list of supported sizes include:

  • tiny - for approximately 10 OpenStack compute nodes

  • small - for approximately 50 OpenStack compute nodes

  • medium - for approximately 100 OpenStack compute nodes

public_domain_name

Specifies the public DNS name for OpenStack services. This is a base DNS name that must be accessible and resolvable by API clients of your OpenStack cloud. It will be present in the OpenStack endpoints as presented by the OpenStack Identity service catalog.

The TLS certificates used by the OpenStack services (see below) must also be issued to this DNS name.

persistent_volume_storage_class

Specifies the Kubernetes storage class name used for services to create persistent volumes. For example, backups of MariaDB. If not specified, the storage class marked as default will be used.

features

Contains the top-level collections of settings for the OpenStack deployment that potentially target several OpenStack services. The section where the customizations should take place.

The features:services element contains a list of extra OpenStack services to deploy. Extra OpenStack services are services that are not included into preset.

region_name

TechPreview

The name of the region used for deployment, defaults to RegionOne.

features:policies

Defines the list of custom policies for OpenStack services.

Configuration structure:

spec:
  features:
    policies:
      nova:
        custom_policy: custom_value

The list of services available for configuration includes: Cinder, Nova, Designate, Keystone, Glance, Neutron, Heat, Octavia, Barbican, Placement, Ironic, aodh, Gnocchi, and Masakari.

Learn more about OpenStack API access policies in MOSK in OpenStack API access policies.

Caution

Mirantis is not responsible for cloud operability in case of default policies modifications but provides API to pass the required configuration to the core OpenStack services.

features:policies:strict_admin

TechPreview

Enables a tested set of policies that limits the global admin role to only the user with admin role in the admin project or user with the service role. The latter should be used only for service users utilizied for communication between OpenStack services.

Configuration structure:

spec:
  features:
    policies:
      strict_admin:
        enabled: true
  services:
    identity:
      keystone:
        values:
          conf:
            keystone:
              resource:
                admin_project_name: admin
                admin_project_domain_name: Default

Note

The spec.services part of the above section will become redundant in one of the following releases.

artifacts

A low-level section that defines the base URI prefixes for images and binary artifacts.

common

A low-level section that defines values that will be passed to all OpenStack (spec:common:openstack) or auxiliary (spec:common:infra) services Helm charts.

Configuration structure:

spec:
  artifacts:
  common:
    openstack:
      values:
    infra:
      values:
services

A section of the lowest level, enables the definition of specific values to pass to specific Helm charts on a one-by-one basis:

Warning

Mirantis does not recommend changing the default settings for spec:artifacts, spec:common, and spec:services elements. Customizations can compromise the OpenStack deployment update and upgrade processes. However, you may need to edit the spec:services section to limit hardware resources in case of a hyperconverged architecture as described in Limit HW resources for hyperconverged OpenStack compute nodes.

Logging

Parameter

features:logging:<service>:level

Usage

Specifies the standard logging levels for OpenStack services that include the following, at increasing severity: TRACE, DEBUG, INFO, AUDIT, WARNING, ERROR, and CRITICAL.

Configuration example:

spec:
  features:
    logging:
      nova:
        level: DEBUG
Node-specific configuration

Depending on the use case, you may need to configure the same application components differently on different hosts. MOSK enables you to easily perform the required configuration through node-specific overrides at the OpenStack Controller side.

The limitation of using the node-specific overrides is that they override only the configuration settings while other components, such as startup scripts and others, should be reconfigured as well.

Caution

The overrides have been implemented in a similar way to the OpenStack node and node label specific DaemonSet configurations. Though, the OpenStack Controller node-specific settings conflict with the upstream OpenStack node and node label specific DaemonSet configurations. Therefore, we do not recommend configuring node and node label overrides.

The list of allowed node labels is located in the Cluster object status providerStatus.releaseRef.current.allowedNodeLabels field.

If the value field is not defined in allowedNodeLabels, a label can have any value.

Before or after a machine deployment, add the required label from the allowed node labels list with the corresponding value to spec.providerSpec.value.nodeLabels in machine.yaml. For example:

nodeLabels:
- key: <NODE-LABEL>
  value: <NODE-LABEL-VALUE>

The addition of a node label that is not available in the list of allowed node labels is restricted.

The node-specific settings are activated through the spec:nodes section of the OsDpl CR. The spec:nodes section contains the following subsections:

  • features- implements overrides for a limited subset of fields and is constructed similarly to spec::features

  • services - similarly to spec::services, enables you to override settings in general for the components running as DaemonSets.

Example configuration:

spec:
  nodes:
    <NODE-LABEL>::<NODE-LABEL-VALUE>:
      features:
        # Detailed information about features might be found at
        # openstack_controller/admission/validators/nodes/schema.yaml
      services:
        <service>:
          <chart>:
            <chart_daemonset_name>:
              values:
                # Any value from specific helm chart
Tempest

Parameter

features:services:tempest

Usage

Enables tests against a deployed OpenStack cloud:

spec:
  features:
    services:
    - tempest
OpenStackDeploymentStatus custom resource

The resource of kind OpenStackDeploymentStatus is a custom resource that describes the status of an OpenStack deployment. To obtain detailed information about the schema of an OpenStackDeploymentStatus custom resource:

kubectl get crd openstackdeploymentstatus.lcm.mirantis.com -o yaml

To obtain the status definition for a particular OpenStack deployment:

kubectl -n openstack get osdplst

Example of system response:

NAME      OPENSTACK VERSION   CONTROLLER VERSION   STATE     LCM PROGRESS   HEALTH   MOSK RELEASE
osh-dev   antelope            0.16.1.dev104        APPLIED   20/20          21/22    MOSK 24.1.3

Where:

  • OPENSTACK VERSION displays the actual OpenStack version of the deployment

  • CONTROLLER VERSION indicates the version of the OpenStack Controller (Rockoon) responsible for the deployment

  • STATE reflects the current status of life-cycle management. The list of possible values includes:

    • APPLYING indicates that some Kubernetes objects for applications are in the process of being applied

    • APPLIED indicates that all Kubernetes objects for applications have been applied to the latest state

  • LCM PROGRESS reflects the current progress of STATE in the format X/Y, where X denotes the number of applications with Kubernetes objects applied and in the actual state, and Y represents the total number of applications managed by the OpenStack Controller (Rockoon)

  • HEALTH provides an overview of the current health status of the OpenStack deployment in the format X/Y, where X represents the number of applications with notReady pods, and Y is the total number of applications managed by the OpenStack Controller (Rockoon)

  • MOSK RELEASE displays the current product release of the OpenStack deployment

NAME      OPENSTACK VERSION   CONTROLLER VERSION   STATE     MOSK RELEASE
osh-dev   antelope            0.16.1.dev104        APPLIED   MOSK 24.1

Where:

  • OPENSTACK VERSION displays the actual OpenStack version of the deployment

  • CONTROLLER VERSION indicates the version of the OpenStack Controller (Rockoon) responsible for the deployment

  • STATE reflects the current status of life-cycle management. The list of possible values includes:

    • APPLYING indicates that some Kubernetes objects for applications are in the process of being applied

    • APPLIED indicates that all Kubernetes objects for applications have been applied to the latest state

  • MOSK RELEASE displays the current product release of the OpenStack deployment

Example of an OpenStackDeploymentStatus custom resource configuration
  1 kind: OpenStackDeploymentStatus
  2 metadata:
  3   name: osh-dev
  4   namespace: openstack
  5 spec: {}
  6 status:
  7   handle:
  8     lastStatus: update
  9   health:
 10     barbican:
 11       api:
 12         generation: 2
 13         status: Ready
 14     cinder:
 15       api:
 16         generation: 2
 17         status: Ready
 18       backup:
 19         generation: 1
 20         status: Ready
 21       scheduler:
 22         generation: 1
 23         status: Ready
 24       volume:
 25         generation: 1
 26         status: Ready
 27   osdpl:
 28     cause: update
 29     changes: '((''add'', (''status'',), None, {''watched'': {''ceph'': {''secret'':
 30       {''hash'': ''0fc01c5e2593bc6569562b451b28e300517ec670809f72016ff29b8cbaf3e729''}}}}),)'
 31     controller_version: 0.5.3.dev12
 32     fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 33     openstack_version: ussuri
 34     state: APPLIED
 35     timestamp: "2021-09-08 17:01:45.633143"
 36   services:
 37     baremetal:
 38       controller_version: 0.5.3.dev12
 39       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 40       openstack_version: ussuri
 41       state: APPLIED
 42       timestamp: "2021-09-08 17:00:54.081353"
 43     block-storage:
 44       controller_version: 0.5.3.dev12
 45       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 46       openstack_version: ussuri
 47       state: APPLIED
 48       timestamp: "2021-09-08 17:00:57.306669"
 49     compute:
 50       controller_version: 0.5.3.dev12
 51       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 52       openstack_version: ussuri
 53       state: APPLIED
 54       timestamp: "2021-09-08 17:01:18.853068"
 55     coordination:
 56       controller_version: 0.5.3.dev12
 57       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 58       openstack_version: ussuri
 59       state: APPLIED
 60       timestamp: "2021-09-08 17:01:00.593719"
 61     dashboard:
 62       controller_version: 0.5.3.dev12
 63       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 64       openstack_version: ussuri
 65       state: APPLIED
 66       timestamp: "2021-09-08 17:00:57.652145"
 67     database:
 68       controller_version: 0.5.3.dev12
 69       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 70       openstack_version: ussuri
 71       state: APPLIED
 72       timestamp: "2021-09-08 17:01:00.233777"
 73     dns:
 74       controller_version: 0.5.3.dev12
 75       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 76       openstack_version: ussuri
 77       state: APPLIED
 78       timestamp: "2021-09-08 17:00:56.540886"
 79     identity:
 80       controller_version: 0.5.3.dev12
 81       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 82       openstack_version: ussuri
 83       state: APPLIED
 84       timestamp: "2021-09-08 17:01:00.961175"
 85     image:
 86       controller_version: 0.5.3.dev12
 87       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 88       openstack_version: ussuri
 89       state: APPLIED
 90       timestamp: "2021-09-08 17:00:58.976976"
 91     ingress:
 92       controller_version: 0.5.3.dev12
 93       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 94       openstack_version: ussuri
 95       state: APPLIED
 96       timestamp: "2021-09-08 17:01:01.440757"
 97     key-manager:
 98       controller_version: 0.5.3.dev12
 99       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
100       openstack_version: ussuri
101       state: APPLIED
102       timestamp: "2021-09-08 17:00:51.822997"
103     load-balancer:
104       controller_version: 0.5.3.dev12
105       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
106       openstack_version: ussuri
107       state: APPLIED
108       timestamp: "2021-09-08 17:01:02.462824"
109     memcached:
110       controller_version: 0.5.3.dev12
111       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
112       openstack_version: ussuri
113       state: APPLIED
114       timestamp: "2021-09-08 17:01:03.165045"
115     messaging:
116       controller_version: 0.5.3.dev12
117       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
118       openstack_version: ussuri
119       state: APPLIED
120       timestamp: "2021-09-08 17:00:58.637506"
121     networking:
122       controller_version: 0.5.3.dev12
123       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
124       openstack_version: ussuri
125       state: APPLIED
126       timestamp: "2021-09-08 17:01:35.553483"
127     object-storage:
128       controller_version: 0.5.3.dev12
129       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
130       openstack_version: ussuri
131       state: APPLIED
132       timestamp: "2021-09-08 17:01:01.828834"
133     orchestration:
134       controller_version: 0.5.3.dev12
135       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
136       openstack_version: ussuri
137       state: APPLIED
138       timestamp: "2021-09-08 17:01:02.846671"
139     placement:
140       controller_version: 0.5.3.dev12
141       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
142       openstack_version: ussuri
143       state: APPLIED
144       timestamp: "2021-09-08 17:00:58.039210"
145     redis:
146       controller_version: 0.5.3.dev12
147       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
148       openstack_version: ussuri
149       state: APPLIED
150       timestamp: "2021-09-08 17:00:36.562673"
Health structure

The health subsection provides a brief output on services health.

OsDpl structure

The osdpl subsection describes the overall status of the OpenStack deployment.

OsDpl structure elements

Element

Description

cause

The cause that triggered the LCM action: update when OsDpl is updated, resume when the OpenStack Controller (Rockoon) is restarted

changes

A string representation of changes in the OpenstackDeployment object

controller_version

The version of rockoon that handles the LCM action

fingerprint

The SHA sum of the OpenStackDeployment object spec section

openstack_version

The current OpenStack version specified in the osdpl object

state

The current state of the LCM action. Possible values include:

  • APPLYING - not all operations are completed

  • APPLIED - all operations are completed

timestamp

The timestamp of the status:osdpl section update

Services structure

The services subsection provides detailed information of LCM performed with a specific service. This is a dictionary where keys are service names, for example, baremetal or compute and values are dictionaries with the following items.

Services structure elements

Element

Description

controller_version

The version of the rockoon that handles the LCM action on a specific service

fingerprint

The SHA sum of the OpenStackDeployment object spec section used when performing the LCM on a specific service

openstack_version

The OpenStack version specified in the osdpl object used when performing the LCM action on a specific service

state

The current state of the LCM action performed on a service. Possible values include:

  • WAITING - waiting for dependencies.

  • APPLYING - not all operations are completed.

  • APPLIED - all operations are completed.

timestamp

The timestamp of the status:services:<SERVICE-NAME> section update.

OpenStack Controller configuration

Available since MOSK 23.2

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

The OpenStack Controller (Rockoon) enables you to modify its configuration at runtime without restarting. MOSK stores the controller configuration in the rockoon-config ConfigMap in the osh-system namespace of your cluster.

To retrieve the Rockoon configuration ConfigMap, run:

kubectl get configmaps rockoon-config -o yaml
Example of a Rockoon configuration ConfigMap
apiVersion: v1
data:
  extra_conf.ini: |
    [maintenance]
    respect_nova_az = false
kind: ConfigMap
metadata:
  annotations:
    openstackdeployments.lcm.mirantis.com/skip_update: "true"
  name: rockoon-config
  namespace: osh-system
Rockoon extra configuration parameters

Section

Parameter

Default value

Description

[osctl]

wait_application_ready_timeout

1200

The number of seconds to wait for all application components to become ready.

wait_application_ready_delay

10

The number of seconds before going to the sleep mode between attempts to verify if the application is ready.

node_not_ready_flapping_timeout

120

The amount of time to wait for the flapping node.

[helmbundle]

manifest_enable_timeout

600

The number of seconds to wait until the values set in the manifest are propagated to the dependent objects.

manifest_enable_delay

10

The number of seconds between attempts to verify if the values were applied.

manifest_disable_timeout

600

The number of seconds to wait until the values are removed from the manifest and propagated to the child objects.

manifest_disable_delay

10

The number of seconds between attempts to verify if the values were removed from the release.

manifest_purge_timeout

600

The number of seconds to wait until the Kubernetes object is removed.

manifest_purge_delay

10

The number of seconds between attempts to verify if the Kubernetes object is removed.

manifest_apply_delay

10

The number of seconds to pause for the Helm bundle changes.

[maintenance]

instance_migrate_concurrency

1

The number of instances to migrate concurrently.

nwl_parallel_max_compute

30

The maximum number of compute nodes allowed for a parallel update.

nwl_parallel_max_gateway

1

The maximum number of gateway nodes allowed for a parallel update.

respect_nova_az

true

Respect Nova availability zone (AZ). The true value allows the parallel update only for the compute nodes in the same AZ.

ndr_skip_instance_check

false

The flag to skip the instance verification on a host before proceeding with the node removal. The false value blocks the node removal until at least one instance exists on the host.

ndr_skip_volume_check

false

The flag to skip the volume verification on a host before proceeding with the node removal. The false value blocks the node removal until at least one volume exists on the host. A volume is tied to a specific host only for the LVM backend.

Custom OpenStack images

Available since MOSK 25.1

The OpenStack Controller enables you to use customized images in your OpenStack deployments. To start using such images, create a ConfigMap in the openstack namespace with the following content, replacing <OPENSTACKDEPLOYMENT-NAME> with the name of your OpenStackDeployment custom resource:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    openstack.lcm.mirantis.com/watch: "true"
  name: <OPENSTACKDEPLOYMENT-NAME>-artifacts
  namespace: openstack
data:
  caracal: |
    dep_check: <KUBERNETES-ENTRYPOINT-IMAGE-URL>
    openvswitch_db_server: <OPENVSWITCH-IMAGE-URL>
    openvswitch_vswitchd: <OPENVSWITCH-IMAGE-URL>
OpenStack database

MOSK relies on the MariaDB Galera cluster to provide its OpenStack components with a reliable storage of persistent data.

For successful long-term operations of a MOSK cloud, it is crucial to ensure the healthy state of the OpenStack database as well as the safety of the data stored in it. To help you with that, MOSK provides built-in automated procedures for OpenStack database maintenance, backup, and restoration. The hereby chapter describes the internal mechanisms and configuration details for the provided tools.

Overview of the OpenStack database backup and restoration

MOSK relies on the MariaDB Galera cluster to provide its OpenStack components with a reliable storage for persistent data. Mirantis recommends backing up your OpenStack databases daily to ensure the safety of your cloud data. Also, you should always create an instant backup before updating your cloud or performing any kind of potentially disruptive experiment.

MOSK has a built-in automated backup routine that can be triggered manually or by schedule. For detailed information about the process of MariaDB Galera cluster backup, refer to Workflows of the OpenStack database backup and restoration.

Backup and restoration can only be performed against the OpenStack database as a whole. Granular per-service or per-table procedures are not supported by MOSK.

Important

Because database restoration reverts to a specific snapshot, the resulting database state may not accurately reflect the current state of dynamic resources such as running VMs, volumes, and similar components. For example:

  • A VM is removed after a snapshot for restoration is created. Such VM will be present as an orphan entry in the database and the OpenStack API after restoration.

  • A VM is created after a snapshot for restoration is created. Such VM will disappear from the OpenStack API after database restoration but will still be present as a process on the compute host.

Manually analyze and resolve such inconsistencies.

Restoring a database may also impact applications that rely on it for heartbeats. For instance, Octavia Amphorae may become unresponsive after restoration, potentially requiring LoadBalancer failover to maintain service availability.

Periodic backups

By default, periodic backups are turned off. Though, a cloud operator can easily enable this capability by adding the following structure to the OpenStackDeployment custom resource:

spec:
  features:
    database:
      backup:
        enabled: true

For the configuration details, refer to Periodic OpenStack database backups.

Database restoration

Along with the automated backup routine, MOSK provides the Mariabackup tool for the OpenStack database restoration. For the database restoration procedure, refer to Restore OpenStack databases from a backup. For more information about the restoration process, consult Workflows of the OpenStack database backup and restoration.

Storage for backup data

By default, MOSK backup routine stores the OpenStack database data into the Mirantis Ceph cluster, which is a part of the same cloud. This is sufficient for the vast majority of clouds. However, you may want to have the backup data stored off the cloud to comply with specific enterprise practices for infrastructure recovery and data safety.

To achieve that, MOSK enables you to point the backup routine to an external data volume. For details, refer to Remote storage for OpenStack database backups.

Size of a backup storage

The size of a backup storage volume depends directly on the size of the MOSK cluster, which can be determined through the size parameter in the OpenStackDeployment CR.

The list of the recommended sizes for a minimal backup volume includes:

  • 20 GB for the tiny cluster size

  • 40 GB for the small cluster size

  • 80 GB for the medium cluster size

If required, you can change the default size of a database backup volume. However, make sure that you configure the volume size before OpenStack deployment is complete. This is because there is no automatic way to resize the backup volume once the cloud is deployed. Also, only the local backup storage (Ceph) supports the configuration of the volume size.

To change the default size of the backup volume, use the following structure in the OpenStackDeployment CR:

spec:
  services:
    database:
      mariadb:
        values:
          volume:
            phy_backup:
              size: "200Gi"
Local backup storage - default

To store the backup data to a local Mirantis Ceph, the MOSK underlying Kubernetes cluster needs to have a preconfigured storage class for Kubernetes persistent volumes with the Ceph cluster as a storage backend.

When restoring the OpenStack database from a local Ceph storage, the cron job restores the state on each MariaDB node sequentially. It is not possible to perform parallel restoration because Ceph Kubernetes volumes do not support concurrent mounting from multiple places.

Additionally, MOSK allows for increased backup safety through synchronization of local MariaDB backups with a remote S3 storage. For details, see Synchronization of local MariaDB backups with a remote S3 storage.

Remote backup storage

MOSK provides you with a capability to store the OpenStack database data outside of the cloud, on an external storage device that supports common data access protocols, such as third-party NAS appliances.

Refer to Remote storage for OpenStack database backups for the configuration details.

Backup encryption

Available since MOSK 25.1 TechPreview

Security compliance may require storing backups of databases in an encrypted format. MOSK enables encryption of database backups, both local and remote, using the OpenSSL aes-256-cbc encryption.

To encrypt database backups, add the following configuration to the OpenStackDeployment custom resource:

spec:
  features:
    database:
      backup:
        encryption:
          enabled: true
Workflows of the OpenStack database backup and restoration

This section provides technical details about the internal implementation of automated backup and restoration routines built into MOSK. The below information would be helpful for troubleshooting of any issues related to the process or understanding the impact these procedures impose on a running cloud.

Backup workflow

The OpenStack database backup workflow consists of the following phases.

Backup phase 1

The mariadb-phy-backup job is responsible for:

  • Performing basic sanity checks and choosing right node for backup

  • Verifying the wsrep status and changing the wsrep_desync parameter settings

  • Checking backup integrity (ensuring correct hash sums)

  • Managing the mariadb-phy-backup-runner pod

  • If enabled, synchronizing the local backup storage with the remote S3 storage

During the first backup phase, the following actions take place:

  1. The mariadb-phy-backup pod starts on the node where the mariadb-server replica with the highest number in its name runs. For example, if the MariaDB server pods are named mariadb-server-0, mariadb-server-1, and mariadb-server-2, the mariadb-phy-backup pod starts on the same node as mariadb-server-2.

  2. The backup process verifies the hash sums of existing backup files based on ConfigMap information:

    • If the verification fails and synchronization with the remote S3 storage is enabled, the process checks the hash sums of remote backups as well. If the remote backups are valid, they are downloaded.

    • If the hash sums are incorrect for both local and remote backups, the backup job fails.

    • If no ConfigMap exists, these hash sum checks are skipped.

  3. Sanity check: verification of the Kubernetes status and wsrep status of each MariaDB pod. If some pods have wrong statuses, the backup job fails unless the --allow-unsafe-backup parameter is passed to the main script in the Kubernetes backup job.

    Note

    Since MOSK 22.4, the --allow-unsafe-backup functionality is removed from the product for security and backup procedure simplification purposes.

    Mirantis does not recommend setting the --allow-unsafe-backup parameter unless it is absolutely required. To ensure the consistency of a backup, verify that the MariaDB Galera cluster is in a working state before you proceed with the backup.

  4. Desynchronize the replica from the Galera cluster. The script connects the target replica and sets the wsrep_desync variable to ON. Then, the replica stops receiving write-sets and receives the wsrep status Donor/Desynced. The Kubernetes health check of that mariadb-server pod fails and the Kubernetes status of that pod becomes Not ready. If the pod has the primary label, the MariaDB Controller sets the backup label to it and the pod is removed from the endpoints list of the MariaDB service.

  5. Verify that there is enough space in the /var/backup folder to perform the backup. The amount of available space in the folder should exceed <DB-SIZE> * <MARIADB-BACKUP-REQUIRED-SPACE-RATIO> in KB.

mariadb_backup_scheme-os-k8s-mariadb-backup-phase1
Backup phase 2
  1. The mariadb-phy-backup pod performs the backup using the mariabackup tool.

  2. The script puts the backed up replica back to sync with the Galera cluster by setting wsrep_desync to OFF and waits for the replica to become Ready in Kubernetes.

mariadb_backup_scheme-os-k8s-mariadb-backup-phase2
Backup phase 3
  1. The script calculates hash sums for backup files and stores them in a special ConfigMap.

  2. If the number of existing backups exceeds the value of the MARIADB_BACKUPS_TO_KEEP job parameter, the script removes the oldest backups to maintain the allowed limit.

  3. If enabled, the script synchronizes the local backup storage with the remote S3 storage.

mariadb_backup_scheme-os-k8s-mariadb-backup-phase3
Restoration workflow

The OpenStack database restoration workflow consists of the following phases.

Restoration phase 1

The mariadb-phy-restore job launches the mariadb-phy-restore pod. This pod starts with the mariadb-server PVC with the highest number in its name. This PVC is mounted to the /var/lib/mysql folder and the backup PVC (or local filesystem if the hostapath backend is configured) is mounted to /var/backup.

The mariadb-phy-restore pod contains the main restore script, which is responsible for:

  • Scaling the mariadb-server StatefulSet

  • Verifying the statuses of mariadb-server pods

  • Managing the openstack-mariadb-phy-restore-runner pods

  • Checking backup integrity (ensuring correct hash sums)

Caution

During the restoration, the database is not available for OpenStack services that means a complete outage of all OpenStack services.

During the first phase, the following actions take place:

  1. The restoration process verifies the hash sums of existing backup files based on ConfigMap information:

    • If the verification fails and synchronization with the remote S3 storage is enabled, the process checks the hash sums of remote backups as well. If the remote backups are valid, they are downloaded.

    • If the hash sums are incorrect for both local and remote backups, the backup job fails.

  2. Save the list of mariadb-server persistent volume claims (PVC).

  3. Scale the mariadb server StatefulSet to 0 replicas. At this point, the database becomes unavailable for OpenStack services.

mariadb_backup_scheme-os-k8s-mariadb-restore-phase1
Restoration phase 2
  1. The mariadb-phy-restore pod performs the following actions:

    1. Launches the openstack-mariadb-phy-restore-runner pod for each mariadb-server PVC. This pod cleans all MySQL data on each PVC.

    2. Collects logs from the openstack-mariadb-phy-restore-runner pod and then removes it.

    3. Unarchives the database backup files to a temporary directory within /var/backup.

    4. Executes mariabackup --prepare on the unarchived data.

    5. Restores the backup to /var/lib/mysql.

mariadb_backup_scheme-os-k8s-mariadb-restore-phase2
Restoration phase 3
  1. The mariadb-phy-restore pod scales the mariadb-server StatefulSet back to the configured number of replicas.

  2. The mariadb-phy-restore pod waits until all mariadb-server replicas are ready.

mariadb_backup_scheme-os-k8s-mariadb-restore-phase3
OpenStack database auto-cleanup

By design, when deleting a cloud resource, for example, an instance, volume, or router, an OpenStack service does not immediately delete its data but marks it as removed so that it can later be picked up by the garbage collector.

Given that an OpenStack resource is often represented by more than one record in the database, deletion of all of them right away could affect the overall responsiveness of the cloud API. On the other hand, an OpenStack database being severely clogged with stale data is one of the most typical reasons for the cloud slowness.

To keep the OpenStack database small and performance fast, MOSK is pre-configured to automatically clean up the removed database records older than 30 days. By default, the clean up is performed for the following MOSK services every Monday according to the schedule:

The default database cleanup schedule by OpenStack service

Service

Service identifier

Clean up time

Block Storage (OpenStack Cinder)

cinder

12:01 a.m.

Compute (OpenStack Nova)

nova

01:01 a.m.

Image (OpenStack Glance)

glance

02:01 a.m.

Instance HA (OpenStack Masakari)

masakari

03:01 a.m.

Key Manager (OpenStack Barbican)

barbican

04:01 a.m.

Orchestration (OpenStack Heat)

heat

05:01 a.m.

If required, you can adjust the cleanup schedule for the OpenStack database by adding the features:database:cleanup setting to the OpenStackDeployment CR following the example below. The schedule parameter must contain a valid cron expression. The age parameter specifies the number of days after which a stale record gets cleaned up.

spec:
  features:
    database:
      cleanup:
        <os-service-identifier>:
          enabled: true
          schedule: "1 0 * * 1"
          age: 30
          batch: 1000
Periodic OpenStack database backups

MOSK uses the Mariabackup utility to back up the MariaDB Galera cluster data where the OpenStack data is stored. The Mariabackup gets launched on a periodic basis as a part of the Kubernetes CronJob included in any MOSK deployment and is suspended by default.

Note

If you are using the default backend to store the backup data, which is Ceph, you can increase the default size of a backup volume. However, make sure to configure the volume size before you deploy OpenStack.

For the default sizes and configuration details, refer to Size of a backup storage.

Enabling the periodic backup

MOSK enables you to configure the periodic backup of the OpenStack database through the OpenStackDeployment object. To enable the backup, use the following structure:

spec:
  features:
    database:
      backup:
        enabled: true

TechPreview

To enhance cloud security, you can enable encryption of OpenStack database backups using the OpenSSL aes-256-cbc encryption through the OpenStackDeployment custom resource. Refer to Backup encryption for configuration details.

By default, the backup job:

  • Runs backup on a daily basis at 01:00 AM

  • Creates incremental backups daily and full backups weekly

  • Keeps 10 latest full backups

  • Stores backups in the mariadb-phy-backup-data PVC

  • Has the backup timeout of 3600 seconds

  • Has the incremental backup type

To verify the configuration of the mariadb-phy-backup CronJob object, run:

kubectl -n openstack get cronjob mariadb-phy-backup
Example of a mariadb-phy-backup CronJob object
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  annotations:
    openstackhelm.openstack.org/release_uuid: ""
  creationTimestamp: "2020-09-08T14:13:48Z"
  managedFields:
  <<<skipped>>>>
  name: mariadb-phy-backup
  namespace: openstack
  resourceVersion: "726449"
  selfLink: /apis/batch/v1beta1/namespaces/openstack/cronjobs/mariadb-phy-backup
  uid: 88c9be21-a160-4de1-afcf-0853697dd1a1
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
      labels:
        application: mariadb-phy-backup
        component: backup
        release_group: openstack-mariadb
    spec:
      activeDeadlineSeconds: 4200
      backoffLimit: 0
      completions: 1
      parallelism: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            application: mariadb-phy-backup
            component: backup
            release_group: openstack-mariadb
        spec:
          containers:
          - command:
            - /tmp/mariadb_resque.py
            - backup
            - --backup-timeout
            - "3600"
            - --backup-type
            - incremental
            env:
            - name: MARIADB_BACKUPS_TO_KEEP
              value: "10"
            - name: MARIADB_BACKUP_PVC_NAME
              value: mariadb-phy-backup-data
            - name: MARIADB_FULL_BACKUP_CYCLE
              value: "604800"
            - name: MARIADB_REPLICAS
              value: "3"
            - name: MARIADB_BACKUP_REQUIRED_SPACE_RATIO
              value: "1.2"
            - name: MARIADB_RESQUE_RUNNER_IMAGE
              value: docker-dev-kaas-local.docker.mirantis.net/general/mariadb:10.4.14-bionic-20200812025059
            - name: MARIADB_RESQUE_RUNNER_SERVICE_ACCOUNT
              value: mariadb-phy-backup-runner
            - name: MARIADB_RESQUE_RUNNER_POD_NAME_PREFIX
              value: openstack-mariadb
            - name: MARIADB_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            image: docker-dev-kaas-local.docker.mirantis.net/general/mariadb:10.4.14-bionic-20200812025059
            imagePullPolicy: IfNotPresent
            name: phy-backup
            resources: {}
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /tmp
              name: pod-tmp
            - mountPath: /tmp/mariadb_resque.py
              name: mariadb-bin
              readOnly: true
              subPath: mariadb_resque.py
            - mountPath: /tmp/resque_runner.yaml.j2
              name: mariadb-bin
              readOnly: true
              subPath: resque_runner.yaml.j2
            - mountPath: /etc/mysql/admin_user.cnf
              name: mariadb-secrets
              readOnly: true
              subPath: admin_user.cnf
          dnsPolicy: ClusterFirst
          initContainers:
          - command:
            - kubernetes-entrypoint
            env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: INTERFACE_NAME
              value: eth0
            - name: PATH
              value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
            - name: DEPENDENCY_SERVICE
            - name: DEPENDENCY_DAEMONSET
            - name: DEPENDENCY_CONTAINER
            - name: DEPENDENCY_POD_JSON
            - name: DEPENDENCY_CUSTOM_RESOURCE
            image: docker-dev-kaas-local.docker.mirantis.net/openstack/extra/kubernetes-entrypoint:v1.0.0-20200311160233
            imagePullPolicy: IfNotPresent
            name: init
            resources: {}
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              runAsUser: 65534
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          nodeSelector:
            openstack-control-plane: enabled
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext:
            runAsUser: 999
          serviceAccount: mariadb-phy-backup
          serviceAccountName: mariadb-phy-backup
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir: {}
            name: pod-tmp
          - name: mariadb-secrets
            secret:
              defaultMode: 292
              secretName: mariadb-secrets
          - configMap:
              defaultMode: 365
              name: mariadb-bin
            name: mariadb-bin
  schedule: 0 1 * * *
  successfulJobsHistoryLimit: 3
  suspend: false
Overriding the default configuration

To override the default configuration, set the parameters and environment variables that are passed to the CronJob as described in the tables below.

MariaDB backup: Configuration parameters

Parameter

Type

Default

Description

--backup-type

String

incremental

Type of a backup. The list of possible values include:

  • incremental

    If the newest full backup is older than the value of the full_backup_cycle parameter, the system performs a full backup. Otherwise, the system performs an incremental backup of the newest full backup.

  • full

    Always performs only a full backup.

Usage example:

spec:
  features:
    database:
      backup:
        backup_type: incremental

--backup-timeout

Integer

21600

Timeout in seconds for the system to wait for the backup operation to succeed.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_timeout: 30000

--allow-unsafe-backup

Boolean

false

Not recommended, removed since MOSK 22.4.

If set to true, enables the MariaDB cluster backup in a not fully operational cluster, where:

  • The current number of ready pods is not equal to MARIADB_REPLICAS.

  • Some replicas do not have healthy wsrep statuses.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              allow_unsafe_backup: true
MariaDB backup: Environment variables

Variable

Type

Default

Description

MARIADB_BACKUPS_TO_KEEP

Integer

10

Number of full backups to keep.

Usage example:

spec:
  features:
    database:
      backup:
        backups_to_keep: 3

MARIADB_BACKUP_PVC_NAME

String

mariadb-phy-backup-data

Persistent volume claim used to store backups.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_pvc_name: mariadb-phy-backup-data

MARIADB_FULL_BACKUP_CYCLE

Integer

604800

Number of seconds that defines a period between 2 full backups. During this period, incremental backups are performed. The parameter is taken into account only if backup_type is set to incremental. Otherwise, it is ignored. For example, with full_backup_cycle set to 604800 seconds a full backup is taken weekly and, if cron is set to 0 0 * * *, an incremental backup is performed on daily basis.

Usage example:

spec:
  features:
    database:
      backup:
        full_backup_cycle: 70000

MARIADB_BACKUP_REQUIRED_SPACE_RATIO

Floating

1.2

Multiplier for the database size to predict the space required to create a backup, either full or incremental, and perform a restoration keeping the uncompressed backup files on the same file system as the compressed ones.

To estimate the size of MARIADB_BACKUP_REQUIRED_SPACE_RATIO, use the following formula: size of (1 uncompressed full backup + all related incremental uncompressed backups + 1 full compressed backup) in KB =< (DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO) in KB.

The DB_SIZE is the disk space allocated in the MySQL data directory, which is /var/lib/mysql, for databases data excluding galera.cache and ib_logfile* files. This parameter prevents the backup PVC from being full in the middle of the restoration and backup procedures. If the current available space is lower than DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO, the backup script fails before the system starts the actual backup and the overall status of the backup job is failed.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_required_space_ratio: 1.4

For example, to perform full backups monthly and incremental backups daily at 02:30 AM and keep the backups for the last six months, configure the database backup in your OpenStackDeployment object as follows:

spec:
  features:
    database:
      backup:
        enabled: true
        backups_to_keep: 6
        schedule_time: '30 2 * * *'
        full_backup_cycle: 2628000
Remote storage for OpenStack database backups

By default, MOSK stores the OpenStack database backups locally in the Mirantis Ceph cluster, which is a part of the same cloud.

Alternatively, MOSK provides you with a capability to create remote backups using an external storage. This section contains configuration details for a remote backend to be used for the OpenStack data backup.

In general, the built-in automated backup routine saves the data to the mariadb-phy-backup-data PersistentVolumeClaim (PVC), which is provisioned from StorageClass specified in the spec.persistent_volume_storage_class parameter of the OpenstackDeployment custom resource (CR).

Remote NFS storage for OpenStack database backups

TechPreview

Requirements
  • A preconfigured NFS server with NFS share that a Unix backup and restore user has access to. By default, it is the same user that runs MySQL server in a MariaDB image.

    To get the Unix user ID, run:

    kubectl -n openstack get cronjob mariadb-phy-backup -o jsonpath='{.spec.jobTemplate.spec.template.spec.securityContext.runAsUser}'
    

    Note

    Verify that the NFS server is accessible through the network from all of the OpenStack control plane nodes of the cluster.

  • The nfs-common package installed on all OpenStack control plane nodes.

Limitations
  • Only NFS Unix authentication is supported.

  • Removal of the NFS persistent volume does not automatically remove the data.

  • No validation of mount options. If mount options are specified incorrectly in the OpenStackDeployment CR, the mount command fails upon the creation of a backup runner pod.

Enabling the NFS backend

To enable the NFS backend, configure the following structure in the OpenStackDeployment object:

spec:
  features:
    database:
      backup:
        enabled: true
        backend: pv_nfs
        pv_nfs:
          server: <ip-address/dns-name-of-the-server>
          path: <path-to-the-share-folder-on-the-server>

TechPreview

To enhance cloud security, you can enable encryption of OpenStack database backups using the OpenSSL aes-256-cbc encryption through the OpenStackDeployment custom resource. Refer to Backup encryption for configuration details.

Optionally, MOSK enables you to set the required mount options for the NFS mount command. You can set as many options of mount as you need. For example:

spec:
  services:
    database:
      mariadb:
        values:
          volume:
            phy_backup:
              nfs:
                mountOptions:
                  - "nfsvers=4"
                  - "hard"
Synchronization of local MariaDB backups with a remote S3 storage

Available since MOSK 25.1 TechPreview

MOSK provides the capability to synchronize local MariaDB backups with a remote S3 storage. Distributing backups across multiple locations increases their safety. Optionally, backup archives stored in S3 can be encrypted on the server side.

To enable synchronization, you need to have a preconfigured S3 storage and a user account for access.

Limitations
  • Only one remote S3 storage can be configured

  • Disabling the S3 synchronization does not automatically remove the data

Enable the synchronization with the S3 storage
  1. Verify that the S3 storage is accessible through the network from all OpenStack control plane nodes.

  2. Create the secret to store credentials for access to the S3 storage:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      labels:
        openstack.lcm.mirantis.com/osdpl_secret: "true"
      name: mariadb-backup-s3-hidden
      namespace: openstack
    type: Opaque
    data:
      access_key: <ACCESS-KEY-FOR-S3-ACCOUNT>
      secret_key: <SECRET-KEY-FOR-S3-ACCOUNT>
      sse_kms_key_id: <SECRET-KEY-FOR-SERVER-SIDE-ENCRYPTION>
    
  3. Enable synchronization by adding the following structure to the OpenStackDeployment custom resource. For example, to use Ceph RadosGW as the S3 storage provider and enable server-side encryption for stored archives:

    spec:
      features:
        database:
          backup:
            enabled: true
            sync_remote:
              enabled: true
              remotes:
                << remote name >>:
                  conf:
                    type: s3
                    provider: Ceph
                    endpoint: <URL-TO-S3-STORAGE>
                    path: <BUCKET-NAME-FOR-BACKUPS-ON-S3-STORAGE>
                    server_side_encryption: aws:kms
                    access_key_id:
                      value_from:
                        secret_key_ref:
                          key: access_key
                          name: mariadb-backup-s3-hidden
                    secret_access_key:
                      value_from:
                        secret_key_ref:
                          key: secret_key
                          name: mariadb-backup-s3-hidden
                    sse_kms_key_id:
                      value_from:
                        secret_key_ref:
                          key: sse_kms_key_id
                          name: mariadb-backup-s3-hidden
    

    Alternatively, you can set the provider parameter to AWS if you prefer using AWS as a provider for S3 storage and omit the server_side_encryption and sse_kms_key_id parameters if encryption is not required.

OpenStack message bus

The internal components of Mirantis OpenStack for Kubernetes (MOSK) coordinate their operations and exchange status information using the cluster’s message bus (RabbitMQ).

Exposable OpenStack notifications

Available since MOSK 22.5

MOSK enables you to configure OpenStack services to emit notification messages to the MOSK cluster messaging bus (RabbitMQ) every time an OpenStack resource, for example, an instance, image, and so on, changes its state due to a cloud user action or through its lifecycle. For example, MOSK Compute service (OpenStack Nova) can publish the instance.create.end notification once a newly created instance is up and running.

Note

In certain cases, RabbitMQ notifications may prove unreliable, such as when the RabbitMQ server undergoes a restart or when communication between the server and the client reading the notifications breaks down. To optimize reliability, Mirantis suggests using multiple channels to store notification events, encompassing:

  • StackLight notifications

  • Storing audit as part of the OpenStack logs

Sample of an instance.create.end notification
{
    "event_type": "instance.create.end",
    "payload": {
        "nova_object.data": {
            "action_initiator_project": "6f70656e737461636b20342065766572",
            "action_initiator_user": "fake",
            "architecture": "x86_64",
            "auto_disk_config": "MANUAL",
            "availability_zone": "nova",
            "block_devices": [],
            "created_at": "2012-10-29T13:42:11Z",
            "deleted_at": null,
            "display_description": "some-server",
            "display_name": "some-server",
            "fault": null,
            "flavor": {
             "nova_object.data": {
              "description": null,
              "disabled": false,
              "ephemeral_gb": 0,
              "extra_specs": {
                  "hw:watchdog_action": "disabled"
              },
              "flavorid": "a22d5517-147c-4147-a0d1-e698df5cd4e3",
              "is_public": true,
              "memory_mb": 512,
              "name": "test_flavor",
              "projects": null,
              "root_gb": 1,
              "rxtx_factor": 1.0,
              "swap": 0,
              "vcpu_weight": 0,
              "vcpus": 1
             },
             "nova_object.name": "FlavorPayload",
             "nova_object.namespace": "nova",
             "nova_object.version": "1.4"
            },
            "host": "compute",
            "host_name": "some-server",
            "image_uuid": "155d900f-4e14-4e4c-a73d-069cbf4541e6",
            "instance_name": "instance-00000001",
            "ip_addresses": [
             {
              "nova_object.data": {
                  "address": "192.168.1.3",
                  "device_name": "tapce531f90-19",
                  "label": "private",
                  "mac": "fa:16:3e:4c:2c:30",
                  "meta": {},
                  "port_uuid": "ce531f90-199f-48c0-816c-13e38010b442",
                  "version": 4
              },
              "nova_object.name": "IpPayload",
              "nova_object.namespace": "nova",
              "nova_object.version": "1.0"
             }
            ],
            "kernel_id": "",
            "key_name": "my-key",
            "keypairs": [
             {
              "nova_object.data": {
                  "fingerprint": "1e:2c:9b:56:79:4b:45:77:f9:ca:7a:98:2c:b0:d5:3c",
                  "name": "my-key",
                  "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQDx8nkQv/zgGgB4rMYmIf+6A4l6Rr+o/6lHBQdW5aYd44bd8JttDCE/F/pNRr0lRE+PiqSPO8nDPHw0010JeMH9gYgnnFlyY3/OcJ02RhIPyyxYpv9FhY+2YiUkpwFOcLImyrxEsYXpD/0d3ac30bNH6Sw9JD9UZHYcpSxsIbECHw== Generated-by-Nova",
                  "type": "ssh",
                  "user_id": "fake"
              },
              "nova_object.name": "KeypairPayload",
              "nova_object.namespace": "nova",
              "nova_object.version": "1.0"
             }
            ],
            "launched_at": "2012-10-29T13:42:11Z",
            "locked": false,
            "locked_reason": null,
            "metadata": {},
            "node": "fake-mini",
            "os_type": null,
            "power_state": "running",
            "progress": 0,
            "ramdisk_id": "",
            "request_id": "req-5b6c791d-5709-4f36-8fbe-c3e02869e35d",
            "reservation_id": "r-npxv0e40",
            "state": "active",
            "tags": [
             "tag"
            ],
            "task_state": null,
            "tenant_id": "6f70656e737461636b20342065766572",
            "terminated_at": null,
            "trusted_image_certificates": [
             "cert-id-1",
             "cert-id-2"
            ],
            "updated_at": "2012-10-29T13:42:11Z",
            "user_id": "fake",
            "uuid": "178b0921-8f85-4257-88b6-2e743b5a975c"
        },
        "nova_object.name": "InstanceCreatePayload",
        "nova_object.namespace": "nova",
        "nova_object.version": "1.12"
    },
    "priority": "INFO",
    "publisher_id": "nova-compute:compute"
}

OpenStack notification messages can be consumed and processed by various corporate systems to integrate MOSK clouds into the company infrastructure and business processes.

The list of the most common use cases includes:

  • Using notification history for retrospective security audit

  • Using the real-time aggregation of notification messages to gather statistics on cloud resource consumption for further capacity planning

Cloud billing considerations

Notifications alone should not be considered as a source of data for any kind of financial reporting. The delivery of the messages can not be guaranteed due to various technical reasons. For example, messages can be lost if an external consumer is not fetching them from the queue fast enough.

Mirantis strongly recommends that your cloud billing solutions rely on the combination of the following data sources:

  • Periodic polling of the OpenStack API as a reliable source of information about allocated resources

  • Subscription to notifications to receive timely updates about the resource status change

If you are looking for a ready-to-use billing solution for your cloud, contact Mirantis or one of our partners.

A cloud administrator can securely expose part of a MOSK cluster message bus to the outside world. This enables an external consumer to subscribe to the notification messages emitted by the cluster services.

Important

The latest OpenStack release available in MOSK supports notifications from the following services:

  • Block storage (OpenStack Cinder)

  • DNS (OpenStack Designate)

  • Image (OpenStack Glance)

  • Orchestration (OpenStack Heat)

  • Bare Metal (OpenStack Ironic)

  • Identity (OpenStack Keystone)

  • Shared Filesystems (OpenStack Manila)

  • Instance High Avalability (OpenStack Masakari)

  • Networking (OpenStack Neutron)

  • Compute (OpenStack Nova)

To enable the external notification endpoint, add the following structure to the OpenStackDeployment custom resource. For example:

spec:
  features:
    messaging:
      notifications:
        external:
          enabled: true
          topics:
            - external-consumer-A
            - external-consumer-2

For each topic name specified in the topics field, MOSK creates a topic exchange in its RabbitMQ cluster together with a set of queues bound to this topic. All enabled MOSK services will publish their notification messages to all configured topics so that multiple consumers can receive the same messages in parallel.

A topic name must follow Kubernetes standard format for object names and IDs that is only lowercase alphanumeric characters, -, or . The topic name notifications is reserved for the internal use.

MOSK supports the connection to message bus (RabbitMQ) through an encrypted or non-encrypted endpoint. Once connected, it supports authentication through either a plain text user name and password or mutual TLS authentication using encrypted X.509 client certificates.

Each topic exchange is protected by automatically generated authentication credentials and certificates for secure connection that are stored as a secret in the openstack-external namespace of a MOSK underlying Kubernetes cluster. A secret is identified by the name of the topic. The list of attributes for the secret object includes:

  • hosts

    The IP addresses which an external notification endpoint is available on

  • port_amqp, port_amqp-tls

    The TCP ports which external notification endpoint is available on

  • vhost

    The name of the RabbitMQ virtual host which the topic queues are created on

  • username, password

    Authentication data

  • ca_cert

    The client CA certificate

  • client_cert

    The client certificate

  • client_key

    The client private key

For the configuration example above, the following objects will be created:

kubectl -n openstack-external get secret

NAME                                            TYPE           DATA   AGE
openstack-external-consumer-A-notifications     Opaque         4      4m51s
openstack-external-consumer-2-notifications     Opaque         4      4m51s

Tungsten Fabric

Tungsten Fabric provides basic L2/L3 networking to an OpenStack environment running on the MKE cluster and includes the IP address management, security groups, floating IP addresses, and routing policies functionality. Tungsten Fabric is based on overlay networking, where all virtual machines are connected to a virtual network with encapsulation (MPLSoGRE, MPLSoUDP, VXLAN). This enables you to separate the underlay Kubernetes management network. A workload requires an external gateway, such as a hardware EdgeRouter or a simple gateway to route the outgoing traffic.

The Tungsten Fabric vRouter uses different gateways for the control and data planes.

Tungsten Fabric cluster

All services of Tungsten Fabric are delivered as separate containers, which are deployed by the Tungsten Fabric Operator (TFO). Each container has an INI-based configuration file that is available on the host system. The configuration file is generated automatically upon the container start and is based on environment variables provided by the TFO through Kubernetes ConfigMaps.

The main Tungsten Fabric containers run with the host network as DeploymentSet, without using the Kubernetes networking layer. The services listen directly on the host network interface.

The following diagram describes the minimum production installation of Tungsten Fabric with a Mirantis OpenStack for Kubernetes (MOSK) deployment.

_images/tf-architecture.png

For the details about the Tungsten Fabric services included in MOSK deployments and the types of traffic and traffic flow directions, see the subsections below.

Tungsten Fabric cluster components

This section describes the Tungsten Fabric services and their distribution across the Mirantis OpenStack for Kubernetes (MOSK) deployment.

The Tungsten Fabric services run mostly as DaemonSets in separate containers for each service. The deployment and update processes are managed by the Tungsten Fabric Operator. However, Kubernetes manages the probe checks and restart of broken containers.

Configuration and control services

All configuration and control services run on the Tungsten Fabric Controller nodes.

Service name

Service description

config-api

Exposes a REST-based interface for the Tungsten Fabric API.

config-provisioner

Provisions the node for execution of configuration services.

control

Communicates with the cluster gateways using BGP and with the vRouter agents using XMPP, as well as redistributes appropriate networking information.

control-provisioner

Provisions the node for execution of configuration services.

device-manager

Manages physical networking devices using netconf or ovsdb. In multi-node deployments, it operates in the active-backup mode.

dns

Using the named service, provides the DNS service to the VMs spawned on different compute nodes. Each vRouter node connects to two Tungsten Fabric Controller containers that run the dns process.

named

The customized Berkeley Internet Name Domain (BIND) daemon of Tungsten Fabric that manages DNS zones for the dns service.

schema

Listens to configuration changes performed by a user and generates corresponding system configuration objects. In multi-node deployments, it works in the active-backup mode.

svc-monitor

Listens to configuration changes of service-template and service-instance, as well as spawns and monitors virtual machines for the firewall, analyzer services, and so on. In multi-node deployments, it works in the active-backup mode.

webui

Consists of the webserver and jobserver services. Provides the Tungsten Fabric web UI.

Analytics services

Deprecated since MOSK 24.1

All analytics services run on Tungsten Fabric analytics nodes.

Service name

Service description

alarm-gen

Evaluates and manages the alarms rules.

analytics-api

Provides a REST API to interact with the Cassandra analytics database.

analytics-nodemgr

Collects all Tungsten Fabric analytics process data and sends this information to the Tungsten Fabric collector.

analytics-database-nodemgr

Provisions the init model if needed. Collects data of the database process and sends it to the Tungsten Fabric collector.

collector

Collects and analyzes data from all Tungsten Fabric services.

query-engine

Handles the queries to access data from the Cassandra database.

snmp-collector

Receives the authorization and configuration of the physical routers from the config-nodemgr service, polls the physical routers using the Simple Network Management Protocol (SNMP), and uploads the data to the Tungsten Fabric collector.

topology

Reads the SNMP information from the physical router user-visible entities (UVEs), creates a neighbor list, and writes the neighbor information to the physical router UVEs. The Tungsten Fabric web UI uses the neighbor list to display the physical topology.

vRouter

The Tungsten Fabric vRouter provides data forwarding to an OpenStack tenant instance and reports statistics to the Tungsten Fabric analytics service. The Tungsten Fabric vRouter is installed on all OpenStack compute nodes. Mirantis OpenStack for Kubernetes (MOSK) supports the kernel-based deployment of the Tungsten Fabric vRouter.

vRouter services on the OpenStack compute nodes

Service name

Service description

vrouter-agent

Connects to the Tungsten Fabric Controller container and the Tungsten Fabric DNS system using the Extensible Messaging and Presence Protocol (XMPP). The vRouter Agent acts as a local control plane. Each Tungsten Fabric vRouter Agent is connected to at least two Tungsten Fabric controllers in an active-active redundancy mode.

The Tungsten Fabric vRouter Agent is responsible for all networking-related functions including routing instances, routes, and others.

The Tungsten Fabric vRouter uses different gateways for the control and data planes. For example, the Linux system gateway is located on the management network, and the Tungsten Fabric gateway is located on the data plane network.

vrouter-provisioner

Provisions the node for the vRouter agent execution.

The following diagram illustrates the Tungsten Fabric kernel vRouter set up by the TF operator:

_images/tf_vrouter.png

On the diagram above, the following types of networks interfaces are used:

  • eth0 - for the management (PXE) network (eth1 and eth2 are the slave interfaces of Bond0)

  • Bond0.x - for the MKE control plane network

  • Bond0.y - for the MKE data plane network

Third-party services

Service name

Service description

cassandra

  • On the Tungsten Fabric control plane nodes, maintains the configuration data of the Tungsten Fabric cluster.

  • On the Tungsten Fabric analytics nodes, stores the collector service data.

cassandra-operator

The Kubernetes operator that enables the Cassandra clusters creation and management.

kafka

Handles the messaging bus and generates alarms across the Tungsten Fabric analytics containers.

kafka-operator

The Kubernetes operator that enables Kafka clusters creation and management.

redis

Stores the physical router UVE storage and serves as a messaging bus for event notifications.

redis-operator

The Kubernetes operator that enables Redis clusters creation and management.

zookeeper

Holds the active-backup status for the device-manager, svc-monitor, and the schema-transformer services. This service is also used for mapping of the Tungsten Fabric resources names to UUIDs.

zookeeper-operator

The Kubernetes operator that enables ZooKeeper clusters creation and management.

rabbitmq

Exchanges messages between API servers and original request senders.

rabbitmq-operator

The Kubernetes operator that enables RabbitMQ clusters creation and management.

Plugin services

All Tungsten Fabric plugin services are installed on the OpenStack Controller (Rockoon) nodes.

Service name

Service description

neutron-server

The Neutron server that includes the Tungsten Fabric plugin.

octavia-api

The Octavia API that includes the Tungsten Fabric Octavia driver.

heat-api

The Heat API that includes the Tungsten Fabric Heat resources and templates.

Image precaching DaemonSets

Along with the Tungsten Fabric services, MOSK deploys and updates special image precaching DaemonSets when the kind TFOperator resource is created or image references in it get updated. These DaemonSets precache container images on Kubernetes nodes minimizing possible downtime when updating container images. Cloud operator can disable image precaching through the TFOperator resource.

Tungsten Fabric traffic flow

This section describes the types of traffic and traffic flow directions in a Mirantis OpenStack for Kubernetes (MOSK) cluster.

User interface and API traffic

The following diagram illustrates all types of UI and API traffic in a Mirantis OpenStack for Kubernetes cluster, including the monitoring and OpenStack API traffic. The OpenStack Dashboard pod hosts Horizon and acts as a proxy for all other types of traffic. TLS termination is also performed for this type of traffic.

_images/tf-traffic_flow_ui_api.png
SDN traffic

SDN or Tungsten Fabric traffic goes through the overlay Data network and processes east-west and north-south traffic for applications that run in a MOSK cluster. This network segment typically contains tenant networks as separate MPLS-over-GRE and MPLS-over-UDP tunnels. The traffic load depends on the workload.

The control traffic between the Tungsten Fabric controllers, edge routers, and vRouters uses the XMPP with TLS and iBGP protocols. Both protocols produce low traffic that does not affect MPLS over GRE and MPLS over UDP traffic. However, this traffic is critical and must be reliably delivered. Mirantis recommends configuring higher QoS for this type of traffic.

The following diagram displays both MPLS over GRE/MPLS over UDP and iBGP and XMPP traffic examples in a MOSK cluster:

_images/tf-traffic_flow_sdn.png
Tungsten Fabric lifecycle management

Mirantis OpenStack for Kubernetes (MOSK) provides the Tungsten Fabric lifecycle management including pre-deployment custom configurations, updates, data backup and restoration, as well as handling partial failure scenarios, by means of the Tungsten Fabric operator.

This section is intended for the cloud operators who want to gain insight into the capabilities provided by the Tungsten Fabric operator along with the understanding of how its architecture allows for easy management while addressing the concerns of users of Tungsten Fabric-based MOSK clusters.

Tungsten Fabric Operator

The Tungsten Fabric Operator (TFO) is based on the Kubernetes operator SDK project. The Kubernetes operator SDK is a framework that uses the controller-runtime library to make writing operators easier by providing the following:

  • High-level APIs and abstractions to write the operational logic more intuitively.

  • Tools for scaffolding and code generation to bootstrap a new project fast.

  • Extensions to cover common operator use cases.

The TFO deploys the following sub-operators. Each sub-operator handles a separate part of a TF deployment:

TFO sub-operators

Network

Description

TFControl

Deploys the Tungsten Fabric control services, such as:

  • Control

  • DNS

  • Control Provisioner 0

TFConfig

Deploys the Tungsten Fabric configuration services, such as:

  • API

  • Service monitor

  • Schema transformer

  • Device manager

  • Configuration Provisioner 0

  • Database Provisioner 0

TFAnalytics

Deploys the Tungsten Fabric analytics services, such as:

  • API

  • Collector

  • Alarm

  • Alarm-gen

  • SNMP

  • Topology

  • Alarm Provisioner 0

  • Database Provisioner 0

  • SNMP Provisioner 0

TFVrouter

Deploys a vRouter on each compute node with the following services:

  • vRouter Agent

  • Provisioner 0

TFWebUI

Deploys the following web UI services:

  • Web server

  • Job server

TFTool

Deploys the following tools for debug purposes:

  • TF-CLI

  • CTools

TFTest

An operator to run Tempest tests.

0(1,2,3,4,5,6,7)

Since MOSK 24.3, Provisioner is a separate component for the vRouter, deployed as the tf-vrouter-provisioner DaemonSet. The NodeManager service is no longer deployed in TF setups.

Besides the sub-operators that deploy TF services, TFO uses operators to deploy and maintain third-party services, such as different types of storage, cache, message system, and so on. The following table describes all third-party operators:

TFO third-party sub-operators

Network

Description

casandra-operator

An upstream operator that automates the Cassandra HA storage operations for the configuration and analytics data.

zookeeper-operator

An upstream operator for deployment and automation of a ZooKeeper cluster.

kafka-operator

An operator for the Kafka cluster used by analytics services.

redis-operator

An upstream operator that automates the Redis cluster deployment and keeps it healthy.

rabbitmq-operator

An operator for the messaging system based on RabbitMQ.

The following diagram illustrates a simplified TFO workflow:

_images/tf-operator-workflow.png
TFOperator custom resource

The resource of kind TFOperator is a custom resource defined by a resource of kind CustomResourceDefinition.

The CustomResourceDefinition resource in Kubernetes uses the OpenAPI Specification version 2 to specify the schema of the defined resource. The Kubernetes API outright rejects the resources that do not pass this schema validation. Along with schema validation, TFOperator uses ValidatingAdmissionWebhook for extended validations when a custom resource is created or updated.

Important

Since 24.1, MOSK introduces the technical preview support for the API v2 for the Tungsten Fabric Operator. This version of the Tungsten Fabric Operator API aligns with the OpenStack Controller API and provides better interface for advanced configurations. Refer to Key differences between TFOperator API v1alpha1 and v2 for details.

For the list of configuration options available to a cloud operator, refer to Tungsten Fabric configuration. Also, check out the Tungsten Fabric Operator resources document of the MOSK version that your cluster has been deployed with.

TFOperator custom resource validation

Tungsten Fabric Operator uses ValidatingAdmissionWebhook to validate environment variables set to Tungsten Fabric components upon the TFOperator object creation or update. The following validations are performed:

  • Environment variables passed to the Tungsten Fabric components containers

  • Mapping between tfVersion and tfImageTag, if defined

  • Schedule for dbBackup

  • Data capacity format

  • Feature variable values

  • Availability of the dataStorageClass class

If required, you can disable ValidatingAdmissionWebhook through the TFOperator HelmBundle resource:

apiVersion: lcm.mirantis.com/v1alpha1
kind: HelmBundle
metadata:
  name: tungstenfabric-operator
  namespace: tf
spec:
  releases:
  - name: tungstenfabric-operator
    values:
      admission:
        enabled: false
Environment variables for Tungsten Fabric components

Warning

The features section of the TFOperator specification allows for easy configuration of all Tungsten Fabric features. Mirantis recommends updating the environment variables through envSettings directly.

Allowed environment variables for Tungsten Fabric components

Environment variables

Tungsten Fabric service and envSettings name

  • INTROSPECT_LISTEN_ALL

  • analytics (alarmGen, api, collector, nodeMgr, query, snmp, topology)

  • config (api, db-nodemgr, nodeMgr)

  • control (control, dns, nodeMgr)

  • vRouter (agent, nodeMgr)

  • PROVISION_DELAY

  • PROVISION_RETRIES

  • BGP_ASN

  • ENCAP_PRIORITY

  • VXLAN_VN_ID_MODE

  • analytics (provisioner)

  • config (provisioner)

  • control (provisioner)

  • agent (provisioner)

  • agentDPDK (provisioner)

  • CONFIG_API_LIST_OPTIMIZATION_ENABLED

  • CONFIG_API_WORKER_COUNT

  • CONFIG_API_MAX_REQUESTS

  • FWAAS_ENABLE

  • RABBITMQ_HEARTBEAT_INTERVAL

  • DISABLE_VNC_API_STATS

config (api)

  • DNS_NAMED_MAX_CACHE_SIZE

  • DNS_NAMED_MAX_RETRANSMISSIONS

  • DNS_RETRANSMISSION_INTERVAL

control (dns)

  • WEBUI_LOG_LEVEL

  • WEBUI_STATIC_AUTH_PASSWORD

  • WEBUI_STATIC_AUTH_ROLE

  • WEBUI_STATIC_AUTH_USER

webui (job, web)

  • ANALYTICS_CONFIG_AUDIT_TTL

  • ANALYTICS_DATA_TTL

  • ANALYTICS_FLOW_TTL

  • ANALYTICS_STATISTICS_TTL

  • COLLECTOR_disk_usage_percentage_high_watermark0

  • COLLECTOR_disk_usage_percentage_high_watermark1

  • COLLECTOR_disk_usage_percentage_high_watermark2

  • COLLECTOR_disk_usage_percentage_low_watermark0

  • COLLECTOR_disk_usage_percentage_low_watermark1

  • COLLECTOR_disk_usage_percentage_low_watermark2

  • COLLECTOR_high_watermark0_message_severity_level

  • COLLECTOR_high_watermark1_message_severity_level

  • COLLECTOR_high_watermark2_message_severity_level

  • COLLECTOR_low_watermark0_message_severity_level

  • COLLECTOR_low_watermark1_message_severity_level

  • COLLECTOR_low_watermark2_message_severity_level

  • COLLECTOR_pending_compaction_tasks_high_watermark0

  • COLLECTOR_pending_compaction_tasks_high_watermark1

  • COLLECTOR_pending_compaction_tasks_high_watermark2

  • COLLECTOR_pending_compaction_tasks_low_watermark0

  • COLLECTOR_pending_compaction_tasks_low_watermark1

  • COLLECTOR_pending_compaction_tasks_low_watermark2

  • COLLECTOR_LOG_FILE_COUNT

  • COLLECTOR_LOG_FILE_SIZE

analytics (collector)

  • ANALYTICS_DATA_TTL

  • QUERYENGINE_MAX_SLICE

  • QUERYENGINE_MAX_TASKS

  • QUERYENGINE_START_TIME

analytics (query)

  • SNMPCOLLECTOR_FAST_SCAN_FREQUENCY

  • SNMPCOLLECTOR_SCAN_FREQUENCY

analytics (snmp)

TOPOLOGY_SCAN_FREQUENCY

analytics (topology)

  • DPDK_UIO_DRIVER

  • PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_NETWORK

  • SRIOV_VF

  • TSN_AGENT_MODE

  • TSN_NODES

  • AGENT_MODE

  • FABRIC_SNAT_HASH_TABLE_SIZE

  • PRIORITY_BANDWIDTH

  • PRIORITY_ID

  • PRIORITY_SCHEDULING

  • PRIORITY_TAGGING

  • QOS_DEF_HW_QUEUE

  • QOS_LOGICAL_QUEUES

  • QOS_QUEUE_ID

  • VROUTER_GATEWAY

  • HUGE_PAGES_2MB

  • HUGE_PAGES_1GB

  • DISABLE_TX_OFFLOAD

  • DISABLE_STATS_COLLECTION

vRouter (agent)

  • CPU_CORE_MASK

  • SERVICE_CORE_MASK

  • DPDK_CTRL_THREAD_MASK

  • DPDK_COMMAND_ADDITIONAL_ARGS

  • DPDK_MEM_PER_SOCKET

  • DPDK_UIO_DRIVER

  • HUGE_PAGES

  • HUGE_PAGES_DIR

  • NIC_OFFLOAD_ENABLE

  • DPDK_ENABLE_VLAN_FWRD

vRouter (agentDPDK)

Allowed environment variables for Tungsten Fabric components

Environment variables

Tungsten Fabric components and containers

  • INTROSPECT_LISTEN_ALL

  • LOG_DIR

  • LOG_LEVEL

  • LOG_LOCAL

  • tf-analytics (alarm-gen, api, collector, alarm-nodemgr, db-nodemgr, nodemgr, snmp-nodemgr, query-engine, snmp, topology)

  • tf-config (api, db-nodemgr, nodemgr)

  • tf-control (control, dns, nodemgr)

  • tf-vrouter (agent, dpdk-nodemgr, nodemgr)

  • LOG_DIR

  • LOG_LEVEL

  • LOG_LOCAL

tf-config (config, devicemgr, schema, svc-monitor)

  • PROVISION_DELAY

  • PROVISION_RETRIES

  • BGP_ASN

  • ENCAP_PRIORITY

  • VXLAN_VN_ID_MODE

  • tf-analytics (alarm-provisioner, db-provisioner, provisioner, snmp-provisioner)

  • tf-config (db-provisioner, provisioner)

  • tf-control (provisioner)

  • tf-vrouter (dpdk-provisioner, provisioner)

  • CONFIG_API_LIST_OPTIMIZATION_ENABLED

  • CONFIG_API_WORKER_COUNT

  • CONFIG_API_MAX_REQUESTS

  • FWAAS_ENABLE

  • RABBITMQ_HEARTBEAT_INTERVAL

  • DISABLE_VNC_API_STATS

tf-config (config)

  • DNS_NAMED_MAX_CACHE_SIZE

  • DNS_NAMED_MAX_RETRANSMISSIONS

  • DNS_RETRANSMISSION_INTERVAL

tf-control (dns)

  • WEBUI_LOG_LEVEL

  • WEBUI_STATIC_AUTH_PASSWORD

  • WEBUI_STATIC_AUTH_ROLE

  • WEBUI_STATIC_AUTH_USER

tf-webui (job, web)

  • ANALYTICS_CONFIG_AUDIT_TTL

  • ANALYTICS_DATA_TTL

  • ANALYTICS_FLOW_TTL

  • ANALYTICS_STATISTICS_TTL

  • COLLECTOR_disk_usage_percentage_high_watermark0

  • COLLECTOR_disk_usage_percentage_high_watermark1

  • COLLECTOR_disk_usage_percentage_high_watermark2

  • COLLECTOR_disk_usage_percentage_low_watermark0

  • COLLECTOR_disk_usage_percentage_low_watermark1

  • COLLECTOR_disk_usage_percentage_low_watermark2

  • COLLECTOR_high_watermark0_message_severity_level

  • COLLECTOR_high_watermark1_message_severity_level

  • COLLECTOR_high_watermark2_message_severity_level

  • COLLECTOR_low_watermark0_message_severity_level

  • COLLECTOR_low_watermark1_message_severity_level

  • COLLECTOR_low_watermark2_message_severity_level

  • COLLECTOR_pending_compaction_tasks_high_watermark0

  • COLLECTOR_pending_compaction_tasks_high_watermark1

  • COLLECTOR_pending_compaction_tasks_high_watermark2

  • COLLECTOR_pending_compaction_tasks_low_watermark0

  • COLLECTOR_pending_compaction_tasks_low_watermark1

  • COLLECTOR_pending_compaction_tasks_low_watermark2

  • COLLECTOR_LOG_FILE_COUNT

  • COLLECTOR_LOG_FILE_SIZE

tf-analytics (collector)

  • ANALYTICS_DATA_TTL

  • QUERYENGINE_MAX_SLICE

  • QUERYENGINE_MAX_TASKS

  • QUERYENGINE_START_TIME

tf-analytics (query-engine)

  • SNMPCOLLECTOR_FAST_SCAN_FREQUENCY

  • SNMPCOLLECTOR_SCAN_FREQUENCY

tf-analytics (snmp)

TOPOLOGY_SCAN_FREQUENCY

tf-analytics (topology)

  • DPDK_UIO_DRIVER

  • PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_NETWORK

  • SRIOV_VF

  • TSN_AGENT_MODE

  • TSN_NODES

  • AGENT_MODE

  • FABRIC_SNAT_HASH_TABLE_SIZE

  • PRIORITY_BANDWIDTH

  • PRIORITY_ID

  • PRIORITY_SCHEDULING

  • PRIORITY_TAGGING

  • QOS_DEF_HW_QUEUE

  • QOS_LOGICAL_QUEUES

  • QOS_QUEUE_ID

  • VROUTER_GATEWAY

  • HUGE_PAGES_2MB

  • HUGE_PAGES_1GB

  • DISABLE_TX_OFFLOAD

  • DISABLE_STATS_COLLECTION

tf-vrouter (agent)

  • CPU_CORE_MASK

  • SERVICE_CORE_MASK

  • DPDK_CTRL_THREAD_MASK

  • DPDK_COMMAND_ADDITIONAL_ARGS

  • DPDK_MEM_PER_SOCKET

  • DPDK_UIO_DRIVER

  • HUGE_PAGES

  • HUGE_PAGES_DIR

  • NIC_OFFLOAD_ENABLE

  • DPDK_ENABLE_VLAN_FWRD

tf-vrouter (agent-dpdk)

Key differences between TFOperator API v1alpha1 and v2

This section outlines the main differences between the v1alpha1 and v2 versions of the TFOperator API:

  • Introduction of the features section:

    • All non-default Tungsten Fabric and Tungsten Fabric Operator features can now be set in the features section.

    • Setting environment variables is no longer necessary but can still be done using the envSetting field in each Tungsten Fabric service section.

  • Relocation of CustomSpec from the vRouter agent specification to the nodes section.

  • Reorganization of the controllers section:

    • The controllers section has been integrated into the services section.

    • The services section is now divided into groups: analytics, config, control, vRouter, and webUI.

    • Configuration of third-party services can be performed through the analytics or config sections.

  • Configuration of the logging levels can be performed using the logging field, which is a separate field in each Tungsten Fabric services configuration.

  • Movement of the dataStorageClass and tfVersion fields to the upper level of the specification.

  • Introduction of the devOptions section enabling the setup of experimental development-related options.

Tungsten Fabric configuration

Mirantis OpenStack for Kubernetes (MOSK) allows you to easily adapt your Tungsten Fabric deployment to the needs of your environment through the TFOperator custom resource.

This section includes custom configuration details available to you.

Important

Since 24.1, MOSK introduces the technical preview support for the API v2 for the Tungsten Fabric Operator. This version of the Tungsten Fabric Operator API aligns with the OpenStack Controller API and provides better interface for advanced configurations. In MOSK 24.1, the API v2 is available only for the new product deployments with Tungsten Fabric.

Since 24.2, the API v2 becomes default for new product deployments and includes the ability to convert existing v1alpha1 TFOperator to v2 during update.

During the update to the 24.3 series, the old Tungsten Fabric cluster configuration API v1alpha1 is automatically converted and replaced with the v2 version. Therefore, since MOSK 24.3, start using the v2 TFOperator custom resource for any updates. The v1alpha1 TFOperator custom resource remains in the cluster but is no longer reconciled and will be automatically removed in MOSK 25.1.

Cassandra configuration

This section describes the Cassandra configuration through the Tungsten Fabric Operator custom resource.

Cassandra resource limits configuration

By default, Tungsten Fabric Operator sets up the following resource limits for Cassandra analytics and configuration StatefulSets:

Limits:
  cpu:     8
  memory:  32Gi
Requests:
  cpu:     1
  memory:  16Gi

This is a verified configuration suitable for most cases. However, if nodes are under a heavy load, the KubeContainerCPUThrottlingHigh StackLight alert may raise for Tungsten Fabric Pods of the tf-cassandra-analytics and tf-cassandra-config StatefulSets. If such alerts appear constantly, you can increase the limits through the TFOperator custom resource. For example:

spec:
  services:
    analytics:
      enabled: true
      cassandra:
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
    config:
      cassandra:
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
spec:
  controllers:
    cassandra:
      deployments:
      - name: tf-cassandra-config
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
      - name: tf-cassandra-analytics
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
Custom configuration

To specify custom configurations for Cassandra clusters, use the configOptions settings in the TFOperator custom resource. For example, you may need to increase the file cache size in case of a heavy load on the nodes labeled with tfanalyticsdb=enabled or tfconfigdb=enabled:

spec:
  services:
    analytics:
      enabled: true
      cassandra:
        configOptions:
          file_cache_size_in_mb: 1024
spec:
  controllers:
    cassandra:
       deployments:
       - name: tf-cassandra-analytics
         configOptions:
           file_cache_size_in_mb: 1024
Custom vRouter settings

TechPreview

Depending on the Tungsten Fabric Operator API version in use, proceed with one of the following options:

To specify custom settings for the Tungsten Fabric vRouter nodes, for example, to change the name of the tunnel network interface or enable debug level logging on some subset of nodes, use the nodes settings in the TFOperator custom resource.

For example, to enable debug level logging on a specific node or multiple nodes:

spec:
  nodes:
    <CUSTOMSPEC-NAME>:
      labels:
        name: <NODE-LABEL>
        value: <NODE-LABEL-VALUE>
      nodeVRouter:
        enabled: true
        envSettings:
          agent:
            env:
            - name: LOG_LEVEL
              value: SYS_DEBUG

To specify custom settings for the Tungsten Fabric vRouter nodes, for example, to change the name of the tunnel network interface or enable debug level logging on some subset of nodes, use the customSpecs settings in the TFOperator custom resource.

For example, to enable debug level logging on a specific node or multiple nodes:

spec:
  controllers:
    tf-vrouter:
      agent:
        customSpecs:
        - name: <CUSTOMSPEC-NAME>
          label:
            name: <NODE-LABEL>
            value: <NODE-LABEL-VALUE>
          containers:
          - name: agent
            env:
            - name: LOG_LEVEL
              value: SYS_DEBUG

Caution

The customspecs:name value must follow the RFC 1123 international format. Verify that the name of a DaemonSet object is a valid DNS subdomain name.

The customSpecs parameter inherits all settings for the tf-vrouter containers that are set on the spec:controllers:agent level and overrides or adds additional parameters. The example configuration above overrides the logging level from SYS_INFO, which is the default logging level, to SYS_DEBUG.

For clusters with a multi-rack architecture, you may need to redefine the gateway IP for the Tungsten Fabric vRouter nodes using the VROUTER_GATEWAY parameter. For details, see Multi-rack architecture.

Control plane traffic interface

By default, the TF control service uses the management interface for the BGP and XMPP traffic. You can change the control service interface using the controlInterface parameter in the TFOperator custom resource, for example, to combine the BGP and XMPP traffic with the data (tenant) traffic:

spec:
  features:
    control:
      controlInterface: <tunnel-interface>
spec:
  settings:
    controlInterface: <tunnel-interface>
Traffic encapsulation

Tungsten Fabric implements cloud tenants’ virtual networks as Layer 3 overlays. Tenant traffic gets encapsulated into one of the supported protocols and is carried over the infrastructure network between 2 compute nodes or a compute node and an edge router device.

In addition, Tungsten Fabric is capable of exchanging encapsulated traffic with external systems in order to build advanced virtual networking topologies, for example, BGP VPN connectivity between 2 MOSK clouds or a MOSK cloud and a cloud tenant premises.

MOSK supports the following encapsulation protocols:

  • MPLS over Generic Routing Encapsulation (GRE)

    A traditional encapsulation method supported by several router vendors, including Cisco and Juniper. The feature is applicable when other encapsulation methods are not available. For example, an SDN gateway runs software that does not support MPLS over UDP.

  • MPLS over User Datagram Protocol (UDP)

    A variation of the MPLS over GRE mechanism. It is the default and the most frequently used option in MOSK. MPLS over UDP replaces headers in UDP packets. In this case, a UDP port stores a hash of the packet payload (entropy). It provides a significant benefit for equal-cost multi-path (ECMP) routing load balancing. MPLS over UDP and MPLS over GRE transfer Layer 3 traffic only.

  • Virtual Extensible LAN (VXLAN) TechPrev

    The combination of VXLAN and EVPN technologies is often used for creating advanced cloud networking topologies. For example, it can provide transparent Layer 2 interconnections between Virtual Network Functions running on top of the cloud and physical traffic generator appliances hosted somewhere else.

Encapsulation priority

The ENCAP_PRIORIY parameter defines the priority in which the encapsulation protocols are attempted to be used when setting the BGP VPN connectivity between the cloud and external systems.

By default, the encapsulation order is set to MPLSoUDP,MPLSoGRE,VXLAN. The cloud operator can change it depending their needs in the TFOperator custom resource as it is illustrated in Configuring encapsulation.

The list of supported encapsulated methods along with their order is shared between BGP peers as part of the capabilities information exchange when establishing a BGP session. Both parties must support the same encapsulation methods to build a tunnel for the network traffic.

For example, if the cloud operator wants to set up a Layer 2 VPN between the cloud and their network infrastructure, they configure the cloud’s virtual networks with VXLAN identifiers (VNIs) and do the same on the other side, for example, on a network switch. Also, VXLAN must be set in the first position in encapsulation priority order. Otherwise, VXLAN tunnels will not get established between endpoints, even though both endpoints may support the VXLAN protocol.

However, setting VXLAN first in the encapsulation priority order will not enforce VXLAN encapsulation between compute nodes or between compute nodes and gateway routers that use Layer 3 VPNs for communication.

Configuring encapsulation

The TFOperator custom resource allows you to define encapsulation settings for your Tungsten Fabric cluster.

Important

The TFOperator custom resource must be the only place to configure the cluster encapsulation. Performing these configurations through the Tungsten Fabric web UI, CLI, or API does not provide the configuration persistency, and the settings defined this way may get reset to defaults during the cluster services restart or update.

Note

Defining the default values for encapsulation parameters in the TFOperator custom resource is unnecessary.

Depending on the Tungsten Fabric operator API version in use, proceed with one of the following options:

Encapsulation settings

Parameter

Default value

Description

encapPriority

MPLSoUDP,MPLSoGRE,VXLAN

Defines the encapsulation priority order.

vxlanVnIdMode

automatic

Defines the Virtual Network ID type. The list of possible values includes:

  • automatic - to assign the VXLAN identifier to virtual networks automatically.

  • configured - to make cloud users explicitly provide the VXLAN identifier for the virtual networks.

Example configuration:

features:
  config:
    vxlanVnIdMode: automatic
    encapPriority: VXLAN,MPLSoUDP,MPLSoGRE
Encapsulation settings

Parameter

Default value

Description

ENCAP_PRIORITY

MPLSoUDP,MPLSoGRE,VXLAN

Defines the encapsulation priority order.

VXLAN_VN_ID_MODE

automatic

Defines the Virtual Network ID type. The list of possible values includes:

  • automatic - to assign the VXLAN identifier to virtual networks automatically.

  • configured - to make cloud users explicitly provide the VXLAN identifier for the virtual networks.

Typically, for a Layer 2 VPN use case, the VXLAN_VN_ID_MODE parameter is set to configured.

Example configuration:

controllers:
  tf-config:
    provisioner:
      containers:
      - env:
        - name: VXLAN_VN_ID_MODE
          value: automatic
        - name: ENCAP_PRIORITY
          value: VXLAN,MPLSoUDP,MPLSoGRE
        name: provisioner
Autonomous System Number (ASN)

In the routing fabric of a data centre, a MOSK cluster with Tungsten Fabric enabled can be represented either by a separate Autonomous System (AS) or as part of a bigger autonomous system. In either case, Tungsten Fabric needs to participate in the BGP peering, exchanging routes with external devices and within the cloud.

The Tungsten Fabric Controller acts as an internal (iBGP) route reflector for the cloud AS by populating /32 routes pointing to VMs across all compute nodes as well as the cloud’s edge gateway devices in case they belong to the same AS. Apart from being an iBGP router reflector for the cloud AS, the Tungsten Fabric Controller can act as a BGP peer for autonomous systems external to the cloud, for example, for the AS configured across the leaf-spine fabric of the data center.

The Autonomous System Number (ASN) setting contains the unique identifier of the autonomous system that the MOSK cluster with Tungsten Fabric belongs to. The ASN number does not affect the internal iBGP communication between vRouters running on the compute nodes. Such communication will work regardless of the ASN number settings. However, any network appliance that is not managed by the Tungsten Fabric control plane will have BGP configured manually. Therefore, the ASN settings should be configured accordingly on both sides. Otherwise, it would result in the inability to establish BPG sessions, regardless of whether the external device peers with Tungsten Fabric over iBGP or eBGP.

Configuring ASNs

The TFOperator custom resource enables you to define ASN settings for your Tungsten Fabric cluster.

Important

The TFOperator CR must be the only place to configure the cluster ASN. Performing these configurations through the Tungsten Fabric web UI, CLI, or API does not provide the configuration persistency, and the settings defined this way may get reset to defaults during the cluster services restart or update.

Note

Defining the default values for ASN parameters in the Tungsten Fabric Operator custom resource is unnecessary.

Depending on the Tungsten Fabric Operator API version in use, proceed with one of the following options:

ASN settings

Parameter

Default value

Description

bgpAsn

64512

Defines ASN of the control node.

enable4ByteAS

false

Enables the 4-byte ASN format.

Example configuration:

features:
  config:
    bgpAsn: 64515
    enable4ByteAS: true
ASN settings

Parameter

Default value

Description

BGP_ASN

64512

Defines ASN of the control node.

ENABLE_4BYTE_AS

FALSE

Enables the 4-byte ASN format.

Example configuration:

controllers:
  tf-config:
    provisioner:
      containers:
      - env:
        - name: BGP_ASN
          value: "64515"
        - name: ENABLE_4BYTE_AS
          value: "true"
        name: provisioner
  tf-control:
    provisioner:
      containers:
      - env:
        - name: BGP_ASN
          value: "64515"
        name: provisioner
Access to external DNS

By default, the Tungsten Fabric tf-control-dns-external service is created to expose the Tungsten Fabric control dns. You can disable creation of this service through the enableDNSExternal parameter in the TFOperator custom resource. For example:

spec:
  features:
    control:
      enableDNSExternal: false
spec:
  controllers:
    tf-control:
      dns:
        enableDNSExternal: false
Gateway for vRouter data plane network

If an edge router is accessible from the data plane through a gateway, define the vRouter gateway in the TFOperator custom resource. Otherwise, the default system gateway is used.

Depending on the Tungsten Fabric Operator API version in use, proceed with one of the following configurations:

Define the vRouterGateway parameter in the features section of the TFOperator custom resource:

spec:
  features:
    vRouter:
      vRouterGateway: <data-plane-network-gateway>

You can also configure the parameter for Tungsten Fabric vRouter in the DPDK mode:

spec:
  services:
    vRouter:
      agentDPDK:
        enabled: true
        envSettings:
          agent:
            env:
            - name: VROUTER_GATEWAY
              value: <data-plane-network-gateway>

Define the VROUTER_GATEWAY parameter in the TFOperator custom resource:

spec:
  controllers:
    tf-vrouter:
      agent:
        containers:
        - name: agent
          env:
          - name: VROUTER_GATEWAY
            value: <data-plane-network-gateway>

You can also configure the parameter for Tungsten Fabric vRouter in the DPDK mode:

spec:
  controllers:
    tf-vrouter:
      agent-dpdk:
        enabled: true
        containers:
        - name: agent
          env:
          - name: VROUTER_GATEWAY
            value: <data-plane-network-gateway>
Tungsten Fabric image precaching

By default, MOSK deploys image precaching DaemonSets to minimize possible downtime when updating container images. You can disable creation of these DaemonSets by setting the imagePreCaching parameter in the TFOperator custom resource to false:

spec:
  features:
    imagePreCaching: false
spec:
  settings:
    imagePreCaching: false
Graceful restart and long-lived graceful restart

Available since MOSK 23.2 for Tungsten Fabric 21.4 only TechPreview

Graceful restart and long-lived graceful restart are vital mechanisms within BGP (Border Gateway Protocol) routing, designed to optimize the routing tables convergence in scenarios where a BGP router restarts or a networking failure is experienced, leading to interruptions of router peering.

During a graceful restart, a router can signal its BGP peers about its impending restart, requesting them to retain the routes it had previously advertised as active. This allows for seamless network operation and minimal disruption to data forwarding during the router downtime.

The long-lived aspect of the long-lived graceful restart extends the graceful restart effectiveness beyond the usual restart duration. This extension provides an additional layer of resilience and stability to BGP routing updates, bolstering the network ability to manage unforeseen disruptions.

Caution

Mirantis does not generally recommend using the graceful restart and long-lived graceful restart features with the Tungsten Fabric XMPP helper, unless the configuration is done by proficient operators with at-scale expertise in networking domain and exclusively to address specific corner cases.

Configuring graceful restart and long-lived graceful restart

Tungsten Fabric Operator allows for easy enablement and configuration of the graceful restart and long-lived graceful restart features through the TFOperator custom resource:

spec:
  features:
    control:
      gracefulRestart:
        enabled: <BOOLEAN>
        bgpHelperEnabled: <BOOLEAN>
        xmppHelperEnabled: <BOOLEAN>
        restartTime: <TIME_IN_SECONDS>
        llgrRestartTime: <TIME_IN_SECONDS>
        endOfRibTimeout: <TIME_IN_SECONDS>
spec:
  settings:
    settings:
      gracefulRestart:
        enabled: <BOOLEAN>
        bgpHelperEnabled: <BOOLEAN>
        xmppHelperEnabled: <BOOLEAN>
        restartTime: <TIME_IN_SECONDS>
        llgrRestartTime: <TIME_IN_SECONDS>
        endOfRibTimeout: <TIME_IN_SECONDS>
Graceful restart and long-lived graceful restart settings

Parameter

Default value

Description

enabled

false

Enables or disables graceful restart and long-lived graceful restart features.

bgpHelperEnabled

false

Specifies the time interval, when the Tungsten Fabric control services act as a graceful restart helper to the edge router or any other BGP peer by retaining the routes learned from this peer and advertising them to the rest of the network as applicable.

Note

BGP peer should support and be configured with graceful restart for all of the address families used.

xmppHelperEnabled

false

Specifies the time interval, when the datapath agent should retain the last route path from the Tungsten Fabric Controller when an XMPP-based connection is lost.

restartTime

300

Configures a non-zero restart time in seconds to advertise for graceful restart capability from peers.

llgrRestartTime

300

Specifies the amount of time in seconds the vRouter datapath should keep advertised routes from the Tungsten Fabric control services, when an XMPP connection between the control and vRouter agent services is lost.

Note

When graceful restart and long-lived graceful restart are both configured, the duration of the long-lived graceful restart timer is the sum of both timers.

endOfRibTimeout

300

Specifies the amount of time in seconds a control node waits to remove stale routes from a vRouter agent Routing Information Base (RIB).

Configuring the protocol for connecting to Cassandra clusters

To streamline and improve the efficiency of communication between clients and the database, Cassandra is transitioning away from the Thrift protocol in favor of the Query Language (CQL) protocol starting with MOSK 24.1. Since MOSK 24.2, Cassandra uses the CQL protocol by default.

CQL provides a more user-friendly and SQL-like interface for interacting with the database. With the move towards CQL, the Thrift-based client drivers are no longer actively supported encouraging the users to migrate to CQL-based client drivers to take advantage of new features and improvements in Cassandra.

If your cluster is running MOSK 24.1.x, you can enable the CQL protocol proceeding with one of the options below depending on the Tungsten Fabric Operator API version in use.

During update to MOSK 24.2, switching from Thrift to CQL is performed automatically. While it is possible to switch back to Thrift, Mirantis does not recommend it. If you choose to do so, specify thrift instead of cql in the configuration examples below.

Define the cassandraDriver parameter in the devOptions section of the TFOperator custom resource:

spec:
  devOptions:
    cassandraDriver: cql

Define the CONFIGDB_CASSANDRA_DRIVER variable for the tf-analytics, tf-config, and tf-control controllers in the TFOperator custom resource:

spec:
  controllers:
    tf-analytics:
      alarm-gen:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: alarm-gen
      api:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: api
      collector:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: collector
      snmp:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: snmp
      topology:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: topology
    tf-config:
      api:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: api
      devicemgr:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: devicemgr
      schema:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: schema
      svc-monitor:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: svc-monitor
    tf-control:
      control:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: control
      dns:
        containers:
          - env:
              - name: CONFIGDB_CASSANDRA_DRIVER
                value: cql
            name: dns
SR-IOV Spoof Check control for Tungsten Fabric

Available since MOSK 24.2 TechPreview

MOSK provides the capability to enable SR-IOV Spoof Check control with the Neutron Tungsten Fabric backend.

The capability can be useful for certain network configurations. For example, you might need to allow traffic from a virtual function interface even when its MAC address does not match the MAC address inside the virtual machine. In this scenario, known as MAC spoofing, disabling spoof check enables the traffic to pass through regardless of the MAC address mismatch.

Caution

Certain NICs and drivers may not handle the spoofchk setting. For example, the Intel 82599ES NIC paired with the ixgbe driver disregards the spoofchk setting when VLAN tagging is enabled. Therefore, ensure compatibility with your hardware configuration regarding spoofchk handling before proceeding.

To enable SR-IOV Spoof Check control for Tungsten Fabric, enable SR-IOV interfaces handling by Nova os-vif plugin in the OpenStackDeployment custom resource:

services:
  compute:
    nova:
      values:
        conf:
          nova:
            workarounds:
              pass_hwveb_ports_to_os_vif_plugin: true

Now, you can enable and disable spoof checking for certain SR-IOV ports through the OpenStack CLI. To disable spoof checking on an SR-IOV port:

openstack port set --no-security-group --disable-port-security <SRIOV-PORT>

To enable spoof checking on an SR-IOV port:

openstack port set --enable-port-security <SRIOV-PORT>
Availability zones

The Tungsten Fabric Operator provides a capability to configure the netns_availability_zone parameter of the Tungsten Fabric svc-monitor service through the netnsAZ parameter. This configuration enables MOSK users to specify an availability zone for Tungsten Fabric instances, such as HAProxy (load balancer instances) or SNAT routers.

Configuration:

spec:
  features:
    config:
      netnsAZ: <NETNS_AVAILABILITY_ZONE>
Tungsten Fabric database

Tungsten Fabric (TF) uses Cassandra and ZooKeeper to store its data. Cassandra is a fault-tolerant and horizontally scalable database that provides persistent storage of configuration and analytics data. ZooKeeper is used by TF for allocation of unique object identifiers and transactions implementation.

To prevent data loss, Mirantis recommends that you simultaneously back up the ZooKeeper database dedicated to configuration services and the Cassandra database.

The backup of database must be consistent across all systems because the state of the Tungsten Fabric databases is associated with other system databases, such as OpenStack databases.

Periodic Tungsten Fabric database backups

MOSK enables you to perform the automatic TF data backup in the JSON format using the tf-dbbackup-job cron job. By default, it is disabled. To back up the TF databases, enable tf-dbBackup in the TF Operator custom resource:

spec:
  features:
     dbBackup:
        enabled: true
spec:
  controllers:
    tf-dbBackup:
      enabled: true

By default, the tf-dbbackup-job job is scheduled for weekly execution, allocating PVC of 5 Gi size for storing backups and keeping 5 previous backups. To configure the backup parameters according to the needs of your cluster, use the following structure:

spec:
  features:
     dbBackup:
        enabled: true
        dataCapacity: 30Gi
        schedule: "0 0 13 * 5"
        storedBackups: 10
spec:
  controllers:
    tf-dbBackup:
      enabled: true
      dataCapacity: 30Gi
      schedule: "0 0 13 * 5"
      storedBackups: 10

To temporarily disable tf-dbbackup-job, suspend the job:

spec:
  features:
     dbBackup:
        enabled: true
        suspend: true
spec:
  controllers:
    tf-dbBackup:
      enabled: true
      suspend: true

To delete the tf-dbbackup-job job, disable tf-dbBackup in the TF Operator custom resource:

spec:
  features:
     dbBackup:
        enabled: false
spec:
  controllers:
    tf-dbBackup:
      enabled: false
Remote storage for Tungsten Fabric database backups

Available since MOSK 23.2 TechPreview

MOSK supports configuring a remote NFS storage for TF data backups through the TF Operator custom resource:

spec:
  features:
     dbBackup:
       enabled: true
       backupType: "pv_nfs"
       nfsOptions:
         path: <PATH_TO_SHARE_FOLDER_ON_SERVER>
         server: <IP_ADDRESS/DNS_NAME_OF_SERVER>
spec:
  controllers:
    tf-dbBackup:
      enabled: true
      backupType: "pv_nfs"
      nfsOptions:
        path: <PATH_TO_SHARE_FOLDER_ON_SERVER>
        server: <IP_ADDRESS/DNS_NAME_OF_SERVER>

If PVC backups were used previously, the old PVC will not be utilized. You can delete it with the following command:

kubectl -n tf delete pvc <TF_DB_BACKUP_PVC>
Tungsten Fabric services

The section explains specifics of the Tungsten Fabric services provided by Mirantis OpenStack for Kubernetes (MOSK). The list of the services and their supported features included in this section is not full and is being constantly amended based on the complexity of the architecture and use of a particular service.

Tungsten Fabric load balancing (HAProxy)

Note

Since 23.1, MOSK provides technology preview for Octavia Amphora load balancing. To start experimenting with the new load balancing solution, refer to Octavia Amphora load balancing.

MOSK ensures Octavia with Tungsten Fabric integration by OpenStack Octavia Driver with Tungsten Fabric HAProxy as a backend.

The Tungsten Fabric-based MOSK deployment supports creation, update, and deletion operations with the following standard load balancing API entities:

  • Load balancers

    Note

    For a load balancer creation operation, the driver supports only the vip-subnet-id argument, the vip-network-id argument is not supported.

  • Listeners

  • Pools

  • Health monitors

The Tungsten Fabric-based MOSK deployment does not support the following load balancing capabilities:

  • L7 load balancing capabilities, such as L7 policies, L7 rules, and others

  • Setting specific availability zones for load balancers and their resources

  • Using of the UDP protocol

  • Operations with Octavia quotas

  • Operations with Octavia flavors

Warning

The Tungsten Fabric-based MOSK deployment enables you to manage the load balancer resources by means of the OpenStack CLI or OpenStack Horizon. Do not perform any manipulations with the load balancer resources through the Tungsten Fabric web UI because in this case the changes will not be reflected on the OpenStack API side.

Octavia Amphora load balancing

Available since MOSK 23.1 TechPreview

Octavia Amphora (Amphora v2) load balancing provides a scalable and flexible solution for load balancing in cloud environments. MOSK deploys Amphora load balancer on each node of the OpenStack environment ensuring that load balancing services are easily accessible, highly scalable, and highly reliable.

Compared to the Octavia Tungsten Fabric driver for LBaaS v2 solution, Amphora offers several advanced features including:

  • Full compatibility with the Octavia API, which provides a standardized interface for load balancing in MOSK OpenStack environments. This makes it easier to manage and integrate with other OpenStack services.

  • Layer 7 policies and rules, which allow for more granular control over traffic routing and load balancing decisions. This enables users to optimize their application performance and improve the user experience.

  • Support for the UDP protocol, which is commonly used for real-time communications and other high-performance applications. This enables users to deploy a wider range of applications with the same load balancing infrastructure.

Enabling Octavia Amphora load balancing

By default, MOSK uses the Octavia Tungsten Fabric load balancing. Once Octavia Amphora load balancing is enabled, the existing Octavia Tungsten Fabric driver load balancers will continue to function normally. However, you cannot migrate your load balancer workloads from the old LBaaS v2 solution to Amphora.

Note

As long as MOSK provides Octavia Amphora load balancing as a technology preview feature, Mirantis cannot guarantee the stability of this solution and does not provide a migration path from Tungsten Fabric load balancing (HAProxy), which is used by default.

To enable Octavia Amphora load balancing:

  1. Assign openstack-gateway: enabled labels to the compute nodes in either order.

    Caution

    Assigning the openstack-gateway: enabled labels on compute nodes is crucial for the effective operation of Octavia Amphora load balancing within an OpenStack environment. Double-check the labels assignment to guarantee proper configuration.

  2. To make Amphora the default provider, specify it in the OpenStackDeployment custom resource:

    spec:
      features:
        octavia:
          default_provider: amphorav2
    
  3. Verify that the OpenStack Controller (Rockoon) has scheduled new Octavia pods that include health manager, worker, and housekeeping pods.

    kubectl get pods -n openstack -l 'application=octavia,component in (worker, health_manager, housekeeping)'
    

    Example of output for an environment with two compute nodes:

    NAME                                    READY   STATUS    RESTARTS   AGE
    octavia-health-manager-default-48znl    1/1     Running   0          4h32m
    octavia-health-manager-default-jk82v    1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-24vc4   1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-h9ccv   1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-rptvv   1/1     Running   0          4h34m
    octavia-worker-665f84fc7-8kdqd          1/1     Running   0          4h34m
    octavia-worker-665f84fc7-j6jn9          1/1     Running   0          4h31m
    octavia-worker-665f84fc7-kqf9t          1/1     Running   0          4h33m
    
Creating new load balancers

The workflow for creating new load balancers with Amphora is identical to the workflow for creating load balancers with Octavia Tungsten Fabric driver for LBaaS v2. You can do it either through the OpenStack Horizon UI or OpenStack CLI.

If you have not defined amphorav2 as default provider in the OpenStackDeployment custom resource, you can specify it explicitly when creating a load balancer using the provider argument:

openstack loadbalancer create --provider amphorav2
Tungsten Fabric known limitations

This section contains a summary of the Tungsten Fabric upstream features and use cases not supported in MOSK, features and use cases offered as Technology Preview in the current product release if any, and known limitations of Tungsten Fabric in integration with other product components.

Feature or use case

Status

Description

Tungsten Fabric web UI

Provided as is

MOSK provides the TF web UI as is and does not include this service in the support Service Level Agreement

Automatic generation of network port records in DNSaaS (Designate)

Not supported

As a workaround, you can use the Tungsten Fabric built-in DNS service that enables virtual machines to resolve each other names

Secret management (Barbican)

Not supported

It is not possible to use the certificates stored in Barbican to terminate HTTPs on a load balancer in a Tungsten Fabric deployment

Role Based Access Control (RBAC) for Neutron objects

Not supported

Advanced Tungsten Fabric features

Provided as is

MOSK provides the following advanced Tungsten Fabric features as is and does not include them in the support Service Level Agreement:

  • Service Function Chaining

  • Production ready multi-site SDN

  • Layer 3 multihoming

  • Long-Lived Graceful Restart (LLGR)

Technical Preview

DPDK

Tungsten Fabric and OpenStack Octavia Amphora integration

Technical Preview

Due to Tungsten Fabric Simple Virtual Gateway restriction, each virtual network can have only one VGW interface. As a result, MOSK should be limited to a single compute node with the openstack-gateway=enabled label. This limitation prevents OpenStack Octavia Amphora from functioning in a multi-rack deployment.

Tungsten Fabric integration with OpenStack

The levels of integration between OpenStack and Tungsten Fabric (TF) include controllers and services integration levels.

Controllers integration

The integration between the OpenStack and TF controllers is implemented through the shared Kubernetes openstack-tf-shared namespace. Both controllers have access to this namespace to read and write the Kubernetes kind: Secret objects.

The OpenStack Controller (Rockoon) posts the data into the openstack-tf-shared namespace required by the TF services. The TF controller watches this namespace. Once an appropriate secret is created, the TF controller obtains it into the internal data structures for further processing.

The OpenStack Controller includes the following data for the TF Controller:

  • tunnel_inteface

    Name of the network interface for the TF data plane. This interface is used by TF for the encapsulated traffic for overlay networks.

  • Keystone authorization information

    Keystone Administrator credentials and an up-and-running IAM service are required for the TF Controller to initiate the deployment process.

  • Nova metadata information

    Required for the TF vRrouter agent service.

Also, the OpenStack Controller watches the openstack-tf-shared namespace for the vrouter_port parameter that defines the vRouter port number and passes it to the nova-compute pod.

Services integration

The list of the OpenStack services that are integrated with TF through their API include:

  • neutron-server - integration is provided by the contrail-neutron-plugin component that is used by the neutron-server service for transformation of the API calls to the TF API compatible requests.

  • nova-compute - integration is provided by the contrail-nova-vif-driver and contrail-vrouter-api packages used by the nova-compute service for interaction with the TF vRouter to the network ports.

  • octavia-api - integration is provided by the Octavia TF Driver that enables you to use OpenStack CLI and Horizon for operations with load balancers. See Tungsten Fabric load balancing (HAProxy) for details.

Warning

TF is not integrated with the following OpenStack services:

  • DNS service (Designate)

  • Key management (Barbican)

Tungsten Fabric IPv6 support

Tungsten Fabric allows running IPv6-enabled OpenStack tenant networks on top of the IPv4 underlay. You can create an IPv6 virtual network through the Tungsten Fabric web UI or OpenStack CLI in the same way as an IPv4 virtual network. The IPv6 functionality is enabled out of the box and does not require major changes in the cloud configuration. This section lists the IPv6 capabilities supported by MOSK, as well as those available and unavailable in the upstream OpenContrail software.

The following IPv6 features are supported and verified in MOSK:

  • Virtual machines with IPv6 and IPv4 interfaces

  • Virtual machines with IPv6-only interfaces

  • DHCPv6 and neighbor discovery

  • Policy and security groups

  • IPv6 flow set up, tear down, and aging

  • Flow set up and tear down based on a TCP state machine

  • Fat flow

  • Allowed address pair configuration with IPv6 addresses

  • Equal Cost Multi-Path (ECMP)

Additionally, the following IPv6 features are available in upstream OpenContrail according to its official documentation:

  • Protocol-based flow aging

  • IPv6 service chaining

  • Connectivity with gateway (MX Series device)

  • Virtual Domain Name Services (vDNS), name-to-IPv6 address resolution

The following IPv6 features are not available in upstream OpenContrail:

  • Any IPv6 Network Address Translation (NAT)

  • Load Balancing as a Service (LBaaS)

  • IPv6 fragmentation

  • Floating IPv6 address

  • Link-local and metadata services

  • Diagnostics for IPv6

  • Contrail Device Manager

  • Virtual customer premises equipment (vCPE)

Networking

Depending on the size of an OpenStack environment and the components that you use, you may want to have a single or multiple network interfaces, as well as run different types of traffic on a single or multiple VLANs.

This section provides the recommendations for planning the network configuration and optimizing the cloud performance.

Networking overview

Mirantis OpenStack for Kubernetes (MOSK) cluster networking is complex and defined by the security requirements and performance considerations. It is based on the Kubernetes cluster networking provided by Mirantis Container Cloud and expanded to facilitate the demands of the OpenStack virtualization platform.

A Container Cloud Kubernetes cluster provides a platform for MOSK and is considered a part of its control plane. All networks that serve Kubernetes and related traffic are considered control plane networks. The Kubernetes cluster networking is typically focused on connecting pods of different nodes as well as exposing the Kubernetes API and services running in pods into an external network.

The OpenStack networking connects virtual machines to each other and the outside world. Most of the OpenStack-related networks are considered a part of the data plane in an OpenStack cluster. Ceph networks are considered data plane networks for the purpose of this reference architecture.

When planning your OpenStack environment, consider the types of traffic that your workloads generate and design your network accordingly. If you anticipate that certain types of traffic, such as storage replication, will likely consume a significant amount of network bandwidth, you may want to move that traffic to a dedicated network interface to avoid performance degradation.

The following diagram provides a simplified overview of the underlay networking in a MOSK environment:

cluster-networking
Management cluster networking

This page summarizes the recommended networking architecture of a Mirantis Container Cloud management cluster for a Mirantis OpenStack for Kubernetes (MOSK) cluster.

We recommend deploying the management cluster with a dedicated interface for the provisioning (PXE) network. The separation of the provisioning network from the management network ensures additional security and resilience of the solution.

MOSK end users typically should have access to the Keycloak service in the management cluster for authentication to the Horizon web UI. Therefore, we recommend that you connect the management network of the management cluster to an external network through an IP router. The default route on the management cluster nodes must be configured with the default gateway in the management network.

If you deploy the multi-rack configuration, ensure that the provisioning network of the management cluster is connected to an IP router that connects it to the provisioning networks of all racks.

MOSK cluster networking

Mirantis OpenStack for Kubernetes (MOSK) clusters managed by Mirantis Container Cloud use the following networks to serve different types of traffic:

MOSK network types

Network role

Description

Provisioning (PXE) network

Facilitates the iPXE boot of all bare metal machines in a MOSK cluster and provisioning of the operating system to machines.

This network is only used during provisioning of the host. It must not be configured on an operational MOSK node.

Life-cycle management (LCM) network

Connects LCM Agents running on the hosts to the Container Cloud LCM API. The LCM API is provided by the management cluster. The LCM network is also used for communication between kubelet and the Kubernetes API server inside a Kubernetes cluster. The MKE components use this network for communication inside a swarm cluster.

The LCM subnet(s) provides IP addresses that are statically allocated by the IPAM service to bare metal hosts. This network must be connected to the Kubernetes API endpoint of the management cluster through an IP router. LCM Agents running on MOSK clusters will connect to the management cluster API through this router. LCM subnets may be different per MOSK cluster as long as this connection requirement is satisfied.

You can use more than one LCM network segment in a MOSK cluster. In this case, separated L2 segments and interconnected L3 subnets are still used to serve LCM and API traffic.

All IP subnets in the LCM networks must be connected to each other by IP routes. These routes must be configured on the hosts through L2 templates.

All IP subnets in the LCM network must be connected to the Kubernetes API endpoints of the management cluster through an IP router.

You can manually select the load balancer IP address for external access to the cluster API and specify it in the Cluster object configuration. Alternatively, you can allocate a dedicated IP range for a virtual IP of the cluster API load balancer by adding a Subnet object with a special annotation. Mirantis recommends that this subnet stays unique per MOSK cluster. For details, see Create subnets.

Note

When using the ARP announcement of the IP address for the cluster API load balancer, the following limitations apply:

  • Only one of the LCM networks can contain the API endpoint. This network is called API/LCM throughout this documentation. It consists of a VLAN segment stretched between all Kubernetes master nodes in the cluster and the IP subnet that provides IP addresses allocated to these nodes.

  • The load balancer IP address must be allocated from the same subnet CIDR address that the LCM subnet uses.

When using the BGP announcement of the IP address for the cluster API load balancer, which is available as Technology Preview since MOSK 23.2.2, no segment stretching is required between Kubernetes master nodes. Also, in this scenario, the load balancer IP address is not required to match the LCM subnet CIDR address.

Kubernetes workloads network

Serves as an underlay network for traffic between pods in the MOSK cluster. Do not share this network between clusters.

There might be more than one Kubernetes pods network segment in the cluster. In this case, they must be connected through an IP router.

Kubernetes workloads network does not need an external access.

The Kubernetes workloads subnet(s) provides IP addresses that are statically allocated by the IPAM service to all nodes and that are used by Calico for cross-node communication inside a cluster. By default, VXLAN overlay is used for Calico cross-node communication.

Kubernetes external network

Serves for an access to the OpenStack endpoints in a MOSK cluster.

When using the ARP announcement of the external endpoints of load-balanced services, the network must contain a VLAN segment extended to all MOSK nodes connected to this network.

When using the BGP announcement of the external endpoints of load-balanced services, which is available as Technology Preview since MOSK 23.2.2, there is no requirement of having a single VLAN segment extended to all MOSK nodes connected to this network.

A typical MOSK cluster only has one external network.

The external network must include at least two IP address ranges defined by separate Subnet objects in Container Cloud API:

  • MOSK services address range

    Provides IP addresses for externally available load-balanced services, including OpenStack API endpoints.

  • External address range

    Provides IP addresses to be assigned to network interfaces on all cluster nodes that are connected to this network. MetalLB speakers must run on the same nodes. For details, see Configure node selectors for MetalLB speakers.

    This is required for external traffic to return to the originating client. The default route on the MOSK nodes that are connected to the external network must be configured with the default gateway in the external network.

Storage access network

Serves for the storage access traffic from and to Ceph OSD services.

A MOSK cluster may have more than one VLAN segment and IP subnet in the storage access network. All IP subnets of this network in a single cluster must be connected by an IP router.

The storage access network does not require external access unless you want to directly expose Ceph to the clients outside of a MOSK cluster.

Note

A direct access to Ceph by the clients outside of a MOSK cluster is technically possible but not supported by Mirantis. Use at your own risk.

The IP addresses from subnets in this network are statically allocated by the IPAM service to Ceph nodes. The Ceph OSD services bind to these addresses on their respective nodes.

This is a public network in Ceph terms. 1

Storage replication network

Serves for the storage replication traffic between Ceph OSD services.

A MOSK cluster may have more than one VLAN segment and IP subnet in this network as long as the subnets are connected by an IP router.

This network does not require external access.

The IP addresses from subnets in this network are statically allocated by the IPAM service to Ceph nodes. The Ceph OSD services bind to these addresses on their respective nodes.

This is a cluster network in Ceph terms. 1

Out-of-Band (OOB) network

Connects Baseboard Management Controllers (BMCs) of the bare metal hosts. Must not be accessible from a MOSK cluster.

1(1,2)

For more details about Ceph networks, see Ceph Network Configuration Reference.

The following diagram illustrates the networking schema of the Container Cloud deployment on bare metal with a MOSK cluster using ARP announcements:

_images/network-multirack.png

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. The following diagram illustrates the networking schema of the Container Cloud deployment on bare metal with a MOSK cluster using BGP announcements:

_images/network-multirack-bgp.png
Network types

This section describes network types for Layer 3 networks used for Kubernetes and Mirantis OpenStack for Kubernetes (MOSK) clusters along with requirements for each network type.

Note

Only IPv4 is currently supported by Container Cloud and IPAM for infrastructure networks. Both IPv4 and IPv6 are supported for OpenStack workloads.

The following diagram provides an overview of the underlay networks in a MOSK environment:

_images/os-cluster-l3-networking.png
L3 networks for Kubernetes

A MOSK deployment typically requires the following types of networks:

  • Provisioning network

    Used for provisioning of bare metal servers.

  • Management network

    Used for management of the Container Cloud infrastructure and for communication between containers in Kubernetes.

  • LCM/API network

    Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

    If BGP announcement is configured for the MOSK cluster API LB address, the API/LCM network is not required. Announcement of the cluster API LB address is done using the LCM network.

    If you configure ARP announcement of the load-balancer IP address for the MOSK cluster API, the API/LCM network must be configured on the Kubernetes manager nodes of the cluster. This network contains the Kubernetes API endpoint with the VRRP virtual IP address.

  • LCM network

    Enables communication between the MKE cluster nodes. Multiple VLAN segments and IP subnets can be created for a multi-rack architecture. Each server must be connected to one of the LCM segments and have an IP from the corresponding subnet.

  • External network

    Used to expose the OpenStack, StackLight, and other services of the MOSK cluster.

  • Kubernetes workloads network

    Used for communication between containers in Kubernetes.

  • Storage access network (Ceph)

    Used for accessing the Ceph storage. In Ceph terms, this is a public network 0. We recommended that it is placed on a dedicated hardware interface.

  • Storage replication network (Ceph)

    Used for Ceph storage replication. In Ceph terms, this is a cluster network 0. To ensure low latency and fast access, place the network on a dedicated hardware interface.

0(1,2)

For details about Ceph networks, see Ceph Network Configuration Reference.

L3 networks for MOSK

The MOSK deployment additionally requires the following networks.

L3 networks for MOSK

Service name

Network

Description

VLAN name

Networking

Provider networks

Typically, a routable network used to provide the external access to OpenStack instances (a floating network). Can be used by the OpenStack services such as Ironic, Manila, and others, to connect their management resources.

pr-floating

Networking

Overlay networks (virtual networks)

The network used to provide denied, secure tenant networks with the help of the tunneling mechanism (VLAN/GRE/VXLAN). If the VXLAN and GRE encapsulation takes place, the IP address assignment is required on interfaces at the node level.

neutron-tunnel

Compute

Live migration network

The network used by the OpenStack compute service (Nova) to transfer data during live migration. Depending on the cloud needs, it can be placed on a dedicated physical network not to affect other networks during live migration. The IP address assignment is required on interfaces at the node level.

lm-vlan

The way of mapping of the logical networks described above to physical networks and interfaces on nodes depends on the cloud size and configuration. We recommend placing OpenStack networks on a dedicated physical interface (bond) that is not shared with storage and Kubernetes management network to minimize the influence on each other.

L3 networks requirements

The following tables describe networking requirements for a MOSK cluster, Container Cloud management and Ceph clusters.

Container Cloud management cluster networking requirements

Network type

Provisioning

Management

Suggested interface name

N/A

k8s-lcm

Minimum number of VLANs

1

1

Minimum number of IP subnets

3

2

Minimum recommended IP subnet size

  • 8 IP addresses (Container Cloud management cluster hosts)

  • 8 IP addresses (MetalLB for provisioning services)

  • 16 IP addresses (DHCP range for directly connected servers)

  • 8 IP addresses (Container Cloud management cluster hosts, API VIP)

  • 16 IP addresses (MetalLB for Container Cloud services)

External routing

Not required

Required, may use proxy server

Multiple segments/stretch segment

Stretch segment for management cluster due to MetalLB Layer 2 limitations 1

Stretch segment due to VRRP, MetalLB Layer 2 limitations

Internal routing

Routing to separate DHCP segments, if in use

  • Routing to API endpoints of managed clusters for LCM

  • Routing to MetalLB ranges of managed clusters for StackLight authentication

  • Default route from Container Cloud management cluster hosts

1

Multiple VLAN segments with IP subnets can be added to the cluster configuration for separate DHCP domains.

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If you configure BGP announcement of the load-balancer IP address for a MOSK cluster API and for load-balanced services of the cluster:

Networking requirements for a MOSK cluster

Network type

Provisioning

LCM

External

Kubernetes workloads

Minimum number of VLANs

1 (optional)

1

1

1

Suggested interface name

N/A

k8s-lcm

k8s-ext-v

k8s-pods 2

Minimum number of IP subnets

1 (optional)

1

2

1

Minimum recommended IP subnet size

16 IPs (DHCP range)

  • IP per cluster node

  • 1 IP for the API endpoint VIP

  • 1 IP per MOSK controller node

  • 16 IPs (MetalLB for StackLight, OpenStack services)

1 IP per cluster node

Stretch or multiple segments

Multiple

Multiple

Multiple For details, see Configure node selectors for MetalLB speakers.

Multiple

External routing

Not required

Not required

Required, default route

Not required

Internal routing

Routing to the provisioning network of the management cluster

  • Routing to the IP subnet of the Container Cloud management network

  • Routing to all LCM IP subnets of the same MOSK cluster

Routing to the IP subnet of the Container Cloud Management API

Routing to all IP subnets of Kubernetes workloads

If you configure ARP announcement of the load-balancer IP address for a MOSK cluster API and for load-balanced services of the cluster:

Networking requirements for a MOSK cluster

Network type

Provisioning

LCM/API

LCM

External

Kubernetes workloads

Minimum number of VLANs

1 (optional)

1

1 (optional)

1

1

Suggested interface name

N/A

k8s-lcm

k8s-lcm

k8s-ext-v

k8s-pods 2

Minimum number of IP subnets

1 (optional)

1

1 (optional)

2

1

Minimum recommended IP subnet size

16 IPs (DHCP range)

  • 3 IPs for Kubernetes manager nodes

  • 1 IP for the API endpoint VIP

1 IP per MOSK node (Kubernetes worker)

  • 1 IP per MOSK controller node

  • 16 IPs (MetalLB for StackLight, OpenStack services)

1 IP per cluster node

Stretch or multiple segments

Multiple

Stretch due to VRRP limitations

Multiple

Stretch connected to all MOSK controller nodes. For details, see Configure node selectors for MetalLB speakers.

Multiple

External routing

Not required

Not required

Not required

Required, default route

Not required

Internal routing

Routing to the provisioning network of the management cluster

  • Routing to the IP subnet of the Container Cloud management network

  • Routing to all LCM IP subnets of the same MOSK cluster, if in use

  • Routing to the IP subnet of the LCM/API network

  • Routing to all IP subnets of the LCM network, if in use

Routing to the IP subnet of the Container Cloud Management API

Routing to all IP subnets of Kubernetes workloads

2(1,2)

The bridge interface with this name is mandatory if you need to separate Kubernetes workloads traffic. You can configure this bridge over the VLAN or directly over the bonded or single interface.

Networking requirements for a Ceph cluster

Network type

Storage access

Storage replication

Minimum number of VLANs

1

1

Suggested interface name

stor-public 3

stor-cluster 3

Minimum number of IP subnets

1

1

Minimum recommended IP subnet size

1 IP per cluster node

1 IP per cluster node

Stretch or multiple segments

Multiple

Multiple

External routing

Not required

Not required

Internal routing

Routing to all IP subnets of the Storage access network

Routing to all IP subnets of the Storage replication network

Note

When selecting externally routable subnets, ensure that the subnet ranges do not overlap with the internal subnets ranges. Otherwise, internal resources of users will not be available from the MOSK cluster.

3(1,2)

For details about Ceph networks, see Ceph Network Configuration Reference.

Multi-rack architecture

TechPreview

Mirantis OpenStack for Kubernetes (MOSK) enables you to deploy a cluster with a multi-rack architecture, where every data center cabinet (a rack) incorporates its own Layer 2 network infrastructure that does not extend beyond its top-of-rack switch. The architecture allows a MOSK cloud to integrate natively with the Layer 3-centric networking topologies such as Spine-Leaf that are commonly seen in modern data centers.

The architecture eliminates the need to stretch and manage VLANs across parts of a single data center, or to build VPN tunnels between the segments of a geographically distributed cloud.

The set of networks present in each rack depends on the backend used by the OpenStack networking service.

multi-rack-overview.html
Bare metal provisioning network

In the Mirantis Container Cloud and MOSK multi-rack reference architecture, every rack has its own L2 segment (VLAN) to bootstrap and install servers.

Segmentation of the provisioning network requires additional configuration of the underlay networking infrastructure and certain Container Cloud API objects. You need to configure a DHCP Relay agent on the border of each VLAN in the provisioning network. The agent handles broadcast DHCP requests coming from the bare metal servers in the rack and forwards them as unicast packets across L3 fabric of the data center to a Container Cloud management cluster.

multi-rack-bm.html

From the standpoint of Container Cloud API, you need to configure per-rack DHCP ranges by adding Subnet resources in Container Cloud as described in Configure multiple DHCP address ranges.

The DHCP server of Container Cloud automatically leases a temporary IP address from the DHCP range to the requester host depending on the address of the DHCP agent that relays the request.

Multi-rack MOSK cluster

To deploy a MOSK cluster with multi-rack reference architecture, you need to create a dedicated set of subnets and L2 templates for every rack in your cluster.

Every specific host type in the rack, which is defined by the role in the MOSK cluster and network-related hardware configuration, may require a specific L2 template.

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

For MOSK 23.1 and older versions, due to the Container Cloud limitations, you need to configure the following networks to have L2 segments (VLANs) stretch across racks to all hosts of certain types in a multi-rack environment:

LCM/API network

Must be configured on the Kubernetes manager nodes of the MOSK cluster. Contains a Kubernetes API endpoint with a VRRP virtual IP address. Enables MKE cluster nodes to communicate with each other.

External network

Exposes OpenStack, StackLight, and other services of the MOSK cluster to external clients.

For details, see Underlay networking: routing configuration.

When planning space allocation for IP addresses in your cluster, pick large IP ranges for each type of network. Then you will split these ranges into per-rack subnets.

For example, if you allocate a /20 address block for LCM network, then you can create up to 16 Subnet objects with the /24 address block each for up to 16 racks. This way you can simplify routing on your hosts using the large /20 IP subnet as an aggregated route destination. For details, see Underlay networking: routing configuration.

Multi-rack MOSK cluster with Tungsten Fabric

A typical medium and more sized MOSK cloud consists of three or more racks that can generally be divided into the following major categories:

  • Compute/Storage racks that contain the hypervisors and instances running on top of them. Additionally, they contain nodes that store cloud applications’ block, ephemeral, and object data as part of the Ceph cluster.

  • Control plane racks that incorporate all the components needed by the cloud operator to manage its life cycle. Also, they include the services through which the cloud users interact with the cloud to deploy their applications, such as cloud APIs and web UI.

    A control plane rack may also contain additional compute and storage nodes.

The diagram below will help you to plan the networking layout of a multi-rack MOSK cloud with Tungsten Fabric.

multi-rack-tf.html

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

For MOSK 23.1 and older versions, Kubernetes masters (3 nodes) either need to be placed into a single rack or, if distributed across multiple racks for better availability, require stretching of the L2 segment of the management network across these racks. This requirement is caused by the Mirantis Kubernetes Engine underlay for MOSK relying on the Layer 2 VRRP protocol to ensure high availability of the Kubernetes API endpoint.

The table below provides a mapping between the racks and the network types participating in a multi-rack MOSK cluster with the Tungsten Fabric backend.

Networks and VLANs for a multi-rack MOSK cluster with TF

Network

VLAN name

Control Plane rack

Compute/Storage rack

Common/PXE

lcm-nw

Yes

Yes

Management

lcm-nw

Yes

Yes

External (MetalLB)

k8s-ext-v

Yes

No

Kubernetes workloads

k8s-pods-v

Yes

Yes

Storage access (Ceph)

stor-frontend

Yes

Yes

Storage replication (Ceph)

stor-backend

Yes

Yes

Overlay

tenant-vlan

Yes

Yes

Live migration

lm-vlan

Yes

Yes

Physical networks layout

This section summarizes the requirements for the physical layout of underlay network and VLANs configuration for the multi-rack architecture of Mirantis OpenStack for Kubernetes (MOSK).

Physical networking of a Container Cloud management cluster

Due to limitations of virtual IP address for Kubernetes API and of MetalLB load balancing in Container Cloud, the management cluster nodes must share VLAN segments in the provisioning and management networks.

In the multi-rack architecture, the management cluster nodes may be placed to a single rack or spread across three racks. In either case, provisioning and management network VLANs must be stretched across ToR switches of the racks.

The following diagram illustrates physical and L2 connections of the Container Cloud management cluster.

_images/os-cluster-mgmt-physical.png
Physical networking of a MOSK cluster
External network

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If you configure BGP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network can consist of multiple VLAN segments connected to all nodes of a MOSK cluster where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

If you configure ARP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network must consist of a single VLAN stretched to the ToR switches of all the racks where MOSK nodes connected to the external network are located. Those are the nodes where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

Kubernetes manager nodes

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If BGP announcement is configured for MOSK cluster API LB address, Kubernetes manager nodes have no requirement to share the single stretched VLAN segment in the API/LCM network. All VLANs may be configured per rack.

If ARP announcement is configured for MOSK cluster API LB address, Kubernetes manager nodes must share the VLAN segment in the API/LCM network. In the multi-rack architecture, Kubernetes manager nodes may be spread across three racks. The API/LCM network VLAN must be stretched to the ToR switches of the racks. All other VLANs may be configured per rack. This requirement is caused by the Mirantis Kubernetes Engine underlay for MOSK relying on the Layer 2 VRRP protocol to ensure high availability of the Kubernetes API endpoint.

The following diagram illustrates physical and L2 network connections of the Kubernetes manager nodes in a MOSK cluster.

Caution

Such configuration does not apply to a compact control plane MOSK installation. See Create a MOSK cluster.

_images/os-cluster-k8s-mgr-physical.png
OpenStack controller nodes

The following diagram illustrates physical and L2 network connections of the control plane nodes in a MOSK cluster.

_images/os-cluster-control-physical.png
OpenStack compute nodes

All VLANs for OpenStack compute nodes may be configured per rack. No VLAN should be stretched across multiple racks.

The following diagram illustrates physical and L2 network connections of the compute nodes in a MOSK cluster.

_images/os-cluster-compute-physical.png
OpenStack storage nodes

All VLANs for OpenStack storage nodes may be configured per rack. No VLAN should be stretched across multiple racks.

The following diagram illustrates physical and L2 network connections of the storage nodes in a MOSK cluster.

_images/os-cluster-storage-physical.png
Underlay networking: routing configuration

This section describes requirements for the configuration of the underlay network for an MOSK cluster in a multi-rack reference configuration. The infrastructure operator must configure the underlay network according to these guidelines. Mirantis Container Cloud will not configure routing on the network devices.

Provisioning network

In the multi-rack reference architecture, every server rack has its own layer-2 segment (VLAN) for network bootstrap and installation of physical servers.

You need to configure top-of-rack (ToR) switches in each rack with the default gateway for the provisioning network VLAN. This gateway must also serve as a DHCP Relay Agent on the border of the VLAN. The agent handles broadcast DHCP requests coming from the bare metal servers in the rack and forwards them as unicast packets across the data center L3 fabric to the provisioning network of a Container Cloud management cluster.

Therefore, each ToR gateway must have an IP route to the IP subnet of the provisioning network of the management cluster. The provisioning network gateway, in turn, must have routes to all IP subnets of all racks.

The hosts of the management cluster must have routes to all IP subnets in the provisioning network through the gateway in the provisioning network of the management cluster.

All hosts in the management cluster must have IP addresses from the same IP subnet of the provisioning network. Even if the hosts of the management cluster are mounted to different racks, they must share a single provisioning VLAN segment.

Management network

All hosts of a management cluster must have IP addresses from the same subnet of the management network. Even if hosts of a management cluster are mounted to different racks, they must share a single management VLAN segment.

The gateway in this network is used as the default route on the nodes in a Container Cloud management cluster. This gateway must connect to external Internet networks directly or through a proxy server. If the Internet is accessible through a proxy server, you must configure Container Cloud bootstrap to use it as well. For details, see Deploy a management cluster.

This network connects a Container Cloud management cluster to Kubernetes API endpoints of MOSK clusters. It also connects LCM agents of MOSK nodes to the Kubernetes API endpoint of the management cluster.

The network gateway must have routes to all API/LCM subnets of all MOSK clusters.

LCM network

This network may include multiple VLANs, typically, one VLAN per rack. Each VLAN may have one or more IP subnets with gateways configured on ToR switches.

Each ToR gateway must provide routes to all other IP subnets in all other VLANs in the LCM network to enable communication between nodes in the cluster.

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If you configure BGP announcement of the load-balancer IP address for a MOSK cluster API:

  • All nodes of a MOSK cluster must be connected to the LCM network. Each host connected to this network must have routes to all IP subnets in the LCM network and to the management subnet of the management cluster, through the ToR gateway for the rack of this host.

  • It is not required to configure a separate API/LCM network. Announcement of the IP address of the load balancer is done using the LCM network.

If you configure ARP announcement of the load-balancer IP address for a MOSK cluster API:

  • All nodes of a MOSK cluster excluding manager nodes must be connected to the LCM network. Each host connected to this network must have routes to all IP subnets in the LCM network, including the API/LCM network of this MOSK cluster and to the Management subnet of the management cluster, through the ToR gateway for the rack of this host.

  • It is required to configure a separate API/LCM network. All manager nodes of a MOSK cluster must be connected to the API/LCM network. IP address announcement for load balancing is done using the API/LCM network.

API/LCM network

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If BGP announcement is configured for the MOSK cluster API LB address, the API/LCM network is not required. Announcement of the cluster API LB address is done using the LCM network.

If you configure ARP announcement of the load-balancer IP address for the MOSK cluster API, the API/LCM network must be configured on the Kubernetes manager nodes of the cluster. This network contains the Kubernetes API endpoint with the VRRP virtual IP address.

This network consists of a single VLAN shared between all MOSK manager nodes in a MOSK cluster, even if the nodes are spread across multiple racks. All manager nodes of a MOSK cluster must be connected to this network and have IP addresses from the same subnet in this network.

The gateway in the API/LCM network for a MOSK cluster must have a route to the Management subnet of the management cluster. This is required to ensure symmetric traffic flow between the management and MOSK clusters.

The gateway in this network must also have routes to all IP subnets in the LCM network of this MOSK cluster.

The load-balancer IP address for cluster API must be allocated from the same CIDR address that the API/LCM subnet uses.

External network

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

If you configure BGP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network can consist of multiple VLAN segments connected to all nodes of a MOSK cluster where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

If you configure ARP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network must consist of a single VLAN stretched to the ToR switches of all the racks where MOSK nodes connected to the external network are located. Those are the nodes where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

The IP gateway in this network is used as the default route on all nodes in the MOSK cluster, which are connected to this network. This allows external users to connect to the OpenStack endpoints exposed as Kubernetes load-balanced services.

Dedicated IP ranges from this network must be configured as address pools for the MetalLB service. MetalLB allocates addresses from these address pools to Kubernetes load-balanced services.

Ceph public network

This network may include multiple VLANs and IP subnets, typically, one VLAN and IP subnet per rack. All IP subnets in this network must be connected by IP routes on the ToR switches.

Typically, every node in a MOSK cluster is connected to this network and have routes to all IP subnets from this network through its rack IP gateway.

This network is not connected to the external networks.

Ceph cluster network

This network may include multiple VLANs and IP subnets, typically, one VLAN and IP subnet per rack. All IP subnets in this network must be connected by IP routes on the ToR switches.

Every Ceph OSD node in a MOSK cluster must be connected to this network and have routes to all IP subnets from this network through its rack IP gateway.

This network is not connected to the external networks.

Kubernetes workloads network

This network may include multiple VLANs and IP subnets, typically, one VLAN and IP subnet per rack. All IP subnets in this network must be connected by IP routes on the ToR switches.

All nodes in a MOSK cluster must be connected to this network and have routes to all IP subnets from this network through its rack IP gateway.

This network is not connected to the external networks.

Performance optimization

The following recommendations apply to all types of nodes in the Mirantis OpenStack for Kubernetes (MOSK) clusters.

Jumbo frames

To improve the goodput, we recommend that you enable jumbo frames where possible. The jumbo frames have to be enabled on the whole path of the packets traverse. If one of the network components cannot handle jumbo frames, the network path uses the smallest MTU.

Bonding

To provide fault tolerance of a single NIC, we recommend using the link aggregation, such as bonding. The link aggregation is useful for linear scaling of bandwidth, load balancing, and fault protection. Depending on the hardware equipment, different types of bonds might be supported. Use the multi-chassis link aggregation as it provides fault tolerance at the device level. For example, MLAG on Arista equipment or vPC on Cisco equipment.

The Linux kernel supports the following bonding modes:

  • active-backup

  • balance-xor

  • 802.3ad (LACP)

  • balance-tlb

  • balance-alb

Since LACP is the IEEE standard 802.3ad supported by the majority of network platforms, we recommend using this bonding mode. Use the Link Aggregation Control Protocol (LACP) bonding mode with MC-LAG domains configured on ToR switches. This corresponds to the 802.3ad bond mode on hosts.

Additionally, follow these recommendations in regards to bond interfaces:

  • Use ports from different multi-port NICs when creating bonds. This makes network connections redundant if failure of a single NIC occurs.

  • Configure the ports that connect servers to the PXE network with PXE VLAN as native or untagged. On these ports, configure LACP fallback to ensure that the servers can reach DHCP server and boot over network.

Spanning tree portfast mode

Configure Spanning Tree Protocol (STP) settings on the network switch ports to ensure that the ports start forwarding packets as soon as the link comes up. It helps avoid iPXE timeout issues and ensures reliable boot over network.

Storage

A MOSK cluster uses Ceph as a distributed storage system for file, block, and object storage. This section provides an overview of a Ceph cluster deployed by Container Cloud.

Ceph overview

Mirantis Container Cloud deploys Ceph on MOSK using Helm charts with the following components:

Rook Ceph Operator

A storage orchestrator that deploys Ceph on top of a Kubernetes cluster. Also known as Rook or Rook Operator. Rook operations include:

  • Deploying and managing a Ceph cluster based on provided Rook CRs such as CephCluster, CephBlockPool, CephObjectStore, and so on.

  • Orchestrating the state of the Ceph cluster and all its daemons.

KaaSCephCluster custom resource (CR)

Represents the customization of a Kubernetes installation and allows you to define the required Ceph configuration through the Container Cloud web UI before deployment. For example, you can define the failure domain, Ceph pools, Ceph node roles, number of Ceph components such as Ceph OSDs, and so on. The ceph-kcc-controller controller on the Container Cloud management cluster manages the KaaSCephCluster CR.

Ceph Controller

A Kubernetes controller that obtains the parameters from Container Cloud through a CR, creates CRs for Rook and updates its CR status based on the Ceph cluster deployment progress. It creates users, pools, and keys for OpenStack and Kubernetes and provides Ceph configurations and keys to access them. Also, Ceph Controller eventually obtains the data from the OpenStack Controller (Rockoon) for the Keystone integration and updates the Ceph Object Gateway services configurations to use Kubernetes for user authentication.

The Ceph Controller operations include:

  • Transforming user parameters from the Container Cloud Ceph CR into Rook CRs and deploying a Ceph cluster using Rook.

  • Providing integration of the Ceph cluster with Kubernetes.

  • Providing data for OpenStack to integrate with the deployed Ceph cluster.

Ceph Status Controller

A Kubernetes controller that collects all valuable parameters from the current Ceph cluster, its daemons, and entities and exposes them into the KaaSCephCluster status. Ceph Status Controller operations include:

  • Collecting all statuses from a Ceph cluster and corresponding Rook CRs.

  • Collecting additional information on the health of Ceph daemons.

  • Provides information to the status section of the KaaSCephCluster CR.

Ceph Request Controller

A Kubernetes controller that obtains the parameters from Container Cloud through a CR and manages Ceph OSD lifecycle management (LCM) operations. It allows for a safe Ceph OSD removal from the Ceph cluster. Ceph Request Controller operations include:

  • Providing an ability to perform Ceph OSD LCM operations.

  • Obtaining specific CRs to remove Ceph OSDs and executing them.

  • Pausing the regular Ceph Controller reconciliation until all requests are completed.

A typical Ceph cluster consists of the following components:

  • Ceph Monitors - three or, in rare cases, five Ceph Monitors.

  • Ceph Managers - one Ceph Manager in a regular cluster.

  • Ceph Object Gateway (radosgw) - Mirantis recommends having three or more radosgw instances for HA.

  • Ceph OSDs - the number of Ceph OSDs may vary according to the deployment needs.

    Warning

    • A Ceph cluster with 3 Ceph nodes does not provide hardware fault tolerance and is not eligible for recovery operations, such as a disk or an entire Ceph node replacement.

    • A Ceph cluster uses the replication factor that equals 3. If the number of Ceph OSDs is less than 3, a Ceph cluster moves to the degraded state with the write operations restriction until the number of alive Ceph OSDs equals the replication factor again.

The placement of Ceph Monitors and Ceph Managers is defined in the KaaSCephCluster CR.

The following diagram illustrates the way a Ceph cluster is deployed in Container Cloud:

_images/ceph-deployment.png

The following diagram illustrates the processes within a deployed Ceph cluster:

_images/ceph-data-flow.png
Ceph limitations

A Ceph cluster configuration in MOSK includes but is not limited to the following limitations:

  • Only one Ceph Controller per MOSK cluster and only one Ceph cluster per Ceph Controller are supported.

  • The replication size for any Ceph pool must be set to more than 1.

  • Only one CRUSH tree per cluster. The separation of devices per Ceph pool is supported through device classes with only one pool of each type for a device class.

  • All CRUSH rules must have the same failure_domain.

  • Only the following types of CRUSH buckets are supported:

    • topology.kubernetes.io/region

    • topology.kubernetes.io/zone

    • topology.rook.io/datacenter

    • topology.rook.io/room

    • topology.rook.io/pod

    • topology.rook.io/pdu

    • topology.rook.io/row

    • topology.rook.io/rack

    • topology.rook.io/chassis

  • RBD mirroring is not supported.

  • Consuming an existing Ceph cluster is not supported.

  • Lifted since MOSK 23.1 CephFS is unsupported. Multiple CephFS are supported since MOSK 25.1.

  • Only IPv4 is supported.

  • If two or more Ceph OSDs are located on the same device, there must be no dedicated WAL or DB for this class.

  • Only a full collocation or dedicated WAL and DB configurations are supported.

  • The minimum size of any defined Ceph OSD device is 5 GB.

  • Ceph OSDs support only raw disks as data devices meaning that no dm or lvm devices are allowed.

  • Lifted since MOSK 23.3 Ceph cluster does not support removable devices (with hotplug enabled) for deploying Ceph OSDs.

  • When adding a Ceph node with the Ceph Monitor role, if any issues occur with the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead, named using the next alphabetic character in order. Therefore, the Ceph Monitor names may not follow the alphabetical order. For example, a, b, d, instead of a, b, c.

  • Reducing the number of Ceph Monitors is not supported and causes the Ceph Monitor daemons removal from random nodes.

  • Removal of the mgr role in the nodes section of the KaaSCephCluster CR does not remove Ceph Managers. To remove a Ceph Manager from a node, remove it from the nodes spec and manually delete the mgr pod in the Rook namespace.

  • Lifted since MOSK 24.1 Ceph does not support allocation of Ceph RGW pods on nodes where the Federal Information Processing Standard (FIPS) mode is enabled.

Ceph integration with OpenStack

The integration between Ceph and OpenStack (Rockoon) Controllers is implemented through the shared Kubernetes openstack-ceph-shared namespace. Both controllers have access to this namespace to read and write the Kubernetes kind: Secret objects.

_images/osctl-ceph-integration.png

As Ceph is required and only supported backend for several OpenStack services, all necessary Ceph pools must be specified in the configuration of the kind: MiraCeph custom resource as part of the deployment. Once the Ceph cluster is deployed, the Ceph Controller posts the information required by the OpenStack services to be properly configured as a kind: Secret object into the openstack-ceph-shared namespace. The OpenStack Controller watches this namespace. Once the corresponding secret is created, the OpenStack Controller transforms this secret to the data structures expected by the OpenStack-Helm charts. Even if an OpenStack installation is triggered at the same time as a Ceph cluster deployment, the OpenStack Controller halts the deployment of the OpenStack services that depend on Ceph availability until the secret in the shared namespace is created by the Ceph Controller.

For the configuration of Ceph Object Gateway as an OpenStack Object Storage, the reverse process takes place. The OpenStack Controller waits for the OpenStack-Helm to create a secret with OpenStack Identity (Keystone) credentials that Ceph Object Gateway must use to validate the OpenStack Identity tokens, and posts it back to the same openstack-ceph-shared namespace in the format suitable for consumption by the Ceph Controller. The Ceph Controller then reads this secret and reconfigures Ceph Object Gateway accordingly.

Mirantis StackLight

StackLight is the logging, monitoring, and alerting solution that provides a single pane of glass for cloud maintenance and day-to-day operations as well as offers critical insights into cloud health including operational information about the components deployed with Mirantis OpenStack for Kubernetes (MOSK). StackLight is based on Prometheus, an open-source monitoring solution and a time series database, and OpenSearch, the logs and notifications storage.

Deployment architecture

Mirantis OpenStack for Kubernetes (MOSK) deploys the StackLight stack as a release of a Helm chart that contains the helm-controller and HelmBundle custom resources. The StackLight HelmBundle consists of a set of Helm charts describing the StackLight components. Apart from the OpenStack-specific components below, StackLight also includes the components described in Mirantis Container Cloud Reference Architecture: Deployment architecture. By default, StackLight logging stack is disabled.

During the StackLight configuration when deploying a MOSK cluster, you can define the HA or non-HA StackLight architecture type. Non-HA StackLight requires a backend storage provider, for example, a Ceph cluster. For details, see Mirantis Container Cloud Reference Architecture: StackLight database modes.

OpenStack-specific StackLight components overview

StackLight component

Description

Prometheus native exporters and endpoints

Export the existing metrics as Prometheus metrics and include:

  • libvirt-exporter

  • memcached-exporter

  • mysql-exporter

  • rabbitmq-exporter

  • tungstenfabric-exporter

Telegraf OpenStack plugin

Collects and processes the OpenStack metrics.

Monitored components

StackLight measures, analyzes, and reports in a timely manner about failures that may occur in the following Mirantis OpenStack for Kubernetes (MOSK) components and their sub-components. Apart from the components below, StackLight also monitors the components listed in Mirantis Container Cloud Reference Architecture: Monitored components.

  • Libvirt

  • Memcached

  • MariaDB

  • NTP

  • OpenStack (Barbican, Cinder, Designate, Glance, Heat, Horizon, Ironic, Keystone, Neutron, Nova, Octavia)

  • OpenStack SSL certificates

  • Tungsten Fabric (Casandra, Kafka, Redis, ZooKeeper)

  • RabbitMQ

OpenSearch and Prometheus storage sizing

Caution

Calculations in this document are based on numbers from a real-scale test cluster with 34 nodes. The exact space required for metrics and logs must be calculated depending on the ongoing cluster operations. Some operations force the generation of additional metrics and logs. The values below are approximate. Use them only as recommendations.

During the deployment of a new cluster, you must specify the OpenSearch retention time and Persistent Volume Claim (PVC) size, Prometheus PVC, retention time, and retention size. When configuring an existing cluster, you can only set OpenSearch retention time, Prometheus retention time, and retention size.

The following table describes the recommendations for both OpenSearch and Prometheus retention size and PVC size for a cluster with 34 nodes. Retention time depends on the space allocated for the data. To calculate the required retention time, use the {retention time} = {retention size} / {amount of data per day} formula.

Service

Required space per day

Description

OpenSearch

StackLight in non-HA mode:
  • 202 - 253 GB for the entire cluster

  • ~6 - 7.5 GB for a single node

StackLight in HA mode:
  • 404 - 506 GB for the entire cluster

  • ~12 - 15 GB for a single node

When setting Persistent Volume Claim Size for OpenSearch during the cluster creation, take into account that it defines the PVC size for a single instance of the OpenSearch cluster. StackLight in HA mode has 3 OpenSearch instances. Therefore, for a total OpenSearch capacity, multiply the PVC size by 3.

Prometheus

  • 11 GB for the entire cluster

  • ~400 MB for a single node

Every Prometheus instance stores the entire database. Multiple replicas store multiple copies of the same data. Therefore, treat the Prometheus PVC size as the capacity of Prometheus in the cluster. Do not sum them up.

Prometheus has built-in retention mechanisms based on the database size and time series duration stored in the database. Therefore, if you miscalculate the PVC size, retention size set to ~1 GB less than the PVC size will prevent disk overfilling.

StackLight integration with OpenStack

StackLight integration with OpenStack includes automatic discovery of RabbitMQ credentials for notifications and OpenStack credentials for OpenStack API metrics. For details, see the openstack.rabbitmq.credentialsConfig and openstack.telegraf.credentialsConfig parameters description in StackLight configuration parameters.

Workload monitoring

Lifecycle management operations of a MOSK cluster may impose impact on its workloads and, specifically, may cause network connectivity interruptions for instances running in OpenStack. To make sure that the downtime caused on the cloud applications still fits into Service Level Agreements (SLAs), MOSK provides the tooling to measure the network availability of instances.

Additionally, continuous monitoring of the network connectivity in the cluster is essential for early detection of infrastructure problems.

MOSK offers cloud operators to oversee the availability of workloads hosted in their OpenStack infrastructure on several levels:

  • Monitoring of floating IP addresses through the Cloudprober service

  • Monitoring of network ports availability through the Portprober service

Floating IP address availability monitoring (Cloudprober)

Available since MOSK 23.2 TechPreview

The floating IP address availability monitoring service (Cloudprober) is a special probing agent that starts on controller nodes and periodically pings selected floating IP addresses. As of today, the agent supports only Internet Control Message Protocol (ICMP) to determine the IP address availability.

instance_availability_arch

To monitor the availability of floating IP addresses, your MOSK cluster and workloads need to meet the following requirements:

  • There must be the layer-3 connectivity between the clusters floating IP networks and nodes running the OpenStack control plane.

  • The guest operating system of the monitored OpenStack instances must allow the ICMP ingress and egress traffic.

  • OpenStack security groups used by the monitored instances must allow the ICMP ingress and egress traffic.

To enable the floating IP address availability monitoring service, use the following OpenStackDeployment definition:

spec:
  features:
    services:
      - cloudprober

For the detailed configuration procedure of the floating IP address availability monitoring service, refer to Configure monitoring of cloud workload availability.

Network port availability monitoring (Portprober)

Available since MOSK 24.2 TechPreview

The network port availability monitoring service (Portprober) is implemented as an extension to OpenStack Neutron service which gets enabled automatically together with the cloudprober service described above.

Also, you can enable Portprober explicitly, regardless of whether Cloudprober is enabled or not. To do so, specify the following structure in the OpenStackDeployment custom resource:

spec:
  features:
    neutron:
      extensions:
        portprober:
          enabled: true

The Portprober service is supported only for the following cloud configurations:

  • OpenStack version is Antelope or newer

  • Neutron OVS backend for networking (Tungsten Fabric and OVN backends are not supported)

portprober

The Portprober agent automatically connects to all OpenStack virtual networks and probes all the ports that are plugged in there and are in the bound state, meaning they are associated with an instance or a network service.

The service makes no difference between private and external networks and also reports the availability of the ports that belong to virtual routers.

The service relies on the ARP protocol to determine port availability and does not require any security groups to be assigned to monitored instances, as opposed to the Floating IP address monitoring service (Cloudprober).

Known limitations

Among the known limitations of the network port availability monitoring service is the lack of support for IPv6. The service ignores the ports that do not have IPv4 addresses associated with them.

StackLight logging indices

Available since MCC 2.26.0 (17.1.0 and 16.1.0)

StackLight logging indices are managed by OpenSearch data streams, which are introduced in OpenSearch 2.6. It is a convenient way to manage insert-only pipelines such as log message collection. The solution consists of the following elements:

  • Data stream objects that can be referred to as alias:

    • Audit - dedicated for Container Cloud, MKE, and host audit logs, ensuring data integrity and security.

    • System - replaces Logstash for system logs, provides a streamlined approach to log management.

  • Write index - current index where ingestion can be performed without removing a data stream.

  • Read indices - indices created after the rollover mechanism is applied.

  • Rollover policy - creating new write index for data stream based on the size of shards

Example of an initial index list:

health status index               uuid                    pri rep docs.count docs.deleted store.size pri.store.size
green  open   .ds-audit-000001    30q4HLGmR0KmpRR8Kvy5jw    1   1    2961719            0    496.3mb          248mb
green  open   .ds-system-000001   5_eFtMAFQa6aFB7nttHjkA    1   1       2476            0      6.1mb            3mb

Example of the index after the rollover is applied to the audit index:

health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   .ds-audit-000001    30q4HLGmR0KmpRR8Kvy5jw   1   1    9819913            0      1.5gb        784.8mb
green  open   .ds-audit-000002    U1fbs0i9TJmOsAOoR7cERg   1   1    2961719            0    496.3mb          248mb
green  open   .ds-system-000001   5_eFtMAFQa6aFB7nttHjkA   1   1       2476            0      6.1mb            3mb
Audit and system index templates

The following table contains a simplified template of the audit and system indices. The user can perform aggregation queries over keyword fields.

Audit and system template

Field

Type

Description

@timestamp

date

Time when a log event was produced, if available in the parsed message. Otherwise time when the event was ingested.

container.id

keyword

Identifier of the Docker container that the application generating the event was running in.

container.image

text

Name of the Docker image defined as <registry>/<repo>:<tag>.

container.name

keyword

Name of the Docker container that the application generating the event was running in.

event.source

keyword

Source of the event: "file", "journal", or "container".

event.provider

keyword

Name of the application that produced the message.

host.hostname

keyword

Name of the host that the message was collected from.

log.file.path

keyword

Path on the host to the source file for the message if the message was not produced by the application running in the container or system unit.

log.level

keyword

Severity level of the event taken from the parsed message content.

message

text

Unparsed content of the event message.

orchestrator.labels

flat_object

Kubernetes metadata labels of the pod that runs the Docker container of the application.

orchestrator.namespace

keyword

Kubernetes namespace where the application pod was running.

orchestrator.pod

keyword

Kubernetes pod name of the pod running the application Docker container.

orchestrator.type

keyword

Type of orchestrator: "mke" or "kubernetes". Empty for host file logs and journal logs.

The following table contains a simplified template of extra fields for the system index that are not present in the audit template.

System template - extra fields

Field

Type

Description

http.destination.address

keyword

IP address of the HTTP request destination.

http.destination.domain

keyword

Name of the OpenStack service that the HTTP request was sent to.

http.request.duration

long

Request duration in nanoseconds.

http.request.id

keyword

Request ID generated by OpenStack.

http.request.method

keyword

HTTP request method.

http.request.path

keyword

Path of the HTTP URL request.

http.response.status_code

long

HTTP status code of the response.

http.source.address

keyword

IP address of the HTTP request source.

System index mapping to the Logstash index

The following table lists mapping of the system index fields to the Logstash ones:

System index fields mapped to Logstash index fields

System

Logstash Removed in MCC 2.26.0 (17.1.0 and 16.1.0)

@timestamp

@timestamp

container.id

docker.container_id

container.image

kubernetes.container_image

container.name

kubernetes.container_name

event.source

n/a

event.provider

logger

host.hostname

hostname

http.destination.address

parsed.upstream_addr

http.destination.domain

parsed.upstream_name

http.request.duration

parsed.duration

http.request.id

parsed.req_id

http.request.method

parsed.method

http.request.path

parsed.path

http.response.status_code

parsed.code

http.source.address

parsed.host

log.file.path

n/a

log.level

severity_label

message

message

orchestrator.labels

kubernetes.labels

orchestrator.namespace

kubernetes.namespace_name

orchestrator.pod

kubernetes.pod_name

orchestrator.type

n/a

Blueprints

This section contains a collection of Mirantis OpenStack for Kubernetes (MOSK) architecture blueprints that include common cluster topology and configuration patterns that can be referred to when building a MOSK cloud. Every blueprint is validated by Mirantis and is known to work. You can use these blueprints alone or in combination, although the interoperability of all possible combinations can not be guaranteed.

The section provides information on the target use cases, pros and cons of every blueprint and outlines the extents of its applicability. However, do not hesitate to reach out to Mirantis if you have any questions or doubts on whether a specific blueprint can be applied when designing your cloud.

Remote compute nodes
Introduction to edge computing

Although a classic cloud approach allows resources to be distributed across multiple regions, it still needs powerful data centers to host control planes and compute clusters. Such regional centralization poses challenges when the number of data consumers grows. It becomes hard to access the resources hosted in the cloud even though the resources are located in the same geographic region. The solution would be to bring the data closer to the consumer. And this is exactly what edge computing provides.

Edge computing is a paradigm that brings computation and data storage closer to the sources of data or the consumer. It is designed to improve response time and save bandwidth.

A few examples of use cases for edge computing include:

  • Hosting a video stream processing application on premises of a large stadium during the Super Bowl match

  • Placing the inventory or augmented reality services directly in the industrial facilities, such as storage, powerplant, shipyard, and so on

  • A small computation node deployed in a far-distanced village supermarket to host an application for store automatization and accounting

These and many other use cases could be solved by deploying multiple edge clusters managed from a single central place. The idea of centralized management plays a significant role for the business efficiency of the edge cloud environment:

  • Cloud operators obtain a single management console for the cloud that simplifies the Day-1 provisioning of new edge sites and Day-2 operations across multiple geographically distributed points of presence

  • Cloud users get ability to transparently connect their edge applications with central databases or business logic components hosted in data centers or public clouds

Depending on the size, location, and target use case, the points of presence comprising an edge cloud environment can be divided into five major categories. Mirantis OpenStack powered by Mirantis Container Cloud offers reference architectures to address the centralized management in core and regional data centers as well as edge sites.

Untitled Diagram
Overview of the remote compute nodes approach

Remote compute nodes is one of the approaches to the implementation of the edge computing concept offered by MOSK. The topology consists of a MOSK cluster residing in a data center, which is extended with multiple small groups of compute nodes deployed in geographically distanced remote sites. Remote compute nodes are integrated into the MOSK cluster just like the nodes in the central site with their configuration and life cycle managed through the same means.

Along with compute nodes, remote sites need to incorporate network gateway components that allow application users to consume edge services directly without looping the traffic through the central site.

Untitled Diagram
Design considerations for a remote site

Deployment of an edge cluster managed from a single central place starts with a proper planning. This section provides recommendations on how to approach the deployment design.

Compute nodes aggregation into availability zones

Mirantis recommends organizing nodes in each remote site into separate Availability Zones in the MOSK Compute (OpenStack Nova), Networking (OpenStack Neutron), and Block Storage (OpenStack Cinder) services. This enables the cloud users to be aware of the failure domain represented by a remote site and distribute the parts of their applications accordingly.

Storage

Typically, high latency in between the central control plane and remote sites makes it not feasible to rely on Ceph as a storage for the instance root/ephemeral and block data.

Mirantis recommends that you configure the remote sites to use the following backends:

  • Local storage (LVM or QCOW2) as a storage backend for the MOSK Compute service. See Image storage backend for the configuration details.

  • LVM on iSCSI backend for the MOSK Block Storage service. See Enable LVM block storage for the enablement procedure.

To maintain the small size of a remote site, the compute nodes need to be hyper-converged and combine the compute and block storage functions.

Site sizing

There is no limitation on the number of the remote sites and their size. However, when planning the cluster, ensure consistency between the total number of nodes managed by a single control plane and the value of the size parameter set in the OpenStackDeployment custom resource. For the list of supported sizes, refer to Main elements.

Additionally, the sizing of the remote site needs to take into account the characteristics of the networking channel with the main site.

Typically, an edge site consists of 3-7 compute nodes installed in a single, usually rented, rack.

Network latency and bandwidth

Mirantis recommends keeping the network latency between the main and remote sites as low as possible. For stable interoperability of cluster components, the latency needs to be around 30-70 milliseconds. Though, depending on the cluster configuration and dynamism of the workloads running in the remote site, the stability of the cluster can be preserved with the latency of up to 190 milliseconds.

The bandwidth of the communication channel between the main and remote sites needs to be sufficient to run the following traffic:

  • The control plane and management traffic, such as OpenStack messaging, database access, MOSK underlay Kubernetes cluster control plane, and so on. A single remote compute node in the idle state requires at minimum 1.5 Mbit/s of bandwidth to perform the non-data plane communications.

  • The data plane traffic, such as OpenStack image operations, instances VNC console traffic, and so on, that heavily depend on the profile of the workloads and other aspects of the cloud usage.

In general, Mirantis recommends having a minimum of 100 MBit/s bandwidth between the main and remote sites.

Loss of connectivity to the central site

MOSK remote compute nodes architecture is designed to tolerate a temporary loss of connectivity between the main cluster and the remote sites. In case of a disconnection, the instances running on remote compute nodes will keep running normally preserving their ability to read and write ephemeral and block storage data presuming it is located in the same site, as well as connectivity to their neighbours and edge application users. However, the instances will not have access to any cloud services or applications located outside of their remote site.

Since the MOSK control plane communicates with remote compute nodes through the same network channel, cloud users will not be able to perform any manipulations, for example, instance creation, deletion, snapshotting, and so on, over their edge applications until the connectivity gets restored. MOSK services providing high availability to cloud applications, such as the Instance HA service and Network service, need to be connected to the remote compute nodes to perform a failover of application components running in the remote site.

Once the connectivity between the main and the remote site restores, all functions become available again. The period during which an edge application can sustain normal function after a connectivity loss is determined by multiple factors including the selected networking backend for the MOSK cluster. Mirantis recommends that a cloud operator performs a set of test manipulations over the cloud resources hosted in the remote site to ensure that it has been fully restored.

Long-lived graceful restart in Tungsten Fabric

When configured in Tungsten Fabric-powered clouds, the Graceful restart and long-lived graceful restart feature significantly improves the MOSK ability to sustain the connectivity of workloads running at remote sites in situations when a site experiences a loss of connection to the central hosting location of the control plane.

Extensive testing has demonstrated that remote sites can effectively withstand a 72-hour control plane disconnection with zero impact on the running applications.

Security of cross-site communication

Given that a remote site communicates with its main MOSK cluster across a wide area network (WAN), it becomes important to protect sensitive data from being intercepted and viewed by a third party. Specifically, you should ensure the protection of the data belonging to the following cloud components:

  • Mirantis Container Cloud life-cycle management plane

    Bare metal servers provisioning and control, Kubernetes cluster deployment and management, Mirantis StackLight telemetry

  • MOSK control plane

    Communication between the components of OpenStack, Tungsten Fabric, and Mirantis Ceph

  • MOSK data plane

    Cloud application traffic

The most reliable way to protect the data is to configure the network equipment in the data center and the remote site to encapsulate all the bypassing remote-to-main communications into an encrypted VPN tunnel. Alternatively, Mirantis Container Cloud and MOSK can be configured to force encryption of specific types of network traffic, such as:

  • Kubernetes networking for MOSK underlying Kubernetes cluster that handles the vast majority of in-MOSK communications

  • OpenStack tenant networking that carries all the cloud application traffic

The ability to enforce traffic encryption depends on the specific version of the Mirantis Container Cloud and MOSK in use, as well as the selected SDN backend for OpenStack.

Remote compute nodes with Tungsten Fabric

TechPreview

In MOSK, the main cloud that controls remote computes can be the regional site that locates the regional cluster and the MOSK control plane. Additionally, it can contain a local storage and compute nodes.

The remote computes implementation in MOSK considers Tungsten Fabric as an SDN solution.

Remote computes bare metal servers are configured as Kubernetes workers hosting the deployments for:

  • Tungsten Fabric vRouter-gateway service

  • Nova-compute

  • Local storage (LVM with iSCSI block storage)

Large clusters

This section describes a validated MOSK cluster architecture that is capable of handling 10,000 instances under a single control plane.

Hardware characteristics
Node roles layout

Role

Nodes count

Server specification

Management cluster Kubernetes nodes

3

  • 16 vCPU 3.4 GHz

  • 32 GB RAM

  • 2 x 480 GB SSD drives

  • 2 x 10 Gbps NICs

MOSK cluster Kubernetes master nodes

3

  • 16 vCPU 3.4 GHz

  • 32 GB RAM

  • 2 x 480 GB SSD drives

  • 2 x 10 Gbps NICs

OpenStack controller nodes

5

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

OpenStack compute and storage nodes

Up to 500 total

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

StackLight nodes

3

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

Cluster architecture
Cluster architecture

Configuration

Value

Dedicated StackLight nodes

Yes

Dedicated Ceph storage nodes

Yes

Dedicated control plane Kubernetes nodes

Yes

Dedicated OpenStack gateway nodes

No, collocated with OpenStack controller nodes

OpenStack networking backend

Open vSwitch, no Distributed Virtual Router

Cluster size in the OpenStackDeployment CR

medium

Cluster validation

The architecture validation is perfomed by means of simultanious creation of multiple OpenStack resources of various types and execution of functional tests against each resource. The amount of resources hosted in the cluster at the moment when a certain threshold of non-operational resources starts being observed, is described below as cluster capacity limit.

Note

A successfully created resource has the Active status in the API and passes the functional tests, for example, its floating IP address is accessible. The MOSK cluster is considered to be able to handle the created resources if it successfully performs the LCM operations including the OpenStack services restart, both on the control and data plane.

Note

The key limiting factor for creating more OpenStack objects in this illustrative setup is hardware resources (vCPU and RAM) available on the compute nodes.

OpenStack resource capacity limits

OpenStack resource

Limit

Instances

11101

Network ports - instances

37337

Network ports - service (avg. per gateway node)

3517

Volumes

2784

Routers

2448

Networks

3383

Orchestration stacks

2419

Hardware resources utilization
Consumed hardware resources by a filled up cluster in the idle state

Node role

Load average

vCPU

RAM in GB

OpenStack controller + gateway

10

10

100

OpenStack compute

30

25

160

Ceph storage

2

2

15

StackLight

10

8

102

Kubernetes master

10

6

13

Cephless cloud

Available since MOSK 23.2 TechPreview

Persistent storage is a key component of any MOSK deployment. Out of the box, MOSK includes an open-source software-defined storage solution (Ceph), which hosts various kinds of cloud application data, such as root and ephemeral disks for virtual machines, virtual machine images, attachable virtual block storage, and object data. In addition, a Ceph cluster usually acts as a storage for the internal MOSK components, such as Kubernetes, OpenStack, StackLight, and so on.

Being distributed and redundant by design, Ceph requires a certain minimum amount of servers, also known as OSD or storage nodes, to work. A production-grade Ceph cluster typically consists of at least nine storage nodes, while a development and test environment may include four to six servers. For details, refer to MOSK cluster hardware requirements.

It is possible to reduce the overall footprint of a MOSK cluster by collocating the Ceph components with hypervisors on the same physical servers; this is also known as hyper-converged design. However, this architecture still may not satisfy the requirements of certain use cases for the cloud.

Standalone telco-edge MOSK clouds typically consist of three to seven servers hosted in a single rack, where every piece of CPU, memory and disk resources is strictly accounted and better be dedicated to the cloud workloads, rather than control plane. For such clouds, where the cluster footprint is more important than the resiliency of the application data storage, it makes sense either not to have a Ceph cluster at all or to replace it with some primitive non-redundant solution.

Enterprise virtualization infrastructure with third-party storage is not a rare strategy among large companies that rely on proprietary storage appliances, provided by NetApp, Dell, HPE, Pure Storage, and other major players in the data storage sector. These industry leaders offer a variety of storage solutions meticulously designed to suit various enterprise demands. Many companies, having already invested substantially in proprietary storage infrastructure, prefer integrating MOSK with their existing storage systems. This approach allows them to leverage this investment rather than incurring new costs and logistical complexities associated with migrating to Ceph.

Architecture
Cephless-architecture

Kind of data

MOSK component

Data storage in Cephless architecture

Configuration

Root and ephemeral disks of instances

Compute service (OpenStack Nova)

  • Compute node local file system (QCOW2 images).

  • Compute node local storage devices (LVM volumes).

    You can select QCOW2 and LVM backend per compute node.

  • Volumes through the “boot from volume” feature of the Compute service.

    You can select the Boot from volume option when spinning up a new instance as a cloud user.

Volumes

Block Storage service (OpenStack Cinder)

  • MOSK standard LVM+iSCSI backend for the Block Storage service. This aligns in a seamless manner with the concept of hyper-converged design, wherein the LVM volumes are collocated on the compute nodes.

  • Third-party storage.

Volume configuration

Volumes backups

Block Storage service (OpenStack Cinder)

  • External NFS share TechPreview

  • External S3 endpoint TechPreview

Alternatively, you can disable the volume backup functionality.

Backup configuration

Tungsten Fabric database backups

Tungsten Fabric (Cassandra, ZooKeeper)

External NFS share TechPreview

Alternatively, you can disable the Tungsten Fabric database backups functionality.

Tungsten Fabric database

OpenStack database backups

OpenStack (MariaDB)

  • External NFS share TechPreview

  • External S3-compatible storage TechPreview

  • Local file system of one of the MOSK controller nodes. By default, database backups are stored on the local file system on the node where the MariaDB service is running. This imposes a risk to cloud security and resiliency. For enterprise environments, it is a common requirement to store all the backup data externally.

Alternatively, you can disable the database backup functionality.

Results of functional testing

OpenStack Tempest

Local file system of MOSK controller nodes.

The openstack-tempest-run-tests job responsible for running the Tempest suite stores the results of its execution in a volume requested through the pvc-tempest PersistentVolumeClaim (PVC). The subject volume can be created by the local volume provisioner on the same Kubernetes worker node, where the job runs. Usually, it is a MOSK controller node.

Run Tempest tests

Instance images and snapshots

Image service (OpenStack Glance)

You can configure the Block Storage service (OpenStack Cinder) to be used as a storage backend for images and snapshots. In this case, each image is represented as a volume.

Important

Representing volumes as images implies a hard requirement for the selected block storage backend to support multi-attach capability that is concurrent reads and writes to and from a single volume.

Enable Cinder backend for Glance

Application object data

Object storage service (Ceph RADOS Gateway)

External S3, Swift, or any other third-party storage solutions compatible with object access protocols.

Note

An external object storage solution will not be integrated into the MOSK identity service (OpenStack Keystone), the cloud applications will need to take care of managing access to their object data themselves.

If no Ceph is deployed as part of a cluster, the MOSK built-in Object Storage service API endpoints are disabled automatically.

Logs, metrics, alerts

Mirantis StackLight (Prometeus, Alertmanager, Patroni, OpenSearch)

Local file system of MOSK controller nodes.

StackLight must be deployed in the HA mode, when all its data gets stored on the local file system of the nodes running StackLight services. In this mode, StackLight components get configured to handle the data replication themselves.

StackLight deployment architecture

Limitations
  • The determination of whether a MOSK cloud will include Ceph or not should take place during its planning and design phase. Once the deployment is complete, reconfiguring the cloud to switch between Ceph and non-Ceph architectures becomes impossible.

  • Mirantis recommends avoiding substitution of Ceph-backed persistent volumes in the MOSK underlying Kubernetes cluster with local volumes (local volume provisioner) for production environments. MOSK does not support such configuration unless the components that rely on these volumes can replicate their data themselves, for example, StackLight. Volumes provided by the local volume provisioner are not redundant, as they are bound to just a single node and can only be mounted from the Kubernetes pods running on the same nodes.

Node maintenance API

This section describes internal implementation of the node maintenance API and how OpenStack and Tungsten Fabric controllers communicate with LCM and each other during a managed cluster update.

Node maintenance API objects

The node maintenance API consists of the following objects:

  • Cluster level:

    • ClusterWorkloadLock

    • ClusterMaintenanceRequest

  • Node level:

    • NodeWorkloadLock

    • NodeMaintenanceRequest

WorkloadLock objects

The WorkloadLock objects are created by each Application Controller. These objects prevent LCM from performing any changes on the cluster or node level while the lock is in the active state. The inactive state of the lock means that the Application Controller has finished its work and the LCM can proceed with the node or cluster maintenance.

ClusterWorkloadLock object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterWorkloadLock
metadata:
  name: cluster-1-openstack
spec:
  controllerName: openstack
status:
  state: active # inactive;active;failed (default: active)
  errorMessage: ""
  release: "6.16.0+21.3"
NodeWorkloadLock object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  name: node-1-openstack
spec:
  nodeName: node-1
  controllerName: openstack
status:
  state: active # inactive;active;failed (default: active)
  errorMessage: ""
  release: "6.16.0+21.3"
MaintenanceRequest objects

The MaintenanceRequest objects are created by LCM. These objects notify Application Controllers about the upcoming maintenance of a cluster or a specific node.

ClusterMaintenanceRequest object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterMaintenanceRequest
metadata:
  name: cluster-1
spec:
  scope: drain # drain;os
NodeMaintenanceRequest object example configuration
 apiVersion: lcm.mirantis.com/v1alpha1
 kind: NodeMaintenanceRequest
 metadata:
   name: node-1
 spec:
   nodeName: node-1
   scope: drain # drain;os

The scope parameter in the object specification defines the impact on the managed cluster or node. The list of the available options include:

  • drain

    A regular managed cluster update. Each node in the cluster goes over a drain procedure. No node reboot takes place, a maximum impact includes restart of services on the node including Docker, which causes the restart of all containers present in the cluster.

  • os

    A node might be rebooted during the update. Triggers the workload evacuation by the OpenStack Controller (Rockoon).

When the MaintenanceRequest object is created, an Application Controller executes a handler to prepare workloads for maintenance and put appropriate WorkloadLock objects into the inactive state.

When maintenance is over, LCM removes MaintenanceRequest objects, and the Application Controllers move their WorkloadLocks objects into the active state.

OpenStack Controller maintenance API

When LCM creates the ClusterMaintenanceRequest object, the OpenStack Controller (Rockoon) ensures that all OpenStack components are in the Healthy state, which means that the pods are up and running, and the readiness probes are passing.

ClusterMaintenanceRequest object creation flow
ClusterMaintenanceRequest - create

When LCM creates the NodeMaintenanceRequest, the OpenStack Controller:

  1. Prepares components on the node for maintenance by removing nova-compute from scheduling.

  2. If the reboot of a node is possible, the instance migration workflow is triggered. The Operator can configure the instance migration flow through the Kubernetes node annotation and should define the required option before the managed cluster update. For configuration details, refer to Instance migration configuration for hosts.

    Also, since MOSK 25.1, cloud users can mark their instances for LCM to handle them individually during host maintenance operations. This allows for greater flexibility during cluster updates, especially for workloads that are sensitive to live migration. For details, refer to Configure per-instance migration mode.

  3. If the OpenStack Controller cannot migrate instances due to errors, it is suspended unless all instances are migrated manually or the openstack.lcm.mirantis.com/instance_migration_mode annotation is set to skip.

NodeMaintenanceRequest object creation flow
NodeMaintenanceRequest - create

When the node maintenance is over, LCM removes the NodeMaintenanceRequest object and the OpenStack Controller:

  • Verifies that the Kubernetes Node becomes Ready.

  • Verifies that all OpenStack components on a given node are Healthy, which means that the pods are up and running, and the readiness probes are passing.

  • Ensures that the OpenStack components are connected to RabbitMQ. For example, the Neutron Agents become alive on the node, and compute instances are in the UP state.

Note

The OpenStack Controller enables you to have only one nodeworkloadlock object at a time in the inactive state. Therefore, the update process for nodes is sequential.

NodeMaintenanceRequest object removal flow
NodeMaintenanceRequest - delete

When the cluster maintenance is over, the OpenStack Controller sets the ClusterWorkloadLock object to back active and the update completes.

CLusterMaintenanceRequest object removal flow
ClusterMaintenanceRequest - delete
Tungsten Fabric Controller maintenance API

The Tungsten Fabric (TF) Controller creates and uses both types of workloadlocks that include ClusterWorkloadLock and NodeWorkloadLock.

When the ClusterMaintenanceRequest object is created, the TF Controller verifies the TF cluster health status and proceeds as follows:

  • If the cluster is Ready , the TF Controller moves the ClusterWorkloadLock object to the inactive state.

  • Otherwise, the TF Controller keeps the ClusterWorkloadLock object in the active state.

When the NodeMaintenanceRequest object is created, the TF Controller verifies the vRouter pod state on the corresponding node and proceeds as follows:

  • If all containers are Ready, the TF Controller moves the NodeWorkloadLock object to the inactive state.

  • Otherwise, the TF Controller keeps the NodeWorkloadLock in the active state.

Note

If there is a NodeWorkloadLock object in the inactive state present in the cluster, the TF Controller does not process the NodeMaintenanceRequest object for other nodes until this inactive NodeWorkloadLock object becomes active.

When the cluster LCM removes the MaintenanceRequest object, the TF Controller waits for the vRouter pods to become ready and proceeds as follows:

  • If all containers are in the Ready state, the TF Controller moves the NodeWorkloadLock object to the active state.

  • Otherwise, the TF Controller keeps the NodeWorkloadLock object in the inactive state.

Cluster update flow

This section describes the MOSK cluster update flow to the product releases that contain major updates and require node reboot such as support for new Linux kernel, and similar.

The diagram below illustrates the sequence of operations controlled by LCM and taking place during the update under the hood. We assume that the ClusterWorkloadLock and NodeWrokloadLock objects present in the cluster are in the active state before the cloud operator triggers the update.

Cluster update flow

See also

For details about the Application Controllers flow during different maintenance stages, refer to:

Phase 1: The Operator triggers the update
  1. The Operator sets appropriate annotations on nodes and selects suitable migration mode for workloads.

  2. The Operator triggers the managed cluster update through the Mirantis Container Cloud web UI as described in Step 2. Initiate MOSK cluster update.

  3. LCM creates the ClusterMaintenance object and notifies the application controllers about planned maintenance.

Phase 2: LCM triggers the OpenStack and Ceph update
  1. The OpenStack update starts.

  2. Ceph is waiting for the OpenStack ClusterWorkloadLock object to become inactive.

  3. When the OpenStack update is finalized, the OpenStack Controller marks ClusterWorkloadLock as inactive.

  4. The Ceph Controller triggers an update of the Ceph cluster.

  5. When the Ceph update is finalized, Ceph marks the ClusterWorkloadLock object as inactive.

Phase 3: LCM initiates the Kubernetes master nodes update
  1. If a master node has collocated roles, LCM creates NodeMainteananceRequest for the node.

  2. All Application Controllers mark their NodeWorkloadLock objects for this node as inactive.

  3. LCM starts draining the node by gracefully moving out all pods from the node. The DaemonSet pods are not evacuated and left running.

  4. LCM downloads the new version of the LCM Agent and runs its states.

    Note

    While running Ansible states, the services on the node may be restarted.

  5. The above flow is applied to all Kubernetes master nodes one by one.

  6. LCM removes NodeMainteananceRequest.

Phase 4: LCM initiates the Kubernetes worker nodes update
  1. LCM creates NodeMaintenanceRequest for the node with specifying scope.

  2. Application Controllers start preparing the node according to the scope.

  3. LCM waits until all Application Controllers mark their NodeWorkloadLock objects for this node as inactive.

  4. All pods are evacuated from the node by draining it. This does not apply to the DaemonSet pods, which cannot be removed.

  5. LCM downloads the new version of the LCM Agent and runs its states.

    Note

    While running Ansible states, the services on the node may be restarted.

  6. The above flow is applied to all Kubernetes worker nodes one by one.

  7. LCM removes NodeMainteananceRequest.

Phase 5: Finalization
  1. LCM triggers the update for all other applications present in the cluster, such as StackLight, Tungsten Fabric, and others.

  2. LCM removes ClusterMaintenanceRequest.

After a while the cluster update completes and becomes fully operable again.

Parallelizing node update operations

Available since MOSK 23.2 TechPreview

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

MOSK enables you to parallelize node update operations, significantly improving the efficiency of your deployment. This capability applies to any operation that utilizes the Node Maintenance API, such as cluster updates or graceful node reboots.

The core implementation of parallel updates is handled by the LCM Controller ensuring seamless execution of parallel operations. LCM starts performing an operation on the node only when all NodeWorkloadLock objects for the node are marked as inactive. By default, the LCM Controller creates one NodeMaintenanceRequest at a time.

Each application controller, including Ceph, OpenStack, and Tungsten Fabric Controllers, manages parallel NodeMaintenanceRequest objects independently. The controllers determine how to handle and execute parallel node maintenance requests based on specific requirements of their respective applications. To understand the workflow of the Node Maintenance API, refer to WorkloadLock objects.

Enhancing parallelism during node updates
  1. Set the nodes update order.

    You can optimize parallel updates by setting the order in which nodes are updated. You can accomplish this by configuring upgradeIndex of the Machine object. For the procedure, refer to Change the upgrade order of a machine.

  2. Increase parallelism.

    Boost parallelism by adjusting the maximum number of worker node updates that are allowed during LCM operations using the spec.providerSpec.value.maxWorkerUpgradeCount configuration parameter, which is set to 1 by default.

    For configuration details, refer to Configure the parallel update of worker nodes.

  3. Execute LCM operations.

    Run LCM operations, such as cluster updates, taking advantage of the increased parallelism.

OpenStack nodes update

By default, the OpenStack Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the OpenStack controller nodes sequentially (one by one).

  • Updates the gateway nodes sequentially. Technically, you can increase the number of gateway nodes upgrades allowed in parallel using the nwl_parallel_max_gateway parameter but Mirantis does not recommend to do so.

  • Updates the compute nodes in parallel. The default number of allowed parallel updates is 30. You can adjust this value through the nwl_parallel_max_compute parameter.

    Parallelism considerations for compute nodes

    When considering parallelism for compute nodes, take into account that during certain pod restarts, for example, the openvswitch-vswitchd pods, a brief instance downtime may occur. Select a suitable level of parallelism to minimize the impact on workloads and prevent excessive load on the control plane nodes.

    If your cloud environment is distributed across failure domains, which are represented by Nova availability zones, you can limit the parallel updates of nodes to only those within the same availability zone. This behavior is controlled by the respect_nova_az option in the OpenStack Controller.

The OpenStack Controller configuration is stored in the rockoon-config configMap of the osh-system namespace. The options are picked up automatically after update. To learn more about the OpenStack Controller (Rockoon) configuration parameters, refer to OpenStack Controller configuration.

Ceph nodes update

By default, the Ceph Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the non-storage nodes sequentially. Non-storage nodes include all nodes that have mon, mgr, rgw, or mds roles.

  • Updates storage nodes in parallel. The default number of allowed parallel updates is calculated automatically based on the minimal failure domain in a Ceph cluster.

    Parallelism calculations for storage nodes

    The Ceph Controller automatically calculates the parallelism number in the following way:

    • Finds the minimal failure domain for a Ceph cluster. For example, the minimal failure domain is rack.

    • Filters all currently requested nodes by minimal failure domain. For example, parallelism equals to 5, and LCM requests 3 nodes from the rack1 rack and 2 nodes from the rack2 rack.

    • Handles each filtered node group one by one. For example, the controller handles in parallel all nodes from rack1 before processing nodes from rack2.

The Ceph Controller handles non-storage nodes before the storage ones. If there are node requests for both node types, the Ceph Controller handles sequentially the non-storage nodes first. Therefore, Mirantis recommends setting the upgrade index of a higher priority for the non-storage nodes to decrease the total upgrade time.

If the minimal failure domain is host, the Ceph Controller updates only one storage node per failure domain unit. This results in updating all Ceph nodes sequentially, despite the potential for increased parallelism.

Tungsten Fabric nodes update

By default, the Tungsten Fabric Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the Tungsten Fabric Controller and gateway nodes sequentially.

  • Updates the vRouter nodes in parallel. The Tungsten Fabric Controller allows updating up to 30 vRouter nodes in parallel.

    Maximum amount of vRouter nodes in maintenance

    While the Tungsten Fabric Controller has the capability to process up to 30 NodeMaintenanceRequest objects targeted to vRouter nodes, the actual amount may be lower. This is due to a check that ensures OpenStack readiness to unlock the relevant nodes for maintenance. If OpenStack allows for maintenance, the Tungsten Fabric Controller verifies the vRouter pods. Upon successful verification, the NodeWorkloadLock object is switched to the maintenance mode.

Deployment Guide

Mirantis OpenStack for Kubernetes (MOSK) enables the operator to create, scale, update, and upgrade OpenStack deployments on Kubernetes through a declarative API.

The Kubernetes built-in features, such as flexibility, scalability, and declarative resource definition make MOSK a robust solution.

Plan the deployment

The detailed plan of any Mirantis OpenStack for Kubernetes (MOSK) deployment is determined on a per-cloud basis. For the MOSK reference architecture and design overview, see Reference Architecture.

Also, read through Mirantis Container Cloud Reference Architecture: Container Cloud bare metal as a MOSK cluster is deployed on top of a bare metal cluster managed by Mirantis Container Cloud.

Note

One of the industry best practices is to verify every new update or configuration change in a non-customer-facing environment before applying it to production. Therefore, Mirantis recommends having a staging cloud, deployed and maintained along with the production clouds. The recommendation is especially applicable to the environments that:

  • Receive updates often and use continuous delivery. For example, any non-isolated deployment of Mirantis Container Cloud.

  • Have significant deviations from the reference architecture or third party extensions installed.

  • Are managed under the Mirantis OpsCare program.

  • Run business-critical workloads where even the slightest application downtime is unacceptable.

A typical staging cloud is a complete copy of the production environment including the hardware and software configurations, but with a bare minimum of compute and storage capacity.

Provision and deploy a management cluster

The bare metal management system enables the Infrastructure Operator to deploy a Container Cloud management cluster on a set of bare metal servers. It also enables Container Cloud to deploy MOSK clusters on bare metal servers without a pre-provisioned operating system.

This section instructs you on how to provision and deploy a Container Cloud management cluster.

Deploy a management cluster

Note

The deprecated bootstrap procedure using Bootstrap v1 was removed for the sake of Bootstrap v2 in Container Cloud 2.26.0.

Introduction

Mirantis Container Cloud Bootstrap v2 provides best user experience to set up Container Cloud. Using Bootstrap v2, you can provision and operate management clusters using required objects through the Container Cloud API.

Basic concepts and components of Bootstrap v2 include:

  • Bootstrap cluster

    Bootstrap cluster is any kind-based Kubernetes cluster that contains a minimal set of Container Cloud bootstrap components allowing the user to prepare the configuration for management cluster deployment and start the deployment. The list of these components includes:

    • Bootstrap Controller

      Controller that is responsible for:

      1. Configuration of a bootstrap cluster with provider charts through the bootstrap Helm bundle.

      2. Configuration and deployment of a management cluster and its related objects.

    • Helm Controller

      Operator that manages Helm chart releases. It installs the Container Cloud bootstrap and provider charts configured in the bootstrap Helm bundle.

    • Public API charts

      Helm charts that contain custom resource definitions for Container Cloud resources.

    • Admission Controller

      Controller that performs mutations and validations for the Container Cloud resources including cluster and machines configuration.

    Currently one bootstrap cluster can be used for deployment of only one management cluster. For example, to add a new management cluster with different settings, a new bootstrap cluster must be created from scratch.

  • Bootstrap region

    BootstrapRegion is the first object to create in the bootstrap cluster for the Bootstrap Controller to identify and install provider components onto the bootstrap cluster. After, the user can prepare and deploy a management cluster with related resources.

    The bootstrap region is a starting point for the cluster deployment. The user needs to approve the BootstrapRegion object. Otherwise, the Bootstrap Controller will not be triggered for the cluster deployment.

  • Bootstrap Helm bundle

    Helm bundle that contains charts configuration for the bootstrap cluster. This object is managed by the Bootstrap Controller that updates the provider bundle in the BootstrapRegion object. The Bootstrap Controller always configures provider charts listed in the regional section of the Container Cloud release for the provider. Depending on the cluster configuration, the Bootstrap Controller may update or reconfigure this bundle even after the cluster deployment starts. For example, the Bootstrap Controller enables the provider in the bootstrap cluster only after the bootstrap region is approved for the deployment.

Overview of the deployment workflow

Management cluster deployment consists of several sequential stages. Each stage finishes when a specific condition is met or specific configuration applies to a cluster or its machines.

In case of issues at any deployment stage, you can identify the problem and adjust it on the fly. The cluster deployment does not abort until all stages complete by means of the infinite-timeout option enabled by default in Bootstrap v2.

Infinite timeout prevents the bootstrap failure due to timeout. This option is useful in the following cases:

  • The network speed is slow for artifacts downloading

  • The infrastructure configuration does not allow fast booting

  • The inspection of a bare-metal node presupposes more than two HDDSATA disks to attach to a machine

You can track the status of each stage in the bootstrapStatus section of the Cluster object that is updated by the Bootstrap Controller.

The Bootstrap Controller starts deploying the cluster after you approve the BootstrapRegion configuration.

The following table describes deployment states of a management cluster that apply in the strict order.

Deployment states of a management cluster

Step

State

Description

1

ProxySettingsHandled

Verifies proxy configuration in the Cluster object. If the bootstrap cluster was created without a proxy, no actions are applied to the cluster.

2

ClusterSSHConfigured

Verifies SSH configuration for the cluster and machines.

You can provide any number of SSH public keys, which are added to cluster machines. But the Bootstrap Controller always adds the bootstrap-key SSH public key to the cluster configuration. The Bootstrap Controller uses this SSH key to manage the lcm-agent configuration on cluster machines.

The bootstrap-key SSH key is copied to a bootstrap-key-<clusterName> object containing the cluster name in its name.

3

ProviderUpdatedInBootstrap

Synchronizes the provider and settings of its components between the Cluster object and bootstrap Helm bundle. Settings provided in the cluster configuration have higher priority than the default settings of the bootstrap cluster, except CDN.

4

ProviderEnabledInBootstrap

Enables the provider and its components if any were disabled by the Bootstrap Controller during preparation of the bootstrap region. A cluster and machines deployment starts after the provider enablement.

5

Nodes readiness

Waits for the provider to complete nodes deployment that comprises VMs creation and MKE installation.

6

ObjectsCreated

Creates required namespaces and IAM secrets.

7

ProviderConfigured

Verifies the provider configuration in the provisioned cluster.

8

HelmBundleReady

Verifies the Helm bundle readiness for the provisioned cluster.

9

ControllersDisabledBeforePivot

Collects the list of deployment controllers and disables them to prepare for pivot.

10

PivotDone

Moves all cluster-related objects from the bootstrap cluster to the provisioned cluster. The copies of Cluster and Machine objects remain in the bootstrap cluster to provide the status information to the user. About every minute, the Bootstrap Controller reconciles the status of the Cluster and Machine objects of the provisioned cluster to the bootstrap cluster.

11

ControllersEnabledAfterPivot

Enables controllers in the provisioned cluster.

12

MachinesLCMAgentUpdated

Updates the lcm-agent configuration on machines to target LCM agents to the provisioned cluster.

13

HelmControllerDisabledBeforeConfig

Disables the Helm Controller before reconfiguration.

14

HelmControllerConfigUpdated

Updates the Helm Controller configuration for the provisioned cluster.

15

Cluster readiness

Contains information about the global cluster status. The Bootstrap Controller verifies that OIDC, Helm releases, and all Deployments are ready. Once the cluster is ready, the Bootstrap Controller stops managing the cluster.

Set up a bootstrap cluster

The setup of a bootstrap cluster comprises preparation of the seed node, configuration of environment variables, acquisition of the Container Cloud license file, and execution of the bootstrap script.

To set up a bootstrap cluster:

  1. Prepare the seed node:

    1. Verify that the hardware allocated for the installation meets the minimal requirements described in Container Cloud Reference Architecture:.

    2. Install basic Ubuntu 22.04 server using standard installation images of the operating system on the bare metal seed node.

    3. Log in to the seed node that is running Ubuntu 22.04.

    4. Configure the operating system and network:

      Operating system and network configuration
      1. Establish a virtual bridge using an IP address of the PXE network on the seed node. Use the following netplan-based configuration file as an example:

        # cat /etc/netplan/config.yaml
        network:
          version: 2
          renderer: networkd
          ethernets:
            ens3:
                dhcp4: false
                dhcp6: false
          bridges:
              br0:
                  addresses:
                  # Replace with IP address from PXE network to create a virtual bridge
                  - 10.0.0.15/24
                  dhcp4: false
                  dhcp6: false
                  # Adjust for your environment
                  gateway4: 10.0.0.1
                  interfaces:
                  # Interface name may be different in your environment
                  - ens3
                  nameservers:
                      addresses:
                      # Adjust for your environment
                      - 8.8.8.8
                  parameters:
                      forward-delay: 4
                      stp: false
        
      2. Apply the new network configuration using netplan:

        sudo netplan apply
        
      3. Verify the new network configuration:

        sudo apt update && sudo apt install -y bridge-utils
        sudo brctl show
        

        Example of system response:

        bridge name     bridge id               STP enabled     interfaces
        br0             8000.fa163e72f146       no              ens3
        

        Verify that the interface connected to the PXE network belongs to the previously configured bridge.

      4. Install the current Docker version available for Ubuntu 22.04:

        sudo apt-get update
        sudo apt-get install docker.io
        
      5. Verify that your logged USER has access to the Docker daemon:

        sudo usermod -aG docker $USER
        
      6. Verify that the br_netfilter kernel module is loaded:

        grep -q br_netfilter /proc/modules || sudo modprobe br_netfilter
        
      7. Log out and log in again to the seed node to apply the changes.

      8. Verify that Docker is configured correctly and has access to Container Cloud CDN. For example:

        docker run --rm alpine sh -c "apk add --no-cache curl; \
        curl https://binary.mirantis.com"
        

        The system output must contain a json file with no error messages. In case of errors, follow the steps provided in Troubleshoot the bootstrap node configuration.

        Note

        If you require all Internet access to go through a proxy server for security and audit purposes, configure Docker proxy settings as described in the official Docker documentation.

        To verify that Docker is configured correctly and has access to Container Cloud CDN:

        docker run --rm alpine sh -c "export http_proxy=http://<proxy_ip:proxy_port>; \
        sed -i ‘s/https/http/g' /etc/apk/repositories; \
        apk add --no-cache wget ; \
        wget http://binary.mirantis.com; \
        cat index.html
        
    5. Verify that the seed node has direct access to the Baseboard Management Controller (BMC) of each bare metal host. All target hardware nodes must be in the power off state.

      For example, using the IPMI tool:

      apt install ipmitool
      ipmitool -I lanplus -H 'IPMI IP' -U 'IPMI Login' -P 'IPMI password' \
      chassis power status
      

      Example of system response:

      Chassis Power is off
      
  2. Prepare the bootstrap script:

    1. Download and run the Container Cloud bootstrap script:

      sudo apt-get update
      sudo apt-get install wget
      wget https://binary.mirantis.com/releases/get_container_cloud.sh
      chmod 0755 get_container_cloud.sh
      ./get_container_cloud.sh
      
    2. Change the directory to the kaas-bootstrap folder created by the script.

  3. Obtain a Container Cloud license file required for the bootstrap:

    Obtain a Container Cloud license
    1. Select from the following options:

      • Open the email from support@mirantis.com with the subject Mirantis Container Cloud License File or Mirantis OpenStack License File

      • In the Mirantis CloudCare Portal, open the Account or Cloud page

    2. Download the License File and save it as mirantis.lic under the kaas-bootstrap directory on the bootstrap node.

    3. Verify that mirantis.lic contains the previously downloaded Container Cloud license by decoding the license JWT token, for example, using jwt.io.

      Example of a valid decoded Container Cloud license data with the mandatory license field:

      {
          "exp": 1652304773,
          "iat": 1636669973,
          "sub": "demo",
          "license": {
              "dev": false,
              "limits": {
                  "clusters": 10,
                  "workers_per_cluster": 10
              },
              "openstack": null
          }
      }
      

    Warning

    The MKE license does not apply to mirantis.lic. For details about MKE license, see MKE documentation.

  4. Export mandatory parameters:

    Bare metal network mandatory parameters

    Export the following mandatory parameters using the commands and table below:

    export KAAS_BM_ENABLED="true"
    #
    export KAAS_BM_PXE_IP="172.16.59.5"
    export KAAS_BM_PXE_MASK="24"
    export KAAS_BM_PXE_BRIDGE="br0"
    
    Bare metal prerequisites data

    Parameter

    Description

    Example value

    KAAS_BM_PXE_IP

    The provisioning IP address in the PXE network. This address will be assigned on the seed node to the interface defined by the KAAS_BM_PXE_BRIDGE parameter described below. The PXE service of the bootstrap cluster uses this address to network boot bare metal hosts.

    172.16.59.5

    KAAS_BM_PXE_MASK

    The PXE network address prefix length to be used with the KAAS_BM_PXE_IP address when assigning it to the seed node interface.

    24

    KAAS_BM_PXE_BRIDGE

    The PXE network bridge name that must match the name of the bridge created on the seed node during preparation of the system and network configuration described earlier in this procedure.

    br0

  5. Optional. Configure proxy settings to bootstrap the cluster using proxy:

    Proxy configuration

    Add the following environment variables:

    • HTTP_PROXY

    • HTTPS_PROXY

    • NO_PROXY

    • PROXY_CA_CERTIFICATE_PATH

    Example snippet:

    export HTTP_PROXY=http://proxy.example.com:3128
    export HTTPS_PROXY=http://user:pass@proxy.example.com:3128
    export NO_PROXY=172.18.10.0,registry.internal.lan
    export PROXY_CA_CERTIFICATE_PATH="/home/ubuntu/.mitmproxy/mitmproxy-ca-cert.cer"
    

    The following formats of variables are accepted:

    Proxy configuration data

    Variable

    Format

    HTTP_PROXY
    HTTPS_PROXY
    • http://proxy.example.com:port - for anonymous access.

    • http://user:password@proxy.example.com:port - for restricted access.

    NO_PROXY

    Comma-separated list of IP addresses or domain names.

    PROXY_CA_CERTIFICATE_PATH

    Optional. Absolute path to the proxy CA certificate for man-in-the-middle (MITM) proxies. Must be placed on the bootstrap node to be trusted. For details, see Install a CA certificate for a MITM proxy on a bootstrap node.

    Warning

    If you require Internet access to go through a MITM proxy, ensure that the proxy has streaming enabled as described in Enable streaming for MITM.

    For implementation details, see Container Cloud Reference Architecture: Proxy and cache support.

    After the bootstrap cluster is set up, the bootstrap-proxy object is created with the provided proxy settings. You can use this object later for the Cluster object configuration.

  6. Deploy the bootstrap cluster:

    ./bootstrap.sh bootstrapv2
    
  7. Make sure that port 80 is open for localhost to prevent security requirements for the seed node:

    Note

    Kind uses port mapping for the master node.

    telnet localhost 80
    

    Example of a positive system response:

    Connected to localhost.
    

    Example of a negative system response:

    telnet: connect to address ::1: Connection refused
    telnet: Unable to connect to remote host
    

    To open port 80:

    iptables -A INPUT -p tcp --dport 80 -j ACCEPT
    
Deploy a management cluster

This section contains an overview of the cluster-related objects along with the configuration procedure of these objects during deployment of a management cluster using Bootstrap v2 through the Container Cloud API.

Deploy a management cluster using CLI

The following procedure describes how to prepare and deploy a management cluster using Bootstrap v2 by operating YAML templates available in the kaas-bootstrap/templates/ folder.

To deploy a management cluster using CLI:

  1. Set up a bootstrap cluster.

  2. Export kubeconfig of the kind cluster:

    export KUBECONFIG=<pathToKindKubeconfig>
    

    By default, <pathToKindKubeconfig> is $HOME/.kube/kind-config-clusterapi.

  3. Configure BIOS on a bare metal host.

  4. Navigate to kaas-bootstrap/templates/bm.

    Warning

    The kubectl apply command automatically saves the applied data as plain text into the kubectl.kubernetes.io/last-applied-configuration annotation of the corresponding object. This may result in revealing sensitive data in this annotation when creating or modifying objects containing credentials. Such Container Cloud objects include:

    • BareMetalHostCredential

    • ClusterOIDCConfiguration

    • License

    • Proxy

    • ServiceUser

    • TLSConfig

    Therefore, do not use kubectl apply on these objects. Use kubectl create, kubectl patch, or kubectl edit instead.

    If you used kubectl apply on these objects, you can remove the kubectl.kubernetes.io/last-applied-configuration annotation from the objects using kubectl edit.

  5. Create the BootstrapRegion object by modifying bootstrapregion.yaml.template.

    Configuration of bootstrapregion.yaml.template
    1. Set provider: baremetal and use the default <regionName>, which is region-one.

      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BootstrapRegion
      metadata:
        name: region-one
        namespace: default
      spec:
        provider: baremetal
      
    2. Create the object:

      ./kaas-bootstrap/bin/kubectl create -f \
          kaas-bootstrap/templates/bm/bootstrapregion.yaml.template
      

    Note

    In the following steps, apply the changes to objects using the commands below with the required template name:

    ./kaas-bootstrap/bin/kubectl create -f \
        kaas-bootstrap/templates/bm/<templateName>.yaml.template
    
  6. Create the ServiceUser object by modifying serviceusers.yaml.template.

    Configuration of serviceusers.yaml.template

    Service user is the initial user to create in Keycloak for access to a newly deployed management cluster. By default, it has the global-admin, operator (namespaced), and bm-pool-operator (namespaced) roles.

    You can delete serviceuser after setting up other required users with specific roles or after any integration with an external identity provider, such as LDAP.

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: ServiceUserList
    items:
    - apiVersion: kaas.mirantis.com/v1alpha1
      kind: ServiceUser
      metadata:
        name: <USERNAME>
      spec:
        password:
          value: <PASSWORD>
    
  7. Optional. Prepare any number of additional SSH keys using the following example:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: PublicKey
    metadata:
      name: <SSHKeyName>
      namespace: default
    spec:
      publicKey: |
        <insert your public key here>
    
  8. Optional. Add the Proxy object using the example below:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: Proxy
    metadata:
      name: <proxyName>
      namespace: default
    spec:
      ...
    
  9. Inspect the default bare metal host profile definition in baremetalhostprofiles.yaml.template and adjust it to fit your hardware configuration. For details, see Customize the default bare metal host profile.

    Warning

    All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

    • A raw device partition with a file system on it

    • A device partition in a volume group with a logical volume that has a file system on it

    • An mdadm RAID device with a file system on it

    • An LVM RAID device with a file system on it

    The wipe field is always considered true for these devices. The false value is ignored.

    Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

  10. In baremetalhosts.yaml.template, update the bare metal host definitions according to your environment configuration. Use the reference table below to manually set all parameters that start with SET_.

    Mandatory parameters for a bare metal host template

    Parameter

    Description

    Example value

    SET_MACHINE_0_IPMI_USERNAME

    The IPMI user name to access the BMC. 0

    user

    SET_MACHINE_0_IPMI_PASSWORD

    The IPMI password to access the BMC. 0

    password

    SET_MACHINE_0_MAC

    The MAC address of the first master node in the PXE network.

    ac:1f:6b:02:84:71

    SET_MACHINE_0_BMC_ADDRESS

    The IP address of the BMC endpoint for the first master node in the cluster. Must be an address from the OOB network that is accessible through the management network gateway.

    192.168.100.11

    SET_MACHINE_1_IPMI_USERNAME

    The IPMI user name to access the BMC. 0

    user

    SET_MACHINE_1_IPMI_PASSWORD

    The IPMI password to access the BMC. 0

    password

    SET_MACHINE_1_MAC

    The MAC address of the second master node in the PXE network.

    ac:1f:6b:02:84:72

    SET_MACHINE_1_BMC_ADDRESS

    The IP address of the BMC endpoint for the second master node in the cluster. Must be an address from the OOB network that is accessible through the management network gateway.

    192.168.100.12

    SET_MACHINE_2_IPMI_USERNAME

    The IPMI user name to access the BMC. 0

    user

    SET_MACHINE_2_IPMI_PASSWORD

    The IPMI password to access the BMC. 0

    password

    SET_MACHINE_2_MAC

    The MAC address of the third master node in the PXE network.

    ac:1f:6b:02:84:73

    SET_MACHINE_2_BMC_ADDRESS

    The IP address of the BMC endpoint for the third master node in the cluster. Must be an address from the OOB network that is accessible through the management network gateway.

    192.168.100.13

    0(1,2,3,4,5,6)

    The parameter requires a user name and password in plain text.

  11. Configure cluster network:

    Important

    Bootstrap V2 supports only separated PXE and LCM networks.

    • Update the network object definition in ipam-objects.yaml.template according to the environment configuration. By default, this template implies the use of separate PXE and life-cycle management (LCM) networks.

    • Manually set all parameters that start with SET_.

    • To ensure successful bootstrap, enable asymmetric routing on the interfaces of the management cluster nodes. This is required because the seed node relies on one network by default, which can potentially cause traffic asymmetry.

      In the kernelParameters section of baremetalhostprofiles.yaml.template, set rp_filter to 2. This enables loose mode as defined in RFC3704.

      Example configuration of asymmetric routing
      ...
      kernelParameters:
        ...
        sysctl:
          # Enables the "Loose mode" for the "k8s-lcm" interface (management network)
          net.ipv4.conf.k8s-lcm.rp_filter: "2"
          # Enables the "Loose mode" for the "bond0" interface (PXE network)
          net.ipv4.conf.bond0.rp_filter: "2"
          ...
      

      Note

      More complicated solutions that are not described in this manual include getting rid of traffic asymmetry, for example:

      • Configure source routing on management cluster nodes.

      • Plug the seed node into the same networks as the management cluster nodes, which requires custom configuration of the seed node.

    For configuration details of bond network interface for the PXE and management network, see Configure NIC bonding.

    Example of the default L2 template snippet for a management cluster
    bonds:
      bond0:
        interfaces:
          - {{ nic 0 }}
          - {{ nic 1 }}
        parameters:
          mode: active-backup
          primary: {{ nic 0 }}
        dhcp4: false
        dhcp6: false
        addresses:
          - {{ ip "bond0:mgmt-pxe" }}
    vlans:
      k8s-lcm:
        id: SET_VLAN_ID
        link: bond0
        addresses:
          - {{ ip "k8s-lcm:kaas-mgmt" }}
        nameservers:
          addresses: {{ nameservers_from_subnet "kaas-mgmt" }}
        routes:
          - to: 0.0.0.0/0
            via: {{ gateway_from_subnet "kaas-mgmt" }}
    

    In this example, the following configuration applies:

    • A bond of two NIC interfaces

    • A static address in the PXE network set on the bond

    • An isolated L2 segment for the LCM network is configured using the k8s-lcm VLAN with the static address in the LCM network

    • The default gateway address is in the LCM network

    For general concepts of configuring separate PXE and LCM networks for a management cluster, see Separate PXE and management networks. For current object templates and variable names to use, see the following tables.

    Network parameters mapping overview

    Deployment file name

    Parameters list to update manually

    ipam-objects.yaml.template

    • SET_LB_HOST

    • SET_MGMT_ADDR_RANGE

    • SET_MGMT_CIDR

    • SET_MGMT_DNS

    • SET_MGMT_NW_GW

    • SET_MGMT_SVC_POOL

    • SET_PXE_ADDR_POOL

    • SET_PXE_ADDR_RANGE

    • SET_PXE_CIDR

    • SET_PXE_SVC_POOL

    • SET_VLAN_ID

    bootstrap.env

    • KAAS_BM_PXE_IP

    • KAAS_BM_PXE_MASK

    • KAAS_BM_PXE_BRIDGE

    Mandatory network parameters of the IPAM object template

    The following table contains examples of mandatory parameter values to set in ipam-objects.yaml.template for the network scheme that has the following networks:

    • 172.16.59.0/24 - PXE network

    • 172.16.61.0/25 - LCM network

    Parameter

    Description

    Example value

    SET_PXE_CIDR

    The IP address of the PXE network in the CIDR notation. The minimum recommended network size is 256 addresses (/24 prefix length).

    172.16.59.0/24

    SET_PXE_SVC_POOL

    The IP address range to use for endpoints of load balancers in the PXE network for the Container Cloud services: Ironic-API, DHCP server, HTTP server, and caching server. The minimum required range size is 5 addresses.

    172.16.59.6-172.16.59.15

    SET_PXE_ADDR_POOL

    The IP address range in the PXE network to use for dynamic address allocation for hosts during inspection and provisioning.

    The minimum recommended range size is 30 addresses for management cluster nodes if it is located in a separate PXE network segment. Otherwise, it depends on the number of managed cluster nodes to deploy in the same PXE network segment as the management cluster nodes.

    172.16.59.51-172.16.59.200

    SET_PXE_ADDR_RANGE

    The IP address range in the PXE network to use for static address allocation on each management cluster node. The minimum recommended range size is 6 addresses.

    172.16.59.41-172.16.59.50

    SET_MGMT_CIDR

    The IP address of the LCM network for the management cluster in the CIDR notation. If managed clusters will have their separate LCM networks, those networks must be routable to the LCM network. The minimum recommended network size is 128 addresses (/25 prefix length).

    172.16.61.0/25

    SET_MGMT_NW_GW

    The default gateway address in the LCM network. This gateway must provide access to the OOB network of the Container Cloud cluster and to the Internet to download the Mirantis artifacts.

    172.16.61.1

    SET_LB_HOST

    The IP address of the externally accessible MKE API endpoint of the cluster in the CIDR notation. This address must be within the management SET_MGMT_CIDR network but must NOT overlap with any other addresses or address ranges within this network. External load balancers are not supported.

    172.16.61.5/32

    SET_MGMT_DNS

    An external (non-Kubernetes) DNS server accessible from the LCM network.

    8.8.8.8

    SET_MGMT_ADDR_RANGE

    The IP address range that includes addresses to be allocated to bare metal hosts in the LCM network for the management cluster.

    When this network is shared with managed clusters, the size of this range limits the number of hosts that can be deployed in all clusters sharing this network.

    When this network is solely used by a management cluster, the range must include at least 6 addresses for bare metal hosts of the management cluster.

    172.16.61.30-172.16.61.40

    SET_MGMT_SVC_POOL

    The IP address range to use for the externally accessible endpoints of load balancers in the LCM network for the Container Cloud services, such as Keycloak, web UI, and so on. The minimum required range size is 19 addresses.

    172.16.61.10-172.16.61.29

    SET_VLAN_ID

    The VLAN ID used for isolation of LCM network. The bootstrap.sh process and the seed node must have routable access to the network in this VLAN.

    3975

    While using separate PXE and LCM networks, the management cluster services are exposed in different networks using two separate MetalLB address pools:

    • Services exposed through the PXE network are as follows:

      • Ironic API as a bare metal provisioning server

      • HTTP server that provides images for network boot and server provisioning

      • Caching server for accessing the Container Cloud artifacts deployed on hosts

    • Services exposed through the LCM network are all other Container Cloud services, such as Keycloak, web UI, and so on.

    The default MetalLB configuration described in the MetalLBConfig object template of metallbconfig.yaml.template uses two separate MetalLB address pools. Also, it uses the interfaces selector in its l2Advertisements template.

    Caution

    When you change the L2Template object template in ipam-objects.yaml.template, ensure that interfaces listed in the interfaces field of the MetalLBConfig.spec.l2Advertisements section match those used in your L2Template. For details about the interfaces selector, see MetalLBConfig spec.

    See Configure and verify MetalLB for details on MetalLB configuration.

  12. In cluster.yaml.template:

    1. Set the mandatory label:

      labels:
        kaas.mirantis.com/provider: baremetal
      
    2. Update the cluster-related settings to fit your deployment.

  13. Optional. Technology Preview. Deprecated since Container Cloud 2.29.0 (Cluster release 16.4.0). Enable WireGuard for traffic encryption on the Kubernetes workloads network.

    WireGuard configuration
    1. Ensure that the Calico MTU size is at least 60 bytes smaller than the interface MTU size of the workload network. IPv4 WireGuard uses a 60-byte header. For details, see Set the MTU size for Calico.

    2. In cluster.yaml.template, enable WireGuard by adding the secureOverlay parameter:

      spec:
        ...
        providerSpec:
          value:
            ...
            secureOverlay: true
      

      Caution

      Changing this parameter on a running cluster causes a downtime that can vary depending on the cluster size.

    For more details about WireGuard, see Calico documentation: Encrypt in-cluster pod traffic.

  14. Configure StackLight. For parameters description, see StackLight configuration parameters.

  15. Optional. Configure additional cluster settings as described in Configure optional settings.

  16. In machines.yaml.template:

    1. Add the following mandatory machine labels:

      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: <clusterName>
        cluster.sigs.k8s.io/control-plane: "true"
      
    2. Adjust spec and labels sections of each entry according to your deployment.

    3. Adjust the spec.providerSpec.value.hostSelector values to match BareMetalHostInventory corresponding to each machine. For details, see spec:providerSpec for instance configuration.

  17. Monitor the inspecting process of the baremetal hosts and wait until all hosts are in the available state:

    kubectl get bmh -o go-template='{{- range .items -}} {{.status.provisioning.state}}{{"\n"}} {{- end -}}'
    

    Example of system response:

    available
    available
    available
    
  18. Monitor the BootstrapRegion object status and wait until it is ready.

    kubectl get bootstrapregions -o go-template='{{(index .items 0).status.ready}}{{"\n"}}'
    

    To obtain more granular status details, monitor status.conditions:

    kubectl get bootstrapregions -o go-template='{{(index .items 0).status.conditions}}{{"\n"}}'
    

    For a more user-friendly system response, consider using dedicated tools such as jq or yq and adjust the -o flag to output in the json or yaml format accordingly.

  19. Change the directory to /kaas-bootstrap/.

  20. Approve the BootstrapRegion object to start the cluster deployment:

    ./container-cloud bootstrap approve all
    

    Caution

    Once you approve the BootstrapRegion object, no cluster or machine modification is allowed.

    Warning

    Do not manually restart or power off any of the bare metal hosts during the bootstrap process.

  21. Monitor the deployment progress. For description of deployment stages, see Overview of the deployment workflow.

  22. Verify that network addresses used on your clusters do not overlap with the following default MKE network addresses for Swarm and MCR:

    • 10.0.0.0/16 is used for Swarm networks. IP addresses from this network are virtual.

    • 10.99.0.0/16 is used for MCR networks. IP addresses from this network are allocated on hosts.

    Verification of Swarm and MCR network addresses

    To verify Swarm and MCR network addresses, run on any master node:

    docker info
    

    Example of system response:

    Server:
     ...
     Swarm:
      ...
      Default Address Pool: 10.0.0.0/16
      SubnetSize: 24
      ...
     Default Address Pools:
       Base: 10.99.0.0/16, Size: 20
     ...
    

    Not all of Swarm and MCR addresses are usually in use. One Swarm Ingress network is created by default and occupies the 10.0.0.0/24 address block. Also, three MCR networks are created by default and occupy three address blocks: 10.99.0.0/20, 10.99.16.0/20, 10.99.32.0/20.

    To verify the actual networks state and addresses in use, run:

    docker network ls
    docker network inspect <networkName>
    
  23. Optional. If you plan to use multiple L2 segments for provisioning of managed cluster nodes, consider the requirements specified in Configure multiple DHCP address ranges.

Configure bare metal settings

During creation of a bare metal management cluster using Bootstrap v2, configure several cluster settings to fit your deployment.

Configure BIOS on a bare metal host

Note

Before update of the management cluster to Container Cloud 2.29.0 (Cluster release 16.4.0), instead of BareMetalHostInventory, use the BareMetalHost object. For details, see BareMetalHost resource.

Caution

While the Cluster release of the management cluster is 16.4.0, BareMetalHostInventory operations are allowed to m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

Before adding new BareMetalHostInventory objects, configure hardware hosts to correctly boot them over the PXE network.

Important

Consider the following common requirements for hardware hosts configuration:

  • Update firmware for BIOS and Baseboard Management Controller (BMC) to the latest available version, especially if you are going to apply the UEFI configuration.

    Container Cloud uses the ipxe.efi binary loader that might be not compatible with old firmware and have vendor-related issues with UEFI booting. For example, the Supermicro issue. In this case, we recommend using the legacy booting format.

  • Configure all or at least the PXE NIC on switches.

    If the hardware host has more than one PXE NIC to boot, we strongly recommend setting up only one in the boot order. It speeds up the provisioning phase significantly.

    Some hardware vendors require a host to be rebooted during BIOS configuration changes from legacy to UEFI or vice versa for the extra option with NIC settings to appear in the menu.

  • Connect only one Ethernet port on a host to the PXE network at any given time. Collect the physical address (MAC) of this interface and use it to configure the BareMetalHostInventory object describing the host.

To configure BIOS on a bare metal host:

  1. Enable the global BIOS mode using BIOS > Boot > boot mode select > legacy. Reboot the host if required.

  2. Enable the LAN-PXE-OPROM support using the following menus:

    • BIOS > Advanced > PCI/PCIe Configuration > LAB OPROM TYPE > legacy

    • BIOS > Advanced > PCI/PCIe Configuration > Network Stack > enabled

    • BIOS > Advanced > PCI/PCIe Configuration > IPv4 PXE Support > enabled

  3. Set up the configured boot order:

    1. BIOS > Boot > Legacy-Boot-Order#1 > Hard Disk

    2. BIOS > Boot > Legacy-Boot-Order#2 > NIC

  4. Save changes and power off the host.

  1. Enable the global BIOS mode using BIOS > Boot > boot mode select > UEFI. Reboot the host if required.

  2. Enable the LAN-PXE-OPROM support using the following menus:

    • BIOS > Advanced > PCI/PCIe Configuration > LAB OPROM TYPE > uefi

    • BIOS > Advanced > PCI/PCIe Configuration > Network Stack > enabled

    • BIOS > Advanced > PCI/PCIe Configuration > IPv4 PXE Support > enabled

    Note

    UEFI support might not apply to all NICs. But at least built-in network interfaces should support it.

  3. Set up the configured boot order:

    1. BIOS > Boot > UEFI-Boot-Order#1 > UEFI Hard Disk

    2. BIOS > Boot > UEFI-Boot-Order#2 > UEFI Network

  4. Save changes and power off the host.

Customize the default bare metal host profile

This section provides description of the bare metal host profile settings and provides instructions on how to configure this profile before deploying Mirantis Container Cloud on physical servers.

The bare metal host profile is a Kubernetes custom resource. It allows the infrastructure operator to define how the storage devices and the operating system are provisioned and configured.

The bootstrap templates for a bare metal deployment include the template for the default BareMetalHostProfile object in the following file that defines the default bare metal host profile:

templates/bm/baremetalhostprofiles.yaml.template

Note

Using BareMetalHostProfile, you can configure LVM or mdadm-based software RAID support during a management or managed cluster creation. For details, see Configure RAID support.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

The customization procedure of BareMetalHostProfile is almost the same for the management and managed clusters, with the following differences:

  • For a management cluster, the customization automatically applies to machines during bootstrap. And for a managed cluster, you apply the changes using kubectl before creating a managed cluster.

  • For a management cluster, you edit the default baremetalhostprofiles.yaml.template. And for a managed cluster, you create a new BareMetalHostProfile with the necessary configuration.

For the procedure details, see Create a custom bare metal host profile. Use this procedure for both types of clusters considering the differences described above.

Configure NIC bonding

You can configure L2 templates for the management cluster to set up a bond network interface for the PXE and management network.

This configuration must be applied to the bootstrap templates, before you run the bootstrap script to deploy the management cluster.

Configuration requirements for NIC bonding

  • Add at least two physical interfaces to each host in your management cluster.

  • Connect at least two interfaces per host to an Ethernet switch that supports Link Aggregation Control Protocol (LACP) port groups and LACP fallback.

  • Configure an LACP group on the ports connected to the NICs of a host.

  • Configure the LACP fallback on the port group to ensure that the host can boot over the PXE network before the bond interface is set up on the host operating system.

  • Configure server BIOS for both NICs of a bond to be PXE-enabled.

  • If the server does not support booting from multiple NICs, configure the port of the LACP group that is connected to the PXE-enabled NIC of a server to be the primary port. With this setting, the port becomes active in the fallback mode.

  • Configure the ports that connect servers to the PXE network with the PXE VLAN as native or untagged.

For reference configuration of network fabric in a baremetal-based cluster, see Container Cloud Reference Architecture: Network fabric.

To configure a bond interface that aggregates two interfaces for the PXE and management network:

  1. In kaas-bootstrap/templates/bm/ipam-objects.yaml.template:

    1. Verify that only the following parameters for the declaration of {{nic 0}} and {{nic 1}} are set, as shown in the example below:

      • dhcp4

      • dhcp6

      • match

      • set-name

      Remove other parameters.

    2. Verify that the declaration of the bond interface bond0 has the interfaces parameter listing both Ethernet interfaces.

    3. Verify that the node address in the PXE network (ip "bond0:mgmt-pxe" in the below example) is bound to the bond interface or to the virtual bridge interface tied to that bond.

      Caution

      No VLAN ID must be configured for the PXE network from the host side.

    4. Configure bonding options using the parameters field. The only mandatory option is mode. See the example below for details.

      Note

      You can set any mode supported by netplan and your hardware.

      Important

      Bond monitoring is disabled in Ubuntu by default. However, Mirantis highly recommends enabling it using the Media Independent Interface (MII) monitoring by setting the mii-monitor-interval parameter to a non-zero value. For details, see Linux documentation: bond monitoring.

  2. Verify your configuration using the following example:

    kind: L2Template
    metadata:
      name: kaas-mgmt
      ...
    spec:
      ...
      l3Layout:
        - subnetName: kaas-mgmt
          scope:      namespace
      npTemplate: |
        version: 2
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
        bonds:
          bond0:
            interfaces:
              - {{nic 0}}
              - {{nic 1}}
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            dhcp4: false
            dhcp6: false
            addresses:
              - {{ ip "bond0:mgmt-pxe" }}
        vlans:
          k8s-lcm:
            id: SET_VLAN_ID
            link: bond0
            addresses:
              - {{ ip "k8s-lcm:kaas-mgmt" }}
            nameservers:
              addresses: {{ nameservers_from_subnet "kaas-mgmt" }}
            routes:
              - to: 0.0.0.0/0
                via: {{ gateway_from_subnet "kaas-mgmt" }}
        ...
    
  3. Proceed to bootstrap your management cluster as described in Deploy a management cluster using CLI.

Separate PXE and management networks

This section describes how to configure a dedicated PXE network for a management bare metal cluster. A separate PXE network allows isolating sensitive bare metal provisioning process from the end users. The users still have access to Container Cloud services, such as Keycloak, to authenticate workloads in managed clusters, such as Horizon in a Mirantis OpenStack for Kubernetes cluster.

Note

This additional configuration procedure must be completed as part of the main Deploy a management cluster using CLI procedure. It substitutes or appends some configuration parameters and templates that are used in the main procedure for the management cluster to use two networks, PXE and management, instead of one PXE/management network. Mirantis recommends considering the main procedure first.

The following table describes the overall network mapping scheme with all L2/L3 parameters, for example, for two networks, PXE (CIDR 10.0.0.0/24) and management (CIDR 10.0.11.0/24):

Network mapping overview

Deployment file name

Network

Parameters and values

cluster.yaml

Management

  • SET_LB_HOST=10.0.11.90

  • SET_METALLB_ADDR_POOL=10.0.11.61-10.0.11.80

ipam-objects.yaml

PXE

  • SET_IPAM_CIDR=10.0.0.0/24

  • SET_PXE_NW_GW=10.0.0.1

  • SET_PXE_NW_DNS=8.8.8.8

  • SET_IPAM_POOL_RANGE=10.0.0.100-10.0.0.109

  • SET_METALLB_PXE_ADDR_POOL=10.0.0.61-10.0.0.70

ipam-objects.yaml

Management

  • SET_LCM_CIDR=10.0.11.0/24

  • SET_LCM_RANGE=10.0.11.100-10.0.11.199

  • SET_LB_HOST=10.0.11.90

  • SET_METALLB_ADDR_POOL=10.0.11.61-10.0.11.80

bootstrap.sh

PXE

  • KAAS_BM_PXE_IP=10.0.0.20

  • KAAS_BM_PXE_MASK=24

  • KAAS_BM_PXE_BRIDGE=br0

  • KAAS_BM_BM_DHCP_RANGE=10.0.0.30,10.0.0.59,255.255.255.0

  • BOOTSTRAP_METALLB_ADDRESS_POOL=10.0.0.61-10.0.0.80


When using separate PXE and management networks, the management cluster services are exposed in different networks using two separate MetalLB address pools:

  • Services exposed through the PXE network are as follows:

    • Ironic API as a bare metal provisioning server

    • HTTP server that provides images for network boot and server provisioning

    • Caching server for accessing the Container Cloud artifacts deployed on hosts

  • Services exposed through the management network are all other Container Cloud services, such as Keycloak, web UI, and so on.

To configure separate PXE and management networks:

  1. Inspect guidelines to follow during configuration of the Subnet object as a MetalLB address pool as described MetalLB configuration guidelines for subnets.

  2. To ensure successful bootstrap, enable asymmetric routing on the interfaces of the management cluster nodes. This is required because the seed node relies on one network by default, which can potentially cause traffic asymmetry.

    In the kernelParameters section of baremetalhostprofiles.yaml.template, set rp_filter to 2. This enables loose mode as defined in RFC3704.

    Example configuration of asymmetric routing
    ...
    kernelParameters:
      ...
      sysctl:
        # Enables the "Loose mode" for the "k8s-lcm" interface (management network)
        net.ipv4.conf.k8s-lcm.rp_filter: "2"
        # Enables the "Loose mode" for the "bond0" interface (PXE network)
        net.ipv4.conf.bond0.rp_filter: "2"
        ...
    

    Note

    More complicated solutions that are not described in this manual include getting rid of traffic asymmetry, for example:

    • Configure source routing on management cluster nodes.

    • Plug the seed node into the same networks as the management cluster nodes, which requires custom configuration of the seed node.

  3. In kaas-bootstrap/templates/bm/ipam-objects.yaml.template:

    • Substitute all Subnet object templates with the new ones as described in the example template below

    • Update the L2 template spec.l3Layout and spec.npTemplate fields as described in the example template below

    Example of the Subnet object templates
    # Subnet object that provides IP addresses for bare metal hosts of
    # management cluster in the PXE network.
    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: mgmt-pxe
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas-mgmt-pxe-subnet: ""
    spec:
      cidr: SET_IPAM_CIDR
      gateway: SET_PXE_NW_GW
      nameservers:
        - SET_PXE_NW_DNS
      includeRanges:
        - SET_IPAM_POOL_RANGE
      excludeRanges:
        - SET_METALLB_PXE_ADDR_POOL
    ---
    # Subnet object that provides IP addresses for bare metal hosts of
    # management cluster in the management network.
    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: mgmt-lcm
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas-mgmt-lcm-subnet: ""
        ipam/SVC-k8s-lcm: "1"
        ipam/SVC-ceph-cluster: "1"
        ipam/SVC-ceph-public: "1"
        cluster.sigs.k8s.io/cluster-name: CLUSTER_NAME
    spec:
      cidr: {{ SET_LCM_CIDR }}
      includeRanges:
        - {{ SET_LCM_RANGE }}
      excludeRanges:
        - SET_LB_HOST
        - SET_METALLB_ADDR_POOL
    ---
    # Deprecated since 2.27.0. Subnet object that provides configuration
    # for "services-pxe" MetalLB address pool that will be used to expose
    # services LB endpoints in the PXE network.
    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: mgmt-pxe-lb
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        metallb/address-pool-name: services-pxe
        metallb/address-pool-protocol: layer2
        metallb/address-pool-auto-assign: "false"
        cluster.sigs.k8s.io/cluster-name: CLUSTER_NAME
    spec:
      cidr: SET_IPAM_CIDR
      includeRanges:
        - SET_METALLB_PXE_ADDR_POOL
    
    Example of the L2 template spec
    kind: L2Template
    ...
    spec:
      ...
      l3Layout:
        - scope: namespace
          subnetName: kaas-mgmt-pxe
          labelSelector:
            kaas.mirantis.com/provider: baremetal
            kaas-mgmt-pxe-subnet: ""
        - scope: namespace
          subnetName: kaas-mgmt-lcm
          labelSelector:
            kaas.mirantis.com/provider: baremetal
            kaas-mgmt-lcm-subnet: ""
      npTemplate: |
        version: 2
        renderer: networkd
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
        bridges:
          bm-pxe:
            interfaces:
             - {{ nic 0 }}
            dhcp4: false
            dhcp6: false
            addresses:
              - {{ ip "bm-pxe:kaas-mgmt-pxe" }}
            nameservers:
              addresses: {{ nameservers_from_subnet "kaas-mgmt-pxe" }}
            routes:
              - to: 0.0.0.0/0
                via: {{ gateway_from_subnet "kaas-mgmt-pxe" }}
          k8s-lcm:
            interfaces:
             - {{ nic 1 }}
            dhcp4: false
            dhcp6: false
            addresses:
              - {{ ip "k8s-lcm:kaas-mgmt-lcm" }}
            nameservers:
              addresses: {{ nameservers_from_subnet "kaas-mgmt-lcm" }}
    

    Deprecated since Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0): the last Subnet template named mgmt-pxe-lb in the example above will be used to configure the MetalLB address pool in the PXE network. The bare metal provider will automatically configure MetalLB with address pools using the Subnet objects identified by specific labels.

    Warning

    The bm-pxe address must have a separate interface with only one address on this interface.

  4. Verify the current MetalLB configuration that is stored in MetalLB objects:

    kubectl -n metallb-system get ipaddresspools,l2advertisements
    

    For the example configuration described above, the system outputs a similar content:

    NAME                                    AGE
    ipaddresspool.metallb.io/default        129m
    ipaddresspool.metallb.io/services-pxe   129m
    
    NAME                                      AGE
    l2advertisement.metallb.io/default        129m
    l2advertisement.metallb.io/services-pxe   129m
    

    To verify the MetalLB objects:

    kubectl -n metallb-system get <object> -o json | jq '.spec'
    

    For the example configuration described above, the system outputs a similar content for ipaddresspool objects:

    {
      "addresses": [
        "10.0.11.61-10.0.11.80"
      ],
      "autoAssign": true,
      "avoidBuggyIPs": false
    }
    $ kubectl -n metallb-system get ipaddresspool.metallb.io/services-pxe -o json | jq '.spec'
    {
      "addresses": [
        "10.0.0.61-10.0.0.70"
      ],
      "autoAssign": false,
      "avoidBuggyIPs": false
    }
    

    The auto-assign parameter will be set to false for all address pools except the default one. So, a particular service will get an address from such an address pool only if the Service object has a special metallb.universe.tf/address-pool annotation that points to the specific address pool name.

    Note

    It is expected that every Container Cloud service on a management

    cluster will be assigned to one of the address pools. Current consideration is to have two MetalLB address pools:

    • services-pxe is a reserved address pool name to use for the Container Cloud services in the PXE network (Ironic API, HTTP server, caching server).

      The bootstrap cluster also uses the services-pxe address pool for its provision services for management cluster nodes to be provisioned from the bootstrap cluster. After the management cluster is deployed, the bootstrap cluster is deleted and that address pool is solely used by the newly deployed cluster.

    • default is an address pool to use for all other Container Cloud services in the management network. No annotation is required on the Service objects in this case.

  5. In addition to the network parameters defined in Deploy a management cluster using CLI, configure the following ones by replacing them in templates/bm/ipam-objects.yaml.template:

    New subnet template parameters

    Parameter

    Description

    Example value

    SET_LCM_CIDR

    Address of a management network for the management cluster in the CIDR notation. You can later share this network with managed clusters where it will act as the LCM network. If managed clusters have their separate LCM networks, those networks must be routable to the management network.

    10.0.11.0/24

    SET_LCM_RANGE

    Address range that includes addresses to be allocated to bare metal hosts in the management network for the management cluster. When this network is shared with managed clusters, the size of this range limits the number of hosts that can be deployed in all clusters that share this network. When this network is solely used by a management cluster, the range should include at least 3 IP addresses for bare metal hosts of the management cluster.

    10.0.11.100-10.0.11.109

    SET_METALLB_PXE_ADDR_POOL

    Address range to be used for LB endpoints of the Container Cloud services: Ironic-API, HTTP server, and caching server. This range must be within the PXE network. The minimum required range is 5 IP addresses.

    10.0.0.61-10.0.0.70

    The following parameters will now be tied to the management network while their meaning remains the same as described in Deploy a management cluster using CLI:

    Subnet template parameters migrated to management network

    Parameter

    Description

    Example value

    SET_LB_HOST

    IP address of the externally accessible API endpoint of the management cluster. This address must NOT be within the SET_METALLB_ADDR_POOL range but within the management network. External load balancers are not supported.

    10.0.11.90

    SET_METALLB_ADDR_POOL

    The address range to be used for the externally accessible LB endpoints of the Container Cloud services, such as Keycloak, web UI, and so on. This range must be within the management network. The minimum required range is 19 IP addresses.

    10.0.11.61-10.0.11.80

  6. Proceed to further steps in Deploy a management cluster using CLI.

Configure multiple DHCP address ranges

To facilitate multi-rack and other types of distributed bare metal datacenter topologies, the dnsmasq DHCP server used for host provisioning in Container Cloud supports working with multiple L2 segments through network routers that support DHCP relay.

Container Cloud has its own DHCP relay running on one of the management cluster nodes. That DHCP relay serves for proxying DHCP requests in the same L2 domain where the management cluster nodes are located.

Caution

Networks used for hosts provisioning of a managed cluster must have routes to the PXE network of the management cluster. This configuration enables hosts to have access to the management cluster services that are used during host provisioning.

Management cluster nodes must have routes through the PXE network to PXE network segments used on a managed cluster. The following example contains L2 template fragments for a management cluster node:

Configuration example extract
l3Layout:
  # PXE/static subnet for a management cluster
  - scope: namespace
    subnetName: kaas-mgmt-pxe
    labelSelector:
      kaas-mgmt-pxe-subnet: "1"
  # management (LCM) subnet for a management cluster
  - scope: namespace
    subnetName: kaas-mgmt-lcm
    labelSelector:
      kaas-mgmt-lcm-subnet: "1"
  # PXE/dhcp subnets for a managed cluster
  - scope: namespace
    subnetName: managed-dhcp-rack-1
  - scope: namespace
    subnetName: managed-dhcp-rack-2
  - scope: namespace
    subnetName: managed-dhcp-rack-3
  ...
npTemplate: |
  ...
  bonds:
    bond0:
      interfaces:
        - {{ nic 0 }}
        - {{ nic 1 }}
      parameters:
        mode: active-backup
        primary: {{ nic 0 }}
        mii-monitor-interval: 100
      dhcp4: false
      dhcp6: false
      addresses:
        # static address on management node in the PXE network
        - {{ ip "bond0:kaas-mgmt-pxe" }}
      routes:
        # routes to managed PXE network segments
        - to: {{ cidr_from_subnet "managed-dhcp-rack-1" }}
          via: {{ gateway_from_subnet "kaas-mgmt-pxe" }}
        - to: {{ cidr_from_subnet "managed-dhcp-rack-2" }}
          via: {{ gateway_from_subnet "kaas-mgmt-pxe" }}
        - to: {{ cidr_from_subnet "managed-dhcp-rack-3" }}
          via: {{ gateway_from_subnet "kaas-mgmt-pxe" }}
        ...

To configure DHCP ranges for dnsmasq, create the Subnet objects tagged with the ipam/SVC-dhcp-range label while setting up subnets for a managed cluster using CLI.

Caution

Support of multiple DHCP ranges has the following limitations:

  • Using of custom DNS server addresses for servers that boot over PXE is not supported.

  • The Subnet objects for DHCP ranges cannot be associated with any specific cluster, as DHCP server configuration is only applicable to the management cluster where DHCP server is running. The cluster.sigs.k8s.io/cluster-name label will be ignored.

Configure DHCP ranges for dnsmasq
  1. Create the Subnet objects tagged with the ipam/SVC-dhcp-range label.

    Caution

    For cluster-specific subnets, create Subnet objects in the same namespace as the related Cluster object project. For shared subnets, create Subnet objects in the default namespace.

    To create the Subnet objects, refer to Create subnets.

    Use the following Subnet object example to specify DHCP ranges and DHCP options to pass the default route address:

    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: mgmt-dhcp-range
      namespace: default
      labels:
        ipam/SVC-dhcp-range: ""
        kaas.mirantis.com/provider: baremetal
    spec:
      cidr: 10.11.0.0/24
      gateway: 10.11.0.1
      includeRanges:
        - 10.11.0.121-10.11.0.125
        - 10.11.0.191-10.11.0.199
    

    Note

    Setting of custom nameservers in the DHCP subnet is not supported.

    After creation of the above Subnet object, the provided data will be utilized to render the Dnsmasq object used for configuration of the dnsmasq deployment. You do not have to manually edit the Dnsmasq object.

  2. Verify that the changes are applied to the Dnsmasq object:

    kubectl --kubeconfig <pathToMgmtClusterKubeconfig> \
    -n kaas get dnsmasq dnsmasq-dynamic-config -o json
    
Configure DHCP relay on ToR switches

For servers to access the DHCP server across the L2 segment boundaries, for example, from another rack with a different VLAN for PXE network, you must configure DHCP relay (agent) service on the border switch of the segment. For example, on a top-of-rack (ToR) or leaf (distribution) switch, depending on the data center network topology.

Warning

To ensure predictable routing for the relay of DHCP packets, Mirantis strongly advises against the use of chained DHCP relay configurations. This precaution limits the number of hops for DHCP packets, with an optimal scenario being a single hop.

This approach is justified by the unpredictable nature of chained relay configurations and potential incompatibilities between software and hardware relay implementations.

The dnsmasq server listens on the PXE network of the management cluster by using the dhcp-lb Kubernetes Service.

To configure the DHCP relay service, specify the external address of the dhcp-lb Kubernetes Service as an upstream address for the relayed DHCP requests, which is the IP helper address for DHCP. There is the dnsmasq deployment behind this service that can only accept relayed DHCP requests.

Container Cloud has its own DHCP relay running on one of the management cluster nodes. That DHCP relay serves for proxying DHCP requests in the same L2 domain where the management cluster nodes are located.

To obtain the actual IP address issued to the dhcp-lb Kubernetes Service:

kubectl -n kaas get service dhcp-lb
Migration of DHCP configuration for existing management clusters

Note

This section applies only to existing management clusters that are created before Container Cloud 2.24.0 (Cluster release 14.0.0).

Caution

Since Container Cloud 2.24.0, you can only remove the deprecated dnsmasq.dhcp_range, dnsmasq.dhcp_ranges, dnsmasq.dhcp_routers, and dnsmasq.dhcp_dns_servers values from the cluster spec.

The Admission Controller does not accept any other changes in these values. This configuration is completely superseded by the Subnet object.

The DHCP configuration automatically migrated from the cluster spec to Subnet objects after cluster upgrade to Container Cloud 2.21.0 (Cluster release 11.5.0).

To remove the deprecated dnsmasq parameters from the cluster spec:

  1. Open the management cluster spec for editing.

  2. In the baremetal-operator release values, remove the dnsmasq.dhcp_range, dnsmasq.dhcp_ranges, dnsmasq.dhcp_routers, and dnsmasq.dhcp_dns_servers parameters. For example:

    regional:
    - helmReleases:
      - name: baremetal-operator
        values:
          dnsmasq:
            dhcp_range: 10.204.1.0,10.204.5.255,255.255.255.0
    

    Caution

    The dnsmasq.dhcp_<name> parameters of the baremetal-operator Helm chart values in the Cluster spec are deprecated since the Cluster release 11.5.0 and removed in the Cluster release 14.0.0.

  3. Ensure that the required DHCP ranges and options are set in the Subnet objects. For configuration details, see Configure DHCP ranges for dnsmasq.

The dnsmasq configuration options dhcp-option=3 and dhcp-option=6 are absent in the default configuration. So, by default, dnsmasq will send the DNS server and default route to DHCP clients as defined in the dnsmasq official documentation:

  • The netmask and broadcast address are the same as on the host running dnsmasq.

  • The DNS server and default route are set to the address of the host running dnsmasq.

  • If the domain name option is set, this name is sent to DHCP clients.

Enable dynamic IP allocation

Available since MCC 2.26.0 (Cluster release 16.1.0)

This section instructs you on how to enable dynamic IP allocation feature to increase the amount of baremetal hosts to be provisioned in parallel on managed clusters.

Using this feature, you can effortlessly deploy a large managed cluster by provisioning up to 100 hosts simultaneously. In addition to dynamic IP allocation, this feature disables the ping check in the DHCP server. Therefore, if you plan to deploy large managed clusters, enable this feature during the management cluster bootstrap.

Caution

Before using this feature, familiarize yourself with Container Cloud Reference Architecture: DHCP range requirements for PXE.

To enable dynamic IP allocation for large managed clusters:

In the Cluster object of the management cluster, modify the configuration of baremetal-operator by setting dynamic_bootp to true:

spec:
  ...
  providerSpec:
    value:
      kaas:
        ...
        regional:
          - helmReleases:
            - name: baremetal-operator
              values:
                dnsmasq:
                  dynamic_bootp: true
            provider: baremetal
          ...
Set a custom external IP address for the DHCP service

Available since MCC 2.25.0 (Cluster release 16.0.0)

This section instructs you on how to set a custom external IP address for the dhcp-lb service so that it remains the same during management cluster upgrades and other LCM operations.

The changes of dhcp-lb service address may lead to the necessity of changing configuration for DHCP relays on ToR switches. The described procedure allows you to avoid such unwanted changes. This configuration makes sense when you use multiple DHCP address ranges on your deployment. See Configure multiple DHCP address ranges for details.

To set a custom external IP address for the dhcp-lb service:

  1. In the Cluster object of the management cluster, modify the configuration of the baremetal-operator release by setting dnsmasq.dedicated_udp_service_address_pool to true:

    spec:
      ...
      providerSpec:
        value:
          kaas:
            ...
            regional:
              - helmReleases:
                ...
                - name: baremetal-operator
                  values:
                    dnsmasq:
                      dedicated_udp_service_address_pool: true
                      ...
                provider: baremetal
              ...
    
  2. In the MetalLBConfig object of the management cluster, modify the ipAddressPools object list by adding the dhcp-lb object and the serviceAllocation parameters for the default object:

    ipAddressPools:
    - name: default
      spec:
        addresses:
        - 112.181.11.41-112.181.11.60
        autoAssign: true
        avoidBuggyIPs: false
        serviceAllocation:
          serviceSelectors:
          - matchExpressions:
            - key: app.kubernetes.io/name
              operator: NotIn
              values:
              - dhcp-lb
    - name: services-pxe
      spec:
        addresses:
        - 10.0.24.122-10.0.24.140
        autoAssign: false
        avoidBuggyIPs: false
    - name: dhcp-lb
      spec:
        addresses:
        - 10.0.24.121/32
        autoAssign: true
        avoidBuggyIPs: false
        serviceAllocation:
          namespaces:
          - kaas
          serviceSelectors:
          - matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - dhcp-lb
    

    Select non-overlapping IP addresses for all the ipAddressPools that you use: default, services-pxe, and dhcp-lb.

  3. In the MetalLBConfig object of the management cluster, modify the l2Advertisements object list by adding dhcp-lb to the ipAddressPools section in the pxe object spec:

    Note

    A cluster may have a different L2Advertisement object name instead of pxe.

    l2Advertisements:
    ...
    - name: pxe
      spec:
        ipAddressPools:
        - services-pxe
        - dhcp-lb
        ...
    
Configure optional settings

Note

Consider this section as part of the Bootstrap v2 CLI procedure.

During creation of a management cluster, you can configure optional cluster settings using the Container Cloud API by modifying cluster.yaml.template.

To configure optional cluster settings:

  1. Technology Preview. Enable custom host names for cluster machines. When enabled, any machine host name in a particular region matches the related Machine object name. For example, instead of the default kaas-node-<UID>, a machine host name will be master-0. The custom naming format is more convenient and easier to operate with.

    Configuration for custom host names on the management and its future managed clusters
    1. In cluster.yaml.template, find the spec.providerSpec.value.kaas.regional.helmReleases.name: baremetal-provider section.

    2. Under values.config, add customHostnamesEnabled: true:

      regional:
       - helmReleases:
         - name: baremetal-provider
           values:
             config:
               allInOneAllowed: false
               customHostnamesEnabled: true
               internalLoadBalancers: false
         provider: baremetal-provider
      
  2. Optional. Technology Preview. Enable the Linux Audit daemon auditd to monitor activity of cluster processes and prevent potential malicious activity.

    Configuration for auditd

    In the Cluster object or cluster.yaml.template, add the auditd parameters:

    spec:
      providerSpec:
        value:
          audit:
            auditd:
              enabled: <bool>
              enabledAtBoot: <bool>
              backlogLimit: <int>
              maxLogFile: <int>
              maxLogFileAction: <string>
              maxLogFileKeep: <int>
              mayHaltSystem: <bool>
              presetRules: <string>
              customRules: <string>
              customRulesX32: <text>
              customRulesX64: <text>
    

    Configuration parameters for auditd:

    enabled

    Boolean, default - false. Enables the auditd role to install the auditd packages and configure rules. CIS rules: 4.1.1.1, 4.1.1.2.

    enabledAtBoot

    Boolean, default - false. Configures grub to audit processes that can be audited even if they start up prior to auditd startup. CIS rule: 4.1.1.3.

    backlogLimit

    Integer, default - none. Configures the backlog to hold records. If during boot audit=1 is configured, the backlog holds 64 records. If more than 64 records are created during boot, auditd records will be lost with a potential malicious activity being undetected. CIS rule: 4.1.1.4.

    maxLogFile

    Integer, default - none. Configures the maximum size of the audit log file. Once the log reaches the maximum size, it is rotated and a new log file is created. CIS rule: 4.1.2.1.

    maxLogFileAction

    String, default - none. Defines handling of the audit log file reaching the maximum file size. Allowed values:

    • keep_logs - rotate logs but never delete them

    • rotate - add a cron job to compress rotated log files and keep maximum 5 compressed files.

    • compress - compress log files and keep them under the /var/log/auditd/ directory. Requires auditd_max_log_file_keep to be enabled.

    CIS rule: 4.1.2.2.

    maxLogFileKeep

    Integer, default - 5. Defines the number of compressed log files to keep under the /var/log/auditd/ directory. Requires auditd_max_log_file_action=compress. CIS rules - none.

    mayHaltSystem

    Boolean, default - false. Halts the system when the audit logs are full. Applies the following configuration:

    • space_left_action = email

    • action_mail_acct = root

    • admin_space_left_action = halt

    CIS rule: 4.1.2.3.

    customRules

    String, default - none. Base64-encoded content of the 60-custom.rules file for any architecture. CIS rules - none.

    customRulesX32

    String, default - none. Base64-encoded content of the 60-custom.rules file for the i386 architecture. CIS rules - none.

    customRulesX64

    String, default - none. Base64-encoded content of the 60-custom.rules file for the x86_64 architecture. CIS rules - none.

    presetRules

    String, default - none. Comma-separated list of the following built-in preset rules:

    • access

    • actions

    • delete

    • docker

    • identity

    • immutable

    • logins

    • mac-policy

    • modules

    • mounts

    • perm-mod

    • privileged

    • scope

    • session

    • system-locale

    • time-change

    Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0) in the Technology Preview scope, you can collect some of the preset rules indicated above as groups and use them in presetRules:

    • ubuntu-cis-rules - this group contains rules to comply with the Ubuntu CIS Benchmark recommendations, including the following CIS Ubuntu 20.04 v2.0.1 rules:

      • scope - 5.2.3.1

      • actions - same as 5.2.3.2

      • time-change - 5.2.3.4

      • system-locale - 5.2.3.5

      • privileged - 5.2.3.6

      • access - 5.2.3.7

      • identity - 5.2.3.8

      • perm-mod - 5.2.3.9

      • mounts - 5.2.3.10

      • session - 5.2.3.11

      • logins - 5.2.3.12

      • delete - 5.2.3.13

      • mac-policy - 5.2.3.14

      • modules - 5.2.3.19

    • docker-cis-rules - this group contains rules to comply with Docker CIS Benchmark recommendations, including the docker Docker CIS v1.6.0 rules 1.1.3 - 1.1.18.

    You can also use two additional keywords inside presetRules:

    • none - select no built-in rules.

    • all - select all built-in rules. When using this keyword, you can add the ! prefix to a rule name to exclude some rules. You can use the ! prefix for rules only if you add the all keyword as the first rule. Place a rule with the ! prefix only after the all keyword.

    Example configurations:

    • presetRules: none - disable all preset rules

    • presetRules: docker - enable only the docker rules

    • presetRules: access,actions,logins - enable only the access, actions, and logins rules

    • presetRules: ubuntu-cis-rules - enable all rules from the ubuntu-cis-rules group

    • presetRules: docker-cis-rules,actions - enable all rules from the docker-cis-rules group and the actions rule

    • presetRules: all - enable all preset rules

    • presetRules: all,!immutable,!sessions - enable all preset rules except immutable and sessions


    CIS controls
    4.1.3 (time-change)
    4.1.4 (identity)
    4.1.5 (system-locale)
    4.1.6 (mac-policy)
    4.1.7 (logins)
    4.1.8 (session)
    4.1.9 (perm-mod)
    4.1.10 (access)
    4.1.11 (privileged)
    4.1.12 (mounts)
    4.1.13 (delete)
    4.1.14 (scope)
    4.1.15 (actions)
    4.1.16 (modules)
    4.1.17 (immutable)
    Docker CIS controls
    1.1.4
    1.1.8
    1.1.10
    1.1.12
    1.1.13
    1.1.15
    1.1.16
    1.1.17
    1.1.18
    1.2.3
    1.2.4
    1.2.5
    1.2.6
    1.2.7
    1.2.10
    1.2.11
  3. Configure OIDC integration with LDAP or Google OAuth. For details, see Configure LDAP for IAM or Configure Google OAuth IdP for IAM.

  4. Configure NTP server. You can disable NTP that is enabled by default. This option disables the management of chrony configuration by MOSK to use your own system for chrony management. Otherwise, configure the regional NTP server parameters as described below.

    NTP configuration

    Configure the regional NTP server parameters to be applied to all machines of managed clusters.

    In cluster.yaml.template or the Cluster object, add the ntp:servers section with the list of required server names:

    spec:
      ...
      providerSpec:
        value:
          kaas:
          ...
          ntpEnabled: true
            regional:
              - helmReleases:
                - name: baremetal-provider
                  values:
                    config:
                      lcm:
                        ...
                        ntp:
                          servers:
                          - 0.pool.ntp.org
                          ...
                provider: baremetal
                ...
    

    To disable NTP:

    spec:
      ...
      providerSpec:
        value:
          ...
          ntpEnabled: false
          ...
    
  5. Applies since Container Cloud 2.26.0 (Cluster release 16.1.0). If you plan to deploy large managed clusters, enable dynamic IP allocation to increase the amount of baremetal hosts to be provisioned in parallel. For details, see Enable dynamic IP allocation.

Now, proceed with completing the bootstrap process using the Container Cloud Bootstrap API as described in Deploy a management cluster.

Post-deployment steps

After bootstrapping the management cluster, collect and save the following cluster details in a secure location:

  1. Obtain the management cluster kubeconfig:

    ./container-cloud get cluster-kubeconfig \
    --kubeconfig <pathToKindKubeconfig> \
    --cluster-name <clusterName>
    

    By default, pathToKindKubeconfig is $HOME/.kube/kind-config-clusterapi.

  2. Obtain the Keycloak credentials as described in Access the Keycloak Admin Console.

  3. Obtain MariaDB credentials for IAM.

  4. Remove the kind cluster:

    ./bin/kind delete cluster -n <kindClusterName>
    

    By default, kindClusterName is clusterapi.

Now, you can proceed with operating your management cluster through the Container Cloud web UI and deploying MOSK clusters as described in Operations Guide.

Create initial users after a management cluster bootstrap

Once you bootstrap your management cluster, create Keycloak users for access to the Container Cloud web UI.

Mirantis recommends creating at least two users, user and operator, that are required for a typical MOSK deployment.

To create the user for access to the Container Cloud web UI:

./container-cloud bootstrap user add \
    --username <userName> \
    --roles <roleName> \
    --kubeconfig <pathToMgmtKubeconfig>

Note

You will be asked for the user password interactively.

User creation parameters

Flag

Description

--username

Required. Name of the user to create.

--roles

Required. Comma-separated list of roles to assign to the user.

  • If you run the command without the --namespace flag, you can assign the following roles:

    • global-admin - read and write access for global role bindings

    • writer - read and write access

    • reader - view access

    • operator - create and manage access to the BareMetalHost and BareMetalHostInventory (since Container Cloud 2.29.1, Cluster release 16.4.1) objects

    • management-admin - full access to the management cluster, available since Container Cloud 2.25.0 (Cluster release 16.0.0)

  • If you run the command for a specific project using the --namespace flag, you can assign the following roles:

    • operator or writer - read and write access

    • user or reader - view access

    • member - read and write access (excluding IAM objects)

    • bm-pool-operator - create and manage access to the BareMetalHost and BareMetalHostInventory (since Container Cloud 2.29.1, Cluster release 16.4.1) objects

--kubeconfig

Required. Path to the management cluster kubeconfig generated during the management cluster bootstrap.

--namespace

Optional. Name of the Container Cloud project where the user will be created. If not set, a global user will be created for all Container Cloud projects with the corresponding role access to view or manage all public objects.

--password-stdin

Optional. Flag to provide the user password through stdin:

echo '$PASSWORD' | ./container-cloud bootstrap user add \
    --username <userName> \
    --roles <roleName> \
    --kubeconfig <pathToMgmtKubeconfig> \
    --password-stdin

To delete the user:

./container-cloud bootstrap user delete --username <userName> --kubeconfig <pathToMgmtKubeconfig>
Requirements for a MITM proxy

Note

For MOSK clusters, the feature is generally available since MOSK 23.1.

While bootstrapping a Container Cloud management cluster using proxy, you may require Internet access to go through a man-in-the-middle (MITM) proxy. Such configuration requires that you enable streaming and install a CA certificate on a bootstrap node.

Enable streaming for MITM

Ensure that the MITM proxy is configured with enabled streaming. For example, if you use mitmproxy, enable the stream_large_bodies=1 option:

./mitmdump --set stream_large_bodies=1
Install a CA certificate for a MITM proxy on a bootstrap node
  1. Log in to the bootstrap node.

  2. Install ca-certificates:

    apt install ca-certificates
    
  3. Copy your CA certificate to the /usr/local/share/ca-certificates/ directory. For example:

    sudo cp ~/.mitmproxy/mitmproxy-ca-cert.cer /usr/local/share/ca-certificates/mitmproxy-ca-cert.crt
    

    Replace ~/.mitmproxy/mitmproxy-ca-cert.cer with the path to your CA certificate.

    Caution

    The target CA certificate file must be in the PEM format with the .crt extension.

  4. Apply the changes:

    sudo update-ca-certificates
    

Now, proceed with bootstrapping your management cluster.

Configure external identity provider for IAM

This section describes how to configure authentication for management cluster depending on the external identity provider type integrated to your deployment.

Configure LDAP for IAM

If you integrate LDAP for IAM to Mirantis OpenStack for Kubernetes, add the required LDAP configuration to cluster.yaml.template during the management cluster bootstrap.

Note

The example below defines the recommended non-anonymous authentication type. If you require anonymous authentication, replace the following parameters with authType: "none":

authType: "simple"
bindCredential: ""
bindDn: ""

To configure LDAP for IAM:

  1. Open templates/bm/cluster.yaml.template.

  2. Configure the keycloak:userFederation:providers: and keycloak:userFederation:mappers: sections as required:

    spec:
      providerSpec:
        value:
          kaas:
            management:
              helmReleases:
              - name: iam
                values:
                  keycloak:
                    userFederation:
                      providers:
                        - displayName: "<LDAP_NAME>"
                          providerName: "ldap"
                          priority: 1
                          fullSyncPeriod: -1
                          changedSyncPeriod: -1
                          config:
                            pagination: "true"
                            debug: "false"
                            searchScope: "1"
                            connectionPooling: "true"
                            usersDn: "<DN>" # "ou=People, o=<ORGANIZATION>, dc=<DOMAIN_COMPONENT>"
                            userObjectClasses: "inetOrgPerson,organizationalPerson"
                            usernameLDAPAttribute: "uid"
                            rdnLDAPAttribute: "uid"
                            vendor: "ad"
                            editMode: "READ_ONLY"
                            uuidLDAPAttribute: "uid"
                            connectionUrl: "ldap://<LDAP_DNS>"
                            syncRegistrations: "false"
                            authType: "simple"
                            bindCredential: ""
                            bindDn: ""
                      mappers:
                        - name: "username"
                          federationMapperType: "user-attribute-ldap-mapper"
                          federationProviderDisplayName: "<LDAP_NAME>"
                          config:
                            ldap.attribute: "uid"
                            user.model.attribute: "username"
                            is.mandatory.in.ldap: "true"
                            read.only: "true"
                            always.read.value.from.ldap: "false"
                        - name: "full name"
                          federationMapperType: "full-name-ldap-mapper"
                          federationProviderDisplayName: "<LDAP_NAME>"
                          config:
                            ldap.full.name.attribute: "cn"
                            read.only: "true"
                            write.only: "false"
                        - name: "last name"
                          federationMapperType: "user-attribute-ldap-mapper"
                          federationProviderDisplayName: "<LDAP_NAME>"
                          config:
                            ldap.attribute: "sn"
                            user.model.attribute: "lastName"
                            is.mandatory.in.ldap: "true"
                            read.only: "true"
                            always.read.value.from.ldap: "true"
                        - name: "email"
                          federationMapperType: "user-attribute-ldap-mapper"
                          federationProviderDisplayName: "<LDAP_NAME>"
                          config:
                            ldap.attribute: "mail"
                            user.model.attribute: "email"
                            is.mandatory.in.ldap: "false"
                            read.only: "true"
                            always.read.value.from.ldap: "true"
    
    • Verify that the userFederation section is located on the same level as the initUsers section.

    • Verify that all attributes set in the mappers section are defined for users in the specified LDAP system. Missing attributes may cause authorization issues.

Now, return to the bootstrap instruction for your management cluster.

Configure Google OAuth IdP for IAM

Caution

The instruction below applies to the DNS-based management clusters. If you bootstrap a non-DNS-based management cluster, configure Google OAuth IdP for Keycloak after bootstrap using the official Keycloak documentation.

If you integrate Google OAuth external identity provider for IAM to Mirantis OpenStack for Kubernetes, create the authorization credentials for IAM in your Google OAuth account and configure cluster.yaml.template during the bootstrap of the management cluster.

To configure Google OAuth IdP for IAM:

  1. Create Google OAuth credentials for IAM:

    1. Log in to your https://console.developers.google.com.

    2. Navigate to Credentials.

    3. In the APIs Credentials menu, select OAuth client ID.

    4. In the window that opens:

      1. In the Application type menu, select Web application.

      2. In the Authorized redirect URIs field, type in <keycloak-url>/auth/realms/iam/broker/google/endpoint, where <keycloak-url> is the corresponding DNS address.

      3. Press Enter to add the URI.

      4. Click Create.

      A page with your client ID and client secret opens. Save these credentials for further usage.

  2. Log in to the bootstrap node.

  3. Open templates/bm/cluster.yaml.template.

  4. In the keycloak:externalIdP: section, add the following snippet with your credentials created in previous steps:

    keycloak:
      externalIdP:
        google:
          enabled: true
          config:
            clientId: <Google_OAuth_client_ID>
            clientSecret: <Google_OAuth_client_secret>
    

Now, return to the bootstrap instruction for your management cluster.

Create a managed cluster

After bootstrapping your baremetal-based Mirantis Container Cloud management cluster, you can create a baremetal-based managed cluster to deploy Mirantis OpenStack for Kubernetes using the Container Cloud API.

Create a project for MOSK clusters

Note

The procedure below applies only to the Container Cloud web UI users with the m:kaas@global-admin or m:kaas@writer access role assigned by the infrastructure operator.

The default project (Kubernetes namespace) is dedicated for management clusters only. MOSK clusters require a separate project. You can create as many projects as required by your company infrastructure.

To create a project for MOSK clusters:

  1. Log in to the Container Cloud web UI as m:kaas@global-admin or m:kaas@writer.

  2. In the Projects tab, click Create.

  3. Type the new project name.

  4. Click Create.

Note

Due to the known issue 50168, access to the newly created project becomes available in five minutes after project creation.

Add a bare metal host

Before creating a bare metal managed cluster, add the required number of bare metal hosts either using the Container Cloud web UI for a default configuration or using CLI for an advanced configuration.

Add a bare metal host using web UI

This section describes how to add bare metal hosts using the Container Cloud web UI during a MOSK cluster creation.

Before you proceed with adding a bare metal host:

To add a bare metal host to a MOSK cluster:

  1. Optional. Create a custom bare metal host profile depending on your needs as described in Create a custom bare metal host profile.

    Note

    You can view the created profiles in the BM Host Profiles tab of the Container Cloud web UI.

  2. Log in to the Container Cloud web UI with the m:kaas@operator or m:kaas:namespace@bm-pool-operator permissions.

  3. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

    Caution

    Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

  4. Optional. Available since Container Cloud 2.24.0 (Cluster releases 15.0.1 and 14.0.1). In the Credentials tab, click Add Credential and add the IPMI user name and password of the bare metal host to access the Baseboard Management Controller (BMC).

  5. Select one of the following options:

    1. In the Baremetal tab, click Create Host.

    2. Fill out the Create baremetal host form as required:

      • Name

        Specify the name of the new bare metal host.

      • Boot Mode

        Specify the BIOS boot mode. Available options: Legacy, UEFI, or UEFISecureBoot.

      • MAC Address

        Specify the MAC address of the PXE network interface.

      • Baseboard Management Controller (BMC)

        Specify the following BMC details:

        • IP Address

          Specify the IP address to access the BMC.

        • Credential Name

          Specify the name of the previously added bare metal host credentials to associate with the current host.

        • Cert Validation

          Enable validation of the BMC API certificate. Applies only to the redfish+http BMC protocol. Disabled by default.

        • Power off host after creation

          Experimental. Select to power off the bare metal host after creation.

          Caution

          This option is experimental and intended only for testing and evaluation purposes. Do not use it for production deployments.

    1. In the Baremetal tab, click Add BM host.

    2. Fill out the Add new BM host form as required:

      • Baremetal host name

        Specify the name of the new bare metal host.

      • Provider Credential

        Optional. Available since Container Cloud 2.24.0 (Cluster releases 15.0.1 and 14.0.1). Specify the name of the previously added bare metal host credentials to associate with the current host.

      • Add New Credential

        Optional. Available since Container Cloud 2.24.0 (Cluster releases 15.0.1 and 14.0.1). Applies if you did not add bare metal host credentials using the Credentials tab. Add the bare metal host credentials:

        • Username

          Specify the name of the IPMI user to access the BMC.

        • Password

          Specify the IPMI password of the user to access the BMC.

      • Boot MAC address

        Specify the MAC address of the PXE network interface.

      • IP Address

        Specify the IP address to access the BMC.

      • Label

        Assign the machine label to the new host that defines which type of machine may be deployed on this bare metal host. Only one label can be assigned to a host. The supported labels include:

        • Manager

          This label is selected and set by default. Assign this label t the bare metal hosts that can be used to deploy machines with the manager type. These hosts must match the CPU and RAM requirements described in MOSK cluster hardware requirements.

        • Worker

          The host with this label may be used to deploy the worker machine type. Assign this label to the bare metal hosts that have sufficient CPU and RAM resources, as described in MOSK cluster hardware requirements.

        • Storage

          Assign this label to the bare metal hosts that have sufficient storage devices to match MOSK cluster hardware requirements. Hosts with this label will be used to deploy machines with the storage type that run Ceph OSDs.

  6. Click Create.

    While adding the bare metal host, Container Cloud discovers and inspects the hardware of the bare metal host and adds it to BareMetalHost.status for future references.

    During provisioning, baremetal-operator inspects the bare metal host and moves it to the Preparing state. The host becomes ready to be linked to a bare metal machine.

  7. Verify the results of the hardware inspection to avoid unexpected errors during the host usage:

    1. Select one of the following options:

      In the left sidebar, click Baremetal. The Hosts page opens.

      In the left sidebar, click BM Hosts.

    2. Verify that the bare metal host is registered and switched to one of the following statuses:

      • Preparing for a newly added host

      • Ready for a previously used host or for a host that is already linked to a machine

    3. Select one of the following options:

      On the Hosts page, click the host kebab menu and select Host info.

      On the BM Hosts page, click the name of the newly added bare metal host.

    4. In the window with the host details, scroll down to the Hardware section.

    5. Review the section and make sure that the number and models of disks, network interface cards, and CPUs match the hardware specification of the server.

      • If the hardware details are consistent with the physical server specifications for all your hosts, proceed to Create a MOSK cluster.

      • If you find any discrepancies in the hardware inspection results, it might indicate that the server has hardware issues or is not compatible with Container Cloud.

Add a bare metal host using CLI

This section describes how to add bare metal hosts using the Container Cloud CLI during a managed cluster creation.

To add a bare metal host:

  1. Create a project for MOSK clusters.

  2. Configure BIOS on a bare metal host.

  3. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  4. Describe the unique credentials of the new bare metal host:

    Create a YAML file that describes the unique credentials of the new bare metal host as a BareMetalHostCredential object.

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: BareMetalHostCredential
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-credential-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      username: <ipmi-user-name>
      password:
        value: <ipmi-user-password>
    
    • In the metadata section, add a unique credentials name and the name of the non-default project (namespace) dedicated for the managed cluster being created.

    • In the spec section, add the IPMI user name and password in plain text to access the Baseboard Management Controller (BMC). The password will not be stored in the BareMetalHostCredential object but will be erased and saved in an underlying Secret object.

    Caution

    Each bare metal host must have a unique BareMetalHostCredential. For details about the BareMetalHostCredential object, refer to BareMetalHostCredential resource.

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Create a secret YAML file that describes the unique credentials of the new bare metal host. Example of the bare metal host secret:

    apiVersion: v1
    data:
      password: <credentials-password>
      username: <credentials-user-name>
    kind: Secret
    metadata:
      labels:
        kaas.mirantis.com/credentials: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <credentials-name>
      namespace: <managed-cluster-project-name>
    type: Opaque
    
    • In the data section, add the IPMI user name and password in the base64 encoding to access the BMC. To obtain the base64-encoded credentials, you can use the following command in your Linux console:

      echo -n <username|password> | base64
      

      Caution

      Each bare metal host must have a unique Secret.

    • In the metadata section, add the unique name of credentials and the name of the non-default project (namespace) dedicated for the managed cluster being created.

  5. Apply this secret YAML file to your deployment:

    Warning

    The kubectl apply command automatically saves the applied data as plain text into the kubectl.kubernetes.io/last-applied-configuration annotation of the corresponding object. This may result in revealing sensitive data in this annotation when creating or modifying the object.

    Therefore, do not use kubectl apply on this object. Use kubectl create, kubectl patch, or kubectl edit instead.

    If you used kubectl apply on this object, you can remove the kubectl.kubernetes.io/last-applied-configuration annotation from the object using kubectl edit.

    kubectl create -n <managedClusterProjectName> -f ${<bmh-cred-file-name>}.yaml
    
  6. Create a YAML file that contains a description of the new bare metal host:

    m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: BareMetalHostInventory
    metadata:
      annotations:
        inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        kaas.mirantis.com/provider: baremetal
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip-address-for-bmc-access>
        bmhCredentialsName: <bare-metal-host-credential-unique-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    Note

    If you have a limited amount of free and unused IP addresses for server provisioning, you can add the baremetalhost.metal3.io/detached annotation that pauses automatic host management to manually allocate an IP address for the host. For details, see Manually allocate IP addresses for bare metal hosts.

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      annotations:
        kaas.mirantis.com/baremetalhost-credentials-name: <bare-metal-host-credential-unique-name>
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip-address-for-bmc-access>
        credentialsName: ''
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Note

    If you have a limited amount of free and unused IP addresses for server provisioning, you can add the baremetalhost.metal3.io/detached annotation that pauses automatic host management to manually allocate an IP address for the host. For details, see Manually allocate IP addresses for bare metal hosts.

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip-address-for-bmc-access>
        credentialsName: <credentials-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    For a detailed fields description, see API Reference: BareMetalHostInventory resource and BareMetalHost resource.

  7. Apply this configuration YAML file to your deployment:

    kubectl create -n <managedClusterProjectName> -f ${<bare-metal-host-config-file-name>}.yaml
    

    During provisioning, baremetal-operator inspects the bare metal host and moves it to the Preparing state. The host becomes ready to be linked to a bare metal machine.

    Caution

    If changing or adding of DHCP subnets is required to bootstrap new nodes, wait after changing or adding of DHCP subnets until the dnsmasq pod becomes ready, then create bare metal host objects as described above.

    For details about the related issue, refer to Inspection error on bare metal hosts after dnsmasq restart.

  8. Verify the new bare metal host object status:

    kubectl -n <managed-cluster-project-name> get bmh -o wide <bare-metal-host-unique-name>
    

    Example of system response:

    NAMESPACE    NAME   STATUS   STATE      CONSUMER  BMC                        BOOTMODE  ONLINE  ERROR  REGION
    my-project   bmh1   OK       preparing            ip-address-for-bmc-access  legacy    true           region-one
    

    During provisioning, the status changes as follows:

    1. registering

    2. inspecting

    3. preparing

  9. After the bare metal host object switches to the preparing stage, the inspecting phase finishes and you can verify that hardware information is available in the object status and matches the MOSK cluster hardware requirements. For example:

    • Verify the status of hardware NICs:

      kubectl -n <managed-cluster-project-name> get bmh <bare-metal-host-unique-name> -o json |  jq -r '[.status.hardware.nics]'
      

      Example of system response:

      [
        [
          {
            "ip": "172.18.171.32",
            "mac": "ac:1f:6b:02:81:1a",
            "model": "0x8086 0x1521",
            "name": "eno1",
            "pxe": true
          },
          {
            "ip": "fe80::225:90ff:fe33:d5ac%ens1f0",
            "mac": "00:25:90:33:d5:ac",
            "model": "0x8086 0x10fb",
            "name": "ens1f0"
          },
       ...
      
    • Verify the status of RAM:

      kubectl -n <managed-cluster-project-name> get bmh <bare-metal-host-unique-name> -o json |  jq -r '[.status.hardware.ramMebibytes]'
      

      Example of system response:

      [
        98304
      ]
      

Now, proceed with Create a custom bare metal host profile.

Create a custom bare metal host profile

The bare metal host profile is a Kubernetes custom resource. It enables the operator to define how the storage devices and the operating system are provisioned and configured.

This section describes the bare metal host profile default settings and configuration of custom profiles for managed clusters using Container Cloud API. The section also applies to a management cluster with a few differences described in Customize the default bare metal host profile.

Default configuration of the host system storage

The default host profile requires three storage devices in the following strict order:

  1. Boot device and operating system storage

    This device contains boot data and operating system data. It is partitioned using the GUID Partition Table (GPT) labels. The root file system is an ext4 file system created on top of an LVM logical volume. For a detailed layout, refer to the table below.

  2. Local volumes device

    This device contains an ext4 file system with directories mounted as persistent volumes to Kubernetes. These volumes are used by the Mirantis Container Cloud services to store its data, including monitoring and identity databases.

  3. Ceph storage device

    This device is used as a Ceph datastore or Ceph OSD on managed clusters.

The following table summarizes the default configuration of the host system storage set up by the Container Cloud bare metal management.

Default configuration of the bare metal host storage

Device/partition

Name/Mount point

Recommended size, GB

Description

/dev/sda1

bios_grub

4 MiB

The mandatory GRUB boot partition required for non-UEFI systems.

/dev/sda2

UEFI -> /boot/efi

0.2 GiB

The boot partition required for the UEFI boot mode.

/dev/sda3

config-2

64 MiB

The mandatory partition for the cloud-init configuration. Used during the first host boot for initial configuration.

/dev/sda4

lvm_root_part

100% of the remaining free space in the LVM volume group

The main LVM physical volume that is used to create the root file system.

/dev/sdb

lvm_lvp_part -> /mnt/local-volumes

100% of the remaining free space in the LVM volume group

The LVM physical volume that is used to create the file system for LocalVolumeProvisioner.

/dev/sdc

-

100% of the remaining free space in the LVM volume group

Clean raw disk that will be used for the Ceph storage backend on managed clusters.

If required, you can customize the default host storage configuration. For details, see Create MOSK host profiles.

Wipe a device or partition

Available since MCC 2.26.0 (17.1.0 and 16.1.0)

Before deploying a cluster, you may need to erase existing data from hardware devices to be used for deployment. You can either erase an existing partition or remove all existing partitions from a physical device. For this purpose, use the wipeDevice structure that configures cleanup behavior during configuration of a custom bare metal host profile described in Create MOSK host profiles.

The wipeDevice structure contains the following options:

  • eraseMetadata

    Configures metadata cleanup of a device

  • eraseDevice

    Configures a complete cleanup of a device

Erase metadata from a device

When you enable the eraseMetadata option, which is disabled by default, the Ansible provisioner attempts to clean up the existing metadata from the target device. Examples of metadata include:

  • Existing file system

  • Logical Volume Manager (LVM) or Redundant Array of Independent Disks (RAID) configuration

The behavior of metadata erasure varies depending on the target device:

  • If a device is part of other logical devices, for example, a partition, logical volume, or MD RAID volume, such logical device is disassembled and its file system metadata is erased. On the final erasure step, the file system metadata of the target device is erased as well.

  • If a device is a physical disk, then all its nested partitions along with their nested logical devices, if any, are erased and disassembled. On the final erasure step, all partitions and metadata of the target device are removed.

Caution

None of the eraseMetadata actions include overwriting the target device with data patterns. For this purpose, use the eraseDevice option as described in Erase a device.

To enable the eraseMetadata option, use the wipeDevice field in the spec:devices section of the BareMetalHostProfile object. For a detailed description of the option, see BareMetalHostProfile resource.

Erase a device

If you require not only disassembling of existing logical volumes but also removing of all data ever written to the target device, configure the eraseDevice option, which is disabled by default. This option is not applicable to paritions, LVM, or MD RAID logical volumes because such volumes may use caching that prevents a physical device from being erased properly.

Important

The eraseDevice option does not replace the secure erase.

To configure the eraseDevice option, use the wipeDevice field in the spec:devices section of the BareMetalHostProfile object. For a detailed description of the option, see BareMetalHostProfile resource.

Create MOSK host profiles

Different types of MOSK nodes require differently configured host storage. This section describes how to create custom host profiles for different types of MOSK nodes.

You can create custom profiles for managed clusters using Container Cloud API.

Note

The procedure below also applies to management clusters.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

To create MOSK bare metal host profiles:

  1. Select from the following options:

    • For a management cluster, log in to the bare metal seed node that will be used to bootstrap the management cluster.

    • For a managed cluster, log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Select from the following options:

    • For a management cluster, open templates/bm/baremetalhostprofiles.yaml.template for editing.

    • For a managed cluster, create a new bare metal host profile for MOSK compute nodes in a YAML file under the templates/bm/ directory.

  3. Edit the host profile using the example template below to meet your hardware configuration requirements:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHostProfile
    metadata:
      name: <PROFILE_NAME>
      namespace: <PROJECT_NAME>
    spec:
      devices:
      # From the HW node, obtain the first device, which size is at least 60Gib
      - device:
          workBy: "by_id,by_wwn,by_path,by_name"
          minSize: 60Gi
          type: ssd
          wipe: true
        partitions:
        - name: bios_grub
          partflags:
          - bios_grub
          size: 4Mi
          wipe: true
        - name: uefi
          partflags:
          - esp
          size: 200Mi
          wipe: true
        - name: config-2
          size: 64Mi
          wipe: true
        # This partition is only required on compute nodes if you plan to
        # use LVM ephemeral storage.
        - name: lvm_nova_part
          wipe: true
          size: 100Gi
        - name: lvm_root_part
          size: 0
          wipe: true
      # From the HW node, obtain the second device, which size is at least 60Gib
      # If a device exists but does not fit the size,
      # the BareMetalHostProfile will not be applied to the node
      - device:
          workBy: "by_id,by_wwn,by_path,by_name"
          minSize: 60Gi
          type: ssd
          wipe: true
      # From the HW node, obtain the disk device with the exact name
      - device:
          workBy: "by_id,by_wwn,by_path,by_name"
          minSize: 60Gi
          wipe: true
        partitions:
        - name: lvm_lvp_part
          size: 0
          wipe: true
      # Example of wiping a device w\o partitioning it.
      # Mandatory for the case when a disk is supposed to be used for Ceph backend
      # later
      - device:
          workBy: "by_id,by_wwn,by_path,by_name"
          wipe: true
      fileSystems:
      - fileSystem: vfat
        partition: config-2
      - fileSystem: vfat
        mountPoint: /boot/efi
        partition: uefi
      - fileSystem: ext4
        logicalVolume: root
        mountPoint: /
      - fileSystem: ext4
        logicalVolume: lvp
        mountPoint: /mnt/local-volumes/
      logicalVolumes:
      - name: root
        size: 0
        vg: lvm_root
      - name: lvp
        size: 0
        vg: lvm_lvp
      postDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
      preDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
      volumeGroups:
      - devices:
        - partition: lvm_root_part
        name: lvm_root
      - devices:
        - partition: lvm_lvp_part
        name: lvm_lvp
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
      kernelParameters:
        sysctl:
        # For the list of options prohibited to change, refer to
        # https://docs.mirantis.com/mke/3.7/install/predeployment/set-up-kernel-default-protections.html
          kernel.dmesg_restrict: "1"
          kernel.core_uses_pid: "1"
          fs.file-max: "9223372036854775807"
          fs.aio-max-nr: "1048576"
          fs.inotify.max_user_instances: "4096"
          vm.max_map_count: "262144"
    
  4. Add or edit the mandatory parameters in the new BareMetalHostProfile object. For the parameters description, see BareMetalHostProfile spec <bmhprofile-spec>.

    Note

    If asymmetric traffic is expected on some of the managed cluster nodes, enable the loose mode for the corresponding interfaces on those nodes by setting the net.ipv4.conf.<interface-name>.rp_filter parameter to "2" in the kernelParameters.sysctl section. For example:

    kernelParameters:
      sysctl:
        net.ipv4.conf.k8s-lcm.rp_filter: "2"
    
  5. Configure required disks for the Ceph cluster as described in Configure Ceph disks in a host profile.

  6. Optional. Configure wiping of the target device or partition to be used for cluster deployment as described in Wipe a device or partition.

  7. Optional. Configure multiple devices for LVM volume using the example template extract below for reference.

    Caution

    The following template extract contains only sections relevant to LVM configuration with multiple PVs. Expand the main template described in the previous step with the configuration below if required.

    spec:
      devices:
        ...
        - device:
          ...
          partitions:
            - name: lvm_lvp_part1
              size: 0
              wipe: true
        - device:
          ...
          partitions:
            - name: lvm_lvp_part2
              size: 0
              wipe: true
    volumeGroups:
      ...
      - devices:
        - partition: lvm_lvp_part1
        - partition: lvm_lvp_part2
        name: lvm_lvp
    logicalVolumes:
      ...
      - name: root
        size: 0
        vg: lvm_lvp
    fileSystems:
      ...
      - fileSystem: ext4
        logicalVolume: root
        mountPoint: /
    
  8. Optional. Technology Preview. Configure support of the Redundant Array of Independent Disks (RAID) that allows, for example, installing a cluster operating system on a RAID device, refer to Configure RAID support.

  9. Optional. Configure the RX/TX buffer size for physical network interfaces and txqueuelen for any network interfaces.

    This configuration can greatly benefit high-load and high-performance network interfaces. You can configure these parameters using the udev rules. For example:

    postDeployScript: |
      #!/bin/bash -ex
      ...
      echo 'ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="eth*|en*", RUN+="/sbin/ethtool -G $name rx 4096 tx 4096"' > /etc/udev/rules.d/59-net.ring.rules
    
      echo 'ACTION=="add|change", SUBSYSTEM=="net", KERNEL=="eth*|en*|bond*|k8s-*|v*" ATTR{tx_queue_len}="10000"' > /etc/udev/rules.d/58-net.txqueue.rules
    
  10. Select from the following options:

    • For a management cluster, proceed with the cluster bootstrap procedure as described in Deploy a management cluster.

    • For a managed cluster, select from the following options:

      1. Log in to the Container Cloud web UI with the operator permissions.

      2. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

        Caution

        Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

      3. In the left sidebar, navigate to Baremetal and click the Host Profiles tab.

      4. Click Create Host Profile.

      5. Fill out the Create host profile form:

        • Name

          Name of the bare metal host profile.

        • Specification

          BareMetalHostProfile object specification in the YAML format that you have previously created. Click Edit to edit the BareMetalHostProfile object if required.

          Note

          Before Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), the field name is YAML file, and you can upload the required YAML file instead of inserting and editing it.

        • Labels

          Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). Key-value pairs attached to BareMetalHostProfile.

      1. Add the bare metal host profile to your management cluster:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <managedClusterProjectName> apply -f <pathToBareMetalHostProfileFile>
        
      2. If required, further modify the host profile:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <managedClusterProjectName> edit baremetalhostprofile <hostProfileName>
        
  11. Repeat the steps above to create host profiles for other OpenStack node roles such as control plane nodes and storage nodes.

Now, proceed to Enable huge pages in a host profile.

Configure Ceph disks in a host profile

This section describes how to configure devices for the Ceph cluster in the BareMetalHostProfile object of a managed cluster.

To configure disks for a Ceph cluster:

  1. Open the BareMetalHostProfile object of a managed cluster for editing.

  2. In the spec.devices section, add each disk intended for use as a Ceph OSD data device with size: 0 and wipe: true.

    Example configuration for sde - sdh disks to use as Ceph OSDs:

    spec:
      devices:
      ...
      - device:
          byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
          size: 0
          wipe: true
      - device:
          byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
          size: 0
          wipe: true
      - device:
          byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:3
          size: 0
          wipe: true
      - device:
          byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:4
          size: 0
          wipe: true
    
  3. Since MOSK 23.2, if you plan to use a separate metadata device for Ceph OSD, configure the spec.devices section as described below.

    Important

    Mirantis highly recommends configuring disk partitions for Ceph OSD metadata using BareMetalHostProfile.

    Configuration of a separate metadata device for Ceph OSD
    1. Add the device to spec.devices with a single partition that will use the entire disk size.

      For example, if you plan to use four Ceph OSDs with a separate metadata device for each Ceph OSD, configure the spec.devices section as follows:

      spec:
        devices:
        ...
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:5
            wipe: true
          partitions:
          - name: ceph_meta
            size: 0
            wipe: true
      
    2. Create a volume group on top of the defined partition and create the required number of logical volumes (LVs) on top of the created volume group (VG). Add one logical volume per one Ceph OSD on the node.

      Example snippet of an LVM configuration for a Ceph metadata disk:

      spec:
        ...
        volumeGroups:
        ...
        - devices:
          - partition: ceph_meta
          name: bluedb
        logicalVolumes:
        ...
        - name: meta_1
          size: 25%VG
          vg: bluedb
        - name: meta_2
          size: 25%VG
          vg: bluedb
        - name: meta_3
          size: 25%VG
          vg: bluedb
        - name: meta_4
          size: 25%VG
          vg: bluedb
      

      Important

      Plan LVs of a separate metadata device thoroughly. Any logical volume misconfiguration causes redeployment of all Ceph OSDs that use this disk as metadata devices.

      Note

      General Ceph recommendation is to have a metadata device in between 1% to 4% of the Ceph OSD data size. Mirantis highly recommends having at least 4% of Ceph OSD data size.

      If you plan using a disk as a separate metadata device for 10 Ceph OSDs, define the size of an LV for each Ceph OSD in between 1% to 4% of the corresponding Ceph OSD data size. If RADOS Gateway is enabled, the minimum data size must be 4%. For details, see Ceph documentation: Bluestore config reference.

      For example, if the total data size of 10 Ceph OSDs equals 1Tb with 100Gb each, assign a metadata disk less than 10Gb with 1Gb per each LV. The recommended size is 40Gb with 4Gb per each LV.

      After applying BareMetalHostProfile, the bare metal provider creates an LVM partitioning for the metadata disk and places these volumes as /dev paths, for example, /dev/bluedb/meta_1 or /dev/bluedb/meta_3.

      Example template of a host profile configuration for Ceph

      spec:
        ...
        devices:
        ...
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
            wipe: true
        - device:
            byName: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:3
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:4
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:5
            wipe: true
          partitions:
          - name: ceph_meta
            size: 0
            wipe: true
        volumeGroups:
        ...
        - devices:
          - partition: ceph_meta
          name: bluedb
        logicalVolumes:
        ...
        - name: meta_1
          size: 25%VG
          vg: bluedb
        - name: meta_2
          size: 25%VG
          vg: bluedb
        - name: meta_3
          size: 25%VG
          vg: bluedb
        - name: meta_4
          size: 25%VG
          vg: bluedb
      

    After applying such BareMetalHostProfile to a node, the nodes spec of the KaaSCephCluster object contains the following storageDevices section:

    spec:
      cephClusterSpec:
        ...
        nodes:
          ...
          machine-1:
            ...
            storageDevices:
            - fullPath: /dev/disk/by-id/scsi-SATA_ST4000NM002A-2HZ_WS20NNKC
              config:
                metadataDevice: /dev/bluedb/meta_1
            - fullPath: /dev/disk/by-id/ata-ST4000NM002A-2HZ101_WS20NEGE
              config:
                metadataDevice: /dev/bluedb/meta_2
            - fullPath: /dev/disk/by-id/scsi-0ATA_ST4000NM002A-2HZ_WS20LEL3
              config:
                metadataDevice: /dev/bluedb/meta_3
            - fullPath: /dev/disk/by-id/ata-HGST_HUS724040ALA640_PN1334PEDN9SSU
              config:
                metadataDevice: /dev/bluedb/meta_4
    
    spec:
      cephClusterSpec:
        ...
        nodes:
          ...
          machine-1:
            ...
            storageDevices:
            - name: sde
              config:
                metadataDevice: /dev/bluedb/meta_1
            - name: sdf
              config:
                metadataDevice: /dev/bluedb/meta_2
            - name: sdg
              config:
                metadataDevice: /dev/bluedb/meta_3
            - name: sdh
              config:
                metadataDevice: /dev/bluedb/meta_4
    
Enable huge pages in a host profile

The BareMetalHostProfile API allows configuring a host to use the huge pages feature of the Linux kernel on managed clusters. The procedure included in this section applies to both new and existing cluster deployments.

Note

Huge pages is a mode of operation of the Linux kernel. With huge pages enabled, the kernel allocates the RAM in bigger chunks, or pages. This allows kernel-based virtual machines and virtual machines running on it to use the host RAM more efficiently and improves the performance of the virtual machines.

To enable huge pages in a custom bare metal host profile for a managed cluster:

  1. Log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Open for editing or create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the grubConfig section of the host profile spec using the example below to configure the kernel boot parameters and enable huge pages:

    spec:
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
        - GRUB_CMDLINE_LINUX_DEFAULT="hugepagesz=1G hugepages=N"
    

    The example configuration above will allocate N huge pages of 1 GB each on the server boot. The last hugepagesz parameter value is default unless default_hugepagesz is defined. For details about possible values, see official Linux kernel documentation.

  4. Add the bare metal host profile to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> apply -f <pathToBareMetalHostProfileFile>
    
  5. If required, further modify the host profile:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> edit baremetalhostprofile <hostProfileName>
    
  6. Proceed to Create a MOSK cluster.

Configure RAID support

TechPreview

During a management or MOSK cluster creation, you can configure the support of the software-based Redundant Array of Independent Disks (RAID) using BareMetalHosProfile to set up an LVM-based RAID level 1 (raid1) or an mdadm-based RAID level 0, 1, or 10 (raid0, raid1, or raid10).

If required, you can further configure RAID in the same profile, for example, to install a cluster operating system onto a RAID device.

Caution

  • RAID configuration on already provisioned bare metal machines or on an existing cluster is not supported.

    To start using any kind of RAIDs, reprovisioning of machines with a new BaremetalHostProfile is required.

  • Mirantis supports the raid1 type of RAID devices both for LVM and mdadm.

  • Mirantis supports the raid0 type for the mdadm RAID to be on par with the LVM linear type.

  • Mirantis recommends having at least two physical disks for raid0 and raid1 devices to prevent unnecessary complexity.

  • Mirantis supports the raid10 type for mdadm RAID. At least four physical disks are required for this type of RAID.

  • Only an even number of disks can be used for a raid1 or raid10 device.

Create an LVM software RAID (raid1)

TechPreview

Warning

The EFI system partition partflags: ['esp'] must be a physical partition in the main partition table of the disk, not under LVM or mdadm software RAID.

During configuration of your custom bare metal host profile, you can create an LVM-based software RAID device raid1 by adding type: raid1 to the logicalVolume spec in BaremetalHostProfile.

Caution

The logicalVolume spec of the raid1 type requires at least two devices (partitions) in volumeGroup where you build a logical volume. For an LVM of the linear type, one device is enough.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Note

The LVM raid1 requires additional space to store the raid1 metadata on a volume group, roughly 4 MB for each partition. Therefore, you cannot create a logical volume of exactly the same size as the partitions it works on.

For example, if you have two partitions of 10 GiB, the corresponding raid1 logical volume size will be less than 10 GiB. For that reason, you can either set size: 0 to use all available space on the volume group, or set a smaller size than the partition size. For example, use size: 9.9Gi instead of size: 10Gi for the logical volume.

The following example illustrates an extract of BaremetalHostProfile with / on the LVM raid1.

...
devices:
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 200Gi
      type: hdd
      wipe: true
    partitions:
      - name: root_part1
        size: 120Gi
    partitions:
      - name: rest_sda
        size: 0
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 200Gi
      type: hdd
      wipe: true
    partitions:
      - name: root_part2
        size: 120Gi
    partitions:
      - name: rest_sdb
        size: 0
volumeGroups:
  - name: vg-root
    devices:
      - partition: root_part1
      - partition: root_part2
  - name: vg-data
    devices:
      - partition: rest_sda
      - partition: rest_sdb
logicalVolumes:
  - name: root
    type: raid1  ## <-- LVM raid1
    vg: vg-root
    size: 119.9Gi
  - name: data
    type: linear
    vg: vg-data
    size: 0
fileSystems:
  - fileSystem: ext4
    logicalVolume: root
    mountPoint: /
    mountOpts: "noatime,nodiratime"
  - fileSystem: ext4
    logicalVolume: data
    mountPoint: /mnt/data

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

Create LVM volume groups on top of RAID devices

TechPreview

You can configure an LVM volume group on top of mdadm-based RAID devices as physical volumes using the BareMetalHostProfile resource. List the required RAID devices in a separate field of the volumeGroups definition within the storage configuration of BareMetalHostProfile.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

The following example illustrates an extract of BaremetalHostProfile with a volume group named lvm_nova to be created on top of an mdadm-based RAID device raid1:

...
devices:
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: ssd
      wipe: true
    partitions:
      - name: bios_grub
        partflags:
          - bios_grub
        size: 4Mi
      - name: uefi
        partflags:
          - esp
        size: 200Mi
      - name: config-2
        size: 64Mi
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 30Gi
      type: ssd
      wipe: true
    partitions:
      - name: md0_part1
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 30Gi
      type: ssd
      wipe: true
    partitions:
      - name: md0_part2
softRaidDevices:
  - devices:
      - partition: md0_part1
      - partition: md0_part2
    level: raid1
    metadata: "1.0"
    name: /dev/md0
volumeGroups:
  - devices:
      - softRaidDevice: /dev/md0
    name: lvm_nova
...
Create an mdadm software RAID (raid0, raid1, raid10)

TechPreview

Warning

The EFI system partition partflags: ['esp'] must be a physical partition in the main partition table of the disk, not under LVM or mdadm software RAID.

During configuration of your custom bare metal host profile as described in Create a custom bare metal host profile, you can create an mdadm-based software RAID device raid0 and raid1 by describing the mdadm devices under the softRaidDevices field in BaremetalHostProfile. For example:

...
softRaidDevices:
- name: /dev/md0
  devices:
  - partition: sda1
  - partition: sdb1
- name: raid-name
  devices:
  - partition: sda2
  - partition: sdb2
...

You can also use the raid10 type for the mdadm-based software RAID devices. This type requires at least four and in total an even number of storage devices available on your servers. For example:

softRaidDevices:
- name: /dev/md0
  level: raid10
  devices:
    - partition: sda1
    - partition: sdb1
    - partition: sdd1

The following fields in softRaidDevices describe RAID devices:

  • name

    Name of the RAID device to refer to throughout the baremetalhostprofile.

  • level

    Type or level of RAID used to create a device, defaults to raid1. Set to raid0 or raid10 to create a device of the corresponding type.

  • devices

    List of physical devices or partitions used to build a software RAID device. It must include at least two partitions or devices to build a raid0 and raid1 devices and at least four for raid10.

For the rest of the mdadm RAID parameters, see BareMetalHostProfile spec.

Caution

The mdadm RAID devices cannot be created on top of LVM devices.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

The following example illustrates an extract of BaremetalHostProfile with / on the mdadm raid1 and some data storage on raid0:

Example with / on the mdadm raid1 and data storage on raid0
...
devices:
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      type: nvme
      wipe: true
    partitions:
      - name: root_part1
        size: 120Gi
    partitions:
      - name: rest_sda
        size: 0
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      type: nvme
      wipe: true
    partitions:
      - name: root_part2
        size: 120Gi
    partitions:
      - name: rest_sdb
        size: 0
softRaidDevices:
  - name: root
    level: raid1  ## <-- mdadm raid1
    devices:
      - partition: root_part1
      - partition: root_part2
  - name: raid-name
    level: raid0  ## <-- mdadm raid0
    devices:
      - partition: rest_sda
      - partition: rest_sdb
fileSystems:
  - fileSystem: ext4
    softRaidDevice: root
    mountPoint: /
    mountOpts: "noatime,nodiratime"
  - fileSystem: ext4
    softRaidDevice: data
    mountPoint: /mnt/data
...

The following example illustrates an extract of BaremetalHostProfile with data storage on a raid10 device:

Example with data storage on the mdadm raid10
...
devices:
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: ssd
      wipe: true
    partitions:
      - name: bios_grub1
        partflags:
          - bios_grub
        size: 4Mi
        wipe: true
      - name: uefi
        partflags:
          - esp
        size: 200Mi
        wipe: true
      - name: config-2
        size: 64Mi
        wipe: true
      - name: lvm_root
        size: 0
        wipe: true
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: nvme
      wipe: true
    partitions:
      - name: md_part1
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: nvme
      wipe: true
    partitions:
      - name: md_part2
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: nvme
      wipe: true
    partitions:
      - name: md_part3
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      workBy: "by_id,by_wwn,by_path,by_name"
      minSize: 60Gi
      type: nvme
      wipe: true
    partitions:
      - name: md_part4
        partflags:
          - raid
        size: 40Gi
        wipe: true
fileSystems:
  - fileSystem: vfat
    partition: config-2
  - fileSystem: vfat
    mountPoint: /boot/efi
    partition: uefi
  - fileSystem: ext4
    mountOpts: rw,noatime,nodiratime,lazytime,nobarrier,commit=240,data=ordered
    mountPoint: /
    partition: root
  - filesystem: ext4
    mountPoint: /var
    softRaidDevice: /dev/md0
softRaidDevices:
  - devices:
      - partition: md_root_part1
      - partition: md_root_part2
      - partition: md_root_part3
      - partition: md_root_part4
    level: raid10
    metadata: "1.2"
    name: /dev/md0
...
Create an mdadm software RAID level 10 (raid10)

TechPreview

Warning

The EFI system partition partflags: ['esp'] must be a physical partition in the main partition table of the disk, not under LVM or mdadm software RAID.

You can deploy MOSK on local software-based Redundant Array of Independent Disks (RAID) devices to withstand failure of one device at a time.

Using a custom bare metal host profile, you can configure and create an mdadm-based software RAID device of type raid10 if you have an even number of devices available on your servers. At least four storage devices are required for such RAID device.

During configuration of your custom bare metal host profile as described in Create a custom bare metal host profile, create an mdadm-based software RAID device raid10 by describing the mdadm devices under the softRaidDevices field. For example:

...
softRaidDevices:
- name: /dev/md0
  level: raid10
  devices:
    - partition: sda1
    - partition: sdb1
    - partition: sdc1
    - partition: sdd1
...

The following fields in softRaidDevices describe RAID devices:

  • name

    Name of the RAID device to refer to throughout the baremetalhostprofile.

  • devices

    List of physical devices or partitions used to build a software RAID device. It must include at least four partitions or devices to build a raid10 device.

  • level

    Type or level of RAID used to create device. Set to raid10 or raid1 to create a device of the corresponding type.

For the rest of the mdadm RAID parameters, see BareMetalHostProfile spec.

Caution

The mdadm RAID devices cannot be created on top of an LVM device.

The following example illustrates an extract of baremetalhostprofile with data storage on a raid10 device:

...
devices:
  - device:
      minSize: 60Gi
      wipe: true
    partitions:
      - name: bios_grub1
        partflags:
          - bios_grub
        size: 4Mi
        wipe: true
      - name: uefi
        partflags:
          - esp
        size: 200Mi
        wipe: true
      - name: config-2
        size: 64Mi
        wipe: true
      - name: lvm_root
        size: 0
        wipe: true
  - device:
      minSize: 60Gi
      wipe: true
    partitions:
      - name: md_part1
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      minSize: 60Gi
      wipe: true
    partitions:
      - name: md_part2
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      minSize: 60Gi
      wipe: true
    partitions:
      - name: md_part3
        partflags:
          - raid
        size: 40Gi
        wipe: true
  - device:
      minSize: 60Gi
      wipe: true
    partitions:
      - name: md_part4
        partflags:
          - raid
        size: 40Gi
        wipe: true
fileSystems:
  - fileSystem: vfat
    partition: config-2
  - fileSystem: vfat
    mountPoint: /boot/efi
    partition: uefi
  - fileSystem: ext4
    mountOpts: rw,noatime,nodiratime,lazytime,nobarrier,commit=240,data=ordered
    mountPoint: /
    partition: root
  - filesystem: ext4
    mountPoint: /var
    softRaidDevice: /dev/md0
softRaidDevices:
  - devices:
      - partition: md_root_part1
      - partition: md_root_part2
      - partition: md_root_part3
      - partition: md_root_part4
    level: raid10
    metadata: "1.2"
    name: /dev/md0
...

Warning

When building the raid10 array on top of device partitions, make sure that only one partition per device is used for a given array.

Although having two partitions located on the same physical device as array members is technically possible, it may lead to data loss if mdadm selects both partitions of the same drive to be mirrored. In such case, redundancy against entire drive failure is lost.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

Create a MOSK cluster

With L2 networking templates, you can create MOSK clusters with advanced host networking configurations. For example, you can create bond interfaces on top of physical interfaces on the host or use multiple subnets to separate different types of network traffic.

You can use several host-specific L2 templates per one cluster to support different hardware configurations. For example, you can create L2 templates with a different number and layout of NICs to be applied to specific machines of one cluster.

You can also use multiple L2 templates to support different roles for nodes in a MOSK installation. You can create L2 templates with different logical interfaces and assign them to individual machines based on their roles in a MOSK cluster.

Caution

Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

  • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

  • Connections between services are interrupted unexpectedly, which can cause data loss.

  • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

For details, see Modify network configuration on an existing machine.

Since MOSK 23.2.2, in the Technology Preview scope, you can create a MOSK cluster with the multi-rack topology, where cluster nodes including Kubernetes masters are distributed across multiple racks without L2 layer extension between them, and use BGP for announcement of the cluster API load balancer address and external addresses of Kubernetes load-balanced services.

Implementation of the multi-rack topology implies the use of Rack and MultiRackCluster objects that support configuration of BGP announcement of the cluster API load balancer address. For the configuration procedure, refer to Configure BGP announcement for cluster API LB address. For configuring the BGP announcement of external addresses of Kubernetes load-balanced services, refer to Configure and verify MetalLB.

Follow the procedures described in the below subsections to configure initial settings and advanced network objects for your managed clusters.

Create a managed bare metal cluster

This section instructs you on how to configure and deploy a managed cluster that is based on the baremetal-based management cluster through the Mirantis Container Cloud web UI.

Note

Due to the known issue 50181, creation of a compact managed cluster or addition of any labels to the control plane nodes is not available through the Container Cloud web UI.

To create a managed cluster on bare metal:

  1. Available since the Cluster release 16.1.0 on the management cluster. If you plan to deploy a large managed cluster, enable dynamic IP allocation to increase the amount of baremetal hosts to be provisioned in parallel. For details, see Enable dynamic IP allocation.

  2. Available since Container Cloud 2.24.0 (Cluster release 14.0.0). Optional. Technology Preview. Enable custom host names for cluster machines. When enabled, any machine host name in a particular region matches the related Machine object name. For example, instead of the default kaas-node-<UID>, a machine host name will be master-0. The custom naming format is more convenient and easier to operate with.

    For details, see Configure host names for cluster machines.

    Skip this step if you enabled this feature during management cluster bootstrap, because custom host names will be automatically enabled on the related managed cluster as well.

  3. Log in to the Container Cloud web UI with the writer permissions.

  4. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

    Caution

    Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

  5. In the SSH keys tab, click Add SSH Key to upload the public SSH key that will be used for the SSH access to VMs.

  6. Optional. In the Proxies tab, enable proxy access to the managed cluster:

    1. Click Add Proxy.

    2. In the Add New Proxy wizard, fill out the form with the following parameters:

      Proxy configuration

      Parameter

      Description

      Proxy Name

      Name of the proxy server to use during a managed cluster creation.

      Region Removed in MOSK 24.1

      From the drop-down list, select the required region.

      HTTP Proxy

      Add the HTTP proxy server domain name in the following format:

      • http://proxy.example.com:port - for anonymous access

      • http://user:password@proxy.example.com:port - for restricted access

      HTTPS Proxy

      Add the HTTPS proxy server domain name in the same format as for HTTP Proxy.

      No Proxy

      Comma-separated list of IP addresses or domain names.

    Note

    The possibility to use a MITM proxy with a CA certificate is available since MOSK 23.1.

    For the list of Mirantis resources and IP addresses to be accessible from the Container Cloud clusters, see Reference Architecture: Requirements.

  7. In the Clusters tab, click Create Cluster.

  8. Configure the new cluster in the Create New Cluster wizard that opens:

    1. Define general and Kubernetes parameters:

      Create new cluster: General, Provider, and Kubernetes

      Section

      Parameter name

      Description

      General settings

      Cluster name

      The cluster name.

      Provider

      Select Baremetal.

      Region Removed since MOSK 24.1

      From the drop-down list, select Baremetal.

      Release version

      Select a Container Cloud version with the OpenStack label tag. Otherwise, you will not be able to deploy MOSK on this managed cluster.

      Proxy

      Optional. From the drop-down list, select the proxy server name that you have previously created.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to the bare metal hosts.

      Container Registry

      From the drop-down list, select the Docker registry name that you have previously added using the Container Registries tab. For details, see Define a custom CA certificate for a private Docker registry.

      Enable WireGuard

      Optional. Technology Preview. Deprecated since Container Cloud 2.29.0 (Cluster releases 17.4.0 and 16.4.0). Available since Container Cloud 2.24.0 (Cluster release 14.0.0). Enable WireGuard for traffic encryption on the Kubernetes workloads network.

      WireGuard configuration
      1. Ensure that the Calico MTU size is at least 60 bytes smaller than the interface MTU size of the workload network. IPv4 WireGuard uses a 60-byte header. For details, see Set the MTU size for Calico.

      2. Enable WireGuard by selecting the Enable WireGuard check box.

        Caution

        Changing this parameter on a running cluster causes a downtime that can vary depending on the cluster size.

        Note

        This parameter was renamed from Enable Secure Overlay to Enable WireGuard in Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0).

      For more details about WireGuard, see Calico documentation: Encrypt in-cluster pod traffic.

      Parallel Upgrade Of Worker Machines

      Optional. Available since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0).

      The maximum number of the worker nodes to update simultaneously. It serves as an upper limit on the number of machines that are drained at a given moment of time. Defaults to 1.

      You can also configure this option after deployment before the cluster update.

      Parallel Preparation For Upgrade Of Worker Machines

      Optional. Available since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0)

      The maximum number of worker nodes being prepared at a given moment of time, which includes downloading of new artifacts. It serves as a limit for the network load that can occur when downloading the files to the nodes. Defaults to 50.

      You can also configure this option after deployment before the cluster update.

      Provider

      LB host IP

      The IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster. This IP address must be in the LCM network if a separate LCM network is in use and if L2 (ARP) announcement of cluster API load balancer IP is in use.

      LB address range Removed in 24.3

      The range of IP addresses that can be assigned to load balancers for Kubernetes Services by MetalLB. For a more flexible MetalLB configuration, refer to Configure and verify MetalLB.

      Note

      Since MOSK 24.3, MetalLB configuration must be added after cluster creation.

      Kubernetes

      Services CIDR blocks

      The Kubernetes Services CIDR blocks. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes pods CIDR blocks. For example, 10.233.64.0/18.

      Note

      The network subnet size of Kubernetes pods influences the number of nodes that can be deployed in the cluster.

      The default subnet size /18 is enough to create a cluster with up to 256 nodes. Each node uses the /26 address blocks (64 addresses), at least one address block is allocated per node. These addresses are used by the Kubernetes pods with hostNetwork: false. The cluster size may be limited further when some nodes use more than one address block.

    2. Configure StackLight:

      Note

      If StackLight is enabled in non-HA mode but Ceph is not deployed yet, StackLight will not be installed and will be stuck in the Yellow state waiting for a successful Ceph installation. Once the Ceph cluster is deployed, the StackLight installation resumes. To deploy a Ceph cluster, refer to Add a Ceph cluster.

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration and StackLight configuration procedure.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdLogs. For details, see StackLight resource limits.

      HA Mode

      Select to enable StackLight monitoring in High Availability (HA) mode. For differences between HA and non-HA modes, see Deployment architecture. If disabled, StackLight requires a Ceph cluster. To deploy a Ceph cluster, refer to Add a Ceph cluster.

      StackLight Default Logs Severity Level

      Log severity (verbosity) level for all StackLight components. The default value for this parameter is Default component log level that respects original defaults of each StackLight component. For details about severity levels, see StackLight log verbosity.

      StackLight Component Logs Severity Level

      The severity level of logs for a specific StackLight component that overrides the value of the StackLight Default Logs Severity Level parameter. For details about severity levels, see StackLight log verbosity. Expand the drop-down menu for a specific component to display its list of available log levels.

      OpenSearch

      Logstash Retention Time Removed in MOSK 24.1

      Available if you select Enable Logging. Specifies the logstash-* index retention time.

      Events Retention Time

      Available if you select Enable Logging. Specifies the kubernetes_events-* index retention time.

      Notifications Retention Time

      Available if you select Enable Logging. Specifies the notification-* index retention time.

      Persistent Volume Claim Size

      Available if you select Enable Logging. The OpenSearch persistent volume claim size.

      Collected Logs Severity Level

      Available if you select Enable Logging. The minimum severity of all Container Cloud components logs collected in OpenSearch. For details about severity levels, see StackLight logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  9. Click Create.

    To monitor cluster readiness, see Verify cluster status.

  10. Available since Container Cloud 2.24.0 (Cluster releases 14.0.0 and 15.0.1). Optional. Technology Preview. Enable the Linux Audit daemon auditd to monitor activity of cluster processes and prevent potential malicious activity.

    Configuration for auditd

    In the Cluster object or cluster.yaml.template, add the auditd parameters:

    spec:
      providerSpec:
        value:
          audit:
            auditd:
              enabled: <bool>
              enabledAtBoot: <bool>
              backlogLimit: <int>
              maxLogFile: <int>
              maxLogFileAction: <string>
              maxLogFileKeep: <int>
              mayHaltSystem: <bool>
              presetRules: <string>
              customRules: <string>
              customRulesX32: <text>
              customRulesX64: <text>
    

    Configuration parameters for auditd:

    enabled

    Boolean, default - false. Enables the auditd role to install the auditd packages and configure rules. CIS rules: 4.1.1.1, 4.1.1.2.

    enabledAtBoot

    Boolean, default - false. Configures grub to audit processes that can be audited even if they start up prior to auditd startup. CIS rule: 4.1.1.3.

    backlogLimit

    Integer, default - none. Configures the backlog to hold records. If during boot audit=1 is configured, the backlog holds 64 records. If more than 64 records are created during boot, auditd records will be lost with a potential malicious activity being undetected. CIS rule: 4.1.1.4.

    maxLogFile

    Integer, default - none. Configures the maximum size of the audit log file. Once the log reaches the maximum size, it is rotated and a new log file is created. CIS rule: 4.1.2.1.

    maxLogFileAction

    String, default - none. Defines handling of the audit log file reaching the maximum file size. Allowed values:

    • keep_logs - rotate logs but never delete them

    • rotate - add a cron job to compress rotated log files and keep maximum 5 compressed files.

    • compress - compress log files and keep them under the /var/log/auditd/ directory. Requires auditd_max_log_file_keep to be enabled.

    CIS rule: 4.1.2.2.

    maxLogFileKeep

    Integer, default - 5. Defines the number of compressed log files to keep under the /var/log/auditd/ directory. Requires auditd_max_log_file_action=compress. CIS rules - none.

    mayHaltSystem

    Boolean, default - false. Halts the system when the audit logs are full. Applies the following configuration:

    • space_left_action = email

    • action_mail_acct = root

    • admin_space_left_action = halt

    CIS rule: 4.1.2.3.

    customRules

    String, default - none. Base64-encoded content of the 60-custom.rules file for any architecture. CIS rules - none.

    customRulesX32

    String, default - none. Base64-encoded content of the 60-custom.rules file for the i386 architecture. CIS rules - none.

    customRulesX64

    String, default - none. Base64-encoded content of the 60-custom.rules file for the x86_64 architecture. CIS rules - none.

    presetRules

    String, default - none. Comma-separated list of the following built-in preset rules:

    • access

    • actions

    • delete

    • docker

    • identity

    • immutable

    • logins

    • mac-policy

    • modules

    • mounts

    • perm-mod

    • privileged

    • scope

    • session

    • system-locale

    • time-change

    Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0) in the Technology Preview scope, you can collect some of the preset rules indicated above as groups and use them in presetRules:

    • ubuntu-cis-rules - this group contains rules to comply with the Ubuntu CIS Benchmark recommendations, including the following CIS Ubuntu 20.04 v2.0.1 rules:

      • scope - 5.2.3.1

      • actions - same as 5.2.3.2

      • time-change - 5.2.3.4

      • system-locale - 5.2.3.5

      • privileged - 5.2.3.6

      • access - 5.2.3.7

      • identity - 5.2.3.8

      • perm-mod - 5.2.3.9

      • mounts - 5.2.3.10

      • session - 5.2.3.11

      • logins - 5.2.3.12

      • delete - 5.2.3.13

      • mac-policy - 5.2.3.14

      • modules - 5.2.3.19

    • docker-cis-rules - this group contains rules to comply with Docker CIS Benchmark recommendations, including the docker Docker CIS v1.6.0 rules 1.1.3 - 1.1.18.

    You can also use two additional keywords inside presetRules:

    • none - select no built-in rules.

    • all - select all built-in rules. When using this keyword, you can add the ! prefix to a rule name to exclude some rules. You can use the ! prefix for rules only if you add the all keyword as the first rule. Place a rule with the ! prefix only after the all keyword.

    Example configurations:

    • presetRules: none - disable all preset rules

    • presetRules: docker - enable only the docker rules

    • presetRules: access,actions,logins - enable only the access, actions, and logins rules

    • presetRules: ubuntu-cis-rules - enable all rules from the ubuntu-cis-rules group

    • presetRules: docker-cis-rules,actions - enable all rules from the docker-cis-rules group and the actions rule

    • presetRules: all - enable all preset rules

    • presetRules: all,!immutable,!sessions - enable all preset rules except immutable and sessions


    CIS controls
    4.1.3 (time-change)
    4.1.4 (identity)
    4.1.5 (system-locale)
    4.1.6 (mac-policy)
    4.1.7 (logins)
    4.1.8 (session)
    4.1.9 (perm-mod)
    4.1.10 (access)
    4.1.11 (privileged)
    4.1.12 (mounts)
    4.1.13 (delete)
    4.1.14 (scope)
    4.1.15 (actions)
    4.1.16 (modules)
    4.1.17 (immutable)
    Docker CIS controls
    1.1.4
    1.1.8
    1.1.10
    1.1.12
    1.1.13
    1.1.15
    1.1.16
    1.1.17
    1.1.18
    1.2.3
    1.2.4
    1.2.5
    1.2.6
    1.2.7
    1.2.10
    1.2.11
  11. Optional. Colocate the OpenStack control plane with the managed cluster Kubernetes manager nodes by adding the following field to the Cluster object spec:

    spec:
      providerSpec:
        value:
          dedicatedControlPlane: false
    

    Note

    This feature is available as technical preview. Use such configuration for testing and evaluation purposes only.

  12. Optional. Customize MetalLB speakers that are deployed on all Kubernetes nodes except master nodes by default. For details, see Configure node selectors for MetalLB speakers.

  13. Configure the MetalLB parameters related to IP address allocation and announcement for load-balanced cluster services. For details, see Configure and verify MetalLB.

  14. Proceed to Obtain and use details about network interfaces.

Note

Once you have created a MOSK cluster, some StackLight alerts may raise as false-positive until you deploy the Mirantis OpenStack environment.

Configure MetalLB

Before configuring subnets for a MOSK cluster, set up and verify MetalLB parameters as described in the following subsections.

Configure and verify MetalLB

This section describes how to set up and verify MetalLB parameters before configuring subnets for a MOSK cluster.

Caution

This section also applies to the bootstrap procedure of a management cluster with the following differences:

  • Instead of the Cluster object, configure templates/bm/cluster.yaml.template.

  • Instead of the MetalLBConfig object, configure templates/bm/metallbconfig.yaml.template.

  • Instead of creating specific IPAM objects such as Subnet and L2Template (as well as Rack and MultiRackCluster when using BGP configuration), add their settings to templates/bm/ipam-objects.yaml.template.

Configuration rules for the ‘MetalLBConfig’ object

Caution

The use of the MetalLBConfig object is mandatory after your management cluster upgrade to the Cluster release 16.0.0.

The following rules and requirements apply to configuration of the MetalLBConfig object:

  • Define one MetalLBConfig object per cluster.

  • Define the following mandatory labels:

    cluster.sigs.k8s.io/cluster-name

    Specifies the cluster name where the MetalLB address pool is used.

    kaas.mirantis.com/region

    Specifies the region name of the cluster where the MetalLB address pool is used.

    kaas.mirantis.com/provider

    Specifies the provider of the cluster where the MetalLB address pool is used.

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

  • Intersection of IP address ranges within any single MetalLB address pool is not allowed.

  • At least one MetalLB address pool must have the auto-assign policy enabled so that unannotated services can have load balancer IP addresses allocated for them.

  • When configuring multiple address pools with the auto-assign policy enabled, keep in mind that it is not determined in advance which pool of those multiple address pools is used to allocate an IP address for a particular unannotated service.

Note

You can optimize address announcement for load-balanced services using the interfaces selector for the l2Advertisements object. This selector allows for address announcement only on selected host interfaces. For details, see MetalLBConfig spec.

Configuration rules for MetalLBConfigTemplate (obsolete since 24.2)

Caution

The MetalLBConfigTemplate object is deprecated since MOSK 24.2 and unsupported since MOSK 24.3. For details, see Deprecation notes.

  • All rules described above for MetalLBConfig also apply to MetalLBConfigTemplate.

  • Optional. Define one MetalLBConfigTemplate object per cluster. The use of this object without MetalLBConfig is not allowed.

  • When using MetalLBConfigTemplate:

    • MetalLBConfig must reference MetalLBConfigTemplate by name:

      spec:
        templateName: <managed-metallb-template>
      
    • You can use Subnet objects for defining MetalLB address pools. Refer to MetalLB configuration guidelines for subnets for guidelines on configuring MetalLB address pools using Subnet objects.

    • You can optimize address announcement for load-balanced services using the interfaces selector for the l2Advertisements object. This selector allows for address announcement only on selected host interfaces. For details, see MetalLBConfigTemplate spec.

Configure and verify MetalLB using the web UI

Available since MOSK 24.3

Note

The BGP configuration is not yet supported in the Container Cloud web UI. Meantime, use the CLI for this purpose. For details, see Configure and verify MetalLB using the CLI.

  1. Read the MetalLB configuration guidelines described in Configuration rules for the ‘MetalLBConfig’ object.

  2. Optional. Configure parameters related to MetalLB components life cycle such as deployment and update using the metallb Helm chart values in the Cluster spec section. For example:

  3. Log in to the Container Cloud web UI with the writer permissions.

  4. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

    Caution

    Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

  5. In the Networks section, click the MetalLB Configs tab.

  6. Click Create MetalLB Config.

  7. Fill out the Create MetalLB Config form as required:

    • Name

      Name of the MetalLB object being created.

    • Cluster

      Name of the cluster that the MetalLB object is being created for

    • IP Address Pools

      List of MetalLB IP address pool descriptions that will be used to create the MetalLB IPAddressPool objects. Click the + button on the right side of the section to add more objects.

      • Name

        IP address pool name.

      • Addresses

        Comma-separated ranges of the IP addresses included into the address pool.

      • Auto Assign

        Enable auto-assign policy for unannotated services to have load balancer IP addresses allocated to them. At least one MetalLB address pool must have the auto-assign policy enabled.

      • Service Allocation

        IP address pool allocation to services. Click Edit to insert a service allocation object with required label selectors for services in the YAML format. For example:

        serviceSelectors:
        - matchExpressions:
          - key: app.kubernetes.io/name
            operator: NotIn
            values:
            - dhcp-lb
        

        For details on the MetalLB IPAddressPool object type, see MetalLB documentation.

      • L2 Advertisements

        List of MetalLBL2Advertisement objects to create MetalLB L2Advertisement objects.

        The l2Advertisements object allows defining interfaces to optimize the announcement. When you use the interfaces selector, LB addresses are announced only on selected host interfaces.

        Mirantis recommends using the interfaces selector if nodes use separate host networks for different types of traffic. The pros of such configuration are as follows: less spam on other interfaces and networks and limited chances to reach IP addresses of load-balanced services from irrelevant interfaces and networks.

        Caution

        Interface names in the interfaces list must match those on the corresponding nodes.

        Add the following parameters:

        • Name

          Name of the l2Advertisements object.

        • Interfaces

          Optional. Comma-separated list of interface names that must match the ones on the corresponding nodes. These names are defined in L2 templates that are linked to the selected cluster.

        • IP Address Pools

          Select the IP adress pool to use for the l2Advertisements object.

        • Node Selectors

          Optional. Match labels and values for the Kubernetes node selector to limit the nodes announced as next hops for the LoadBalancer IP. If you do not provide any labels, all nodes are announced as next hops.

        For details on the MetalLB L2Advertisements object type, see MetalLB documentation.

  8. Click Create.

  9. In Networks > MetalLB Configs, verify the status of the created MetalLB object:

    • Ready - object is operational.

    • Error - object is non-operational. Hover over the status to obtain details of the issue.

    Note

    To verify the object details, in Networks > MetalLB Configs, click the More action icon in the last column of the required object section and select MetalLB Config info.

  10. Proceed to creating cluster subnets as described in Create subnets.

Configure and verify MetalLB using the CLI
  1. Optional. Configure parameters related to MetalLB components life cycle such as deployment and update using the metallb Helm chart values in the Cluster spec section. For example:

  2. Configure the MetalLB parameters related to IP address allocation and announcement for load-balanced cluster services:

    Since MOSK 24.2

    Mandatory after a management cluster upgrade to the Cluster release 17.2.0. Recommended and default since MOSK 24.2.

    Create the MetalLBConfig object:

    In the Technology Preview scope, you can use BGP for announcement of external addresses of Kubernetes load-balanced services for a MOSK cluster. To configure the BGP announcement mode for MetalLB, use MetalLBConfig object.

    The use of BGP is required to announce IP addresses for load-balanced services when using MetalLB on nodes that are distributed across multiple racks. In this case, setting of rack-id labels on nodes is required, they are used in node selectors for BGPPeer, BGPAdvertisement, or both MetalLB objects to properly configure BGP connections from each node.

    Configuration example of the Machine object for the BGP announcement mode
    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    metadata:
      name: test-cluster-compute-1
      namespace: mosk-ns
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        ipam/RackRef: rack-1  # reference to the "rack-1" Rack
        kaas.mirantis.com/provider: baremetal
    spec:
      providerSpec:
        value:
          ...
          nodeLabels:
          - key: rack-id   # node label can be used in "nodeSelectors" inside
            value: rack-1  # "BGPPeer" and/or "BGPAdvertisement" MetalLB objects
      ...
    
    Configuration example of the MetalLBConfig object for the BGP announcement mode
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: MetalLBConfig
    metadata:
      name: test-cluster-metallb-config
      namespace: mosk-ns
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        kaas.mirantis.com/provider: baremetal
    spec:
      ...
      bgpPeers:
        - name: svc-peer-1
          spec:
            holdTime: 0s
            keepaliveTime: 0s
            peerAddress: 10.77.42.1
            peerASN: 65100
            myASN: 65101
            nodeSelectors:
              - matchLabels:
                rack-id: rack-1  # references the nodes having
                                 # the "rack-id=rack-1" label
      bgpAdvertisements:
        - name: services
          spec:
            aggregationLength: 32
            aggregationLengthV6: 128
            ipAddressPools:
              - services
            peers:
              - svc-peer-1
              ...
    

    The bgpPeers and bgpAdvertisements fields are used to configure BGP announcement instead of l2Advertisements.

    The use of BGP for announcement also allows for better balancing of service traffic between cluster nodes as well as gives more configuration control and flexibility for infrastructure administrators. For configuration examples, refer to MetalLB configuration examples. For configuration procedure, refer to Configure BGP announcement for cluster API LB address.

    Since MOSK 23.2

    Select from the following options:

    • Deprecated since MOSK 24.2 and unsupported since MOSK 24.3. Mandatory after a management cluster upgrade to the Cluster release 16.0.0. Recommended and default since MOSK 23.2 in the Technology Preview scope. Create MetalLBConfig and MetalLBConfigTemplate objects. This method allows using the Subnet object to define MetalLB address pools.

      Since MOSK 23.2.2, in the Technology Preview scope, you can use BGP for announcement of external addresses of Kubernetes load-balanced services for a MOSK cluster. To configure the BGP announcement mode for MetalLB, use MetalLBConfig and MetalLBConfigTemplate objects.

      The use of BGP is required to announce IP addresses for load-balanced services when using MetalLB on nodes that are distributed across multiple racks. In this case, setting of rack-id labels on nodes is required, they are used in node selectors for BGPPeer, BGPAdvertisement, or both MetalLB objects to properly configure BGP connections from each node.

      Configuration example of the Machine object for the BGP announcement mode
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        name: test-cluster-compute-1
        namespace: mosk-ns
        labels:
          cluster.sigs.k8s.io/cluster-name: test-cluster
          ipam/RackRef: rack-1  # reference to the "rack-1" Rack
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
      spec:
        providerSpec:
          value:
            ...
            nodeLabels:
            - key: rack-id   # node label can be used in "nodeSelectors" inside
              value: rack-1  # "BGPPeer" and/or "BGPAdvertisement" MetalLB objects
        ...
      

      Note

      The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

      Configuration example of the MetalLBConfigTemplate object for the BGP announcement mode
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: MetalLBConfigTemplate
      metadata:
        name: test-cluster-metallb-config-template
        namespace: mosk-ns
        labels:
          cluster.sigs.k8s.io/cluster-name: test-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
      spec:
        templates:
          ...
          bgpPeers: |
            - name: svc-peer-1
              spec:
                peerAddress: 10.77.42.1
                peerASN: 65100
                myASN: 65101
                nodeSelectors:
                  - matchLabels:
                      rack-id: rack-1  # references the nodes having
                                       # the "rack-id=rack-1" label
          bgpAdvertisements: |
            - name: services
              spec:
                ipAddressPools:
                  - services
                peers:
                  - svc-peer-1
                  ...
      

      Note

      The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

      The bgpPeers and bgpAdvertisements fields are used to configure BGP announcement instead of l2Advertisements.

      The use of BGP for announcement also allows for better balancing of service traffic between cluster nodes as well as gives more configuration control and flexibility for infrastructure administrators. For configuration examples, refer to MetalLBConfigTemplate. For configuration procedure, refer to Configure BGP announcement for cluster API LB address.

    • Not recommended. Configure the configInline value in the MetalLB chart of the Cluster object.

      Warning

      This option is deprecated since MOSK 23.2 and is removed during the management cluster upgrade to the Cluster release 16.0.0, which is introduced in Container Cloud 2.25.0.

      Therefore, this option becomes unavailable on MOSK 23.2 clusters after the parent management cluster upgrade to 2.25.0.

    • Not recommended. Configure the Subnet objects without MetalLBConfigTemplate.

      Warning

      This option is deprecated since MOSK 23.2 and is removed during the management cluster upgrade to the Cluster release 16.0.0, which is introduced in Container Cloud 2.25.0.

      Therefore, this option becomes unavailable on MOSK 23.2 clusters after the parent management cluster upgrade to 2.25.0.

    Caution

    If the MetalLBConfig object is not used for MetalLB configuration related to address allocation and announcement for load-balanced services, then automated migration applies during cluster creation or update to MOSK 23.2.

    During automated migration, the MetalLBConfig and MetalLBConfigTemplate objects are created and contents of the MetalLB chart configInline value is converted to the parameters of the MetalLBConfigTemplate object.

    Any change to the configInline value made on a MOSK 23.2 cluster will be reflected in the MetalLBConfigTemplate object.

    This automated migration is removed during your management cluster upgrade to the Cluster release 16.0.0, which is introduced in Container Cloud 2.25.0, together with the possibility to use the configInline value of the MetalLB chart. After that, any changes in MetalLB configuration related to address allocation and announcement for load-balanced services are applied using the MetalLBConfig, MetalLBConfigTemplate, and Subnet objects only.

    Before MOSK 23.2

    Select from the following options:

    • Configure Subnet objects. For details, see MetalLB configuration guidelines for subnets.

    • Configure the configInline value for the MetalLB chart in the Cluster object.

    • Configure both the configInline value for the MetalLB chart and Subnet objects.

      The resulting MetalLB address pools configuration will contain address ranges from both cluster specification and Subnet objects. All address ranges for L2 address pools will be aggregated into a single L2 address pool and sorted as strings.

    Changes to be applied since MOSK 23.2

    The configuration options above become deprecated since 23.2, and automated migration of MetalLB parameters applies during cluster creation or update to MOSK 23.2.

    During automated migration, the MetalLBConfig and MetalLBConfigTemplate objects are created and contents of the MetalLB chart configInline value is converted to the parameters of the MetalLBConfigTemplate object.

    Any change to the configInline value made on a MOSK 23.2 cluster will be reflected in the MetalLBConfigTemplate object.

    This automated migration is removed during your management cluster upgrade to Container Cloud 2.25.0 together with the possibility to use the configInline value of the MetalLB chart. After that, any changes in MetalLB configuration related to address allocation and announcement for load-balanced services will be applied using the MetalLBConfigTemplate and Subnet objects only.

  3. Verify the current MetalLB configuration:

    Verify the MetalLB configuration that is stored in MetalLB objects:

    kubectl -n metallb-system get ipaddresspools,l2advertisements
    

    The example system output:

    NAME                                    AGE
    ipaddresspool.metallb.io/default        129m
    ipaddresspool.metallb.io/services-pxe   129m
    
    NAME                                      AGE
    l2advertisement.metallb.io/default        129m
    l2advertisement.metallb.io/services-pxe   129m
    

    Verify one of the listed above MetalLB objects:

    kubectl -n metallb-system get <object> -o json | jq '.spec'
    

    The example system output for ipaddresspool objects:

    $ kubectl -n metallb-system get ipaddresspool.metallb.io/default -o json | jq '.spec'
    {
      "addresses": [
        "10.0.11.61-10.0.11.80"
      ],
      "autoAssign": true,
      "avoidBuggyIPs": false
    }
    $ kubectl -n metallb-system get ipaddresspool.metallb.io/services-pxe -o json | jq '.spec'
    {
      "addresses": [
        "10.0.0.61-10.0.0.70"
      ],
      "autoAssign": false,
      "avoidBuggyIPs": false
    }
    

    Verify the MetalLB configuration that is stored in the ConfigMap object:

    kubectl -n metallb-system get cm metallb -o jsonpath={.data.config}
    

    An example of a successful output:

    address-pools:
    - name: default
      protocol: layer2
      addresses:
      - 10.0.11.61-10.0.11.80
    - name: services-pxe
      protocol: layer2
      auto-assign: false
      addresses:
      - 10.0.0.61-10.0.0.70
    

    The auto-assign parameter will be set to false for all address pools except the default one. So, a particular service will get an address from such an address pool only if the Service object has a special metallb.universe.tf/address-pool annotation that points to the specific address pool name.

    Note

    It is expected that every Kubernetes service on a management cluster will be assigned to one of the address pools. Current consideration is to have two MetalLB address pools:

    • services-pxe is a reserved address pool name to use for the Kubernetes services in the PXE network (Ironic API, HTTP server, caching server).

    • default is an address pool to use for all other Kubernetes services in the management network. No annotation is required on the Service objects in this case.

  4. Proceed to creating cluster subnets as described in Create subnets.

Configure node selectors for MetalLB speakers

By default, MetalLB speakers are deployed on all Kubernetes nodes except master nodes. You can configure MetalLB to run its speakers on a particular set of nodes. This decreases the number of nodes that should be connected to external network. In this scenario, only a few nodes are exposed for ingress traffic from the outside world.

To customize a node selector for a MetalLB speaker:

  1. Using kubeconfig of the Container Cloud management cluster, open the MOSK Cluster object for editing:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <OSClusterNamespace> edit cluster <OSClusterName>
    
  2. In the spec:providerSpec:value:helmReleases section, add the speaker.nodeSelector field for metallb:

    spec:
      ...
      providerSpec:
        value:
          ...
          helmReleases:
          - name: metallb
            values:
              ...
              speaker:
                nodeSelector:
                  metallbSpeakerEnabled: "true"
    

    The metallbSpeakerEnabled: "true" parameter in this example is the label on Kubernetes nodes where MetalLB speakers will be deployed. It can be an already existing node label or a new one.

    Note

    The issue [24435] MetalLB speaker fails to announce the LB IP for the Ingress service, which is related to collocation of MetalLB speakers and the OpenStack Ingress service pods is addressed in MOSK 22.5. For details, see Release Notes: Set externalTrafficPolicy=Local for the OpenStack Ingress service.

    You can add user-defined labels to nodes using the nodeLabels field.

    This field contains the list of node labels to be attached to a node for the user to run certain components on separate cluster nodes. The list of allowed node labels is located in the Cluster object status providerStatus.releaseRef.current.allowedNodeLabels field.

    If the value field is not defined in allowedNodeLabels, a label can have any value. For example:

    allowedNodeLabels:
    - displayName: Stacklight
      key: stacklight
    

    Before or after a machine deployment, add the required label from the allowed node labels list with the corresponding value to spec.providerSpec.value.nodeLabels in machine.yaml. For example:

    nodeLabels:
    - key: stacklight
      value: enabled
    

    Adding of a node label that is not available in the list of allowed node labels is restricted.

MetalLB configuration guidelines for subnets

Note

Consider this section as obsolete since MOSK 24.2 due to the MetalLBConfigTemplate object deprecation. For details, see Deprecation Notes: MetalLBConfigTemplate object.

Caution

This section also applies to the bootstrap procedure of a management cluster with the following difference: instead of creating the Subnet object, add its configuration to ipam-objects.yaml.template located in kaas-bootstrap/templates/bm/.

The Kubernetes Subnet object is created for a management cluster from templates during bootstrap.

Each Subnet object can define either a MetalLB address range or MetalLB address pool. A MetalLB address pool may contain one or several address ranges. The following rules apply to creation of address ranges or pools:

  • To designate a subnet as a MetalLB address pool or range, use the ipam/SVC-MetalLB label key. Set the label value to "1".

  • The object must contain the cluster.sigs.k8s.io/<cluster-name> label to reference the name of the target cluster where the MetalLB address pool is used.

  • You may create multiple subnets with the ipam/SVC-MetalLB label to define multiple IP address ranges or multiple address pools for MetalLB in the cluster.

  • The IP addresses of the MetalLB address pool are not assigned to the interfaces on hosts. This subnet is virtual. Do not include such subnets to the L2 template definitions for your cluster.

  • If a Subnet object defines a MetalLB address range, no additional object properties are required.

  • You can use any number of Subnet objects that define a single MetalLB address range. In this case, all address ranges are aggregated into a single MetalLB L2 address pool named services having the auto-assign policy enabled.

  • Intersection of IP address ranges within any single MetalLB address pool is not allowed.

    The bare metal provider verifies intersection of IP address ranges. If it detects intersection, the MetalLB configuration is blocked and the provider logs contain corresponding error messages.

Use the following labels to identify the Subnet object as a MetalLB address pool and configure the name and protocol for that address pool. All labels below are mandatory for the Subnet object that configures a MetalLB address pool.

Mandatory Subnet labels for a MetalLB address pool

Label

Description

Labels to link Subnet to the target MOSK clusters within a management cluster.

cluster.sigs.k8s.io/cluster-name

Specifies the cluster name where the MetalLB address pool is used.

kaas.mirantis.com/region

Specifies the region name of the cluster where the MetalLB address pool is used.

kaas.mirantis.com/provider

Specifies the provider of the cluster where the MetalLB address pool is used.

Note

The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

ipam/SVC-MetalLB

Defines that the Subnet object will be used to provide a new address pool or range for MetalLB.

metallb/address-pool-name

Every address pool must have a distinct name.

The services-pxe address pool is mandatory when configuring a dedicated PXE network in the management cluster. This name will be used in annotations for services exposed through the PXE network.

A bootstrap cluster also uses the services-pxe address pool for its provision services so that management cluster nodes can be provisioned from the bootstrap cluster. After a management cluster is deployed, the bootstrap cluster is deleted and that address pool is solely used by the newly deployed cluster.

metallb/address-pool-auto-assign

Configures the auto-assign policy of an address pool. Boolean.

Caution

For the address pools defined using the MetalLB Helm chart values in the Cluster spec section, auto-assign policy is set to true and is not configurable.

For any service that does not have a specific MetalLB annotation configured, MetalLB allocates external IPs from arbitrary address pools that have the auto-assign policy set to true.

Only for the service that has a specific MetalLB annotation with the address pool name, MetalLB allocates external IPs from the address pool having the auto-assign policy set to false.

metallb/address-pool-protocol

Sets the address pool protocol. The only supported value is layer2 (default).

Caution

Do not set the same address pool name for two or more Subnet objects. Otherwise, the corresponding MetalLB address pool configuration fails with a warning message in the bare metal provider log.

Caution

For the auto-assign policy, the following configuration rules apply:

  • At least one MetalLB address pool must have the auto-assign policy enabled so that unannotated services can have load balancer IPs allocated for them. To satisfy this requirement, either configure one of address pools using the Subnet object with metallb/address-pool-auto-assign: "true" or configure address range(s) using the Subnet object(s) without metallb/address-pool-* labels.

  • When configuring multiple address pools with the auto-assign policy enabled, keep in mind that it is not determined in advance which pool of those multiple address pools is used to allocate an IP for a particular unannotated service.

Configure BGP announcement for cluster API LB address

Available since MOSK 23.2.2 TechPreview

When you create a MOSK cluster with the multi-rack topology, where Kubernetes masters are distributed across multiple racks without an L2 layer extension between them, you must configure BGP announcement of the cluster API load balancer address.

For clusters where Kubernetes masters are in the same rack or with an L2 layer extension between masters, you can configure either BGP or L2 (ARP) announcement of the cluster API load balancer address. The L2 (ARP) announcement is used by default and its configuration is covered in Create a managed bare metal cluster.

Caution

Create Rack and MultiRackCluster objects, which are described in the below procedure, before initiating the provisioning of master nodes to ensure that both BGP and netplan configurations are applied simultaneously during the provisioning process.

To enable the use of BGP announcement for the cluster API LB address:

  1. In the Cluster object, set the useBGPAnnouncement parameter to true:

    spec:
      providerSpec:
        value:
          useBGPAnnouncement: true
    
  2. Create the MultiRackCluster object that is mandatory when configuring BGP announcement for the cluster API LB address. This object enables you to set cluster-wide parameters for configuration of BGP announcement.

    In this scenario, the MultiRackCluster object must be bound to the corresponding Cluster object using the cluster.sigs.k8s.io/cluster-name label.

    Container Cloud uses the bird BGP daemon for announcement of the cluster API LB address. For this reason, set the corresponding bgpdConfigFileName and bgpdConfigFilePath parameters in the MultiRackCluster object, so that bird can locate the configuration file. For details, see the configuration example below.

    The bgpdConfigTemplate object contains the default configuration file template for the bird BGP daemon, which you can override in Rack objects.

    The defaultPeer parameter contains default parameters of the BGP connection from master nodes to infrastructure BGP peers, which you can override in Rack objects.

    Configuration example for MultiRackCluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: MultiRackCluster
    metadata:
      name: multirack-test-cluster
      namespace: mosk-ns
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      bgpdConfigFileName: bird.conf
      bgpdConfigFilePath: /etc/bird
      bgpdConfigTemplate: |
        ...
      defaultPeer:
        localASN: 65101
        neighborASN: 65100
        neighborIP: ""
        password: deadbeef
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    For the object description, see MultiRackCluster.

  3. Create the Rack object(s). This object is mandatory when configuring BGP announcement for the cluster API LB address and it allows you to configure BGP announcement parameters for each rack.

    In this scenario, Rack objects must be bound to Machine objects corresponding to master nodes of the cluster. Each Rack object describes the configuration for the bird BGP daemon used to announce the cluster API LB address from a particular master node or from several master nodes in the same rack.

    The Rack object fields are described in Rack.

  4. Set a reference to the Rack object used to configure the bird BGP daemon for a particular master node to announce the cluster API LB IP:

    In the Machine objects for all master nodes, set the ipam/RackRef label with the value equal to the name of the corresponding Rack object. For example:

    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    metadata:
      labels:
        ipam/RackRef: rack-master-1 # reference to the "rack-master-1" Rack
    ...
    

    In the BareMetalHost objects for all cluster nodes, set the ipam.mirantis.com/rack-ref annotation with the value equal to the name of the corresponding Rack object. For example:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      annotations:
        ipam.mirantis.com/rack-ref: rack-master-1 # reference to the "rack-master-1" Rack
    ...
    
  5. Optional. Using the Machine object, define the rack-id node label that is not used for BGP announcement of the cluster API LB IP but can be used for MetalLB.

    The rack-id node label is required for MetalLB node selectors when MetalLB is used to announce LB IP addresses on nodes that are distributed across multiple racks. In this scenario, the L2 (ARP) announcement mode cannot be used for MetalLB because master nodes are in different L2 segments. So, the BGP announcement mode must be used for MetalLB, and node selectors are required to properly configure BGP connections from each node. See Configure and verify MetalLB for details.

    The L2Template object includes the lo interface configuration to set the IP address for the bird BGP daemon that will be advertised as the cluster API LB address. The {{ cluster_api_lb_ip }} function is used in npTemplate to obtain the cluster API LB address value.

    Configuration example for Rack
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Rack
    metadata:
      name: rack-master-1
      namespace: mosk-ns
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      bgpdConfigTemplate: |  # optional
        ...
      peeringMap:
        lcm-rack-control-1:
          peers:
          - neighborIP: 10.77.31.2  # "localASN" & "neighborASN" are taken from
          - neighborIP: 10.77.31.3  # "MultiRackCluster.spec.defaultPeer" if
                                    # not set here
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Configuration example for Machine
    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    metadata:
      name: test-cluster-master-1
      namespace: mosk-ns
      annotations:
        metal3.io/BareMetalHost: mosk-ns/test-cluster-master-1
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        cluster.sigs.k8s.io/control-plane: controlplane
        hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
        ipam/RackRef: rack-master-1  # reference to the "rack-master-1" Rack
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      providerSpec:
        value:
          kind: BareMetalMachineProviderSpec
          apiVersion: baremetal.k8s.io/v1alpha1
          hostSelector:
            matchLabels:
              kaas.mirantis.com/baremetalhost-id: test-cluster-master-1
          l2TemplateSelector:
            name: test-cluster-master-1
          nodeLabels:            # optional. it is not used for BGP announcement
          - key: rack-id         # of the cluster API LB IP but it can be used
            value: rack-master-1 # for MetalLB if "nodeSelectors" are required
      ...
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Configuration example for L2Template
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      labels:
        cluster.sigs.k8s.io/cluster-name: test-cluster
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: test-cluster-master-1
      namespace: mosk-ns
    spec:
      ...
      l3Layout:
        - subnetName: lcm-rack-control-1  # this network is referenced
          scope:      namespace           # in the "rack-master-1" Rack
        - subnetName: ext-rack-control-1  # optional. this network is used
          scope:      namespace           # for k8s services traffic and
                                          # MetalLB BGP connections
      ...
      npTemplate: |
        ...
        ethernets:
          lo:
            addresses:
              - {{ cluster_api_lb_ip }}  # function for cluster API LB IP
            dhcp4: false
            dhcp6: false
        ...
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    The configuration example for the scenario where Kubernetes masters are in the same rack or with an L2 layer extension between masters is described in Single rack configuration example.

    The configuration example for the scenario where Kubernetes masters are distributed across multiple racks without L2 layer extension between them is described in Multiple rack configuration example.

Obtain and use details about network interfaces

To simplify operations with L2 templates, before you start creating them, inspect the general workflow of a network interface name gathering and processing.

Network interface naming workflow:

  1. The operator creates a BareMetalHostInventory object.

    Note

    Before update of the management cluster to Container Cloud 2.29.0 (Cluster release 16.4.0), instead of BareMetalHostInventory, use the BareMetalHost object. For details, see BareMetalHost resource.

    Caution

    While the Cluster release of the management cluster is 16.4.0, BareMetalHostInventory operations are allowed to m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

  2. The BareMetalHostInventory object executes the introspection stage and becomes ready.

  3. The operator collects information about NIC count, naming, and so on for further changes in the mapping logic.

    At this stage, the NICs order in the object may randomly change during each introspection, but the NICs names are always the same. For more details, see Predictable Network Interface Names.

    For example:

    # Example commands:
    # kubectl -n managed-ns get bmh baremetalhost1 -o custom-columns='NAME:.metadata.name,STATUS:.status.provisioning.state'
    # NAME            STATE
    # baremetalhost1  ready
    
    # kubectl -n managed-ns get bmh baremetalhost1 -o yaml
    # Example output:
    
    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    ...
    status:
    ...
        nics:
        - ip: fe80::ec4:7aff:fe6a:fb1f%eno2
          mac: 0c:c4:7a:6a:fb:1f
          model: 0x8086 0x1521
          name: eno2
          pxe: false
        - ip: fe80::ec4:7aff:fe1e:a2fc%ens1f0
          mac: 0c:c4:7a:1e:a2:fc
          model: 0x8086 0x10fb
          name: ens1f0
          pxe: false
        - ip: fe80::ec4:7aff:fe1e:a2fd%ens1f1
          mac: 0c:c4:7a:1e:a2:fd
          model: 0x8086 0x10fb
          name: ens1f1
          pxe: false
        - ip: 192.168.1.151 # Temp. PXE network adress
          mac: 0c:c4:7a:6a:fb:1e
          model: 0x8086 0x1521
          name: eno1
          pxe: true
     ...
    
  4. The operator selects from the following options:

  5. The operator creates a Machine or Subnet object.

  6. The baremetal-provider service links the Machine object to the BareMetalHostInventory object.

  7. The kaas-ipam and baremetal-provider services collect hardware information from the BareMetalHostInventory object and use it to configure host networking and services.

  8. The kaas-ipam service:

    1. Spawns the IpamHost object.

    2. Renders the L2Template object.

    3. Spawns the ipaddr object.

    4. Updates the IpamHost object status with all rendered and linked information.

  9. The baremetal-provider service collects the rendered networking information from the IpamHost object

  10. The baremetal-provider service proceeds with the IpamHost object provisioning.

Now proceed to Create subnets.

Create subnets

After creating the MetalLB configuration as described in Configure and verify MetalLB and before creating L2 templates, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the MOSK cluster nodes. Where required, create a number of subnets for a particular project using the Subnet CR. A subnet has the following logical scopes:

Each subnet used in an L2 template has its logical scope that is set using the scope parameter in the corresponding L2Template.spec.l3Layout section. One of the following logical scopes is used for each subnet referenced in an L2 template:

  • global - CR uses the default namespace. A subnet can be used for any cluster located in any project.

  • namespaced - CR uses the namespace that corresponds to a particular project where MOSK clusters are located. A subnet can be used for any cluster located in the same project.

  • cluster - Unsupported since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). CR uses the namespace where the referenced cluster is located. A subnet is only accessible to the cluster that L2Template.metadata.labels:cluster.sigs.k8s.io/cluster-name (mandatory since MOSK 23.3) or L2Template.spec.clusterRef (deprecated since MOSK 23.3) refers to. The Subnet objects with the cluster scope will be created for every new cluster.

Note

The use of the ipam/SVC-MetalLB label in Subnet objects is unsupported as part of the MetalLBConfigTemplate object deprecation since Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0). No actions are required for existing objects. A Subnet object containing this label will be ignored by baremetal-provider after cluster update to the mentioned Cluster releases.

You can have subnets with the same name in different projects. In this case, the subnet that has the same project as the cluster will be used. One L2 template may often reference several subnets, those subnets may have different scopes in this case.

The IP address objects (IPaddr CR) that are allocated from subnets always have the same project as their corresponding IpamHost objects, regardless of the subnet scope.

You can create subnets using either the Container Cloud web UI or CLI.

Service labels and their life cycle

Any Subnet object may contain ipam/SVC-<serviceName> labels. All IP addresses allocated from the Subnet object that has service labels defined inherit those labels.

When a particular IpamHost uses IP addresses allocated from such labeled Subnet objects, the ServiceMap field in IpamHost.Status contains information about which IPs and interfaces correspond to which service labels (that have been set in the Subnet objects). Using ServiceMap, you can understand what IPs and interfaces of a particular host are used for network traffic of a given service.

Container Cloud uses the following service labels that allow using of the specific subnets for particular Container Cloud services:

  • ipam/SVC-k8s-lcm

  • ipam/SVC-ceph-cluster

  • ipam/SVC-ceph-public

  • ipam/SVC-dhcp-range

  • ipam/SVC-MetalLB Unsupported since 24.3

  • ipam/SVC-LBhost

Caution

The use of the ipam/SVC-k8s-lcm label is mandatory for every cluster.

Important

A label value is not mandatory and can be empty but it must match the value in the related L2Template object, in which the corresponding subnet is used. Otherwise, network configuration for related hosts will not be rendered due to not found subnets.

You can also add custom service labels to the Subnet objects the same way you add Container Cloud service labels. The mapping of IPs and interfaces to the defined services is displayed in IpamHost.Status.ServiceMap.

You can assign multiple service labels to one network. You can also assign the ceph-* and dhcp-range services to multiple networks. In the latter case, the system sorts the IP addresses in the ascending order:

serviceMap:
  ipam/SVC-ceph-cluster:
    - ifName: ceph-br2
      ipAddress: 10.0.10.11
    - ifName: ceph-br1
      ipAddress: 10.0.12.22
  ipam/SVC-ceph-public:
    - ifName: ceph-public
      ipAddress: 10.1.1.15
  ipam/SVC-k8s-lcm:
    - ifName: k8s-lcm
      ipAddress: 10.0.1.52

You can add service labels during creation of subnets as described in Create subnets.

Create subnets for a managed cluster using web UI

After creating the MetalLB configuration as described in Configure and verify MetalLB and before creating an L2 template, create the required subnets to use in the L2 template to allocate IP addresses for the managed cluster nodes.

To create subnets for a managed cluster using web UI:

  1. Log in to the Container Cloud web UI with the operator permissions.

  2. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

    Caution

    Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

  3. Create basic cluster settings as described in Create a managed bare metal cluster.

  4. Select one of the following options:

    1. In the left sidebar, navigate to Networks. The Subnets tab opens.

    2. Click Create Subnet.

    3. Fill out the Create subnet form as required:

      • Name

        Subnet name.

      • Subnet Type

        Subnet type:

        • DHCP Optional

          DHCP subnet that configures DHCP address ranges used by the DHCP server on the management cluster. For details, see Configure multiple DHCP address ranges.

        • LB

          Cluster API LB subnet.

        • LCM

          LCM subnet(s).

        • Storage access Optional

          Available in the web UI since Container Cloud 2.28.0 (17.3.0 and 16.3.0). Storage access subnet.

        • Storage replication Optional

          Available in the web UI since Container Cloud 2.28.0 (17.3.0 and 16.3.0). Storage replication subnet.

        • Custom Optional

          Custom subnet. For example, external or for Kubernetes workloads. For details, see optional steps in Create subnets for a managed cluster using CLI.

        • MetalLB

          Services subnet(s).

          Warning

          Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), disregard this parameter during subnet creation. Configure MetalLB separately as described in Configure and verify MetalLB.

          This parameter is removed from the Container Cloud web UI in Container Cloud 2.29.0 (Cluster releases 17.4.0 and 16.4.0).

        For description of subnet types in a managed cluster, see MOSK cluster networking.

      • Cluster

        Cluster name that the subnet is being created for. Not required only for the DHCP subnet.

      • CIDR

        A valid IPv4 address of the subnet in the CIDR notation, for example, 10.11.0.0/24.

      • Include Ranges Optional

        A comma-separated list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes. The gateway, network, broadcast, and DNSaddresses will be excluded (protected) automatically if they intersect with one of the range. The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

        Warning

        Do not use values that are out of the given CIDR.

      • Exclude Ranges Optional

        A comma-separated list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they are included in the CIDR. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

        Warning

        Do not use values that are out of the given CIDR.

      • Gateway Optional

        A valid IPv4 gateway address, for example, 10.11.0.9. Does not apply to the MetalLB subnet.

      • Nameservers

        IP addresses of nameservers separated by a comma. Does not apply to the DHCP and MetalLB subnet types.

      • Use whole CIDR

        Optional. Select to use the whole IPv4 address range that is set in the CIDR field. Useful when defining single IP address (/32), for example, in the Cluster API load balancer (LB) subnet.

        If not set, the network address and broadcast address in the IP subnet are excluded from the address allocation.

      • Labels

        Key-value pairs attached to the selected subnet:

        Caution

        The values of the created subnet labels must match the ones in the spec.l3Layout section of the corresponding L2Template object.

        • Optional user-defined labels to distinguish different subnets of the same type. For an example of user-defined labels, see Expand IP addresses capacity in an existing cluster.

          The following special values define the storage subnets:

          • ipam/SVC-ceph-cluster

          • ipam/SVC-ceph-public

          For more examples of label usage, see Service labels and their life cycle and Create subnets for a managed cluster using CLI.

          Click Add a label and assign the first custom label with the required name and value. To assign consecutive labels, use the + button located in the right side of the Labels section.

        • MetalLB:

          Warning

          Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), disregard this label during subnet creation. Configure MetalLB separately as described in Configure and verify MetalLB.

          The label will be removed from the Container Cloud web UI in one of the following releases.

          • metallb/address-pool-name

            Name of the subnet address pool. Exemplary values: services, default, external, services-pxe.

            The latter label is dedicated for management clusters only. For details about address pool names of a management cluster, see Separate PXE and management networks.

          • metallb/address-pool-auto-assign

            Enables automatic assignment of address pool. Boolean.

          • metallb/address-pool-protocol

            Defines the address pool protocol. Possible values:

            • layer2 - announcement using the ARP protocol.

            • bgp - announcement using the BGP protocol. Technology Preview.

            For description of these protocols, refer to the MetalLB documentation.

    4. Click Create.

    5. In the Networks tab, verify the status of the created subnet:

      • Ready - object is operational.

      • Error - object is non-operational. Hover over the status to obtain details of the issue.

      Note

      To verify subnet details, in the Networks tab, click the More action icon in the last column of the required subnet and select Subnet info.

    1. In the Clusters tab, click the required cluster and scroll down to the Subnets section.

    2. Click Add Subnet.

    3. Fill out the Add new subnet form as required:

      • Subnet Name

        Subnet name.

      • CIDR

        A valid IPv4 CIDR, for example, 10.11.0.0/24.

      • Include Ranges Optional

        A comma-separated list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes. The gateway, network, broadcast, and DNSaddresses will be excluded (protected) automatically if they intersect with one of the range. The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

        Warning

        Do not use values that are out of the given CIDR.

      • Exclude Ranges Optional

        A comma-separated list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they are included in the CIDR. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

        Warning

        Do not use values that are out of the given CIDR.

      • Gateway Optional

        A valid gateway address, for example, 10.11.0.9.

    4. Click Create.

Proceed to creating L2 templates as described in Create L2 templates.

Create subnets for a managed cluster using CLI

After creating the MetalLB configuration as described in Configure and verify MetalLB and before creating an L2 template, create the required subnets to use in the L2 template to allocate IP addresses for the managed cluster nodes.

Prerequisites for a multi-rack cluster
Create subnets using CLI
  1. Create a cluster using one of the following options:

  2. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  3. Create the subnet.yaml file with a number of global or namespaced subnets depending on the configuration of your cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnet.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: demo
      namespace: demo-namespace
      labels:
        kaas.mirantis.com/provider: baremetal
    spec:
      cidr: 10.11.0.0/24
      gateway: 10.11.0.9
      includeRanges:
      - 10.11.0.5-10.11.0.70
      nameservers:
      - 172.18.176.6
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Specification fields of the Subnet object

    Parameter

    Description

    cidr (singular)

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

    includeRanges (list)

    A comma-separated list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes. The gateway, network, broadcast, and DNSaddresses will be excluded (protected) automatically if they intersect with one of the range. The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

    Warning

    Do not use values that are out of the given CIDR.

    excludeRanges (list)

    A comma-separated list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they are included in the CIDR. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77.

    Warning

    Do not use values that are out of the given CIDR.

    useWholeCidr (boolean)

    If set to true, the subnet address (10.11.0.0 in the example above) and the broadcast address (10.11.0.255 in the example above) are included into the address allocation for nodes. Otherwise, (false by default), the subnet address and broadcast address are excluded from the address allocation.

    gateway (singular)

    A valid gateway address, for example, 10.11.0.9.

    nameservers (list)

    A list of the IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

    Configuration rules:

    • The subnet for the LCM network must contain the ipam/SVC-k8s-lcm: "1" label. For details, see Service labels and their life cycle.

    • Each cluster must use at least one subnet for its LCM network. Every node must have the address allocated in the LCM network using such subnet(s).

    • Each node of every cluster must have one and only IP address in the LCM network that is allocated from one of the Subnet objects having the ipam/SVC-k8s-lcm label defined. Therefore, all Subnet objects used for LCM networks must have the ipam/SVC-k8s-lcm label defined.

    • You can use any interface name for the LCM network traffic. The Subnet objects for the LCM network must have the ipam/SVC-k8s-lcm label. For details, see Service labels and their life cycle.

    Note

    You may use different subnets to allocate IP addresses to different Container Cloud components in your cluster. Add a label with the ipam/SVC- prefix to each subnet that is used to configure a Container Cloud service. For details, see Service labels and their life cycle and the optional steps below.

  4. Configure DHCP relay agents on the edges of the broadcast domains in the provisioning network, as needed.

    Make sure to assign the IP address ranges you want to allocate to the hosts using DHCP for discovery and inspection. Create subnets using these IP parameters. Specify the IP address of your DHCP relay as the default gateway in the corresponding Subnet object.

    Caution

    Support of multiple DHCP ranges has the following limitations:

    • Using of custom DNS server addresses for servers that boot over PXE is not supported.

    • The Subnet objects for DHCP ranges cannot be associated with any specific cluster, as the DHCP server configuration is only applicable to the management cluster where the DHCP server is running. The cluster.sigs.k8s.io/cluster-name label will be ignored.

    Configuration examples:

    Single-rack cluster
    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: mgmt-dhcp
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        ipam/SVC-dhcp-range: "presents"
    spec:
      cidr: 10.20.10.0/24
      includeRanges:
        - 10.20.10.10-10.20.10.20
    
    Multi-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-dhcp
      namespace: default
      labels:
        ipam/SVC-dhcp-range: "1"
        kaas.mirantis.com/provider: baremetal
    spec:
      cidr: 10.20.101.0/24
      gateway: 10.20.101.1
      includeRanges:
        - 10.20.101.16-10.20.101.127
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-dhcp
      namespace: default
      labels:
        ipam/SVC-dhcp-range: "1"
        kaas.mirantis.com/provider: baremetal
    spec:
      cidr: 10.20.102.0/24
      gateway: 10.20.102.1
      includeRanges:
        - 10.20.102.16-10.20.102.127
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-dhcp
      namespace: default
      labels:
        ipam/SVC-dhcp-range: "1"
        kaas.mirantis.com/provider: baremetal
    spec:
      cidr: 10.20.103.0/24
      gateway: 10.20.103.1
      includeRanges:
        - 10.20.103.16-10.20.103.127
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    
  5. Optional. Add subnets for configuring multiple DHCP ranges. For details, see Configure multiple DHCP address ranges.

  6. Add one or more subnets for the LCM network:

    • Set the ipam/SVC-k8s-lcm label with the value "1" to create a subnet that will be used to assign IP addresses in the LCM network.

    • Optional. Set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation.

    • Use this subnet in the L2 template for cluster nodes.

    • Using the L2 template, assign this subnet to the interface connected to your LCM network.

    Precautions for the LCM network usage

    • Each cluster must use at least one subnet for its LCM network. Every node must have the address allocated in the LCM network using such subnet(s).

    • Each node of every cluster must have one and only IP address in the LCM network that is allocated from one of the Subnet objects having the ipam/SVC-k8s-lcm label defined. Therefore, all Subnet objects used for LCM networks must have the ipam/SVC-k8s-lcm label defined.

    • You can use any interface name for the LCM network traffic. The Subnet objects for the LCM network must have the ipam/SVC-k8s-lcm label. For details, see Service labels and their life cycle.

    Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        ipam/SVC-k8s-lcm: "1"
      name: lcm-nw
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 172.16.43.0/24
      gateway: 172.16.43.1
      includeRanges:
      - 172.16.43.10-172.16.43.100
      nameservers:
        - 8.8.8.8
    
    Multi-rack cluster
    Example mosk-racks-lcm-subnets.yaml

    Note

    Subnet labels such as rack-x-lcm, rack-api-lcm, and so on are optional. You can use them in L2 templates to select Subnet objects by label.

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-lcm
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-k8s-lcm: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-lcm: "true"
    spec:
      cidr: 10.20.111.0/24
      gateway: 10.20.111.1
      includeRanges:
        - 10.20.111.16-10.20.111.255
      nameservers:
        - 8.8.8.8
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-lcm
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-k8s-lcm: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-lcm: "true"
    spec:
      cidr: 10.20.112.0/24
      gateway: 10.20.112.1
      includeRanges:
        - 10.20.112.16-10.20.112.255
      nameservers:
        - 8.8.8.8
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-lcm
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-k8s-lcm: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-lcm: "true"
    spec:
      cidr: 10.20.113.0/24
      gateway: 10.20.113.1
      includeRanges:
        - 10.20.113.16-10.20.113.255
      nameservers:
        - 8.8.8.8
    ---
    # Add more subnet object templates as required using the above example
    # (one subnet per rack)
    
    Example mosk-racks-api-lcm-subnet.yaml

    Note

    Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

    If BGP announcement is configured for the MOSK cluster API LB address, the API/LCM network is not required. Announcement of the cluster API LB address is done using the LCM network.

    If you configure ARP announcement of the load-balancer IP address for the MOSK cluster API, the API/LCM network must be configured on the Kubernetes manager nodes of the cluster. This network contains the Kubernetes API endpoint with the VRRP virtual IP address.

    This network contains Kubernetes API endpoint with the VRRP virtual IP address. This is the IP address space that Container Cloud uses to ensure communication between the LCM agents and the management API. These addresses are also used by Kubernetes nodes for communication. The addresses from the subnet are assigned to all Kubernetes manager nodes of the MOSK cluster.

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-api-lcm
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-k8s-lcm: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-api-lcm: "true"
    spec:
      cidr: 10.20.110.0/24
      gateway: 10.20.110.1
      includeRanges:
        - 10.20.110.16-10.20.110.25
      nameservers:
        - 8.8.8.8
    
  7. Optional. Add a subnet for external connection to the Kubernetes services exposed by the MOSK cluster. The network is used to expose the OpenStack, StackLight, and other MOSK services. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
      name: k8s-ext-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 172.16.45.0/24
      gateway: 172.16.45.1
      includeRanges:
      - 172.16.45.10-172.16.45.100
      nameservers:
        - 8.8.8.8
    
    Multi-rack cluster

    Note

    Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement of external addresses of Kubernetes load-balanced services in Deployment Guide.

    If you configure BGP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network can consist of multiple VLAN segments connected to all nodes of a MOSK cluster where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

    If you configure ARP announcement for IP addresses of load-balanced services of a MOSK cluster, the external network must consist of a single VLAN stretched to the ToR switches of all the racks where MOSK nodes connected to the external network are located. Those are the nodes where MetalLB speaker components are configured to announce IP addresses for Kubernetes load-balanced services. Mirantis recommends that you use OpenStack controller nodes for this purpose.

    The subnets are used to assign addresses to the external interfaces of the MOSK controller nodes and will be used to assign the default gateway to these hosts. The default gateway for other hosts of the MOSK cluster is assigned using the LCM and optionally API/LCM subnets.

    Example of a subnet where a single VLAN segment is stretched to all MOSK controller nodes:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: k8s-external
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        k8s-external: true
    spec:
      cidr: 10.20.120.0/24
      gateway: 10.20.120.1 # This will be the default gateway on hosts
      includeRanges:
        - 10.20.120.16-10.20.120.20
      nameservers:
        - 8.8.8.8
    

    Example of subnets where separate VLAN segments per rack are used:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-k8s-ext
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-k8s-ext: true
    spec:
      cidr: 10.20.121.0/24
      gateway: 10.20.121.1 # This will be the default gateway on hosts
      includeRanges:
        - 10.20.121.16-10.20.121.20
      nameservers:
        - 8.8.8.8
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-k8s-ext
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-k8s-ext: true
    spec:
      cidr: 10.20.122.0/24
      gateway: 10.20.122.1 # This will be the default gateway on hosts
      includeRanges:
        - 10.20.122.16-10.20.122.20
      nameservers:
        - 8.8.8.8
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-k8s-ext
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-k8s-ext: true
    spec:
      cidr: 10.20.123.0/24
      gateway: 10.20.123.1 # This will be the default gateway on hosts
      includeRanges:
        - 10.20.123.16-10.20.123.20
      nameservers:
        - 8.8.8.8
    

    Configuration rules:

    • Make sure that loadBalancerHost is set to "" (empty string) in the Cluster spec.

      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            kind: BaremetalClusterProviderSpec
            ...
            loadBalancerHost: ""
      
    • Create a subnet with the ipam/SVC-LBhost label having the "1" value to make the baremetal-provider use this subnet for allocation of addresses for cluster API endpoints. One IP address will be allocated for each cluster to serve its Kubernetes/MKE API endpoint.

    • Make sure that master nodes have host local-link addresses in the same subnet as the cluster API endpoint address. These host IP addresses will be used for VRRP traffic. The cluster API endpoint address will be assigned to the same interface on one of the master nodes where these host IP addresses are assigned.

    • Mirantis highly recommends that you assign the cluster API endpoint address from the LCM or external network. For details on cluster network types, refer to MOSK cluster networking.

    To add an address allocation scope of API endpoints, create a subnet in the corresponding namespace with a reference to the target cluster using the cluster.sigs.k8s.io/cluster-name label. For example:

    apiVersion: "ipam.mirantis.com/v1alpha1"
    kind: Subnet
    metadata:
      name: lbhost-mgmt-cluster
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mgmt-cluster
        ipam/SVC-LBhost: "presents"
    spec:
      cidr: "10.0.30.100/32"
      useWholeCidr: true
    
  8. Optional. Add a subnet(s) for the storage access network. Ceph will automatically use this subnet for its external connections. A Ceph OSD will look for and bind to an address from this subnet when it is started on a machine. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        ipam/SVC-ceph-public: true
        cluster.sigs.k8s.io/cluster-name: <MOSKClusterName>
      name: ceph-public-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 10.12.0.0/24
    
    Multi-rack cluster

    This network may have per-rack VLANs and IP subnets. The addresses from the subnets are assigned to all MOSK cluster nodes besides Kubernetes manager nodes.

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-ceph-public
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-public: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-ceph-public: true
    spec:
      cidr: 10.20.131.0/24
      gateway: 10.20.131.1
      includeRanges:
        - 10.20.131.16-10.20.131.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-ceph-public
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-public: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-ceph-public: true
    spec:
      cidr: 10.20.132.0/24
      gateway: 10.20.132.1
      includeRanges:
        - 10.20.132.16-10.20.132.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-ceph-public
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-public: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-ceph-public: true
    spec:
      cidr: 10.20.133.0/24
      gateway: 10.20.133.1
      includeRanges:
        - 10.20.133.16-10.20.133.255
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    

    Configuration rules:

    • Set the ipam/SVC-ceph-public label with the value "1" to create a subnet that will be used to configure the Ceph public network.

    • Set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation.

    • Use this subnet in the L2 template for all cluster nodes except Kubernetes manager nodes.

    • Assign this subnet to the interface connected to your storage access network.

  9. Optional. Add a subnet(s) for the storage replication network. Ceph will automatically use this network for its internal replication traffic. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        ipam/SVC-ceph-cluster: true
        cluster.sigs.k8s.io/cluster-name: <MOSKClusterName>
      name: ceph-cluster-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 10.12.1.0/24
    
    Multi-rack cluster

    This network may have per-rack VLANs and IP subnets. The addresses from the subnets are assigned to storage nodes in the MOSK cluster.

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-ceph-cluster
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-cluster: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-ceph-cluster: true
    spec:
      cidr: 10.20.141.0/24
      gateway: 10.20.141.1
      includeRanges:
        - 10.20.141.16-10.20.141.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-ceph-cluster
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-cluster: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-ceph-cluster: true
    spec:
      cidr: 10.20.142.0/24
      gateway: 10.20.142.1
      includeRanges:
        - 10.20.142.16-10.20.142.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-ceph-cluster
      namespace: mosk-namespace-name
      labels:
        ipam/SVC-ceph-cluster: "1"
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-ceph-cluster: true
    spec:
      cidr: 10.20.143.0/24
      gateway: 10.20.143.1
      includeRanges:
        - 10.20.143.16-10.20.143.255
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    

    Configuration rules:

    • Set the ipam/SVC-ceph-cluster label with the value "1" to create a subnet that will be used to configure the Ceph cluster network.

    • Set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation.

    • Use this subnet in the L2 template for storage nodes.

    • Assign this subnet to the interface connected to your storage replication network.

  10. Optional. Add a subnet for the Kubernetes Pods traffic. The addresses from this subnet are assigned to interfaces connected to the Kubernetes workloads network and used by Calico CNI as underlay for traffic between the pods in the Kubernetes cluster. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
      name: k8s-pods-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 10.12.3.0/24
      includeRanges:
      - 10.12.3.10-10.12.3.100
    
    Multi-rack cluster

    This network may include multiple per-rack VLANs and IP subnets. The addresses from the subnets are assigned to all MOSK cluster nodes. For details, see Network types.

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-k8s-pods
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-k8s-pods: true
    spec:
      cidr: 10.20.151.0/24
      gateway: 10.20.151.1
      includeRanges:
        - 10.20.151.16-10.20.151.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-k8s-pods
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-k8s-pods: true
    spec:
      cidr: 10.20.152.0/24
      gateway: 10.20.152.1
      includeRanges:
        - 10.20.152.16-10.20.152.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-k8s-pods
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-k8s-pods: true
    spec:
      cidr: 10.20.153.0/24
      gateway: 10.20.153.1
      includeRanges:
        - 10.20.153.16-10.20.153.255
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    

    Configuration rules:

    • Use this subnet in the L2 template for all nodes in the cluster.

    • Use the npTemplate.bridges.k8s-pods bridge name in the L2 template. This bridge name is reserved for the Kubernetes workloads network. When the k8s-pods bridge is defined in an L2 template, Calico CNI uses that network for routing the Pods traffic between nodes.

  11. Optional. Add a subnet for the MOSK overlay network. this is the underlay network for VXLAN tunnels for the MOSK tenant traffic. If deployed with Tungsten Fabric, it is used for the MPLS over UDP+GRE traffic. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
      name: neutron-tunnel-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 10.12.2.0/24
      includeRanges:
      - 10.12.2.10-10.12.2.100
    
    Multi-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-tenant-tunnel
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-tenant-tunnel: true
    spec:
      cidr: 10.20.161.0/24
      gateway: 10.20.161.1
      includeRanges:
        - 10.20.161.16-10.20.161.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-tenant-tunnel
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-tenant-tunnel: true
    spec:
      cidr: 10.20.162.0/24
      gateway: 10.20.162.1
      includeRanges:
        - 10.20.162.16-10.20.162.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-tenant-tunnel
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-tenant-tunnel: true
    spec:
      cidr: 10.20.163.0/24
      gateway: 10.20.163.1
      includeRanges:
        - 10.20.163.16-10.20.163.255
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    

    Configuration rules:

    • Use this subnet in the L2 template for the compute and gateway (controller) nodes in the MOSK cluster.

    • Assign this subnet to the interface connected to your MOSK overlay network.

    • This network is used to provide denied and secure tenant networks with the help of the tunneling mechanism (VLAN/GRE/VXLAN). If the VXLAN and GRE encapsulation takes place, the IP address assignment is required on interfaces at the node level. On the Tungsten Fabric deployments, this network is used for MPLS over UDP+GRE traffic.

  12. Optional. Add a subnet for the MOSK live migration network. This subnet is used by the Compute service (OpenStack Nova) to transfer data during live migration. Depending on the cloud needs, you can place it on a dedicated physical network not to affect other networks during live migration. The IP address assignment is required on interfaces at the node level. Configuration examples:

    Single-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
      name: live-migration-subnet
      namespace: <MOSKClusterNamespace>
    spec:
      cidr: 10.12.7.0/24
      includeRanges:
      - 10.12.7.10-10.12.7.100
    
    Multi-rack cluster
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-1-live-migration
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-1-live-migration: true
    spec:
      cidr: 10.20.171.0/24
      gateway: 10.20.171.1
      includeRanges:
        - 10.20.171.16-10.20.171.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-2-live-migration
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-2-live-migration: true
    spec:
      cidr: 10.20.172.0/24
      gateway: 10.20.172.1
      includeRanges:
        - 10.20.172.16-10.20.172.255
    ---
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: rack-3-live-migration
      namespace: mosk-namespace-name
      labels:
        kaas.mirantis.com/provider: baremetal
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack-3-live-migration: true
    spec:
      cidr: 10.20.173.0/24
      gateway: 10.20.173.1
      includeRanges:
        - 10.20.173.16-10.20.173.255
    ---
    # Add more Subnet object templates as required using the above example
    # (one subnet per rack)
    

    Configuration rules:

    • Use this subnet in the L2 template for compute nodes in the MOSK cluster.

    • Assign this subnet to the interface connected to your MOSK overlay network.

  13. Verify that the subnet is successfully created:

    kubectl get subnet kaas-mgmt -oyaml
    

    In the system output, verify the Subnet object status.

    Status fields of the Subnet object

    Parameter

    Description

    state Since 23.1

    Contains a short state description and a more detailed one if applicable. The short status values are as follows:

    • OK - object is operational.

    • ERR - object is non-operational. This status has a detailed description in the messages list.

    • TERM - object was deleted and is terminating.

    messages Since 23.1

    Contains error or warning messages if the object state is ERR. For example, ERR: Wrong includeRange for CIDR….

    statusMessage

    Deprecated since MOSK 23.1 and will be removed in one of the following releases in favor of state and messages. Since MOSK 23.2, this field is not set for the objects of newly created clusters.

    cidr

    Reflects the actual CIDR, has the same meaning as spec.cidr.

    gateway

    Reflects the actual gateway, has the same meaning as spec.gateway.

    nameservers

    Reflects the actual name servers, has same meaning as spec.nameservers.

    ranges

    Specifies the address ranges that are calculated using the fields from spec: cidr, includeRanges, excludeRanges, gateway, useWholeCidr. These ranges are directly used for nodes IP allocation.

    allocatable

    Includes the number of currently available IP addresses that can be allocated for nodes from the subnet.

    allocatedIPs

    Specifies the list of IPv4 addresses with the corresponding IPaddr object IDs that were already allocated from the subnet.

    capacity

    Contains the total number of IP addresses being held by ranges that equals to a sum of the allocatable and allocatedIPs parameters values.

    objCreated

    Date, time, and IPAM version of the Subnet CR creation.

    objStatusUpdated

    Date, time, and IPAM version of the last update of the status field in the Subnet CR.

    objUpdated

    Date, time, and IPAM version of the last Subnet CR update by kaas-ipam.

    Example of a successfully created subnet:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        ipam/UID: 6039758f-23ee-40ba-8c0f-61c01b0ac863
        kaas.mirantis.com/provider: baremetal
        ipam/SVC-k8s-lcm: "1"
      name: kaas-mgmt
      namespace: default
    spec:
      cidr: 10.0.0.0/24
      excludeRanges:
      - 10.0.0.100
      - 10.0.0.101-10.0.0.120
      gateway: 10.0.0.1
      includeRanges:
      - 10.0.0.50-10.0.0.90
      nameservers:
      - 172.18.176.6
    status:
      allocatable: 38
      allocatedIPs:
      - 10.0.0.50:0b50774f-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.51:1422e651-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.52:1d19912c-ffed-11ea-84c7-0242c0a85b02
      capacity: 41
      cidr: 10.0.0.0/24
      gateway: 10.0.0.1
      objCreated: 2021-10-21T19:09:32Z  by  v5.1.0-20210930-121522-f5b2af8
      objStatusUpdated: 2021-10-21T19:14:18.748114886Z  by  v5.1.0-20210930-121522-f5b2af8
      objUpdated: 2021-10-21T19:09:32.606968024Z  by  v5.1.0-20210930-121522-f5b2af8
      nameservers:
      - 172.18.176.6
      ranges:
      - 10.0.0.50-10.0.0.90
    
  14. Proceed to creating L2 templates as described in Create L2 templates.

Create L2 templates

After you create subnets for the MOSK cluster as described in Create subnets, follow the procedure below to create L2 templates for different types of OpenStack nodes in the cluster.

See the following subsections for templates that implement the MOSK Reference Architecture: Networking. You may adjust the templates according to the requirements of your architecture using the last two subsections of this section. They explain mandatory parameters of the templates and supported configuration options.

Create an L2 template for a new cluster

After you create subnets for one or more MOSK clusters or projects as described in Create subnets, follow the procedure below to create L2 templates for a MOSK cluster.

L2 templates are used directly during provisioning. This way, a hardware node obtains and applies a complete network configuration during the first system boot.

Caution

Update any L2 template created before Container Cloud 2.9.0 (Cluster releases 6.14.0, 5.15.0, or earlier) to the new format as described in Container Cloud Release Notes: Switch L2 templates to the new format.

Сreate an L2 template for a new MOSK cluster

Caution

Create L2 templates before adding any machines to your new MOSK cluster.

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create a set of L2Template YAML files specific to your deployment using exemplary templates provided in Create L2 templates.

    Note

    You can create several L2 templates with different configurations to be applied to different nodes of the same cluster. See Assign L2 templates to machines for details.

  3. Add or edit the mandatory labels and parameters in the new L2 template. For description of mandatory labels and parameters, see L2Template.

  4. Optional. To designate an L2 template as default, assign the ipam/DefaultForCluster label to it. Only one L2 template in a cluster can have this label. It will be used for machines that do not have an L2 template explicitly assigned to them.

    Note

    You may skip this step and add the default label along with other custom labels using the Container Cloud web UI, as described below in this procedure.

    To assign the default template to the cluster:

    Use the mandatory cluster.sigs.k8s.io/cluster-name label in the L2 template metadata section.

    Use the cluster.sigs.k8s.io/cluster-name label or the clusterRef parameter in the L2 template spec section. During cluster update to 2.25.0, this deprecated parameter is automatically migrated to the cluster.sigs.k8s.io/cluster-name label.

  5. Optional. Add custom labels to the L2 template. You can refer to these labels to assign the L2 template to machines.

  6. Add the L2 template to your management cluster. Select one of the following options:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <pathToL2TemplateYamlFile>
    
    1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

    2. Switch to the required non-default project using the Switch Project action icon located on top of the main left-side navigation panel.

      Caution

      Do not create a MOSK cluster in the default project (Kubernetes namespace), which is dedicated for the management cluster only. If no projects are defined, first create a new mosk project as described in Create a project for MOSK clusters.

    3. In the left sidebar, navigate to Networks and click the L2 Templates tab.

    4. Click Create L2 Template.

    5. Fill out the Create L2 Template form as required:

      • Name

        L2 template name.

      • Cluster

        Cluster name that the L2 template is being added for. To set the L2 template as default for all machines, also select Set default for the cluster.

      • Specification

        L2 specification in the YAML format that you have previously created. Click Edit to edit the L2 template if required.

        Note

        Before Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), the field name is YAML file, and you can upload the required YAML file instead of inserting and editing it.

      • Labels

        Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). Key-value pairs attached to the L2 template. For details, see L2Template metadata.

  7. Optional. Further modify the template, if required. For description of parameters, see L2Template.

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    -n <ProjectNameForNewMOSKCluster> edit l2template <L2templateName>
    

    Caution

    Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

    • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

    • Connections between services are interrupted unexpectedly, which can cause data loss.

    • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

    For details, see Modify network configuration on an existing machine.

  8. Proceed with Add a machine. The resulting L2 template will be used to render the netplan configuration for the MOSK cluster machines.

Workflow of the netplan configuration using an L2 template
  1. The kaas-ipam service uses the data from BareMetalHost, L2Template, and Subnet objects to generate the netplan configuration for every cluster machine.

    Note

    Before update of the management cluster to Container Cloud 2.29.0 (Cluster release 16.4.0), instead of BareMetalHostInventory, use the BareMetalHost object. For details, see BareMetalHost resource.

    Caution

    While the Cluster release of the management cluster is 16.4.0, BareMetalHostInventory operations are allowed to m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

  2. The generated netplan configuration is saved in the status.netconfigFiles section of the IpamHost object. If the status.netconfigFilesState field of the IpamHost object is OK, the configuration was rendered in the IpamHost object successfully. Otherwise, the status contains an error message.

    Caution

    The following fields of the ipamHost status are renamed since MOSK 23.1 in the scope of the L2Template and IpamHost objects refactoring:

    • netconfigV2 to netconfigCandidate

    • netconfigV2state to netconfigCandidateState

    • netconfigFilesState to netconfigFilesStates (per file)

    No user actions are required after renaming.

    The format of netconfigFilesState changed after renaming. The netconfigFilesStates field contains a dictionary of statuses of network configuration files stored in netconfigFiles. The dictionary contains the keys that are file paths and values that have the same meaning for each file that netconfigFilesState had:

    • For a successfully rendered configuration file: OK: <timestamp> <sha256-hash-of-rendered-file>, where a timestamp is in the RFC 3339 format.

    • For a failed rendering: ERR: <error-message>.

  3. The baremetal-provider service copies data from status.netconfigFiles of the IpamHost object to the Spec.StateItemsOverwrites[‘deploy’][‘bm_ipam_netconfigv2’] parameter of LCMMachine.

  4. The lcm-agent service on every host synchronizes the LCMMachine data to its host. The lcm-agent service runs a playbook to update the netplan configuration on the host during the pre-download and deploy phases.

Create L2 templates for a multi-rack MOSK cluster

For a multi-rack MOSK cluster, you need to create one L2 template for each type of server in each rack. This may result in a large number of L2 templates in your configuration.

For example, if you have a three-rack deployment of MOSK with 4 types of nodes evenly distributed across three racks, you have to create at least the following L2 templates:

  • rack-1-k8s-manager, rack-2-k8s-manager, rack-3-k8s-manager for Kubernetes control plane nodes, unless you use the compact control plane option.

  • rack-1-mosk-control, rack-2-mosk-control, rack-3-mosk-control for OpenStack controller nodes in each rack.

  • rack-1-mosk-compute, rack-2-mosk-compute, rack-3-mosk-compute for OpenStack compute nodes in each rack.

  • rack-1-mosk-storage, rack-2-mosk-storage, rack-3-mosk-storage for OpenStack storage nodes in each rack.

In total, twelve L2 templates are required for this relatively simple cluster. In the following sections, the examples cover only one rack, but can be easily expanded to more racks.

Note

Three servers are required for Kubernetes control plane and for the OpenStack control plane. So, you might not need more L2 templates for these roles when expanding beyond three racks.

Now, proceed to creating L2 templates for your cluster, starting from Create an L2 template for a Kubernetes manager node.

Create an L2 template for a Kubernetes manager node

Caution

Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

  • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

  • Connections between services are interrupted unexpectedly, which can cause data loss.

  • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

For details, see Modify network configuration on an existing machine.

According to the reference architecture, the Kubernetes manager nodes in the MOSK cluster must be connected to the following networks:

  • PXE network

  • API/LCM network (if you configure ARP announcement of the load-balancer IP address for the MOSK cluster API)

  • LCM network (if you configure BGP announcement of the load-balancer IP address for the MOSK cluster API)

  • Kubernetes workloads network

Caution

If you plan to deploy MOSK cluster with the compact control plane option, skip this section entirely and proceed with Create an L2 template for a MOSK controller node.

To create L2 templates for Kubernetes manager nodes:

  1. Create or open the mosk-l2templates.yml file that contains the L2 templates you are preparing.

  2. Add L2 templates using the following example. Adjust the values of specific parameters according to the specifications of your environment, specifically the name of your project (namespace) and cluster, IP address ranges and networks, subnet names.

    L2 template example for Kubernetes manager node
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack1-mosk-manager: "true"
      name: rack1-mosk-manager
      namespace: mosk-namespace-name
    spec:
      autoIfMappingPrio:
      - provision
      - eno
      - ens
      - enp
      l3Layout:
      - subnetName: api-lcm
        scope: namespace
      - subnetName: rack1-k8s-pods
        scope: namespace
      npTemplate: |-
        version: 2
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
            mtu: 9000
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
            mtu: 9000
          {{nic 2}}
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 2}}
            set-name: {{nic 2}}
            mtu: 9000
          {{nic 3}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 3}}
            set-name: {{nic 3}}
            mtu: 9000
        bonds:
          bond0:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 0}}
            - {{nic 1}}
        vlans:
          k8s-lcm-v:
            id: 403
            link: bond0
            mtu: 9000
          k8s-pods-v:
            id: 408
            link: bond0
            mtu: 9000
        bridges:
          k8s-lcm:
            interfaces: [k8s-lcm-v]
            addresses:
            - {{ ip "k8s-lcm:api-lcm" }}
            nameservers:
              addresses: {{nameservers_from_subnet "api-lcm"}}
            gateway4: {{ gateway_from_subnet "api-lcm" }}
          k8s-pods:
            interfaces: [k8s-pods-v]
            addresses:
            - {{ip "k8s-pods:rack1-k8s-pods"}}
            mtu: 9000
            routes:
              - to: 10.199.0.0/22 # aggregated address space for Kubernetes workloads
                via: {{gateway_from_subnet "rack1-k8s-pods"}}
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Note

    Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.

    To create L2 templates for other racks, change the rack identifier in the names and labels above.

  3. Proceed with Create an L2 template for a MOSK controller node. The resulting L2 templates will be used to render the netplan configuration for the managed cluster machines.

Create an L2 template for a MOSK controller node

Caution

Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

  • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

  • Connections between services are interrupted unexpectedly, which can cause data loss.

  • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

For details, see Modify network configuration on an existing machine.

According to the reference architecture, MOSK controller nodes must be connected to the following networks:

  • PXE network

  • LCM network

  • Kubernetes workloads network

  • Storage access network (if deploying with Ceph as a backend for ephemeral storage)

  • Floating IP and provider networks. Not required for deployment with Tungsten Fabric.

  • Tenant underlay networks. If deploying with VXLAN networking or with Tungsten Fabric. In the latter case, the BGP service is configured over this network.

To create L2 templates for MOSK controller nodes:

  1. Create or open the mosk-l2template.yml file that contains the L2 templates.

  2. Add L2 templates using the following example. Adjust the values of specific parameters according to the specification of your environment.

    Example of an L2 template for a MOSK controller node
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack1-mosk-controller: "true"
      name: rack1-mosk-controller
      namespace: mosk-namespace-name
    spec:
      autoIfMappingPrio:
      - provision
      - eno
      - ens
      - enp
      l3Layout:
      - subnetName: mgmt-lcm
        scope: global
      - subnetName: rack1-k8s-lcm
        scope: namespace
      - subnetName: k8s-external
        scope: namespace
      - subnetName: rack1-k8s-pods
        scope: namespace
      - subnetName: rack1-ceph-public
        scope: namespace
      - subnetName: rack1-tenant-tunnel
        scope: namespace
      npTemplate: |-
        version: 2
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
            mtu: 9000
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
            mtu: 9000
          {{nic 2}}
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 2}}
            set-name: {{nic 2}}
            mtu: 9000
          {{nic 3}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 3}}
            set-name: {{nic 3}}
            mtu: 9000
        bonds:
          bond0:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 0}}
            - {{nic 1}}
          bond1:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 2}}
            - {{nic 3}}
        vlans:
          k8s-lcm-v:
            id: 403
            link: bond0
            mtu: 9000
          k8s-ext-v:
            id: 409
            link: bond0
            mtu: 9000
          k8s-pods-v:
            id: 408
            link: bond0
            mtu: 9000
          pr-floating:
            id: 407
            link: bond1
            mtu: 9000
          stor-frontend:
            id: 404
            link: bond0
            addresses:
            - {{ip "stor-frontend:rack1-ceph-public"}}
            mtu: 9000
            routes:
            - to: 10.199.16.0/22 # aggregated address space for Ceph public network
              via: {{ gateway_from_subnet "rack1-ceph-public" }}
          tenant-tunnel:
            id: 406
            link: bond1
            addresses:
            - {{ip "tenant-tunnel:rack1-tenant-tunnel"}}
            mtu: 9000
            routes:
            - to: 10.195.0.0/22 # aggregated address space for tenant networks
              via: {{ gateway_from_subnet "rack1-tenant-tunnel" }}
        bridges:
          k8s-lcm:
            interfaces: [k8s-lcm-v]
            addresses:
            - {{ ip "k8s-lcm:rack1-k8s-lcm" }}
            nameservers:
              addresses: {{nameservers_from_subnet "rack1-k8s-lcm"}}
            routes:
            - to: 10.197.0.0/21 # aggregated address space for LCM and API/LCM networks
              via: {{ gateway_from_subnet "rack1-k8s-lcm" }}
            - to: {{ cidr_from_subnet "mgmt-lcm" }}
              via: {{ gateway_from_subnet "rack1-k8s-lcm" }}
          k8s-ext:
            interfaces: [k8s-ext-v]
            addresses:
            - {{ip "k8s-ext:k8s-external"}}
            nameservers:
              addresses: {{nameservers_from_subnet "k8s-external"}}
            gateway4: {{ gateway_from_subnet "k8s-external" }}
            mtu: 9000
          k8s-pods:
            interfaces: [k8s-pods-v]
            addresses:
            - {{ip "k8s-pods:rack1-k8s-pods"}}
            mtu: 9000
            routes:
            - to: 10.199.0.0/22 # aggregated address space for Kubernetes workloads
              via: {{ gateway_from_subnet "rack1-k8s-pods" }}
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Note

    Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.

Caution

If you plan to deploy a MOSK cluster with the compact control plane option and configure ARP announcement of the load-balancer IP address for the MOSK cluster API, the API/LCM network will be used for MOSK controller nodes. Therefore, change the rack1-k8s-lcm subnet to the api-lcm one in the corresponding L2Template object:

spec:
  ...
  l3Layout:
  ...
  - subnetName: api-lcm
    scope: namespace
  ...
  npTemplate: |-
  ...
    bridges:
      k8s-lcm:
        interfaces: [k8s-lcm-v]
        addresses:
        - {{ ip "k8s-lcm:api-lcm" }}
        nameservers:
          addresses: {{nameservers_from_subnet "api-lcm"}}
        routes:
        - to: 10.197.0.0/21 # aggregated address space for LCM and API/LCM networks
          via: {{ gateway_from_subnet "api-lcm" }}
        - to: {{ cidr_from_subnet "mgmt-lcm" }}
          via: {{ gateway_from_subnet "api-lcm" }}
  ...
  1. Proceed with Create an L2 template for a MOSK compute node.

Create an L2 template for a MOSK compute node

Caution

Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

  • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

  • Connections between services are interrupted unexpectedly, which can cause data loss.

  • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

For details, see Modify network configuration on an existing machine.

According to the reference architecture, MOSK compute nodes must be connected to the following networks:

  • PXE network

  • LCM network

  • Kubernetes workloads network

  • Storage access network (if deploying with Ceph as a backend for ephemeral storage)

  • Floating IP and provider networks (if deploying OpenStack with DVR)

  • Tenant underlay networks

To create L2 templates for MOSK compute nodes:

  1. Add L2 templates to the mosk-l2templates.yml file using the following example. Adjust the values of parameters according to the specification of your environment.

    Example of an L2 template for a MOSK compute node
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack1-mosk-compute: "true"
      name: rack1-mosk-compute
      namespace: mosk-namespace-name
    spec:
      autoIfMappingPrio:
      - provision
      - eno
      - ens
      - enp
      l3Layout:
      - subnetName: rack1-k8s-lcm
        scope: namespace
      - subnetName: rack1-k8s-pods
        scope: namespace
      - subnetName: rack1-ceph-public
        scope: namespace
      - subnetName: rack1-tenant-tunnel
        scope: namespace
      npTemplate: |-
        version: 2
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
            mtu: 9000
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
            mtu: 9000
          {{nic 2}}
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 2}}
            set-name: {{nic 2}}
            mtu: 9000
          {{nic 3}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 3}}
            set-name: {{nic 3}}
            mtu: 9000
        bonds:
          bond0:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 0}}
            - {{nic 1}}
          bond1:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 2}}
            - {{nic 3}}
        vlans:
          k8s-lcm-v:
            id: 403
            link: bond0
            mtu: 9000
          k8s-pods-v:
            id: 408
            link: bond0
            mtu: 9000
          pr-floating:
            id: 407
            link: bond1
            mtu: 9000
          stor-frontend:
            id: 404
            link: bond0
            addresses:
            - {{ip "stor-frontend:rack1-ceph-public"}}
            mtu: 9000
            routes:
            - to: 10.199.16.0/22 # aggregated address space for Ceph public network
              via: {{ gateway_from_subnet "rack1-ceph-public" }}
          tenant-tunnel:
            id: 406
            link: bond1
            addresses:
            - {{ip "tenant-tunnel:rack1-tenant-tunnel"}}
            mtu: 9000
            routes:
            - to: 10.195.0.0/22 # aggregated address space for tenant networks
              via: {{ gateway_from_subnet "rack1-tenant-tunnel" }}
        bridges:
          k8s-lcm:
            interfaces: [k8s-lcm-v]
            addresses:
            - {{ ip "k8s-lcm:rack1-k8s-lcm" }}
            nameservers:
              addresses: {{nameservers_from_subnet "rack1-k8s-lcm"}}
            gateway4: {{ gateway_from_subnet "rack1-k8s-lcm" }}
          k8s-pods:
            interfaces: [k8s-pods-v]
            addresses:
            - {{ip "k8s-pods:k8s-pods-subnet"}}
            mtu: 9000
            routes:
              - to: 10.199.0.0/22 # aggregated address space for Kubernetes workloads
                via: {{gateway_from_subnet "rack1-k8s-pods"}}
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Note

    Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.

  2. Proceed with Create an L2 template for a MOSK storage node.

Create an L2 template for a MOSK storage node

Caution

Modification of L2 templates in use is only allowed with a mandatory validation step from the infrastructure operator to prevent accidental cluster failures due to unsafe changes. The list of risks posed by modifying L2 templates includes:

  • Services running on hosts cannot reconfigure automatically to switch to the new IP addresses and/or interfaces.

  • Connections between services are interrupted unexpectedly, which can cause data loss.

  • Incorrect configurations on hosts can lead to irrevocable loss of connectivity between services and unexpected cluster partition or disassembly.

For details, see Modify network configuration on an existing machine.

According to the reference architecture, MOSK storage nodes in the MOSK cluster must be connected to the following networks:

  • PXE network

  • LCM network

  • Kubernetes workloads network

  • Storage access network

  • Storage replication network

To create L2 templates for MOSK storage nodes:

  1. Add L2 templates to the mosk-l2templates.yml file using the following example. Adjust the values of parameters according to the specification of your environment.

    Example of an L2 template for a MOSK storage node
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
        cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
        rack1-mosk-storage: "true"
      name: rack1-mosk-storage
      namespace: mosk-namespace-name
    spec:
      autoIfMappingPrio:
      - provision
      - eno
      - ens
      - enp
      l3Layout:
      - subnetName: rack1-k8s-lcm
        scope: namespace
      - subnetName: rack1-k8s-pods
        scope: namespace
      - subnetName: rack1-ceph-public
        scope: namespace
      - subnetName: rack1-ceph-cluster
        scope: namespace
      npTemplate: |-
        version: 2
        ethernets:
          {{nic 0}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
            mtu: 9000
          {{nic 1}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
            mtu: 9000
          {{nic 2}}
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 2}}
            set-name: {{nic 2}}
            mtu: 9000
          {{nic 3}}:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 3}}
            set-name: {{nic 3}}
            mtu: 9000
        bonds:
          bond0:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 0}}
            - {{nic 1}}
          bond1:
            mtu: 9000
            parameters:
              mode: 802.3ad
              mii-monitor-interval: 100
            interfaces:
            - {{nic 2}}
            - {{nic 3}}
        vlans:
          k8s-lcm-v:
            id: 403
            link: bond0
            mtu: 9000
          k8s-pods-v:
            id: 408
            link: bond0
            mtu: 9000
          stor-frontend:
            id: 404
            link: bond0
            addresses:
            - {{ip "stor-frontend:rack1-ceph-public"}}
            mtu: 9000
            routes:
            - to: 10.199.16.0/22 # aggregated address space for Ceph public network
              via: {{ gateway_from_subnet "rack1-ceph-public" }}
          stor-backend:
            id: 405
            link: bond1
            addresses:
            - {{ip "stor-backend:rack1-ceph-cluster"}}
            mtu: 9000
            routes:
            - to: 10.199.32.0/22 # aggregated address space for Ceph cluster network
              via: {{ gateway_from_subnet "rack1-ceph-cluster" }}
        bridges:
          k8s-lcm:
            interfaces: [k8s-lcm-v]
            addresses:
            - {{ ip "k8s-lcm:rack1-k8s-lcm" }}
            nameservers:
              addresses: {{nameservers_from_subnet "rack1-k8s-lcm"}}
            gateway4: {{ gateway_from_subnet "rack1-k8s-lcm" }}
          k8s-pods:
            interfaces: [k8s-pods-v]
            addresses:
            - {{ip "k8s-pods:k8s-pods-subnet"}}
            mtu: 9000
            routes:
              - to: 10.199.0.0/22 # aggregated address space for Kubernetes workloads
                via: {{gateway_from_subnet "rack1-k8s-pods"}}
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    Note

    Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.

  2. Proceed with the L2 template configuration procedure described in Create an L2 template for a new cluster.

L2 template example with bonds and bridges

This section contains an exemplary L2 template that demonstrates how to set up bonds and bridges on hosts for your managed clusters.


Parameters of the bond interface

Configure bonding options using the parameters field. The only mandatory option is mode. See the example below for details.

Note

You can set any mode supported by netplan and your hardware.

Important

Bond monitoring is disabled in Ubuntu by default. However, Mirantis highly recommends enabling it using the Media Independent Interface (MII) monitoring by setting the mii-monitor-interval parameter to a non-zero value. For details, see Linux documentation: bond monitoring.

Kubernetes LCM network

The Kubernetes LCM network connects LCM Agents running on nodes to the LCM API of the management cluster. It is also used for communication between kubelet and Kubernetes API server inside a Kubernetes cluster. The MKE components use this network for communication inside a swarm cluster.

To configure each node with an IP address that will be used for LCM traffic, use the npTemplate.bridges.k8s-lcm bridge in the L2 template, as demonstrated in the example below.

  • Each node of every cluster must have one and only IP address in the LCM network that is allocated from one of the Subnet objects having the ipam/SVC-k8s-lcm label defined. Therefore, all Subnet objects used for LCM networks must have the ipam/SVC-k8s-lcm label defined.

  • You can use any interface name for the LCM network traffic. The Subnet objects for the LCM network must have the ipam/SVC-k8s-lcm label. For details, see Service labels and their life cycle.

Dedicated network for the Kubernetes pods traffic

If you want to use a dedicated network for Kubernetes pods traffic, configure each node with an IPv4 address that will be used to route the pods traffic between nodes. To accomplish that, use the npTemplate.bridges.k8s-pods bridge in the L2 template, as demonstrated in the example below. As defined in Container Cloud Reference Architecture: Host networking, this bridge name is reserved for the Kubernetes pods network. When the k8s-pods bridge is defined in an L2 template, Calico CNI uses that network for routing the pods traffic between nodes.

Dedicated network for the Kubernetes services traffic (MetalLB)

You can use a dedicated network for external connection to the Kubernetes services exposed by the cluster. If enabled, MetalLB will listen and respond on the dedicated virtual bridge. To accomplish that, configure each node where metallb-speaker is deployed with an IPv4 address. For details on selecting nodes for metallb-speaker, see Configure node selectors for MetalLB speakers. Both the MetalLB IP address ranges and the IP addresses configured on those nodes must fit in the same CIDR.

The default route on the MOSK nodes that are connected to the external network must be configured with the default gateway in the external network.

Caution

The IP address ranges of the corresponding subnet used in L2Template for the dedicated virtual brigde must be excluded from the MetalLB address ranges.

Dedicated networks for the Ceph distributed storage traffic

You can configure dedicated networks for the Ceph cluster access and replication traffic. Set labels on the Subnet CRs for the corresponding networks, as described in Create subnets. Container Cloud automatically configures Ceph to use the addresses from these subnets. Ensure that the addresses are assigned to the storage nodes.

The Subnet objects used to assign IP addresses to these networks must have corresponding labels ipam/SVC-ceph-public for the Ceph public (storage access) network and ipam/SVC-ceph-cluster for the Ceph cluster (storage replication) network.

Example of an L2 template with interfaces bonding
apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
  labels:
    cluster.sigs.k8s.io/cluster-name: mosk-cluster-name
spec:
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  l3Layout:
    - subnetName: mgmt-lcm
      scope: global
    - subnetName: demo-lcm
      scope: namespace
    - subnetName: demo-ext
      scope: namespace
    - subnetName: demo-pods
      scope: namespace
    - subnetName: demo-ceph-cluster
      scope: namespace
    - subnetName: demo-ceph-public
      scope: namespace
  npTemplate: |
    version: 2
    ethernets:
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
    bonds:
      bond0:
        interfaces:
          - ten10gbe0s0
          - ten10gbe0s1
        mtu: 9000
        parameters:
          mode: 802.3ad
          mii-monitor-interval: 100
    vlans:
      k8s-lcm-vlan:
        id: 1009
        link: bond0
      k8s-ext-vlan:
        id: 1001
        link: bond0
      k8s-pods-vlan:
        id: 1002
        link: bond0
      stor-frontend:
        id: 1003
        link: bond0
      stor-backend:
        id: 1004
        link: bond0
    bridges:
      k8s-lcm:
        interfaces: [k8s-lcm-vlan]
        addresses:
          - {{ip "k8s-lcm:demo-lcm"}}
        routes:
          - to: {{ cidr_from_subnet "mgmt-lcm" }}
            via: {{ gateway_from_subnet "demo-lcm" }}
      k8s-ext:
        interfaces: [k8s-ext-vlan]
        addresses:
          - {{ip "k8s-ext:demo-ext"}}
        nameservers:
          addresses: {{nameservers_from_subnet "demo-ext"}}
        gateway4: {{ gateway_from_subnet "demo-ext" }}
      k8s-pods:
        interfaces: [k8s-pods-vlan]
        addresses:
          - {{ip "k8s-pods:demo-pods"}}
      ceph-cluster:
        interfaces: [stor-backend]
        addresses:
          - {{ip "ceph-cluster:demo-ceph-cluster"}}
      ceph-public:
        interfaces: [stor-frontend]
        addresses:
          - {{ip "ceph-public:demo-ceph-public"}}

Note

Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.

L2 template example for automatic multiple subnet creation

Unsupported since MCC 2.28.0 (17.3.0 and 16.3.0)

Warning

The SubnetPool object is unsupported since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). For details, see Deprecation Notes: SubnetPool resource management.

This section contains an exemplary L2 template for automatic multiple subnet creation as described in Automate multiple subnet creation using SubnetPool. This template also contains the L3Layout section that allows defining the Subnet scopes and enables auto-creation of the Subnet objects from the SubnetPool objects. For details about auto-creation of the Subnet objects see Automate multiple subnet creation using SubnetPool.

For details on how to create L2 templates, see Create an L2 template for a new cluster.

Caution

Do not assign an IP address to the PXE nic 0 NIC explicitly to prevent the IP duplication during updates. The IP address is automatically assigned by the bootstrapping engine.

Example of an L2 template for multiple subnets:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
    cluster.sigs.k8s.io/cluster-name: my-cluster
spec:
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  l3Layout:
    - subnetName: lcm-subnet
      scope:      namespace
    - subnetName: subnet-1
      subnetPool: kaas-mgmt
      scope:      namespace
    - subnetName: subnet-2
      subnetPool: kaas-mgmt
      scope:      cluster
  npTemplate: |
    version: 2
    ethernets:
      onboard1gbe0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 0}}
        set-name: {{nic 0}}
        # IMPORTANT: do not assign an IP address here explicitly
        # to prevent IP duplication issues. The IP will be assigned
        # automatically by the bootstrapping engine.
        # addresses: []
      onboard1gbe1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 1}}
        set-name: {{nic 1}}
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
        addresses:
          - {{ip "2:subnet-1"}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
        addresses:
          - {{ip "3:subnet-2"}}
    bridges:
      k8s-lcm:
        interfaces: [onboard1gbe0]
        addresses:
          - {{ip "k8s-lcm:lcm-subnet"}}
        gateway4: {{gateway_from_subnet "lcm-subnet"}}
        nameservers:
          addresses: {{nameservers_from_subnet "lcm-subnet"}}

Note

The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

In the template above, the following networks are defined in the l3Layout section:

  • lcm-subnet - the subnet name to use for the LCM network in npTemplate. This subnet is shared between the project clusters because it has the namespaced scope.

    • Since a subnet pool is not in use, create the corresponding Subnet object before machines are attached to cluster manually. For details, see Create subnets for a managed cluster using CLI.

    • Mark this Subnet with the ipam/SVC-k8s-lcm label. The L2 template must contain the definition of the virtual Linux bridge (k8s-lcm in the L2 template example) that is used to set up the LCM network interface. IP addresses for the defined bridge must be assigned from the LCM subnet, which is marked with the ipam/SVC-k8s-lcm label.

      • Each node of every cluster must have one and only IP address in the LCM network that is allocated from one of the Subnet objects having the ipam/SVC-k8s-lcm label defined. Therefore, all Subnet objects used for LCM networks must have the ipam/SVC-k8s-lcm label defined.

      • You can use any interface name for the LCM network traffic. The Subnet objects for the LCM network must have the ipam/SVC-k8s-lcm label. For details, see Service labels and their life cycle.

  • subnet-1 - unless already created, this subnet will be created from the kaas-mgmt subnet pool. The subnet name must be unique within the project. This subnet is shared between the project clusters.

  • subnet-2 - will be created from the kaas-mgmt subnet pool. This subnet has the cluster scope. Therefore, the real name of the Subnet CR object consists of the subnet name defined in l3Layout and the cluster UID. But the npTemplate section of the L2 template must contain only the subnet name defined in l3Layout. The subnets of the cluster scope are not shared between clusters.

Caution

Using the l3Layout section, define all subnets that are used in the npTemplate section. Defining only part of subnets is not allowed.

If labelSelector is used in l3Layout, use any custom label name that differs from system names. This allows for easier cluster scaling in case of adding new subnets as described in Expand IP addresses capacity in an existing cluster.

Mirantis recommends using a unique label prefix such as user-defined/.

Example of a complete template configuration for cluster creation

The following example contains all required objects of an advanced network and host configuration for a managed cluster.

The procedure below contains:

  • Various .yaml objects to be applied with a managed cluster kubeconfig

  • Useful comments inside the .yaml example files

  • Example hardware and configuration data, such as network, disk, auth, that must be updated accordingly to fit your cluster configuration

  • Example templates, such as l2template and baremetalhostprofline, that illustrate how to implement a specific configuration

Caution

The exemplary configuration described below is not production ready and is provided for illustration purposes only.

For illustration purposes, all files provided in this exemplary procedure are named by the Kubernetes object types:

Note

Before update of the management cluster to Container Cloud 2.29.0 (Cluster release 16.4.0), instead of BareMetalHostInventory, use the BareMetalHost object. For details, see BareMetalHost resource.

Caution

While the Cluster release of the management cluster is 16.4.0, BareMetalHostInventory operations are allowed to m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

managed-ns_BareMetalHostInventory_cz7700-managed-cluster-control-noefi.yaml
managed-ns_BareMetalHostInventory_cz7741-managed-cluster-control-noefi.yaml
managed-ns_BareMetalHostInventory_cz7743-managed-cluster-control-noefi.yaml
managed-ns_BareMetalHostInventory_cz812-managed-cluster-storage-worker-noefi.yaml
managed-ns_BareMetalHostInventory_cz813-managed-cluster-storage-worker-noefi.yaml
managed-ns_BareMetalHostInventory_cz814-managed-cluster-storage-worker-noefi.yaml
managed-ns_BareMetalHostInventory_cz815-managed-cluster-worker-noefi.yaml
managed-ns_BareMetalHostProfile_bmhp-cluster-default.yaml
managed-ns_BareMetalHostProfile_worker-storage1.yaml
managed-ns_Cluster_managed-cluster.yaml
managed-ns_KaaSCephCluster_ceph-cluster-managed-cluster.yaml
managed-ns_L2Template_bm-1490-template-controls-netplan-cz7700-pxebond.yaml
managed-ns_L2Template_bm-1490-template-controls-netplan.yaml
managed-ns_L2Template_bm-1490-template-workers-netplan.yaml
managed-ns_Machine_cz7700-managed-cluster-control-noefi-.yaml
managed-ns_Machine_cz7741-managed-cluster-control-noefi-.yaml
managed-ns_Machine_cz7743-managed-cluster-control-noefi-.yaml
managed-ns_Machine_cz812-managed-cluster-storage-worker-noefi-.yaml
managed-ns_Machine_cz813-managed-cluster-storage-worker-noefi-.yaml
managed-ns_Machine_cz814-managed-cluster-storage-worker-noefi-.yaml
managed-ns_Machine_cz815-managed-cluster-worker-noefi-.yaml
managed-ns_PublicKey_managed-cluster-key.yaml
managed-ns_cz7700-cred.yaml
managed-ns_cz7741-cred.yaml
managed-ns_cz7743-cred.yaml
managed-ns_cz812-cred.yaml
managed-ns_cz813-cred.yaml
managed-ns_cz814-cred.yaml
managed-ns_cz815-cred.yaml
managed-ns_Subnet_lcm-nw.yaml
managed-ns_Subnet_metallb-public-for-managed.yaml (obsolete)
managed-ns_Subnet_metallb-public-for-extiface.yaml
managed-ns_MetalLBConfig-lb-managed.yaml
managed-ns_MetalLBConfigTemplate-lb-managed-template.yaml (obsolete)
managed-ns_Subnet_storage-backend.yaml
managed-ns_Subnet_storage-frontend.yaml
default_Namespace_managed-ns.yaml

Caution

The procedure below presumes that you apply each new .yaml file using kubectl create -f <file_name.yaml>.

To create an example configuration for a managed cluster creation:

  1. Verify that you have configured the following items:

    1. All bmh nodes for PXE boot as described in Add a bare metal host using CLI

    2. All physical NICs of the bmh nodes

    3. All required physical subnets and routing

  2. Create an empty .yaml file with the namespace object:

    apiVersion: v1
    
  3. Select from the following options:

    Create the required number of .yaml files with the BareMetalHostCredential objects for each bmh node with the unique name and authentication data. The following example contains one BareMetalHostCredential object:

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

    managed-ns_cz815-cred.yaml
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: BareMetalHostCredential
    metadata:
      name: cz815-cred
      namespace: managed-ns
      labels:
        kaas.mirantis.com/region: region-one
    spec:
      username: admin
      password:
        value: supersecret
    

    Create the required number of .yaml files with the Secret objects for each bmh node with the unique name and authentication data. The following example contains one Secret object:

    managed-ns_cz815-cred.yaml
    apiVersion: v1
    data:
      password: YWRtaW4=
      username: ZW5naW5lZXI=
    kind: Secret
    metadata:
      labels:
        kaas.mirantis.com/credentials: 'true'
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: cz815-cred
      namespace: managed-ns
    
  4. Create a set of files with the bmh nodes configuration:

    • managed-ns_BareMetalHostInventory_cz7700-managed-cluster-control-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          # we will use those label, to link machine to exact bmh node
          kaas.mirantis.com/baremetalhost-id: cz7700
          kaas.mirantis.com/provider: baremetal
        name: cz7700-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.12
          bmhCredentialsName: 'cz7740-cred'
        bootMACAddress: 0c:c4:7a:34:52:04
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz7741-managed-cluster-control-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz7741
          kaas.mirantis.com/provider: baremetal
        name: cz7741-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.76
            bmhCredentialsName: 'cz7741-cred'
        bootMACAddress: 0c:c4:7a:34:92:f4
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz7743-managed-cluster-control-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz7743
          kaas.mirantis.com/provider: baremetal
        name: cz7743-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.78
          bmhCredentialsName: 'cz7743-cred'
        bootMACAddress: 0c:c4:7a:34:66:fc
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz812-managed-cluster-storage-worker-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz812
          kaas.mirantis.com/provider: baremetal
        name: cz812-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.182
          bmhCredentialsName: 'cz812-cred'
        bootMACAddress: 0c:c4:7a:bc:ff:2e
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz813-managed-cluster-storage-worker-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz813
          kaas.mirantis.com/provider: baremetal
        name: cz813-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.183
          bmhCredentialsName: 'cz813-cred'
        bootMACAddress: 0c:c4:7a:bc:fe:36
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz814-managed-cluster-storage-worker-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz814
          kaas.mirantis.com/provider: baremetal
        name: cz814-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.184
          bmhCredentialsName: 'cz814-cred'
        bootMACAddress: 0c:c4:7a:bc:fb:20
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHostInventory_cz815-managed-cluster-worker-noefi.yaml
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: BareMetalHostInventory
      metadata:
        annotations:
          inspect.metal3.io/hardwaredetails-storage-sort-term: hctl ASC, wwn ASC, by_id ASC, name ASC
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/baremetalhost-id: cz815
          kaas.mirantis.com/provider: baremetal
        name: cz815-managed-cluster-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.185
          bmhCredentialsName: 'cz815-cred'
        bootMACAddress: 0c:c4:7a:bc:fc:3e
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7700-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          # we will use those label, to link machine to exact bmh node
          kaas.mirantis.com/baremetalhost-id: cz7700
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz7700-cred
        name: cz7700-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.12
          # credentialsName is updated automatically during cluster deployment
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:34:52:04
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7741-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/baremetalhost-id: cz7741
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz7741-cred
        name: cz7741-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.76
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:34:92:f4
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7743-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/baremetalhost-id: cz7743
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz7743-cred
        name: cz7743-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.78
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:34:66:fc
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz812-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz812
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz812-cred
        name: cz812-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.182
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:bc:ff:2e
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz813-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz813
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz813-cred
        name: cz813-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.183
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:bc:fe:36
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz814-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz814
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz814-cred
        name: cz814-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.184
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:bc:fb:20
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz815-managed-cluster-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz815
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        annotations:
          kaas.mirantis.com/baremetalhost-credentials-name: cz815-cred
        name: cz815-managed-cluster-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.185
          credentialsName: ''
        bootMACAddress: 0c:c4:7a:bc:fc:3e
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7700-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          # we will use those label, to link machine to exact bmh node
          kaas.mirantis.com/baremetalhost-id: cz7700
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz7700-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.12
          # The secret for credentials requires the username and password
          # keys in the Base64 encoding.
          credentialsName: cz7700-cred
        bootMACAddress: 0c:c4:7a:34:52:04
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7741-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/baremetalhost-id: cz7741
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz7741-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.76
          credentialsName: cz7741-cred
        bootMACAddress: 0c:c4:7a:34:92:f4
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz7743-managed-cluster-control-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/baremetalhost-id: cz7743
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz7743-managed-cluster-control-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.78
          credentialsName: cz7743-cred
        bootMACAddress: 0c:c4:7a:34:66:fc
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz812-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz812
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz812-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.182
          credentialsName: cz812-cred
        bootMACAddress: 0c:c4:7a:bc:ff:2e
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz813-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz813
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz813-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.183
          credentialsName: cz813-cred
        bootMACAddress: 0c:c4:7a:bc:fe:36
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz814-managed-cluster-storage-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz814
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz814-managed-cluster-storage-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.184
          credentialsName: cz814-cre
        bootMACAddress: 0c:c4:7a:bc:fb:20
        bootMode: legacy
        online: true
      
    • managed-ns_BareMetalHost_cz815-managed-cluster-worker-noefi.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHost
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/baremetalhost-id: cz815
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: cz815-managed-cluster-worker-noefi
        namespace: managed-ns
      spec:
        bmc:
          address: 192.168.1.185
          credentialsName: cz815-cred
        bootMACAddress: 0c:c4:7a:bc:fc:3e
        bootMode: legacy
        online: true
      
  5. Verify that the inspecting phase has started:

    KUBECONFIG=kubeconfig kubectl -n managed-ns get bmh -o wide
    

    Example of system response:

    NAME                                       STATUS STATE CONSUMER BMC           BOOTMODE ONLINE ERROR REGION
    cz7700-managed-cluster-control-noefi       OK     inspecting     192.168.1.12  legacy   true         region-one
    cz7741-managed-cluster-control-noefi       OK     inspecting     192.168.1.76  legacy   true         region-one
    cz7743-managed-cluster-control-noefi       OK     inspecting     192.168.1.78  legacy   true         region-one
    cz812-managed-cluster-storage-worker-noefi OK     inspecting     192.168.1.182 legacy   true         region-one
    

    Wait for inspection to complete. Usually, it takes up to 15 minutes.

  6. Collect the bmh hardware information to create the l2template and bmh objects:

    KUBECONFIG=kubeconfig kubectl -n managed-ns get bmh -o wide
    

    Example of system response:

    NAME                                       STATUS STATE CONSUMER BMC           BOOTMODE ONLINE ERROR REGION
    cz7700-managed-cluster-control-noefi       OK     available      192.168.1.12  legacy   true         region-one
    cz7741-managed-cluster-control-noefi       OK     available      192.168.1.76  legacy   true         region-one
    cz7743-managed-cluster-control-noefi       OK     available      192.168.1.78  legacy   true         region-one
    cz812-managed-cluster-storage-worker-noefi OK     available      192.168.1.182 legacy   true         region-one
    
    KUBECONFIG=kubeconfig kubectl -n managed-ns get bmh cz7700-managed-cluster-control-noefi -o yaml | less
    

    Example of system response:

    ..
    nics:
    - ip: ""
      mac: 0c:c4:7a:1d:f4:a6
      model: 0x8086 0x10fb
      # discovered interfaces
      name: ens4f0
      pxe: false
      # temporary PXE address discovered from baremetal-mgmt
    - ip: 172.16.170.30
      mac: 0c:c4:7a:34:52:04
      model: 0x8086 0x1521
      name: enp9s0f0
      pxe: true
      # duplicates temporary PXE address discovered from baremetal-mgmt
      # since we have fallback-bond configured on host
    - ip: 172.16.170.33
      mac: 0c:c4:7a:34:52:05
      model: 0x8086 0x1521
      # discovered interfaces
      name: enp9s0f1
      pxe: false
    ...
    storage:
    - by_path: /dev/disk/by-path/pci-0000:00:1f.2-ata-1
      model: Samsung SSD 850
      name: /dev/sda
      rotational: false
      sizeBytes: 500107862016
    - by_path: /dev/disk/by-path/pci-0000:00:1f.2-ata-2
      model: Samsung SSD 850
      name: /dev/sdb
      rotational: false
      sizeBytes: 500107862016
    ....
    
  7. Create bare metal host profiles:

    • managed-ns_BareMetalHostProfile_bmhp-cluster-default.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHostProfile
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          # This label indicates that this profile will be default in
          # namespaces, so machines w\o exact profile selecting will use
          # this template
          kaas.mirantis.com/defaultBMHProfile: 'true'
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: bmhp-cluster-default
        namespace: managed-ns
      spec:
        devices:
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-1
            minSize: 120Gi
            wipe: true
          partitions:
          - name: bios_grub
            partflags:
            - bios_grub
            size: 4Mi
            wipe: true
          - name: uefi
            partflags:
            - esp
            size: 200Mi
            wipe: true
          - name: config-2
            size: 64Mi
            wipe: true
          - name: lvm_dummy_part
            size: 1Gi
            wipe: true
          - name: lvm_root_part
            size: 0
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-2
            minSize: 30Gi
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-3
            minSize: 30Gi
            wipe: true
          partitions:
          - name: lvm_lvp_part
            size: 0
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-4
            wipe: true
        fileSystems:
        - fileSystem: vfat
          partition: config-2
        - fileSystem: vfat
          mountPoint: /boot/efi
          partition: uefi
        - fileSystem: ext4
          logicalVolume: root
          mountPoint: /
        - fileSystem: ext4
          logicalVolume: lvp
          mountPoint: /mnt/local-volumes/
        grubConfig:
          defaultGrubOptions:
          - GRUB_DISABLE_RECOVERY="true"
          - GRUB_PRELOAD_MODULES=lvm
          - GRUB_TIMEOUT=30
        kernelParameters:
          modules:
          - content: 'options kvm_intel nested=1'
            filename: kvm_intel.conf
          sysctl:
          # For the list of options prohibited to change, refer to
          # https://docs.mirantis.com/mke/3.7/install/predeployment/set-up-kernel-default-protections.html
            fs.aio-max-nr: '1048576'
            fs.file-max: '9223372036854775807'
            fs.inotify.max_user_instances: '4096'
            kernel.core_uses_pid: '1'
            kernel.dmesg_restrict: '1'
            net.ipv4.conf.all.rp_filter: '0'
            net.ipv4.conf.default.rp_filter: '0'
            net.ipv4.conf.k8s-ext.rp_filter: '0'
            net.ipv4.conf.k8s-ext.rp_filter: '0'
            net.ipv4.conf.m-pub.rp_filter: '0'
            vm.max_map_count: '262144'
        logicalVolumes:
        - name: root
          size: 0
          vg: lvm_root
        - name: lvp
          size: 0
          vg: lvm_lvp
        postDeployScript: |
          #!/bin/bash -ex
          # used for test-debug only!
          echo "root:r00tme" | sudo chpasswd
          echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"' > /etc/udev/rules.d/60-ssd-scheduler.rules
          echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
      
        preDeployScript: |
          #!/bin/bash -ex
          echo "$(date) pre_deploy_script done" >> /root/pre_deploy_done
        volumeGroups:
        - devices:
          - partition: lvm_root_part
          name: lvm_root
        - devices:
          - partition: lvm_lvp_part
          name: lvm_lvp
        - devices:
          - partition: lvm_dummy_part
          # here we can create lvm, but do not mount or format it somewhere
          name: lvm_forawesomeapp
      
    • managed-ns_BareMetalHostProfile_worker-storage1.yaml
      apiVersion: metal3.io/v1alpha1
      kind: BareMetalHostProfile
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: worker-storage1
        namespace: managed-ns
      spec:
        devices:
        - device:
            minSize: 120Gi
            wipe: true
          partitions:
          - name: bios_grub
            partflags:
            - bios_grub
            size: 4Mi
            wipe: true
          - name: uefi
            partflags:
            - esp
            size: 200Mi
            wipe: true
          - name: config-2
            size: 64Mi
            wipe: true
          # Create dummy partition w\o mounting
          - name: lvm_dummy_part
            size: 1Gi
            wipe: true
          - name: lvm_root_part
            size: 0
            wipe: true
        - device:
            # Will be used for Ceph, so required to be wiped
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-1
            minSize: 30Gi
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-2
            minSize: 30Gi
            wipe: true
          partitions:
          - name: lvm_lvp_part
            size: 0
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-3
            wipe: true
        - device:
            byPath: /dev/disk/by-path/pci-0000:00:1f.2-ata-4
            minSize: 30Gi
            wipe: true
          partitions:
            - name: lvm_lvp_part_sdf
              wipe: true
              size: 0
        fileSystems:
        - fileSystem: vfat
          partition: config-2
        - fileSystem: vfat
          mountPoint: /boot/efi
          partition: uefi
        - fileSystem: ext4
          logicalVolume: root
          mountPoint: /
        - fileSystem: ext4
          logicalVolume: lvp
          mountPoint: /mnt/local-volumes/
        grubConfig:
          defaultGrubOptions:
          - GRUB_DISABLE_RECOVERY="true"
          - GRUB_PRELOAD_MODULES=lvm
          - GRUB_TIMEOUT=30
        kernelParameters:
          modules:
          - content: 'options kvm_intel nested=1'
            filename: kvm_intel.conf
          sysctl:
          # For the list of options prohibited to change, refer to
          # https://docs.mirantis.com/mke/3.6/install/predeployment/set-up-kernel-default-protections.html
            fs.aio-max-nr: '1048576'
            fs.file-max: '9223372036854775807'
            fs.inotify.max_user_instances: '4096'
            kernel.core_uses_pid: '1'
            kernel.dmesg_restrict: '1'
            net.ipv4.conf.all.rp_filter: '0'
            net.ipv4.conf.default.rp_filter: '0'
            net.ipv4.conf.k8s-ext.rp_filter: '0'
            net.ipv4.conf.k8s-ext.rp_filter: '0'
            net.ipv4.conf.m-pub.rp_filter: '0'
            vm.max_map_count: '262144'
        logicalVolumes:
        - name: root
          size: 0
          vg: lvm_root
        - name: lvp
          size: 0
          vg: lvm_lvp
        postDeployScript: |
      
          #!/bin/bash -ex
      
          # used for test-debug only! That would allow operator to logic via TTY.
          echo "root:r00tme" | sudo chpasswd
          # Just an example for enforcing "ssd" disks to be switched to use "deadline" i\o scheduler.
          echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"' > /etc/udev/   rules.d/60-ssd-scheduler.rules
          echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
      
        preDeployScript: |
          #!/bin/bash -ex
          echo "$(date) pre_deploy_script done" >> /root/pre_deploy_done
      
        volumeGroups:
        - devices:
          - partition: lvm_root_part
          name: lvm_root
        - devices:
          - partition: lvm_lvp_part
          - partition: lvm_lvp_part_sdf
          name: lvm_lvp
        - devices:
          - partition: lvm_dummy_part
          name: lvm_forawesomeapp
      

    Note

    If you mount the /var directory, before configuring BareMetalHostProfile, review Mounting recommendations for the /var directory.

  8. Create the L2Template objects:

    • managed-ns_L2Template_bm-1490-template-controls-netplan.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: L2Template
      metadata:
        labels:
          bm-1490-template-controls-netplan: anymagicstring
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: bm-1490-template-controls-netplan
        namespace: managed-ns
      spec:
        ifMapping:
        - enp9s0f0
        - enp9s0f1
        - eno1
        - ens3f1
        l3Layout:
        - scope: namespace
          subnetName: lcm-nw
        - scope: namespace
          subnetName: storage-frontend
        - scope: namespace
          subnetName: storage-backend
        - scope: namespace
          subnetName: metallb-public-for-extiface
        npTemplate: |-
          version: 2
          ethernets:
            {{nic 0}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 0}}
              set-name: {{nic 0}}
              mtu: 1500
            {{nic 1}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 1}}
              set-name: {{nic 1}}
              mtu: 1500
            {{nic 2}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 2}}
              set-name: {{nic 2}}
              mtu: 1500
            {{nic 3}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 3}}
              set-name: {{nic 3}}
              mtu: 1500
          bonds:
            bond0:
              parameters:
                mode: 802.3ad
                #transmit-hash-policy: layer3+4
                #mii-monitor-interval: 100
              interfaces:
                - {{ nic 0 }}
                - {{ nic 1 }}
            bond1:
              parameters:
                mode: 802.3ad
                #transmit-hash-policy: layer3+4
                #mii-monitor-interval: 100
              interfaces:
                - {{ nic 2 }}
                - {{ nic 3 }}
          vlans:
            stor-f:
              id: 1494
              link: bond1
              addresses:
                - {{ip "stor-f:storage-frontend"}}
            stor-b:
              id: 1489
              link: bond1
              addresses:
                - {{ip "stor-b:storage-backend"}}
            m-pub:
              id: 1491
              link: bond0
          bridges:
            k8s-ext:
              interfaces: [m-pub]
              addresses:
                - {{ ip "k8s-ext:metallb-public-for-extiface" }}
            k8s-lcm:
              dhcp4: false
              dhcp6: false
              gateway4: {{ gateway_from_subnet "lcm-nw" }}
              addresses:
                - {{ ip "k8s-lcm:lcm-nw" }}
              nameservers:
                addresses: [ 172.18.176.6 ]
              interfaces:
                  - bond0
      
    • managed-ns_L2Template_bm-1490-template-workers-netplan.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: L2Template
      metadata:
        labels:
          bm-1490-template-workers-netplan: anymagicstring
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: bm-1490-template-workers-netplan
        namespace: managed-ns
      spec:
        ifMapping:
        - eno1
        - eno2
        - ens7f0
        - ens7f1
        l3Layout:
        - scope: namespace
          subnetName: lcm-nw
        - scope: namespace
          subnetName: storage-frontend
        - scope: namespace
          subnetName: storage-backend
        - scope: namespace
          subnetName: metallb-public-for-extiface
        npTemplate: |-
          version: 2
          ethernets:
            {{nic 0}}:
              match:
                macaddress: {{mac 0}}
              set-name: {{nic 0}}
              mtu: 1500
            {{nic 1}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 1}}
              set-name: {{nic 1}}
              mtu: 1500
            {{nic 2}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 2}}
              set-name: {{nic 2}}
              mtu: 1500
            {{nic 3}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 3}}
              set-name: {{nic 3}}
              mtu: 1500
          bonds:
            bond0:
              interfaces:
                - {{ nic 1 }}
            bond1:
              parameters:
                mode: 802.3ad
                #transmit-hash-policy: layer3+4
                #mii-monitor-interval: 100
              interfaces:
                - {{ nic 2 }}
                - {{ nic 3 }}
          vlans:
            stor-f:
              id: 1494
              link: bond1
              addresses:
                - {{ip "stor-f:storage-frontend"}}
            stor-b:
              id: 1489
              link: bond1
              addresses:
                - {{ip "stor-b:storage-backend"}}
            m-pub:
              id: 1491
              link: {{ nic 1 }}
          bridges:
            k8s-lcm:
              interfaces:
                - {{ nic 0 }}
              gateway4: {{ gateway_from_subnet "lcm-nw" }}
              addresses:
                - {{ ip "k8s-lcm:lcm-nw" }}
              nameservers:
                addresses: [ 172.18.176.6 ]
            k8s-ext:
              interfaces: [m-pub]
      
    • managed-ns_L2Template_bm-1490-template-controls-netplan-cz7700-pxebond.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: L2Template
      metadata:
        labels:
          bm-1490-template-controls-netplan-cz7700-pxebond: anymagicstring
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: bm-1490-template-controls-netplan-cz7700-pxebond
        namespace: managed-ns
      spec:
        ifMapping:
        - enp9s0f0
        - enp9s0f1
        - eno1
        - ens3f1
        l3Layout:
        - scope: namespace
          subnetName: lcm-nw
        - scope: namespace
          subnetName: storage-frontend
        - scope: namespace
          subnetName: storage-backend
        - scope: namespace
          subnetName: metallb-public-for-extiface
        npTemplate: |-
          version: 2
          ethernets:
            {{nic 0}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 0}}
              set-name: {{nic 0}}
              mtu: 1500
            {{nic 1}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 1}}
              set-name: {{nic 1}}
              mtu: 1500
            {{nic 2}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 2}}
              set-name: {{nic 2}}
              mtu: 1500
            {{nic 3}}:
              dhcp4: false
              dhcp6: false
              match:
                macaddress: {{mac 3}}
              set-name: {{nic 3}}
              mtu: 1500
          bonds:
            bond0:
              parameters:
                mode: 802.3ad
                #transmit-hash-policy: layer3+4
                #mii-monitor-interval: 100
              interfaces:
                - {{ nic 0 }}
                - {{ nic 1 }}
            bond1:
              parameters:
                mode: 802.3ad
                #transmit-hash-policy: layer3+4
                #mii-monitor-interval: 100
              interfaces:
                - {{ nic 2 }}
                - {{ nic 3 }}
          vlans:
            stor-f:
              id: 1494
              link: bond1
              addresses:
                - {{ip "stor-f:storage-frontend"}}
            stor-b:
              id: 1489
              link: bond1
              addresses:
                - {{ip "stor-b:storage-backend"}}
            m-pub:
              id: 1491
              link: bond0
          bridges:
            k8s-ext:
              interfaces: [m-pub]
              addresses:
                - {{ ip "k8s-ext:metallb-public-for-extiface" }}
            k8s-lcm:
              dhcp4: false
              dhcp6: false
              gateway4: {{ gateway_from_subnet "lcm-nw" }}
              addresses:
                - {{ ip "k8s-lcm:lcm-nw" }}
              nameservers:
                addresses: [ 172.18.176.6 ]
              interfaces:
                - bond0
      
  9. Create the Subnet objects:

    • managed-ns_Subnet_lcm-nw.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: Subnet
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          ipam/SVC-k8s-lcm: '1'
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: lcm-nw
        namespace: managed-ns
      spec:
        cidr: 172.16.170.0/24
        excludeRanges:
        - 172.16.170.150
        gateway: 172.16.170.1
        includeRanges:
        - 172.16.170.150-172.16.170.250
      
    • managed-ns_Subnet_metallb-public-for-extiface.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: Subnet
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: metallb-public-for-extiface
        namespace: managed-ns
      spec:
        cidr: 172.16.168.0/24
        gateway: 172.16.168.1
        includeRanges:
        - 172.16.168.10-172.16.168.30
      
    • managed-ns_Subnet_storage-backend.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: Subnet
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          ipam/SVC-ceph-cluster: '1'
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: storage-backend
        namespace: managed-ns
      spec:
        cidr: 10.12.0.0/24
      
    • managed-ns_Subnet_storage-frontend.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: Subnet
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          ipam/SVC-ceph-public: '1'
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: storage-frontend
        namespace: managed-ns
      spec:
        cidr: 10.12.1.0/24
      
  10. Create MetalLB configuration objects:

    managed-ns_MetalLBConfig-lb-managed.yaml
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: MetalLBConfig
    metadata:
      labels:
        cluster.sigs.k8s.io/cluster-name: managed-cluster
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: lb-managed
      namespace: managed-ns
    spec:
      ipAddressPools:
      - name: services
        spec:
          addresses:
          - 172.16.168.31-172.16.168.50
          autoAssign: true
          avoidBuggyIPs: false
      l2Advertisements:
      - name: services
        spec:
          interfaces:
          - k8s-ext
          ipAddressPools:
          - services
    
    • managed-ns_Subnet_metallb-public-for-managed.yaml
      apiVersion: ipam.mirantis.com/v1alpha1
      kind: Subnet
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          ipam/SVC-MetalLB: '1'
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: metallb-public-for-managed
        namespace: managed-ns
      spec:
        cidr: 172.16.168.0/24
        includeRanges:
        - 172.16.168.31-172.16.168.50
      
    • managed-ns_MetalLBConfig-lb-managed.yaml

      Note

      Applies since Container Cloud 2.21.0 and 2.21.1 for MOSK as TechPreview and since 2.24.0 as GA for management clusters. For managed clusters, is generally available since Container Cloud 2.25.0.

      apiVersion: kaas.mirantis.com/v1alpha1
      kind: MetalLBConfig
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: lb-managed
        namespace: managed-ns
      spec:
        templateName: lb-managed-template
      
    • managed-ns_MetalLBConfigTemplate-lb-managed-template.yaml

      Note

      The MetalLBConfigTemplate object is available as Technology Preview since Container Cloud 2.24.0 (Cluster release 14.0.0) and is generally available since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0).

      apiVersion: ipam.mirantis.com/v1alpha1
      kind: MetalLBConfigTemplate
      metadata:
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        name: lb-managed-template
        namespace: managed-ns
      spec:
        templates:
          l2Advertisements: |
            - name: services
              spec:
                ipAddressPools:
                  - services
      
    managed-ns_Subnet_metallb-public-for-managed.yaml
    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        cluster.sigs.k8s.io/cluster-name: managed-cluster
        ipam/SVC-MetalLB: '1'
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: metallb-public-for-managed
      namespace: managed-ns
    spec:
      cidr: 172.16.168.0/24
      includeRanges:
      - 172.16.168.31-172.16.168.50
    
  11. Create the PublicKey object for a managed cluster connection. For details, see PublicKey resource.

    managed-ns_PublicKey_managed-cluster-key.yaml
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: PublicKey
    metadata:
      name: managed-cluster-key
      namespace: managed-ns
    spec:
      publicKey: ssh-rsa AAEXAMPLEXXX
    
  12. Create the Cluster object. For details, see Cluster resource.

    managed-ns_Cluster_managed-cluster.yaml
    apiVersion: cluster.k8s.io/v1alpha1
    kind: Cluster
    metadata:
      annotations:
        kaas.mirantis.com/lcm: 'true'
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: managed-cluster
      namespace: managed-ns
    spec:
      clusterNetwork:
        pods:
          cidrBlocks:
          - 192.169.0.0/16
        serviceDomain: ''
        services:
          cidrBlocks:
          - 10.232.0.0/18
      providerSpec:
        value:
          apiVersion: baremetal.k8s.io/v1alpha1
          dedicatedControlPlane: false
          helmReleases:
          - name: ceph-controller
          - enabled: true
            name: stacklight
            values:
              alertmanagerSimpleConfig:
                email:
                  enabled: false
                slack:
                  enabled: false
              logging:
                persistentVolumeClaimSize: 30Gi
              highAvailabilityEnabled: false
              logging:
                enabled: false
              prometheusServer:
                customAlerts: []
                persistentVolumeClaimSize: 16Gi
                retentionSize: 15GB
                retentionTime: 15d
                watchDogAlertEnabled: false
          - name: metallb
            values: {}
          kind: BaremetalClusterProviderSpec
          loadBalancerHost: 172.16.168.3
          publicKeys:
          - name: managed-cluster-key
          region: region-one
          release: mke-5-16-0-3-3-6
    
  13. Create the Machine objects linked to each bmh node. For details, see Machine resource.

    • managed-ns_Machine_cz7700-managed-cluster-control-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz7700-managed-cluster-control-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          cluster.sigs.k8s.io/control-plane: controlplane
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz7700
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-controls-netplan-cz7700-pxebond
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz7741-managed-cluster-control-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz7741-managed-cluster-control-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          cluster.sigs.k8s.io/control-plane: controlplane
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: bmhp-cluster-default
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz7741
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-controls-netplan
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz7743-managed-cluster-control-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz7743-managed-cluster-control-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          cluster.sigs.k8s.io/control-plane: controlplane
          hostlabel.bm.kaas.mirantis.com/controlplane: controlplane
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: bmhp-cluster-default
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz7743
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-controls-netplan
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz812-managed-cluster-storage-worker-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz812-managed-cluster-storage-worker-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/storage: storage
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: worker-storage1
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz812
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-workers-netplan
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz813-managed-cluster-storage-worker-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz813-managed-cluster-storage-worker-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/storage: storage
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: worker-storage1
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz813
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-workers-netplan
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz814-managed-cluster-storage-worker-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz814-managed-cluster-storage-worker-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/storage: storage
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: worker-storage1
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz814
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-workers-netplan
            publicKeys:
            - name: managed-cluster-key
      
    • managed-ns_Machine_cz815-managed-cluster-worker-noefi-.yaml
      apiVersion: cluster.k8s.io/v1alpha1
      kind: Machine
      metadata:
        generateName: cz815-managed-cluster-worker-noefi-
        labels:
          cluster.sigs.k8s.io/cluster-name: managed-cluster
          hostlabel.bm.kaas.mirantis.com/worker: worker
          kaas.mirantis.com/provider: baremetal
          kaas.mirantis.com/region: region-one
          si-role/node-for-delete: 'true'
        namespace: managed-ns
      spec:
        providerSpec:
          value:
            apiVersion: baremetal.k8s.io/v1alpha1
            bareMetalHostProfile:
              name: worker-storage1
              namespace: managed-ns
            hostSelector:
              matchLabels:
                kaas.mirantis.com/baremetalhost-id: cz815
            kind: BareMetalMachineProviderSpec
            l2TemplateSelector:
              label: bm-1490-template-workers-netplan
            publicKeys:
            - name: managed-cluster-key
      
  14. Verify that the bmh nodes are in the provisioning state:

    KUBECONFIG=kubectl kubectl -n managed-ns get bmh  -o wide
    

    Example of system response:

    NAME                                  STATUS STATE          CONSUMER                                    BMC          BOOTMODE   ONLINE  ERROR REGION
    cz7700-managed-cluster-control-noefi  OK     provisioning   cz7700-managed-cluster-control-noefi-8bkqw  192.168.1.12  legacy     true          region-one
    cz7741-managed-cluster-control-noefi  OK     provisioning   cz7741-managed-cluster-control-noefi-42tp2  192.168.1.76  legacy     true          region-one
    cz7743-managed-cluster-control-noefi  OK     provisioning   cz7743-managed-cluster-control-noefi-8cwpw  192.168.1.78  legacy     true          region-one
    ...
    

    Wait until all bmh nodes are in the provisioned state.

  15. Verify that the lcmmachine phase has started:

    KUBECONFIG=kubeconfig kubectl -n managed-ns get lcmmachines  -o wide
    

    Example of system response:

    NAME                                       CLUSTERNAME       TYPE      STATE   INTERNALIP     HOSTNAME                                         AGENTVERSION
    cz7700-managed-cluster-control-noefi-8bkqw managed-cluster   control   Deploy  172.16.170.153 kaas-node-803721b4-227c-4675-acc5-15ff9d3cfde2   v0.2.0-349-g4870b7f5
    cz7741-managed-cluster-control-noefi-42tp2 managed-cluster   control   Prepare 172.16.170.152 kaas-node-6b8f0d51-4c5e-43c5-ac53-a95988b1a526   v0.2.0-349-g4870b7f5
    cz7743-managed-cluster-control-noefi-8cwpw managed-cluster   control   Prepare 172.16.170.151 kaas-node-e9b7447d-5010-439b-8c95-3598518f8e0a   v0.2.0-349-g4870b7f5
    ...
    
  16. Verify that the lcmmachine phase is complete and the Kubernetes cluster is created:

    KUBECONFIG=kubeconfig kubectl -n managed-ns get lcmmachines  -o wide
    

    Example of system response:

    NAME                                       CLUSTERNAME       TYPE     STATE  INTERNALIP      HOSTNAME                                        AGENTVERSION
    cz7700-managed-cluster-control-noefi-8bkqw  managed-cluster  control  Ready  172.16.170.153  kaas-node-803721b4-227c-4675-acc5-15ff9d3cfde2  v0.2.0-349-g4870b7f5
    cz7741-managed-cluster-control-noefi-42tp2  managed-cluster  control  Ready  172.16.170.152  kaas-node-6b8f0d51-4c5e-43c5-ac53-a95988b1a526  v0.2.0-349-g4870b7f5
    cz7743-managed-cluster-control-noefi-8cwpw  managed-cluster  control  Ready  172.16.170.151  kaas-node-e9b7447d-5010-439b-8c95-3598518f8e0a  v0.2.0-349-g4870b7f5
    ...
    
  17. Create the KaaSCephCluster object:

    managed-ns_KaaSCephCluster_ceph-cluster-managed-cluster.yaml
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephCluster
    metadata:
      name: ceph-cluster-managed-cluster
      namespace: managed-ns
    spec:
      cephClusterSpec:
        nodes:
          # Add the exact ``nodes`` names.
          # Obtain the name from "get bmh -o wide" ``consumer`` field.
          cz812-managed-cluster-storage-worker-noefi-58spl:
            roles:
            - mgr
            - mon
          # All disk configuration must be reflected in ``baremetalhostprofile``
            storageDevices:
            - config:
                deviceClass: ssd
              fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231434939
          cz813-managed-cluster-storage-worker-noefi-lr4k4:
            roles:
            - mgr
            - mon
            storageDevices:
            - config:
                deviceClass: ssd
              fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231440912
          cz814-managed-cluster-storage-worker-noefi-z2m67:
            roles:
            - mgr
            - mon
            storageDevices:
            - config:
                deviceClass: ssd
              fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231443409
        pools:
        - default: true
          deviceClass: ssd
          name: kubernetes
          replicated:
            size: 3
          role: kubernetes
      k8sCluster:
        name: managed-cluster
        namespace: managed-ns
    

    Note

    The storageDevices[].fullPath field is available since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0). For the clusters running earlier product versions, define the /dev/disk/by-id symlinks using storageDevices[].name instead.

  18. Obtain kubeconfig of the newly created managed cluster:

    KUBECONFIG=kubeconfig kubectl -n managed-ns get secrets managed-cluster-kubeconfig -o jsonpath='{.data.admin\.conf}' | base64 -d |  tee managed.kubeconfig
    
  19. Verify the status of the Ceph cluster in your managed cluster:

    KUBECONFIG=managed.kubeconfig kubectl -n rook-ceph exec -it $(KUBECONFIG=managed.kubeconfig kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph -s
    

    Example of system response:

    cluster:
      id:     e75c6abd-c5d5-4ae8-af17-4711354ff8ef
      health: HEALTH_OK
    services:
      mon: 3 daemons, quorum a,b,c (age 55m)
      mgr: a(active, since 55m)
      osd: 3 osds: 3 up (since 54m), 3 in (since 54m)
    data:
      pools:   1 pools, 32 pgs
      objects: 273 objects, 555 MiB
      usage:   4.0 GiB used, 1.6 TiB / 1.6 TiB avail
      pgs:     32 active+clean
    io:
      client:   51 KiB/s wr, 0 op/s rd, 4 op/s wr
    
Automate multiple subnet creation using SubnetPool

Unsupported since MCC 2.28.0 (17.3.0 and 16.3.0)

Warning

The SubnetPool object is unsupported since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). For details, see Deprecation Notes: SubnetPool resource management.

Operators of Mirantis Container Cloud for on-demand self-service Kubernetes deployments will want their users to create networks without extensive knowledge about network topology or IP addresses. For that purpose, the Operator can prepare L2 network templates in advance for users to assign these templates to machines in their clusters.

The Operator can ensure that the users’ clusters have separate IP address spaces using the SubnetPool resource.

SubnetPool allows for automatic creation of Subnet objects that will consume blocks from the parent SubnetPool CIDR IP address range. The SubnetPool blockSize setting defines the IP address block size to allocate to each child Subnet. SubnetPool has a global scope, so any SubnetPool can be used to create the Subnet objects for any namespace and for any cluster.

You can use the SubnetPool resource in the L2Template resources to automatically allocate IP addresses from an appropriate IP range that corresponds to a specific cluster, or create a Subnet resource if it does not exist yet. This way, every cluster will use subnets that do not overlap with other clusters.

To automate multiple subnet creation using SubnetPool:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnetpool.yaml file with a number of subnet pools:

    Note

    You can define either or both subnets and subnet pools, depending on the use case. A single L2 template can use either or both subnets and subnet pools.

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnetpool.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: SubnetPool
    metadata:
      name: kaas-mgmt
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      cidr: 10.10.0.0/16
      blockSize: /25
      nameservers:
      - 172.18.176.6
      gatewayPolicy: first
    

    For the specification fields description of the SubnetPool object, see SubnetPool spec.

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

  3. Verify that the subnet pool is successfully created:

    kubectl get subnetpool kaas-mgmt -oyaml
    

    In the system output, verify the status fields of the subnetpool.yaml file. For the status fields description of the SunbetPool object, see SubnetPool status.

  4. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates. In this procedure, select the exemplary L2 template for multiple subnets.

    Caution

    Using the l3Layout section, define all subnets that are used in the npTemplate section. Defining only part of subnets is not allowed.

    If labelSelector is used in l3Layout, use any custom label name that differs from system names. This allows for easier cluster scaling in case of adding new subnets as described in Expand IP addresses capacity in an existing cluster.

    Mirantis recommends using a unique label prefix such as user-defined/.

Add a machine

This section describes how to add a machine to a managed MOSK cluster using CLI for advanced configuration.

Create a machine using CLI

This section describes adding machines to a new MOSK cluster using Mirantis Container Cloud CLI.

If you need to add more machines to an existing MOSK cluster, see Add a controller node and Add a compute node.

To add machine to the MOSK cluster:

  1. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  2. Create a new text file mosk-cluster-machines.yaml and create the YAML definitons of the Machine resources. Use this as an example, and see the descriptions of the fields below:

    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    metadata:
      name: mosk-node-role-name
      namespace: mosk-project
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
        cluster.sigs.k8s.io/cluster-name: mosk-cluster
    spec:
      providerSpec:
        value:
          apiVersion: baremetal.k8s.io/v1alpha1
          kind: BareMetalMachineProviderSpec
          bareMetalHostProfile:
            name: mosk-k8s-mgr
            namespace: mosk-project
          l2TemplateSelector:
            name: mosk-k8s-mgr
          hostSelector: {}
          l2TemplateMappingOverride: []
    

    Note

    The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

  3. Add the top level fields:

    • apiVersion

      API version of the object that is cluster.k8s.io/v1alpha1.

    • kind

      Object type that is Machine.

    • metadata

      This section will contain the metadata of the object.

    • spec

      This section will contain the configuration of the object.

  4. Add mandatory fields to the metadata section of the Machine object definition.

    • name

      The name of the Machine object.

    • namespace

      The name of the Project where the Machine will be created.

    • labels

      This section contains additional metadata of the machine. Set the following mandatory labels for the Machine object.

      • kaas.mirantis.com/provider

        Set to "baremetal".

      • kaas.mirantis.com/region

        Region name that matches the region name in the Cluster object.

      • cluster.sigs.k8s.io/cluster-name

        The name of the cluster to add the machine to.

      Note

      The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

  5. Configure the mandatory parameters of the Machine object in spec field. Add providerSpec field that contains parameters for deployment on bare metal in a form of Kubernetes subresource.

  6. In the providerSpec section, add the following mandatory configuration parameters:

    • apiVersion

      API version of the subresource that is baremetal.k8s.io/v1alpha1.

    • kind

      Object type that is BareMetalMachineProviderSpec.

    • bareMetalHostProfile

      Reference to a configuration profile of a bare metal host. It helps to pick bare metal host with suitable configuration for the machine. This section includes two parameters:

      • name

        Name of a bare metal host profile

      • namespace

        Project in which the bare metal host profile is created.

    • l2TemplateSelector

      If specified, contains the name (first priority) or label of the L2 template that will be applied during a machine creation. Note that changing this field after Machine object is created will not affect the host network configuration of the machine.

      You should assign one of the templates you defined in Create L2 templates to the machine. If there is no suitable templates, you should create one per Create L2 templates.

    • hostSelector

      This parameter defines matching criteria for picking a bare metal host for the machine by label.

      Any custom label that is assigned to one or more bare metal hosts using API can be used as a host selector. If the bare metal host objects with the specified label are missing, the Machine object will not be deployed until at least one bare metal host with the specified label is available.

      See Deploy a machine to a specific bare metal host for details.

    • l2TemplateIfMappingOverride

      This parameter contains a list of names of network interfaces of the host. It allows to override the default naming and ordering of network interfaces defined in L2 template referenced by the l2TemplateSelector. This ordering informs the L2 templates how to generate the host network configuration.

      See Override network interfaces naming and order for details.

  7. Depending on the role of the machine in the MOSK cluster, add labels to the nodeLabels field.

    This field contains the list of node labels to be attached to a node for the user to run certain components on separate cluster nodes. The list of allowed node labels is located in the Cluster object status providerStatus.releaseRef.current.allowedNodeLabels field.

    If the value field is not defined in allowedNodeLabels, a label can have any value. For example:

    allowedNodeLabels:
    - displayName: Stacklight
      key: stacklight
    

    Before or after a machine deployment, add the required label from the allowed node labels list with the corresponding value to spec.providerSpec.value.nodeLabels in machine.yaml. For example:

    nodeLabels:
    - key: stacklight
      value: enabled
    

    Adding of a node label that is not available in the list of allowed node labels is restricted.

  8. If you are NOT deploying MOSK with the compact control plane, add 3 dedicated Kubernetes manager nodes.

    1. Add 3 Machine objects for Kubernetes manager nodes using the following label:

      metadata:
        labels:
          cluster.sigs.k8s.io/control-plane: "true"
      

      Note

      The value of the label might be any non-empty string. On a worker node, this label must be omitted entirely.

    2. Add 3 Machine objects for MOSK controller nodes using the following labels:

      spec:
        providerSpec:
          value:
            nodeLabels:
              openstack-control-plane: enabled
              openstack-gateway: enabled
      
  9. If you are deploying MOSK with compact control plane, add Machine objects for 3 combined control plane nodes using the following labels and parameters to the nodeLabels field:

    metadata:
      labels:
        cluster.sigs.k8s.io/control-plane: true
    spec:
      providerSpec:
        value:
          nodeLabels:
            openstack-control-plane: enabled
            openstack-gateway: enabled
            openvswitch: enabled
    
  10. Add Machine objects for as many compute nodes as you want to install using the following labels:

    spec:
      providerSpec:
        value:
          nodeLabels:
            openstack-compute-node: enabled
            openvswitch: enabled
    
  11. Save the text file and repeat the process to create configuration for all machines in your MOSK cluster.

  12. Create machines in the cluster using command:

    kubectl create -f mosk-cluster-machines.yaml
    

Proceed to Add a Ceph cluster.

Create a machine using web UI

After you add bare metal hosts and create a managed cluster as described in Create a MOSK cluster, proceed with associating Kubernetes machines of your cluster with the previously added bare metal hosts using the Container Cloud web UI.

To add a Kubernetes machine to a MOSK cluster:

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine button.

  5. Fill out the Create New Machine form as required:

    • Name

      New machine name. If empty, a name is automatically generated in the <clusterName>-<machineType>-<uniqueSuffix> format.

    • Type

      Machine type. Select Manager or Worker to create a Kubernetes manager or worker node.

      Caution

      The required minimum number of machines:

      • 3 manager nodes for HA

      • 3 worker storage nodes for a minimal Ceph cluster

    • L2 Template

      From the drop-down list, select the previously created L2 template, if any. For details, see Create L2 templates. Otherwise, leave the default selection to use the default L2 template of the cluster.

      Note

      Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0), if you leave the default selection in the drop-down list, a preinstalled L2 template is used. Preinstalled templates are removed in the above-mentioned releases.

    • Distribution

      Operating system to provision the machine. From the drop-down list, select Ubuntu 22.04 Jammy as the machine distribution.

      Warning

      Do not use obsolete Ubuntu 20.04 distribution on greenfield deployments but only on existing clusters based on Ubuntu 20.04, which reaches end-of-life in April 2025. MOSK 24.3 release series is the last one to support Ubuntu 20.04 as the host operating system.

      Update of management or MOSK clusters running Ubuntu 20.04 to the following major product release, where Ubuntu 22.04 is the only supported version, is not possible.

    • Upgrade Index

      Optional. A positive numeral value that defines the order of machine upgrade during a cluster update.

      Note

      You can change the upgrade order later on an existing cluster. For details, see Change the upgrade order of a machine.

      Consider the following upgrade index specifics:

      • The first machine to upgrade is always one of the control plane machines with the lowest upgradeIndex. Other control plane machines are upgraded one by one according to their upgrade indexes.

      • If the Cluster spec dedicatedControlPlane field is false, worker machines are upgraded only after the upgrade of all control plane machines finishes. Otherwise, they are upgraded after the first control plane machine, concurrently with other control plane machines.

      • If several machines have the same upgrade index, they have the same priority during upgrade.

      • If the value is not set, the machine is automatically assigned a value of the upgrade index.

    • Host Configuration

      Configuration settings of the bare metal host to be used for the machine:

      • Host

        From the drop-down list, select the previously created custom bare metal host to be used for the new machine.

      • Host Profile

        From the drop-down list, select the previously created custom bare metal host profile, if any. For details, see Create a custom bare metal host profile. Otherwise, leave the default selection.

    • Labels

      Add the required node labels for the worker machine to run certain components on a specific node. For example, for the StackLight nodes that run OpenSearch and require more resources than a standard node, add the StackLight label. The list of available node labels is obtained from allowedNodeLabels of your current Cluster release.

      If the value field is not defined in allowedNodeLabels, from the drop-down list, select the required label and define an appropriate custom value for this label to be set to the node. For example, the node-type label can have the storage-ssd value to meet the service scheduling logic on a particular machine.

      Note

      Due to the known issue 23002 fixed in Container Cloud 2.21.0 (Cluster releases 7.11.0 and 11.5.0), a custom value for a predefined node label cannot be set using the Container Cloud web UI. For a workaround, refer to the issue description.

      Caution

      If you deploy StackLight in the HA mode (recommended):

      • Add the StackLight label to minimum three worker nodes. Otherwise, StackLight will not be deployed until the required number of worker nodes is configured with the StackLight label.

      • Removal of the StackLight label from worker nodes along with removal of worker nodes with StackLight label can cause the StackLight components to become inaccessible. It is important to correctly maintain the worker nodes where the StackLight local volumes were provisioned. For details, see Delete a cluster machine.

        To obtain the list of nodes where StackLight is deployed, refer to Container Cloud Release Notes: Upgrade managed clusters with StackLight deployed in HA mode.

      If you move the StackLight label to a new worker machine on an existing cluster, manually deschedule all StackLight components from the old worker machine, which you remove the StackLight label from. For details, see Deschedule StackLight Pods from a worker machine.

      Note

      To add node labels after deploying a worker machine, navigate to the Machines page, click the More action icon in the last column of the required machine field, and select Configure machine.

    • Count

      Specify the number of machines to create. If you create a machine pool, specify the replicas count of the pool.

    • Manager

      Select Manager or Worker to create a Kubernetes manager or worker node.

      Caution

      The required minimum number of machines:

      • 3 manager nodes for HA

      • 3 worker storage nodes for a minimal Ceph cluster

    • BareMetal Host Label

      Assign the role to the new machine(s) to link the machine to a previously created bare metal host with the corresponding label. You can assign one role type per machine. The supported labels include:

      • Manager

        This node hosts the manager services of a managed cluster. For the reliability reasons, Container Cloud does not permit running end user workloads on the manager nodes or use them as storage nodes.

      • Worker

        The default role for any node in a managed cluster. Only the kubelet service is running on the machines of this type.

    • Upgrade Index

      Optional. A positive numeral value that defines the order of machine upgrade during a cluster update.

      Note

      You can change the upgrade order later on an existing cluster. For details, see Change the upgrade order of a machine.

      Consider the following upgrade index specifics:

      • The first machine to upgrade is always one of the control plane machines with the lowest upgradeIndex. Other control plane machines are upgraded one by one according to their upgrade indexes.

      • If the Cluster spec dedicatedControlPlane field is false, worker machines are upgraded only after the upgrade of all control plane machines finishes. Otherwise, they are upgraded after the first control plane machine, concurrently with other control plane machines.

      • If several machines have the same upgrade index, they have the same priority during upgrade.

      • If the value is not set, the machine is automatically assigned a value of the upgrade index.

    • Distribution

      Operating system to provision the machine. From the drop-down list, select the required Ubuntu distribution.

    • L2 Template

      From the drop-down list, select the previously created L2 template, if any. For details, see Create L2 templates. Otherwise, leave the default selection to use the default L2 template of the cluster.

      Note

      Before Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0), if you leave the default selection in the drop-down list, a preinstalled L2 template is used. Preinstalled templates are removed in the above-mentioned releases.

    • BM Host Profile

      From the drop-down list, select the previously created custom bare metal host profile, if any. For details, see Create a custom bare metal host profile. Otherwise, leave the default selection.

    • Node Labels

      Add the required node labels for the worker machine to run certain components on a specific node. For example, for the StackLight nodes that run OpenSearch and require more resources than a standard node, add the StackLight label. The list of available node labels is obtained from allowedNodeLabels of your current Cluster release.

      If the value field is not defined in allowedNodeLabels, from the drop-down list, select the required label and define an appropriate custom value for this label to be set to the node. For example, the node-type label can have the storage-ssd value to meet the service scheduling logic on a particular machine.

      Note

      Due to the known issue 23002 fixed in Container Cloud 2.21.0 (Cluster releases 7.11.0 and 11.5.0), a custom value for a predefined node label cannot be set using the Container Cloud web UI. For a workaround, refer to the issue description.

      Caution

      If you deploy StackLight in the HA mode (recommended):

      • Add the StackLight label to minimum three worker nodes. Otherwise, StackLight will not be deployed until the required number of worker nodes is configured with the StackLight label.

      • Removal of the StackLight label from worker nodes along with removal of worker nodes with StackLight label can cause the StackLight components to become inaccessible. It is important to correctly maintain the worker nodes where the StackLight local volumes were provisioned. For details, see Delete a cluster machine.

        To obtain the list of nodes where StackLight is deployed, refer to Container Cloud Release Notes: Upgrade managed clusters with StackLight deployed in HA mode.

      If you move the StackLight label to a new worker machine on an existing cluster, manually deschedule all StackLight components from the old worker machine, which you remove the StackLight label from. For details, see Deschedule StackLight Pods from a worker machine.

      Note

      To add node labels after deploying a worker machine, navigate to the Machines page, click the More action icon in the last column of the required machine field, and select Configure machine.

  6. Click Create.

    At this point, Container Cloud adds the new machine object to the specified cluster. And the Bare Metal Operator Controller creates the relation to bare metal host with the labels matching the roles.

    Provisioning of the newly created machine starts when the machine object is created and includes the following stages:

    1. Creation of partitions on the local disks as required by the operating system and the Container Cloud architecture.

    2. Configuration of the network interfaces on the host as required by the operating system and the Container Cloud architecture.

    3. Installation and configuration of the Container Cloud LCM Agent.

  7. Repeat the steps above for the remaining machines.

  8. Verify machine status.

Now, proceed to Add a Ceph cluster.

Assign L2 templates to machines

To install MOSK on bare metal with Container Cloud, you must create L2 templates for each node type in the MOSK cluster. Additionally, you may have to create separate templates for nodes of the same type when they have different configuration.

To assign specific L2 templates to machines in a cluster:

  1. Select from the following options to assign the templates to the cluster:

    • Since MOSK 23.3, use the cluster.sigs.k8s.io/cluster-name label in the labels section.

    • Before MOSK 23.3, use the clusterRef parameter in the spec section.

  2. Add a unique identifier label to every L2 template. Typically, that would be the name of the MOSK node role, for example l2template-compute, or l2template-compute-5nics.

  3. Assign an L2 template to a machine. Set the l2TemplateSelector field in the machine spec to the name of the label added in the previous step. IPAM Controller uses this field to use a specific L2 template for the corresponding machine.

    Alternatively, you may set the l2TemplateSelector field to the name of the L2 template.

Consider the following examples of an L2 template assignment to a machine.

Example of an L2Template resource
apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: example-node-netconfig
  namespace: my-project
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
    l2template-example-node-netconfig: "1"
    cluster.sigs.k8s.io/cluster-name: my-cluster
    ...

Note

The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

Note

Before MOSK 23.3, an L2 template requires clusterRef: <clusterName> in the spec section. Since MOSK 23.3, this parameter is deprecated and automatically migrated to the cluster.sigs.k8s.io/cluster-name: <clusterName> label.


Example of a Machine resource with the label-based L2 template selector
apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: machine1
  namespace: my-project
  labels:
    cluster.sigs.k8s.io/cluster-name: my-cluster
    ...
...
spec:
  providerSpec:
    value:
      l2TemplateSelector:
        label: l2template-example-node-netconfig
...

Example of a Machine resource with the name-based L2 template selector
apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: machine1
  namespace: my-project
  labels:
    cluster.sigs.k8s.io/cluster-name: my-cluster
    ...
...
spec:
  providerSpec:
    value:
      l2TemplateSelector:
        name: example-node-netconfig
...

Now, proceed to Deploy a machine to a specific bare metal host.

Deploy a machine to a specific bare metal host

A machine in a MOSK cluster requires dedicated bare metal host for deployment. In the Mirantis Container Cloud management API, bare metal hosts are represented by the BareMetalHost objects that are automatically generated by the related BareMetalHostInventory objects.

Note

The BareMetalHostInventory resource is available since the update of the management cluster to the Cluster release 16.4.0 (Container Cloud 2.29.0). Before this release, the BareMetalHost object is used.

Since the above mentioned release, BareMetalHost is only used for internal purposes of the Container Cloud private API. All configuration changes must be applied using the BareMetalHostInventory objects.

For any existing BareMetalHost object, a BareMetalHostInventory object is created automatically during cluster update.

m:kaas@management-admin only. This limitation is lifted once the management cluster is updated to the Cluster release 16.4.1 or later.

All BareMetalHostInventory objects must be labeled upon creation with a label that allows identifying the host and assigning it to a machine.

The labels may be unique, or applied to a group of hosts, based on similarities in their capacity, capabilities and hardware configuration, on their location, suitable role, or a combination of thereof.

In some cases, you may need to deploy a machine to a specific bare metal host. This is especially useful when some of your bare metal hosts have different hardware configuration than the rest.

To deploy a machine to a specific bare metal host:

  1. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  2. Identify the bare metal host that you want to associate with the specific machine. For example, host host-1.

    kubectl get baremetalhostinventory host-1 -o yaml
    
    kubectl get baremetalhost host-1 -o yaml
    
  3. Add a label that will uniquely identify this host, for example, by the name of the host and machine that you want to deploy on it.

    kubectl edit baremetalhostinventory host-1
    

    Note

    For details about labels, see BareMetalHostInventory resource.

    kubectl edit baremetalhost host-1
    

    Note

    For details about labels, see BareMetalHost resource.

    Configuration example:

    kind: BareMetalHostInventory
    metadata:
      name: host-1
      namespace: myProjectName
      labels:
        kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
        ...
    
    kind: BareMetalHost
    metadata:
      name: host-1
      namespace: myProjectName
      labels:
        kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
        ...
    
  4. Open the text file with the YAML definition of the Machine object, created in Create a machine using CLI.

  5. Add a host selector that matches the label you have added to the BareMetalHost object in the previous step. For example:

    kind: Machine
    metadata:
      name: worker-HW11-cad5
      namespace: myProjectName
    spec:
      ...
      providerSpec:
        value:
          apiVersion: baremetal.k8s.io/v1alpha1
          kind: BareMetalMachineProviderSpec
          ...
          hostSelector:
            matchLabels:
              kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
      ...
    

Once created, this machine will be associated with the specified bare metal host, and you can return to Create a machine using CLI.

Caution

The required minimum number of machines:

  • 3 manager nodes for HA

  • 3 worker storage nodes for a minimal Ceph cluster

Override network interfaces naming and order

An L2 template contains the ifMapping field that allows you to identify Ethernet interfaces for the template. The Machine object API enables the Operator to override the mapping from the L2 template by enforcing a specific order of names of the interfaces when applied to the template.

The field l2TemplateIfMappingOverride in the spec of the Machine object contains a list of interfaces names. The order of the interfaces names in the list is important because the L2Template object will be rendered with NICs ordered as per this list.

Note

Changes in the l2TemplateIfMappingOverride field will apply only once when the Machine and corresponding IpamHost objects are created. Further changes to l2TemplateIfMappingOverride will not reset the interfaces assignment and configuration.

Caution

The l2TemplateIfMappingOverride field must contain the names of all interfaces of the bare metal host.

The following example illustrates how to include the override field to the Machine object. In this example, we configure the interface eno1, which is the second on-board interface of the server, to precede the first on-board interface eno0.

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  finalizers:
  - foregroundDeletion
  - machine.cluster.sigs.k8s.io
  labels:
    cluster.sigs.k8s.io/cluster-name: kaas-mgmt
    cluster.sigs.k8s.io/control-plane: "true"
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
spec:
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      hostSelector:
        matchLabels:
          kaas.mirantis.com/baremetalhost-id: hw-master-0
      image: {}
      kind: BareMetalMachineProviderSpec
      l2TemplateIfMappingOverride:
      - eno1
      - eno0
      - enp0s1
      - enp0s2

Note

The kaas.mirantis.com/region label is removed from all MOSK objects in 24.1. Therefore, do not add the label starting with this release. On existing clusters updated to this release, or if added manually, MOSK ignores this label.

As a result of the configuration above, when used with the example L2 template for bonds and bridges described in Create L2 templates, the enp0s1 and enp0s2 interfaces will be in predictable ordered state. This state will be used to create subinterfaces for Kubernetes networks (k8s-pods) and for Kubernetes external network (k8s-ext).

Also, you can use the non-case-sensitive list of NIC MAC addresses instead of the list of NIC names. For example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
...
spec:
  providerSpec:
    value:
      ...
      kind: BareMetalMachineProviderSpec
      l2TemplateIfMappingOverride:
      - b4:96:91:6f:2e:10
      - b4:96:91:6f:2e:11
      - b5:a6:c1:6f:ee:02
      - b5:a6:c1:6f:ee:02
Manually allocate IP addresses for bare metal hosts

Available since MCC 2.26.0 (16.1.0 and 17.1.0)

You can force the DHCP server to assign a particular IP address for a bare metal host during PXE provisioning by adding the host.dnsmasqs.metal3.io/address annotation with the desired IP address value to the required bare metal host.

If you have a limited amount of free and unused IP addresses for a server provisioning, you can manually create bare metal hosts one by one and provision servers in small, manually managed batches.

For batching in small chunks, you can use the host.dnsmasqs.metal3.io/address annotation to manually allocate IP addresses along with the baremetalhost.metal3.io/detached annotation to pause automatic host management by the bare metal Operator.

To pause bare metal hosts for a manual IP allocation during provisioning:

  1. Set the baremetalhost.metal3.io/detached annotation for all bare metal hosts that pauses host management.

    Note

    If the host provisioning has already started or completed, addition of this annotation deletes the information about the host from Ironic without triggering deprovisioning. The bare metal Operator recreates the host in Ironic once you remove the annotation. For details, see Metal3 documentation.

  2. Add the host.dnsmasqs.metal3.io/address annotation with corresponding IP address values to a batch of bare metal hosts.

  3. Remove the baremetalhost.metal3.io/detached annotation from the batch used in the previous step.

  4. Repeat the steps 2 and 3 until all hosts are provisioned.

Add a Ceph cluster

After you add machines to your new bare metal managed cluster as described in Add a machine, create a Ceph cluster on top of this managed cluster.

For an advanced configuration through the KaaSCephCluster CR, see Ceph advanced configuration. To configure Ceph Controller through Kubernetes templates to manage Ceph node resources, see Enable Ceph tolerations and resources management.

The procedure below enables you to create a Ceph cluster with minimum three Ceph nodes that provides persistent volumes to the Kubernetes workloads in the managed cluster.

Create a Ceph cluster using the CLI
  1. Verify that the overall status of the managed cluster is ready with all conditions in the Ready state:

    kubectl -n <managedClusterProject> get cluster <clusterName> -o yaml
    

    Substitute <managedClusterProject> and <clusterName> with the corresponding managed cluster namespace and name.

    Example of system response:

    status:
      providerStatus:
        ready: true
        conditions:
        - message: Helm charts are successfully installed(upgraded).
          ready: true
          type: Helm
        - message: Kubernetes objects are fully up.
          ready: true
          type: Kubernetes
        - message: All requested nodes are ready.
          ready: true
          type: Nodes
        - message: Maintenance state of the cluster is false
          ready: true
          type: Maintenance
        - message: TLS configuration settings are applied
          ready: true
          type: TLS
        - message: Kubelet is Ready on all nodes belonging to the cluster
          ready: true
          type: Kubelet
        - message: Swarm is Ready on all nodes belonging to the cluster
          ready: true
          type: Swarm
        - message: All provider instances of the cluster are Ready
          ready: true
          type: ProviderInstance
        - message: LCM agents have the latest version
          ready: true
          type: LCMAgent
        - message: StackLight is fully up.
          ready: true
          type: StackLight
        - message: OIDC configuration has been applied.
          ready: true
          type: OIDC
        - message: Load balancer 10.100.91.150 for kubernetes API has status HEALTHY
          ready: true
          type: LoadBalancer
    
  2. Create a YAML file with the Ceph cluster specification:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephCluster
    metadata:
      name: <cephClusterName>
      namespace: <managedClusterProject>
    spec:
      k8sCluster:
        name: <clusterName>
        namespace: <managedClusterProject>
    

    Substitute <cephClusterName> with the required name of the Ceph cluster. This name will be used in the Ceph LCM operations.

  3. Select from the following options:

    • Add explicit network configuration of the Ceph cluster using the network section:

      spec:
        cephClusterSpec:
          network:
            publicNet: <publicNet>
            clusterNet: <clusterNet>
      

      Substitute the following values:

      • <publicNet> is a CIDR definition or comma-separated list of CIDR definitions (if the managed cluster uses multiple networks) of public network for the Ceph data. The values should match the corresponding values of the cluster Subnet object.

      • <clusterNet> is a CIDR definition or comma-separated list of CIDR definitions (if the managed cluster uses multiple networks) of replication network for the Ceph data. The values should match the corresponding values of the cluster Subnet object.

    • Configure Subnet objects for the Storage access network by setting ipam/SVC-ceph-public: "1" and ipam/SVC-ceph-cluster: "1" labels to the corresponding Subnet objects. For more details, refer to Create subnets for a managed cluster using CLI, Step 5.

  4. Configure Ceph Manager and Ceph Monitor roles to select nodes that must place Ceph Monitor and Ceph Manager daemons:

    1. Obtain the names of machines to place Ceph Monitor and Ceph Manager daemons at:

      kubectl -n <managedClusterProject> get machine
      
    2. Add the nodes section with mon and mgr roles defined:

      spec:
        cephClusterSpec:
          nodes:
            <mgr-node-1>:
              roles:
              - <role-1>
              - <role-2>
              ...
            <mgr-node-2>:
              roles:
              - <role-1>
              - <role-2>
              ...
      

      Substitute <mgr-node-X> with the corresponding Machine object names and <role-X> with the corresponding roles of daemon placement, for example, mon or mgr.

      For other optional node parameters, see Ceph advanced configuration.

  5. Configure Ceph OSD daemons for Ceph cluster data storage:

    Note

    This step involves the deployment of Ceph Monitor and Ceph Manager daemons on nodes that are different from the ones hosting Ceph cluster OSDs. However, you can also colocate Ceph OSDs, Ceph Monitor, and Ceph Manager daemons on the same nodes by configuring the roles and storageDevices sections accordingly. This kind of configuration flexibility is particularly useful in scenarios such as hyper-converged clusters.

    Warning

    The minimal production cluster requires at least three nodes for Ceph Monitor daemons and three nodes for Ceph OSDs.

    1. Obtain the names of machines with disks intended for storing Ceph data:

      kubectl -n <managedClusterProject> get machine
      
    2. For each machine, use status.providerStatus.hardware.storage to obtain information about node disks:

      kubectl -n <managedClusterProject> get machine <machineName> -o yaml
      
      Example of system response with machine hardware details
      status:
        providerStatus:
          hardware:
            storage:
            - byID: /dev/disk/by-id/wwn-0x05ad99618d66a21f
              byIDs:
              - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_05ad99618d66a21f
              - /dev/disk/by-id/scsi-305ad99618d66a21f
              - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_05ad99618d66a21f
              - /dev/disk/by-id/wwn-0x05ad99618d66a21f
              byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:0
              byPaths:
              - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:0
              name: /dev/sda
              serialNumber: 05ad99618d66a21f
              size: 61
              type: hdd
            - byID: /dev/disk/by-id/wwn-0x26d546263bd312b8
              byIDs:
              - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_26d546263bd312b8
              - /dev/disk/by-id/scsi-326d546263bd312b8
              - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_26d546263bd312b8
              - /dev/disk/by-id/wwn-0x26d546263bd312b8
              byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
              byPaths:
              - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
              name: /dev/sdb
              serialNumber: 26d546263bd312b8
              size: 32
              type: hdd
            - byID: /dev/disk/by-id/wwn-0x2e52abb48862dbdc
              byIDs:
              - /dev/disk/by-id/lvm-pv-uuid-MncrcO-6cel-0QsB-IKaY-e8UK-6gDy-k2hOtf
              - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_2e52abb48862dbdc
              - /dev/disk/by-id/scsi-32e52abb48862dbdc
              - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dbdc
              - /dev/disk/by-id/wwn-0x2e52abb48862dbdc
              byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
              byPaths:
              - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
              name: /dev/sdc
              serialNumber: 2e52abb48862dbdc
              size: 61
              type: hdd
      
    3. Select by-id symlinks on the disks to be used in the Ceph cluster. The symlinks must meet the following requirements:

      • A by-id symlink must contain status.providerStatus.hardware.storage.serialNumber

      • A by-id symlink must not contain wwn

      For the example above, to use the sdc disk to store Ceph data on it, select the /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dbdc symlink. It is persistent and will not be affected by node reboot.

      Note

      For details about storage device formats, see Mirantis Container Cloud Reference Architecture: Addressing storage devices.

    4. Sepcify by-id symlinks:

      Specify the selected by-id symlinks in the spec.cephClusterSpec.nodes.storageDevices.fullPath field along with the spec.cephClusterSpec.nodes.storageDevices.config.deviceClass field:

      spec:
        cephClusterSpec:
          nodes:
            <storage-node-1>:
              storageDevices:
              - fullPath: <byIDSymlink-1>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-2>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-3>
                config:
                  deviceClass: <deviceClass-2>
              ...
            <storage-node-2>:
              storageDevices:
              - fullPath: <byIDSymlink-4>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-5>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-6>
                config:
                  deviceClass: <deviceClass-2>
            <storage-node-3>:
              storageDevices:
              - fullPath: <byIDSymlink-7>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-8>
                config:
                  deviceClass: <deviceClass-1>
              - fullPath: <byIDSymlink-9>
                config:
                  deviceClass: <deviceClass-2>
      

      Specify the selected by-id symlinks in the spec.cephClusterSpec.nodes.storageDevices.name field along with the spec.cephClusterSpec.nodes.storageDevices.config.deviceClass field:

      spec:
        cephClusterSpec:
          nodes:
            <storage-node-1>:
              storageDevices:
              - name: <byIDSymlink-1>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-2>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-3>
                config:
                  deviceClass: <deviceClass-2>
              ...
            <storage-node-2>:
              storageDevices:
              - name: <byIDSymlink-4>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-5>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-6>
                config:
                  deviceClass: <deviceClass-2>
            <storage-node-3>:
              storageDevices:
              - name: <byIDSymlink-7>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-8>
                config:
                  deviceClass: <deviceClass-1>
              - name: <byIDSymlink-9>
                config:
                  deviceClass: <deviceClass-2>
      

      Substitute the following values:

      • <storage-node-X> with the corresponding Machine object names

      • <byIDSymlink-X> with the by-id symlinks obtained from status.providerStatus.hardware.storage.byIDs

      • <deviceClass-X> with the disk types obtained from status.providerStatus.hardware.storage.type

  6. Configure the pools for Image, Block Storage, and Compute services:

    Note

    Ceph validates the specified pools. Therefore, do not omit any of the following pools.

    spec:
      pools:
      - default: true
        deviceClass: hdd
        name: kubernetes
        replicated:
          size: 3
        role: kubernetes
      - default: false
        deviceClass: hdd
        name: volumes
        replicated:
          size: 3
        role: volumes
      - default: false
        deviceClass: hdd
        name: vms
        replicated:
          size: 3
        role: vms
      - default: false
        deviceClass: hdd
        name: backup
        replicated:
          size: 3
        role: backup
      - default: false
        deviceClass: hdd
        name: images
        replicated:
          size: 3
        role: images
    

    Each Ceph pool, depending on its role, has the default targetSizeRatio value that defines the expected consumption of the total Ceph cluster capacity. The default ratio values for MOSK pools are as follows:

    • 20.0% for a Ceph pool with the role volumes

    • 40.0% for a Ceph pool with the role vms

    • 10.0% for a Ceph pool with the role images

    • 10.0% for a Ceph pool with the role backup

  7. Optional. Configure Ceph Block Pools to use RBD. For the detailed configuration, refer to Pool parameters.

    Example configuration:

    spec:
      cephClusterSpec:
        pools:
        - name: kubernetes
          role: kubernetes
          deviceClass: hdd
          replicated:
            size: 3
            targetSizeRatio: 10.0
          default: true
    
  8. Configure Ceph Object Storage to use OpenStack Swift Object Storage. For details, see RADOS Gateway parameters. Example configuration:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            dataPool:
              deviceClass: hdd
              erasureCoded:
                codingChunks: 1
                dataChunks: 2
              failureDomain: host
            gateway:
              instances: 3
              port: 80
              securePort: 8443
            metadataPool:
              deviceClass: hdd
              failureDomain: host
              replicated:
                size: 3
            name: object-store
            preservePoolsOnDelete: false
    
  9. Optional. Configure Ceph Shared Filesystem to use CephFS. For the detailed configuration, refer to Configure Ceph Shared File System (CephFS). Example configuration:

    spec:
      cephClusterSpec:
        sharedFilesystem:
          cephFS:
          - name: cephfs-store
            dataPools:
            - name: cephfs-pool-1
              deviceClass: hdd
              replicated:
                size: 3
              failureDomain: host
            metadataPool:
              deviceClass: nvme
              replicated:
                size: 3
              failureDomain: host
            metadataServer:
              activeCount: 1
              activeStandby: false
    
  10. When the Ceph cluster specification is complete, apply the built YAML file on the management cluster:

    kubectl apply -f <kcc-template>.yaml
    

    Substitue <kcc-template> with the name of the file containing the KaaSCephCluster specification.

    The resulting example of the KaaSCephCluster template
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephCluster
    metadata:
      name: kaas-ceph
      namespace: mosk-namespace
    spec:
      k8sCluster:
        name: mosk-cluster
        namespace: mosk-namespace
      cephClusterSpec:
        network:
          publicNet: 10.10.0.0/24
          clusterNet: 10.11.0.0/24
        nodes:
          master-1:
            roles:
            - mon
            - mgr
          master-2:
            roles:
            - mon
            - mgr
          master-3:
            roles:
            - mon
            - mgr
          worker-1:
            storageDevices:
            - fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231443409
              config:
                deviceClass: ssd
          worker-2:
            storageDevices:
            - fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231440912
              config:
                deviceClass: ssd
          worker-3:
            storageDevices:
            - fullPath: /dev/disk/by-id/scsi-1ATA_WDC_WDS100T2B0A-00SM50_200231434939
              config:
                deviceClass: ssd
        pools:
        - default: true
          deviceClass: hdd
          name: kubernetes
          replicated:
            size: 3
          role: kubernetes
        - default: false
          deviceClass: hdd
          name: volumes
          replicated:
            size: 3
          role: volumes
        - default: false
          deviceClass: hdd
          name: vms
          replicated:
            size: 3
          role: vms
        - default: false
          deviceClass: hdd
          name: backup
          replicated:
            size: 3
          role: backup
        - default: false
          deviceClass: hdd
          name: images
          replicated:
            size: 3
          role: images
        objectStorage:
          rgw:
            dataPool:
              deviceClass: ssd
              erasureCoded:
                codingChunks: 1
                dataChunks: 2
              failureDomain: host
            gateway:
              instances: 3
              port: 80
              securePort: 8443
            metadataPool:
              deviceClass: ssd
              failureDomain: host
              replicated:
                size: 3
            name: object-store
            preservePoolsOnDelete: false
    
  11. Wait for the KaaSCephCluster status and for status.shortClusterInfo.state to become Ready:

    kubectl -n <managedClusterProject> get kcc -o yaml
    
  12. Verify your Ceph cluster as described in Verify Ceph.

  13. Once all pools are created, verify that an appropriate secret required for a successful deployment of the OpenStack services that rely on Ceph is created in the openstack-ceph-shared namespace:

    kubectl -n openstack-ceph-shared get secrets openstack-ceph-keys
    

    Example of a positive system response:

    NAME                  TYPE     DATA   AGE
    openstack-ceph-keys   Opaque   7      36m
    
Create a Ceph cluster using the web UI

Warning

Mirantis highly recommends adding a Ceph cluster using the CLI instead of the web UI.

The web UI capabilities for adding a Ceph cluster are limited and lack flexibility in defining Ceph cluster specifications. For example, if an error occurs while adding a Ceph cluster using the web UI, usually you can address it only through the CLI.

The web UI functionality for managing Ceph cluster is going to be deprecated in one of the following releases.

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The Cluster page with the Machines and Ceph clusters lists opens.

  4. In the Ceph Clusters block, click Create Cluster.

  5. Configure the Ceph cluster in the Create New Ceph Cluster wizard that opens:

    Create new Ceph cluster

    Section

    Parameter name

    Description

    General settings

    Name

    The Ceph cluster name.

    Cluster Network

    Replication network for Ceph OSDs. Must contain the CIDR definition and match the corresponding values of the cluster L2Template object or the environment network values.

    Public Network

    Public network for Ceph data. Must contain the CIDR definition and match the corresponding values of the cluster L2Template object or the environment network values.

    Enable OSDs LCM

    Select to enable LCM for Ceph OSDs.

    Machines / Machine #1-3

    Select machine

    Select the name of the Kubernetes machine that will host the corresponding Ceph node in the Ceph cluster.

    Manager, Monitor

    Select the required Ceph services to install on the Ceph node.

    Devices

    Select the disk that Ceph will use.

    Warning

    Do not select the device for system services, for example, sda.

    Warning

    A Ceph cluster does not support removable devices that are hosts with hotplug functionality enabled. To use devices as Ceph OSD data devices, make them non-removable or disable the hotplug functionality in the BIOS settings for disks that are configured to be used as Ceph OSD data devices.

    Enable Object Storage

    Select to enable the single-instance RGW Object Storage.

  6. To add more Ceph nodes to the new Ceph cluster, click + next to any Ceph Machine title in the Machines tab. Configure a Ceph node as required.

    Warning

    Do not add more than 3 Manager and/or Monitor services to the Ceph cluster.

  7. After you add and configure all nodes in your Ceph cluster, click Create.

  8. Open the KaaSCephCluster CR for editing as described in Ceph advanced configuration.

  9. Verify that the following snippet is present in the KaaSCephCluster configuration:

    network:
      clusterNet: 10.10.10.0/24
      publicNet: 10.10.11.0/24
    
  10. Configure the pools for Image, Block Storage, and Compute services.

    Note

    Ceph validates the specified pools. Therefore, do not omit any of the following pools.

    spec:
      pools:
        - default: true
          deviceClass: hdd
          name: kubernetes
          replicated:
            size: 3
          role: kubernetes
        - default: false
          deviceClass: hdd
          name: volumes
          replicated:
            size: 3
          role: volumes
        - default: false
          deviceClass: hdd
          name: vms
          replicated:
            size: 3
          role: vms
        - default: false
          deviceClass: hdd
          name: backup
          replicated:
            size: 3
          role: backup
        - default: false
          deviceClass: hdd
          name: images
          replicated:
            size: 3
          role: images
    

    Each Ceph pool, depending on its role, has a default targetSizeRatio value that defines the expected consumption of the total Ceph cluster capacity. The default ratio values for MOSK pools are as follows:

    • 20.0% for a Ceph pool with role volumes

    • 40.0% for a Ceph pool with role vms

    • 10.0% for a Ceph pool with role images

    • 10.0% for a Ceph pool with role backup

  11. Once all pools are created, verify that an appropriate secret required for a successful deployment of the OpenStack services that rely on Ceph is created in the openstack-ceph-shared namespace:

    kubectl -n openstack-ceph-shared get secrets openstack-ceph-keys
    

    Example of a positive system response:

    NAME                  TYPE     DATA   AGE
    openstack-ceph-keys   Opaque   7      36m
    
  12. Verify your Ceph cluster as described in Verify Ceph.

Deploy OpenStack

This section instructs you on how to deploy OpenStack on top of Kubernetes as well as how to troubleshoot the deployment and access your OpenStack environment after deployment.

Deploy an OpenStack cluster

This section instructs you on how to deploy OpenStack on top of Kubernetes using the OpenStack Controller (Rockoon) and openstackdeployments.lcm.mirantis.com (OsDpl) CR.

To deploy an OpenStack cluster:

  1. Verify that you have pre-configured the networking according to Networking.

  2. Verify that the TLS certificates that will be required for the OpenStack cluster deployment have been pre-generated.

    Note

    The Transport Layer Security (TLS) protocol is mandatory on public endpoints.

    Caution

    To avoid certificates renewal with subsequent OpenStack updates during which additional services with new public endpoints may appear, we recommend using wildcard SSL certificates for public endpoints. For example, *.it.just.works, where it.just.works is a cluster public domain.

    The sample code block below illustrates how to generate a self-signed certificate for the it.just.works domain. The procedure presumes the cfssl and cfssljson tools are installed on the machine.

    mkdir cert && cd cert
    
    tee ca-config.json << EOF
    {
      "signing": {
        "default": {
          "expiry": "8760h"
        },
        "profiles": {
          "kubernetes": {
            "usages": [
              "signing",
              "key encipherment",
              "server auth",
              "client auth"
            ],
            "expiry": "8760h"
          }
        }
      }
    }
    EOF
    
    tee ca-csr.json << EOF
    {
      "CN": "kubernetes",
      "key": {
        "algo": "rsa",
        "size": 2048
      },
      "names":[{
        "C": "<country>",
        "ST": "<state>",
        "L": "<city>",
        "O": "<organization>",
        "OU": "<organization unit>"
      }]
    }
    EOF
    
    cfssl gencert -initca ca-csr.json | cfssljson -bare ca
    
    tee server-csr.json << EOF
    {
        "CN": "*.it.just.works",
        "hosts":     [
            "*.it.just.works"
        ],
        "key":     {
            "algo": "rsa",
            "size": 2048
        },
        "names": [    {
            "C": "US",
            "L": "CA",
            "ST": "San Francisco"
        }]
    }
    EOF
    cfssl gencert -ca=ca.pem -ca-key=ca-key.pem --config=ca-config.json -profile=kubernetes server-csr.json | cfssljson -bare server
    
  3. Create the openstackdeployment.yaml file that will include the OpenStack cluster deployment configuration. For the configuration details, refer to OpenStack configuration and OpenStack Operator resources.

    Note

    The resource of kind OpenStackDeployment (OsDpl) is a custom resource defined by a resource of kind CustomResourceDefinition. The resource is validated with the help of the OpenAPI v3 schema.

  4. Configure the OsDpl resource depending on the needs of your deployment. For the configuration details, refer to OpenStack configuration.

    Example of an OpenStackDeployment CR of minimum configuration
    apiVersion: lcm.mirantis.com/v1alpha1
    kind: OpenStackDeployment
    metadata:
      name: openstack-cluster
      namespace: openstack
    spec:
      openstack_version: victoria
      preset: compute
      size: tiny
      internal_domain_name: cluster.local
      public_domain_name: it.just.works
      features:
        neutron:
          tunnel_interface: ens3
          external_networks:
            - physnet: physnet1
              interface: veth-phy
              bridge: br-ex
              network_types:
               - flat
              vlan_ranges: null
              mtu: null
          floating_network:
            enabled: False
        nova:
          live_migration_interface: ens3
          images:
            backend: local
    
  5. If required, enable DPDK, huge pages, and other supported Telco features as described in Advanced OpenStack configuration (optional).

  6. To the openstackdeployment object, add information about the TLS certificates:

    • ssl:public_endpoints:ca_cert - CA certificate content (ca.pem)

    • ssl:public_endpoints:api_cert - server certificate content (server.pem)

    • ssl:public_endpoints:api_key - server private key (server-key.pem)

  7. Verify that the Load Balancer network does not overlap your corporate or internal Kubernetes networks, for example, Calico IP pools. Also, verify that the pool of Load Balancer network is big enough to provide IP addresses for all Amphora VMs (loadbalancers).

    If required, reconfigure the Octavia network settings using the following sample structure:

    spec:
      services:
        load-balancer:
          octavia:
            values:
              octavia:
                settings:
                  lbmgmt_cidr: "10.255.0.0/16"
                  lbmgmt_subnet_start: "10.255.1.0"
                  lbmgmt_subnet_end: "10.255.255.254"
    
  8. If you are using the default backend to store OpenStack database backups, which is Ceph, you may want to increase the default size of the allocated storage since there is no automatic way to resize the backup volume once the cloud is deployed.

    For the default sizes and configuration details, refer to Size of a backup storage.

  9. Trigger the OpenStack deployment:

    kubectl apply -f openstackdeployment.yaml
    
  10. Monitor the status of your OpenStack deployment:

    kubectl -n openstack get pods
    kubectl -n openstack describe osdpl osh-dev
    
  11. Assess the current status of the OpenStack deployment using the status section output in the OsDpl resource:

    1. Get the OsDpl YAML file:

      kubectl -n openstack get osdpl osh-dev -o yaml
      
    2. Analyze the status output using the detailed description in OpenStack configuration.

  12. Verify that the OpenStack cluster has been deployed:

    clinet_pod_name=$(kubectl -n openstack get pods -l application=keystone,component=client  | grep keystone-client | head -1 | awk '{print $1}')
    kubectl -n openstack exec -it $clinet_pod_name -- openstack service list
    

    Example of a positive system response:

    +----------------------------------+---------------+----------------+
    | ID                               | Name          | Type           |
    +----------------------------------+---------------+----------------+
    | 159f5c7e59784179b589f933bf9fc6b0 | cinderv3      | volumev3       |
    | 6ad762f04eb64a31a9567c1c3e5a53b4 | keystone      | identity       |
    | 7e265e0f37e34971959ce2dd9eafb5dc | heat          | orchestration  |
    | 8bc263babe9944cdb51e3b5981a0096b | nova          | compute        |
    | 9571a49d1fdd4a9f9e33972751125f3f | placement     | placement      |
    | a3f9b25b7447436b85158946ca1c15e2 | neutron       | network        |
    | af20129d67a14cadbe8d33ebe4b147a8 | heat-cfn      | cloudformation |
    | b00b5ad18c324ac9b1c83d7eb58c76f5 | radosgw-swift | object-store   |
    | b28217da1116498fa70e5b8d1b1457e5 | cinderv2      | volumev2       |
    | e601c0749ce5425c8efb789278656dd4 | glance        | image          |
    +----------------------------------+---------------+----------------+
    
  13. Register a record on the customer DNS:

    Caution

    The DNS component is mandatory to access OpenStack public endpoints.

    1. Obtain the full list of endpoints:

      kubectl -n openstack get ingress -ocustom-columns=NAME:.metadata.name,HOSTS:spec.rules[*].host | awk '/namespace-fqdn/ {print $2}'
      

      Example of system response:

      barbican.<spec:public_domain_name>
      cinder.<spec:public_domain_name>
      cloudformation.<spec:public_domain_name>
      designate.<spec:public_domain_name>
      glance.<spec:public_domain_name>
      heat.<spec:public_domain_name>
      horizon.<spec:public_domain_name>
      keystone.<spec:public_domain_name>
      metadata.<spec:public_domain_name>
      neutron.<spec:public_domain_name>
      nova.<spec:public_domain_name>
      novncproxy.<spec:public_domain_name>
      octavia.<spec:public_domain_name>
      placement.<spec:public_domain_name>
      
    2. Obtain the public endpoint IP:

      kubectl -n openstack get services ingress
      

      In the system response, capture EXTERNAL-IP.

      Example of system response:

      NAME      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)                                      AGE
      ingress   LoadBalancer   10.96.32.97   10.172.1.101   80:34234/TCP,443:34927/TCP,10246:33658/TCP   4h56m
      
    3. Ask the customer to create records for public endpoints, obtained earlier in this procedure, to EXTERNAL-IP from the Ingress service.

See also

Networking

Advanced OpenStack configuration (optional)

This section includes configuration information for available advanced Mirantis OpenStack for Kubernetes features that include DPDK with the Neutron OVS backend, huge pages, CPU pinning, and other Enhanced Platform Awareness (EPA) capabilities.

Enable LVM ephemeral storage

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to configure LVM as backend for the VM disks and ephemeral storage.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

Warning

Usage of more than one nonvolatile memory express (NVMe) drive per node may cause update issues and is therefore not supported.

To enable LVM ephemeral storage:

  1. In BareMetalHostProfile in the spec:volumeGroups section, add the following configuration for the OpenStack compute nodes:

    spec:
      devices:
        - device:
            byName: /dev/nvme0n1
            minSize: 30Gi
            wipe: true
          partitions:
            - name: lvm_nova_vol
              size: 0
              wipe: true
      volumeGroups:
        - devices:
          - partition: lvm_nova_vol
          name: nova-vol
    

    For details about BareMetalHostProfile, see Operations Guide: Create a custom host profile.

  2. Configure the OpenStackDeployment CR to deploy OpenStack with LVM ephemeral storage. For example:

    spec:
      features:
        nova:
          images:
            backend: lvm
            lvm:
              volume_group: "nova-vol"
    
  3. Optional. Enable encryption for the LVM ephemeral storage by adding the following metadata in the OpenStackDeployment CR:

    spec:
      features:
        nova:
          images:
            encryption:
              enabled: true
              cipher: "aes-xts-plain64"
              key_size: 256
    

    Caution

    Both live and cold migrations are not supported for such instances.

Enable LVM block storage

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to configure LVM as a backend for the OpenStack Block Storage service.

You can use flexible size units throughout bare metal host profiles. For example, you can now use either sizeGiB: 0.1 or size: 100Mi when specifying a device size.

Mirantis recommends using only one parameter name type and units throughout the configuration files. If both sizeGiB and size are used, sizeGiB is ignored during deployment and the suffix is adjusted accordingly. For example, 1.5Gi will be serialized as 1536Mi. The size without units is counted in bytes. For example, size: 120 means 120 bytes.

Warning

All data will be wiped during cluster deployment on devices defined directly or indirectly in the fileSystems list of BareMetalHostProfile. For example:

  • A raw device partition with a file system on it

  • A device partition in a volume group with a logical volume that has a file system on it

  • An mdadm RAID device with a file system on it

  • An LVM RAID device with a file system on it

The wipe field is always considered true for these devices. The false value is ignored.

Therefore, to prevent data loss, move the necessary data from these file systems to another server beforehand, if required.

To enable LVM block storage:

  1. Open BareMetalHostProfile for editing.

  2. In the spec:volumeGroups section, specify the following data for the OpenStack compute nodes. In the following example, we deploy a Cinder volume with LVM on compute nodes. However, you can use dedicated nodes for this purpose.

    spec:
      devices:
        - device:
            byName: /dev/nvme0n1
            minSize: 30Gi
            wipe: true
          partitions:
            - name: lvm_cinder_vol
              size: 0
              wipe: true
      volumeGroups:
        - devices:
          - partition: lvm_cinder_vol
          name: cinder-vol
    

    Important

    Since MOSK 23.1, the open-iscsi package does not install by default on bare metal hosts. Install it manually during cluster deployment in BareMetalHostProfile in the spec:postDeployScript section:

    spec:
      postDeployScript: |
        #!/bin/bash -ex
        apt-get update
        apt-get install --no-install-recommends -y open-iscsi
        echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
    

    For details about BareMetalHostProfile, see Operations Guide: Create a custom host profile.

  3. Configure the OpenStackDeployment CR to deploy OpenStack with LVM block storage. For example:

    spec:
      nodes:
        rockoon-openstack-compute-node::enabled:
          features:
            cinder:
              volume:
                backends:
                  lvm:
                    lvm:
                      volume_group: "cinder-vol"
    
Enable DPDK with OVS

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to enable DPDK with the Neutron OVS back end.

Warning

Usage of third-party software, which is not part of Mirantis-supported configurations, for example, the use of custom DPDK modules, may block upgrade of an operating system distribution. Users are fully responsible for ensuring the compatibility of such custom components with the latest supported Ubuntu version.

To enable DPDK with OVS:

  1. Verify that your deployment meets the following requirements:

  2. Enable DPDK in the OsDpl custom resource through the node specific overrides settings. For example:

    spec:
      nodes:
        <NODE-LABEL>::<NODE-LABEL-VALUE>:
          features:
            neutron:
              dpdk:
                bridges:
                - ip_address: 10.12.2.80/24
                  name: br-phy
                driver: igb_uio
                enabled: true
                nics:
                - bridge: br-phy
                  name: nic01
                  pci_id: "0000:05:00.0"
              tunnel_interface: br-phy
    
Enable SR-IOV with OVS

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to enable SR-IOV with the Neutron OVS back end.

To enable SR-IOV with OVS:

  1. Verify that your deployment meets the following requirements:

    • NICs with the SR-IOV support are installed

    • SR-IOV and VT-d are enabled in BIOS

  2. Enable IOMMU in the kernel by configuring intel_iommu=on in the GRUB configuration file. Specify the parameter for compute nodes in BareMetalHostProfile in the grubConfig section:

    spec:
      grubConfig:
          defaultGrubOptions:
            - 'GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX intel_iommu=on"'
    
  3. Configure the OpenStackDeployment CR to deploy OpenStack with the VLAN tenant network encapsulation.

    Caution

    To ensure correct appliance of the configuration changes, configure VLAN segmentation during the initial OpenStack deployment.

    Configuration example:

    spec:
      features:
        neutron:
          external_networks:
          - bridge: br-ex
            interface: pr-floating
            mtu: null
            network_types:
            - flat
            physnet: physnet1
            vlan_ranges: null
          - bridge: br-tenant
            interface: bond0
            network_types:
              - vlan
            physnet: tenant
            vlan_ranges: 490:499,1420:1459
          tenant_network_types:
            - vlan
    
  4. Enable SR-IOV in the OpenStackDeployment CR through the node-specific overrides settings. For example:

    spec:
      nodes:
        <NODE-LABEL>::<NODE-LABEL-VALUE>:
          features:
            neutron:
              sriov:
                enabled: true
                nics:
                - device: enp10s0f1
                  num_vfs: 7
                  physnet: tenant
    
Enable BGP VPN

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

The BGP VPN service is an extra OpenStack Neutron plugin that enables connection of OpenStack Virtual Private Networks with external VPN sites through either BGP/MPLS IP VPNs or E-VPN.

To enable the BGP VPN service:

Enable BGP VPN in the OsDpl custom resource through the node specific overrides settings. For example:

spec:
  features:
    neutron:
      bgpvpn:
        enabled: true
         route_reflector:
           # Enable deploygin FRR route reflector
           enabled: true
           # Local AS number
           as_number: 64512
           # List of subnets we allow to connect to
           # router reflector BGP
           neighbor_subnets:
             - 10.0.0.0/8
             - 172.16.0.0/16
  nodes:
    rockoon-openstack-compute-node::enabled:
      features:
        neutron:
          bgpvpn:
            enabled: true

When the service is enabled, a route reflector is scheduled on nodes with the openstack-frrouting: enabled label. Mirantis recommends collocating the route reflector nodes with the OpenStack controller nodes. By default, two replicas are deployed.

Encrypt the east-west traffic

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

MOSK allows configuring Internet Protocol Security (IPSec) encryption for the east-west tenant traffic between the OpenStack compute nodes and gateways. The feature uses the strongSwan open source IPSec solution. Authentication is accomplished through a pre-shared key (PSK). However, other authentication methods are upcoming.

To encrypt the east-west tenant traffic, enable ipsec in the spec:features:neutron settings of the OpenStackDeployment CR:

spec:
  features:
    neutron:
      ipsec:
        enabled: true

Caution

Enabling IPSec adds extra headers to the tenant traffic. The header size varies depending on IPSec configuration.

Therefore, Mirantis recommends decreasing network MTU for virtual networks and reserve 73 bytes overhead for the worst-case scenario as described in Cisco documentation: Configuring IPSec VPN Fragmentation and MTU.

Enable Cinder backend for Glance

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to configure Cinder backend for the for images through the OpenStackDeployment CR.

Note

This feature depends heavily on Cinder multi-attach, which enables you to simultaneously attach volumes to multiple instances. Therefore, only the block storage backends that support multi-attach can be used.

To configure a Cinder backend for Glance, define the backend identity in the OpenStackDeployment CR. This identity will be used as a name for the backend section in the Glance configuration file.

When defining the backend:

  • Configure one of the backends as default.

  • Configure each backend to use specific Cinder volume type.

    Note

    You can use the cinder_volume_type parameter instead of backend_name. If so, you have to create this volume type beforehand and take into account that the bootstrap script does not manage such volume types.

The blockstore identity definition example:

spec:
  features:
    glance:
      backends:
        cinder:
          blockstore:
            default: true
            backend_name: <volume_type:volume_name>
            # e.g. backend_name: lvm:lvm_store

spec:
  features:
    glance:
      backends:
        cinder:
          blockstore:
            default: true
            cinder_volume_type: netapp
Enable Cinder volume encryption

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

This section instructs you on how to enable Cinder volume encryption through the OpenStackDeployment CR using Linux Unified Key Setup (LUKS) and store the encryption keys in Barbican. For details, see Volume encryption.

To enable Cinder volume encryption:

  1. In the OpenStackDeployment CR, specify the LUKS volume type and configure the required encryption parameters for the storage system to encrypt or decrypt the volume.

    The volume_types definition example:

    spec:
      services:
        block-storage:
          cinder:
            values:
              bootstrap:
                volume_types:
                  volumes-hdd-luks:
                    arguments:
                      encryption-cipher: aes-xts-plain64
                      encryption-control-location: front-end
                      encryption-key-size: 256
                      encryption-provider: luks
                    volume_backend_name: volumes-hdd
    
  2. To create an encrypted volume as a non-admin user and store keys in the Barbican storage, assign the creator role to the user since the default Barbican policy allows only the admin or creator role:

    openstack role add --project <PROJECT-ID> --user <USER-ID> --creator <CREATOR-ID> creator
    
  3. Optional. To define an encrypted volume as a default one, specify volumes-hdd-luks in default_volume_type in the Cinder configuration:

    spec:
      services:
        block-storage:
          cinder:
            values:
              conf:
                cinder:
                  DEFAULT:
                    default_volume_type: volumes-hdd-luks
    
Advanced configuration for OpenStack compute nodes

Note

Consider this section as part of Deploy an OpenStack cluster.

The section describes how to perform advanced configuration for the OpenStack compute nodes. Such configuration can be required in some specific use cases, such as DPDK, SR-IOV, or huge pages features usage.

Configuration recommendations for compute node types

This section contains recommendations for configuration of an OpenStackDeployment custom resource for the compute nodes of the following types:

  • Compute nodes with the default configuration, without local NVMe storage and SR-IOV network interface cards (NICs)

  • Compute nodes with the NVMe local storage

  • Compute nodes with the SR-IOV NICs

  • Compute nodes with both the NVMe local storage and SR-IOV NICs

Note

If the local NVMe storage is enabled, Mirantis recommends using it and enable SR-IOV if possible.

Caution

Before using the NVMe local storage and mount point, define them in BareMetalHostProfile. For example:

apiVersion: metal3.io/v1alpha1
 kind: BareMetalHostProfile
 ...
 spec:
   devices:
     ...
   - device:
       byName: /dev/nvme0n1
       minSizeGiB: 30
       wipe: true
     partitions:
     - name: local-volumes-partition
       sizeGiB: 0
       wipe: true
     ...
   fileSystems:
     ...
   - fileSystem: ext4
     partition: local-volumes-partition
     # mountpoint for Nova images
     mountPoint: /var/lib/nova

Note

If you mount the /var directory, review Mounting recommendations for the /var directory before configuring BareMetalHostProfile.

Caution

To control the storage type (local NVMe or Ceph) for virtual machines, place a node into the OpenStack aggregate. For details, see OpenStack documentation: Host aggregates.

As defined in Node-specific configuration, each node with a non-default configuration must be configured separately. Each Machine object must have a configuration-specific label. For example, for a compute node with the local NVMe storage:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
  ...
  spec:
    providerSpec:
      value:
        nodeLabels:
        - key: node-type
          value: compute-nvme

Caution

The <NODE-LABEL> value must match one of the allowed labels defined in the nodeLabels section of the Cluster object:

nodeLabels:
- key: <NODE-LABEL>
  value: <NODE-LABEL-VALUE>

Mirantis recommends using node-type as the <NODE-LABEL> key. To view the full list of allowed node labels:

kubectl \
  -n <CLUSTER-NAMESPACE> \
  get <CLUSTER-NAME> \
  -o json \
  | jq .status.providerStatus.releaseRefs.current.allowedNodeLabels

The list of node labels is read-only and cannot be modified.

For compute nodes with the SR-IOV NIC, use compute-sriov as the node-type value of nodeLabels:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
  ...
  spec:
    providerSpec:
      value:
        nodeLabels:
        - key: node-type
          value: compute-sriov

For compute nodes with the local NVMe storage and SR-IOV NICs, use the compute-nvme-sriov as the node-type value of nodeLabels:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
  ...
  spec:
    providerSpec:
      value:
        nodeLabels:
        - key: node-type
          value: compute-nvme-sriov

In the examples above, compute-sriov, compute-nvme-sriov, and compute-nvme are human-readable string identifiers. You can use any unique string value for each type of compute node.

In the OpenStackDeployment object of each node group, define its own section that starts with <NODE-LABEL>::<NODE-LABEL-VALUE>::

apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
...
spec:
  ...
  nodes:
    node-type::compute-nvme:
      features:
        nova:
          images:
            backend: local
    node-type::compute-sriov:
      features:
        neutron:
          sriov:
            enabled: true
            nics:
            - device: enp10s0f1
              num_vfs: 7
              physnet: tenant
    node-type::compute-nvme-sriov:
      features:
        nova:
          images:
            backend: local
        neutron:
          sriov:
            enabled: true
            nics:
            - device: enp10s0f1
              num_vfs: 7
              physnet: tenant
Configure the CPU model

Note

Consider this section as part of Deploy an OpenStack cluster.

Mirantis OpenStack for Kubernetes (MOSK) enables you to configure the vCPU model for all instances managed by the OpenStack Compute service (Nova) using the following osdpl definition:

spec:
  features:
    nova:
      vcpu_type: host-model

For the supported values and configuration examples, see Virtual CPU.

Enable huge pages for OpenStack

Note

Consider this section as part of Deploy an OpenStack cluster.

Note

The instruction provided in this section applies to both OpenStack with OVS and OpenStack with Tungsten Fabric topologies.

The huge pages OpenStack feature provides essential performance improvements for applications that are highly memory IO-bound. Huge pages should be enabled on a per compute node basis. By default, NUMATopologyFilter is enabled.

To activate the feature, you need to enable huge pages on the dedicated bare metal host as described in Enable huge pages in a host profile during the predeployment bare metal configuration.

Note

The multi-size huge pages are not fully supported by Kubernetes versions before 1.19. Therefore, define only one size in kernel parameters.

Configure CPU isolation for an instance

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

Warning

The below procedure applies only to deployments based on deprecated Ubuntu 20.04. For Ubuntu 22.04 that supports cgroup v2, use the cpushield module. For the procedure details, see Host operating system configuration.

CPU isolation is a way to force the system scheduler to use only some logical CPU cores for processes. For compute hosts, you should typically isolate system processes and virtual guests on different cores through the cpusets mechanism in Linux kernel.

The Linux kernel and cpuset provide a mechanism to run tasks by limiting the resources defined by a cpuset. The tasks can be moved from one cpuset to another to use the resources defined in other cpusets. The cset Python tool is a command-line interface to work with cpusets.

To configure CPU isolation using cpusets:

  1. Configure core isolation:

    Note

    You can also automate this step during deployment by using the postDeploy script as described in Create MOSK host profiles.

    cat <<-"EOF" > /usr/bin/setup-cgroups.sh
    #!/bin/bash
    
    set -x
    
    UNSHIELDED_CPUS=${UNSHIELDED_CPUS:-"0-3"}
    UNSHIELDED_MEM_NUMAS=${UNSHIELDED_MEM_NUMAS:-0}
    SHIELD_CPUS=${SHIELD_CPUS:-"4-15"}
    SHIELD_MODE=${SHIELD_MODE:-"cpuset"} # One of isolcpu or cpuset
    
    DOCKER_CPUS=${DOCKER_CPUS:-$UNSHIELDED_CPUS}
    DOCKER_MEM_NUMAS=${DOCKER_MEM_NUMAS:-$UNSHIELDED_MEM_NUMAS}
    KUBERNETES_CPUS=${KUBERNETES_CPUS:-$UNSHIELDED_CPUS}
    KUBERNETES_MEM_NUMAS=${KUBERNETES_MEM_NUMAS:-$UNSHIELDED_MEM_NUMAS}
    CSET_CMD=${CSET_CMD:-"python3 /usr/bin/cset"}
    
    if [[ ${SHIELD_MODE} == "cpuset" ]]; then
        ${CSET_CMD} set -c ${UNSHIELDED_CPUS} -m ${UNSHIELDED_MEM_NUMAS} -s system
        ${CSET_CMD} proc -m -f root -t system
        ${CSET_CMD} proc -k -f root -t system
    fi
    
    ${CSET_CMD} set --cpu=${DOCKER_CPUS} --mem=${DOCKER_MEM_NUMAS} --set=docker
    ${CSET_CMD} set --cpu=${KUBERNETES_CPUS} --mem=${KUBERNETES_MEM_NUMAS} --set=kubepods
    ${CSET_CMD} set --cpu=${DOCKER_CPUS} --mem=${DOCKER_MEM_NUMAS} --set=com.docker.ucp
    
    EOF
    chmod +x /usr/bin/setup-cgroups.sh
    
    cat <<-"EOF" > /etc/systemd/system/shield-cpus.service
    [Unit]
    Description=Shield CPUs
    DefaultDependencies=no
    After=systemd-udev-settle.service
    Before=lvm2-activation-early.service
    Wants=systemd-udev-settle.service
    [Service]
    ExecStart=/usr/bin/setup-cgroups.sh
    RemainAfterExit=true
    Type=oneshot
    Restart=on-failure     #Service should restart on failure
    RestartSec=5s          #Restart each five seconds until success
    [Install]
    WantedBy=basic.target
    EOF
    
    systemctl enable shield-cpus
    
    reboot
    
  2. As root user, verify that isolation has been applied:

    cset set -l
    

    Example of system response:

    cset:
          Name       CPUs-X     MEMs-X    Tasks Subs   Path
      ------------ ---------- - ------- - ----- ---- ----------
      root             0-15 y       0 y     165    4  /
      kubepods         0-3 n        0 n       0    2  /kubepods
      docker           0-3 n        0 n       0    0  /docker
      system           0-3 n        0 n      65    0  /system
      com.docker.ucp   0-3 n        0 n       0    0  /com.docker.ucp
    
  3. Run the cpustress container:

    docker run -it --name cpustress --rm containerstack/cpustress --cpu 4 --timeout 30s --metrics-brief
    
  4. Verify that isolated cores are not affected:

    htop
    

    Example of system response highlighting the load created on all available Docker cores:

    _images/cpu-isolation-htop.png
Configure custom CPU topologies

Note

Consider this section as part of Deploy an OpenStack cluster.

The majority of CPU topologies features are activated by NUMATopologyFilter that is enabled by default. Such features do not require any further service configuration and can be used directly on a vanilla MOSK deployment. The list of the CPU topologies features includes, for example:

  • NUMA placement policies

  • CPU pinning policies

  • CPU thread pinning policies

  • CPU topologies

To enable libvirt CPU pinning through the node-specific overrides in the OpenStackDeployment custom resource, use the following sample configuration structure:

spec:
  nodes:
    <NODE-LABEL>::<NODE-LABEL-VALUE>:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    compute:
                      cpu_dedicated_set: 2-17
                      cpu_shared_set: 18-47
Configure PCI passthrough for guests

Note

Consider this section as part of Deploy an OpenStack cluster.

The Peripheral Component Interconnect (PCI) passthrough feature in OpenStack allows full access and direct control over physical PCI devices in guests. The mechanism applies to any kind of PCI devices including a Network Interface Card (NIC), Graphics Processing Unit (GPU), and any other device that can be attached to a PCI bus. The only requirement for the guest to properly use the device is to correctly install the driver.

To enable PCI passthrough in a MOSK deployment:

  1. For Linux X86 compute nodes, verify that the following features are enabled on the host:

  2. Configure the nova-api service that is scheduled on OpenStack controllers nodes. To generate the alias for PCI in nova.conf, add the alias configuration through the OpenStackDeployment CR.

    Note

    When configuring PCI with SR-IOV on the same host, the values specified in alias take precedence. Therefore, add the SR-IOV devices to passthrough_whitelist explicitly.

    For example:

    spec:
      services:
        compute:
          nova:
            values:
              conf:
                nova:
                  pci:
                    alias: '{ "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" }'
    
  3. Configure the nova-compute service that is scheduled on OpenStack compute nodes. To enable Nova to pass PCI devices to virtual machines, configure the passthrough_whitelist section in nova.conf through the node-specific overrides in the OpenStackDeployment CR. For example:

    spec:
      nodes:
        <NODE-LABEL>::<NODE-LABEL-VALUE>:
          services:
            compute:
              nova:
                nova_compute:
                  values:
                    conf:
                      nova:
                        pci:
                          alias: '{ "vendor_id":"8086", "product_id":"154d", "device_type":"type-PF", "name":"a1" }'
                          passthrough_whitelist: |
                            [{"devname":"enp216s0f0","physical_network":"sriovnet0"}, { "vendor_id": "8086", "product_id": "154d" }]
    
Configure initial resource oversubscription

Available since MOSK 23.1

MOSK enables you to configure initial oversubscription through the OpenStackDeployment custom resource. For configuration details and oversubscription considerations, refer to Configuring initial resource oversubscription.

By default, the following values are applied:

  • 8.0 for the number of CPUs

  • 1.6 for the disk space

  • 1.0 for the amount of RAM

    Note

    In MOSK 22.5 and earlier, the effective default value of RAM allocation ratio is 1.1.

Changing oversubscription configuration after deployment will only affect the newly added compute nodes and will not change oversubscription for already existing compute nodes. You can change oversubscription for existing compute nodes through the placement API as described in Change oversubscription settings for existing compute nodes.

Limit HW resources for hyperconverged OpenStack compute nodes

Note

Consider this section as part of Deploy an OpenStack cluster.

Hyperconverged architecture combines OpenStack compute nodes along with Ceph nodes. To avoid nodes overloading, which can cause Ceph performance degradation and outage, limit the hardware resources consumption by the OpenStack compute services.

You can reserve hardware resources for non-workload related consumption using the following nova-compute parameters. For details, see OpenStack documentation: Overcommitting CPU and RAM and OpenStack documentation: Configuration Options.

  • cpu_allocation_ratio - in case of a hyperconverged architecture, the value depends on the number of vCPU used for non-workload related operations, total number of vCPUs of a hyperconverged node, and on workload vCPU consumption:

    cpu_allocation_ratio = (${vCPU_count_on_a_hyperconverged_node} -
    ${vCPU_used_for_non_OpenStack_related_tasks}) /
    ${vCPU_count_on_a_hyperconverged_node} / ${workload_vCPU_utilization}
    

    To define the vCPU count used for non-OpenStack related tasks, use the following formula, considering the storage data plane performance tests:

    vCPU_used_for_non-OpenStack_related_tasks = 2 * SSDs_per_hyperconverged_node +
    1 * Ceph_OSDs_per_hyperconverged_node + 0.8 * Ceph_OSDs_per_hyperconverged_node
    

    Consider the following example with 5 SSD disks for Ceph OSDs per hyperconverged node and 2 Ceph OSDs per disk:

    vCPU_used_for_non-OpenStack_related_tasks = 2 * 5 + 1 * 10 + 0.8 * 10 = 28
    

    In this case, if there are 40 vCPUs per hyperconverged node, 28 vCPUs are required for non-workload related calculations, and a workload consumes 50% of the allocated CPU time: cpu_allocation_ratio = (40-28) / 40 / 0.5 = 0.6.


  • reserved_host_memory_mb - a dedicated variable in the OpenStack Nova configuration, to reserve memory for non-OpenStack related VM activities:

    reserved_host_memory_mb = 13 GB * Ceph_OSDs_per_hyperconverged_node
    

    For example for 10 Ceph OSDs per hyperconverged node: reserved_host_memory_mb = 13 GB * 10 = 130 GB = 133120


  • ram_allocation_ratio - the allocation ratio of virtual RAM to physical RAM. To completely exclude the possibility of memory overcommitting, set to 1.

To limit HW resources for hyperconverged OpenStack compute nodes:

In the OpenStackDeployment CR, specify the cpu_allocation_ratio, ram_allocation_ratio, and reserved_host_memory_mb parameters as required using the calculations described above.

For example:

apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              DEFAULT:
                cpu_allocation_ratio: 0.6
                ram_allocation_ratio: 1
                reserved_host_memory_mb: 133120

Note

For an existing OpenStack deployment:

  1. Obtain the name of your OpenStackDeployment CR:

    kubectl -n openstack get osdpl
    
  2. Open the OpenStackDeployment CR for editing and specify the parameters as required.

    kubectl -n openstack edit osdpl <osdpl name>
    
Configure GPU virtualization

Available since MOSK 24.1 TechPreview

This section delves into virtual GPU configuration. It is specifically tailored for NVIDIA physical GPUs, with a focus on the A100 40GB GPU and NVIDIA AIE 4.1 drivers.

While setup procedures may vary among different cards and vendors, MOSK can generally ensure compatibility between the MOSK Compute service (Nova) and vGPU functionality, as long as the drivers for the physical GPU expose an VFIO mdev-compatible interface to the Linux host.

For configuration specifics of other physical GPUs, refer to the official documentation provided by the vendor.

Obtain drivers

Visit NVIDIA AI Enterprise documentation for comprehensive guidance on how to download the required drivers.

Also, if you have access to the NVIDIA NGC Catalog, search for the latest Infra Release that provides NVIDIA vGPU Host Driver there.

NVIDIA licensing

To fully utilize the capabilities of NVIDIA GPU virtualization, you may need to set up and configure the NVIDIA licensing server.

Install drivers

To install the acquired drivers within your cluster, add a custom postDeployScript script to the custom BareMetalHostProfile object used for the compute nodes with GPUs.

Note

For the instruction on how to create a BareMetalHostProfile object, refer to Operations Guide: Create a custom host profile.

This script must solve the following tasks:

  • Download and install the drivers, if needed

  • Configure physical GPU according to your cluster requirements

  • Configure a startup task to reconfigure the physical GPU after node reboots.

Example postDeployScript script:

#!/bin/bash -ex
# Create a one time script that will initialize physical GPU right now and self-destruct
cat << EOF > /root/test_postdeploy_job.sh
#!/bin/bash -ex
systemctl enable initialize-vgpu
systemctl start --no-block initialize-vgpu
crontab -l | grep -v test_postdeploy_job.sh | crontab -
rm /root/test_postdeploy_job.sh
EOF
mkdir -p /var/spool/cron/crontabs/ && echo "*/1 * * * * sudo /root/test_postdeploy_job.sh >> /var/log/test_postdeploy_job.log 2>&1" >> /var/spool/cron/crontabs/root
chmod +x /root/test_postdeploy_job.sh

# Create a systemd unit that will re-initialize physical GPU on restart
cat << EOF > /etc/systemd/system/initialize-vgpu.service
[Unit]
Description=Configure VGPU
After=systemd-modules-load.service

[Service]
Type=oneshot
ExecStart=/root/initialize_vgpu.sh
RemainAfterExit=true
StandardOutput=journal
[Install]
RequiredBy=multi-user.target
EOF
cat << EOF > /root/initialize_vgpu.sh
#!/bin/bash
set -ex
while ! docker inspect ucp-kubelet;
    do echo "Waiting lcm-agent is finished.";
    sleep 1;
done
# Download and install the driver, dependencies and tools
if [[ ! -f /root/nvidia-vgpu-ubuntu-aie-535_535.129.03_amd64.deb ]]; then
    apt update
    apt install -y dkms unzip gcc libc-dev make linux-headers-$(uname -r) pciutils lshw mdevctl
    wget https://my.intra.net//root/gpu-driver-x-y-z.deb -O /root/gpu-driver-x-y-z.deb
    apt install /root/gpu-driver-x-y-z.deb
    systemctl enable initialize-vgpu
fi
systemctl restart nvidia-vgpud.service
# Enable SR-IOV mode for the pGPU
/usr/lib/nvidia/sriov-manage -e <PCI-ADDRESS-OF-NVIDIA-CARD>
# Enable MIG mode for pGPU
nvidia-smi -i 0 -mig 1
systemctl enable nvidia-vgpu-mgr.service
systemctl start nvidia-vgpu-mgr.service
EOF
chmod +x /root/initialize_vgpu.sh
Manage virtual GPU types

Virtual GPU types are similar to compute flavors as they determine the resources allocated to each virtual GPU. This allows for efficient allocation and optimization of GPU resources in virtualized environments.

Each physical GPU has a maximum number of virtual GPUs of a specific type that can be created on it, with no possibility for overallocation. In the time-sliced vGPU configuration, each particular physical GPU can only instantiate vGPUs of the same selected type. In the Multi-Instance GPU (MIG), a single physical GPU may be partitioned into several differently sized virtual GPUs.

Either way, prior to accepting workloads, Mirantis recommends determining the virtual GPU types that each of your physical GPU will provide. Altering these settings afterward necessitates terminating every virtual machine currently running on the physical GPU intended for reconfiguration or repurposing for another virtual GPU type.

Partition to Multi-Instance GPUs

This section outlines the process for partitioning physical GPUs into Multi-Instance GPUs (MIG) using the nvidia-smi tool provided by the NVIDIA Host GPU driver.

To list available virtual GPU instance profiles:

nvidia-smi mig -lgip

Example system response:

+-----------------------------------------------------------------------------+
| GPU instance profiles:                                                      |
| GPU   Name             ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                              Free/Total   GiB              CE    JPEG  OFA  |
|=============================================================================|
|   0  MIG 1g.5gb        19     7/7        4.75       No     14     0     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.5gb+me     20     1/1        4.75       No     14     1     0   |
|                                                             1     1     1   |
+-----------------------------------------------------------------------------+
|   0  MIG 1g.10gb       15     4/4        9.75       No     14     1     0   |
|                                                             1     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 2g.10gb       14     3/3        9.75       No     28     1     0   |
|                                                             2     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 3g.20gb        9     2/2        19.62      No     42     2     0   |
|                                                             3     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 4g.20gb        5     1/1        19.62      No     56     2     0   |
|                                                             4     0     0   |
+-----------------------------------------------------------------------------+
|   0  MIG 7g.40gb        0     1/1        39.50      No     98     5     0   |
|                                                             7     1     1   |
+-----------------------------------------------------------------------------+

To create seven, which is a maximum possible amount of instances according to the system response above, MIG vGPUs of the smallest size:

nvidia-smi mig -cgi 19,19,19,19,19,19,19

To create three differently sized vGPUs of 4g.20gb, 2g.10gb, and 1g.5gb sizes:

nvidia-smi mig -cgi 5,14,19

Caution

Keep in mind that not all combinations of differently sized vGPU instances are supported. Additionally, the order in which you create vGPUs is important.

For example configurations, see NVIDIA documentation.

Find mdev class of virtual GPU type

To correctly configure the MOSK Compute service, you need to correlate the following naming schemes related to virtual GPU types:

  • The GPU instance profile as reported by nvidia-smi mig. For example, MIG 1g.5gb.

  • The vGPU type as reported by the driver. For example, GRID A100-1-5C.

  • The mdev class that corresponds to the vGPU type. For example, nvidia-474.

For the compatibility between GPU instance profiles and virtual GPU types, refer to NVIDIA documentation: Virtual GPU Types for Supported GPUs.

To determine the mdev class supported by a specific virtual GPU type listed by a PCI device address, verify the output of the mdevctl types command executed on the compute node that has a physical GPU available on it:

mdevctl types

Example system response for MIGs:

0000:42:00.4
  nvidia-1053
    Available instances: 0
    Device API: vfio-pci
    Name: GRID A100-1-10C
    Description: num_heads=1, frl_config=60, framebuffer=10240M, max_resolution=4096x2400, max_instance=4
  ...
  nvidia-474
    Available instances: 1
    Device API: vfio-pci
    Name: GRID A100-1-5C
    Description: num_heads=1, frl_config=60, framebuffer=5120M, max_resolution=4096x2400, max_instance=7
  ...

The Name field from the example system output above corresponds to the supported virtual GPU type, linking the GPU instance profile with the mdev class supported by your physical GPU.

In the example above, the MIG 1g.5gb GPU instance profile corresponds to the GRID A100-1-5C vGPU type as per NVIDIA documentation, and according to the mdevctl types` output, it corresponds to the nvidia-474 mdev class.

Note

Notice that Available instances is zero for vGPU types that are not actually supported by this given card and configuration. For MIGs, the Available instances will be non-zero only for the virtual GPU types for which the MIG virtual GPU instances have already been created. See Partition to Multi-Instance GPUs.

Configure the Compute service

The parameters you need to define for the nova-compute service on each compute node with physical GPUs you want to expose as virtual GPUs include:

  • [devices]enabled_mdev_types

    Required. List of the mdev classes, see the previous step for details.

  • [devices]cleanup_mdev_devices

    Optional. By default, the Compute service does not delete created mdev devices but reuses them instead. While this speeds up processes, it may pose challenges when reconfiguring the enabled_mdev_types parameter. Set cleanup_mdev_devices to True for the Compute service to auto-delete created mdev devices upon instance deletion.

Time-sliced vGPU

If you plan to use only time-sliced vGPUs and provide a single virtal GPU type across the entire cloud, you only need to configure the options mentioned above once globally for all compute nodes through the spec.services section of the OpenStackDeployment custom resource.

With the configuration below, the Compute service will auto-detect all PCI devices that provide this mdev type and automatically create required resource providers in the placement service with the resource class VGPU.

Example configuration for the nvidia-474 mdev type:

kind: OpenStackDeployment
spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              devices:
                enabled_mdev_types: nvidia-474
                cleanup_mdev_devices: true

If you plan to provide multiple time-sliced vGPU types, simplify the configuration by grouping the nodes based on a node label (not necessarily aggregates). Ensure that each group exposes only one mdev type using the Node-specific configuration settings. Additionally, use custom resource classes to facilitate flavor creation, ensuring consistent use of the CUSTOM_ prefix for custom mdev_class.

For example, if you want to provide the nvidia-474 and nvidia-475 mdev types, label your nodes with the vgpu-type=nvidia-474 and vgpu-type=nvidia-475 labels and use the following node-specific settings:

kind: OpenStackDeployment
spec:
  nodes:
    vgpu-type::nvidia-474:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-474
                      cleanup_mdev_devices: true
                    mdev_nvidia-474:
                      mdev_class: CUSTOM_VGPU_A100_1_5C
    vgpu-type::nvidia-475:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-475
                      cleanup_mdev_devices: true
                    mdev_nvidia-475:
                      mdev_class: CUSTOM_VGPU_A100_2_10C

The configuration above creates corresponding resource providers in the placement service that provide CUSTOM_VGPU_A100_1_5C or CUSTOM_VGPU_A100_2_10C resources. You can use these resources during the definition of flavors for instances with corresponding vGPU types.

In some cases, you may need to provide different vGPU types from a single compute node, for example, if the compute node has 2 physical GPUs and you want to create two different types of vGPU on them. For such scenarios, you should provide explicit PCI device addresses of these physical GPUs in the settings. This makes such configuration verbose in heterogeneous hardware environments where physical GPUs have different PCI addresses on each node. For example, when targeting node-specific settings by node name:

kind: OpenStackDeployment
spec:
  nodes:
    kubernetes.io/hostname::kaas-node-7af9aab1-596d-4ba3-a717-846653aa441a:
      services:
        compute:
          nova:
            nova_compute:
              values:
                conf:
                  nova:
                    devices:
                      enabled_mdev_types: nvidia-474,nvidia-475
                      cleanup_mdev_devices: true
                    mdev_nvidia-474:
                      device_addresses: 0000:42:00.0
                      mdev_class: CUSTOM_VGPU_A100_1_5C
                    mdev_nvidia-475:
                      device_addresses: 0000:43:00.0
                      mdev_class: CUSTOM_VGPU_A100_2_10C
Multi-Instance GPU (MIG)

In the SR-IOV mode, the driver typically creates more virtual functions than the maximum capacity of the physical GPU, even for the smallest virtual GPU type. Each virtual function can hold only one single virtual GPU. This leads to resource over-reporting to the placement service.

Therefore, to ensure efficient resource allocation and utilization within a homogeneous hardware environment, assuming that each compute node in it has the same PCI address for the physical GPU and the physical GPU has been partitioned to the MIG GPU instances identically:

  1. Identify the number of instances created of each MIG profile.

  2. Select random but not overlapping sets of PCI addresses from the list of virtual functions of the physical GPU. The amount of addresses in each set must correspond to the number of instances created of each MIG profile.

  3. Assign the mdev type to the selected devices.

For example, for the environment with the following configuration:

  • 3 MIG instances of MIG 1.5gb and 2 MIG instances of MIG 2.10gb

  • 16 virtual functions created for the physical GPU with the PCI address range from 0000:42:00.0 to 0000:42:01.7

Pick 3 and 2 random PCI addresses from that pool and assign them to CUSTOM_VGPU_A100_1_5C and CUSTOM_VGPU_A100_2_10C mdev classes respectively:

spec:
  services:
    compute:
      nova:
        values:
          conf:
            nova:
              devices:
                enabled_mdev_types: nvidia-474,nvidia-475
                cleanup_mdev_devices: true
              mdev_nvidia-474:
                device_addresses: 0000:42:00.0,0000:42:00.1,0000:42:00.2
                mdev_class: CUSTOM_VGPU_A100_1_5C
              mdev_nvidia-475:
                device_addresses: 0000:42:01.0,0000:42:01.1
                mdev_class: CUSTOM_VGPU_A100_2_10C

In a heterogeneous hardware environment, use node-specific settings to group nodes with the same PCI addresses and intended vGPU configuration, or use explicit setting for each node targeting node-specific settings to every node, sequentially if needed.

Verify resource providers

This section provides guidelines for verifying that virtual GPUs are correctly accounted for in the OpenStack Placement service, ensuring proper scheduling of instances that utilize virtual GPUs.

Firstly, verify that resource providers have been created with accurate inventories. For each PCI device associated with a virtual GPU, including virtual instances in the case of MIG/SR-IOV, there should be a nested resource provider under the resource provider of the corresponding compute node. The name of this nested resource provider should follow the format <node-name>_pci_<pci-address-with-underscores>:

openstack resource provider list --resource CUSTOM_VGPU_A100_1_5C=1 -f yaml

Example system response:

- generation: 1
  name: kaas-node-9d18b7c8-7ea8-4b13-abe9-0e76ee8db596.kaas-kubernetes-294cbb1cbf084789b931ebc54d3f9b05_pci_0000_42_00_4
  parent_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
  root_provider_uuid: c922488b-69eb-42a8-afad-dc7d3d56b8fd
  uuid: 963bb3ce-3ed1-421f-a186-a808c3460c48
  ...

Also, examine the inventory of each resource provider. It should exclusively consist of the VGPU resource or any custom resource name configured in the Compute service settings. The total capacity of the resource should match the capacity reported by the mdevctl types output, reflecting the capabilities of the PCI device for the specified mdev class. In the case of MIG, this total capacity will always be 1.

openstack resource provider inventory list 963bb3ce-3ed1-421f-a186-a808c3460c48 -f yaml

Example system response:

- allocation_ratio: 1.0
  max_unit: 1
  min_unit: 1
  reserved: 0
  resource_class: CUSTOM_VGPU_A100_1_5C
  step_size: 1
  total: 1
  used: 0
Create required resources

This section provides instructions for creating a flavor that requests a specific virtual GPU resource, using the mdev classes configured in the Compute service and registered in the placement service.

To create the flavor, use the openstack flavor create command. Ensure that the flavor properties match the configured mdev classes in the Compute service. For example, to request one vGPU of type nvidia-474 using the resource class from the previous examples:

openstack flavor create --ram 1024 --vcpus 2 --disk 5 --property resources:CUSTOM_VGPU_A100_1_5C=1

Replace the --property resources:CUSTOM_VGPU_A100_1_5C=1 parameter with the appropriate property matching the desired virtual GPU type and quantity.

Once the flavor is created, you can start launching instances using the created flavor as usual.

Enable image signature verification

TechPreview

Note

Consider this section as part of Deploy an OpenStack cluster.

Mirantis OpenStack for Kubernetes (MOSK) enables you to perform image signature verification when booting an OpenStack instance, uploading a Glance image with signature metadata fields set, and creating a volume from an image.

To enable signature verification, use the following osdpl definition:

spec:
  features:
    glance:
      signature:
        enabled: true

When enabled during initial deployment, all internal images such as Amphora, Ironic, and test (CirrOS, Fedora, Ubuntu) images, will be signed by a self-signed certificate.

Configure LoadBalancer for PowerDNS

Note

Consider this section as part of Deploy an OpenStack cluster.

Mirantis OpenStack for Kubernetes (MOSK) allows configuring LoadBalancer for the Designate PowerDNS backend. For example, you can expose a TCP port for zone transferring using the following exemplary osdpl definition:

spec:
 designate:
   backend:
     external_ip: 10.172.1.101
     protocol: udp
     type: powerdns

For the supported values, see LoadBalancer type for PowerDNS.

Access OpenStack after deployment

This section contains the guidelines on how to access your MOSK OpenStack environment.

Configure DNS to access OpenStack

DNS is a mandatory component for MOSK deployment, all records must be created on the customer DNS server. The OpenStack services are exposed through the Ingress NGINX controller.

Warning

This document describes how to temporarily configure DNS. The workflow contains non-permanent changes that will be rolled back during a managed cluster update or reconciliation loop. Therefore, proceed at your own risk.

To configure DNS to access your OpenStack environment:

  1. Obtain the external IP address of the Ingress service:

    kubectl -n openstack get services ingress
    

    Example of system response:

    NAME      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)                                      AGE
    ingress   LoadBalancer   10.96.32.97   10.172.1.101   80:34234/TCP,443:34927/TCP,10246:33658/TCP   4h56m
    
  2. Select from the following options:

    • If you have a corporate DNS server, update your corporate DNS service and create appropriate DNS records for all OpenStack public endpoints.

      To obtain the full list of public endpoints:

      kubectl -n openstack get ingress -ocustom-columns=NAME:.metadata.name,HOSTS:spec.rules[*].host | awk '/namespace-fqdn/ {print $2}'
      

      Example of system response:

      barbican.it.just.works
      cinder.it.just.works
      cloudformation.it.just.works
      designate.it.just.works
      glance.it.just.works
      heat.it.just.works
      horizon.it.just.works
      keystone.it.just.works
      neutron.it.just.works
      nova.it.just.works
      novncproxy.it.just.works
      octavia.it.just.works
      placement.it.just.works
      
    • If you do not have a corporate DNS server, perform one of the following steps:

      • Add the appropriate records to /etc/hosts locally. For example:

        10.172.1.101 barbican.it.just.works
        10.172.1.101 cinder.it.just.works
        10.172.1.101 cloudformation.it.just.works
        10.172.1.101 designate.it.just.works
        10.172.1.101 glance.it.just.works
        10.172.1.101 heat.it.just.works
        10.172.1.101 horizon.it.just.works
        10.172.1.101 keystone.it.just.works
        10.172.1.101 neutron.it.just.works
        10.172.1.101 nova.it.just.works
        10.172.1.101 novncproxy.it.just.works
        10.172.1.101 octavia.it.just.works
        10.172.1.101 placement.it.just.works
        
      • Deploy your DNS server on top of Kubernetes:

        1. Deploy a standalone CoreDNS server by including the following configuration into coredns.yaml:

          apiVersion: lcm.mirantis.com/v1alpha1
          kind: HelmBundle
          metadata:
            name: coredns
            namespace: osh-system
          spec:
            repositories:
            - name: hub_stable
              url: https://charts.helm.sh/stable
            releases:
            - name: coredns
              chart: hub_stable/coredns
              version: 1.8.1
              namespace: coredns
              values:
                image:
                  repository: mirantis.azurecr.io/openstack/extra/coredns
                  tag: "1.6.9"
                isClusterService: false
                servers:
                - zones:
                  - zone: .
                    scheme: dns://
                    use_tcp: false
                  port: 53
                  plugins:
                  - name: cache
                    parameters: 30
                  - name: errors
                  # Serves a /health endpoint on :8080, required for livenessProbe
                  - name: health
                  # Serves a /ready endpoint on :8181, required for readinessProbe
                  - name: ready
                  # Required to query kubernetes API for data
                  - name: kubernetes
                    parameters: cluster.local
                  - name: loadbalance
                    parameters: round_robin
                  # Serves a /metrics endpoint on :9153, required for serviceMonitor
                  - name: prometheus
                    parameters: 0.0.0.0:9153
                  - name: forward
                    parameters: . /etc/resolv.conf
                  - name: file
                    parameters: /etc/coredns/it.just.works.db it.just.works
                serviceType: LoadBalancer
                zoneFiles:
                - filename: it.just.works.db
                  domain: it.just.works
                  contents: |
                    it.just.works.            IN      SOA     sns.dns.icann.org. noc.dns.icann.org. 2015082541 7200 3600 1209600 3600
                    it.just.works.            IN      NS      b.iana-servers.net.
                    it.just.works.            IN      NS      a.iana-servers.net.
                    it.just.works.            IN      A       1.2.3.4
                    *.it.just.works.           IN      A      1.2.3.4
          
        2. Update the public IP address of the Ingress service:

          sed -i 's/1.2.3.4/10.172.1.101/' coredns.yaml
          kubectl apply -f coredns.yaml
          
        3. Verify that the DNS resolution works properly:

          1. Assign an external IP to the service:

            kubectl -n coredns patch service coredns-coredns --type='json' -p='[{"op": "replace", "path": "/spec/ports", "value": [{"name": "udp-53", "port": 53, "protocol": "UDP", "targetPort": 53}]}]'
            kubectl -n coredns patch service coredns-coredns --type='json' -p='[{"op": "replace", "path": "/spec/type", "value":"LoadBalancer"}]'
            
          2. Obtain the external IP address of CoreDNS:

            kubectl -n coredns get service coredns-coredns
            

            Example of system response:

            NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
            coredns-coredns   ClusterIP   10.96.178.21   10.172.1.102      53/UDP,53/TCP   25h
            
        4. Point your machine to use the correct DNS. It is 10.172.1.102 in the example system response above.

        5. If you plan to launch Tempest tests or use the OpenStack client from a keystone-client-XXX pod, verify that the Kubernetes built-in DNS service is configured to resolve your public FQDN records by adding your public domain to Corefile. For example, to add the it.just.works domain:

          kubectl -n kube-system get configmap coredns -oyaml
          

          Example of system response:

          apiVersion: v1
          data:
            Corefile: |
              .:53 {
                  errors
                  health
                  ready
                  kubernetes cluster.local in-addr.arpa ip6.arpa {
                    pods insecure
                    fallthrough in-addr.arpa ip6.arpa
                  }
                  prometheus :9153
                  forward . /etc/resolv.conf
                  cache 30
                  loop
                  reload
                  loadbalance
              }
              it.just.works:53 {
                  errors
                  cache 30
                  forward . 10.96.178.21
              }
          
Access your OpenStack environment

This section explains how to access your OpenStack environment as admin user.

Before you proceed, make sure that you can access the Kubernetes API and have privileges to read secrets from the openstack-external namespace in Kubernetes or you are able to exec to the pods in the openstack namespace.

Access OpenStack using the Kubernetes built-in admin CLI

You can use the built-in admin CLI client and execute the openstack commands from a dedicated pod deployed in the openstack namespace:

kubectl -n openstack exec \
  $(kubectl -n openstack get pod -l application=keystone,component=client -ojsonpath='{.items[*].metadata.name}') \
  -ti -- bash

This pod has python-openstackclient and all required plugins already installed. The python-openstackclient command-line client is configured to use the admin user credentials. You can view the detailed configuration for the openstack command in /etc/openstack/clouds.yaml file in the pod.

Access an OpenStack environment through Horizon
  1. Configure the external DNS resolution for OpenStack services as described in Configure DNS to access OpenStack.

  2. Obtain the admin user credentials from the openstack-identity-credentials secret in the openstack-external namespace:

    kubectl -n openstack-external get secrets openstack-identity-credentials -o jsonpath='{.data.clouds\.yaml}' | base64 -d
    

    Example of a system response:

    clouds:
      admin:
        auth:
          auth_url: https://keystone.it.just.works/
          password: <ADMIN_PWD>
          project_domain_name: <ADMIN_PROJECT_DOMAIN>
          project_name: <ADMIN_PROJECT>
          user_domain_name: <ADMIN_USER_DOMAIN>
          username: <ADMIN_USER_NAME>
        endpoint_type: public
        identity_api_version: 3
        interface: public
        region_name: CustomRegion
      admin-system:
        auth:
          auth_url: https://keystone.it.just.works/
          password: <ADMIN_PWD>
          system_scope: all
          user_domain_name: <ADMIN_USER_DOMAIN>
          username: <ADMIN_USER_NAME>
        endpoint_type: public
        identity_api_version: 3
        interface: public
        region_name: CustomRegion
    
  3. Access Horizon through your browser using its public service. For example, https://horizon.it.just.works.

    To log in, specify the user name and domain name obtained in previous step from the <ADMIN_USER_NAME> and <ADMIN_USER_DOMAIN> values.

    If the OpenStack Identity service has been deployed with the OpenID Connect integration:

    1. From the Authenticate using drop-down menu, select OpenID Connect.

    2. Click Connect. You will be redirected to your identity provider to proceed with the authentication.

    Note

    If OpenStack has been deployed with self-signed TLS certificates for public endpoints, you may get a warning about an untrusted certificate. To proceed, allow the connection.

Access OpenStack through CLI from your local machine

To be able to access your OpenStack environment through the CLI, you need to configure the openstack client environment using either an openstackrc environment file or clouds.yaml file.

  1. Log in to Horizon as described in Access an OpenStack environment through Horizon.

  2. Download the openstackrc file from the web UI.

  3. On any shell from which you want to run OpenStack commands, source the environment file for the respective project.

  1. Obtain clouds.yaml:

    mkdir -p ~/.config/openstack
    kubectl -n openstack-external get secrets openstack-identity-credentials -o jsonpath='{.data.clouds\.yaml}' | base64 -d > ~/.config/openstack/clouds.yaml
    

    The OpenStack client looks for clouds.yaml in the following locations: current directory, ~/.config/openstack, and /etc/openstack.

  2. Export the OS_CLOUD environment variable:

    export OS_CLOUD=admin
    

Now, you can use the openstack CLI as usual. For example:

openstack user list

Example of an expected system response:

+----------------------------------+-----------------+
| ID                               | Name            |
+----------------------------------+-----------------+
| dc23d2d5ee3a4b8fae322e1299f7b3e6 | internal_cinder |
| 8d11133d6ef54349bd014681e2b56c7b | admin           |
+----------------------------------+-----------------+

Note

If OpenStack was deployed with self-signed TLS certificates for public endpoints, you may need to use the openstack command-line client with certificate validation disabled. For example:

openstack --insecure user list
Troubleshoot an OpenStack deployment

This section provides the general debugging instructions for your OpenStack on Kubernetes deployment. Start your troubleshooting with the determination of the failing component that can include the OpenStack Controller (Rockoon), Helm, a particular pod, or service.

Debugging the Helm releases

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

Note

MOSK uses direct communication with Helm 3.

Verify the Helm releases statuses
  1. Log in to the rockoon pod, where the Helm v3 client is installed, or download the Helm v3 binary locally:

    kubectl -n osh-system get pods  |grep rockoon
    

    Example of a system response:

    rockoon-5c5965688b-knck5             9/9     Running   0          21h
    rockoon-admission-6795d594b4-rp7kf   1/1     Running   0          22h
    rockoon-exporter-6f66547b67-pcxhh    1/1     Running   0          22h
    
  2. Verify the Helm releases statuses:

    helm3 --namespace openstack list --all
    

    Example of a system response:

    NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
    etcd                            openstack       4               2021-07-09 11:06:25.377538008 +0000 UTC deployed        etcd-0.1.0-mcp-2735
    ingress-openstack               openstack       4               2021-07-09 11:06:24.892822083 +0000 UTC deployed        ingress-0.1.0-mcp-2735
    openstack-barbican              openstack       4               2021-07-09 11:06:25.733684392 +0000 UTC deployed        barbican-0.1.0-mcp-3890
    openstack-ceph-rgw              openstack       4               2021-07-09 11:06:25.045759981 +0000 UTC deployed        ceph-rgw-0.1.0-mcp-2735
    openstack-cinder                openstack       4               2021-07-09 11:06:42.702963544 +0000 UTC deployed        cinder-0.1.0-mcp-3890
    openstack-designate             openstack       4               2021-07-09 11:06:24.400555027 +0000 UTC deployed        designate-0.1.0-mcp-3890
    openstack-glance                openstack       4               2021-07-09 11:06:25.5916904 +0000 UTC deployed        glance-0.1.0-mcp-3890
    openstack-heat                  openstack       4               2021-07-09 11:06:25.3998706 +0000 UTC deployed        heat-0.1.0-mcp-3890
    openstack-horizon               openstack       4               2021-07-09 11:06:23.27538297 +0000 UTC deployed        horizon-0.1.0-mcp-3890
    openstack-iscsi                 openstack       4               2021-07-09 11:06:37.891858343 +0000 UTC deployed        iscsi-0.1.0-mcp-2735            v1.0.0
    openstack-keystone              openstack       4               2021-07-09 11:06:24.878052272 +0000 UTC deployed        keystone-0.1.0-mcp-3890
    openstack-libvirt               openstack       4               2021-07-09 11:06:38.185312907 +0000 UTC deployed        libvirt-0.1.0-mcp-2735
    openstack-mariadb               openstack       4               2021-07-09 11:06:24.912817378 +0000 UTC deployed        mariadb-0.1.0-mcp-2735
    openstack-memcached             openstack       4               2021-07-09 11:06:24.852840635 +0000 UTC deployed        memcached-0.1.0-mcp-2735
    openstack-neutron               openstack       4               2021-07-09 11:06:58.96398517 +0000 UTC deployed        neutron-0.1.0-mcp-3890
    openstack-neutron-rabbitmq      openstack       4               2021-07-09 11:06:51.454918432 +0000 UTC deployed        rabbitmq-0.1.0-mcp-2735
    openstack-nova                  openstack       4               2021-07-09 11:06:44.277976646 +0000 UTC deployed        nova-0.1.0-mcp-3890
    openstack-octavia               openstack       4               2021-07-09 11:06:24.775069513 +0000 UTC deployed        octavia-0.1.0-mcp-3890
    openstack-openvswitch           openstack       4               2021-07-09 11:06:55.271711021 +0000 UTC deployed        openvswitch-0.1.0-mcp-2735
    openstack-placement             openstack       4               2021-07-09 11:06:21.954550107 +0000 UTC deployed        placement-0.1.0-mcp-3890
    openstack-rabbitmq              openstack       4               2021-07-09 11:06:25.431404853 +0000 UTC deployed        rabbitmq-0.1.0-mcp-2735
    openstack-tempest               openstack       2               2021-07-09 11:06:21.330801212 +0000 UTC deployed        tempest-0.1.0-mcp-3890
    

    If a Helm release is not in the DEPLOYED state, obtain the details from the output of the following command:

    helm3 --namespace openstack  history <release-name>
    
Verify the status of a Helm release

To verify the status of a Helm release:

helm3 --namespace openstack status <release-name>

Example of a system response:

NAME: openstack-memcached
LAST DEPLOYED: Fri Jul  9 11:06:24 2021
NAMESPACE: openstack
STATUS: deployed
REVISION: 4
TEST SUITE: None
Debugging the OpenStack Controller

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

The OpenStack Controller (Rockoon) is running in several containers in the rockoon-xxxx pod in the osh-system namespace. For the full list of containers and their roles, refer to OpenStack Controller (Rockoon).

To verify the status of the OpenStack Controller, run:

kubectl -n osh-system get pods

Example of a system response:

NAME                                  READY   STATUS    RESTARTS   AGE
rockoon-5c5965688b-knck5             9/9     Running   0          21h
rockoon-admission-6795d594b4-rp7kf   1/1     Running   0          22h
rockoon-exporter-6f66547b67-pcxhh    1/1     Running   0          22h

To verify the logs for the osdpl container, run:

kubectl -n osh-system logs -f <rockoon-xxxx> -c osdpl
Debugging the OsDpl CR

This section includes the ways to mitigate the most common issues with the OsDpl CR. We assume that you have already debugged the Helm releases and OpenStack Controller to rule out possible failures with these components as described in Debugging the Helm releases and Debugging the OpenStack Controller.

The osdpl has DEPLOYED=false

Possible root cause: One or more Helm releases have not been deployed successfully.

To determine if you are affected:

Verify the status of the osdpl object:

kubectl -n openstack get osdpl osh-dev

Example of a system response:

NAME      AGE   DEPLOYED   DRAFT
osh-dev   22h   false      false

To debug the issue:

  1. Identify the failed release by assessing the status:children section in the OsDpl resource:

    1. Get the OsDpl YAML file:

      kubectl -n openstack get osdpl osh-dev -o yaml
      
    2. Analyze the status output using the detailed description in OpenStackDeploymentStatus custom resource.

  2. For further debugging, refer to Debugging the Helm releases.

Some pods are stuck in Init

Possible root cause: MOSK uses the Kubernetes entrypoint init container to resolve dependencies between objects. If the pod is stuck in Init:0/X, this pod may be waiting for its dependencies.

To debug the issue:

Verify the missing dependencies:

kubectl -n openstack logs -f placement-api-84669d79b5-49drw -c init

Example of a system response:

Entrypoint WARNING: 2020/04/21 11:52:50 entrypoint.go:72: Resolving dependency Job placement-ks-user in namespace openstack failed: Job Job placement-ks-user in namespace openstack is not completed yet .
Entrypoint WARNING: 2020/04/21 11:52:52 entrypoint.go:72: Resolving dependency Job placement-ks-endpoints in namespace openstack failed: Job Job placement-ks-endpoints in namespace openstack is not completed yet .
Some Helm releases are not present

Possible root cause: some OpenStack services depend on Ceph. These services include OpenStack Image, OpenStack Compute, and OpenStack Block Storage. If the Helm releases for these services are not present, the openstack-ceph-keys secret may be missing in the openstack-ceph-shared namespace.

To debug the issue:

Verify that the Ceph Controller has created the openstack-ceph-keys secret in the openstack-ceph-shared namespace:

kubectl -n openstack-ceph-shared get secrets openstack-ceph-keys

Example of a positive system response:

NAME                  TYPE     DATA   AGE
openstack-ceph-keys   Opaque   7      23h

If the secret is not present, create one manually.

Support dump

TechPreview

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

Support dump described in this section specifically targets OpenStack components, providing valuable insights for troubleshooting OpenStack-related problems.

To generate a support dump for your MOSK environment, use the osctl sos report tool present within the rockoon image.

This section focuses only on the essential capabilities of the tool. For all available parameters, consult osctl sos report --help.

Collectors

The support dump is modular. Each module is responsible for specific functionality. To enable or disable specific modules during support dump creation, use the --collector option. If not specified, all collectors are used.

Support dump collectors

Collector

Description

elastic

Collects logs from StackLight by connecting to the OpenSearch API.

k8s

Collects data about objects from Kubernetes.

nova

Collects metadata associated with the Compute service (OpenStack Nova) from the OpenStack nodes. This encompasses a wide range of data, including instance details, general libvirt information, and so on.

neutron

Collects metadata associated with the Networking service (OpenStack Neutron) from the OpenStack nodes. This encompasses a wide range of data, including Open vSwitch statistics, list of namespaces, IP address statistics in namespaces, Open vSwitch flows, and so on.

Components

Given the substantial amount of information, you can manage the components included in a support dump using the mutually exclusive --component or --all-components options. Within the elastic collector component, you can specify which loggers to gather logs for. For example, the --component nova option restricts log collection to pods related to Nova, which names start with nova-* and libvirt-*.

Hosts

Another filtering criterion involves specifying the host for which you intend to collect support information. This can be accomplished through the use of mutually exclusive --host or --all-hosts options. This feature is particularly valuable for limiting the volume of data included in the support dump.

Modes

Support dump works in the following modes:

  • report

    Generic report is created, no specific information such as resource UUID is included. The tool collects as much information as possible.

  • trace

    Provides more sophisticated filtering criteria rather than the report mode. For example, you can search for specific message patterns in OpenSearch.

Usage

Since MOSK 23.2, you can execute the osctl sos commands directly in the rockoon pod. For example:

kubectl -n osh-system exec -it deployment/rockoon bash

osctl sos --since 1d \
          --all-hosts \
          --component neutron \
          --collector elastic \
          --workspace /tmp/ report

To get trace for a specific resource UUID in Neutron for a specific host, use the following command as an example:

kubectl -n osh-system exec -it deployment/rockoon bash

osctl sos --since 1d \
          --host kaas-node-fe0734de-20e8-4493-9f7d-52c4f8a8a98c \
          --component neutron \
          --workspace /workspace/ \
          --collector elastic trace --message ".4a055675-89b0-45c2-a3b3-a10dffa07f31."

For older MOSK versions, to start generating support dumps, execute the osctl sos commands from a manually started Docker container on any node of your cluster. For example, to create a generic report for the Neutron component:

docker run -v /home/ubuntu/sosreport/:/workspace -v /root/.kube/config:/root/.kube/config -it mirantis.azurecr.io/openstack/rockoon:1.0.1 bash

osctl sos --since 1d \
          --elastic-url http://172.16.37.11:9200 \
          --all-hosts \
          --component neutron \
          --collector elastic \
          --workspace /workspace/ report

Note

172.16.37.11 is the IP address of the opensearch-master StackLight service. To obtain it, run:

kubectl -n stacklight get svc opensearch-master -o jsonpath='{.spec.clusterIP}')

Deploy Tungsten Fabric

This section describes how to deploy Tungsten Fabric as a backend for networking for your MOSK environment.

Caution

Before you proceed with the Tungsten Fabric deployment, read through Tungsten Fabric known limitations.

Tungsten Fabric deployment prerequisites

Before you proceed with the actual Tungsten Fabric (TF) deployment, verify that your deployment meets the following prerequisites:

  1. Your MOSK OpenStack cluster is deployed as described in Deploy an OpenStack cluster with the Tungsten Fabric backend enabled for Neutron using the following structure:

    spec:
      features:
        neutron:
          backend: tungstenfabric
    
  2. Your MOSK OpenStack cluster uses the correct value of features:neutron:tunnel_interface in the openstackdeployment object. The TF Operator will consume this value through the shared secret and use it as a network interface from the underlay network to create encapsulated tunnels with the tenant networks.

    Considerations for tunnel_interface

    • Plan this interface as a dedicated physical interface for TF overlay networks. TF uses features:neutron:tunnel_interface to create the vhost0 virtual interface and transfers the IP configuration from the tunnel_interface to the virtual one.

    • Do not use bridges from L2 templates as tunnel_interface. Such usage might lead to networking performance degradation and data plane downtime.

  3. The Kubernetes nodes are labeled according to the TF node roles:

    Tungsten Fabric (TF) node roles

    Node role

    Description

    Kubernetes labels

    Minimal count

    TF control plane

    Hosts the TF control plane services such as database, messaging, api, svc, config.

    tfconfig=enabled
    tfcontrol=enabled
    tfwebui=enabled
    tfconfigdb=enabled

    3

    TF analytics

    Hosts the TF analytics services.

    tfanalytics=enabled
    tfanalyticsdb=enabled

    3

    TF vRouter

    Hosts the TF vRouter module and vRouter Agent.

    tfvrouter=enabled

    Varies

    TF vRouter DPDK Technical Preview

    Hosts the TF vRouter Agent in DPDK mode.

    tfvrouter-dpdk=enabled

    Varies

    Note

    TF supports only Kubernetes OpenStack workloads. Therefore, you should label OpenStack compute nodes with the tfvrouter=enabled label.

    Note

    Do not specify the openvswitch=enabled label for the OpenStack deployments with TF as a networking backend.

Deploy Tungsten Fabric

Deployment of Tungsten Fabric is managed by the tungstenfabric-operator Helm resource in a respective ClusterRelease.

To deploy Tungsten Fabric:

  1. Optional. Configure the ASN and encapsulation settings if you need custom values for these parameters. For configuration details, see Autonomous System Number (ASN).

  2. Verify that you have completed all prerequisite steps as described in Tungsten Fabric deployment prerequisites.

  3. Create the tungstenfabric.yaml file with the Tungsten Fabric resource configuration. For example:

    apiVersion: tf.mirantis.com/v2
    kind: TFOperator
    metadata:
      name: openstack-tf
      namespace: tf
    spec:
       dataStorageClass: tungstenfabric-operator-bind-mounts
    
  4. Configure the TFOperator custom resource according to the needs of your deployment. For the configuration details, refer to TFOperator custom resource and Tungsten Fabric Operator resources.

  5. Trigger the Tungsten Fabric deployment:

    kubectl apply -f tungstenfabric.yaml
    
  6. Verify that Tungsten Fabric has been successfully deployed:

    kubectl get pods -n tf
    

    The successfully deployed TF services should appear in the Running status in the system response.

  7. If you have enabled StackLight, enable Tungsten Fabric monitoring by setting tungstenFabricMonitoring.enabled to true as described in StackLight configuration procedure.

    Since MOSK 23.1, tungstenFabricMonitoring.enabled is enabled by default during the Tungsten Fabric deployment. Therefore, skip this step.

Advanced Tungsten Fabric configuration (optional)

This section includes configuration information for available advanced Mirantis OpenStack for Kubernetes features that include SR-IOV and DPDK with the Neutron Tungsten Fabric backend.

Enable huge pages for OpenStack with Tungsten Fabric

Note

The instruction provided in this section applies to both OpenStack with OVS and OpenStack with Tungsten Fabric topologies.

The huge pages OpenStack feature provides essential performance improvements for applications that are highly memory IO-bound. Huge pages should be enabled on a per compute node basis. By default, NUMATopologyFilter is enabled.

To activate the feature, you need to enable huge pages on the dedicated bare metal host as described in Enable huge pages in a host profile during the predeployment bare metal configuration.

Note

The multi-size huge pages are not fully supported by Kubernetes versions before 1.19. Therefore, define only one size in kernel parameters.

Enable DPDK for Tungsten Fabric

TechPreview

This section describes how to enable DPDK mode for the Tungsten Fabric (TF) vRouter.

To enable DPDK for TF, follow one of the procedures below depending on the API version used:

  1. Install the vfio-pci (recommended) or uio_pci_generic driver on the host operating system. For more information about drivers, see Linux Drivers.

    An example of the DPDK_UIO_DRIVER configuration:

    spec:
      services:
        vRouter:
          agentDPDK:
            enabled: true
            envSettings:
              dpdk:
              - name: DPDK_UIO_DRIVER
                value: vfio-pci
    
  2. Open the TF Operator custom resource for editing:

    kubectl -n tf edit tfoperators.tf.mirantis.com openstack-tf
    
  3. Enable DPDK:

    spec:
      services:
        vRouter:
          agentDPDK:
            enabled: true
    
  1. Install the vfio-pci (recommended) or uio_pci_generic driver on the host operating system. For more information about drivers, see Linux Drivers.

    An example of the DPDK_UIO_DRIVER configuration:

    spec:
      tf-vrouter:
        agent-dpdk:
          enabled: true
          containers:
          - name: dpdk
            env:
            - name: DPDK_UIO_DRIVER
              value: vfio-pci
    
  2. Verify that DPDK NICs are not used on the host operating system.

    Note

    For use in the Linux user space, DPDK NICs will be bound to specific Linux drivers, required by PMDs. In such a way, bounded NICs are not available for usage by standard Linux network utilities. Therefore, allocate a dedicated NIC(s) for the vRouter deployment in DPDK mode.

  3. Enable huge pages on the host as described in Enable huge pages in a host profile.

  4. Mark the hosts for deployment with DPDK with the tfvrouter-dpdk=enabled label.

  5. Open the TF Operator custom resource for editing:

    kubectl -n tf edit tfoperators.operator.tf.mirantis.com openstack-tf
    
  6. Enable DPDK:

    spec:
      controllers:
        tf-vrouter:
          agent-dpdk:
            enabled: true
    
Enable SR-IOV for Tungsten Fabric

This section instructs you on how to enable SR-IOV with the Neutron Tungsten Fabric (TF) backend.

To enable SR-IOV for TF:

  1. Verify that your deployment meets the following requirements:

    • NICs with the SR-IOV support are installed

    • SR-IOV and VT-d are enabled in BIOS

  2. Enable IOMMU in the kernel by configuring intel_iommu=on in the GRUB configuration file. Specify the parameter for compute nodes in BareMetalHostProfile in the grubConfig section:

    spec:
      grubConfig:
        defaultGrubOptions:
          - 'GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX intel_iommu=on"'
    
  3. Enable SR-IOV in the OpenStackDeployment CR through the node-specific overrides settings. For example:

    spec:
      nodes:
        <NODE-LABEL>::<NODE-LABEL-VALUE>:
          features:
            neutron:
              sriov:
                enabled: true
                nics:
                - device: enp10s0f1
                  num_vfs: 7
                  physnet: tenant
    

    Warning

    After the OpenStackDeployment CR modification, the TF Operator generates a separate vRouter DaemonSet with specified settings. The tf-vrouter-agent-<XXXXX> pods will be automatically restarted on the affected nodes causing the network services interruption on virtual machines running on these hosts.

  4. Optional. To modify a vRouter DaemonSet according to the SR-IOV definition in the OpenStackDevelopment CR, add vRouter custom specs to the TF Operator CR with the node label specified in the OpenStackDeployment CR. For example:

    spec:
      nodes:
        sriov:
          labels:
            name: <NODE-LABEL>
            value: <NODE-LABEL-VALUE>
          nodeVRouter:
            enabled: true
            envSettings:
              agent:
              - name: VROUTER_GATEWAY
                value: <VROUTER-GATEWAY-IP>
    
    spec:
      controllers:
        tf-vrouter:
          agent:
            customSpecs:
            - name: sriov
              label:
                name: <NODE-LABEL>
                value: <NODE-LABEL-VALUE>
              containers:
              - name: agent
                env:
                - name: <VROUTER-GATEWAY>
                  value: <VROUTER-GATEWAY-IP>
    
Configure multiple Contrail API workers

TechPreview

Tungsten Fabric MOSK deployments use six workers of the contrail-api service by default. This section instructs you on how to change the default configuration if needed.

To configure the number of Contrail API workers on a TF deployment:

  1. Specify the required number of workers in the TFOperator custom resource:

    spec:
     features:
       config:
         configApiWorkerCount: 7
    
    spec:
      controllers:
        tf-config:
          api:
            containers:
            - env:
              - name: CONFIG_API_WORKER_COUNT
                value: "7"
              name: api
    
  2. Wait until all tf-config-* pods are restarted.

  3. Verify the number of workers inside the running API container:

    kubectl -n tf exec -ti tf-config-rclzq -c api -- ps aux --width 500
    kubectl -n tf exec -ti tf-config-rclzq -c api -- ls /etc/contrail/
    

    Verify that the ps output lists one API process with PID "1" and the number of workers set in the TFOperator custom resource.

  4. In /etc/contrail/, verify that the number of configuration files contrail-api-X.conf matches the number of workers set in the TFOperator custom resource.

Disable Tungsten Fabric analytics services

Available since MOSK 23.3 TechPreview

By default, analytics services are part of basic setups for Tungsten Fabric deployments. To obtain a more lightweight setup, you can disable these services through the custom resource of the Tungsten Fabric Operator.

Warning

Disabling of the Tungsten Fabric analytics services requires restart of the data plane services for existing environments and must be planned in advance. While calculating the maintenance window for this operation, take into account the deletion of the analytics DaemonSets and automatic restart of the tf-config, tf-control, and tf-webui pods.

To disable Tungsten Fabric analytics services:

  1. Open the TFOperator custom resource for editing:

    kubectl -n tf edit tfoperators.tf.mirantis.com openstack-tf
    
    kubectl -n tf edit tfoperators.operator.tf.mirantis.com openstack-tf
    
  2. Disable Tungsten Fabric analytics services in the TFOperator custom resource:

    spec:
      services:
        analytics:
          enabled: false
    
    spec:
      settings:
        disableTFAnalytics: true
    
  3. Clean up the Kubernetes resources. To free up the space that has been used by Cassandra, ZooKeeper, and Kafka analytics storage, manually delete the related PVC:

    kubectl -n tf delete pvc -l app=cassandracluster,cassandracluster=tf-cassandra-analytics
    kubectl -n tf delete pvc -l app=tf-zookeeper-nal
    kubectl -n tf delete pvc -l app=tf-kafka
    
  4. Remove the tfanalytics=enabled and tfanalyticsdb=enabled labels from nodes, as they are not required by the Tungsten Fabric Operator anymore.

  5. Manually restart the vRouter pods:

    Note

    To avoid network disruption, restart the vRouter pods in chunks.

    kubectl -n tf get pod -l app=tf-vrouter-agent
    kubectl -n tf delete pod <POD_NAME>
    
  6. Delete terminated nodes from the Tungsten Fabric configuration through the Tungsten Fabric web UI:

    Caution

    With disabled Tungsten Fabric analytics, the Tungsten Fabric web UI may not work properly.

    1. Log in to the Tungsten Fabric web UI.

    2. On Configure > Infrastructure > Nodes > Analytics Nodes, delete all terminated analytics nodes.

    3. On Configure > Infrastructure > Nodes > Database Analytics Nodes, delete all terminated database analytics nodes.

  7. Depending on the MOSK version, proceed accordingly:

    Disable monitoring of the Tungsten Fabric analytics services in StackLight by setting the following parameter in StackLight values of the Cluster object to false:

    tungstenFabricMonitoring:
      analyticsEnabled: false
    

    When done, the monitoring of the Tungsten Fabric analytics components will become disabled and Kafka alerts along with the Kafka dashboard will disappear from StackLight.

    The tungstenFabricMonitoring.analyticsEnabled setting is automatically configured based on the state of the Tungsten Fabric analytics services, which are enabled or disabled.

    However, you can still override this setting. If set manually, the configuration overrides the default behavior and does not reflect the actual state of Tungsten Fabric analytics.

Now, with the Tungsten Fabric analytics services successfully disabled, you have optimized resource utilization and system performance. While these services are deactivated, related alerts may still be present in StackLight. However, do not consider such alerts as indicative of the actual status of the analytics services.

Troubleshoot the Tungsten Fabric deployment

This section provides the general debugging instructions for your Tungsten Fabric (TF) on Kubernetes deployment.

Enable debug logs for the Tungsten Fabric services

To enable debug logging for the Tungsten Fabric (TF) services:

  1. Open the TF custom resource for modification:

    kubectl -n tf edit tfoperators.tf.mirantis.com openstack-tf
    
  2. Set the logLevel variable to the SYS_DEBUG value for the required TF service. For example, for the config-api service:

    spec:
      services:
        config:
          tf:
            logging:
               api:
                 logLevel: SYS_DEBUG
    
  1. Open the TF custom resource for modification:

    kubectl -n tf edit tfoperators.operator.tf.mirantis.com openstack-tf
    
  2. Set the LOG_LEVEL variable to the SYS_DEBUG value for the required TF service. For example, for the config-api service:

    spec:
       controllers:
         tf-config:
           api:
             containers:
             - name: api
               env:
               - name: LOG_LEVEL
                 value: SYS_DEBUG
    

Warning

After the TF custom resource modification, the pods related to the affected services will be restarted. This rule does not apply to the tf-vrouter-agent-<XXXXX> pods as their update strategy differs. Therefore, if you enable the debug logging for the services in a tf-vrouter-agent-<XXXXX> pod, restart this pod manually after you modify the custom resource.

Troubleshoot access to the Tungsten Fabric web UI

If you cannot access the Tungsten Fabric (TF) web UI service, verify that the FQDN of the TF web UI is resolvable on your PC by running one of the following commands:

host tf-webui.it.just.works
# or
ping tf-webui.it.just.works
# or
dig host tf-webui.it.just.works

All commands above should resolve the web UI domain name to the IP address that should match the EXTERNAL-IPs subnet dedicated to Kubernetes.

If the TF web UI domain name has not been resolved to the IP address, your PC is using a different DNS or the DNS does not contain the record for the TF web UI service. To resolve the issue, define the IP address of the Ingress service from the openstack namespace of Kubernetes in the hosts file of your machine. To obtain the Ingress IP address:

kubectl -n openstack get svc ingress -o custom-columns=HOSTS:.status.loadBalancer.ingress[*].ip

If the web UI domain name is resolvable but you still cannot access the service, verify the connectivity to the cluster.

Disable TX offloading on NICs used by vRouter

In the following cases, a TCP-based service may not work on VMs:

  • If the setup has nested VMs.

  • If VMs are running in the ESXi hypervisor.

  • If the Network Interface Cards (NICs) do not support the IP checksum calculation and generate an incorrect checksum. For example, the Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe NIC cards.

To resolve the issue, disable the transmit (TX) offloading on all OpenStack compute nodes for the affected NIC used by the vRouter as described below.

To identify the issue:

  1. Verify whether ping is working between VMs on different hypervisor hosts and the TCP services are working.

  2. Run the following command for the vRouter Agent and verify whether the output includes the number of Checksum errors:

    kubectl -n tf exec tf-vrouter-agent-XXXXX -c agent -- dropstats
    
  3. Run the following command and verify if the output includes the cksum incorrect entries:

    kubectl -n tf exec tf-vrouter-agent-XXXXX -c agent -- tcpdump -i <tunnel interface> -v -nn | grep -i incorrect
    

    Example of system response:

    tcpdump: listening on <tunnel interface>, link-type EN10MB (Ethernet), capture size 262144 bytes
    <src ip.port> > <dst ip.port>: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
    seq 1901889431, ack 1081063811, win 28960, options [mss 1420,sackOK,\
    TS val 456361578 ecr 41455995,nop,wscale 7], length 0
    <src ip.port> > <dst ip.port>: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
    seq 1901889183, ack 1081063811, win 28960, options [mss 1420,sackOK,\
    TS val 456361826 ecr 41455995,nop,wscale 7], length 0
    <src ip.port> > <dst ip.port>: Flags [S.], cksum 0x43bf (incorrect -> 0xb8dc), \
    seq 1901888933, ack 1081063811, win 28960, options [mss 1420,sackOK,\
    TS val 456362076 ecr 41455995,nop,wscale 7], length 0
    
  4. Run the following command for the vRouter Agent container and verify whether the output includes the information about a drop for an unknown reason:

    kubectl -n tf exec tf-vrouter-agent-XXXXX -c agent -- flow -l
    

To disable the TX offloading on NICs used by vRouter:

  1. Open the TFOperator custom resource (CR) for editing:

    kubectl -n tf edit tfoperators.operator.tf.mirantis.com openstack-tf
    
  2. Specify the DISABLE_TX_OFFLOAD variable with the "YES" value for the vRouter Agent container:

    spec:
      features:
        vRouter:
          disableTXOffload: true
    
    spec:
      controllers:
        tf-vrouter:
          agent:
            containers:
            - name: agent
              env:
              - name: DISABLE_TX_OFFLOAD
                value: "YES"
    

    Warning

    Once you modify the TFOperator CR, the tf-vrouter-agent-<XXXXX> pods will not restart automatically because they use the OnDelete update strategy. Restart such pods manually, considering that the vRouter pods restart causes network services interruption for the VMs hosted on the affected nodes.

  3. To disable TX offloading on a specific subset of nodes, use custom vRouter settings. For details, see Custom vRouter settings.

    Warning

    Once you add a new CustomSpec, a new daemon set will be generated and the tf-vrouter-agent-<XXXXX> pods will be automatically restarted. The vRouter pods restart causes network services interruption for VMs hosted on the affected node. Therefore, plan this procedure accordingly.

Operations Guide

This guide outlines the post-deployment Day-2 operations for a Mirantis OpenStack for Kubernetes environment. It describes how to configure and manage the MOSK components, perform different types of cloud verification, and enable additional features depending on your cloud needs. The guide also contains day-to-day maintenance procedures such as how to back up and restore, update and upgrade, or troubleshoot your MOSK cluster.

Cluster update

Updating a MOSK cluster ensures that the system remains secure, efficient, and up-to-date with the latest features and performance improvements, as well as receives fixes for the known CVEs. This section provides comprehensive details and step-by-step procedures to guide you through the process of updating your cluster.

Update to a major version

This section describes the workflow you as a cloud operator need to follow to correctly update your Mirantis OpenStack for Kubernetes (MOSK) cluster to a major release version.

Note

The hereby guide applies to the clusters running MOSK of version 23.1 and above. In case you have an older version and looking to update, please contact Mirantis support to get intructions valid for your cluster.

The instructions below are generic and apply to any MOSK cluster regardless of its configuration specifics. However, every major release may have its own update peculiarities. Therefore, to accurately plan and successfully perform an update, in addition to the hereby document, read the update-related section in the Release Notes of the target MOSK version.

Depending on the payload of a target release, the update mechanism can perform the changes on different levels of the stack, from the configuration of the host operating system to the code of OpenStack itself. The update mechanism is designed to avoid the impact on the workloads and cloud users as much as possible. The life-cycle management logic minimizes the downtime for the cloud API by means of smart management of the cluster components under the hood and only requests your involvement when a human decision is required to proceed.

Though the update mechanism may change the internal components of the cluster, it will always preserve the major versions of OpenStack, that is, the APIs that cloud users and workloads deal with. After the cluster is successfully updated, you can initiate a separate upgrade procedure to obtain the latest supported OpenStack version.

Before you begin

Before starting an update, we recommend that you closely peruse the Release Compatibility Matrix document and Release notes of the target release, as well as thoroughly plan maintenance windows for each update phase depending on the configurational of your cluster.

Read the release notes

Read carefully Release Compatibility Matrix and Release Notes of the target MOSK version paying particular attention to the following:

  • Current Mirantis Container Cloud software version and the need to first update to the latest cluster release version

  • Update notes provided in the Release notes for the target MOSK version

  • New product features that will get enabled in your cloud by default

  • New product features that may have already been configured in your cloud as customizations and now need to be properly re-enabled to be eligible for further support

  • Any changes in the behavior of the product features enabled in your cloud

  • List of the addressed and known issues in the target MOSK version

Warning

If your cloud configuration is known to have any custom configuration that was not explicitly approved by Mirantis, make sure to bring this up with your dedicated Mirantis representative before proceeding with the update. Mirantis cannot guarantee the safe updating of a customized cloud.

Plan the cluster update

Depending on the payload brought by a particular target release, a generic cluster update includes from three to six major phases.

The first three phases are present in any update. They focus on the containerized components of the software stack and have minimal impact on the cloud users and workloads.

The remaining phases are only present if any changes need to be made to the foundation layers: the underlay Kubernetes cluster and host operating system. For the changes to take effect, you may need to reboot the cluster nodes. This procedure imposes a severe impact on cloud workloads and, therefore, needs to be thoroughly planned across several sequential maintenance windows.

Important

To effectively plan a cluster update, keep in mind the architecture of your specific cloud. Depending on the selected design, the components of a MOSK cluster may have different distribution across the nodes (physical servers) comprising the underlay bare metal Kubernetes cluster. The more components are collocated on a single node, the harder is the impact on the functions of the cloud when the changes are applied.

The tables below will help you to plan your cluster update and include the following information for each mandatory and additional update phase:

  • What happens during the phase

    Includes the phase milestones. The nature of changes that are going to be applied is important to understand in order to estimate the exact impact the update is going to have on your cluster.

    Consult the Update notes section of the target MOSK release for the detailed information about the changes it brings and the impact these changes are going to imply when getting applied to your cluster.

  • Impact

    Describes possible impact on cloud users and workloads.

    The provided information about the impact represents the worst-case scenario in the cluster architectures that imply a combination of several roles on the same physical servers, such as hyper-converged compute nodes and clusters with a compact control plane.

    The impact estimation presumes that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

  • Time to complete

    Provides a rough estimation of the time required to complete the phase.

    The estimates for a phase timeline presume that your cluster uses one of the standard architectures provided by the product and follows Mirantis design guidelines.

Warning

During the update, try to prevent users from performing write operations on the cloud resources. Any intensive manipulations may lead to workload corruption.

Phase 1: Life-cycle management modules update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Life-cycle management modules update

What happens during the phase

New versions of OpenStack, Tungsten Fabric, and Ceph controllers downloaded and installed. OpenStack and Tungsten Fabric images precached.

Impact

None

Time to complete

Depending on the quality of the Internet connectivity, up to 45 minutes.

Phase 2: OpenStack and Tungsten Fabric components update

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

OpenStack and Tungsten Fabric components update

What happens during the phase

New versions of OpenStack and Tungsten Fabric container images downloaded, services restarted sequentially.

Impact

  • Some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • Workloads may experience temporary loss of the North-South connectivity in the clusters with Open vSwitch networking backend. The downtime depends on the type of virtual routers in use.

Time to complete

  • 20 minutes per network gateway node (Open vSwitch)

  • 5 minutes for a Tungsten Fabric cluster

  • 15 minutes per compute node

Phase 3: Ceph cluster update and upgrade

Important

This phase is mandatory. It is always present in the update flow regardless of the contents of the target release.

Ceph cluster update and upgrade

What happens during the phase

New versions of Ceph components downloaded, services restarted. If applicable, Ceph switched to the latest major version.

Impact

Workloads may experience IO performance degradation for the virtual storage devices backed by Ceph.

Time to complete

The update of a Ceph cluster with 30 storage nodes can take up to 35 minutes. Additionally, 15 minutes are required for the major Ceph version upgrade, if any.

Phase 4a: Host operating system update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Host operating system update on Kubernetes master nodes

What happens during the phase

New system packages downloaded and installed on the host operating system, other major changes get applied.

Impact

None

Time to complete

The nodes are updated sequentially. Up to 15 minutes per node.

Phase 4b: Kubernetes components update on Kubernetes master nodes

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Kubernetes cluster update on Kubernetes master nodes

What happens during the phase

New versions of Kubernetes control plane components downloaded and installed.

Impact

For clusters with the compact control plane, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

For the compact control plane with gateway nodes collocated (Open vSwitch networking backend), workloads can experience temporary loss of the North-South connectivity. The downtime depends on the type of virtual routers in use.

Time to complete

Up to 40 minutes total

Phases 5a and 5b: Host operating system and Kubernetes cluster update on Kubernetes worker nodes

Important

Both phases, 5a and 5b, are applied together, either node by node (default) or to several nodes in parallel. The parallel updating is available since 23.1.

Take this into consideration when estimating the impact and planning the maintenance window.

Host operating system and Kubernetes cluster

What happens during the phase

During the host operating system update:

  • New packages for host operating system downloaded and installed, including kernel, and other system components.

  • Any other major configuration changes get applied.

  • Node manually rebooted. But an operator of the cloud has an option to restart the nodes later, during another maintenance window.

During the Kubernetes cluster update:

  • New versions of Kubernetes control plane components, including container runtime, downloaded and installed

  • Containers get restarted

Impact

  • For the storage nodes:

    • Minor impact on Ceph cluster availability, depending on the number of storage nodes getting updated in parallel. See Enable parallel update of Kubernetes worker nodes

    • Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes.

  • For dedicated control plane nodes, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • For dedicated gateway nodes (Open vSwitch), workloads can experience minor loss of the North-South connectivity.

  • For compute nodes, there can be up to 5 minute downtime on the network connectivity for the workloads running on the node, due to the restart of the containers hosting the components of the cloud data plane.

    For clusters running MOSK 24.1.2 and above, the dowtime is up to 2 minutes per node.

Time to complete

By default, the nodes are updated sequentially as follows:

  • For the host operating system update, up to 15 minutes per node.

  • For the Kubernetes cluster update, up to 40 minutes per node.

For MOSK 23.1 to 23.2 and newer releases, you can reduce update time by enabling parallel node update. The procedure is described further in the Enable parallel update of Kubernetes worker nodes subsection.

Phase 6: Cluster nodes reboot

Important

This phase is optional. The presense of this phase in the update flow depends on the contents of the target release.

Important

An update to a newer MOSK version may require reboot of the cluster nodes for changes to take effect. Although, you can decide when to restart each particular node, an update can not be considered complete until all of the nodes get restarted.

To determine whether the reboot is required, consult the Step 4. Reboot the nodes with optional instance migration section.

What happens during the phase

  1. You put the cluster into the maintenance mode.

  2. For each node in the cluster:

    1. Optional. You configure an instance migration policy.

    2. You initiate the node reboot.

    3. The node is gracefully restarted with automatic or manual migration of cloud workloads running on it.

Impact

  • For the storage nodes:

    • No impact on the nodes hosting the Ceph cluster data

    • Loss of connectivity to the volumes for the nodes hosting LVM with iSCSI volumes

  • For the control plane nodes, some of the running cloud operations may fail over the course of the phase due to minor unavailability of the cloud API.

  • For the network gateway nodes (Open vSwitch), workloads can experience minor loss of the North-South connectivity depending on the type of virtual routers in use.

  • For the compute nodes, no or controllable impact on the workloads depending on the configured instance migration policy. See Configure instance migration policy for cluster nodes.

Time to complete

  • Optional. Time to migrate instances across compute nodes.

  • Up to 10 minutes per node to reboot. Depends on the hardware and BIOS configuration. Several nodes can be rebooted in parallel.

Step 1. Verify that the Container Cloud management cluster is up-to-date

MOSK relies on Mirantis Container Cloud to manage the underlying software stack for a cluster, as well as to deliver updates for all the components.

Since every MOSK release is tightly coupled with a Container Cloud release, a MOSK cluster update becomes possible once the management cluster is known to run the latest Container Cloud version. The management cluster periodically verifies public Mirantis repositories and updates itself automatically when a newer version becomes available. Having any of the managed clusters, including MOSK, running outdated Container Cloud version will prevent the management cluster from automatic self-update.

To identify the current version of the Container Cloud software your management cluster is running, refer to the Container Cloud web UI. You can also verify your management cluster status using CLI as described in Verify the management cluster status before MOSK update.

Step 2. Initiate MOSK cluster update
Silence alerts

During an update of a MOSK cluster, numerous alerts may be seen in StackLight. This is expected behavior. Therefore, ignore or temporarily mute the alerts as described in Silence alerts.

Caution

During update, the false positive CalicoDataplaneFailuresHigh alert may be firing. Disregard this alert, which will disappear once update succeeds.

The observed behavior is typical for calico-node during upgrades, as workload changes occur frequently. Consequently, there is a possibility of temporary desynchronization in the Calico dataplane. This can occasionally result in throttling when applying workload changes to the Calico dataplane.

Verify Ceph configuration

If you update MOSK to 23.1, verify that the KaaSCephCluster custom resource does not contain the following entries. If they exist, remove them.

  • In the spec.cephClusterSpec section, the external section.

  • In the spec.cephClusterSpec.rookConfig section, the ms_crc_data or ms crc data configuration key. After you remove the key, wait for rook-ceph-mon pods to restart on the MOSK cluster.

Enable parallel update of Kubernetes worker nodes

Optional. Starting from MOSK 23.1 to 23.2 update, you can enable and configure parallel node update to reduce update time and minimize downtime:

  • To enable parallel update of Kubernetes worker nodes, set the spec.providerSpec.value.maxWorkerUpgradeCount configuration parameter in the Mirantis Container Cloud management cluster as described in conf-upd-count.

  • Consider the specifics of handling of parallel node updates by OpenStack, Ceph, and Tungsten Fabric Controllers to properly plan the maintenance window. For handling details and possible configuration, refer to Parallelizing node update operations.

Enable automatic node reboot in update groups

TechPreview

Optional. Starting from MOSK 24.3, you can enable automatic node reboot of an update group, which contains a set of controller or worker machines. This option applies when a Cluster release update requires node reboot, for example, when kernel version update is available in the target Cluster release. The option reduces manual intervention and overall downtime during cluster update.

To enable automatic node reboot in an update group, set spec.rebootIfUpdateRequires in the required UpdateGroup object. For details, see UpdateGroup resource.

Caution

During a distribution upgrade, machines are always rebooted, overriding rebootIfUpdateRequires: false.

Trigger the update
  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, find the managed MOSK cluster.

  4. Click the More action icon to see whether a new release is available. If that is the case, click Update cluster.

  5. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release.

  6. Click Update.

    Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:

    • /srv/backup/swarm on every Container Cloud node for Docker Swarm

    • /srv/backup/ucp on one of the controller nodes for MKE

    To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Step 3. Watch the cluster update
Watch the update process through the web UI

To view the update status through the Container Cloud web UI, navigate to the Clusters page. Once the orange blinking dot next to the cluster name disappears, the cluster update is complete.

Also, you can see the general status of each node during the update on the Container Cloud cluster view page.

Follow the update process through logs

The whole update process is controlled by lcm-controller, which runs in the kaas namespace of the Container Cloud management cluster. Follow its logs to watch the progress of the update, discover, and debug any issues.

Watch the state of the cluster and nodes update through the CLI

The lcmclusterstate and lcmmachines objects in the mos namespace of the Container Cloud management cluster provide detailed information about the current phase of the update process in the context of the managed cluster overall as well as specific nodes.

The lcmmachine object being in the Ready state indicates that a node has been successfully updated.

To display the detailed view of the cluster update state, run:

kubectl -n child-ns get lcmclusterstates -o wide

Example system response:

NAME                                            CLUSTERNAME   TYPE              ARG                                          VALUE   ACTUALVALUE   ATTEMPT   MESSAGE
cd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      cordon-drain      cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: following    NodeWorkloadLocks are still active - ceph: UpdatingController,openstack: InProgress
sd-cz7506-child-cl-storage-worker-noefi-rgxhk   child-cl      swarm-drain       cz7506-child-cl-storage-worker-noefi-rgxhk   true                  0         Error: waiting for kubernetes node kaas-node-5222a92f-5523-457c-8c69-b7aa0ffc235c to be drained first

To display the detailed view of the nodes update state, run:

kubectl -n child-ns get lcmmachines

Example system response:

NAME                                                 CLUSTERNAME   TYPE      STATE
cz5018-child-cl-storage-worker-noefi-dzttw           child-cl      worker    Prepare
cz5019-child-cl-storage-worker-noefi-vxcm9           child-cl      worker    Prepare
cz7500-child-cl-control-storage-worker-noefi-nkk9t   child-cl      control   Ready
cz7501-child-cl-control-storage-worker-noefi-7pcft   child-cl      control   Ready
cz7502-child-cl-control-storage-worker-noefi-c7k6f   child-cl      control   Ready
cz7503-child-cl-storage-worker-noefi-5lvd7           child-cl      worker    Prepare
cz7505-child-cl-storage-worker-noefi-jh4mc           child-cl      worker    Prepare
cz7506-child-cl-storage-worker-noefi-rgxhk           child-cl      worker    Prepare
Step 4. Reboot the nodes with optional instance migration

Depending on the target release content, you may need to reboot the cluster nodes for the changes to take effect. Running a MOSK cluster in a semi-updated state for an extended period may result in unpredictable behavior of the cloud and impact users and workloads. Therefore, when it is required, you need to reboot the cluster nodes as soon as possible to avoid potential risks.

Note

If you enabled rebootIfUpdateRequires as described in Enable automatic node reboot in update groups, nodes will be automatically rebooted in update groups during a Cluster release update that requires a reboot, for example, when kernel version update is available in the target Cluster release. For a distribution upgrade, continue reading the following subsections.

Determine if the node needs to be rebooted

Verify the YAML definitions of the LCMMachine and Machine objects. The node must be rebooted if the rebootRequired flag is set to true. In addition, objects explicitly specify the reason for rebooting. For example:

  • The LCMMachine object of the node that requires rebooting:

    ...
    status:
       hostInfo:
         rebootRequired: true
         rebootReason: "linux-image-5.13.0-51-generic"
    
  • The Machine object of the node that does not require rebooting:

    ...
    status:
      ...
      providerStatus:
        ...
        reboot:
          reason: ""
          required: false
        status: Ready
    

Since MOSK 23.1, you can also use the Mirantis Container Cloud web UI to identify the nodes requiring reboot:

  1. In the Clusters tab, click the required cluster name. The page with Machines opens.

  2. Hover over the status of every machine. A machine to reboot contains the Reboot > The machine requires a reboot notification in the Status tooltip.

Configure instance migration policy for cluster nodes

Restarting the cluster causes downtime of the cloud services running on the nodes. While the MOSK control plane is built for high availability and can tolerate temporary loss of at least 1/3 of services without a significant impact on user experience, rebooting nodes that host the elements of cloud data plane, such as network gateway nodes and compute nodes, has a detrimental effect on the cloud workloads, if not performed gracefully.

To configure the instance migration policy:

  1. Edit the target compute node resource. For example:

    kubectl edit node kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
    
  2. To mitigate the potential impact on the cloud workloads, define the migration mode and the number of attempts the OpenStack Controller should make to migrate a single instance running on it:

    Instance migration configuration for hosts

    Node annotation

    Default

    Description

    instance_migration_mode

    live

    Defines the instance migration mode for the host. The list of available options include:

    • live: the OpenStack Controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the manual mode. 0

    • manual: the OpenStack Controller waits for the Operator to migrate instances from the host. When it is time to update the host, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update. 1

    • skip: the OpenStack Controller skips the instance check on the node and reboots it.

    instance_migration_attempts

    3

    Defines the number of times the OpenStack Controller attempts to live-migrate a single instance before falling back to the manual mode.

    0

    Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:

    • LVM-based ephemeral storage with and without encryption

    • Encrypted block storage volumes

    • CPU and NUMA node pinning

    1

    For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. Mirantis recommends that in such environments you perform the update procedure in the manual mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore connectivity.

    Configuration example that sets the instance migration mode to live and the number of attempts to live-migrate to 5:

    apiVersion: v1
    kind: Node
    metadata:
     name: kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     selfLink: /api/v1/nodes/kaas-node-03ab613d-cf79-4830-ac70-ed735453481a
     uid: 54be5139-aba7-47e7-92bf-5575773a12a6
     resourceVersion: '299734609'
     creationTimestamp: '2021-03-24T16:03:11Z'
     labels:
       ...
       openstack-compute-node: enabled
       openvswitch: enabled
     annotations:
       openstack.lcm.mirantis.com/instance_migration_mode: "live"
       openstack.lcm.mirantis.com/instance_migration_attempts: "5"
       ...
    
  3. If needed, as a cloud user, mark the instances that require individual handling during instance migration using the openstack.lcm.mirantis.com:maintenance_action=<ACTION-TAG> server tag. For details, refer to Configure per-instance migration mode.

Reboot MOSK cluster

Since MOSK 23.1, you can reboot several cluster nodes in one go by using the Graceful reboot mechanism provided by Mirantis Container Cloud. The mechanism restarts the selected nodes one by one, honoring the instance migration policies.

For older versions of MOSK, you need to reboot each node manually as follows:

  1. Enable maintenance mode for the MOSK cluster.

  2. For each node in the cluster:

    1. Enable maintenance mode for the node.

    2. If manual instance migration policy is configured for the node, perform manual migration once the node is ready to reboot (see below).

    3. Reboot the node using cluster life-cycle management.

    4. Disable maintenance mode for the node.

  3. Disable maintenance mode for the MOSK cluster.

Perform manual actions before node reboot

When a node that has a manual instance migration policy is ready to be restarted, the life-cycle management mechanism notifies you about that by creating a NodeMaintenanceRequest object for the node and setting the active status attribute for the corresponding NodeWorkloadLock object.

Note

Verify the status:errorMessage attribute before proceeding.

To view the NodeWorkloadLock objects details for a specific node, run:

kubectl get nodeworkloadlocks <NODE-NAME> -o yaml

Example system response:

apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  annotations:
    inner_state: active
  creationTimestamp: "2022-02-04T13:24:48Z"
  generation: 1
  name: openstack-kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
  resourceVersion: "173934"
  uid: 0cb4428f-dd0d-401d-9d5e-e9e97e077422
spec:
  controllerName: openstack
  nodeName: kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5
status:
  errorMessage: 2022-02-04 14:43:52.674125 Some servers ['0ab4dd8f-ef0d-401d-9d5e-e9e97e077422'] are still present on host kaas-node-b2a55089-5b03-4698-9879-8756e2e81df5.
  Waiting unless all of them are migrated manually or instance_migration_mode is set to 'skip'
  release: 8.5.0-rc+22.1
  state: active

Note

For MOSK compute nodes, you need to manually shut down all instances running on it, or perform cold or live migration of the instances.

After the update

Once your MOSK cluster update is complete, proceed with the following:

  1. Perform the post-update steps recommended in the update notes of the target release if any.

  2. Use the standard configuration mechanisms to re-enable the new product features that could previously exist in your cloud as a custom configuration.

  3. To ensure the cluster operability, execute a set of smoke tests as described in Run Tempest tests.

  4. Optional. Proceed with the upgrade of OpenStack.

  5. If necessary, expire alert silences in StackLight as described in Silence alerts.

What to do if the update hangs or fails

If an update phase takes significantly longer than expected according to the tables included in Plan the cluster update, you should consider the update process hung.

If you observe errors that are not described explicitly in the documentation, immediately contact Mirantis support.

Troubleshoot issues

To see any issues that might have occurred during the update, verify the logs of the lcm-controller pods in the kaas namespace of the Container Cloud management cluster.

To troubleshoot the update that involves the operating system upgrade with host reboot, refer to Troubleshoot an operating system upgrade with host restart.

Roll back the changes

Container Cloud and MOSK life-cycle management mechanism does not provide a way to perform a cluster-wide rollback of an update.

Update to a patch version

Patch releases aim to significantly shorten the cycle of CVE fixes delivery onto your MOSK deployments to help you avoid cyber threats and data breaches.

Your management bare-metal cluster obtains patch releases automatically the same way as major releases. A new patch MOSK release version becomes available through the Container Cloud web UI after the automatic upgrade of the management cluster.

It is not possible to update between the patch releases that belong to different release series in one go. For example, you can update from MOSK 23.1.1 to 23.1.2, but you cannot immediately update from MOSK 23.1.x to 23.2.x because you need to update to the major MOSK 23.2 release first.

Caution

If you delay the Container Cloud upgrade and schedule it at a later time as described in Schedule Mirantis Container Cloud updates, make sure to schedule a longer maintenance window as the upgrade queue can include several patch releases along with the major release upgrade.

Pre-update actions
Estimate the update impact

Read the Update notes part of the target MOSK release notes to understand the changes it brings and the impact these changes are going to have on your cloud users and workloads.

Determine if cluster nodes need to be rebooted

The application of the patch releases may not require the cluster nodes reboot. Though, your cluster can contain nodes that require reboot after the last update to a major release, and this requirement will remain after update to any of the following patch releases. Therefore, Mirantis strongly recommends that you determine if there are such nodes in your cluster before you update to the next patch release and reboot them if any, as described in Step 4. Reboot the nodes with optional instance migration.

Avoid network downtime for cloud workloads

For some MOSK versions, applying a patch release may require restart of the containers that host the elements of the cloud data plane. In case of Open vSwitch-based clusters, this may result in up to 5 minute downtime of workload network connectivity for each compute node.

For MOSK prior to 24.1 series, you can determine whether applying a patch release is going to require the restart of the data plane by consulting the Release artifacts part of the release notes of the current and target MOSK releases. The data plane restart will only happen if there are new versions of the container images related to the data plane.

It is possible to avoid the downtime for the cloud data by explicitly pinning the image versions of the following components:

  • Open vSwitch

  • Kubernetes entrypoint

However, pinning these images will result in the cloud data plane not receiving any security or bugfixes during the update.

To pin the images:

  1. Depending on the proxy configuration, the image base URL differs. To obtain the list of currently used images on the cluster, run:

    kubectl -n openstack get ds openvswitch-openvswitch-vswitchd-default -o yaml |grep "image:" | sort -u
    

    Example of system response:

    image: mirantis.azurecr.io/general/openvswitch:2.13-focal-20230211095312
    image: mirantis.azurecr.io/openstack/extra/kubernetes-entrypoint:v1.0.1-48d1e8a-20220919122849
    
  2. Add the openvswitch and kubernetes-entrypoint images used on your cluster:

    Create a ConfigMap in the openstack namespace with the following content, replacing <OPENSTACKDEPLOYMENT-NAME> with the name of your OpenStackDeployment custom resource:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      labels:
        penstack.lcm.mirantis.com/watch: "true"
      name: <OPENSTACKDEPLOYMENT-NAME>-artifacts
      namespace: openstack
    data:
      caracal: |
        dep_check: <KUBERNETES-ENTRYPOINT-IMAGE-URL>
        openvswitch_db_server: <OPENVSWITCH-IMAGE-URL>
        openvswitch_vswitchd: <OPENVSWITCH-IMAGE-URL>
    

    Edit the OpenStackDeployment custom resoruce as follows:

    spec:
      services:
        networking:
          openvswitch:
            values:
              images:
                tags:
                  dep_check: <KUBERNETES-ENTRYPOINT-IMAGE-URL>
                  openvswitch_db_server: <OPENVSWITCH-IMAGE-URL>
                  openvswitch_vswitchd: <OPENVSWITCH-IMAGE-URL>
    

    For example:

    spec:
      services:
        networking:
          openvswitch:
            values:
              images:
                tags:
                  dep_check: mirantis.azurecr.io/openstack/extra/kubernetes-entrypoint:v1.0.1-48d1e8a-20220919122849
                  openvswitch_db_server: mirantis.azurecr.io/general/openvswitch:2.13-focal-20230211095312
                  openvswitch_vswitchd: mirantis.azurecr.io/general/openvswitch:2.13-focal-20230211095312
    
Update a patch Cluster release of a MOSK cluster
  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click Upgrade next to the More action icon located in the last column for each cluster where available.

    Note

    If Upgrade is greyed out, the cluster is in maintenance mode that must be disabled before you can proceed with cluster update. For details, see Disable maintenance mode on a cluster and machine.

    If Upgrade does not display, your cluster is up-to-date.

  4. In the Release update window, select the required patch Cluster release to update your managed cluster to.

    The release notes for patch Cluster releases are available in Container Cloud Release Notes: Patch releases.

  5. Click Update. To monitor update readiness, refer to Verify cluster status.

Note

Since Container Cloud 2.26.1 (patch Cluster releases 17.1.1 and 16.1.1), the update of Ubuntu packages with kernel minor version update may apply in certain releases.

In this case, cordon-drain and reboot of machines does not apply automatically, and all machines have the Reboot is required notification after the cluster update. You can manually handle the reboot of machines during a convenient maintenance window as described in Perform a graceful reboot of a cluster.

Verify the management cluster status before MOSK update

Before you start updating your managed clusters, Mirantis recommends verifying that the associated management cluster is upgraded successfully.

To verify that the management cluster is upgraded successfully:

  1. Using kubeconfig of the management cluster, verify the Cluster release version of the management cluster machines:

    for i in $(kubectl get lcmmachines | awk '{print $1}' | sed '1d'); do echo $i; kubectl get lcmmachines $i -o yaml | grep release | tail -1; done
    

    Example of system response:

    master-0
      release: 14.0.0+3.6.5
    master-1
      release: 14.0.0+3.6.5
    master-2
      release: 14.0.0+3.6.5
    
  2. Obtain the name of the latest available Container Cloud release object:

    kubectl get kaasrelease
    

    Example of system response:

    NAME          AGE
    kaas-2-15-0   63m
    kaas-2-14-0   40d
    
  3. Using the name of the latest Container Cloud release object, obtain the latest available Cluster release version:

    kubectl get -o yaml clusterrelease $(kubectl get kaasrelease kaas-2-15-0 -o yaml | egrep "^ +clusterRelease:" | cut -d: -f2 | tr -d ' ') | egrep "^  version:"
    

    Example of system response:

    version: 14.0.0+3.6.4
    
  4. Compare the output obtained in the previous step with the output from the first step. The Cluster releases must match. If this is not the case, contact Mirantis support for further details.

  5. Proceed to Step 2. Initiate MOSK cluster update.

Change the upgrade order of a machine

You can define the upgrade sequence for existing machines to allow prioritized machines to be upgraded first during a cluster update.

Consider the following upgrade index specifics:

  • The first machine to upgrade is always one of the control plane machines with the lowest upgradeIndex. Other control plane machines are upgraded one by one according to their upgrade indexes.

  • If the Cluster spec dedicatedControlPlane field is false, worker machines are upgraded only after the upgrade of all control plane machines finishes. Otherwise, they are upgraded after the first control plane machine, concurrently with other control plane machines.

  • If several machines have the same upgrade index, they have the same priority during upgrade.

  • If the value is not set, the machine is automatically assigned a value of the upgrade index.

To define the upgrade order of an existing machine:

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. In one of the Unassigned machines settings menu, select Change upgrade index.

  5. In the Configure Upgrade Priority window that opens, use the Up and Down arrows in the Upgrade Index field to configure the upgrade sequence of a machine. Click Update to apply changes.

  6. Using the Pool info or Machine info options in the machine settings menu, verify that the Upgrade Priority Index contains the updated value.

Configure the parallel update of worker nodes

Available since MCC 2.25.0 (17.0.0 and 16.0.0)

Note

You can start using the below procedure during cluster update from 23.1 to 23.2. For details, see Parallelizing node update operations.

By default, worker machines are upgraded sequentially, which includes node draining, software upgrade, services restart, and so on. Though, MOSK enables you to parallelize node upgrade operations, significantly improving the efficiency of your deployment, especially on large clusters.

For upgrade workflow of the control plane, see Change the upgrade order of a machine.

Configure the parallel update of worker nodes using web UI

Available since MCC 2.25.0 (17.0.0 and 16.0.0)

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. On the Clusters page, click the More action icon in the last column of the required cluster and select Configure cluster.

  5. In General Settings of the Configure cluster window, define the following parameters:

    Parallel Upgrade Of Worker Machines

    The maximum number of the worker nodes to update simultaneously. It serves as an upper limit on the number of machines that are drained at a given moment of time. Defaults to 1.

    You can configure this option after deployment before the cluster update.

    Parallel Preparation For Upgrade Of Worker Machines

    The maximum number of worker nodes being prepared at a given moment of time, which includes downloading of new artifacts. It serves as a limit for the network load that can occur when downloading the files to the nodes. Defaults to 50.

Configure the parallel update of worker nodes using CLI

Available since MCC 2.24.0 (15.0.1 and 14.0.1)

  1. Open the Cluster object for editing.

  2. Adjust the following parameters as required:

    Configuration of the parallel node update

    Parameter

    Default

    Description

    spec.providerSpec.maxWorkerUpgradeCount

    1

    The maximum number of the worker nodes to update simultaneously. It serves as an upper limit on the number of machines that are drained at a given moment of time.

    Caution

    Since Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0), maxWorkerUpgradeCount is deprecated and will be removed in one of the following releases. Use the concurrentUpdates parameter in the UpdateGroup object instead. For details, see Create update groups for worker machines.

    spec.providerSpec.maxWorkerPrepareCount

    50

    The maximum number of workers being prepared at a given moment of time, which includes downloading of new artifacts. It serves as a limit for the network load that can occur when downloading the files to the nodes.

  3. Save the Cluster object to apply the change.

Create update groups for worker machines

Available since MCC 2.27.0 (17.2.0 and 16.2.0)

The use of update groups provides enhanced control over update of worker machines by allowing granular concurrency settings for specific machine groups. This feature uses the UpdateGroup object to decouple the concurrency settings from the global cluster level, providing flexibility based on the workload characteristics of different machine sets.

The UpdateGroup objects are processed sequentially based on their indexes. Update groups with the same indexes are processed concurrently. The control update group is always processed first.

Note

The update order of a machine within the same group is determined by the upgrade index of a specific machine. For details, see Change the upgrade order of a machine.

The maxWorkerUpgradeCount parameter of the Cluster object is inherited by the default update group. Changing maxWorkerUpgradeCount leads to changing the concurrentUpdates parameter of the default update group.

Note

The maxWorkerUpgradeCount parameter of the Cluster object is deprecated and will be removed in one of the following Container Cloud releases. You can still use this parameter to change the concurrentUpdates value of the default update group. However, Mirantis recommends changing this value directly in the UpdateGroup object.

Update group for controller nodes

Available since MCC 2.28.0 (17.3.0 and 16.3.0) TechPreview

The update group for controller nodes is automatically generated during initial cluster creation with the following settings:

  • name: <cluster-name>-control

  • index: 1

  • concurrentUpdates: 1

  • rebootIfUpdateRequires: false

Caution

During a distribution upgrade, machines are always rebooted, overriding rebootIfUpdateRequires: false.

All control plane machines are automatically assigned to the update group for controller nodes with no possibility to change it.

Note

On existing clusters created before Container Cloud 2.28.0 (Cluster releases 17.2.0, 16.2.0, or earlier), the update group for controller nodes is created after Container Cloud upgrade to 2.28.0 (Cluster release 16.3.0) on the management cluster.

Caution

The index and concurrentUpdates parameters of the update group for controller nodes are hardcoded and cannot be changed.

Example of the update group for controller nodes:

apiVersion: kaas.mirantis.com/v1alpha1
kind: UpdateGroup
metadata:
  name: example-cluster-control
  namespace: example-ns
  labels:
    cluster.sigs.k8s.io/cluster-name: example-cluster
spec:
  index: 1
  concurrentUpdates: 1
  rebootIfUpdateRequires: false
Default update group

The default update group is automatically created during initial cluster creation with the following settings:

  • name: <cluster-name>-default

  • index: 1

  • rebootIfUpdateRequires: false

  • concurrentUpdates: inherited from the maxWorkerUpgradeCount parameter set in the Cluster object

Caution

During a distribution upgrade, machines are always rebooted, overriding rebootIfUpdateRequires: false.

Note

On existing clusters created before Container Cloud 2.27.0 (Cluster releases 17.1.0, 16.1.0, or earlier), the default update group is created after Container Cloud upgrade to 2.27.0 (Cluster release 16.2.0) on the management cluster.

Example of the default update group:

apiVersion: kaas.mirantis.com/v1alpha1
kind: UpdateGroup
metadata:
  name: example-cluster-default
  namespace: example-ns
  labels:
    cluster.sigs.k8s.io/cluster-name: example-cluster
spec:
  index: 1
  concurrentUpdates: 1
  rebootIfUpdateRequires: false

If you require custom update settings for worker machines, create one or several custom UpdateGroup objects as described below.

Assign a machine to an update group using CLI

Note

All worker machines that are not assigned to any update group are automatically assigned to the default update group.

  1. Create an UpdateGroup object with the required specification. For description of the object fields, see UpdateGroup resource.

  2. Label the machines to associate them with the newly created UpdateGroup object:

    kubectl label machine <machineName> kaas.mirantis.com/update-group=<UpdateGroupObjectName>
    

    To change the update group of a machine, update the kaas.mirantis.com/update-group label of the machine with the new update group name. Removing this label from a machine automatically assigns such machine to the default update group.

Note

After creation of a custom UpdateGroup object, if you plan to add a new machine that requires a non-default update group, manually add the corresponding label to the machine as described above. Otherwise, the default update group is applied to such machine.

Note

Before removing the UpdateGroup object, reassign all machines to another update group.

Granularly update a managed cluster using the ClusterUpdatePlan object

Available since MCC 2.27.0 (17.2.0) TechPreview

You can control the process of a managed cluster update by manually launching update stages using the ClusterUpdatePlan custom resource. Between the update stages, a cluster remains functional from the perspective of cloud users and workloads.

A ClusterUpdatePlan object contains the following funtionality:

  • The object is automatically created by the bare metal provider when a new Cluster release becomes available for your cluster.

  • The object is created in the management cluster for the same namespace that the corresponding managed cluster refers to.

  • The object contains a list of self-descriptive update steps that are cluster-specific. These steps are defined in the spec section of the object with information about their impact on the cluster.

  • The object starts cluster update when the operator manually changes the commence field of the first update step to true. All steps have the commence flag initially set to false so that the operator can decide when to pause or resume the update process.

  • The object has the following naming convention: <managedClusterName>-<targetClusterReleaseVersion>.

  • Since Container Cloud 2.28.0 (Cluster release 17.3.0), the object contains several StackLight alerts to notify the operator about the update progress and potential update issues. For details, see StackLight alerts: Container Cloud.

Granularly update a managed cluster using CLI
  1. Verify that the management cluster is upgraded successfully as described in Verify the management cluster status before MOSK update.

  2. Optional. Available since Container Cloud 2.29.0 (Cluster release 17.4.0) as Technology Preview. Enable update auto-pause to be triggered by specific StackLight alerts. For details, see Configure update auto-pause.

  3. Open the ClusterUpdatePlan object for editing.

  4. Start cluster update by changing the spec:steps:commence field of the first update step to true.

    Once done, the following actions are applied to the cluster:

    1. The Cluster release in the corresponding Cluster spec is changed to the target Cluster version defined in the ClusterUpdatePlan spec.

    2. The cluster update starts and pauses before the next update step with commence: false set in the ClusterUpdatePlan spec.

    Caution

    Cancelling an already started update step is not supported.

    The following example illustrates the ClusterUpdatePlan object of a MOSK cluster update that has completed:

    Example of a completed ClusterUpdatePlan object
    Object:
      apiVersion: kaas.mirantis.com/v1alpha1
      kind: ClusterUpdatePlan
      metadata:
        creationTimestamp: "2025-02-06T16:53:51Z"
        generation: 11
        name: mosk-17.4.0
        namespace: child
        resourceVersion: "6072567"
        uid: 82c072be-1dc5-43dd-b8cf-bc643206d563
      spec:
        cluster: mosk
        releaseNotes: https://docs.mirantis.com/mosk/latest/25.1-series.html
        source: mosk-17-3-0-24-3
        steps:
        - commence: true
          description:
          - install new version of OpenStack and Tungsten Fabric life cycle management
            modules
          - OpenStack and Tungsten Fabric container images pre-cached
          - OpenStack and Tungsten Fabric control plane components restarted in parallel
          duration:
            estimated: 1h30m0s
            info:
            - 15 minutes to cache the images and update the life cycle management modules
            - 1h to restart the components
          granularity: cluster
          id: openstack
          impact:
            info:
            - some of the running cloud operations may fail due to restart of API services
              and schedulers
            - DNS might be affected
            users: minor
            workloads: minor
          name: Update OpenStack and Tungsten Fabric
        - commence: true
          description:
          - Ceph version update
          - restart Ceph monitor, manager, object gateway (radosgw), and metadata services
          - restart OSD services node-by-node, or rack-by-rack depending on the cluster
            configuration
          duration:
            estimated: 8m30s
            info:
            - 15 minutes for the Ceph version update
            - around 40 minutes to update Ceph cluster of 30 nodes
          granularity: cluster
          id: ceph
          impact:
            info:
            - 'minor unavailability of object storage APIs: S3/Swift'
            - workloads may experience IO performance degradation for the virtual storage
              devices backed by Ceph
            users: minor
            workloads: minor
          name: Update Ceph
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - container runtime version gets bumped
          - new versions of Kubernetes components installed
          duration:
            estimated: 1h40m0s
            info:
            - about 20 minutes to update host OS per a Kubernetes controller, nodes updated
              one-by-one
            - Kubernetes components update takes about 40 minutes, all nodes in parallel
          granularity: cluster
          id: k8s-controllers
          impact:
            users: none
            workloads: none
          name: Update host OS and Kubernetes components on master nodes
        - commence: true
          description:
          - new host OS kernel and packages get installed
          - host OS configuration re-applied
          - container runtime version gets bumped
          - new versions of Kubernetes components installed
          - data plane components (Open vSwitch and Neutron L3 agents, TF agents and vrouter)
            restarted on gateway and compute nodes
          - storage nodes put to “no-out” mode to prevent rebalancing
          - by default, nodes are updated one-by-one, a node group can be configured to
            update several nodes in parallel
          duration:
            estimated: 8h0m0s
            info:
            - host OS update - up to 15 minutes per node (not including host OS configuration
              modules)
            - Kubernetes components update - up to 15 minutes per node
            - OpenStack controllers and gateways updated one-by-one
            - nodes hosting Ceph OSD, monitor, manager, metadata, object gateway (radosgw)
              services updated one-by-one
          granularity: machine
          id: k8s-workers-vdrok-child-default
          impact:
            info:
            - 'OpenStack controller nodes: some running OpenStack operations might not
              complete due to restart of components'
            - 'OpenStack compute nodes: minor loss of the East-West connectivity with
              the Open vSwitch networking back end that causes approximately 5 min of
              downtime'
            - 'OpenStack gateway nodes: minor loss of the North-South connectivity with
              the Open vSwitch networking back end: a non-distributed HA virtual router
              needs up to 1 minute to fail over; a non-distributed and non-HA virtual
              router failover time depends on many factors and may take up to 10 minutes'
            users: major
            workloads: major
          name: Update host OS and Kubernetes components on worker nodes, group vdrok-child-default
        - commence: true
          description:
          - restart of StackLight, MetalLB services
          - restart of auxiliary controllers and charts
          duration:
            estimated: 1h30m0s
          granularity: cluster
          id: mcc-components
          impact:
            info:
            - minor cloud API downtime due restart of MetalLB components
            users: minor
            workloads: none
          name: Auxiliary components update
        target: mosk-17-4-0-25-1
      status:
        completedAt: "2025-02-07T19:24:51Z"
        startedAt: "2025-02-07T17:07:02Z"
        status: Completed
        steps:
        - duration: 26m36.355605528s
          id: openstack
          message: Ready
          name: Update OpenStack and Tungsten Fabric
          startedAt: "2025-02-07T17:07:02Z"
          status: Completed
        - duration: 6m1.124356485s
          id: ceph
          message: Ready
          name: Update Ceph
          startedAt: "2025-02-07T17:33:38Z"
          status: Completed
        - duration: 24m3.151554465s
          id: k8s-controllers
          message: Ready
          name: Update host OS and Kubernetes components on master nodes
          startedAt: "2025-02-07T17:39:39Z"
          status: Completed
        - duration: 1h19m9.359184228s
          id: k8s-workers-vdrok-child-default
          message: Ready
          name: Update host OS and Kubernetes components on worker nodes, group vdrok-child-default
          startedAt: "2025-02-07T18:03:42Z"
          status: Completed
        - duration: 2m0.772243006s
          id: mcc-components
          message: Ready
          name: Auxiliary components update
          startedAt: "2025-02-07T19:22:51Z"
          status: Completed
    
  5. Monitor the message and status fields of the first step. The message field contains information about the progress of the current step. The status field can have the following values:

    • NotStarted

    • Scheduled Since MCC 2.28.0 (17.3.0)

    • InProgress

    • AutoPaused TechPreview since MCC 2.29.0 (17.4.0)

    • Stuck

    • Completed

    The Scheduled status indicates that a step is already triggered but its execution has not started yet.

    The AutoPaused status indicates that the update process is paused by a firing StackLight alert defined in the UpdateAutoPause object. For details, see Configure update auto-pause.

    The Stuck status indicates an issue and that the step can not fit into the ETA defined in the duration field for this step. The ETA for each step is defined statically and does not change depending on the cluster.

    Caution

    The status is not populated for the ClusterUpdatePlan objects that have not been started by adding the commence: true flag to the first object step. Therefore, always start updating the object from the first step.

  6. Optional. Available since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0). Add or remove update groups of worker nodes on the fly, unless the update of the group that is being removed has already been scheduled, or if a newly set group will have an index that is lower or equal to another group that is already scheduled. These changes are reflected in ClusterUpdatePlan.

    You can also reassign a machine to a different update group while the cluster is being updated, but only if the new update group has an index higher than the index of the last scheduled worker update group. Disabled machines are considered as updated immediately.

    Note

    Depending on the number of update groups for worker nodes present in the cluster, the number of steps in spec differs. Each update group for worker nodes that has at least one machine will be represented by a step with the ID k8s-workers-<UpdateGroupName>.

  7. Proceed with changing the commence flag of the following update steps granularly depending on the cluster update requirements.

    Caution

    Launch the update steps sequentially. A consecutive step is not started until the previous step is completed.

Granularly update a managed cluster using the Container Cloud web UI

Available since MCC 2.29.0 (17.4.0 and 16.4.0)

  1. Verify that the management cluster is upgraded successfully as described in Verify the management cluster status before MOSK update.

  2. Optional. Available since Container Cloud 2.29.0 (Cluster release 17.4.0) as Technology Preview. Enable update auto-pause to be triggered by specific StackLight alerts. For details, see Configure update auto-pause.

  3. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  4. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  5. On the Clusters page, in the Updates column of the required cluster, click the Available link. The Updates tab opens.

    Note

    If the Updates column is absent, it indicates that the cluster is up-to-date.

    Note

    For your convenience, the Cluster updates menu is also available in the right-side kebab menu of the cluster on the Clusters page.

  6. On the Updates page, click the required version in the Target column to open update details, including the list of update steps, current and target cluster versions, and estimated update time.

  7. In the Target version section of the Cluster update window, click Release notes and carefully read updates about target release, including the Update notes section that contains important pre-update and post-update steps.

  8. Expand each step to verify information about update impact and other useful details.

  9. Select one of the following options:

    • Enable Auto-commence all at the top-right of the first update step section and click Start Update to launch update and start each step automatically.

    • Click Start Update to only launch the first update step.

      Note

      This option allows you to auto-commence consecutive steps while the current step is in progress. Enable the Auto-commence toggle for required steps and click Save to launch the selected steps automatically. You will only be prompted to confirm the consecutive step, all remaining steps will be launched without a manual confirmation.

    Before launching the update, you will be prompted to manually type in the target Cluster release name and confirm that you have read release notes about target release.

    Caution

    Cancelling an already started update step is not supported.

  10. Monitor the status of each step by hovering over the In Progress icon at the top-right of the step window. While the step is in progress, its current status is updated every minute.

    Once the required step is completed, the Waiting for input status at the top of the update window is displayed requiring you to confirm the next step.

The update history is retained in the Updates tab with the completion timestamp. The update plans that were not started and can no longer be used are cleaned up automatically.

Configure update auto-pause

Available since MOSK 25.1 TechPreview

Uinsg the UpdateAutoPause object, the operator can define specific StackLight alerts that trigger auto-pause of an update phase execution in a MOSK cluster. The feature enhances update management of MOSK clusters by preventing harmful changes to be propagated across the entire cloud.

Note

The feature is not available for management clusters.

When an update auto-pause is configured on a cluster, the following workflow applies:

  • During cluster updates, the system continuously monitors for the alerts defined in the UpdateAutoPause object

  • If any configured alert fires:

    • The update process automatically pauses

    • The commence field is removed from all steps that have not started

    • The commence field is removed from the steps related to Update host OS and Kubernetes components on worker nodes even if the step is in progress, and the step is paused

    • The ClusterUpdatePlan status changes to AutoPaused

    • The firing alerts are recorded in the UpdateAutoPause status

    • A condition is added to the Cluster object indicating the pause state

Configure auto-pausing of a MOSK cluster update
  1. Verify that StackLight is enabled on the MOSK cluster.

  2. Create an UpdateAutoPause object with the name that matches your cluster name within the cluster namespace. For example:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: UpdateAutoPause
    metadata:
      name: managed-cluster-example    # Must match cluster name
      namespace: managed-cluster-ns   # Must match cluster namespace
    spec:
      alerts:
        - AlertName1
        - AlertName2
    

    The list of alerts can include standard and custom StackLight alerts previously configured for the cluster.

    For the object spec, see UpdateAutoPause resource.

  3. Apply the configuration:

    kubectl apply -f update-autopause.yaml
    
Resume paused updates
  1. Select one of the following options:

    • Investigate and resolve the conditions that triggered the alerts, then wait for the alerts to clear automatically

    • Remove the problematic alert from the UpdateAutoPause configuration

  2. Set the commence field to true for the relevant UpdatePlan steps to resume the update.

Caution

Admission Controller blocks attempts to set commence: true while alerts defined in the UpdateAutoPause object are still firing.

Monitor the status of an update auto-pause

You can monitor the status of an update auto-pause using the following resources:

  • The UpdateAutoPause object status:

    kubectl get updateautopause <cluster-name> -n <namespace> -o yaml
    
  • The ClusterUpdatePlan object status that displays the following details:

    • The AutoPaused status when updates are paused.

    • Messages indicating which alerts caused the pause and other relevant information.

  • StackLight alerts:

    • ClusterUpdateAutoPaused, which indicates that an update is currently paused.

    • ClusterUpdateStepAutoPaused, which describes specific steps that are paused.

    For alert details, see Container Cloud.

Calculate a maintenance window duration for update Deprecated

Deprecation notice

The maintenance window duration calculator is deprecated. Starting from MOSK 25.1, cloud operators should use the ClusterUpdatePlan API instead. For details, refer to ClusterUpdatePlan resource.

This section provides an online calculator for quick calculation of the approximate time required to update your MOSK cluster that uses Open vSwitch as a networking backend.

Additionally, for a more accurate calculation, consider any cluster-specific factors that can have a large impact on the update time in some edge cases, such as number of routers, frequency of CPU, and so on.

Getting access

This section contains instructions on how to get access to different systems of a MOSK cluster.

To obtain endpoints of the MKE web UI and StackLight web UIs such as Prometheus, Alertmanager, Alerta, OpenSearch Dashboards, and Grafana, in the Clusters tab of the Container Cloud web UI, navigate to More > Cluster info.

Note

The Alertmanager web UI displays alerts received by all configured receivers, which can be mistaken for duplicates. To only display the alerts received by a particular receiver, use the Receivers filter.

Generate a kubeconfig for a MOSK cluster using API

This section describes how to generate a MOSK cluster kubeconfig using the Container Cloud API. You can also download a MOSK cluster kubeconfig using the Download Kubeconfig option in the Container Cloud web UI. For details, see Connect to a MOSK cluster.

To generate a MOSK cluster kubeconfig using API:

  1. Obtain the following details:

    • Your <username> with the corresponding password that were created after the management cluster bootstrap as described in Create initial users after a management cluster bootstrap.

    • The kubeconfig of your <username> that you can download through the Container Cloud web UI using Download Kubeconfig located under your <username> on the top-left of the page.

  2. Obtain the <cluster> object of the <cluster_name> MOSK cluster:

    kubectl get cluster <cluster_name> -n <project_name> -o yaml
    
  3. Obtain the access token from Keycloak for the <username> user:

    curl -d 'client_id=<cluster.status.providerStatus.oidc.clientId>' --data-urlencode 'username=<username>' --data-urlencode 'password=<password>' -d 'grant_type=password' -d 'response_type=id_token' -d 'scope=openid' <cluster.status.providerStatus.oidc.issuerURL>/protocol/openid-connect/token
    
  4. Generate the MOSK cluster kubeconfig using the data from <cluster.status> and <token> obtained in the previous steps. Use the following template as an example:

    apiVersion: v1
    clusters:
      - name: <cluster_name>
        cluster:
          certificate-authority-data: <cluster.status.providerStatus.apiServerCertificate>
          server: https://<cluster.status.providerStatus.loadBalancerHost>:443
    contexts:
      - context:
          cluster: <cluster_name>
          user: <username>
        name: <username>@<cluster_name>
    current-context: <username>@<cluster_name>
    kind: Config
    preferences: {}
    users:
      - name: <username>
        user:
          auth-provider:
            config:
              client-id: <cluster.status.providerStatus.oidc.clientId>
              idp-certificate-authority-data: <cluster.status.providerStatus.oidc.certificate>
              idp-issuer-url: <cluster.status.providerStatus.oidc.issuerUrl>
              refresh-token: <token.refresh_token>
              id-token: <token.id_token>
            name: oidc
    
Connect to a MOSK cluster

Note

The Container Cloud web UI communicates with Keycloak to authenticate users. Keycloak is exposed using HTTPS with self-signed TLS certificates that are not trusted by web browsers.

To use your own TLS certificates for Keycloak, refer to Configure TLS certificates for cluster applications.

After you deploy a MOSK management or managed cluster, connect to the cluster to verify the availability and status of the nodes as described below.

To connect to a MOSK cluster:

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Verify the status of the manager nodes. Once the first manager node is deployed and has the Ready status, the Download Kubeconfig option for the cluster being deployed becomes active.

  5. Open the Clusters tab.

  6. Click the More action icon in the last column of the required cluster and select Download Kubeconfig:

    1. Enter your user password.

    2. Not recommended. Select Offline Token to generate an offline IAM token. Otherwise, for security reasons, the kubeconfig token expires every 30 minutes of the Container Cloud API idle time and you have to download kubeconfig again with a newly generated token.

    3. Click Download.

  7. Verify the availability of the managed cluster machines:

    1. Export the kubeconfig parameters to your local machine with access to kubectl. For example:

      export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
      
    2. Obtain the list of available machines:

      kubectl get nodes -o wide
      

      The system response must contain the details of the nodes in the READY status.

To connect to a management cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Obtain the list of available management cluster machines:

    kubectl get nodes -o wide
    

    The system response must contain the details of the nodes in the READY status.

Access the Keycloak Admin Console

Using the Keycloak Admin Console, you can create or delete a user as well as grant or revoke roles to or from a user. The Keycloak administrator is responsible for assigning roles to users depending on the level of access they need in a cluster.

Obtain access credentials using the CLI

Available since MCC 2.22.0 (Cluster release 11.6.0)

./container-cloud get keycloak-creds --mgmt-kubeconfig <pathToManagementClusterKubeconfig>

Optionally, use the --output key to save credentials in a YAML file.

Example of system response:

Keycloak admin credentials:
Address: https://<keycloak-ip-adress>/auth
Login: keycloak
Password: foobar
Obtain access credentials using kubectl
kubectl get cluster <mgmtClusterName> -o=jsonpath='{.status.providerStatus.helm.releases.iam.keycloak.url}'

The system response contains the URL to access the Keycloak Admin Console. The user name is keycloak by default. The password is located in passwords.yaml generated during bootstrap.

You can also obtain the password from the iam-api-secrets secret in the kaas namespace of the management cluster and decode the content of the keycloak_password key:

kubectl get secret iam-api-secrets -n kaas -o=jsonpath='{.data.keycloak_password}' | base64 -d
Access the Tungsten Fabric web UI

The Tungsten Fabric (TF) web UI allows for easy and fast TF resources configuration, monitoring, and debugging. You can access the TF web UI through either the Ingress service or the Kubernetes Service directly. TLS termination for the https protocol is performed through the Ingress service.

Note

Mirantis OpenStack for Kubernetes provides the TF web UI as is and does not include this service in the support Service Level Agreement.

To access the TF web UI through Ingress:

  1. Log in to a local machine where kubectl is installed.

  2. Obtain and export kubeconfig of your managed cluster as described in Connect to a MOSK cluster.

  3. Obtain the password of the Admin user:

    kubectl -n openstack get secret keystone-keystone-admin -ojsonpath='{.data.OS_PASSWORD}' | base64 -d
    
  4. Obtain the external IP address of the Ingress service:

    kubectl -n openstack get services ingress
    

    Example of system response:

    NAME      TYPE           CLUSTER-IP    EXTERNAL-IP    PORT(S)                                      AGE
    ingress   LoadBalancer   10.96.32.97   10.172.1.101   80:34234/TCP,443:34927/TCP,10246:33658/TCP   4h56m
    

    Note

    Do not use the EXTERNAL-IP value to directly access the TF web UI. Instead, use the FQDN from the list below.

  5. Obtain the FQDN of tf-webui:

    Note

    The command below outputs all host names assigned to the TF web UI service. Use one of them.

    kubectl -n tf get ingress tf-webui -o custom-columns=HOSTS:.spec.rules[*].host
    
  6. Configure DNS to access the TF web UI host as described in Configure DNS to access OpenStack.

  7. Use your favorite browser to access the TF web UI at https://<FQDN-WEBUI>.

Add a cluster to Lens

For quick and easy inspection and monitoring, you can add a MOSK cluster to Lens using the Container Cloud web UI. The following options are available in the More action icon menu of each cluster:

  • Add cluster to Lens

  • Open cluster in Lens

Before you can start monitoring your clusters in Lens, install the Container Cloud Lens extension as described below.

Install the Container Cloud Lens extension
  1. Start Lens.

  2. Verify that your Lens version is 4.2.4 or later.

  3. Select Lens > Extensions.

  4. Copy and paste the following text into the Install Extension field:

    @mirantis/lens-extension-cc
    
  5. Click Install.

  6. Verify that the Container Cloud Lens extension appears in the Installed Extensions section.

Add a cluster to Lens
  1. Enable your browser to open pop-ups for the Container Cloud web UI.

  2. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  3. Open the Clusters tab.

  4. Verify that the target cluster is successfully deployed and is in the Ready status.

  5. In the last column of the target cluster area, click the More action icon and select Add cluster to Lens.

  6. In the Add Cluster To Lens window, click Add. The system redirects you to Lens that now contains the previously added cluster.

    Caution

    If prompted, allow your browser to open Lens.

Open a cluster in Lens
  1. Add the target cluster to Lens as described above.

  2. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  3. Open the Clusters tab.

  4. In the last column of the target cluster area, click the More action icon and select Open cluster in Lens.

OpenStack operations

The section covers the management aspects of an OpenStack cluster deployed on Kubernetes.

Upgrade OpenStack

This section provides instructions on how to upgrade OpenStack to a major version on a MOSK cluster.

Note

The update of the OpenStack components within the same major OpenStack version is performed seamlessly as part of the MOSK cluster update.

Prerequisites
  1. Verify that your OpenStack cloud is running on the latest MOSK release. See Release Compatibility Matrix for the release matrix and supported upgrade paths.

  2. Just before the upgrade, back up your OpenStack databases. See the following documentation for details:

  3. Verify that OpenStack is healthy and operational. All OpenStack components in the health group in the OpenStackDeploymentStatus CR should be in the Ready state. See OpenStackDeploymentStatus custom resource for details.

  4. Verify the workability of your OpenStack deployment by running Tempest against the OpenStack cluster as described in Run Tempest tests. Verification of the testing pass rate before upgrading will help you measure your cloud quality before and after upgrade.

  5. Read carefully through the Release Notes of your MOSK version paying attention to the Known issues section and the OpenStack upstream release notes for the target OpenStack version.

  6. Calculate the maintenance window using Plan the cluster update and Calculate a maintenance window duration for update Deprecated and notify users.

  7. When upgrading to OpenStack Yoga, remove the Panko service from the cloud by removing the event entry from the spec:features:services structure in the OpenStackDeployment resource as described in Remove an OpenStack service.

    Note

    The OpenStack Panko service has been removed from the product and is no longer maintained in the upstream OpenStack. See the project repository page for details.

Perform the upgrade

To start the OpenStack upgrade, change the value of the spec:openstack_version parameter in the OpenStackDeployment object to the target OpenStack release.

After you change the value of the spec:openstack_version parameter, the OpenStack Controller initializes the upgrade process.

To verify the upgrade status, use:

  • Logs from the osdpl container in the OpenStack Controller rockoon pod.

  • The OpenStackDeploymentStatus object.

    When upgrade starts, the OPENSTACK VERSION field content changes to the target OpenStack version, and STATE displays APPLYING:

    kubectl -n openstack get osdplst
    

    Example of system output:

    NAME      OPENSTACK VERSION   CONTROLLER VERSION   STATE
    osh-dev   antelope            0.15.6               APPLYING
    

    When upgrade finishes, the STATE field should display APPLIED:

    kubectl -n openstack get osdplst
    

    Example of system output:

    NAME      OPENSTACK VERSION   CONTROLLER VERSION   STATE
    osh-dev   antelope            0.15.6               APPLIED
    

The maintenance window for the OpenStack upgrade usually takes from two to four hours, depending on the cloud size.

Verify the upgrade
  1. Verify that OpenStack is healthy and operational. All OpenStack components in the health group in the OpenStackDeploymentStatus CR should be in the Ready state. See OpenStackDeploymentStatus custom resource for details.

  2. Verify the workability of your OpenStack deployment by running Tempest against the OpenStack cluster as described in Run Tempest tests.

Upgrade from Antelope to Caracal

Before upgrading, verify that you have completed the Prerequisites and removed the domains from federation mappings as described below.

Warning

If your MOSK cluster is running version 24.3 and includes the Instance High Availability service (OpenStack Masakari), the OpenStack upgrade will fail due to an incorrect migration of the Masakari database from legacy SQLAlchemy Migrate to Alembic caused by a misconfigured alembic_table. To avoid this issue, follow the workaround steps outlined in [47603] Masakari fails during the OpenStack upgrade to Caracal before proceeding with the upgrade.

Warning

If you initially deployed your MOSK cluster with OpenStack Victoria or earlier releases and gradually upgraded it to Antelope, and you do not perform periodic cleanups of OpenStack databases from soft-deleted rows, the upgrade will stuck due to the failing cinder-db-sync job.

To prevent, diagnose, and fix this issue, perform the workaround steps outlined in [47695] Cinder database sync job fails during upgrade from Antelope to Caracal before proceeding with the upgrade.

MOSK enables you to upgrade directly from Antelope to Caracal without the need to upgrade to the intermediate Bobcat release. To upgrade the cloud, complete the upgrade steps instruction changing the value of the spec:openstack_version parameter in the OpenStackDeployment object from antelope to caracal.

Remove domains from federation mappings

Important

Perform the domains removal from the federation mappings if your MOSK cluster configuration includes federated identity management system, such as IAM or any other supported identity provider.

Before Caracal, Keystone does not properly handle domain specifications for users in mappings. Even though domains are specified for users, Keystone always creates users in the domain associated with the identity provider the user logs in from.

Starting with Caracal, Keystone honors the domains specified for users in mappings. Many example mappings, including the previous default mapping in MOSK, use domain specifications. After upgrading to Caracal, the new users logging in through federation may be assigned to a different Keystone domain, while existing users will retain their current domain. This behaviour may negatively impact monitoring, compliance, and overall cluster operations.

To maintain the same functionality after the upgrade, remove the domain element from both the local.user element and local element, which sets default domain values for user and group elements, from the previous default mappings.

You can use the openstack mapping commands to manage mappings:

  • To list available mappings: openstack mapping list

  • To display the mapping rules: openstack mapping show <name>

  • To modify the mapping rules: openstack mapping set <name> --rules <rules>

Example mapping rules in Antelope:

[
  {
    "local": [
      {
        "user": {
          "name": "{0}",
          "email": "{1}",
          "domain": {
            "name": "Default"
          }
        }
      },
      {
        "groups": "{2}",
        "domain": {
          "name": "Default"
        }
      },
      {
        "domain": {
          "name": "Default"
        }
      }
    ],
    "remote": [
      {
        "type": "OIDC-iam_username"
      },
      {
        "type": "OIDC-email"
      },
      {
        "type": "OIDC-iam_roles"
      }
    ]
  }
]

Example mapping rules in Caracal:

[
  {
    "local": [
      {
        "user": {
          "name": "{0}",
          "email": "{1}"
        }
      },
      {
        "groups": "{2}",
        "domain": {
          "name": "Default"
        }
      }
    ],
    "remote": [
      {
        "type": "OIDC-iam_username"
      },
      {
        "type": "OIDC-email"
      },
      {
        "type": "OIDC-iam_roles"
      }
    ]
  }
]
Upgrade from Yoga to Antelope

MOSK enables you to upgrade directly from Yoga to Antelope without the need to upgrade to the intermediate Zed release.

Before upgrading, verify that you have completed the Prerequisites.

Important

There are several known issue affecting MOSK clusters running OpenStack Antelope that can disrupt the network connectivity of the cloud workloads.

If your cluster is still running OpenStack Yoga, update to the MOSK 24.2.1 patch release first and only then upgrade to OpenStack Antelope. If you have not been applying patch releases previously and would prefer to switch back to major releases-only mode, you will be able to do this when MOSK 24.3 is released.

If you have updated your cluster to OpenStack Antelope, apply the workarounds described in Release notes: OpenStack known issues for the following issues:

  • [45879] [Antelope] Incorrect packet handling between instance and its gateway

  • [44813] Traffic disruption observed on trunk ports

To upgrade the cloud, complete the upgrade steps instruction changing the value of the spec:openstack_version parameter in the OpenStackDeployment object from yoga to antelope.

Upgrade from Victoria to Yoga

Caution

If your cluster is running on top of the MOSK 23.1.2 patch version, the OpenStack upgrade to Yoga may fail due to the delay in the Cinder start. For the workaround, see 23.1.2 known issues: OpenStack upgrade failure.

Before upgrading, verify that you have completed the Prerequisites.

If your cloud runs on top of the OpenStack Victoria release, you must first upgrade to the technical OpenStack releases Wallaby and Xena before upgrading to Yoga.

Caution

The Wallaby and Xena releases are not recommended for a long-run production usage. These versions are transitional, so-called technical releases with limited testing scopes. For the OpenStack versions support cycle, refer to OpenStack support cycle.

To upgrade the cloud, complete the upgrade steps for each release version in line in the following strict order:

  1. Upgrade the cloud from victoria to wallaby

  2. Upgrade the cloud from wallaby to xena

  3. Upgrade the cloud from xena to yoga

Backup and restore OpenStack databases

Mirantis OpenStack for Kubernetes (MOSK) relies on the MariaDB Galera cluster to provide its OpenStack components with reliable storage of persistent data. Mirantis recommends backing up your OpenStack databases daily to ensure the safety of your cloud data. Also, you should always create an instant backup before updating your cloud or performing any kind of potentially disruptive experiment.

MOSK has a built-in automated backup routine that can be triggered manually or by schedule. Periodic backups are suspended by default but you can easily enable them through the OpenStackDeployment custom resource. For the details about enablement and configuration of the periodic backups, refer to Periodic OpenStack database backups in the Reference Architecture.

This section includes more intricate procedures that involve additional steps beyond editing the OpenStackDeployment custom resource, such as restoring the OpenStack database from a backup or configuring a remote storage for backups.

Enable OpenStack database remote backups

TechPreview

By default, MOSK stores the OpenStack database backups locally in the Mirantis Ceph cluster, which is a part of the same cloud.

Alternatively, MOSK enables you to save the backup data to an external storage. This section contains the details on how you, as a cloud operator, can configure a remote storage backend for OpenStack database backups.

In general, the built-in automated backup mechanism saves the data to the mariadb-phy-backup-data PersistentVolumeClaim (PVC), which is provisioned from StorageClass specified in the spec.persistent_volume_storage_class parameter of the OpenstackDeployment custom resource (CR).

Configure a remote NFS storage for OpenStack backups
  1. If your MOSK cluster was originally deployed with the default backup storage, proceed with this step. Otherwise, skip it.

    1. Copy the already existing backup data to a storage different from the mariadb-phy-backup-data PVC.

    2. Remove the mariadb-phy-backup-data PVC manually:

      kubectl -n openstack delete pvc mariadb-phy-backup-data
      
  2. Enable the NFS backend in the OpenStackDeployment object by editing the backup section of the OpenStackDeployment CR as follows:

    spec:
      features:
        database:
          backup:
            enabled: true
            backend: pv_nfs
            pv_nfs:
              server: <ip-address/dns-name-of-the-server>
              path: <path-to-the-share-folder-on-the-server>
    
  3. Optional. Set the required mount options for the NFS mount command. You can set as many options of mount as you need. For example:

    spec:
      services:
        database:
          mariadb:
            values:
              volume:
                phy_backup:
                  nfs:
                    mountOptions:
                      - "nfsvers=4"
                      - "hard"
    
  4. Verify the mariadb-phy-backup-data PVC and NFS persistent volume (PV):

    kubectl -n openstack get pvc mariadb-phy-backup-data -o wide
    
    kubectl -n openstack get pv mariadb-phy-backup-data-nfs-pv -o yaml
    

    An example of a positive system response:

    NAME                      STATUS   VOLUME                           CAPACITY   ACCESS MODES   STORAGECLASS   AGE     VOLUMEMODE
    mariadb-phy-backup-data   Bound    mariadb-phy-backup-data-nfs-pv   20Gi       RWO                           5m40s   Filesystem
    
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      annotations:
        meta.helm.sh/release-name: openstack-mariadb
        meta.helm.sh/release-namespace: openstack
      <<<skipped>>>>
      name: mariadb-phy-backup-data-nfs-pv
      resourceVersion: "2279204"
      uid: 60db9f89-afc4-417b-bf44-8acab844f17e
    spec:
      accessModes:
      - ReadWriteOnce
      capacity:
        storage: 20Gi
      claimRef:
        apiVersion: v1
        kind: PersistentVolumeClaim
        name: mariadb-phy-backup-data
        namespace: openstack
        resourceVersion: "2279201"
        uid: e0e08d73-e56f-425a-ad4e-e5393aa3cdc1
      mountOptions:
      - nfsvers=4
      - hard
      nfs:
        path: /
        server: 10.10.0.116
      persistentVolumeReclaimPolicy: Retain
      volumeMode: Filesystem
    status:
      phase: Bound
    
Switch back to the local storage for OpenStack backups
  1. Remove NFS PVC and PV:

    kubectl -n openstack delete pvc mariadb-phy-backup-data
    
    kubectl -n openstack delete pv mariadb-phy-backup-data-nfs-pv
    
  2. Re-enable the local backup in the OpenStackDeployment CR:

    spec:
      features:
        database:
          backup:
            enabled: true
            backend: pvc
    
  3. Verify that the mariadb-phy-backup-data PVC uses the default PV:

    kubectl -n openstack get pvc mariadb-phy-backup-data
    

    An example of a positive system response:

    NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
    mariadb-phy-backup-data   Bound    pvc-a4f6e24b-c05b-4a76-bca2-bb6a5c8ef5b5   20Gi       RWO            mirablock-k8s-block-hdd   80m
    
Restore OpenStack databases from a backup

During the OpenStack database restoration, the MariaDB cluster is unavailable due to the MariaDB StatefulSet being scaled down to 0 replicas. Therefore, to safely restore the state of the OpenStack database, plan the maintenance window thoroughly and in accordance with the database size.

The duration of the maintenance window may depend on the following:

  • Network throughput

  • Performance of the storage where backups are kept, which is Mirantis Ceph by default

  • Local disks performance of the nodes where MariaDB data resides

To restore OpenStack databases:

  1. Obtain an image of the MariaDB container:

    kubectl -n openstack get pods mariadb-server-0 -o jsonpath='{.spec.containers[0].image}'
    
  2. Create the check_pod.yaml file to create the helper pod required to view the backup volume content:

    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: check-backup-helper
      namespace: openstack
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: check-backup-helper
      namespace: openstack
      labels:
        application: check-backup-helper
    spec:
      nodeSelector:
        openstack-control-plane: enabled
      containers:
        - name: helper
          securityContext:
            allowPrivilegeEscalation: false
            runAsUser: 0
            readOnlyRootFilesystem: true
          command:
            - sleep
            - infinity
          image: << image of mariadb container >>
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: pod-tmp
              mountPath: /tmp
            - mountPath: /var/backup
              name: mysql-backup
      restartPolicy: Never
      serviceAccount: check-backup-helper
      serviceAccountName: check-backup-helper
      volumes:
        - name: pod-tmp
          emptyDir: {}
        - name: mariadb-secrets
          secret:
            secretName: mariadb-secrets
            defaultMode: 0444
        - name: mariadb-bin
          configMap:
            name: mariadb-bin
            defaultMode: 0555
        - name: mysql-backup
          persistentVolumeClaim:
            claimName: mariadb-phy-backup-data
    
  3. Create the helper pod:

    kubectl -n openstack apply -f check_pod.yaml
    
  4. Obtain the name of the backup to restore:

    kubectl -n openstack exec -t check-backup-helper -- tree /var/backup
    

    Example of system response:

    /var/backup
    |-- base
    |   `-- 2020-09-09_11-35-48
    |       |-- backup.stream.gz
    |       |-- backup.successful
    |       |-- grastate.dat
    |       |-- xtrabackup_checkpoints
    |       `-- xtrabackup_info
    |-- incr
    |   `-- 2020-09-09_11-35-48
    |       |-- 2020-09-10_01-02-36
    |       |-- 2020-09-11_01-02-02
    |       |-- 2020-09-12_01-01-54
    |       |-- 2020-09-13_01-01-55
    |       `-- 2020-09-14_01-01-55
    `-- lost+found
    
    10 directories, 5 files
    

    If you want to restore the full backup, the name from the example above is 2020-09-09_11-35-48. To restore a specific incremental backup, the name from the example above is 2020-09-09_11-35-48/2020-09-12_01-01-54.

    In the example above, the backups will be restored in the following strict order:

    1. 2020-09-09_11-35-48 - full backup, path /var/backup/base/2020-09-09_11-35-48

    2. 2020-09-10_01-02-36 - incremental backup, path /var/backup/incr/2020-09-09_11-35-48/2020-09-10_01-02-36

    3. 2020-09-11_01-02-02 - incremental backup, path /var/backup/incr/2020-09-09_11-35-48/2020-09-11_01-02-02

    4. 2020-09-12_01-01-54 - incremental backup, path /var/backup/incr/2020-09-09_11-35-48/2020-09-12_01-01-54

  5. Delete the helper pod:

    kubectl -n openstack delete -f check_pod.yaml
    
  6. Pass the following parameters to the mariadb_resque.py script from the OsDpl object:

    Parameter

    Type

    Default

    Description

    --backup-name

    String

    Name of a folder with backup in <BASE_BACKUP> or <BASE_BACKUP>/<INCREMENTAL_BACKUP>.

    --replica-restore-timeout

    Integer

    3600

    Timeout in seconds for 1 replica data to be restored to the mysql data directory. Also, includes time for spawning a rescue runner pod in Kubernetes and extracting data from a backup archive.

  7. Edit the OpenStackDeployment object as follows:

    spec:
      services:
        database:
          mariadb:
            values:
              manifests:
                job_mariadb_phy_restore: true
              conf:
                phy_restore:
                  backup_name: "2020-09-09_11-35-48/2020-09-12_01-01-54"
                  replica_restore_timeout: 7200
    
  8. Wait until the mariadb-phy-restore job suceeds:

    kubectl -n openstack get jobs mariadb-phy-restore -o jsonpath='{.status}'
    

    Important

    If mariadb-phy-restore fails, the MariaDB Pods do not start automatically. For example, the failure may occur due to discrepancy between the current and backup versions of MariaDB, broken backup archive, and so on.

    Assess the mariadb-phy-restore job log to identify the issue:

    kubectl -n openstack logs --tail=10000 -l application=mariadb-phy-restore,job-name=mariadb-phy-restore
    

    If the restoration process does not start due to the MariaDB versions discrepancy:

    • Use other backup file with the corresponding MariaDB version for restoration, if any.

    • Start MariaDB Pods without restoration:

      kubectl scale --replicas=3 sts/mariadb-server -n openstack
      

    The command above restores the previous cluster state.

  9. The mariadb-phy-restore job is an immutable object. Therefore, remove the job after each successful execution. To correctly remove the job, clean up all the settings from the OpenStackDeployment object that you have configured during step 7 of this procedure. This will remove all related pods as well.

  10. Resolve database discrepancies by analysing the following resources that may be inconsistent in the restored snapshot as opposed to the original environment:

    • Leftover VMs, volumes, images, and other dynamic resources. For example:

      • A VM is removed after a snapshot for restoration is created. Such VM will be present as an orphan entry in the database and the OpenStack API after restoration.

      • A VM is created after a snapshot for restoration is created. Such VM will disappear from the OpenStack API after database restoration but will still be present as a process on the compute host.

    • Broken Octavia Amphorae that may become unresponsive after restoration, potentially requiring LoadBalancer failover

    • Other broken or leftover resources

Verify the periodic backup jobs for the OpenStack database
  1. Verify pods in the openstack namespace. After the backup jobs have succeeded, the pods stay in the Completed state:

    kubectl -n openstack get pods -l application=mariadb-phy-backup
    

    Example of a posistive system response:

    NAME                                  READY   STATUS      RESTARTS   AGE
    mariadb-phy-backup-1599613200-n7jqv   0/1     Completed   0          43h
    mariadb-phy-backup-1599699600-d79nc   0/1     Completed   0          30h
    mariadb-phy-backup-1599786000-d5kc7   0/1     Completed   0          6h17m
    

    Note

    By default, the system keeps three latest successful and one latest failed pods.

  2. Obtain an image of the MariaDB container:

    kubectl -n openstack get pods mariadb-server-0 -o jsonpath='{.spec.containers[0].image}'
    
  3. Create the check_pod.yaml file to create the helper pod required to view the backup volume content.

    Configuration example:

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: check-backup-helper
      namespace: openstack
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      name: check-backup-helper
      namespace: openstack
      labels:
        application: check-backup-helper
    spec:
      nodeSelector:
        openstack-control-plane: enabled
      containers:
        - name: helper
          securityContext:
            allowPrivilegeEscalation: false
            runAsUser: 0
            readOnlyRootFilesystem: true
          command:
            - sleep
            - infinity
          image: << image of mariadb container >>
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: pod-tmp
              mountPath: /tmp
            - mountPath: /var/backup
              name: mysql-backup
      restartPolicy: Never
      serviceAccount: check-backup-helper
      serviceAccountName: check-backup-helper
      volumes:
        - name: pod-tmp
          emptyDir: {}
        - name: mariadb-secrets
          secret:
            secretName: mariadb-secrets
            defaultMode: 0444
        - name: mariadb-bin
          configMap:
            name: mariadb-bin
            defaultMode: 0555
        - name: mysql-backup
          persistentVolumeClaim:
            claimName: mariadb-phy-backup-data
    
  4. Apply the helper service account and pod resources:

    kubectl -n openstack apply -f check_pod.yaml
    kubectl -n openstack get pods -l application=check-backup-helper
    

    Example of a positive system response:

    NAME                  READY   STATUS    RESTARTS   AGE
    check-backup-helper   1/1     Running   0          27s
    
  5. Verify the directories structure within the /var/backup directory of the spawned pod:

    kubectl -n openstack exec -t check-backup-helper -- tree /var/backup
    

    Example of a system response:

    /var/backup
    |-- base
    |   `-- 2020-09-09_11-35-48
    |       |-- backup.stream.gz
    |       |-- backup.successful
    |       |-- grastate.dat
    |       |-- xtrabackup_checkpoints
    |       `-- xtrabackup_info
    |-- incr
    |   `-- 2020-09-09_11-35-48
    |       |-- 2020-09-10_01-02-36
    |       |   |-- backup.stream.gz
    |       |   |-- backup.successful
    |       |   |-- grastate.dat
    |       |   |-- xtrabackup_checkpoints
    |       |   `-- xtrabackup_info
    |       `-- 2020-09-11_01-02-02
    |           |-- backup.stream.gz
    |           |-- backup.successful
    |           |-- grastate.dat
    |           |-- xtrabackup_checkpoints
    |           `-- xtrabackup_info
    

    The base directory contains full backups. Each directory in the incr folder contains incremental backups related to a certain full backup in the base folder. All incremental backups always have the base backup name as parent folder.

  6. Delete the helper pod:

    kubectl delete -f check_pod.yaml
    
Add a controller node

This section describes how to add a new control plane node to the existing MOSK deployment.

To add an OpenStack controller node:

  1. Add a bare metal host to the MOSK cluster as described in Add a bare metal host.

    When adding the bare metal host YAML file, specify the following OpenStack control plane node labels for the OpenStack control plane services such as database, messaging, API, schedulers, conductors, L3 and L2 agents:

    • openstack-control-plane=enabled

    • openstack-gateway=enabled

    • openvswitch=enabled

  2. Create a Kubernetes machine in your cluster as described in Add a machine.

    When adding the machine, verify that OpenStack control plane node has the following labels:

    • openstack-control-plane=enabled

    • openstack-gateway=enabled

    • openvswitch=enabled

    Note

    Depending on the applications that were colocated on the failed controller node, you may need to specify some additional labels, for example, ceph_role_mgr=true and ceph_role_mon=true . To successfuly replace a failed mon and mgr node, refer to Ceph operations.

  3. Verify that the node is in the Ready state through the Kubernetes API:

    kubectl get node <NODE-NAME> -o wide | grep Ready
    
  4. Verify that the node has all required labels described in the previous steps:

    kubectl get nodes --show-labels
    
  5. Configure new Octavia health manager resources:

    1. Rerun the octavia-create-resources job:

      kubectl -n osh-system exec -t <OS-CONTROLLER-POD> -c osdpl osctl-job-rerun octavia-create-resources openstack
      
    2. Wait until the Octavia health manager pod on the newly added control plane node appears in the Running state:

      kubectl -n openstack get pods -o wide | grep <NODE_ID> | grep octavia-health-manager
      

      Note

      If the pod is in the crashloopbackoff state, remove it:

      kubectl -n openstack delete pod <OCTAVIA-HEALTH-MANAGER-POD-NAME>
      
    3. Verify that an OpenStack port for the node has been created and the node is in the Active state:

      kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port show octavia-health-manager-listen-port-<NODE-NAME>
      
Replace a failed controller node

This section describes how to replace a failed control plane node in your MOSK deployment. The procedure applies to the control plane nodes that are, for example, permanently failed due to a hardware failure and appear in the NotReady state:

kubectl get nodes <CONTAINER-CLOUD-NODE-NAME>

Example of system response:

NAME                         STATUS       ROLES    AGE   VERSION
<CONTAINER-CLOUD-NODE-NAME>    NotReady   <none>   10d   v1.18.8-mirantis-1

To replace a failed controller node:

  1. Remove the Kubernetes labels from the failed node by editing the .metadata.labels node object:

    kubectl edit node <CONTAINER-CLOUD-NODE-NAME>
    
  2. If your cluster is deployed with a compact control plane, inspect precautions for a cluster machine deletion.

  3. Add the control plane node to your deployment as described in Add a controller node.

  4. Identify all stateful applications present on the failed node:

    node=<CONTAINER-CLOUD-NODE-NAME>
    claims=$(kubectl -n openstack get pv -o jsonpath="{.items[?(@.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == '${node}')].spec.claimRef.name}")
    for i in $claims; do echo $i; done
    

    Example of system response:

    mysql-data-mariadb-server-2
    openstack-operator-bind-mounts-rfr-openstack-redis-1
    etcd-data-etcd-etcd-0
    
  5. For MOSK 23.3 series or earlier, reschedule stateful applications pods to healthy controller nodes as described in Reschedule stateful applications. For the newer versions, MOSK performs the rescheduling of stateful applications automatically.

  6. If the failed controller node had the StackLight label, fix the StackLight volume node affinity conflict as described in Delete a cluster machine.

  7. Remove the OpenStack port related to the Octavia health manager pod of the failed node:

    kubectl -n openstack exec -t <KEYSTONE-CLIENT-POD-NAME> openstack port delete octavia-health-manager-listen-port-<NODE-NAME>
    
  8. For clouds using Open Virtual Network (OVN) as the networking backend, remove the Northbound and Southbound database members for the failed node:

    1. Log in to the running openvswitch-ovn-db-XX pod.

    2. Remove an old Northboud database member:

      1. Identify the member to be removed:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
        

        Example of system response:

        5d02
        Name: OVN_Northbound
        Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b)
        Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d)
        Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643
        Status: cluster member
        Role: follower
        Term: 5402
        Leader: c617
        Vote: c617
        
        Election timer: 10000
        Log: [22917, 26535]
        Entries not yet committed: 0
        Entries not yet applied: 0
        Connections: ->c617 ->4d1e <-c617 <-0e28
        Disconnections: 0
        Servers:
            c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 1153 ms ago
            4d1e (4d1e at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643)
            0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 109828 ms ago
            5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
        

        In the above example output, the 4d1e member belongs to the failed node.

      2. Remove the old member:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/kick OVN_Northbound 4d1e
        sent removal request to leader
        
      3. Verify that the old member has been removed successfully:

        ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
        

        Example of a successful system response:

        5d02
        Name: OVN_Northbound
        Cluster ID: 4d61 (4d61fde5-6cd5-449e-9846-34fcb470687b)
        Server ID: 5d02 (5d022977-982b-4de7-b125-e679746ece8d)
        Address: tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643
        Status: cluster member
        Role: follower
        Term: 5402
        Leader: c617
        Vote: c617
        
        Election timer: 10000
        Log: [22917, 26536]
        Entries not yet committed: 0
        Entries not yet applied: 0
        Connections: ->c617 <-c617 <-0e28 ->0e28
        Disconnections: 1
        Servers:
            c617 (c617 at tcp:openvswitch-ovn-db-2.ovn-discovery.openstack.svc.cluster.local:6643) last msg 3321 ms ago
            0e28 (0e28 at tcp:openvswitch-ovn-db-1.ovn-discovery.openstack.svc.cluster.local:6643) last msg 134877 ms ago
            5d02 (5d02 at tcp:openvswitch-ovn-db-0.ovn-discovery.openstack.svc.cluster.local:6643) (self)
        
    3. Remove an old Southbound database member by following the same steps used to remove an old Northbound database member:

      1. Identify the member to be removed:

        ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
        
      2. Remove the old member:

        ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/kick OVN_Southbound <SERVER-ID>
        
Add a compute node

This section describes how to add a new compute node to your existing Mirantis OpenStack for Kubernetes deployment.

To add a compute node:

  1. Add a bare metal host to the MOSK cluster as described in Add a bare metal host.

  2. Create a Kubernetes machine in your cluster as described in Add a machine.

    When adding the machine, specify the node labels as required for an OpenStack compute node:

    OpenStack node roles

    Node role

    Description

    Kubernetes labels

    Minimal count

    OpenStack control plane

    Hosts the OpenStack control plane services such as database, messaging, API, schedulers, conductors, L3 and L2 agents.

    openstack-control-plane=enabled
    openstack-gateway=enabled
    openvswitch=enabled

    3

    OpenStack compute

    Hosts the OpenStack compute services such as libvirt and L2 agents.

    openstack-compute-node=enabled
    openvswitch=enabled (for a deployment with Open vSwitch as a backend for networking)

    Varies

  3. If required, configure the compute host to enable DPDK, huge pages, SR-IOV, and other advanced features in your MOSK deployment. See Advanced OpenStack configuration (optional) for details.

  4. Once the node is available in Kubernetes and when the nova-compute and neutron pods are running on the node, verify that the compute service and Neutron Agents are healthy in OpenStack API.

    In the keystone-client pod, run:

    openstack network agent list --host <cmp_host_name>
    
    openstack compute service list --host <cmp_host_name>
    
  5. Verify that the compute service is mapped to cell.

    The OpenStack Controller triggers the nova-cell-setup job once it detects a new compute pod in the Ready state. This job sets mapping for new compute services to cells.

    In the nova-api-osapi pod, run:

    nova-manage cell_v2 list_hosts | grep <cmp_host_name>
    
Change oversubscription settings for existing compute nodes

Available since MOSK 23.1

MOSK enables you to control the oversubscription of compute node resources through the placement service API.

To manage the oversubscription through the placement API:

  1. Obtain the host name of the hypervisor in question:

    openstack hypervisor list -f yaml
    

    Example of system response:

    - Host IP: 10.10.0.78
      Hypervisor Hostname: ps-ps-obnqilm4xxlu-0-gdy3x46euaeu-server-ftp6p7j6pyjl.cluster.local
      Hypervisor Type: QEMU
      ID: 1
      State: up
    - Host IP: 10.10.0.118
      Hypervisor Hostname: ps-ps-obnqilm4xxlu-1-n36gax6zqgef-server-xtby2leuercd.cluster.local
      Hypervisor Type: QEMU
      ID: 7
      State: up
    
  2. Determine the resource provider that corresponds to the hypervisor:

    openstack resource provider list -f yaml --name <hypervisor_hostname>
    

    Example of system response:

    - generation: 4
      name: ps-ps-obnqilm4xxlu-1-n36gax6zqgef-server-xtby2leuercd.cluster.local
      parent_provider_uuid: null
      root_provider_uuid: b16e9094-3f0e-4b8e-a138-e0b1f0a980db
      uuid: b16e9094-3f0e-4b8e-a138-e0b1f0a980db
    
  3. Verify the current values in the resource provider by its UUID:

    openstack resource provider inventory list <provider_uuid> -f yaml
    

    Example of system response:

    - allocation_ratio: 8.0
      max_unit: 8
      min_unit: 1
      reserved: 0
      resource_class: VCPU
      step_size: 1
      total: 8
      used: 0
    - allocation_ratio: 1.0
      max_unit: 7956
      min_unit: 1
      reserved: 512
      resource_class: MEMORY_MB
      step_size: 1
      total: 7956
      used: 0
    - allocation_ratio: 1.6
      max_unit: 145
      min_unit: 1
      reserved: 0
      resource_class: DISK_GB
      step_size: 1
      total: 145
      used: 0
    
  4. Update the allocation ratio for the required resource class in the resource provider and inspect the system response to verify that the change has been applied:

    openstack resource provider inventory set <provider_uuid> --amend --resource VCPU:allocation_ratio=10
    

    Caution

    To ensure accurate resource updates, it is crucial to specify the --amend argument when making requests. Failure to do so will require the inclusion of values for all fields associated with the resource provider.

    Example of system response:

    - allocation_ratio: 10.0
      max_unit: 8
      min_unit: 1
      reserved: 0
      resource_class: VCPU
      step_size: 1
      total: 8
      used: 0
    - allocation_ratio: 1.0
      max_unit: 7956
      min_unit: 1
      reserved: 512
      resource_class: MEMORY_MB
      step_size: 1
      total: 7956
      used: 0
    - allocation_ratio: 1.6
      max_unit: 145
      min_unit: 1
      reserved: 0
      resource_class: DISK_GB
      step_size: 1
      total: 145
      used: 0
    
Delete a compute node

Since MOSK 23.2, the OpenStack-related metadata is automatically removed during the graceful machine deletion through the Mirantis Container Cloud web UI. For the procedure, refer to Delete a cluster machine.

During the graceful machine deletion, the OpenStack Controller (Rockoon) performs the following operations:

  • Disables the OpenStack Compute and Block Storage services on the node to prevent further scheduling of workloads to it.

  • Verifies if any resources are present on the node, for example, instances and volumes. By default, the OpenStack Controller blocks the removal process until the resources are removed by the user. To adjust this behavior to the needs of your cluster, refer to OpenStack Controller configuration.

  • Removes OpenStack services metadata including compute services, Neutron agents, and volume services.

Caution

You cannot collocate the OpenStack compute node with other cluster components, such as Ceph. If done so, refer to the removal steps of the collocated components when planning the maintenance window.

If your cluster runs MOSK 23.1 or older version, perfrom the following steps before you remove the node from the cluster through the web UI to correctly remove the OpenStack-related metadata from it:

  1. Disable the compute service to prevent spawning of new instances. In the keystone-client pod, run:

    openstack compute service set --disable <cmp_host_name> nova-compute --disable-reason "Compute is going to be removed."
    
  2. Migrate all workloads from the node. For more information, follow Nova official documentation: Migrate instances.

  3. Ensure that there are no pods running on the node to delete by draining the node as instructed in the Kubernetes official documentation: Safely drain node.

  4. Delete the compute service using the OpenStack API. In the keystone-client pod, run:

    openstack compute service delete <service_id>
    

    Note

    To obtain <service_id>, run:

    openstack compute service list --host <cmp_host_name>
    
  5. Depending on the networking backend in use, proceed with one of the following:

    1. Obtain the network agent ID:

      openstack network agent list --host <cmp_host_name>
      
    2. Delete the Neutron Agent service running the following command in the keystone-client pod:

      openstack network agent delete <agent_id>
      
    1. Log in to the Tungsten Fabric web UI.

    2. Navigate to Configure > Infrastructure > Virtual Routers.

    3. Select the target compute node.

    4. Click Delete.

Reschedule stateful applications

Note

The procedure applies to the MOSK clusters running MOSK 23.3 series or earlier versions. Starting from 24.1, MOSK performs the rescheduling of stateful applications automatically.

The rescheduling of stateful applications may be required when replacing a permanently failed node, decommissioning a node, migrating applications to nodes with a more suitable set of hardware, and in several other use cases.

MOSK deployment profiles include the following stateful applications:

  • OpenStack database (MariaDB)

  • OpenStack coordination (etcd)

  • OpenStack Time Series Database backend (Redis)

Each stateful application from the list above has a persistent volume claim (PVC) based on a local persistent volume per pod. Each of control plane nodes has a set of local volumes available. To migrate an application pod to another node, recreate a PVC with the persistent volume from the target node.

Caution

A stateful application pod can only be migrated to a node that does not contain other pods of this application.

Caution

When a PVC is removed, all data present in the related persistent volume is removed from the node as well.

Reschedule pods to another control plane node

This section describes how to reschedule pods for MariaDB, etcd, and Redis to another control plane node.

Reschedule pods for MariaDB

Important

Perform the pods rescheduling if you have to move a PVC to another node and the current node is still present in the cluster. If the current node has been removed already, MOSK reschedules pods automatically when a node with required labels is present in the cluster.

  1. Recreate PVCs as described in Recreate a PVC on another control plane node.

  2. Remove the pod:

    Note

    To remove a pod from a node in the NotReady state, add --grace-period=0 --force to the following command.

    kubectl -n openstack delete pod <STATEFULSET-NAME>-<NUMBER>
    
  3. Wait until the pod appears in the Ready state.

    When the rescheduling is finalized, the <STATEFULSET-NAME>-<NUMBER> pod rejoins the Galera cluster with a clean MySQL data directory and requests the Galera state transfer from the available nodes.

Reschedule pods for Redis

Important

Perform the pods rescheduling if you have to move a PVC to another node and the current node is still present in the cluster. If the current node has been removed already, MOSK reschedules pods automatically when a node with required labels is present in the cluster.

  1. Recreate PVCs as described in Recreate a PVC on another control plane node.

  2. Remove the pod:

    Note

    To remove a pod from a node in the NotReady state, add --grace-period=0 --force to the following command.

    kubectl -n openstack-redis delete pod <STATEFULSET-NAME>-<NUMBER>
    
  3. Wait until the pod is in the Ready state.

Reschedule pods for etcd

Warning

During the reschedule procedure of the etcd LCM, a short cluster downtime is expected.

  1. Before MOSK 23.1:

    1. Identify the etcd replica ID that is a numeric suffix in a pod name. For example, the ID of the etcd-etcd-0 pod is 0. This ID is required during the reschedule procedure.

      kubectl -n openstack get pods | grep etcd
      

      Example of a system response:

      etcd-etcd-0                    0/1     Pending                 0          3m52s
      etcd-etcd-1                    1/1     Running                 0          39m
      etcd-etcd-2                    1/1     Running                 0          39m
      
    2. If the replica ID is 1 or higher:

      1. Add the coordination section to the spec.services section of the OsDpl object:

        spec:
          services:
            coordination:
              etcd:
                values:
                  conf:
                    etcd:
                      ETCD_INITIAL_CLUSTER_STATE: existing
        
      2. Wait for the etcd statefulSet to update the new state parameter:

        kubectl -n openstack get sts etcd-etcd -o jsonpath='{.spec.template.spec.containers[0].env[?(@.name=="ETCD_INITIAL_CLUSTER_STATE")].value}'
        
  2. Scale down the etcd StatefulSet to 0 replicas. Verify that no replicas are running on the failed node.

    kubectl -n openstack scale sts etcd-etcd --replicas=0
    
  3. Select from the following options:

    • If the current node is still present in the cluster and the PVC should be moved to another node, recreate the PVC as described in Recreate a PVC on another control plane node.

    • If the current node has been removed, remove the PVC related to the etcd replica of the failed node:

      kubectl -n <NAMESPACE> delete pvc <PVC-NAME>
      

      The PVC will be recreated automatically after the etcd StatefulSet is scaled to the initial number of replicas.

  4. Scale the etcd StatefulSet to the initial number of replicas:

    kubectl -n openstack scale sts etcd-etcd --replicas=<NUMBER-OF-REPLICAS>
    
  5. Wait until all etcd pods are in the Ready state.

  6. Verify that the etcd cluster is healthy:

    kubectl -n openstack exec -t etcd-etcd-1 -- etcdctl -w table endpoint --cluster status
    
  7. Before MOSK 23.1, if the replica ID is 1 or higher:

    1. Remove the coordination section from the spec.services section of the OsDpl object.

    2. Wait until all etcd pods appear in the Ready state.

    3. Verify that the etcd cluster is healthy:

      kubectl -n openstack exec -t etcd-etcd-1 -- etcdctl -w table endpoint --cluster status
      
Recreate a PVC on another control plane node

This section describes how to recreate a PVC of a stateful application on another control plane node.

To recreate a PVC on another control plane node:

  1. Select one of the persistent volumes available on the node:

    Caution

    A stateful application pod can only be migrated to the node that does not contain other pods of this application.

    NODE_NAME=<NODE-NAME>
    STORAGE_CLASS=$(kubectl -n openstack get osdpl <OSDPL_OBJECT_NAME> -o jsonpath='{.spec.local_volume_storage_class}')
    kubectl -n openstack get pv -o json | jq --arg NODE_NAME $NODE_NAME --arg STORAGE_CLASS $STORAGE_CLASS -r '.items[] | select(.spec.nodeAffinity.required.nodeSelectorTerms[0].matchExpressions[0].values[0] == $NODE_NAME and .spec.storageClassName == $STORAGE_CLASS and .status.phase == "Available") | .metadata.name'
    
  2. As the new PVC should contain the same parameters as the deleted one except for volumeName, save the old PVC configuration in YAML:

    kubectl -n <NAMESPACE> get pvc <PVC-NAME> -o yaml > <OLD-PVC>.yaml
    

    Note

    <NAMESPACE> is a Kubernetes namespace where the PVC is created. For Redis, specify openstack-redis, for other applications specify openstack.

  3. Delete the old PVC:

    kubectl -n <NAMESPACE> delete pvc <PVC-NAME>
    

    Note

    If a PVC has stuck in the terminating state, run kubectl -n openstack edit pvc <PVC-NAME> and remove the finalizers section from metadata of the PVC.

  4. Create a PVC with a new persistent volume:

    cat <<EOF | kubectl apply -f -
         apiVersion: v1
         kind: PersistentVolumeClaim
         metadata:
           name: <PVC-NAME>
           namespace: <NAMESPACE>
         spec:
           accessModes:
           - ReadWriteOnce
           resources:
             requests:
               storage: <STORAGE-SIZE>
           storageClassName: <STORAGE-CLASS>
           volumeMode: Filesystem
           volumeName: <PV-NAME>
        EOF
    

    Caution

    <STORAGE-SIZE>, <STORAGE-CLASS>, and <NAMESPACE> should correspond to the storage, storageClassName, and namespace values from the <OLD-PVC>.yaml file with the old PVC configuration.

Run Tempest tests

The OpenStack Integration Test Suite (Tempest), is a set of integration tests to be run against a live OpenStack environment. This section instructs you on how to verify the workability of your OpenStack deployment using Tempest.

To verify an OpenStack deployment using Tempest:

  1. Configure the Tempest run parameters using the features:services:tempest structure in the OpenStackDeployment custom resource.

    Note

    To perform the smoke testing of your deployment, no additional configuration is required.

    Configuration examples:

    • To perform the full Tempest testing:

      spec:
        services:
          tempest:
            tempest:
              values:
                conf:
                  script: |
                    tempest run --config-file /etc/tempest/tempest.conf --concurrency 4 --blacklist-file /etc/tempest/test-blacklist --regex test
      
    • To set the image build timeout to 600:

      spec:
        services:
          tempest:
            tempest:
              values:
                conf:
                  tempest:
                    image:
                      build_timeout: 600
      
  2. Run Tempest. The OpenStack Tempest is deployed like other OpenStack services in a dedicated openstack-tempest Helm release by adding tempest to spec:features:services in the OpenStackDeployment custom resource:

    spec:
      features:
        services:
          - tempest
    
  3. Wait until Tempest is ready. The Tempest tests are launched by the openstack-tempest-run-tests job. To keep track of the tests execution, run:

    kubectl -n openstack logs -l application=tempest,component=run-tests
    
  4. Get the Tempest results. The Tempest results can be stored in a pvc-tempest PersistentVolumeClaim (PVC). To get them from a PVC, use:

    # Run pod and mount pvc to it
    cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: tempest-test-results-pod
      namespace: openstack
    spec:
      nodeSelector:
        openstack-control-plane: enabled
      volumes:
        - name: tempest-pvc-storage
          persistentVolumeClaim:
            claimName: pvc-tempest
      containers:
        - name: tempest-pvc-container
          image: ubuntu
          command: ['sh', '-c', 'sleep infinity']
          volumeMounts:
            - mountPath: "/var/lib/tempest/data"
              name: tempest-pvc-storage
    EOF
    
  5. If required, copy the results locally:

    kubectl -n openstack cp tempest-test-results-pod:/var/lib/tempest/data/report_file.xml .
    
  6. Remove the Tempest test results pod:

    kubectl -n openstack delete pod tempest-test-results-pod
    
  7. To rerun Tempest:

    1. Remove Tempest from the list of enabled services.

    2. Wait until Tempest jobs are removed.

    3. Add Tempest back to the list of the enabled services.

Remove an OpenStack cluster

This section instructs you on how to remove an OpenStack cluster, deployed on top of Kubernetes, by deleting the openstackdeployments.lcm.mirantis.com (OsDpl) CR.

To remove an OpenStack cluster:

  1. Verify that the OsDpl object is present:

    kubectl get osdpl -n openstack
    
  2. Delete the OsDpl object:

    kubectl delete osdpl osh-dev -n openstack
    

    The deletion may take a certain amount of time.

  3. Verify that all pods and jobs have been deleted and no objects are present in the command output:

    kubectl get pods,jobs -n openstack
    
  4. Delete Persistent Volume Claims (PVCs) using the following snippet. Deletion of PVCs causes data deletion on Persistent Volumes. The volumes themselves will become available for further operations.

    Caution

    Before deleting PVCs, save valuable data in a safe place.

    #!/bin/bash
    PVCS=$(kubectl get pvc  --all-namespaces |egrep "openstack|openstack-redis" | egrep "redis|etcd|mariadb" | awk '{print $1" "$2" "$4}'| column -t )
    echo  "$PVCS" | while read line; do
    PVC_NAMESPACE=$(echo "$line" | awk '{print $1}')
    PVC_NAME=$(echo "$line" | awk '{print $2}')
    echo "Deleting PVC ${PVC_NAME}"
    kubectl delete pvc ${PVC_NAME} -n ${PVC_NAMESPACE}
    done
    

    Note

    Deletion of PVCs may get stuck if a resource that uses the PVC is still running. Once the resource is deleted, the PVC deletion process will proceed.

  5. Delete the MariaDB state ConfigMap:

    kubectl delete configmap openstack-mariadb-mariadb-state -n openstack
    
  6. Delete secrets using the following snippet:

    #!/bin/bash
    SECRETS=$(kubectl get secret  -n openstack | awk '{print $1}'| column -t | awk 'NR>1')
    echo  "$SECRETS" | while read line; do
    echo "Deleting Secret ${line}"
    kubectl delete secret ${line} -n openstack
    done
    
  7. Verify that OpenStack ConfigMaps and secrets have been deleted:

    kubectl get configmaps,secrets -n openstack
    
Remove an OpenStack service

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

This section instructs you on how to remove an OpenStack service deployed on top of Kubernetes. A service is typically removed by deleting a corresponding entry in the spec.features.services section of the openstackdeployments.lcm.mirantis.com (OsDpl) CR.

Caution

You cannot remove the default services built into the preset section.

Remove a service
  1. Verify that the spec.features.services section is present in the OsDpl object:

    kubectl -n openstack get osdpl osh-dev -o jsonpath='{.spec.features.services}'
    

    Example of system output:

    [instance-ha object-storage]
    
  2. Obtain the user name of the service database that will be required during Clean up OpenStack database leftovers after the service removal to substitute SERVICE-DB-NAME:

    Note

    For example, the <SERVICE-NAME> for the instance-ha service type is masakari.

    kubectl -n osh-system exec -t <ROCKOON-POD-NAME> -- helm3 -n openstack get values openstack-<SERVICE-NAME> -o json | jq -r .endpoints.oslo_db.auth.<SERVICE-NAME>.username
    
  3. Delete the service from the spec.features.services section of the OsDpl CR:

    kubectl -n openstack edit osdpl osh-dev
    

    The deletion may take a certain amount of time.

  4. Verify that all related objects have been deleted and no objects are present in the output of the following command:

    for i in $(kubectl api-resources --namespaced -o name | grep -v event); do kubectl -n openstack get $i 2>/dev/null | grep <SERVICE-NAME>; done
    
Clean up OpenStack API leftovers after the service removal
  1. Log in to the Keystone client pod shell:

    kubectl -n openstack exec -it <KEYSTONE-CLIENT-POD-NAME> -- bash
    
  2. Remove service endpoints from the Keystone catalog:

    for i in $(openstack endpoint list --service <SERVICE-NAME> -f value -c ID); do openstack endpoint delete $i; done
    
  3. Remove the service user from the Keystone catalog:

    openstack user list --project service | grep <SERVICE-NAME>
    openstack user delete <SERVICE-USER-ID>
    
  4. Remove the service from the catalog:

    openstack service list | grep <SERVICE-NAME>
    openstack service delete <SERVICE-ID>
    
Clean up OpenStack database leftovers after the service removal

Caution

The procedure below will permanently destroy the data of the removed service.

  1. Log in to the mariadb-server pod shell:

    kubectl -n openstack exec -it mariadb-server-0 -- bash
    
  2. Remove the service database user and its permissions:

    Note

    Use the user name for the service database obtained during the Remove a service procedure to substitute SERVICE-DB-NAME:

    mysql -u root -p${MYSQL_DBADMIN_PASSWORD} -e "REVOKE ALL PRIVILEGES, GRANT OPTION FROM '<SERVICE-DB-USERNAME>'@'%';"
    mysql -u root -p${MYSQL_DBADMIN_PASSWORD} -e "DROP USER '<SERVICE-DB-USERNAME>'@'%';"
    
  3. Remove the service database:

    mysql -u root -p${MYSQL_DBADMIN_PASSWORD} -e "DROP DATABASE <SERVICE-NAME>;"
    
Enable uploading of an image through Horizon with untrusted SSL certificates

By default, the OpenStack Dashboard (Horizon) is configured to load images directly into Glance. However, if a MOSK cluster is deployed using untrusted certificates for public API endpoints and Horizon, uploading of images to Glance through the Horizon web UI may fail.

When accessing the Horizon web UI of such MOSK deployment for the first time, a warning informs you that the site is insecure and you must force trust the certificate of this site. However, when trying to upload an image directly from a web browser, the certificate of the Glance API is still not considered by the web browser as a trusted one since host:port of the site is different. In this case, you must explicitly trust the certificate of the Glance API.

To enable uploading of an image through Horizon with untrusted SSL certificates:

  1. Navigate to the Horizon web UI.

  2. Configure your web browser to trust the Horizon certificate if you have not done so yet:

    • In Google Chrome or Chromium, click Advanced > Proceed to <URL> (unsafe).

    • In Mozilla Firefox, navigate to Advanced > Add Exception, enter the URL in the Location field, and click Confirm Security Exception.

    Note

    For other web browsers, the steps may vary slightly.

  3. Navigate to Project > API Access.

  4. Copy the Service Endpoint URL of the Image service.

  5. Open this URL in a new window or tab of the same web browser.

  6. Configure your web browser to trust the certificate of this site as described in the step 2.

    As a result, the version discovery document should appear with contents that varies depending on the OpenStack version. For example, for OpenStack Victoria:

    {"versions": [{"id": "v2.9", "status": "CURRENT", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.7", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.6", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.5", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.4", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.3", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.2", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.1", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}, \
    {"id": "v2.0", "status": "SUPPORTED", "links": \
    [{"rel": "self", "href": "https://glance.ic-eu.ssl.mirantis.net/v2/"}]}]}
    

Once done, you should be able to upload an image through Horizon with untrusted SSL certificates.

Rotate OpenStack credentials

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

The credential rotation procedure is designed to minimize the impact on service availability and workload downtime. It depends on the credential type and is based on the following principles:

  • Credentials for OpenStack admin database and messaging are immediately changed during one rotation cycle, without a transition period.

  • Credentials for OpenStack admin identity are rotated with a transition period of one extra rotation cycle. This means that the credentials become invalid after two rotations. MOSK exposes the latest valid credentials to the openstack-external namespace. For details, refer to Access OpenStack through CLI from your local machine.

  • Credentials for OpenStack service users, including those for messaging, identity, and database, undergo a transition period of one extra rotation cycle during rotation.

Note

If immediate inactivation of credentials is required, initiate the rotation procedure twice.

Impact on workloads availability

The restarts of the Networking service may cause workload downtimes. The exact lengths of these downtimes depend on the cloud density and scale.

Impact on APIs availability

Rotating both administrator and service credentials can potentially result in certain API operations failing.

Rotation prerequisites
  • Verify that the current state of the LCM action in OpenstackDeploymentStatus is APPLIED:

    kubectl -n openstack get osdplst -o yaml
    

    Example of an expected system response:

     1 kind: OpenStackDeploymentStatus
     2 metadata:
     3   name: osh-dev
     4   namespace: openstack
     5 spec: {}
     6 status:
     7   ...
     8   osdpl:
     9     cause: update
    10     changes: '((''add'', (''status'',), None, {''watched'': {''ceph'': {''secret'':
    11       {''hash'': ''0fc01c5e2593bc6569562b451b28e300517ec670809f72016ff29b8cbaf3e729''}}}}),)'
    12     controller_version: 0.5.3.dev12
    13     fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
    14     openstack_version: ussuri
    15     state: APPLIED
    16     timestamp: "2021-09-08 17:01:45.633143"
    
  • Verify that there are no other LCM operations running on the OpenStack cluster.

  • Thoroughly plan the maintenance window taking into account the following considerations:

    • All OpenStack control plane services, components of the Networking service (OpenStack Neutron) responsible for the data plane and messaging services are restarted during service credentials rotation.

    • OpenStack database and OpenStack messaging services are restarted during administrator credentials rotation, as well as some of the Openstack control plane services, including the Instance High Availability service (OpenStack Masakari), Dashboard (OpenStack Horizon), and Identity service (OpenStack Keystone).

    For approximate maintenance window duration, refer to Calculate a maintenance window duration for update Deprecated.

Rotate the credentials
  1. Log in to the osdpl container in the rockoon pod:

    kubectl -n osh-system exec -it <rockoon-pod> -c osdpl -- bash
    
  2. Use the osctl utility to trigger credentials rotation:

    osctl credentials rotate --osdpl <osdpl-object-name> --type <credentials-type>
    

    Where the <credentials-type> value is either admin or service.

    Note

    Mirantis recommends rotating both admin and service credentials simultaneously to decrease the duration of the maintenance window and number of service restarts. You can do this by passing the --type argument twice:

    osctl credentials rotate --osdpl <osdpl-object-name> --type service --type admin
    
  3. Wait until the OpenStackDeploymentStatus object has state APPLIED and all OpenStack components in the health group in the OpenStackDeploymentStatus custom resource are in the Ready state.

    Alternatively, you can launch the rotation command with the --wait flag.

Now, the latest admin password for your OpenStack environment is available in the openstack-identity-credentials secret in the openstack-external namespace.

Customize OpenStack container images

This section provides instructions on how to customize the functionality of your MOSK OpenStack services by installing custom system or Python packages into their container images.

The MOSK services are running in Ubuntu-based containers, which can be extended to meet specific requirements or implement specific use cases, for example:

  • Enabling third-party storage driver for OpenStack Cinder

  • Implementing a custom scheduler for OpenStack Nova

  • Adding a custom dashboard to OpenStack Horizon

  • Building your own image importing workflow for OpenStack Glance

Warning

Mirantis cannot be held responsible for any consequences arising from using customized container images. Mirantis does not provide support for such actions, and any modifications to production systems are made at your own risk.

Note

Custom images are pinned in the OpenStackDeployment custom resource. These images do not undergo automatic updates or upgrades. Cloud administrator is responsible for image update during OpenStack updating and upgrading.

Build a custom container image
  1. Create a new directory and switch to it:

    mkdir my-custom-image
    cd my-custom-image
    
  2. Create a Dockerfile:

    touch Dockerfile
    
  3. Specify the location for the base image in the Dockerfile.

    A custom image can be derived from any OpenStack image shipped with MOSK. For locations of the images comprising a specific MOSK release, refer to a corresponding release artifacts page in the Release Notes.

    ARG FROM=<images-base-url>/openstack/<image-name>:<tag>
    FROM $FROM
    

    Note

    Presuming the custom image will need to get rebuilt for every new MOSK release, Mirantis recommends parametrizing the location of its base by introducing the $FROM argument to the Dockerfile.

  4. Instruct the Dockerfile to install additional system packages:

    RUN apt-get update ;\
          apt-get install --no-install-recommends -y <package name> ;\
          apt-get clean -y
    

    If you need to install packages from a third-party repository:

    • Make sure that the add-apt-repository utility is installed:

      RUN apt-get install --no-install-recommends -y software-properties-common
      
    • Add the third-party repository:

      RUN add-apt-repository <repository_name>
      
  5. Instruct the Dockerfile to install additional Python packages:

    Caution

    Rules to comply with when extending MOSK container images with Python packages:

    • Use only Python wheel packaging standard, the older *.egg package type is not supported

    • Honor upper constraints that MOSK defines for its OpenStack packages prerequisites

    1. OpenStack components in every MOSK release are shipped together with their requirements packaged as Python wheels and constraints file. Download and extract these artifacts from the corresponding requirements container image, so that they can be used for building your packages as well. Use the requirements image with the same tag as the base image that you plan to customize:

      docker pull <images-base-url>/openstack/requirements:<tag>
      docker save -o requirements.tar <images-base-url>/openstack/requirements:<tag>
      
      mkdir requirements
      tar -xf requirements.tar -C requirements
      tar -xf requirements/<shasum>/layer.tar -C requirements
      
    2. Build your Python wheel packages using one of the commands below depending on the place where the source code is stored:

      • Build from source:

        pip wheel --no-binary <package> <package> --wheel-dir=custom-wheels -c requirements/dev-upper-constraints.txt
        
      • Build from an upstream pip repository:

        pip wheel <package> --wheel-dir=custom-wheels -c requirements/dev-upper-constraints.txt
        
      • Build from a custom repository:

        pip wheel <package> --extra-index-url <Repo-URL> --wheel-dir=custom-wheels -c requirements/dev-upper-constraints.txt
        
    3. Include the built custom wheel packages and packages for the requirements into the Dockerfile:

      COPY custom-wheels /tmp/custom-wheels
      COPY requirements /tmp/wheels
      
    4. Install the necessary wheel packages to be stored along with other OpenStack components:

      RUN source /var/lib/openstack/bin/activate  ;\
          pip install <package> --no-cache-dir --only-binary :all: --no-compile --find-links /tmp/wheels --find-links /tmp/custom-wheels -c /tmp/wheels/dev-upper-constraints.txt
      
  6. Build the container image from your Dockerfile:

    docker build --tag <user-name>/<repo-name>:<tag> .
    

    When selecting the name for your image, Mirantis recommends following the common practice across major public Docker repositories, that is Docker Hub. The image name should be <user-name>/<repo-name>, where <user-name> is a unique identifier of the user who authored it and <repo-name> is the name of the software shipped.

    Specify the current directory as the build context. Also, use the --tag option to assign the tag to your image. Assigning a tag :<tag> enables you to add multiple versions of the same image to the repository. Unless you assign a tag, it defaults to latest.

    If you are adding Python packages, you can minimize the size of the custom image by building it with the --squash flag. It merges all the image layers into one and instructs the system not to store the cache layers of the wheel packages.

  7. Verify that the image has been built and is present on your system:

    docker image ls
    
  8. Publish the image to the designated registry by its name and tag:

    Note

    Before pushing the image, make sure that you have authenticated with the registry using the docker login command.

    docker push <user-name>/<repo-name>:<tag>
    
Attach a private Docker registry to MOSK Kubernetes underlay

To ensure that the Kubernetes worker nodes in your MOSK cluster can locate and download the custom image, it should be published to a container image registry that the cluster is configured to use.

To configure the MOSK Kubernetes underlay to use your private registry, you need to create a ContainerRegistry resource in the Mirantis Container Cloud API with the registry domain and CA certificate in it, and specify the resource in the Cluster object that corresponds to MOSK.

For the details, refer to Define a custom CA certificate for a private Docker registry and ContainerRegistry resource.

Inject a custom image into MOSK cluster

To inject a customized OpenStack container into your MOSK cluster:

Create a ConfigMap in the openstack namespace with the following content, replacing <OPENSTACKDEPLOYMENT-NAME> with the name of your OpenStackDeployment custom resource:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    penstack.lcm.mirantis.com/watch: "true"
  name: <OPENSTACKDEPLOYMENT-NAME>-artifacts
  namespace: openstack
data:
  caracal: |
    horizon: <image_path>

Update the spec:services section in the OpenStackDeployment custom resource to override the default location of the container image with your own:

  1. Open the OpenStackDeployment custom resource for editing:

    kubectl -n openstack edit osdpl osh-dev
    
  2. Specify the path to your custom image:

    • For MOSK Dashboard (OpenStack Horizon):

      spec:
        services:
          dashboard:
            horizon:
              values:
                images:
                  tags:
                    horizon: <PATH-TO-IMAGE>
      
    • For the MOSK Block Storage service (OpenStack Cinder):

      spec:
        services:
          block-storage:
            cinder:
              values:
                images:
                  tags:
                    <IMAGE-NAME>: <PATH-TO-IMAGE>
      
How to examples

To help you better understand the process, this section provides a few examples illustrating how to add various plugins to MOSK services.

Warning

Mirantis cannot be held responsible for any consequences arising from using storage drivers, plugins, or features that are not explicitly tested or documented with MOSK. Mirantis does not provide support for such configurations as a part of standard product subscription.

Pure Storage driver for OpenStack Cinder

Although the PureStorage driver itself is already included in the cinder system package, you need to install additional dependencies to make it work:

  • System packages: nfs-common

  • Python packages: purestorage

The base image is the MOSK Cinder image cinder:yoga-focal-20230227093206.

Procedure:

  1. Download and extract the requirements from the requirements container image that corresponds to the base image that you plan to customize:

    docker pull mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    docker save -o requirements.tar mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    
    mkdir requirements
    tar -xf requirements.tar -C requirements
    tar -xf requirements/f533a79f1d92ad487e5261ed1086a2da75b14c6f91d28fe42dbcf0ceaf0dea70/layer.tar -C requirements
    
  2. Build Python wheels:

    pip wheel purestorage --wheel-dir=custom-wheels -c requirements/dev-upper-constraints.txt
    
  3. Create the Dockerfile:

    ARG FROM=mirantis.azurecr.io/openstack/cinder:yoga-focal-20230227093206
    FROM $FROM
    COPY requirements /tmp/wheels
    COPY custom-wheels /tmp/custom-wheels
    RUN apt-get update ;\
        apt-get install --no-install-recommends -y nfs-common ;\
        source /var/lib/openstack/bin/activate  ;\
        pip install purestorage --no-cache-dir --only-binary :all: --no-compile --find-links /tmp/wheels --find-links /tmp/custom-wheels -c /tmp/wheels/dev-upper-constraints.txt ;\
        apt-get clean -y ;\
        rm -rf \
         /var/cache/debconf/* \
         /var/lib/apt/lists/* \
         /var/log/* \
         /tmp/* \
         /var/tmp/*
    
  4. Build the custom image:

    docker build --squash -t customopenstackimages/cinder-purestorage:yoga-1.0.0 .
    
TrilioVault plugin for OpenStack Horizon

The base image is the MOSK Horizon image horizon:yoga-focal-20230227093206.

Procedure:

  1. Download and extract the requirements from the requirements container image that corresponds to the base image that you plan to customize:

    docker pull mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    docker save -o requirements.tar mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    
    mkdir requirements
    tar -xf requirements.tar -C requirements
    tar -xf requirements/f533a79f1d92ad487e5261ed1086a2da75b14c6f91d28fe42dbcf0ceaf0dea70/layer.tar -C requirements
    
  2. Build wheels. This step will be performed automatically because the Trillio repository has tar Python packages that build wheels binaries on installation.

  3. Create the Dockerfile:

    ARG FROM=mirantis.azurecr.io/openstack/horizon:yoga-focal-20230227093206
    FROM $FROM
    COPY requirements /tmp/wheels
    RUN source /var/lib/openstack/bin/activate ;\
        pip install tvault-horizon-plugin workloadmgrclient contegoclient --extra-index-url https://pypi.fury.io/triliodata-dev-5-0-beta --no-cache-dir --no-compile --find-links /tmp/wheels -c /tmp/wheels/dev-upper-constraints.txt ;\
        cp /var/lib/openstack/lib/python3.8/site-packages/dashboards/local/enabled/*.py /var/lib/openstack/lib/python3.8/site-packages/openstack_dashboard/enabled/ ;\
        cp /var/lib/openstack/lib/python3.8/site-packages/dashboards/templatetags/*.py /var/lib/openstack/lib/python3.8/site-packages/openstack_dashboard/templatetags/ ;\
        cp /var/lib/openstack/lib/python3.8/site-packages/dashboards/local/local_settings.d/_001_trilio_dashboard.py  /var/lib/openstack/lib/python3.8/site-packages/openstack_dashboard/local/local_settings.d/ ;\
        rm -rf \
          /var/log/* \
          /tmp/* \
          /var/tmp/*
    
  4. Build the custom image:

    docker build --squash -t customopenstackimages/horizon-trilio:yoga-1.0.0 .
    
FWaaS dashboard for OpenStack Horizon

The base image is the MOSK Horizon image horizon:yoga-focal-20230227093206.

You can install the dashboard source from the neutron-fwaas-dashboard GitHub repository.

Procedure:

  1. Download and extract the requirements from the``requirements`` container image that corresponds to the base image that you plan to customize:

    docker pull mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    docker save -o requirements.tar mirantis.azurecr.io/openstack/requirements:yoga-focal-20230227093206
    
    mkdir requirements
    tar -xf requirements.tar -C requirements
    tar -xf requirements/f533a79f1d92ad487e5261ed1086a2da75b14c6f91d28fe42dbcf0ceaf0dea70/layer.tar -C requirements
    
  2. Build Python wheels:

    git clone https://opendev.org/openstack/neutron-fwaas-dashboard
    pip wheel neutron-fwaas-dashboard/ neutron-fwaas-dashboard --wheel-dir=custom-wheels --no-deps -c requirements/dev-upper-constraints.txt
    
  3. Create the Dockerfile:

    ARG FROM=mirantis.azurecr.io/openstack/horizon:yoga-focal-20230227093206
    FROM $FROM
    COPY requirements /tmp/wheels
    COPY custom-wheels /tmp/custom-wheels
    RUN source /var/lib/openstack/bin/activate ;\
        pip install neutron-fwaas-dashboard  --no-cache-dir --no-compile --find-links /tmp/wheels --find-links /tmp/custom-wheels -c /tmp/wheels/dev-upper-constraints.txt ;\
        cp /var/lib/openstack/lib/python3.8/site-packages/neutron_fwaas_dashboard/enabled/_70*_*.py /var/lib/openstack/lib/python3.8/site-packages/openstack_dashboard/enabled/ ;\
        rm -rf \
          /var/log/* \
          /tmp/* \
          /var/tmp/*
    
  4. Build the custom image:

    docker build --squash -t customopenstackimages/horizon-fwaas:yoga-1.0.0 .
    
Troubleshoot orphaned resource allocations

Available since MOSK 24.3

OpenStack Controller (Rockoon)

Since MOSK 25.1, the OpenStack Controller has been open-sourced under the name Rockoon and is maintained as an independent open-source project going forward.

As part of this transition, all openstack-controller pods are named rockoon pods across the MOSK documentation and deployments. This change does not affect functionality, but this is the reminder for the users to utilize the new naming for pods and other related artifacts accordingly.

Orphaned resource allocations are entries in the Placement database that track resource consumption, but the corresponding consumer (instance) no longer exists on the compute nodes. As a result, the Nova scheduler mistakenly believes that compute nodes have more resources allocated than they actually have.

For example, orphaned resource allocations may occur when an instance is evacuated from a hypervisor while the related nova-compute service is down.

This section provides instructions on how to resolve orphaned resource allocations in Nova if they are detected on compute nodes.

Detect orphaned allocations

Orphaned allocations are detected by the nova-placement-audit CronJob that runs every four hours.

The osdpl-exporter service processes the nova-placement-audit CronJob output and exports current number of orphaned allocations to StackLight as an osdpl_nova_audit_orphaned_allocations metric. If the value of this metric is greater than 0, StackLight raises a major alert NovaOrphanedAllocationsDetected.

Collect logging data from the cluster
  1. Obtain the mapping with IDs of resource providers and related orphaned consumers:

    kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
       jq .orphaned_allocations.detected
    

    Example of a system response:

    {
       "12ed66d0-00d8-40e5-a28b-19cdecd2211d": [
          {
             "consumer": "1e616d60-bc5b-436d-8d71-503d15de5c55",
             "resources": {
             "DISK_GB": 5,
             "MEMORY_MB": 512,
             "VCPU": 1
             }
          }
       ]
    }
    
  2. Obtain the list of the nova-compute services that have issues with orphaned allocations:

    1. Obtain the UUIDs of the resource providers containing orphaned allocations:

      rp_uuids=$(kubectl -n openstack logs -l application=nova,component=placement-audit -c placement-audit-report | \
         jq -c '.orphaned_allocations.detected|keys')
      
    2. Obtain the hostnames of the compute nodes that correspond to the resource providers obtained in the previous step:

      cmp_fqdns=$(kubectl -n openstack exec -t deployment/keystone-client -- \
         openstack resource provider list -f json | \
         jq --argjson rp_uuids "$rp_uuids" -c ' .[] | select( [.uuid] | inside($rp_uuids) ) | .name')
      cmp_hostnames=$(for n in $(echo ${cmp_fqdns} | tr -d \"); do echo ${n%%.*}; done)
      
    3. List the nova-compute services that contain orphaned allocations:

      kubectl -n openstack exec -t deployment/keystone-client -- \
         openstack compute service list --service nova-compute --long -f json | \
         jq --arg hosts "$cmp_hostnames" -r '.[] | select( .Host | inside($hosts) )'
      

      Example of a system response:

      [{
         "ID": "14a1685a-798e-40f1-b490-a09a5c8f6f66",
         "Binary": "nova-compute",
         "Host": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
         "Zone": "nova",
         "Status": "enabled",
         "State": "down",
         "Updated At": "2024-08-12T07:52:46.000000",
         "Disabled Reason": null,
         "Forced Down": true
      }]
      
  3. Analyze the list of the nova-compute services obtained during the previous step:

    • For the nova-compute services in the down state, most probably there were evacuations of instances from the correspoding nodes when the services were down. If this is the case, proceed directly to Remove orphaned allocations. Otherwise, proceed with collecting the logs.

      To verify if the evacuations were performed:

      openstack server migration list --type evacuation --host <CMP_HOSTNAME> -f json
      

      Example of a system response:

      [{
         "Id": 3,
         "UUID": "d7c29e99-2f69-4f85-80ed-72e1ef71c099",
         "Source Node": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao.cluster.local",
         "Dest Node": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk.cluster.local",
         "Source Compute": "mk-ps-7bqjjdq7o53q-1-snwv3ahiip6i-server-3axzquptwsao",
         "Dest Compute": "mk-ps-7bqjjdq7o53q-0-rlocum3rumf4-server-ir4n3ag33erk",
         "Dest Host": "10.10.0.61",
         "Status": "completed",
         "Server UUID": "1e616d60-bc5b-436d-8d71-503d15de5c55",
         "Old Flavor": null,
         "New Flavor": null,
         "Type": "evacuation",
         "Created At": "2024-08-07T09:01:08.000000",
         "Updated At": "2024-08-07T16:11:54.000000"
      }]
      
    • For the nova-compute services in the UP state, proceed with collecting the logs.

  4. Collect the following logs from the environment:

    Caution

    The log data can be significant in size. Ensure that there is sufficient space available in the /tmp/ directory of the OpenStack Controller (Rockoon) pod. Create a separate report for each node.

    • Logs from compute nodes for a 3-day period around the time of the alert:

      • From the node with the orphaned allocation

      • From the node with the actual allocation (where the instance exists, if any)

      kubectl -n osh-system exec -it deployment/rockoon -- bash
      
      osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
         --host <CMP_HOSTNAME> \
         --component nova \
         --collector elastic \
         --collector nova \
         --workspace /tmp/ report
      

      For example, if the alert was raised on 2024-08-12, set <REPORT_PERIOD_TIMESTAMPS> to 2024-08-11,2024-08-13.

    • Logs from the nova-scheduler, nova-api, nova-conductor, placement-api pods for a 3-day period around the time of the alert:

      ctl_nodes=$(kubectl get nodes -l openstack-control-plane=enabled -o name)
      
      kubectl -n osh-system exec -it deployment/rockoon -- bash
      
      # for each node in ctl_nodes execute:
      osctl sos --between <REPORT_PERIOD_TIMESTAMPS> \
         --host <CTL_HOSTNAME> \
         --component nova \
         --component placement \
         --collector elastic \
         --workspace /tmp/ report
      
    • Logs from the Kubernetes objects:

      kubectl -n osh-system exec -it deployment/rockoon -- bash
      osctl sos --collector k8s --workspace /tmp/ report
      
    • Nova service data from the API:

      kubectl -n openstack exec -it deployment/keystone-client -- bash
      
      openstack server migration list
      openstack compute service list --long
      openstack resource provider list
      # Get the server event list for each orphaned consumer id
      openstack server event list <SERVER_ID>
      

      Note

      SERVER_ID is the orphaned consumer ID from the nova-placement-audit logs.

  5. Create a support case and attach the obtained information.

Remove orphaned allocations
  1. Log in to the nova-api-osapi pod:

    kubectl -n openstack exec -it deployment/nova-api-osapi -- bash
    
  2. Remove orphaned allocations:

    • To remove all found orphaned allocations:

      nova-manage placement audit --verbose --delete
      
    • To remove orphaned allocations on a specific resource provider:

      nova-manage placement audit --verbose --delete --resource_provider <RESOURCE_PROVIDER_UUID>
      
  3. Verify that no orphaned allocations exist:

    nova-manage placement audit --verbose
    
Start monitoring IP address capacity

Available since MOSK 25.1

Note

The MOSK deployments with Tungsten Fabric do not support IP address capacity monitoring.

Monitoring IP address capacity helps cloud operators allocate routable IP addresses efficiently for dynamic workloads in the clouds. This capability provides insights for predicting future needs for IP addresses, ensuring seamless communication between workloads, users, and services while optimizing IP address usage.

By monitoring IP address capacity, cloud operators can:

  • Predict when to add new IP address blocks to prevent service disruptions.

  • Identify networks or subnets nearing capacity to prevent issues.

  • Optimize the allocation of costly external IP address pools.

To start monitoring IP address capacity in your cloud:

  1. Verify that all required networks and subnets are monitored.

    By default, MOSK monitors IP address capacity for the external networks that have the router:external=External attribute and segmentation type of vlan or flat.

    To include additional networks and subnets in the monitoring:

    • Tag the network with the openstack.lcm.mirantis.com:prometheus tag. When a network is tagged, all its subnets are automatically included in the monitoring:

      openstack network set <NETWORK-ID> --tag openstack.lcm.mirantis.com:prometheus
      
    • Tag individual subnets with the openstack.lcm.mirantis.com:prometheus tag. This includes the subnet in the monitoring regardless of the network tagging:

      openstack subnet set <SUBNET_ID> --tag openstack.lcm.mirantis.com:prometheus
      
  2. View and analyze the monitoring data.

    The metrics of IP address capacity are collected by the OpenStack Exporter and are available in Prometheus within the StackLight monitoring suite as:

    • osdpl_neutron_network_total_ips

    • osdpl_neutron_network_free_ips

    • osdpl_neutron_subnet_total_ips

    • osdpl_neutron_subnet_free_ips

OpenStack services configuration

This section covers post-deployment configuration of OpenStack services and is intended for cloud operators responsible for maintaining a functional cloud infrastructure for end users. It focuses on more complex procedures that require additional steps beyond simply editing the OpenStackDeployment custom resource.

For an overview of the capabilities provided by MOSK OpenStack services and instructions on enabling and configuring them at the OpenStackDeployment level, refer to Cloud services.

Configure high availability with Masakari

Instances High Availability Service or Masakari is an OpenStack project designed to ensure high availability of instances and compute processes running on hosts.

Before the end user can start enjoying the benefits of Masakari, the cloud operator has to configure the service properly. This section includes instructions on how to create segments and host through the Masakari API as well as provides the list of additional settings that can be useful in certain use cases.

Group compute nodes into segments

The segment object is a logical grouping of compute nodes into zones also known as availability zones. The segment object enables the cloud operator to list, create, show details for, update, and delete segments.

To create a segment named allcomputes with service_type = compute, and recovery_method = auto, run:

openstack segment create allcomputes auto compute

Example of a positive system response:

+-----------------+--------------------------------------+
| Field           | Value                                |
+-----------------+--------------------------------------+
| created_at      | 2021-07-06T07:34:23.000000           |
| updated_at      | None                                 |
| uuid            | b8b0d7ca-1088-49db-a1e2-be004522f3d1 |
| name            | allcomputes                          |
| description     | None                                 |
| id              | 2                                    |
| service_type    | compute                              |
| recovery_method | auto                                 |
+-----------------+--------------------------------------+
Create hosts under segments

The host object represents compute service hypervisors. A host belongs to a segment. The host can be any kind of virtual machine that has compute service running on it. The host object enables the operator to list, create, show details for, update, and delete hosts.

To create a host under a given segment:

  1. Obtain the hypervisor hostname:

    openstack hypervisor list
    

    Example of a positive system response:

    +----+-------------------------------------------------------+-----------------+------------+-------+
    | ID | Hypervisor Hostname                                   | Hypervisor Type | Host IP    | State |
    +----+-------------------------------------------------------+-----------------+------------+-------+
    |  2 | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt | QEMU            | 10.10.0.39 | up    |
    |  5 | vs-ps-vyvsrkrdpusv-0-ukqbpy2pkcuq-server-s4u2thvgxdfi | QEMU            | 10.10.0.14 | up    |
    +----+-------------------------------------------------------+-----------------+------------+-------+
    
  2. Create the host under previously created segment. For example, with uuid = b8b0d7ca-1088-49db-a1e2-be004522f3d1:

    Caution

    The segment under which you create a host must exist.

    openstack segment host create \
        vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt \
        compute \
        SSH \
        b8b0d7ca-1088-49db-a1e2-be004522f3d1
    

    Positive system response:

    +---------------------+-------------------------------------------------------+
    | Field               | Value                                                 |
    +---------------------+-------------------------------------------------------+
    | created_at          | 2021-07-06T07:37:26.000000                            |
    | updated_at          | None                                                  |
    | uuid                | 6f1bd5aa-0c21-446a-b6dd-c1b4d09759be                  |
    | name                | vs-ps-vyvsrkrdpusv-1-w2mtagbeyhel-server-cgpejthzbztt |
    | type                | compute                                               |
    | control_attributes  | SSH                                                   |
    | reserved            | False                                                 |
    | on_maintenance      | False                                                 |
    | failover_segment_id | b8b0d7ca-1088-49db-a1e2-be004522f3d1                  |
    +---------------------+-------------------------------------------------------+
    
Enable notifications

The alerting API is used by Masakari monitors to notify about a failure of either a host, process, or instance. The notification object enables the operator to list, create, and show details of notifications.

Useful tunings

The list of useful tunings for the Masakari service includes:

  • [host_failure]\evacuate_all_instances

    Enables the operator to decide whether to evacuate all instances or only the instances that have [host_failure]\ha_enabled_instance_metadata_key set to True. By default, the parameter is set to False.

  • [host_failure]\ha_enabled_instance_metadata_key

    Enables the operator to decide on the instance metadata key naming that affects the per instance behavior of [host_failure]\evacuate_all_instances. The default is the same for both failure types, which include host and instance, but the value can be overridden to make the metadata key different per failure type.

  • [host_failure]\ignore_instances_in_error_state

    Enables the operator to decide whether error instances should be allowed for evacuation from a failed source compute node or not. If set to True, it will ignore error instances from evacuation from a failed source compute node. Otherwise, it will evacuate error instances along with other instances from a failed source compute node.

  • Available since MOSK 24.2 [host_failure]\ha_enabled_project_tag

    By default, instances belonging to any project are evacuated. However, if the operator needs to restrict this functionality to specific projects, they can tag these projects with a designated tag and pass this tag as the value for this Masakari option. Consequently, instances from projects that do not have the specified tag are not considered for evacuation, even if they have the corresponding metadata key and value set.

  • [instance_failure]\process_all_instances

    Enables the operator to decide whether all instances or only the ones that have [instance_failure]\ha_enabled_instance_metadata_key set to True should be recovered from instance failure events. If set to True, it will execute instance failure recovery actions for an instance irrespective of whether that particular instance has [instance_failure]\ha_enabled_instance_metadata_key set to True or not. Otherwise, it will only execute instance failure recovery actions for an instance which has [instance_failure]\ha_enabled_instance_metadata_key set to True.

  • [instance_failure]\ha_enabled_instance_metadata_key

    Enables the operators to decide on the instance metadata key naming that affects the per-instance behavior of [instance_failure]\process_all_instances. The default is the same for both failure types, which include host and instance, but you can override the value to make the metadata key different per failure type.

Configure monitoring of cloud workload availability

MOSK enables cloud operators to oversee the availability of workloads hosted in their OpenStack infrastructure through the monitoring of floating IP addresses availability (Cloudpprober) and network port availability (Portprober).

For the feature description and usage requirements, refer to Workload monitoring.

Configure floating IP address availability monitoring

Available since MOSK 23.2 TechPreview

MOSK allows you to monitor the floating IP address availability through the Cloudprober service. This section explains the details of the service configuration.

Enable the Cloudprober service
  1. Enable the Cloudprober service in the OpenStackDeployment custom resource:

    spec:
      features:
        services:
          - cloudprober
    
  2. Wait untill the OpenStackDeployment state becomes applied:

    kubectl -n openstack get osdplst
    

    Example of a positive system response:

    NAME      OPENSTACK VERSION   CONTROLLER VERSION   STATE
    osh-dev   yoga                0.13.1.dev54         APPLIED
    
  3. Verify that the Cloudprober service is running:

    kubectl -n openstack get pods -l application=cloudprober
    

    Example of a positive system response:

    NAME                                     READY   STATUS    RESTARTS   AGE
    openstack-cloudprober-587b4bf7c4-lwmxx   2/2     Running   2          3d1h
    openstack-cloudprober-587b4bf7c4-v9tt9   2/2     Running   0          3d1h
    
  4. Verify that the Cloudprober service is sending data to StackLight:

    1. Log in to the StackLight Prometheus web UI.

    2. Navigate to Status - Targets.

    3. Search for the openstack-cloudprober target and verify that it is UP.

Configure security groups

By default, for outgoing traffic, the IP address for the Cloudprober Pod is translated to the node IP address. In this procedure, we assume no further translation of that node IP address on the path between the node and floating network.

  1. Identify the node IP address used for traffic destined to floating network by selecting the IP address from the floating network and running the following command on each OpenStack control plane node:

    ip r get <floating ip> | grep -E -o '(src .*)' | awk '{print $2}'
    
  2. In the project where monitored virtual machines are running, create a security group:

    openstack security group create --project <project_id> instance-monitoring
    
  3. Create the rule for each IP address you obtain in step 1:

    openstack security group rule create --proto icmp --ingress --remote-ip <node ip> instance-monitoring
    
Mark instances with floating IPs for monitoring
  1. Log in to the keystone-client Pod to assign the openstack.lcm.mirantis.com:prober tag to each instance to be added to monitoring:

    openstack --os-compute-api-version 2.26 server set --tag openstack.lcm.mirantis.com:prober <INSTANCE_ID>
    
  2. Assign the instance-monitoring security group to the server:

    openstack server add security group <SERVER_ID> <SECURITY_GROUP_ID>
    
  3. Verify that the instances have been added successfully.

    Cloudprober uses auto-discovery of instances on periodic basis. Therefore, wait for the discovery interval to pass (defaults to 600 seconds) and execute the following command inside the keystone-client Pod:

    curl -s http://cloudprober.openstack.svc.cluster.local:9313/metrics | grep <INSTANCE_ID>
    

    Example of a positive system response:

    cloudprober_total{ptype="ping",probe="openstack-instances-icmp-probe",dst="d34a0c6b-91a2-4bd3-95ea-772da49b90c3-10.11.12.122",openstack_hypervisor_hostname="mk-ps-xp4m27lfl56j-1-w74pc7cinu67-server-42kx24m22xop.cluster.local",openstack_instance_id="d34a0c6b-91a2-4bd3-95ea-772da49b90c3",openstack_instance_name="test-vm-proj",openstack_project_id="1eb031db8add42fda2fdb0ef2c2ad8d7"} 266388 1685963215202
    cloudprober_success{ptype="ping",probe="openstack-instances-icmp-probe",dst="d34a0c6b-91a2-4bd3-95ea-772da49b90c3-10.11.12.122",openstack_hypervisor_hostname="mk-ps-xp4m27lfl56j-1-w74pc7cinu67-server-42kx24m22xop.cluster.local",openstack_instance_id="d34a0c6b-91a2-4bd3-95ea-772da49b90c3",openstack_instance_name="test-vm-proj",openstack_project_id="1eb031db8add42fda2fdb0ef2c2ad8d7"} 266386 1685963215202
    cloudprober_latency{ptype="ping",probe="openstack-instances-icmp-probe",dst="d34a0c6b-91a2-4bd3-95ea-772da49b90c3-10.11.12.122",openstack_hypervisor_hostname="mk-ps-xp4m27lfl56j-1-w74pc7cinu67-server-42kx24m22xop.cluster.local",openstack_instance_id="d34a0c6b-91a2-4bd3-95ea-772da49b90c3",openstack_instance_name="test-vm-proj",openstack_project_id="1eb031db8add42fda2fdb0ef2c2ad8d7"} 315484742.137 1685963215202
    cloudprober_validation_failure{ptype="ping",probe="openstack-instances-icmp-probe",dst="d34a0c6b-91a2-4bd3-95ea-772da49b90c3-10.11.12.122",openstack_hypervisor_hostname="mk-ps-xp4m27lfl56j-1-w74pc7cinu67-server-42kx24m22xop.cluster.local",openstack_instance_id="d34a0c6b-91a2-4bd3-95ea-772da49b90c3",openstack_instance_name="test-vm-proj",openstack_project_id="1eb031db8add42fda2fdb0ef2c2ad8d7",validator="data-integrity"} 0 1685963215202
    

    Note

    You can adjust the instance auto-discovery interval in the OpenStackDeployment object. However, Mirantis does not recommend setting it to too low values to avoid high load on the OpenStack API:

    spec:
      features:
        cloudprober:
          discovery:
            interval: 300
    

Now, you can start seeing the availability of instances floating IP addresses per OpenStack compute node and project, as well as viewing the probe statistics for individual instance floating IP addresses through the OpenStack Instances Availability dashboard in Grafana.

Enable network port availability monitoring

Available since MOSK 24.2 TechPreview

MOSK allows you to monitor the network port availability through the Portprober service.

The Portprober service is enabled by default when the Cloudprober service is enabled as described above, on clouds running OpenStack Antelope or newer version and using Neutron OVS backend for networking.

Also, you can enable Portprober explicitly, regardless of whether Cloudprober is enabled or not. See Network port availability monitoring (Portprober) for details.

When the service is enabled, you can monitor the network port availability through the OpenStack PortProber dashboard in Grafana.

Ceph operations

This section outlines Ceph LCM operations such as adding Ceph Monitor, Ceph nodes, and RADOS Gateway nodes to an existing Ceph cluster or removing them, as well as removing or replacing Ceph OSDs. The section also includes OpenStack-specific operations for Ceph.

The following sections describe the Ceph cluster configuration options:

Ceph default configuration options

Ceph Controller provides the capability to specify configuration options for the Ceph cluster through the spec.cephClusterSpec.rookConfig key-value parameter of the KaaSCephCluster resource as if they were set in a usual ceph.conf file. For details, see Ceph advanced configuration.

However, if rookConfig is empty, Ceph Controller still specifies the following default configuration options for each Ceph cluster:

  • Required network parameters that you can change through the spec.cephClusterSpec.network section:

    cluster network = <spec.cephClusterSpec.network.clusterNet>
    public network = <spec.cephClusterSpec.network.publicNet>
    
  • General default configuration options that you can override using the rookConfig parameter:

    mon target pg per osd = 200
    mon max pg per osd = 600
    
    # Workaround configuration option to avoid the
    # https://github.com/rook/rook/issues/7573 issue
    # when updating to Rook 1.6.x versions:
    rgw_data_log_backing = omap
    

If rookConfig is empty but the spec.cephClsuterSpec.objectStore.rgw section is defined, Ceph Controller specifies the following OpenStack-related default configuration options for each Ceph cluster:

  • Ceph Object Gateway options, which you can override using the rookConfig parameter:

    rgw swift account in url = true
    rgw keystone accepted roles = '_member_, Member, member, swiftoperator'
    rgw keystone accepted admin roles = admin
    rgw keystone implicit tenants = true
    rgw swift versioning enabled = true
    rgw enforce swift acls = true
    rgw_max_attr_name_len = 64
    rgw_max_attrs_num_in_req = 32
    rgw_max_attr_size = 1024
    rgw_bucket_quota_ttl = 0
    rgw_user_quota_bucket_sync_interval = 0
    rgw_user_quota_sync_interval = 0
    rgw s3 auth use keystone = true
    
  • Additional parameters for the Keystone integration:

    Warning

    All values with the keystone prefix are programmatically specified for each MOSK deployment. Do not modify these parameters manually.

    rgw keystone api version = 3
    rgw keystone url = <keystoneAuthURL>
    rgw keystone admin user = <keystoneUser>
    rgw keystone admin password = <keystonePassword>
    rgw keystone admin domain = <keystoneProjectDomain>
    rgw keystone admin project = <keystoneProjectName>
    
Ceph advanced configuration

This section describes how to configure a Ceph cluster through the KaaSCephCluster (kaascephclusters.kaas.mirantis.com) CR during or after the deployment of a MOSK cluster.

The KaaSCephCluster CR spec has two sections, cephClusterSpec and k8sCluster and specifies the nodes to deploy as Ceph components. Based on the roles definitions in the KaaSCephCluster CR, Ceph Controller automatically labels nodes for Ceph Monitors and Managers. Ceph OSDs are deployed based on the storageDevices parameter defined for each Ceph node.

For a default KaaSCephCluster CR, see Example of a complete template configuration for cluster creation.

Configure a Ceph cluster
  1. Select from the following options:

    • If you do not have a cluster yet, open kaascephcluster.yaml.template for editing.

    • If the cluster is already deployed, open the KaasCephCluster CR for editing:

      kubectl edit kaascephcluster -n <ClusterProjectName>
      

      Substitute <ClusterProjectName> with a corresponding value.

  2. Using the tables below, configure the Ceph cluster as required.

  3. Select from the following options:

    • If you are creating a cluster, save the updated KaaSCephCluster template to the corresponding file and proceed with the cluster creation.

    • If you are configuring KaaSCephCluster of an existing cluster, exit the text editor to apply the change.

Ceph configuration options
High-level parameters

Parameter

Description

cephClusterSpec

Describes a Ceph cluster in the MOSK cluster. For details on cephClusterSpec parameters, see the tables below.

k8sCluster

Defines the cluster on which the KaaSCephCluster depends on. Use the k8sCluster parameter if the name or namespace of the corresponding MOSK cluster differs from default one:

spec:
  k8sCluster:
    name: kaas-mgmt
    namespace: default
General parameters

Parameter

Description

network

Specifies networks for the Ceph cluster:

  • clusterNet - specifies a Classless Inter-Domain Routing (CIDR) for the Ceph OSD replication network.

    Warning

    To avoid ambiguous behavior of Ceph daemons, do not specify 0.0.0.0/0 in clusterNet. Otherwise, Ceph daemons can select an incorrect public interface that can cause the Ceph cluster to become unavailable. The bare metal provider automatically translates the 0.0.0.0/0 network range to the default LCM IPAM subnet if it exists.

    Note

    The clusterNet and publicNet parameters support multiple IP networks. For details, see Enable Ceph multinetwork.

  • publicNet - specifies a CIDR for communication between the service and operator.

    Warning

    To avoid ambiguous behavior of Ceph daemons, do not specify 0.0.0.0/0 in publicNet. Otherwise, Ceph daemons can select an incorrect public interface that can cause the Ceph cluster to become unavailable. The bare metal provider automatically translates the 0.0.0.0/0 network range to the default LCM IPAM subnet if it exists.

    Note

    The clusterNet and publicNet parameters support multiple IP networks. For details, see Enable Ceph multinetwork.

nodes

Specifies the list of Ceph nodes. For details, see Node parameters. The nodes parameter is a map with machine names as keys and Ceph node specifications as values, for example:

nodes:
  master-0:
    <node spec>
  master-1:
    <node spec>
  ...
  worker-0:
    <node spec>

nodeGroups

Specifies the list of Ceph nodes grouped by node lists or node labels. For details, see NodeGroups parameters. The nodeGroups parameter is a map with group names as keys and Ceph node specifications for defined nodes or node labels as values. For example:

nodes:
  group-1:
    spec: <node spec>
    nodes: ["master-0", "master-1"]
  group-2:
    spec: <node spec>
    label: <nodeLabelExpression>
  ...
  group-3:
    spec: <node spec>
    nodes: ["worker-2", "worker-3"]

The <nodeLabelExpression> must be a valid Kubernetes label selector expression.

pools

Specifies the list of Ceph pools. For details, see Pool parameters.

objectStorage

Specifies the parameters for Object Storage, such as RADOS Gateway, the Ceph Object Storage. Also specifies the RADOS Gateway Multisite configuration. For details, see RADOS Gateway parameters and Multisite parameters.

rookConfig

Optional. String key-value parameter that allows overriding Ceph configuration options.

Since MOSK 24.2, use the | delimiter to specify the section where a parameter must be placed. For example, mon or osd. And, if required, use the . delimiter to specify the exact number of the Ceph OSD or Ceph Monitor to apply an option to a specific mon or osd and override the configuration of the corresponding section.

The use of this option enables restart of only specific daemons related to the corresponding section. If you do not specify the section, a parameter is set in the global section, which includes restart of all Ceph daemons except Ceph OSD.

For example:

rookConfig:
  "osd_max_backfills": "64"
  "mon|mon_health_to_clog":  "true"
  "osd|osd_journal_size": "8192"
  "osd.14|osd_journal_size": "6250"

extraOpts

Available since MOSK 23.3. Enables specification of extra options for a setup, includes the deviceLabels parameter. For details, see ExtraOpts parameters.

ingress

In MOSK 25.1, is automatically replaced with ingressConfig. Enables a custom ingress rule for public access on Ceph services, for example, Ceph RADOS Gateway. For details, see Configure Ceph Object Gateway TLS.

ingressConfig

Available since MOSK 25.1 to automatically replace the ingress section. Enables a custom ingress rule for public access on Ceph services, for example, Ceph RADOS Gateway. For details, see Configure Ceph Object Gateway TLS.

rbdMirror

Enables pools mirroring between two interconnected clusters. For details, see Enable Ceph RBD mirroring.

clients

List of Ceph clients. For details, see Clients parameters.

disableOsSharedKeys

Disables autogeneration of shared Ceph values for OpenStack deployments. Set to false by default.

mgr

Contains the mgrModules parameter that should list the following keys:

  • name - Ceph Manager module name

  • enabled - flag that defines whether the Ceph Manager module is enabled

  • settings.balancerMode - available since MOSK 25.1. Allows defining balancer mode for the Ceph Manager balancer module. Possible values are crush-compat or upmap.

For example:

mgr:
  mgrModules:
  - name: balancer
    enabled: true
    settings:
      balancerMode: upmap
  - name: pg_autoscaler
    enabled: true

The balancer and pg_autoscaler Ceph Manager modules are enabled by default and cannot be disabled.

Note

Most Ceph Manager modules require additional configuration that you can perform through the ceph-tools pod on a MOSK cluster.

healthCheck

Configures health checks and liveness probe settings for Ceph daemons. For details, see HealthCheck parameters.

Example configuration
spec:
  cephClusterSpec:
    network:
      clusterNet: 10.10.10.0/24
      publicNet: 10.10.11.0/24
    nodes:
      master-0:
        <node spec>
      ...
    pools:
    - <pool spec>
    ...
    rookConfig:
      "mon max pg per osd": "600"
      ...
Node parameters

Parameter

Description

roles

Specifies the mon, mgr, or rgw daemon to be installed on a Ceph node. You can place the daemons on any nodes upon your decision. Consider the following recommendations:

  • The recommended number of Ceph Monitors in a Ceph cluster is 3. Therefore, at least 3 Ceph nodes must contain the mon item in the roles parameter.

  • The number of Ceph Monitors must be odd.

  • Do not add more than 2 Ceph Monitors at a time and wait until the Ceph cluster is Ready before adding more daemons.

  • For better HA and fault tolerance, the number of mgr roles must equal the number of mon roles. Therefore, we recommend labeling at least 3 Ceph nodes with the mgr role.

  • If rgw roles are not specified, all rgw daemons will spawn on the same nodes with mon daemons.

If a Ceph node contains a mon role, the Ceph Monitor Pod deploys on this node.

If a Ceph node contains a mgr role, it informs the Ceph Controller that a Ceph Manager can be deployed on the node. Rook Operator selects the first available node to deploy the Ceph Manager on it:

  • Before MOSK 23.1, only one Ceph Manager is deployed on a cluster.

  • Since MOSK 23.1, two Ceph Managers, active and stand-by, are deployed on a cluster.

    If you assign the mgr role to three recommended Ceph nodes, one back-up Ceph node is available to redeploy a failed Ceph Manager in case of a server outage.

storageDevices

Specifies the list of devices to use for Ceph OSD deployment. Includes the following parameters:

Note

Since MOSK 23.3, Mirantis recommends migrating all storageDevices items to by-id symlinks as persistent device identifiers.

For details, refer to Container Cloud documentation: Addressing storage devices.

  • fullPath - a storage device symlink. Accepts the following values:

    • Since MOSK 23.3, the device by-id symlink that contains the serial number of the physical device and does not contain wwn. For example, /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543. The by-id symlink should be equal to the one of Machine status status.providerStatus.hardware.storage.byIDs list. Mirantis recommends using this field for defining by-id symlinks.

    • The device by-path symlink. For example, /dev/disk/by-path/pci-0000:00:11.4-ata-3. Since MOSK 23.3, Mirantis does not recommend specifying storage devices with device by-path symlinks because such identifiers are not persistent and can change at node boot.

      This parameter is mutually exclusive with name.

  • name - a storage device name. Accepts the following values:

    • The device name, for example, sdc. Since MOSK 23.3, Mirantis does not recommend specifying storage devices with device names because such identifiers are not persistent and can change at node boot.

    • The device by-id symlink that contains the serial number of the physical device and does not contain wwn. For example, /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543.

      The by-id symlink should be equal to the one of Machine status status.providerStatus.hardware.storage.byIDs list. Since MOSK 23.3, Mirantis recommends using the fullPath field for defining by-id symlinks instead.

    This parameter is mutually exclusive with fullPath.

  • config - a map of device configurations that must contain a mandatory deviceClass parameter set to hdd, ssd, or nvme. The device class must be defined in a pool and can optionally contain a metadata device, for example:

    storageDevices:
    - name: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
      config:
        deviceClass: hdd
        metadataDevice: nvme01
        osdsPerDevice: "2"
    

    The underlying storage format to use for Ceph OSDs is BlueStore.

    The metadataDevice parameter accepts a device name or logical volume path for the BlueStore device. Mirantis recommends using logical volume paths created on nvme devices. For devices partitioning on logical volumes, see Create a custom bare metal host profile.

    The osdsPerDevice parameter accepts the string-type natural numbers and allows splitting one device on several Ceph OSD daemons. Mirantis recommends using this parameter only for ssd or nvme disks.

crush

Specifies the explicit key-value CRUSH topology for a node. For details, see Ceph official documentation: CRUSH maps. Includes the following parameters:

  • datacenter - a physical data center that consists of rooms and handles data.

  • room - a room that accommodates one or more racks with hosts.

  • pdu - a power distribution unit (PDU) device that has multiple outputs and distributes electric power to racks located within a data center.

  • row - a row of computing racks inside room.

  • rack - a computing rack that accommodates one or more hosts.

  • chassis - a bare metal structure that houses or physically assembles hosts.

  • region - the geographic location of one or more Ceph Object instances within one or more zones.

  • zone - a logical group that consists of one or more Ceph Object instances.

Example configuration:

crush:
  datacenter: dc1
  room: room1
  pdu: pdu1
  row: row1
  rack: rack1
  chassis: ch1
  region: region1
  zone: zone1

monitorIP

Optional. Available since MOSK 25.1. Specifies the custom monitor endpoint for the node on which the monitor is placed. The custom monitor endpoint can be equal, for example, to an IP address from the Ceph public network range.

Example configuration:

monitorIP: "192.168.13.1"
NodeGroups parameters

Parameter

Description

spec

Specifies a Ceph node specification. For the entire spec, see Node parameters.

nodes

Specifies a list of names of machines to which the Ceph node spec must be applied. Mutually exclusive with the label parameter. For example:

nodeGroups:
  group-1:
    spec: <node spec>
    nodes:
    - master-0
    - master-1
    - worker-0

label

Specifies a string with a valid label selector expression to select machines to which the node spec must be applied. Mutually exclusive with nodes parameter. For example:

nodeGroup:
  group-2:
    spec: <node spec>
    label: "ceph-storage-node=true,!ceph-control-node"
Pool parameters

Parameter

Description

name

Mandatory. Specifies the pool name as a prefix for each Ceph block pool. The resulting Ceph block pool name will be <name>-<deviceClass>.

useAsFullName

Optional. Enables Ceph block pool to use only the name value as a name. The resulting Ceph block pool name will be <name> without the deviceClass suffix.

role

Mandatory. Specifies the pool role and is used mostly for (MOSK) pools.

default

Mandatory. Defines if the pool and dependent StorageClass should be set as default. Must be enabled only for one pool.

deviceClass

Mandatory. Specifies the device class for the defined pool. Possible values are HDD, SSD, and NVMe.

replicated

Mandatory, mutually exclusive with erasureCoded. Includes the following parameters:

  • size - the number of pool replicas.

  • targetSizeRatio - Optional. A float percentage from 0.0 to 1.0, which specifies the expected consumption of the total Ceph cluster capacity. The default values are as follows:

    • The default ratio of the Ceph Object Storage dataPool is 10.0%.

    • For the pools ratio for MOSK, see Add a Ceph cluster.

erasureCoded

Mandatory, mutually exclusive with replicated. Enables the erasure-coded pool. For details, see Rook documentation: Erasure coded and Ceph documentation: Erasure coded pool.

failureDomain

Mandatory. The failure domain across which the replicas or chunks of data will be spread. Set to host by default. The list of possible recommended values includes: host, rack, room, and datacenter.

Caution

Mirantis does not recommend using the following intermediate topology keys: pdu, row, chassis. Consider the rack topology instead. The osd failure domain is prohibited.

mirroring

Optional. Enables the mirroring feature for the defined pool. Includes the mode parameter that can be set to pool or image. For details, see Enable Ceph RBD mirroring.

allowVolumeExpansion

Optional. Not updatable as it applies only once. Enables expansion of persistent volumes based on StorageClass of a corresponding pool. For details, see Kubernetes documentation: Resizing persistent volumes using Kubernetes.

Note

A Kubernetes cluster only supports increase of storage size.

rbdDeviceMapOptions

Optional. Not updatable as it applies only once. Specifies custom rbd device map options to use with StorageClass of a corresponding pool. Allows customizing the Kubernetes CSI driver interaction with Ceph RBD for the defined StorageClass. For the available options, see Ceph documentation: Kernel RBD (KRBD) options.

parameters

Optional. Available since MOSK 23.1. Specifies the key-value map for the parameters of the Ceph pool. For details, see Ceph documentation: Set Pool values.

reclaimPolicy

Optional. Available since MOSK 23.3. Specifies reclaim policy for the underlying StorageClass of the pool. Accepts Retain and Delete values. Default is Delete if not set.

Example configuration:

pools:
- name: kubernetes
  role: kubernetes
  deviceClass: hdd
  replicated:
    size: 3
    targetSizeRatio: 10.0
  default: true

To configure additional required pools for MOSK, see Add a Ceph cluster.

Caution

Since Ceph Pacific, Ceph CSI driver does not propagate the 777 permission on the mount point of persistent volumes based on any StorageClass of the Ceph pool.

Clients parameters

Parameter

Description

name

Ceph client name.

caps

Key-value parameter with Ceph client capabilities. For details about caps, refer to Ceph documentation: Authorization (capabilities).

Example configuration:

clients:
- name: glance
  caps:
    mon: allow r, allow command "osd blacklist"
    osd: profile rbd pool=images
RADOS Gateway parameters

Parameter

Description

name

Ceph Object Storage instance name.

dataPool

Mutually exclusive with the zone parameter. Object storage data pool spec that should only contain replicated or erasureCoded and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. For dataPool, Mirantis recommends using an erasureCoded pool. For details, see Rook documentation: Erasure coding. For example:

cephClusterSpec:
  objectStorage:
    rgw:
      dataPool:
        erasureCoded:
          codingChunks: 1
          dataChunks: 2

metadataPool

Mutually exclusive with the zone parameter. Object storage metadata pool spec that should only contain replicated and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. Can use only replicated settings. For example:

cephClusterSpec:
  objectStorage:
    rgw:
      metadataPool:
        replicated:
          size: 3
        failureDomain: host

where replicated.size is the number of full copies of data on multiple nodes.

Warning

When using the non-recommended Ceph pools replicated.size of less than 3, Ceph OSD removal cannot be performed. The minimal replica size equals a rounded up half of the specified replicated.size.

For example, if replicated.size is 2, the minimal replica size is 1, and if replicated.size is 3, then the minimal replica size is 2. The replica size of 1 allows Ceph having PGs with only one Ceph OSD in the acting state, which may cause a PG_TOO_DEGRADED health warning that blocks Ceph OSD removal. Mirantis recommends setting replicated.size to 3 for each Ceph pool.

gateway

The gateway settings corresponding to the rgw daemon settings. Includes the following parameters:

  • port - the port on which the Ceph RGW service will be listening on HTTP.

  • securePort - the port on which the Ceph RGW service will be listening on HTTPS.

  • instances - the number of pods in the Ceph RGW ReplicaSet. If allNodes is set to true, a DaemonSet is created instead.

    Note

    Mirantis recommends using 2 instances for Ceph Object Storage.

  • allNodes - defines whether to start the Ceph RGW pods as a DaemonSet on all nodes. The instances parameter is ignored if allNodes is set to true.

For example:

cephClusterSpec:
  objectStorage:
    rgw:
      gateway:
        allNodes: false
        instances: 1
        port: 80
        securePort: 8443

preservePoolsOnDelete

Defines whether to delete the data and metadata pools in the rgw section if the object storage is deleted. Set this parameter to true if you need to store data even if the object storage is deleted. However, Mirantis recommends setting this parameter to false.

objectUsers and buckets

Optional. To create new Ceph RGW resources, such as buckets or users, specify the following keys. Ceph Controller will automatically create the specified object storage users and buckets in the Ceph cluster.

  • objectUsers - a list of user specifications to create for object storage. Contains the following fields:

    • name - a user name to create.

    • displayName - the Ceph user name to display.

    • capabilities - user capabilities:

      • user - admin capabilities to read/write Ceph Object Store users.

      • bucket - admin capabilities to read/write Ceph Object Store buckets.

      • metadata - admin capabilities to read/write Ceph Object Store metadata.

      • usage - admin capabilities to read/write Ceph Object Store usage.

      • zone - admin capabilities to read/write Ceph Object Store zones.

      The available options are *, read, write, read, write. For details, see Ceph documentation: Add/remove admin capabilities.

    • quotas - user quotas:

      • maxBuckets - the maximum bucket limit for the Ceph user. Integer, for example, 10.

      • maxSize - the maximum size limit of all objects across all the buckets of a user. String size, for example, 10G.

      • maxObjects - the maximum number of objects across all buckets of a user. Integer, for example, 10.

      For example:

      objectUsers:
      - capabilities:
          bucket: '*'
          metadata: read
          user: read
        displayName: test-user
        name: test-user
        quotas:
          maxBuckets: 10
          maxSize: 10G
      
  • users - a list of strings that contain user names to create for object storage.

    Note

    This field is deprecated. Use objectUsers instead. If users is specified, it will be automatically transformed to the objectUsers section.

  • buckets - a list of strings that contain bucket names to create for object storage.

zone

Optional. Mutually exclusive with metadataPool and dataPool. Defines the Ceph Multisite zone where the object storage must be placed. Includes the name parameter that must be set to one of the zones items. For details, see Enable multisite for Ceph RGW Object Storage.

For example:

cephClusterSpec:
  objectStorage:
    multisite:
      zones:
      - name: master-zone
      ...
    rgw:
      zone:
        name: master-zone

SSLCert

Optional. Custom TLS certificate parameters used to access the Ceph RGW endpoint. If not specified, a self-signed certificate will be generated.

For example:

cephClusterSpec:
  objectStorage:
    rgw:
      SSLCert:
        cacert: |
          -----BEGIN CERTIFICATE-----
          ca-certificate here
          -----END CERTIFICATE-----
        tlsCert: |
          -----BEGIN CERTIFICATE-----
          private TLS certificate here
          -----END CERTIFICATE-----
        tlsKey: |
          -----BEGIN RSA PRIVATE KEY-----
          private TLS key here
          -----END RSA PRIVATE KEY-----

SSLCertInRef

Optional. Available since {{ product_name_abbr }} 25.1. Flag to determine that a TLS certificate for accessing the Ceph RGW endpoint is used but not exposed in spec. For example:

cephClusterSpec:
  objectStorage:
    rgw:
      SSLCertInRef: true

The operator must manually provide TLS configuration using the rgw-ssl-certificate secret in the rook-ceph namespace of the managed cluster. The secret object must have the following structure:

data:
  cacert: <base64encodedCaCertificate>
  cert: <base64encodedCertificate>

When removing an already existing SSLCert block, no additional actions are required, because this block uses the same rgw-ssl-certificate secret in the rook-ceph namespace.

When adding a new secret directly without exposing it in spec, the following rules apply:

  • cert - base64 representation of a file with the server TLS key, server TLS cert, and cacert.

  • cacert - base64 representation of a cacert only.

For configuration example, see Enable Ceph RGW Object Storage.

ExtraOpts parameters

Parameter

Description

deviceLabels

Available since MOSK 23.3. A key-value setting used to assign a specification label to any available device on a specific node. These labels can then be utilized within nodeGroups or node definitions to eliminate the need to specify different devices for each node individually. Additionally, it helps in avoiding the use of device names, facilitating the grouping of nodes with similar labels.

Usage:

extraOpts:
  deviceLabels:
    <node-name>:
      <dev-label>: /dev/disk/by-id/<unique_ID>
    ...
    <node-name-n>:
      <dev-label-n>: /dev/disk/by-id/<unique_ID>
nodesGroup:
  <group-name>:
    spec:
      storageDevices:
        - devLabel: <dev_label>
        - devLabel: <dev_label_n>
    nodes:
      - <node_name>
      - <node_name_n>

Before MOSK 23.3, you need to specify the device labels for each node separately:

nodes:
  <node-name>:
  - storageDevices:
    - fullPath: /dev/disk/by-id/<unique_ID>
  <node-name-n>:
  - storageDevices:
    - fullPath: /dev/disk/by-id/<unique_ID>

customDeviceClasses

Available since MOSK 23.3 as TechPreview. A list of custom device class names to use in the specification. Enables you to specify the custom names different from the default ones, which include ssd, hdd, and nvme, and use them in nodes and pools definitions.

Usage:

extraOpts:
  customDeviceClasses:
  - <custom_class_name>
nodes:
  kaas-node-5bgk6:
    storageDevices:
    - config: # existing item
        deviceClass: <custom_class_name>
      fullPath: /dev/disk/by-id/<unique_ID>
pools:
- default: false
  deviceClass: <custom_class_name>
  erasureCoded:
    codingChunks: 1
    dataChunks: 2
  failureDomain: host

Before MOSK 23.3, you cannot specify custom class names in the specification.

Multisite parameters

Parameter

Description

realms Technical Preview

List of realms to use, represents the realm namespaces. Includes the following parameters:

  • name - the realm name.

  • pullEndpoint - optional, required only when the master zone is in a different storage cluster. The endpoint, access key, and system key of the system user from the realm to pull from. Includes the following parameters:

    • endpoint - the endpoint of the master zone in the master zone group.

    • accessKey - the access key of the system user from the realm to pull from.

    • secretKey - the system key of the system user from the realm to pull from.

zoneGroups Technical Preview

The list of zone groups for realms. Includes the following parameters:

  • name - the zone group name.

  • realmName - the realm namespace name to which the zone group belongs to.

zones Technical Preview

The list of zones used within one zone group. Includes the following parameters:

  • name - the zone name.

  • metadataPool - the settings used to create the Object Storage metadata pools. Must use replication. For details, see Pool parameters.

  • dataPool - the settings to create the Object Storage data pool. Can use replication or erasure coding. For details, see Pool parameters.

  • zoneGroupName - the zone group name.

  • endpointsForZone - available since {{ product_name_abbr }} 24.2. The list of all endpoints in the zone group. If you use ingress proxy for RGW, the list of endpoints must contain that FQDN/IP address to access RGW. By default, if no ingress proxy is used, the list of endpoints is set to the IP address of the RGW external service. Endpoints must follow the HTTP URL format.

For configuration example, see Enable multisite for Ceph RGW Object Storage.

HealthCheck parameters

Parameter

Description

daemonHealth

Specifies health check settings for Ceph daemons. Contains the following parameters:

  • status - configures health check settings for Ceph health

  • mon - configures health check settings for Ceph Monitors

  • osd - configures health check settings for Ceph OSDs

Each parameter allows defining the following settings:

  • disabled - a flag that disables the health check.

  • interval - an interval in seconds or minutes for the health check to run. For example, 60s for 60 seconds.

  • timeout - a timeout for the health check in seconds or minutes. For example, 60s for 60 seconds.

livenessProbe

Key-value parameter with liveness probe settings for the defined daemon types. Can be one of the following: mgr, mon, osd, or mds. Includes the disabled flag and the probe parameter. The probe parameter accepts the following options:

  • initialDelaySeconds - the number of seconds after the container has started before the liveness probes are initiated. Integer.

  • timeoutSeconds - the number of seconds after which the probe times out. Integer.

  • periodSeconds - the frequency (in seconds) to perform the probe. Integer.

  • successThreshold - the minimum consecutive successful probes for the probe to be considered successful after a failure. Integer.

  • failureThreshold - the minimum consecutive failures for the probe to be considered failed after having succeeded. Integer.

Note

Ceph Controller specifies the following livenessProbe defaults for mon, mgr, osd, and mds (if CephFS is enabled):

  • 5 for timeoutSeconds

  • 5 for failureThreshold

startupProbe

Key-value parameter with startup probe settings for the defined daemon types. Can be one of the following: mgr, mon, osd, or mds. Includes the disabled flag and the probe parameter. The probe parameter accepts the following options:

  • timeoutSeconds - the number of seconds after which the probe times out. Integer.

  • periodSeconds - the frequency (in seconds) to perform the probe. Integer.

  • successThreshold - the minimum consecutive successful probes for the probe to be considered successful after a failure. Integer.

  • failureThreshold - the minimum consecutive failures for the probe to be considered failed after having succeeded. Integer.

Example configuration
healthCheck:
  daemonHealth:
    mon:
      disabled: false
      interval: 45s
      timeout: 600s
    osd:
      disabled: false
      interval: 60s
    status:
      disabled: true
  livenessProbe:
    mon:
      disabled: false
      probe:
        timeoutSeconds: 10
        periodSeconds: 3
        successThreshold: 3
    mgr:
      disabled: false
      probe:
        timeoutSeconds: 5
        failureThreshold: 5
    osd:
      probe:
        initialDelaySeconds: 5
        timeoutSeconds: 10
        failureThreshold: 7
  startupProbe:
    mon:
      disabled: true
    mgr:
      probe:
        successThreshold: 3

The following sections describe the OpenStack-related Ceph operations:

Configure Ceph Object Gateway TLS

Once you enable Ceph Object Gateway (radosgw) as described in Enable Ceph RGW Object Storage, you can configure the Transport Layer Security (TLS) protocol for a Ceph Object Gateway public endpoint using the following options:

  • Using MOSK TLS, if it is enabled and exposes its certificates and domain for Ceph. In this case, Ceph Object Gateway will automatically create an ingress rule with MOSK certificates and domain to access the Ceph Object Gateway public endpoint. Therefore, you only need to reach the Ceph Object Gateway public and internal endpoints and set the CA certificates for a trusted TLS connection.

  • Using custom ingress specified in the KaaSCephCluster CR. In this case, Ceph Object Gateway public endpoint will use the public domain specified using the ingress parameters.

Caution

External Ceph Object Gateway service is not supported and will be deleted during update. If your system already uses endpoints of an external Ceph Object Gateway service, reconfigure them to the ingress endpoints.

Caution

When using a custom or OpenStack ingress, ensure to configure the DNS name for RGW to target an external IP address of that ingress. If there is no OpenStack or custom ingress available, point the DNS to an external load balancer of RGW.

Note

Since MOSK 23.3, if the cluster has tls-proxy enabled, TLS certificates specified in ingress objects, including those configured in the KaaSCephCluster specification, are disregarded. Instead, common certificates are applied to all ingresses from the OpenStackDeployment object. This implies that tlsCert and other ingress certificates specified in KaaSCephCluster are ignored, and the common certificate from the OpenStackDeployment object is used.

This section also describes how to specify a custom public endpoint for the Object Storage service.

To configure Ceph Object Gateway TLS:

  1. Verify whether MOSK TLS is enabled. The spec.features.ssl.public_endpoints section should be specified in the OpenStackDeployment CR.

  2. To generate an SSL certificate for internal usage, verify that the gateway securePort parameter is specified in the KaasCephCluster CR. For details, see Enable Ceph RGW Object Storage.

  3. Select from the following options:

    Configure TLS for Ceph Object Gateway using a custom ingressConfig:

    1. Open the KaasCephCluster CR for editing.

    2. Specify the ingressConfig parameters:

      Description of the tlsConfig section parameters

      certs

      TLS configuration for ingress including certificates. Contains the following parameters:

      cacert

      The Certificate Authority (CA) certificate, used for the ingress rule TLS support.

      tlsCert

      The TLS certificate, used for the ingress rule TLS support.

      tlsKey

      The TLS private key, used for the ingress rule TLS support.

      publicDomain

      Mandatory. The domain name to use for public endpoints.

      Caution

      The default ingress controller does not support publicDomain values different from the OpenStack ingress public domain. Therefore, if you intend to use the default OpenStack Ingress Controller for your Ceph Object Storage public endpoint, plan to use the same public domain as your OpenStack endpoints.

      hostname

      Custom name to override the Objectstore RGW name for public RGW access. Public RGW endpoint has the https://<hostname>.<publicDomain> format.

      tlsSecretRefName

      Optional. Secret name with TLS certs on the managed cluster in the rook-ceph namespace prepared by the operator. Allows avoiding exposure of certs directly in spec. Must contain the following format:

      data:
        ca.cert: <base64encodedCaCertificate>
        tls.crt: <base64encodedTlsCert>
        tls.key: <base64encodedTlsKey>
      

      Caution

      When using tlsSecretRefName, remove the following fields: cacert, tlsCert, and tlsKey.

      Description of optional parameters in the ingressConfig section

      controllerClassName

      Name of the custom Ingress Controller. By default, the openstack-ingress-nginx class name is specified and Ceph uses the OpenStack Ingress Controller based on NGINX.

      annotations

      Extra annotations for the ingress proxy that are a key-value mapping of strings to add or override ingress rule annotations. For details, see NGINX Ingress Controller: Annotations.

      By default, the following annotations are set:

      • nginx.ingress.kubernetes.io/rewrite-target is set to /

      • nginx.ingress.kubernetes.io/upstream-vhost is set to <rgwName>.rook-ceph.svc

      The value for <rgwName> is located in spec.cephClusterSpec.objectStorage.rgw.name.

      Optional annotations:

      • nginx.ingress.kubernetes.io/proxy-request-buffering: "off" that disables buffering for ingress to prevent the 413 (Request Entity Too Large) error when uploading large files using radosgw.

      • nginx.ingress.kubernetes.io/proxy-body-size: <size> that increases the default uploading size limit to prevent the 413 (Request Entity Too Large) error when uploading large files using radosgw. Set the value in MB (m) or KB (k). For example, 100m.

      Note

      By default, an ingress rule is created with an internal Ceph Object Gateway service endpoint as a backend. Also, rgw dns name is specified in the Ceph configuration and is set to <rgwName>.rook-ceph.svc by default.

      You can override rgw dns name using the spec.cephClusterSpec.rookConfig key-value parameter. In this case, also change the corresponding ingress annotation.

      Configuration example with the rgw dns name override
      spec:
        cephClusterSpec:
          objectStorage:
            rgw:
              name: rgw-store
          ingressConfig:
            tlsConfig:
              publicDomain: public.domain.name
              certs:
                cacert: |
                  -----BEGIN CERTIFICATE-----
                  ...
                  -----END CERTIFICATE-----
                tlsCert: |
                  -----BEGIN CERTIFICATE-----
                  ...
                  -----END CERTIFICATE-----
                tlsKey: |
                  -----BEGIN RSA PRIVATE KEY-----
                  ...
                  -----END RSA PRIVATE KEY-----
            controllerClassName: openstack-ingress-nginx
            annotations:
              nginx.ingress.kubernetes.io/rewrite-target: /
              nginx.ingress.kubernetes.io/upstream-vhost: rgw-store.public.domain.name
              nginx.ingress.kubernetes.io/proxy-body-size: 100m
          rookConfig:
            "rgw dns name": rgw-store.public.domain.name
      

      For clouds with the publicDomain parameter specified, align the upstream-vhost ingress annotation with the name of the Ceph Object Storage and the specified public domain.

      Ceph Object Storage requires the upstream-vhost and rgw dns name parameters to be equal. Therefore, override the default rgw dns name with the corresponding ingress annotation value.

    Configure Ceph Object Gateway TLS using a custom ingress:

    Warning

    The rgw section is deprecated and the ingress parameters are moved under cephClusterSpec.ingress. If you continue using rgw.ingress, it will be automatically translated into cephClusterSpec.ingress during the MOSK cluster release update.

    1. Open the KaasCephCluster CR for editing.

    2. Specify the ingress parameters:

      • publicDomain - domain name to use for the external service.

        Caution

        Since MOSK 23.3, the default ingress controller does not support publicDomain values different from the OpenStack ingress public domain. Therefore, if you intend to use the default OpenStack ingress controller for your Ceph Object Storage public endpoint, plan to use the same public domain as your OpenStack endpoints.

      • cacert - Certificate Authority (CA) certificate, used for the ingress rule TLS support.

      • tlsCert - TLS certificate, used for the ingress rule TLS support.

      • tlsKey - TLS private key, used for the ingress rule TLS support.

      • customIngress Optional - includes the following custom Ingress Controller parameters:

        • className - the custom Ingress Controller class name. If not specified, the openstack-ingress-nginx class name is used by default.

        • annotations - extra annotations for the ingress proxy. For details, see NGINX Ingress Controller: Annotations.

          By default, the following annotations are set:

          • nginx.ingress.kubernetes.io/rewrite-target is set to /

          • nginx.ingress.kubernetes.io/upstream-vhost is set to <rgwName>.rook-ceph.svc.

            The value for <rgwName> is spec.cephClusterSpec.objectStorage.rgw.name.

          Optional annotations:

          • nginx.ingress.kubernetes.io/proxy-request-buffering: "off" that disables buffering for ingress to prevent the 413 (Request Entity Too Large) error when uploading large files using radosgw.

          • nginx.ingress.kubernetes.io/proxy-body-size: <size> that increases the default uploading size limit to prevent the 413 (Request Entity Too Large) error when uploading large files using radosgw. Set the value in MB (m) or KB (k). For example, 100m.

          For example:

          customIngress:
            className: openstack-ingress-nginx
            annotations:
              nginx.ingress.kubernetes.io/rewrite-target: /
              nginx.ingress.kubernetes.io/upstream-vhost: openstack-store.rook-ceph.svc
              nginx.ingress.kubernetes.io/proxy-body-size: 100m
          

          Note

          An ingress rule is by default created with an internal Ceph Object Gateway service endpoint as a backend. Also, rgw dns name is specified in the Ceph configuration and is set to <rgwName>.rook-ceph.svc by default. You can override this option using the spec.cephClusterSpec.rookConfig key-value parameter. In this case, also change the corresponding ingress annotation.

          For example:

          spec:
            cephClusterSpec:
              objectStorage:
                rgw:
                  name: rgw-store
              ingress:
                publicDomain: public.domain.name
                cacert: |
                  -----BEGIN CERTIFICATE-----
                  ...
                  -----END CERTIFICATE-----
                tlsCert: |
                  -----BEGIN CERTIFICATE-----
                  ...
                  -----END CERTIFICATE-----
                tlsKey: |
                  -----BEGIN RSA PRIVATE KEY-----
                  ...
                  -----END RSA PRIVATE KEY-----
                customIngress:
                  annotations:
                    "nginx.ingress.kubernetes.io/upstream-vhost": rgw-store.public.domain.name
              rookConfig:
                "rgw dns name": rgw-store.public.domain.name
          

          Warning

          • For clouds with the publicDomain parameter specified, align the upstream-vhost ingress annotation with the name of the Ceph Object Storage and the specified public domain.

          • Ceph Object Storage requires the upstream-vhost and rgw dns name parameters to be equal. Therefore, override the default rgw dns name to the corresponding ingress annotation value.

    Obtain the MOSK CA certificate for a trusted connection:

    kubectl -n openstack-ceph-shared get secret openstack-rgw-creds -o jsonpath="{.data.ca_cert}" | base64 -d
    
  4. Access internal and public Ceph Object Gateway endpoints by selecting one of the following options:

    Note

    If you are using the HTTP scheme instead of HTTPS:

    In the KaaSCephCluster object on the management cluster, find the ingressConfig section and add custom annotations:

    spec:
      cephClusterSpec:
        ingressConfig:
          annotations:
            "nginx.ingress.kubernetes.io/force-ssl-redirect": "false"
            "nginx.ingress.kubernetes.io/ssl-redirect": "false"
    

    You can omit TLS configuration for the default settings provided by OpenStack to be applied.

    If both HTTP and HTTPS must be used, apply the following configuration in the KaaSCephCluster object:

    spec:
      cephClusterSpec:
        ingressConfig:
          tlsConfig:
            publicDomain: public.domain.name
            cacert: |
              -----BEGIN CERTIFICATE-----
              ...
              -----END CERTIFICATE-----
            tlsCert: |
              -----BEGIN CERTIFICATE-----
              ...
              -----END CERTIFICATE-----
            tlsKey: |
              -----BEGIN RSA PRIVATE KEY-----
              ...
              -----END RSA PRIVATE KEY-----
          annotations:
            "nginx.ingress.kubernetes.io/force-ssl-redirect": "false"
            "nginx.ingress.kubernetes.io/ssl-redirect": "false"
    

    In the KaaSCephCluster object on the management cluster, find the ingress section and add custom annotations:

    spec:
      cephClusterSpec:
        ingress:
          publicDomain: public.domain.name
          cacert: |
            -----BEGIN CERTIFICATE-----
            ...
            -----END CERTIFICATE-----
          tlsCert: |
            -----BEGIN CERTIFICATE-----
            ...
            -----END CERTIFICATE-----
          tlsKey: |
            -----BEGIN RSA PRIVATE KEY-----
            ...
            -----END RSA PRIVATE KEY-----
          customIngress:
            annotations:
              "nginx.ingress.kubernetes.io/force-ssl-redirect": "false"
              "nginx.ingress.kubernetes.io/ssl-redirect": "false"
    

    In the KaaSCephCluster object on the management cluster, find the ingress section and add custom annotations:

    spec:
      cephClusterSpec:
        rookConfig:
          rgw_remote_addr_param: "HTTP_X_FORWARDED_FOR"
    
    1. Obtain the Ceph Object Gateway public endpoint:

      kubectl -n rook-ceph get ingress
      
    2. Obtain the public endpoint TLS CA certificate:

      kubectl -n rook-ceph get secret $(kubectl -n rook-ceph get ingress -o jsonpath='{.items[0].spec.tls[0].secretName}{"\n"}') -o jsonpath='{.data.ca\.crt}' | base64 -d; echo
      
    1. Obtain the internal endpoint name for Ceph Object Gateway:

      kubectl -n rook-ceph get svc -l app=rook-ceph-rgw
      

      The internal endpoint for Ceph Object Gateway has the https://<internal-svc-name>.rook-ceph.svc:<rgw-secure-port>/ format, where <rgw-secure-port> is spec.rgw.gateway.securePort specified in the KaaSCephCluster CR.

    2. Obtain the internal endpoint TLS CA certificate:

      kubectl -n rook-ceph get secret rgw-ssl-certificate -o jsonpath="{.data.cacert}" | base64 -d
      
  5. Update the zonegroup hostnames of Ceph Object Gateway:

    The hostnames zonegroup of Ceph Object Gateway updates automatically and you can skip this step if one of the following requirements is met:

    • The public hostname matches the public domain name set by the spec.cephClusterSpec.ingressConfig.tlsConfig.publicDomain field

    • The OpenStack configuration has been applied

    Otherwise, complete the steps for MOSK 22.4 and earlier product versions.

    Note

    Prior to MOSK 25.1, use the spec.cephClusterSpec.ingress.publicDomain field instead.

    1. Enter the rook-ceph-tools pod:

      kubectl -n rook-ceph exec -it deployment/rook-ceph-tools -- bash
      
    2. Obtain Ceph Object Gateway default zonegroup configuration:

      radosgw-admin zonegroup get --rgw-zonegroup=<objectStorageName> --rgw-zone=<objectStorageName> | tee zonegroup.json
      

      Substitute <objectStorageName> with the Ceph Object Storage name from spec.cephClusterSpec.objectStorage.rgw.name.

    3. Inspect zonegroup.json and verify that the hostnames key is a list that contains two endpoints: an internal endpoint and a custom public endpoint:

      "hostnames": ["rook-ceph-rgw-<objectStorageName>.rook-ceph.svc", <customPublicEndpoint>]
      

      Substitute <objectStorageName> with the Ceph Object Storage name and <customPublicEndpoint> with the public endpoint with a custom public domain.

    4. If one or both endpoints are omitted in the list, add the missing endpoints to the hostnames list in the zonegroup.json file and update Ceph Object Gateway zonegroup configuration:

      radosgw-admin zonegroup set --rgw-zonegroup=<objectStorageName> --rgw-zone=<objectStorageName> --infile zonegroup.json
      radosgw-admin period update --commit
      
    5. Verify that the hostnames list contains both the internal and custom public endpoint:

      radosgw-admin --rgw-zonegroup=<objectStorageName> --rgw-zone=<objectStorageName> zonegroup get | jq -r ".hostnames"
      

      Example of system response:

      [
        "rook-ceph-rgw-obj-store.rook-ceph.svc",
        "obj-store.mcc1.cluster1.example.com"
      ]
      
    6. Exit the rook-ceph-tools pod:

      exit
      

Once done, Ceph Object Gateway becomes available by the custom public endpoint with an S3 API client, OpenStack Swift CLI, and OpenStack Horizon Containers plugin.

Use object storage server-side encryption

TechPreview

When you use Ceph Object Gateway server-side encryption (SSE), unencrypted data sent over HTTPS is stored encrypted by the Ceph Object Gateway in the Ceph cluster. The current implementation integrates Barbican as a key management service.

The object storage SSE feature is enabled by default in MOSK deployments with Barbican. To use object storage SSE, the AWS CLI S3 client is used.

To use object storage server-side encryption:

  1. Create Amazon Elastic Compute Cloud (EC2) credentials:

    openstack ec2 credentials create
    
  2. Configure AWS CLI with access and secret created in the previous step:

    aws configure
    
  3. Create a secret key in Barbican secret key:

    openstack secret order create --name <name> --algorithm <algorithm> --mode <mode> --bit-length 256 --payload-content-type=<payload-content-type> key
    

    Substitute the parameters enclosed in angle brackets:

    • <name> - human-friendly name.

    • <algorithm> - algorithm to use with the requested key. For example, aes.

    • <mode> - algorithm mode to use with the requested key. For example, ctr.

    • <payload-content-type> - type/format of the secret to generate. For example, application/octet-stream.

  4. Verify that the key has been created:

    openstack secret order get <order-href>
    

    Substitute <order-href> with the corresponding value from the output of the previous command.

  5. Specify the ceph-rgw user in the Barbican secret Access Control List (ACL):

    1. Obtain the list of ceph-rgw users:

      openstack user list --domain service  | grep ceph-rgw
      

      Example output:

      | c63b70134e0845a2b13c3f947880f66a | ceph-rgwZ6ycK3dY         |
      

      In the output, capture the first value as the <user-id>, which is c63b70134e0845a2b13c3f947880f66a in the above example.

    2. Specify the ceph-rgw user in the Barbican ACL:

      openstack acl user add --user <user-id> <secret-href>
      

      Substitute <user-id> with the corresponding value from the output of the previous command and <secret-href> with the corresponding value obtained in step 3.

  6. Create an S3 bucket:

    aws --endpoint-url <rgw-endpoint-url> --ca-bundle <ca-bundle> s3api create-bucket --bucket <bucket-name>
    

    Substitute the parameters enclosed in angle brackets:

    • <rgw-endpoint-url> - Ceph Object Gateway endpoint DNS name

    • <ca-bundle> - CA Certificate Bundle

    • <bucket-name> - human-friendly bucket name

  7. Upload a file using object storage SSE:

    aws --endpoint-url <rgw-endpoint-url> --ca-bundle <ca-bundle> s3 cp <path-to-file> "s3://<bucket-name>/<filename>" --sse aws:kms --sse-kms-key-id <key-id>
    

    Substitute the parameters enclosed in angle brackets:

    • <path-to-file> - path to the file that you want to upload

    • <filename> - name under which the uploaded file will be stored in the bucket

    • <key-id> - Barbican secret key ID

  8. Select from the following options to download the file:

    • Download the file using a key:

      aws --endpoint-url <rgw-endpoint-url> --ca-bundle <ca-bundle> s3 cp "s3://<bucket-name>/<filename>" <path-to-output-file> --sse aws:kms --sse-kms-key-id <key-id>
      

      Substitute <path-to-output-file> with the path to the file you want to download.

    • Download the file without a key:

      aws --endpoint-url <rgw-endpoint-url> --ca-bundle <ca-bundle> s3 cp "s3://<bucket-name>/<filename>" <output-filename>
      
Set an Amazon S3 bucket policy

This section explains how to create an Amazon Simple Storage Service (Amazon S3 or S3) bucket and set an S3 bucket policy between two Ceph Object Storage users.

Create Ceph Object Storage users

Ceph Object Storage users can create Amazon S3 buckets and bucket policies that grant access to other users.

This section describes how to create two Ceph Object Storage users and configure their S3 credentials.

To create and configure Ceph Object Storage users:

  1. Open the KaaSCephCluster CR:

    kubectl --kubeconfig <managementKubeconfig> -n <managedClusterProject> edit kaascephcluster
    

    Substitute <managementKubeconfig> with a management cluster kubeconfig file and <managedClusterProject> with a managed cluster project name.

  2. In the cephClusterSpec section, add new Ceph Object Storage users.

    Caution

    For user name, apply the UUID format with no capital letters.

    For example:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            objectUsers:
            - name: user-b
              displayName: user-a
              capabilities:
                bucket: "*"
                user: read
            - name: user-t
              displayName: user-t
              capabilities:
                bucket: "*"
                user: read
    
  3. Verify that rgwUserSecrets are created for both users:

    kubectl --kubeconfig <managementKubeconfig> -n <managedClusterProject> get kaascephcluster -o yaml
    

    Substitute <managementKubeconfig> with a management cluster kubeconfig file and <managedClusterProject> with a managed cluster project name.

    Example of a positive system response:

    status:
      miraCephSecretsInfo:
        secretInfo:
          rgwUserSecrets:
          - name: user-a
            secretName: <user-aCredSecretName>
            secretNamespace: <user-aCredSecretNamespace>
          - name: user-t
            secretName: <user-tCredSecretName>
            secretNamespace: <user-tCredSecretNamespace>
    
  4. Obtain S3 user credentials from the cluster secrets. Specify an access key and a secret key for both users:

    kubectl --kubeconfig <managedKubeconfig> -n <user-aCredSecretNamespace> get secret <user-aCredSecretName> -o jsonpath='{.data.AccessKey}' | base64 -d
    kubectl --kubeconfig <managedKubeconfig> -n <user-aCredSecretNamespace> get secret <user-aCredSecretName> -o jsonpath='{.data.SecretKey}' | base64 -d
    kubectl --kubeconfig <managedKubeconfig> -n <user-tCredSecretNamespace> get secret <user-tCredSecretName> -o jsonpath='{.data.AccessKey}' | base64 -d
    kubectl --kubeconfig <managedKubeconfig> -n <user-tCredSecretNamespace> get secret <user-tCredSecretName> -o jsonpath='{.data.SecretKey}' | base64 -d
    

    Substitute <managementKubeconfig> with a management cluster kubeconfig and specify the corresponding secretNamespace and secretName for both users.

  5. Obtain Ceph Object Storage public endpoint from the KaaSCephCluster status:

    kubectl --kubeconfig <managementKubeconfig> -n <managedClusterProject> get kaascephcluster -o yaml | grep PublicEndpoint
    

    Substitute <managementKubeconfig> with a management cluster kubeconfig file and <managedClusterProject> with a managed cluster project name.

    Example of a positive system response:

    objectStorePublicEndpoint: https://object-storage.mirantis.example.com
    
  6. Obtain the CA certificate to use an HTTPS endpoint:

    kubectl --kubeconfig <managedKubeconfig> -n rook-ceph get secret $(kubectl -n rook-ceph get ingress -o jsonpath='{.items[0].spec.tls[0].secretName}{"\n"}') -o jsonpath='{.data.ca\.crt}' | base64 -d; echo
    

    Save the output to ca.crt.

Set a bucket policy for a Ceph Object Storage user

Available since 2.23.1 (Cluster release 12.7.0)

Amazon S3 is an object storage service with different access policies. A bucket policy is a resource-based policy that grants permissions to a bucket and objects in it. For more details, see Amazon S3 documentation: Using bucket policies .

The following procedure illustrates the process of setting a bucket policy for a bucket (test01) stored in a Ceph Object Storage. The bucket policy requires at least two users: a bucket owner (user-a) and a bucket user (user-t). The bucket owner creates the bucket and sets the policy that regulates access for the bucket user.

Caution

For user name, apply the UUID format with no capital letters.

To configure an Amazon S3 bucket policy:

Note

The s3cmd is a free command-line tool and client for uploading, retrieving, and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol. You can download the s3cmd CLI tool from Amazon S3 tools: Download s3cmd.

  1. Configure the s3cmd client with the user-a credentials:

    s3cmd --configure --ca-certs=ca.crt
    

    Specify the bucket access parameters as required:

    Bucket access parameters

    Parameter

    Description

    Comment

    Access Key

    Public part of access credentials.

    Specify a user access key.

    Secret Key

    Secret part of access credentials.

    Specify a user secret key.

    Default Region

    Region of AWS servers where requests are sent by default.

    Use the default value.

    S3 Endpoint

    Connection point to the Ceph Object Storage.

    Specify the Ceph Object Storage public endpoint.

    DNS-style bucket+hostname:port template for accessing a bucket

    Bucket location.

    Specify the Ceph Object Storage public endpoint.

    Path to GPG program

    Path to the GNU Privacy Guard encryption suite.

    Use the default value.

    Use HTTPS protocol

    HTTPS protocol switch.

    Specify Yes.

    HTTP Proxy server name

    HTTP Proxy server name.

    Skip this parameter.

    When configured correctly, the s3cmd tool connects to the Ceph Object Storage. Save new settings when prompted by the system.

  2. As user-a, create a new bucket test01:

    s3cmd mb s3://test01
    

    Example of a positive system response:

    Bucket 's3://test01/' created
    
  3. Upload an object to the bucket:

    touch test.txt
    s3cmd put test.txt s3://test01
    

    Example of a positive system response:

    upload: 'test.txt' -> 's3://test01/test.txt'  [1 of 1]
    0 of 0     0% in    0s     0.00 B/s  done
    
  4. Verify that the object is in the test01 bucket:

    s3cmd ls s3://test01
    

    Example of a positive system response:

    2022-09-02 13:06            0  s3://test01/test.txt
    
  5. Create the bucket policy file and add bucket CRUD permissions for user-t:

    {
      "Version": "2012-10-17",
      "Id": "S3Policy1",
      "Statement": [
        {
         "Sid": "BucketAllow",
         "Effect": "Allow",
         "Principal": {
           "AWS": ["arn:aws:iam:::user/user-t"]
         },
         "Action": [
           "s3:ListBucket",
           "s3:PutObject",
           "s3:GetObject"
         ],
         "Resource": [
           "arn:aws:s3:::test01",
           "arn:aws:s3:::test01/*"
         ]
        }
      ]
    }
    
  6. Set the bucket policy for the test01 bucket:

    s3cmd setpolicy policy.json s3://test01
    

    Example of a positive system response:

    s3://test01/: Policy updated
    
  7. Verify that the user-t has access for the test01 bucket by reconfiguring the s3cmd client with the user-t credentials:

    s3cmd  --ca-certs=ca.crt --configure
    

    Specify the bucket access parameters in a similar to the step 1 manner.

    When configured correctly, the s3cmd tool connects to the Ceph Object Storage. Save new settings when prompted by the system.

    Verify that the user-t can read the bucket test01 content:

    s3cmd ls s3://test01
    

    Example of a positive system response:

    2022-09-02 13:06            0  s3://test01/test.txt
    
  8. Download the object from the test01 bucket:

    s3cmd get s3://test01/test.txt check.txt
    

    Example of a positive system response:

    download: 's3://test01/test.txt' -> 'check.txt'  [1 of 1]
     0 of 0     0% in    0s     0.00 B/s  done
    
  9. Upload a new object to the test01 bucket:

    s3cmd put check.txt s3://test01
    

    Example of a positive system response:

    upload: 'check.txt' -> 's3://test01/check.txt'  [1 of 1]
     0 of 0     0% in    0s     0.00 B/s  done
    
  10. Verify that the object is in the test01 bucket:

    s3cmd ls s3://test01
    

    Example of a positive system response:

    2022-09-02 14:41            0  s3://test01/check.txt
    2022-09-02 13:06            0  s3://test01/test.txt
    
  11. Verify the new object by reconfiguring the s3cmd client with the user-a credentials:

    s3cmd --configure --ca-certs=ca.crt
    
  12. List test01 bucket objects:

    s3cmd ls s3://test01
    

    Example of a positive system response:

    2022-09-02 14:41            0  s3://test01/check.txt
    2022-09-02 13:06            0  s3://test01/test.txt
    
Set a bucket policy for OpenStack users

The following procedure illustrates the process of setting a bucket policy for a bucket between two OpenStack users.

Due to specifics of the Ceph integration with OpenStack projects, you should configure the bucket policy for OpenStack users indirectly through setting permissions for corresponding OpenStack projects.

For illustration purposes, we use the following names in the procedure:

  • test01 for the bucket

  • user-a, user-t for the OpenStack users

  • project-a, project-t for the OpenStack projects

To configure an Amazon S3 bucket policy for OpenStack users:

  1. Specify the rookConfig parameter in the cephClusterSpec section of the KaaSCephCluster custom resource:

    spec:
      cephClusterSpec:
        rookConfig:
          rgw keystone implicit tenants: "swift"
    
  2. Prepare the Ceph Object Storage similarly to the procedure described in Create Ceph Object Storage users.

  3. Create two OpenStack projects:

    openstack project create project-a
    openstack project create project-t
    

    Example of system response:

    +-------------+----------------------------------+
    | Field       | Value                            |
    +-------------+----------------------------------+
    | description |                                  |
    | domain_id   | default                          |
    | enabled     | True                             |
    | id          | faf957b776874a2e80384cb882ebf6ab |
    | is_domain   | False                            |
    | name        | project-a                         |
    | options     | {}                               |
    | parent_id   | default                          |
    | tags        | []                               |
    +-------------+----------------------------------+
    

    You can also use existing projects. Save the ID of each project for the bucket policy specification.

    Note

    For details how to access OpenStack CLI, refer Access your OpenStack environment.

  4. Create an OpenStack user for each project:

    openstack user create user-a --project project-a
    openstack user create user-t --project project-t
    

    Example of system response:

    +---------------------+----------------------------------+
    | Field               | Value                            |
    +---------------------+----------------------------------+
    | default_project_id  | faf957b776874a2e80384cb882ebf6ab |
    | domain_id           | default                          |
    | enabled             | True                             |
    | id                  | cc2607dc383e4494948d68eeb556f03b |
    | name                | user-a                            |
    | options             | {}                               |
    | password_expires_at | None                             |
    +---------------------+----------------------------------+
    

    You can also use existing project users.

  5. Assign the member role to the OpenStack users:

    openstack role add member --user user-a --project project-a
    openstack role add member --user user-t --project project-t
    
  6. Verify that the OpenStack users have obtained the member roles paying attention to the role IDs:

    openstack role show member
    

    Example of system response:

    +-------------+----------------------------------+
    | Field       | Value                            |
    +-------------+----------------------------------+
    | description | None                             |
    | domain_id   | None                             |
    | id          | 8f0ce4f6cd61499c809d6169b2b5bd93 |
    | name        | member                           |
    | options     | {'immutable': True}              |
    +-------------+----------------------------------+
    
  7. List the role assignments for the user-a and user-t:

    openstack role assignment list --user user-a --project project-a
    openstack role assignment list --user user-t --project project-t
    

    Example of system response:

    +----------------------------------+----------------------------------+-------+----------------------------------+--------+--------+-----------+
    | Role                             | User                             | Group | Project                          | Domain | System | Inherited |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+--------+-----------+
    | 8f0ce4f6cd61499c809d6169b2b5bd93 | cc2607dc383e4494948d68eeb556f03b |       | faf957b776874a2e80384cb882ebf6ab |        |        | False     |
    +----------------------------------+----------------------------------+-------+----------------------------------+--------+--------+-----------+
    
  8. Create Amazon EC2 credentials for user-a and user-t:

    openstack ec2 credentials create --user user-a --project project-a
    openstack ec2 credentials create --user user-t --project project-t
    

    Example of system response:

    +------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Field      | Value                                                                                                                                                          |
    +------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | access     | d03971aedc2442dd9a79b3b409c32046                                                                                                                               |
    | links      | {'self': 'http://keystone-api.openstack.svc.cluster.local:5000/v3/users/cc2607dc383e4494948d68eeb556f03b/credentials/OS-EC2/d03971aedc2442dd9a79b3b409c32046'} |
    | project_id | faf957b776874a2e80384cb882ebf6ab                                                                                                                               |
    | secret     | 0a9fd8d9e0d24aecacd6e75951154d0f                                                                                                                               |
    | trust_id   | None                                                                                                                                                           |
    | user_id    | cc2607dc383e4494948d68eeb556f03b                                                                                                                               |
    +------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    Obtain the values from the access and secret fields to connect with Ceph Object Storage trough the s3cmd tool.

    Note

    The s3cmd is a free command-line tool for uploading, retrieving, and managing data in Amazon S3 and other cloud storage service providers that use the S3 protocol. You can download the s3cmd tool from Amazon S3 tools: Download s3cmd.

  9. Create bucket users and configure a bucket policy for the project-t OpenStack project similarly to the procedure described in Set a bucket policy for a Ceph Object Storage user. Ceph integration does not allow providing permissions for OpenStack users directly. Therefore, you need to set permissions for the project that corresponds to the user:

    {
      "Version": "2012-10-17",
      "Id": "S3Policy1",
      "Statement": [
        {
         "Sid": "BucketAllow",
         "Effect": "Allow",
         "Principal": {
           "AWS": ["arn:aws:iam::<PROJECT-T_ID>:root"]
         },
         "Action": [
           "s3:ListBucket",
           "s3:PutObject",
           "s3:GetObject"
         ],
         "Resource": [
           "arn:aws:s3:::test01",
           "arn:aws:s3:::test01/*"
         ]
        }
      ]
    }
    
Ceph Object Storage bucket policy examples

You can configure different bucket policies for various situations. See examples below.

Provide access to a bucket from one OpenStack project to another
{
  "Version": "2012-10-17",
  "Id": "S3Policy1",
  "Statement": [
    {
     "Sid": "BucketAllow",
     "Effect": "Allow",
     "Principal": {
       "AWS": ["arn:aws:iam::<osProjectId>:root"]
     },
     "Action": [
       "s3:ListBucket",
       "s3:PutObject",
       "s3:GetObject"
     ],
     "Resource": [
       "arn:aws:s3:::<bucketName>",
       "arn:aws:s3:::<bucketName>/*"
     ]
    }
  ]
}

Substitute the following parameters:

  • <osProjectId> - the target OpenStack project ID

  • <bucketName> - the target bucket name where the policy will be set

Provide access to a bucket from a Ceph Object Storage user to an OpenStack project
{
  "Version": "2012-10-17",
  "Id": "S3Policy1",
  "Statement": [
    {
     "Sid": "BucketAllow",
     "Effect": "Allow",
     "Principal": {
       "AWS": ["arn:aws:iam::<osProjectId>:root"]
     },
     "Action": [
       "s3:ListBucket",
       "s3:PutObject",
       "s3:GetObject"
     ],
     "Resource": [
       "arn:aws:s3:::<bucketName>",
       "arn:aws:s3:::<bucketName>/*"
     ]
    }
  ]
}

Substitute the following parameters:

  • <osProjectId> - the target OpenStack project ID

  • <bucketName> - the target bucket name where policy will be set

Provide access to a bucket from an OpenStack user to a Ceph Object Storage user
{
  "Version": "2012-10-17",
  "Id": "S3Policy1",
  "Statement": [
    {
     "Sid": "BucketAllow",
     "Effect": "Allow",
     "Principal": {
       "AWS": ["arn:aws:iam:::user/<userName>"]
     },
     "Action": [
       "s3:ListBucket",
       "s3:PutObject",
       "s3:GetObject"
     ],
     "Resource": [
       "arn:aws:s3:::<bucketName>",
       "arn:aws:s3:::<bucketName>/*"
     ]
    }
  ]
}

Substitute the following parameters:

  • <userName> - the target Ceph Object Storage user name

  • <bucketName> - the target bucket name where policy will be set

Provide access to a bucket from one Ceph Object Storage user to another
{
  "Version": "2012-10-17",
  "Id": "S3Policy1",
  "Statement": [
    {
     "Sid": "BucketAllow",
     "Effect": "Allow",
     "Principal": {
       "AWS": ["arn:aws:iam:::user/<userName>"]
     },
     "Action": [
       "s3:ListBucket",
       "s3:PutObject",
       "s3:GetObject"
     ],
     "Resource": [
       "arn:aws:s3:::<bucketName>",
       "arn:aws:s3:::<bucketName>/*"
     ]
    }
  ]
}

Substitute the following parameters:

  • <userName> - the target Ceph Object Storage user name

  • <bucketName> - the target bucket name where policy will be set

Calculate target ratio for Ceph pools

Ceph pool target ratio defines for the Placement Group (PG) autoscaler the amount of data the pools are expected to acquire over time in relation to each other. You can set initial PG values for each Ceph pool. Otherwise, the autoscaler starts with the minimum value and scales up, causing a lot of data to move in the background.

You can allocate several pools to use the same device class, which is a solid block of available capacity in Ceph. For example, if three pools (kubernetes-hdd, images-hdd, and volumes-hdd) are set to use the same device class hdd, you can set the target ratio for Ceph pools to provide 80% of capacity to the volumes-hdd pool and distribute the remaining capacity evenly between the two other pools. This way, Ceph pool target ratio instructs Ceph on when to warn that a pool is running out of free space and, at the same time, instructs Ceph on how many placement groups Ceph should allocate/autoscale for a pool for better data distribution.

Ceph pool target ratio is not a constant value and you can change it according to new capacity plans. Once you specify target ratio, if the PG number of a pool scales, other pools with specified target ratio will automatically scale accordingly.

For details, see Ceph Documentation: Autoscaling Placement Groups.

To calculate target ratio for each Ceph pool:

  1. Define raw capacity of the entire storage by device class:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o name) -- ceph df
    

    For illustration purposes, the procedure below uses raw capacity of 185 TB or 189440 GB.

  2. Design Ceph pools with the considered device class upper bounds of the possible capacity. For example, consider the hdd device class that contains the following pools:

    • The kubernetes-hdd pool will contain not more than 2048 GB.

    • The images-hdd pool will contain not more than 2048 GB.

    • The volumes-hdd pool will contain 50 GB per VM. The upper bound of used VMs on the cloud is 204, the pool replicated size is 3. Therefore, calculate the upper bounds for volumes-hdd:

      50 GB per VM * 204 VMs * 3 replicas = 30600 GB
      
    • The backup-hdd pool can be calculated as a relative of volumes-hdd. For example, 1 volumes-hdd storage unit per 5 backup-hdd units.

    • The vms-hdd is a pool for ephemeral storage Copy on Write (CoW). We recommend designing the amount of ephemeral data it should store. For example purposes, we use 500 GB. However, in reality, despite the CoW data reduction, this value is very optimistic.

    Note

    If dataPool is replicated and Ceph Object Store is planned for intensive use, also calculate upper bounds for dataPool.

  3. Calculate target ratio for each considered pool. For example:

    Example bounds and capacity

    Pools upper bounds

    Pools capacity

    • kubernetes-hdd = 2048 GB

    • images-hdd = 2048 GB

    • volumes-hdd = 30600 GB

    • backup-hdd = 30600 GB * 5 = 153000 GB

    • vms-hdd = 500 GB

    • Summary capacity = 188196 GB

    • Total raw capacity = 189440 GB

    1. Calculate pools fit factor using the (total raw capacity) / (pools summary capacity) formula. For example:

      pools fit factor = 189440 / 188196 = 1.0066
      
    2. Calculate pools upper bounds size using the (pool upper bounds) * (pools fit factor) formula. For example:

      kubernetes-hdd = 2048 GB * 1.0066   = 2061.5168 GB
      images-hdd     = 2048 GB * 1.0066   = 2061.5168 GB
      volumes-hdd    = 30600 GB * 1.0066  = 30801.96 GB
      backup-hdd     = 153000 GB * 1.0066 = 154009.8 GB
      vms-hdd        = 500 GB * 1.0066    = 503.3 GB
      
    3. Calculate pool target ratio using the (pool upper bounds) * 100 / (total raw capacity) formula. For example:

      kubernetes-hdd = 2061.5168 GB * 100 / 189440 GB = 1.088
      images-hdd     = 2061.5168 GB * 100 / 189440 GB = 1.088
      volumes-hdd    = 30801.96 GB * 100 / 189440 GB  = 16.259
      backup-hdd     = 154009.8 GB * 100 / 189440 GB  = 81.297
      vms-hdd        = 503.3 GB * 100 / 189440 GB     = 0.266
      
  4. If required, calculate the target ratio for erasure-coded pools.

    Due to erasure-coded pools splitting each object into K data parts and M coding parts, the total used storage for each object is less than that in replicated pools. Indeed, M is equal to the number of OSDs that can be missing from the cluster without the cluster experiencing data loss. This means that planned data is stored with an efficiency of (K+M)/2 on the Ceph cluster. For example, if an erasure-coded data pool with K=2, M=2 planned capacity is 200 GB, then the total used capacity is 200*(2+2)/2, which is 400 GB.

  5. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  6. In the spec.cephClusterSpec.pools section, specify the calculated relatives as targetSizeRatio for each considered replicated pool. For example:

    spec:
      cephClusterSpec:
        pools:
        - name: kubernetes
          deviceClass: hdd
          ...
          replicated:
            size: 3
            targetSizeRatio: 1.088
        - name: images
          deviceClass: hdd
          ...
          replicated:
            size: 3
            targetSizeRatio: 1.088
        - name: volumes
          deviceClass: hdd
          ...
          replicated:
            size: 3
            targetSizeRatio: 16.259
        - name: backup
          deviceClass: hdd
          ...
          replicated:
            size: 3
            targetSizeRatio: 81.297
        - name: vms
          deviceClass: hdd
          ...
          replicated:
            size: 3
            targetSizeRatio: 0.266
    

    If Ceph Object Store dataPool is replicated and a proper value is calculated, also specify it:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            name: rgw-store
            ...
            dataPool:
              deviceClass: hdd
              ...
              replicated:
                size: 3
                targetSizeRatio: <relative>
    
  7. In the spec.cephClusterSpec.pools section, specify the calculated relatives as parameters.target_size_ratio for each considered erasure-coded pool. For example:

    Note

    The parameters section is a key-value mapping where the value is of the string type and should be quoted.

    spec:
      cephClusterSpec:
        pools:
        - name: ec-pool
          deviceClass: hdd
          ...
          parameters:
            target_size_ratio: "<relative>"
    

    If Ceph Object Store dataPool is erasure-coded and a proper value is calculated, also specify it:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            name: rgw-store
            ...
            dataPool:
              deviceClass: hdd
              ...
              parameters:
                target_size_ratio: "<relative>"
    
  8. Verify that all target ratio has been successfully applied to the Ceph cluster:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o name) -- ceph osd pool autoscale-status
    

    Example of system response:

    POOL                                SIZE  TARGET SIZE  RATE  RAW CAPACITY  RATIO   TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE
    device_health_metrics               0                  2.0   149.9G        0.0000                                 1.0   1                   on
    kubernetes-hdd                      2068               2.0   149.9G        0.0000  1.088         1.0885           1.0   32                  on
    volumes-hdd                         19                 2.0   149.9G        0.0000  16.259        16.2591          1.0   256                 on
    vms-hdd                             19                 2.0   149.9G        0.0000  0.266         0.2661           1.0   128                 on
    backup-hdd                          19                 2.0   149.9G        0.0000  81.297        81.2972          1.0   256                 on
    images-hdd                          888.8M             2.0   149.9G        0.0116  1.088         1.0881           1.0   32                  on
    
  9. Optional. Repeat the steps above for other device classes.

Ceph pools for Cinder multi-backend

Available since MOSK 23.2

The KaaSCephCluster object supports multiple Ceph pools with the volumes role to configure Cinder multiple backends.

To define Ceph pools for Cinder multiple backends:

  1. In the KaaSCephCluster object, add the desired number of Ceph pools to the pools section with the volumes role:

    kubectl -n <MOSKClusterProject> edit kaascephcluster
    

    Substitute <MOSKClusterProject> with corresponding namespace of the MOSK cluster.

    Example configuration:

    spec:
      cephClusterSpec:
        pools:
        - default: false
          deviceClass: hdd
          name: volumes
          replicated:
            size: 3
          role: volumes
        - default: false
          deviceClass: hdd
          name: volumes-backend-1
          replicated:
            size: 3
          role: volumes
        - default: false
          deviceClass: hdd
          name: volumes-backend-2
          replicated:
            size: 3
          role: volumes
    
  2. Verify that Cinder backend pools are created and ready:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    

    Example output:

    status:
      fullClusterStatus:
        blockStorageStatus:
          poolsStatus:
            volumes-hdd:
              present: true
              status:
                observedGeneration: 1
                phase: Ready
            volumes-backend-1-hdd:
              present: true
              status:
                observedGeneration: 1
                phase: Ready
            volumes-backend-2-hdd:
              present: true
              status:
                observedGeneration: 1
                phase: Ready
    
  3. Verify that the added Ceph pools are accessible from the Cinder service. For example:

    kubectl -n openstack exec -it cinder-volume-0 -- rbd ls -p volumes-backend-1-hdd -n client.cinder
    kubectl -n openstack exec -it cinder-volume-0 -- rbd ls -p volumes-backend-2-hdd -n client.cinder
    

After the Ceph pool becomes available, it is automatically specified as an additional Cinder backend and registered as a new volume type, which you can use to create Cinder volumes.

The following sections describe how to configure, manage, and verify specific aspects of a Ceph cluster.

Caution

Before you proceed with any reading or writing operation, first verify the cluster status using the ceph tool as described in Verify the Ceph core services.

Automated Ceph LCM

This section describes the supported automated Ceph lifecycle management (LCM) operations.

High-level workflow of Ceph OSD or node removal

The Ceph LCM automated operations such as Ceph OSD or Ceph node removal are performed by creating a corresponding KaaSCephOperationRequest CR that creates separate CephOsdRemoveRequest requests. It allows for automated removal of healthy or non-healthy Ceph OSDs from a Ceph cluster and covers the following scenarios:

  • Reducing hardware - all Ceph OSDs are up/in but you want to decrease the number of Ceph OSDs by reducing the number of disks or hosts.

  • Hardware issues. For example, if a host unexpectedly goes down and will not be restored, or if a disk on a host goes down and requires replacement.

This section describes the KaaSCephOperationRequest CR creation workflow, specification, and request status.

For step-by-step procedures, refer to Automated Ceph LCM.

Creating a Ceph OSD removal request

The workflow of creating a Ceph OSD removal request includes the following steps:

  1. Removing obsolete nodes or disks from the spec.nodes section of the KaaSCephCluster CR as described in Ceph advanced configuration.

    Note

    Note the names of the removed nodes, devices or their paths exactly as they were specified in KaaSCephCluster for further usage.

  2. Creating a YAML template for the KaaSCephOperationRequest CR. For details, see KaaSCephOperationRequest OSD removal specification.

    • If KaaSCephOperationRequest contains information about Ceph OSDs to remove in a proper format, the information will be validated to eliminate human error and avoid a wrong Ceph OSD removal.

    • If the osdRemove.nodes section of KaaSCephOperationRequest is empty, the Ceph Request Controller will automatically detect Ceph OSDs for removal, if any. Auto-detection is based not only on the information provided in the KaaSCephCluster but also on the information from the Ceph cluster itself.

    Once the validation or auto-detection completes, the entire information about the Ceph OSDs to remove appears in the KaaSCephOperationRequest object: hosts they belong to, OSD IDs, disks, partitions, and so on. The request then moves to the ApproveWaiting phase until the Operator manually specifies the approve flag in the spec.

  3. Manually adding an affirmative approve flag in the KaaSCephOperationRequest spec. Once done, the Ceph Status Controller reconciliation pauses until the request is handled and executes the following:

    • Stops regular Ceph Controller reconciliation

    • Removes Ceph OSDs

    • Runs batch jobs to clean up the device, if possible

    • Removes host information from the Ceph cluster if the entire Ceph node is removed

    • Marks the request with an appropriate result with a description of occurred issues

    Note

    If the request completes successfully, Ceph Controller reconciliation resumes. Otherwise, it remains paused until the issue is resolved.

  4. Reviewing the Ceph OSD removal status. For details, see KaaSCephOperationRequest OSD removal status.

  5. Manual removal of device cleanup jobs.

    Note

    Device cleanup jobs are not removed automatically and are kept in the ceph-lcm-mirantis namespace along with pods containing information about the executed actions. The jobs have the following labels:

    labels:
      app: miraceph-cleanup-disks
      host: <HOST-NAME>
      osd: <OSD-ID>
      rook-cluster: <ROOK-CLUSTER-NAME>
    

    Additionally, jobs are labeled with disk names that will be cleaned up, such as vdb=true. You can remove a single job or a group of jobs using any label described above, such as host, disk, and so on.

Example of KaaSCephOperationRequest resource
apiVersion: kaas.mirantis.com/v1alpha1
kind: KaaSCephOperationRequest
metadata:
  name: remove-osd-3-4-request
  namespace: managed-namespace
spec:
  osdRemove:
    approve: true
    nodes:
      worker-3:
        cleanupByDevice:
        - name: sdb
        - path: /dev/disk/by-path/pci-0000:00:1t.9
  kaasCephCluster:
    name: ceph-cluster-managed-cluster
    namespace: managed-namespace
Example of Ceph OSDs ready for removal
apiVersion: kaas.mirantis.com/v1alpha1
kind: KaaSCephOperationRequest
metadata:
  generateName: remove-osds
  namespace: managed-ns
spec:
  osdRemove: {}
  kaasCephCluster:
    name: ceph-cluster-managed-cl
    namespace: managed-ns
KaaSCephOperationRequest OSD removal specification

This section describes the KaaSCephOperationRequest CR specification used to automatically create a CephOsdRemoveRequest request. For the procedure workflow, see Creating a Ceph OSD removal request.


KaaSCephOperationRequest high-level parameters spec

Parameter

Description

osdRemove

Describes the definition for the CephOsdRemoveRequest spec. For details on the osdRemove parameters, see the tables below.

kaasCephCluster

Defines KaaSCephCluster on which the KaaSCephOperationRequest depends on. Use the kaasCephCluster parameter if the name or project of the corresponding Container Cloud cluster differs from the default one:

spec:
  kaasCephCluster:
    name: kaas-mgmt
    namespace: default
KaaSCephOperationRequest ‘osdRemove’ parameters spec

Parameter

Description

nodes

Map of Kubernetes nodes that specifies how to remove Ceph OSDs: by host-devices or OSD IDs. For details, see KaaSCephOperationRequest ‘nodes’ parameters spec.

approve

Flag that indicates whether a request is ready to execute removal. Can only be manually enabled by the Operator. For example:

spec:
  osdRemove:
    approve: true

keepOnFail

Flag used to keep requests in handling and not to proceed to the next request if the Validating or Processing phases failed. The request will remain in the InputWaiting state until the flag or the request itself is removed or the request spec is updated.

If the Validation phase fails, you can update the spec.osdRemove.nodes section in KaaSCephCluster to avoid issues and re-run the validation. If the Processing phase fails, you can resolve issues without resuming the Ceph Controller reconciliation and proceeding to the next request and apply the required actions to keep cluster data.

For example:

spec:
  osdRemove:
    keepOnFail: true

resolved

Optional. Flag that marks a finished request, even if it failed, to keep it in history. It allows resuming the Ceph Controller reconciliation without removing the failed request. The flag is used only by Ceph Controller and has no effect on request processing. Can only be manually specified. For example:

spec:
  osdRemove:
    resolved: true

resumeFailed

Optional. Flag used to resume a failed request and proceed with Ceph OSD removal if the KeepOnFail is set and the request status is InputWaiting. For example:

spec:
  osdRemove:
    resumeFailed: true
KaaSCephOperationRequest ‘nodes’ parameters spec

Parameter

Description

completeCleanUp

Flag used to clean up an entire node and drop it from the CRUSH map. Mutually exclusive with cleanupByDevice and cleanupByOsdId.

cleanupByDevice

List that describes devices to clean up by name or device path as they were specified in KaaSCephCluster. Mutually exclusive with completeCleanUp and cleanupByOsdId. Includes the following parameters:

  • name - name of the device to remove from the Ceph cluster. Mutually exclusive with path.

  • path - by-path of the device to remove from the Ceph cluster. Mutually exclusive with name. Supports device removal with by-id.

Warning

Since MOSK 23.3, Mirantis does not recommend setting device name or device by-path symlink in the cleanupByDevice field as these identifiers are not persistent and can change at node boot. Remove Ceph OSDs with by-id symlinks specified in the path field or use cleanupByOsdId instead.

For details, see Container Cloud documentation: Addressing storage devices.

cleanupByOsdId

List of Ceph OSD IDs to remove. Mutually exclusive with completeCleanUp and cleanupByDevice.

Example of KaaSCephOperationRequest with spec.osdRemove.nodes
apiVersion: kaas.mirantis.com/v1alpha1
kind: KaaSCephOperationRequest
metadata:
  name: remove-osd-request
  namespace: default
spec:
  kaasCephCluster:
    name: kaas-mgmt
    namespace: default
  osdRemove:
    nodes:
      "node-a":
        completeCleanUp: true
      "node-b":
        cleanupByOsdId: [1, 15, 25]
      "node-c":
        cleanupByDevice:
        - name: "sdb"
        - path: "/dev/disk/by-path/pci-0000:00:1c.5"
        - path: "/dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS"

The example above includes the following actions:

  • For node-a, full cleanup, including all OSDs on the node, node drop from the CRUSH map, and cleanup of all disks used for Ceph OSDs on this node.

  • For node-b, cleanup of Ceph OSDs with IDs 1, 15, and 25 along with the related disk information.

  • For node-c, cleanup of the device with name sdb, the device with path ID /dev/disk/by-path/pci-0000:00:1c.5, and the device with by-id /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS, dropping of OSDs running on these devices.

KaaSCephOperationRequest OSD removal status

This section describes the status.osdRemoveStatus.removeInfo fields of the KaaSCephOperationRequest CR that you can use to review a Ceph OSD or node removal phases. The following diagram represents the phases flow:

_images/ceph-osd-remove-phases-flow.png
KaaSCephOperationRequest high-level parameters status

Parameter

Description

osdRemoveStatus

Describes the status of the current CephOsdRemoveRequest. For details, see KaaSCephOperationRequest ‘osdRemoveStatus’ parameters status.

childNodesMapping

The key-value mapping that reflects the management cluster machine names with their corresponding Kubernetes node names.

KaaSCephOperationRequest ‘osdRemoveStatus’ parameters status

Parameter

Description

phase

Describes the current request phase that can be one of:

  • Pending - the request is created and placed in the request queue.

  • Validation - the request is taken from the queue and the provided information is being validated.

  • ApproveWaiting - the request passed the validation phase, is ready to execute, and is waiting for user confirmation through the approve flag.

  • Processing - the request is executing following the next phases:

    • Pending - marking the current Ceph OSD for removal.

    • Rebalancing - the Ceph OSD is moved out, waiting until it is rebalanced. If the current Ceph OSD is down or already out, the next phase takes place.

    • Removing - purging the Ceph OSD and its authorization key.

    • Removed - the Ceph OSD has been successfully removed.

    • Failed - the Ceph OSD failed to remove.

  • Completed - the request executed with no issues.

  • CompletedWithWarnings - the request executed with non-critical issues. Review the output, action may be required.

  • InputWaiting - during the Validation or Processing phases, critical issues occurred that require attention. If issues occurred during validation, update osdRemove information, if present, and re-run validation. If issues occurred during processing, review the reported issues and manually resolve them.

  • Failed - the request failed during the Validation or Processing phases.

removeInfo

The overall information about the Ceph OSDs to remove: final removal map, issues, and warnings. Once the Processing phase succeeds, removeInfo will be extended with the removal status for each node and Ceph OSD. In case of an entire node removal, the status will contain the status itself and an error message, if any.

The removeInfo.osdMapping field contains information about:

  • Ceph OSDs removal status.

  • Batch job reference for the device cleanup: its name, status, and error, if any. The batch job status for the device cleanup will be either Failed, Completed, or Skipped. The Skipped status is used when a host is down, disk is crashed, or an error occurred when obtaining the ceph-volume information.

  • Ceph OSD deployment removal status and the related Ceph OSD name. The status will be either Failed or Removed.

messages

Informational messages describing the reason for the request transition to the next phase.

conditions

History of spec updates for the request.

Example of status.osdRemoveStatus.removeInfo after successful Validation
removeInfo:
  cleanUpMap:
    "node-a":
      completeCleanUp: true
      osdMapping:
        "2":
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:0a.0"
              partition: "/dev/ceph-a-vg_sdb/osd-block-b-lv_sdb"
              type: "block"
              class: "hdd"
              zapDisk: true
        "6":
          deviceMapping:
            "sdc":
              path: "/dev/disk/by-path/pci-0000:00:0c.0"
              partition: "/dev/ceph-a-vg_sdc/osd-block-b-lv_sdc-1"
              type: "block"
              class: "hdd"
              zapDisk: true
        "11":
          deviceMapping:
            "sdc":
              path: "/dev/disk/by-path/pci-0000:00:0c.0"
              partition: "/dev/ceph-a-vg_sdc/osd-block-b-lv_sdc-2"
              type: "block"
              class: "hdd"
              zapDisk: true
    "node-b":
      osdMapping:
        "1":
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:0a.0"
              partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
              type: "block"
              class: "ssd"
              zapDisk: true
        "15":
          deviceMapping:
            "sdc":
              path: "/dev/disk/by-path/pci-0000:00:0b.1"
              partition: "/dev/ceph-b-vg_sdc/osd-block-b-lv_sdc"
              type: "block"
              class: "ssd"
              zapDisk: true
        "25":
          deviceMapping:
            "sdd":
              path: "/dev/disk/by-path/pci-0000:00:0c.2"
              partition: "/dev/ceph-b-vg_sdd/osd-block-b-lv_sdd"
              type: "block"
              class: "ssd"
              zapDisk: true
    "node-c":
      osdMapping:
        "0":
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:1t.9"
              partition: "/dev/ceph-c-vg_sdb/osd-block-c-lv_sdb"
              type: "block"
              class: "hdd"
              zapDisk: true
        "8":
          deviceMapping:
            "sde":
              path: "/dev/disk/by-path/pci-0000:00:1c.5"
              partition: "/dev/ceph-c-vg_sde/osd-block-c-lv_sde"
              type: "block"
              class: "hdd"
              zapDisk: true
            "sdf":
              path: "/dev/disk/by-path/pci-0000:00:5a.5",
              partition: "/dev/ceph-c-vg_sdf/osd-db-c-lv_sdf-1",
              type: "db",
              class: "ssd"

The example above is based on the example spec provided in KaaSCephOperationRequest OSD removal specification. During the Validation phase, the provided information was validated and reflects the final map of the Ceph OSDs to remove:

  • For node-a, Ceph OSDs with IDs 2, 6, and 11 will be removed with the related disk and its information: all block devices, names, paths, and disk class.

  • For node-b, the Ceph OSDs with IDs 1, 15, and 25 will be removed with the related disk information.

  • For node-c, the Ceph OSD with ID 8 will be removed, which is placed on the specified sdb device. The related partition on the sdf disk, which is used as the BlueStore metadata device, will be cleaned up keeping the disk itself untouched. Other partitions on that device will not be touched.

Example of removeInfo with removeStatus succeeded
removeInfo:
  cleanUpMap:
    "node-a":
      completeCleanUp: true
      hostRemoveStatus:
        status: Removed
      osdMapping:
        "2":
          removeStatus:
            osdRemoveStatus:
              status: Removed
            deploymentRemoveStatus:
              status: Removed
              name: "rook-ceph-osd-2"
            deviceCleanUpJob:
              status: Finished
              name: "job-name-for-osd-2"
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:0a.0"
              partition: "/dev/ceph-a-vg_sdb/osd-block-b-lv_sdb"
              type: "block"
              class: "hdd"
              zapDisk: true
Example of removeInfo with removeStatus failed
removeInfo:
  cleanUpMap:
    "node-a":
      completeCleanUp: true
      osdMapping:
        "2":
          removeStatus:
            osdRemoveStatus:
              errorReason: "retries for cmd ‘ceph osd ok-to-stop 2’ exceeded"
              status: Failed
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:0a.0"
              partition: "/dev/ceph-a-vg_sdb/osd-block-b-lv_sdb"
              type: "block"
              class: "hdd"
              zapDisk: true
Example of removeInfo with removeStatus failed by timeout
removeInfo:
  cleanUpMap:
    "node-a":
      completeCleanUp: true
      osdMapping:
        "2":
          removeStatus:
            osdRemoveStatus:
              errorReason: Timeout (30m0s) reached for waiting pg rebalance for osd 2
              status: Failed
          deviceMapping:
            "sdb":
              path: "/dev/disk/by-path/pci-0000:00:0a.0"
              partition: "/dev/ceph-a-vg_sdb/osd-block-b-lv_sdb"
              type: "block"
              class: "hdd"
              zapDisk: true

Note

In case of failures similar to the examples above, review the ceph-request-controller logs and the Ceph cluster status. Such failures may simply indicate timeout and retry issues. If no other issues were found, re-create the request with a new name and skip adding successfully removed Ceph OSDS or Ceph nodes.

Add, remove, or reconfigure Ceph nodes

Mirantis Ceph Controller simplifies a Ceph cluster management by automating LCM operations. This section describes how to add, remove, or reconfigure Ceph nodes.

Note

When adding a Ceph node with the Ceph Monitor role, if any issues occur with the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead, named using the next alphabetic character in order. Therefore, the Ceph Monitor names may not follow the alphabetical order. For example, a, b, d, instead of a, b, c.

Add Ceph nodes on a managed cluster
  1. Prepare a new machine for the required managed cluster as described in Add a machine. During machine preparation, update the settings of the related bare metal host profile for the Ceph node being replaced with the required machine devices as described in Create a custom bare metal host profile.

  2. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes section, specify the parameters for a Ceph node as required. For the parameters description, see Node parameters.

    The example configuration of the nodes section with the new node:

    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
    
    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          name: sdb
    

    Warning

    Since MOSK 23.3, Mirantis highly recommends using the non-wwn by-id symlinks to specify storage devices in the storageDevices list.

    For details, see Container Cloud documentation: Addressing storage devices.

    Note

    • To use a new Ceph node for a Ceph Monitor or Ceph Manager deployment, also specify the roles parameter.

    • Reducing the number of Ceph Monitors is not supported and causes the Ceph Monitor daemons removal from random nodes.

    • Removal of the mgr role in the nodes section of the KaaSCephCluster CR does not remove Ceph Managers. To remove a Ceph Manager from a node, remove it from the nodes spec and manually delete the mgr pod in the Rook namespace.

  4. Verify that all new Ceph daemons for the specified node have been successfully deployed in the Ceph cluster. The fullClusterInfo section should not contain any issues.

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    
    Example of system response
    status:
      fullClusterInfo:
        daemonsStatus:
          mgr:
            running: a is active mgr
            status: Ok
          mon:
            running: '3/3 mons running: [a b c] in quorum'
            status: Ok
          osd:
            running: '3/3 running: 3 up, 3 in'
            status: Ok
    
Remove a Ceph node from a managed cluster

Note

Ceph node removal presupposes usage of a KaaSCephOperationRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Note

To remove a Ceph node with a mon role, first move the Ceph Monitor to another node and remove the mon role from the Ceph node as described in Move a Ceph Monitor daemon to another node.

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the spec.cephClusterSpec.nodes section, remove the required Ceph node specification.

    For example:

    spec:
      cephClusterSpec:
        nodes:
          worker-5: # remove the entire entry for the required node
            storageDevices: {...}
            roles: [...]
    
  3. Create a YAML template for the KaaSCephOperationRequest CR. For example:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: remove-osd-worker-5
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          worker-5:
            completeCleanUp: true
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding cluster namespace and <kaasCephClusterName> with the corresponding KaaSCephCluster name.

  4. Apply the template on the management cluster in the corresponding namespace:

    kubectl apply -f remove-osd-worker-5.yaml
    
  5. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest remove-osd-worker-5 -n <managedClusterProjectName>
    
  6. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-worker-5 -o yaml
    
    Example of system response
    status:
      childNodesMapping:
        kaas-node-d4aac64d-1721-446c-b7df-e351c3025591: worker-5
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            kaas-node-d4aac64d-1721-446c-b7df-e351c3025591:
              osdMapping:
                "10":
                  deviceMapping:
                    sdb:
                      path: "/dev/disk/by-path/pci-0000:00:1t.9"
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
                "16":
                  deviceMapping:
                    sdc:
                      path: "/dev/disk/by-path/pci-0000:00:1t.10"
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdc"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
    
  7. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-worker-5 -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  8. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest remove-osd-worker-5
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  9. Review the status of the KaaSCephOperationRequest resource request processing. The valuable parameters are as follows:

    • status.phase - the current state of request processing

    • status.messages - the description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - contain error and warning messages occurred during request processing

  10. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  11. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Reconfigure a Ceph node on a managed cluster

There is no hot reconfiguration procedure for existing Ceph OSDs and Ceph Monitors. To reconfigure an existing Ceph node, follow the steps below:

  1. Remove the Ceph node from the Ceph cluster as described in Remove a Ceph node from a managed cluster.

  2. Add the same Ceph node but with a modified configuration as described in Add Ceph nodes on a managed cluster.

Add, remove, or reconfigure Ceph OSDs

Mirantis Ceph Controller simplifies Ceph cluster management by automating LCM operations. This section describes how to add, remove, or reconfigure Ceph OSDs.

Add a Ceph OSD on a managed cluster
  1. Manually prepare the required machine devices with LVM2 on the existing node because BareMetalHostProfile does not support in-place changes.

    To add a Ceph OSD to an existing or hot-plugged raw device

    If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:

    • You can add a raw device to a node during node deployment.

    • If a node supports adding devices without node reboot, you can hot plug a raw device to a node.

    • If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:

      1. Enable maintenance mode on the managed cluster.

      2. Turn off the required node.

      3. Attach the required raw device to the node.

      4. Turn on the required node.

      5. Disable maintenance mode on the managed cluster.

  2. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes.<machineName>.storageDevices section, specify the parameters for a Ceph OSD as required. For the parameters description, see Node parameters.

    The example configuration of the nodes section with the new node:

    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config: # existing item
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
        - config: # new item
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-0ATA_HGST_HUS724040AL_PN1334PEHN1VBC
    
    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config: # existing item
            deviceClass: hdd
          name: sdb
        - config: # new item
            deviceClass: hdd
          name: sdc
    

    Warning

    Since MOSK 23.3, Mirantis highly recommends using the non-wwn by-id symlinks to specify storage devices in the storageDevices list.

    For details, see Container Cloud documentation: Addressing storage devices.

  4. Verify that the Ceph OSD on the specified node is successfully deployed. The fullClusterInfo section should not contain any issues.

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    

    For example:

    status:
      fullClusterInfo:
        daemonsStatus:
          ...
          osd:
            running: '3/3 running: 3 up, 3 in'
            status: Ok
    

    Note

    Since MOSK 23.2, cephDeviceMapping is removed because its large size can potentially exceed the Kubernetes 1.5 MB quota.

  5. Verify the Ceph OSD on the managed cluster:

    kubectl -n rook-ceph get pod -l app=rook-ceph-osd -o wide | grep <machineName>
    
Remove a Ceph OSD from a managed cluster

Note

Ceph OSD removal presupposes usage of a KaaSCephOperationRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Warning

When using the non-recommended Ceph pools replicated.size of less than 3, Ceph OSD removal cannot be performed. The minimal replica size equals a rounded up half of the specified replicated.size.

For example, if replicated.size is 2, the minimal replica size is 1, and if replicated.size is 3, then the minimal replica size is 2. The replica size of 1 allows Ceph having PGs with only one Ceph OSD in the acting state, which may cause a PG_TOO_DEGRADED health warning that blocks Ceph OSD removal. Mirantis recommends setting replicated.size to 3 for each Ceph pool.

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. Remove the required Ceph OSD specification from the spec.cephClusterSpec.nodes.<machineName>.storageDevices list:

    The example configuration of the nodes section with the new node:

    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
        - config: # remove the entire item entry from storageDevices list
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-0ATA_HGST_HUS724040AL_PN1334PEHN1VBC
    
    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          name: sdb
        - config: # remove the entire item entry from storageDevices list
            deviceClass: hdd
          name: sdc
    
  3. Create a YAML template for the KaaSCephOperationRequest CR. Select from the following options:

    • Remove Ceph OSD by device name, by-path symlink, or by-id symlink:

      apiVersion: kaas.mirantis.com/v1alpha1
      kind: KaaSCephOperationRequest
      metadata:
        name: remove-osd-<machineName>-sdb
        namespace: <managedClusterProjectName>
      spec:
        osdRemove:
          nodes:
            <machineName>:
              cleanupByDevice:
              - name: sdb
        kaasCephCluster:
          name: <kaasCephClusterName>
          namespace: <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding cluster namespace and <kaasCephClusterName> with the corresponding KaaSCephCluster name.

      Warning

      Since MOSK 23.3, Mirantis does not recommend setting device name or device by-path symlink in the cleanupByDevice field as these identifiers are not persistent and can change at node boot. Remove Ceph OSDs with by-id symlinks specified in the path field or use cleanupByOsdId instead.

      For details, see Container Cloud documentation: Addressing storage devices.

      Note

      • Since MOSK 23.1, cleanupByDevice is not supported if a device was physically removed from a node. Therefore, use cleanupByOsdId instead. For details, see Remove a failed Ceph OSD by Ceph OSD ID.

      • Before MOSK 23.1, if the storageDevice item was specified with by-id, specify the path parameter in the cleanupByDevice section instead of name.

      • If the storageDevice item was specified with a by-path device path, specify the path parameter in the cleanupByDevice section instead of name.

    • Remove Ceph OSD by OSD ID:

      apiVersion: kaas.mirantis.com/v1alpha1
      kind: KaaSCephOperationRequest
      metadata:
        name: remove-osd-<machineName>-sdb
        namespace: <managedClusterProjectName>
      spec:
        osdRemove:
          nodes:
            <machineName>:
              cleanupByOsdId:
              - 2
        kaasCephCluster:
          name: <kaasCephClusterName>
          namespace: <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding cluster namespace and <kaasCephClusterName> with the corresponding KaaSCephCluster name.

  4. Apply the template on the management cluster in the corresponding namespace:

    kubectl apply -f remove-osd-<machineName>-sdb.yaml
    
  5. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest remove-osd-<machineName>-sdb -n <managedClusterProjectName>
    
  6. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-<machineName>-sdb -o yaml
    

    Example of system response:

    status:
      childNodesMapping:
        kaas-node-d4aac64d-1721-446c-b7df-e351c3025591: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            kaas-node-d4aac64d-1721-446c-b7df-e351c3025591:
              osdMapping:
                "10":
                  deviceMapping:
                    sdb:
                      path: "/dev/disk/by-path/pci-0000:00:1t.9"
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
    
  7. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-<machineName>-sdb -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  8. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest remove-osd-<machineName>-sdb
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  9. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  10. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  11. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Reconfigure a Ceph OSD on a managed cluster

There is no hot reconfiguration procedure for existing Ceph OSDs. To reconfigure an existing Ceph node, follow the steps below:

  1. Remove a Ceph OSD from the Ceph cluster as described in Remove a Ceph OSD from a managed cluster.

  2. Add the same Ceph OSD but with a modified configuration as described in Add a Ceph OSD on a managed cluster.

Add, remove, or reconfigure Ceph OSDs with metadata devices

Mirantis Ceph Controller simplifies Ceph cluster management by automating LCM operations. This section describes how to add, remove, or reconfigure Ceph OSDs with a separate metadata device.

Add a Ceph OSD with a metadata device
  1. From the Ceph disks defined in the BareMetalHostProfile object that was configured using the Configure Ceph disks in a host profile procedure, select one disk for data and one logical volume for metadata of a Ceph OSD to be added to the Ceph cluster.

    Note

    If you add a new disk after machine provisioning, manually prepare the required machine devices using Logical Volume Manager (LVM) 2 on the existing node because BareMetalHostProfile does not support in-place changes.

    To add a Ceph OSD to an existing or hot-plugged raw device

    If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:

    • You can add a raw device to a node during node deployment.

    • If a node supports adding devices without node reboot, you can hot plug a raw device to a node.

    • If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:

      1. Enable maintenance mode on the managed cluster.

      2. Turn off the required node.

      3. Attach the required raw device to the node.

      4. Turn on the required node.

      5. Disable maintenance mode on the managed cluster.

  2. Open the KaasCephCluster object for editing:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes.<machineName>.storageDevices section, specify the parameters for a Ceph OSD as required. For the parameters description, see Node parameters.

    The example configuration of the nodes section with the new node:

    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config: # existing item
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
        - config: # new item
            deviceClass: hdd
            metadataDevice: /dev/bluedb/meta_1
          fullPath: /dev/disk/by-id/scsi-0ATA_HGST_HUS724040AL_PN1334PEHN1VBC
    
    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config: # existing item
            deviceClass: hdd
          name: sdb
        - config: # new item
            deviceClass: hdd
            metadataDevice: /dev/bluedb/meta_1
          name: sdc
    

    Warning

    Since MOSK 23.3, Mirantis highly recommends using the non-wwn by-id symlinks to specify storage devices in the storageDevices list.

    For details, see Container Cloud documentation: Addressing storage devices.

  4. Verify that the Ceph OSD is successfully deployed on the specified node:

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    

    In the system response, the fullClusterInfo section should not contain any issues.

    Example of a successful system response:

    status:
      fullClusterInfo:
        daemonsStatus:
          ...
          osd:
            running: '4/4 running: 4 up, 4 in'
            status: Ok
    
  5. Obtain the name of the node on which the machine with the Ceph OSD is running:

    kubectl -n <managedClusterProjectName> get machine <machineName> -o jsonpath='{.status.nodeRef.name}'
    

    Substitute <managedClusterProjectName> and <machineName> with corresponding values.

  6. Verify the Ceph OSD status:

    kubectl -n rook-ceph get pod -l app=rook-ceph-osd -o wide | grep <nodeName>
    

    Substitute <nodeName> with the value obtained on the previous step.

    Example of system response:

    rook-ceph-osd-0-7b8d4d58db-f6czn   1/1     Running   0          42h   10.100.91.6   kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf   <none>           <none>
    rook-ceph-osd-1-78fbc47dc5-px9n2   1/1     Running   0          21h   10.100.91.6   kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf   <none>           <none>
    rook-ceph-osd-3-647f8d6c69-87gxt   1/1     Running   0          21h   10.100.91.6   kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf   <none>           <none>
    
Remove a Ceph OSD with a metadata device

Note

Ceph OSD removal implies the usage of the KaaSCephOperationRequest custom resource (CR). For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Warning

When using the non-recommended Ceph pools replicated.size of less than 3, Ceph OSD removal cannot be performed. The minimal replica size equals a rounded up half of the specified replicated.size.

For example, if replicated.size is 2, the minimal replica size is 1, and if replicated.size is 3, then the minimal replica size is 2. The replica size of 1 allows Ceph having PGs with only one Ceph OSD in the acting state, which may cause a PG_TOO_DEGRADED health warning that blocks Ceph OSD removal. Mirantis recommends setting replicated.size to 3 for each Ceph pool.

  1. Open the KaasCephCluster object of the managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. Remove the required Ceph OSD specification from the spec.cephClusterSpec.nodes.<machineName>.storageDevices list:

    The example configuration of the nodes section with the new node:

    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
        - config: # remove the entire item entry from storageDevices list
            deviceClass: hdd
            metadataDevice: /dev/bluedb/meta_1
          fullPath: /dev/disk/by-id/scsi-0ATA_HGST_HUS724040AL_PN1334PEHN1VBC
    
    nodes:
      kaas-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            deviceClass: hdd
          name: sdb
        - config: # remove the entire item entry from storageDevices list
            deviceClass: hdd
            metadataDevice: /dev/bluedb/meta_1
          name: sdc
    
  3. Create a YAML template for the KaaSCephOperationRequest CR. For example:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: remove-osd-<machineName>-sdb
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByDevice:
            - name: sdb
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding cluster namespace and <kaasCephClusterName> with the corresponding KaaSCephCluster name.

    Warning

    Since MOSK 23.3, Mirantis does not recommend setting device name or device by-path symlink in the cleanupByDevice field as these identifiers are not persistent and can change at node boot. Remove Ceph OSDs with by-id symlinks specified in the path field or use cleanupByOsdId instead.

    For details, see Container Cloud documentation: Addressing storage devices.

    Note

    • Since MOSK 23.1, cleanupByDevice is not supported if a device was physically removed from a node. Therefore, use cleanupByOsdId instead. For details, see Remove a failed Ceph OSD by Ceph OSD ID.

    • Before MOSK 23.1, if the storageDevice item was specified with by-id, specify the path parameter in the cleanupByDevice section instead of name.

    • If the storageDevice item was specified with a by-path device path, specify the path parameter in the cleanupByDevice section instead of name.

  4. Apply the template on the management cluster in the corresponding namespace:

    kubectl apply -f remove-osd-<machineName>-sdb.yaml
    
  5. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest remove-osd-<machineName>-sdb -n <managedClusterProjectName>
    
  6. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-<machineName>-sdb -o yaml
    

    Example of system response:

    status:
      childNodesMapping:
        kaas-node-d4aac64d-1721-446c-b7df-e351c3025591: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            kaas-node-d4aac64d-1721-446c-b7df-e351c3025591:
              osdMapping:
                "10":
                  deviceMapping:
                    sdb:
                      path: "/dev/disk/by-path/pci-0000:00:1t.9"
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
                "5":
                  deviceMapping:
                    /dev/sdc:
                      deviceClass: hdd
                      devicePath: /dev/disk/by-path/pci-0000:00:0f.0
                      devicePurpose: block
                      usedPartition: /dev/ceph-2d11bf90-e5be-4655-820c-fb4bdf7dda63/osd-block-e41ce9a8-4925-4d52-aae4-e45167cfcf5c
                      zapDisk: true
                    /dev/sdf:
                      deviceClass: hdd
                      devicePath: /dev/disk/by-path/pci-0000:00:12.0
                      devicePurpose: db
                      usedPartition: /dev/bluedb/meta_1
    
  7. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest remove-osd-<machineName>-sdb -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  8. In the KaaSCephOperationRequest CR, set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest remove-osd-<machineName>-sdb
    

    Configuration snippet:

    spec:
      osdRemove:
        approve: true
    
  9. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  10. Verify that the KaaSCephOperationRequest has been completed.

    Example of the positive status.phase field:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  11. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Reconfigure a partition of a Ceph OSD metadata device

There is no hot reconfiguration procedure for existing Ceph OSDs. To reconfigure an existing Ceph node, remove and re-add a Ceph OSD with a metadata device using the following options:

  • Since Container Cloud 2.24.0, if metadata device partitions are specified in the BareMetalHostProfile object as described in Configure Ceph disks in a host profile, the metadata device definition is an LVM path in metadataDevice of the KaaSCephCluster object.

    Therefore, automated LCM will clean up the logical volume without removal and it can be reused. For this reason, to reconfigure a partition of a Ceph OSD metadata device:

    1. Remove a Ceph OSD from the Ceph cluster as described in Remove a Ceph OSD with a metadata device.

    2. Add the same Ceph OSD but with a modified configuration as described in Add a Ceph OSD with a metadata device.

  • Before MOSK 23.2 or if metadata device partitions are not specified in the BareMetalHostProfile object as described in Configure Ceph disks in a host profile, the most common definition of a metadata device is a full device name (by-path or by-id) in metadataDevice of the KaaSCephCluster object for Ceph OSD. For example, metadataDevice: /dev/nvme0n1. In this case, to reconfigure a partition of a Ceph OSD metadata device:

    1. Remove a Ceph OSD from the Ceph cluster as described in Remove a Ceph OSD with a metadata device. Automated LCM will clean up the data device and will remove the metadata device partition for the required Ceph OSD.

    2. Reconfigure the metadata device partition manually to use it during addition of a new Ceph OSD.

      Manual reconfiguration of a metadata device partition
      1. Log in to the Ceph node running a Ceph OSD to reconfigure.

      2. Find the required metadata device used for Ceph OSDs that should have LVM partitions with the osd--db substring:

        lsblk
        

        Example of system response:

        ...
        vdf               252:80   0   32G  0 disk
        ├─ceph--7831901d--398e--415d--8941--e78486f3b019-osd--db--4bdbb0a0--e613--416e--ab97--272f237b7eab
        │                 253:3    0   16G  0 lvm
        └─ceph--7831901d--398e--415d--8941--e78486f3b019-osd--db--8f439d5c--1a19--49d5--b71f--3c25ae343303
                          253:5    0   16G  0 lvm
        

        Capture the volume group UUID and logical volume sizes. In the example above, the volume group UUID is ceph--7831901d--398e--415d--8941--e78486f3b019 and the size is 16G.

      3. Find the volume group of the metadata device:

        vgs
        

        Example of system response:

        VG                                        #PV #LV #SN Attr   VSize   VFree
        ceph-508c7a6d-db01-4873-98c3-52ab204b5ca8   1   1   0 wz--n- <32.00g    0
        ceph-62d84b29-8de5-440c-a6e9-658e8e246af7   1   1   0 wz--n- <32.00g    0
        ceph-754e0772-6d0f-4629-bf1d-24cb79f3ee82   1   1   0 wz--n- <32.00g    0
        ceph-7831901d-398e-415d-8941-e78486f3b019   1   2   0 wz--n- <48.00g <17.00g
        lvm_root                                    1   1   0 wz--n- <61.03g    0
        

        Capture the volume group with the name that matches the prefix of LVM partitions of the metadata device. In the example above, the required volume group is ceph-7831901d-398e-415d-8941-e78486f3b019.

      4. Make a manual LVM partitioning for the new Ceph OSD. Create a new logical volume in the obtained volume group:

        lvcreate -L <lvSize> -n <lvName> <vgName>
        

        Substitute the following parameters:

        • <lvSize> with the previously obtained logical volume size. In the example above, it is 16G.

        • <lvName> with a new logical volume name. For example, meta_1.

        • <vgName> with the previously obtained volume group name. In the example above, it is ceph-7831901d-398e-415d-8941-e78486f3b019.

        Note

        Manually created partitions can be removed only manually, or during a complete metadata disk removal, or during the Machine object removal or re-provisioning.

    3. Add the same Ceph OSD but with a modified configuration and manually created logical volume of the metadata device as described in Add a Ceph OSD with a metadata device.

      For example, instead of metadataDevice: /dev/bluedb/meta_1 define metadataDevice: /dev/ceph-7831901d-398e-415d-8941-e78486f3b019/meta_1 that was manually created in the previous step.

Replace a failed Ceph OSD

After a physical disk replacement, you can use Ceph LCM API to redeploy a failed Ceph OSD. The common flow of replacing a failed Ceph OSD is as follows:

  1. Remove the obsolete Ceph OSD from the Ceph cluster by device name, by Ceph OSD ID, or by path.

  2. Add a new Ceph OSD on the new disk to the Ceph cluster.

Note

Ceph OSD replacement presupposes usage of a KaaSCephOperationRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Remove a failed Ceph OSD by device name, path, or ID

Warning

The procedure below presuppose that the operator knows the exact device name, by-path, or by-id of the replaced device, as well as on which node the replacement occurred.

Warning

Since Container Cloud 2.23.1 (Cluster release 12.7.0), a Ceph OSD removal using by-path, by-id, or device name is not supported if a device was physically removed from a node. Therefore, use cleanupByOsdId instead. For details, see Remove a failed Ceph OSD by Ceph OSD ID.

Warning

Since MOSK 23.3, Mirantis does not recommend setting device name or device by-path symlink in the cleanupByDevice field as these identifiers are not persistent and can change at node boot. Remove Ceph OSDs with by-id symlinks specified in the path field or use cleanupByOsdId instead.

For details, see Container Cloud documentation: Addressing storage devices.

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, remove the required device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from storageDevices list
              # fullPath: <deviceByPath> if device is specified with symlink instead of name
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where the device <deviceName> or <deviceByPath> is going to be replaced.

  3. Save KaaSCephCluster and close the editor.

  4. Create a KaaSCephOperationRequest CR template and save it as replace-failed-osd-<machineName>-<deviceName>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-osd-<machineName>-<deviceName>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByDevice:
            - name: <deviceName>
              # If a device is specified with by-path or by-id instead of
              # name, path: <deviceByPath> or <deviceById>.
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <kaasCephClusterName> with the corresponding KaaSCephCluster resource from the <managedClusterProjectName> namespace.

  5. Apply the template to the cluster:

    kubectl apply -f replace-failed-osd-<machineName>-<deviceName>-request.yaml
    
  6. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  7. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
    

    Example of system response:

    status:
      childNodesMapping:
        <nodeName>: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            <nodeName>:
              osdMapping:
                <osdId>:
                  deviceMapping:
                    <dataDevice>:
                      deviceClass: hdd
                      devicePath: <dataDeviceByPath>
                      devicePurpose: block
                      usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2
                      zapDisk: true
    

    Definition of values in angle brackets:

    • <machineName> - name of the machine on which the device is being replaced, for example, worker-1

    • <nodeName> - underlying node name of the machine, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af

    • <osdId> - Ceph OSD ID for the device being replaced, for example, 1

    • <dataDeviceByPath> - by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9

    • <dataDevice> - name of the device placed on the node, for example, /dev/sdb

  8. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName> -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  9. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<deviceName>
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  10. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  11. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  12. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Remove a failed Ceph OSD by Ceph OSD ID

Caution

The procedure below presupposes that the operator knows only the failed Ceph OSD ID.

  1. Identify the node and device names used by the affected Ceph OSD:

    Using the Ceph CLI in the rook-ceph-tools Pod, run:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd metadata <osdId>
    

    Substitute <osdId> with the affected OSD ID.

    Example output:

    {
      "id": 1,
      ...
      "bluefs_db_devices": "vdc",
      ...
      "bluestore_bdev_devices": "vde",
      ...
      "devices": "vdc,vde",
      ...
      "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
      ...
    },
    

    In the example above, hostname is the node name and devices are all devices used by the affected Ceph OSD.

    In the status section of the KaaSCephCluster CR, obtain the osd-device mapping:

    kubectl get kaascephcluster -n <managedClusterProjectName> -o yaml
    

    Substitute <managedClusterProjectName> with the corresponding value.

    For example:

    status:
      fullClusterInfo:
        cephDetails:
          cephDeviceMapping:
            <nodeName>:
              <osdId>: <deviceName>
    

    In the system response, capture the following parameters:

    • <nodeName> - the corresponding node name that hosts the Ceph OSD

    • <osdId> - the ID of the Ceph OSD to replace

    • <deviceName> - an actual device name to replace

  2. Obtain <machineName> for <nodeName> where the Ceph OSD is placed:

    kubectl -n rook-ceph get node -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.metadata.labels.kaas\.mirantis\.com\/machine-name}{"\n"}{end}'
    
  3. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  4. In the nodes section, remove the required device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from storageDevices list
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where the device <deviceName> is going to be replaced.

  5. Save KaaSCephCluster and close the editor.

  6. Create a KaaSCephOperationRequest CR template and save it as replace-failed-<machineName>-osd-<osdId>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-<machineName>-osd-<osdId>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByOsdId:
            - <osdId>
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <kaasCephClusterName> with the corresponding KaaSCephCluster resource from the <managedClusterProjectName> namespace.

  7. Apply the template to the cluster:

    kubectl apply -f replace-failed-<machineName>-osd-<osdId>-request.yaml
    
  8. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  9. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
    

    Example of system response

    status:
      childNodesMapping:
        <nodeName>: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            <nodeName>:
              osdMapping:
                <osdId>:
                  deviceMapping:
                    <dataDevice>:
                      deviceClass: hdd
                      devicePath: <dataDeviceByPath>
                      devicePurpose: block
                      usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2
                      zapDisk: true
    

    Definition of values in angle brackets:

    • <machineName> - name of the machine on which the device is being replaced, for example, worker-1

    • <nodeName> - underlying node name of the machine, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af

    • <osdId> - Ceph OSD ID for the device being replaced, for example, 1

    • <dataDeviceByPath> - by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9

    • <dataDevice> - name of the device placed on the node, for example, /dev/sdb

  10. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  11. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-osd-<osdId>-request
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  12. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  13. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  14. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Deploy a new device after removal of a failed one

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

  1. If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:

    • You can add a raw device to a node during node deployment.

    • If a node supports adding devices without node reboot, you can hot plug a raw device to a node.

    • If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:

      1. Enable maintenance mode on the managed cluster.

      2. Turn off the required node.

      3. Attach the required raw device to the node.

      4. Turn on the required node.

      5. Disable maintenance mode on the managed cluster.

  2. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes section, add a new device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - fullPath: <deviceByID> # Since 2.25.0 (17.0.0) if device is supposed to be added with by-id
              # name: <deviceByID> # Prior MCC 2.25.0 if device is supposed to be added with by-id
              # fullPath: <deviceByPath> # if device is supposed to be added with by-path
              config:
                deviceClass: hdd
    

    Substitute <machineName> with the machine name of the node where device <deviceName> or <deviceByPath> is going to be added as a Ceph OSD.

  4. Verify that the new Ceph OSD has appeared in the Ceph cluster and is in and up. The fullClusterInfo section should not contain any issues.

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    

    For example:

    status:
      fullClusterInfo:
        daemonStatus:
          osd:
            running: '3/3 running: 3 up, 3 in'
            status: Ok
    
Replace a failed Ceph OSD with a metadata device

The document describes various scenarios of a Ceph OSD outage and recovery or replacement. More specifically, this section describes how to replace a failed Ceph OSD with a metadata device:

  • If the metadata device is specified as a logical volume in the BareMetalHostProfile object and defined in the KaaSCephCluster object as a logical volume path

  • If the metadata device is specified in the KaaSCephCluster object as a device name

Note

Ceph OSD replacement implies the usage of the KaaSCephOperationRequest custom resource (CR). For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Replace a failed Ceph OSD with a metadata device as a logical volume path

You can apply the below procedure in the following cases:

  • A Ceph OSD failed without data or metadata device outage. In this case, first remove a failed Ceph OSD and clean up all corresponding disks and partitions. Then add a new Ceph OSD to the same data and metadata paths.

  • A Ceph OSD failed with data or metadata device outage. In this case, you also first remove a failed Ceph OSD and clean up all corresponding disks and partitions. Then add a new Ceph OSD to a newly replaced data device with the same metadata path.

Note

The below procedure also applies to manually created metadata partitions.

Remove a failed Ceph OSD by ID with a defined metadata device
  1. Identify the ID of Ceph OSD related to a failed device. For example, use the Ceph CLI in the rook-ceph-tools Pod:

    ceph osd metadata
    

    Example of system response:

    {
        "id": 0,
        ...
        "bluestore_bdev_devices": "vdc",
        ...
        "devices": "vdc",
        ...
        "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
        ...
        "pod_name": "rook-ceph-osd-0-7b8d4d58db-f6czn",
        ...
    },
    {
        "id": 1,
        ...
        "bluefs_db_devices": "vdf",
        ...
        "bluestore_bdev_devices": "vde",
        ...
        "devices": "vde,vdf",
        ...
        "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
        ...
        "pod_name": "rook-ceph-osd-1-78fbc47dc5-px9n2",
        ...
    },
    ...
    
  2. Open the KaasCephCluster custom resource (CR) for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes section:

    1. Find and capture the metadataDevice path to reuse it during re-creation of the Ceph OSD.

    2. Remove the required device:

    Example configuration snippet:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from the storageDevices list
              # fullPath: <deviceByPath> if device is specified using by-path instead of name
              config:
                deviceClass: hdd
                metadataDevice: /dev/bluedb/meta_1
    

    In the example above, <machineName> is the name of machine that relates to the node on which the device <deviceName> or <deviceByPath> must be replaced.

  4. Create a KaaSCephOperationRequest CR template and save it as replace-failed-osd-<machineName>-<osdID>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-osd-<machineName>-<deviceName>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByOsdId:
            - <osdID>
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute the following parameters:

    • <machineName> and <deviceName> with the machine and device names from the previous step

    • <managedClusterProjectName> with the cluster project name

    • <osdID> with the ID of the affected Ceph OSD

    • <kaasCephClusterName> with the KaaSCephCluster resource name

    • <managedClusterProjectName> with the project name of the related managed cluster

  5. Apply the template to the cluster:

    kubectl apply -f replace-failed-osd-<machineName>-<osdID>-request.yaml
    
  6. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  7. Verify that the status section of KaaSCephOperationRequest contains the removeInfo section:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<osdID> -o yaml
    

    Example of system response:

    childNodesMapping:
      <nodeName>: <machineName>
    removeInfo:
      cleanUpMap:
        <nodeName>:
          osdMapping:
            "<osdID>":
              deviceMapping:
                <dataDevice>:
                  deviceClass: hdd
                  devicePath: <dataDeviceByPath>
                  devicePurpose: block
                  usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2
                  zapDisk: true
                <metadataDevice>:
                  deviceClass: hdd
                  devicePath: <metadataDeviceByPath>
                  devicePurpose: db
                  usedPartition: /dev/bluedb/meta_1
              uuid: ef516477-d2da-492f-8169-a3ebfc3417e2
    

    Definition of values in angle brackets:

    • <machineName> - name of the machine on which the device is being replaced, for example, worker-1

    • <nodeName> - underlying node name of the machine, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af

    • <osdId> - Ceph OSD ID for the device being replaced, for example, 1

    • <dataDeviceByPath> - by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9

    • <dataDevice> - name of the device placed on the node, for example, /dev/vde

    • <metadataDevice> - metadata name of the device placed on the node, for example, /dev/vdf

    • <metadataDeviceByPath> - metadata by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:12.0

    Note

    The partitions that are manually created or configured using the BareMetalHostProfile object can be removed only manually, or during a complete metadata disk removal, or during the Machine object removal or re-provisioning.

  8. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-osd-<machineName>-<osdID> -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  9. In the KaaSCephOperationRequest CR, set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-osd-<machineName>-<osdID>
    

    Configuration snippet:

    spec:
      osdRemove:
        approve: true
    
  10. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  11. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
Re-create a Ceph OSD with the same metadata partition

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

  1. If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:

    • You can add a raw device to a node during node deployment.

    • If a node supports adding devices without node reboot, you can hot plug a raw device to a node.

    • If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:

      1. Enable maintenance mode on the managed cluster.

      2. Turn off the required node.

      3. Attach the required raw device to the node.

      4. Turn on the required node.

      5. Disable maintenance mode on the managed cluster.

  2. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes section, add the replaced device with the same metadataDevice path as on the removed Ceph OSD. For example:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceByID> # Recommended. Add a new device by ID, for example, /dev/disk/by-id/...
              #fullPath: <deviceByPath> # Add a new device by path, for example, /dev/disk/by-path/...
              config:
                deviceClass: hdd
                metadataDevice: /dev/bluedb/meta_1 # Must match the value of the previously removed OSD
    

    Substitute <machineName> with the machine name of the node where the new device <deviceByID> or <deviceByPath> must be added.

  4. Wait for the replaced disk to apply to the Ceph cluster as a new Ceph OSD.

    You can monitor the application state using either the status section of the KaaSCephCluster CR or in the rook-ceph-tools Pod:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
    
Replace a failed Ceph OSD disk with a metadata device as a device name

You can apply the below procedure if a Ceph OSD failed with data disk outage and the metadata partition is not specified in the BareMetalHostProfile custom resource (CR). This scenario implies that the Ceph cluster automatically creates a required metadata logical volume on a desired device.

Remove a Ceph OSD with a metadata device as a device name

To remove the affected Ceph OSD with a metadata device as a device name, follow the Remove a failed Ceph OSD by ID with a defined metadata device procedure and capture the following details:

  • While editing KaasCephCluster in the nodes section, capture the metadataDevice path to reuse it during re-creation of the Ceph OSD.

    Example of the spec.nodes section:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName>  # remove the entire item from the storageDevices list
              # fullPath: <deviceByPath> if device is specified using by-path instead of name
              config:
                deviceClass: hdd
                metadataDevice: /dev/nvme0n1
    

    In the example above, save the metadataDevice device name /dev/nvme0n1.

  • During verification of removeInfo, capture the usedPartition value of the metadata device located in the deviceMapping.<metadataDevice> section.

    Example of the removeInfo section:

    removeInfo:
      cleanUpMap:
        <nodeName>:
          osdMapping:
            "<osdID>":
              deviceMapping:
                <dataDevice>:
                  deviceClass: hdd
                  devicePath: <dataDeviceByPath>
                  devicePurpose: block
                  usedPartition: /dev/ceph-d2d3a759-2c22-4304-b890-a2d87e056bd4/osd-block-ef516477-d2da-492f-8169-a3ebfc3417e2
                  zapDisk: true
                <metadataDevice>:
                  deviceClass: hdd
                  devicePath: <metadataDeviceByPath>
                  devicePurpose: db
                  usedPartition: /dev/ceph-b0c70c72-8570-4c9d-93e9-51c3ab4dd9f9/osd-db-ecf64b20-1e07-42ac-a8ee-32ba3c0b7e2f
              uuid: ef516477-d2da-492f-8169-a3ebfc3417e2
    

    In the example above, capture the following values from the <metadataDevice> section:

    • ceph-b0c70c72-8570-4c9d-93e9-51c3ab4dd9f9 - name of the volume group that contains all metadata partitions on the <metadataDevice> disk

    • osd-db-ecf64b20-1e07-42ac-a8ee-32ba3c0b7e2f - name of the logical volume that relates to a failed Ceph OSD

Re-create the metadata partition on the existing metadata disk

After you remove the Ceph OSD disk, manually create a separate logical volume for the metadata partition in an existing volume group on the metadata device:

lvcreate -l 100%FREE -n meta_1 <vgName>

Subtitute <vgName> with the name of a volume group captured in the usedPartiton parameter.

Note

If you removed more than one OSD, replace 100%FREE with the corresponding partition size. For example:

lvcreate -l <partitionSize> -n meta_1 <vgName>

Substitute <partitionSize> with the corresponding value that matches the size of other partitions placed on the affected metadata drive. To obtain <partitionSize>, use the output of the lvs command. For example: 16G.

During execution of the lvcreate command, the system asks you to wipe the found bluestore label on a metadata device. For example:

WARNING: ceph_bluestore signature detected on /dev/ceph-b0c70c72-8570-4c9d-93e9-51c3ab4dd9f9/meta_1 at offset 0. Wipe it? [y/n]:

Using the interactive shell, answer n to keep all metadata partitions alive. After answering n, the system outputs the following:

Aborted wiping of ceph_bluestore.
1 existing signature left on the device.
Logical volume "meta_1" created.
Re-create the Ceph OSD with the re-created metadata partition

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

  1. If you want to add a Ceph OSD on top of a raw device that already exists on a node or is hot-plugged, add the required device using the following guidelines:

    • You can add a raw device to a node during node deployment.

    • If a node supports adding devices without node reboot, you can hot plug a raw device to a node.

    • If a node does not support adding devices without node reboot, you can hot plug a raw device during node shutdown. In this case, complete the following steps:

      1. Enable maintenance mode on the managed cluster.

      2. Turn off the required node.

      3. Attach the required raw device to the node.

      4. Turn on the required node.

      5. Disable maintenance mode on the managed cluster.

  2. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the nodes section, add the replaced device with the same metadataDevice path as in the previous Ceph OSD:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - fullPath: <deviceByID> # Recommended since MCC 2.25.0 (17.0.0).
                                     # Add a new device by-id symlink, for example, /dev/disk/by-id/...
              #name: <deviceByID> # Add a new device by ID, for example, /dev/disk/by-id/...
              #fullPath: <deviceByPath> # Add a new device by path, for example, /dev/disk/by-path/...
              config:
                deviceClass: hdd
                metadataDevice: /dev/<vgName>/meta_1
    

    Substitute <machineName> with the machine name of the node where the new device <deviceByID> or <deviceByPath> must be added. Also specify metadataDevice with the path to the logical volume created during the Re-create the metadata partition on the existing metadata disk procedure.

  4. Wait for the replaced disk to apply to the Ceph cluster as a new Ceph OSD.

    You can monitor the application state using either the status section of the KaaSCephCluster CR or in the rook-ceph-tools Pod:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
    
Replace a failed metadata device

This section describes the scenario when an underlying metadata device fails with all related Ceph OSDs. In this case, the only solution is to remove all Ceph OSDs related to the failed metadata device, then attach a device that will be used as a new metadata device, and re-create all affected Ceph OSDs.

Caution

If you used BareMetalHostProfile to automatically partition the failed device, you must create a manual partition of the new device because BareMetalHostProfile does not support hot-load changes and creates an automatic device partition only during node provisioning.

Remove failed Ceph OSDs with the affected metadata device
  1. Save the KaaSCephCluster specification of all Ceph OSDs affected by the failed metadata device to re-use this specification during re-creation of Ceph OSDs after disk replacement.

  2. Identify Ceph OSD IDs related to the failed metadata device, for example, using Ceph CLI in the rook-ceph-tools Pod:

    ceph osd metadata
    

    Example of system response:

    {
        "id": 11,
        ...
        "bluefs_db_devices": "vdc",
        ...
        "bluestore_bdev_devices": "vde",
        ...
        "devices": "vdc,vde",
        ...
        "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
        ...
    },
    {
        "id": 12,
        ...
        "bluefs_db_devices": "vdd",
        ...
        "bluestore_bdev_devices": "vde",
        ...
        "devices": "vdd,vde",
        ...
        "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
        ...
    },
    {
        "id": 13,
        ...
        "bluefs_db_devices": "vdf",
        ...
        "bluestore_bdev_devices": "vde",
        ...
        "devices": "vde,vdf",
        ...
        "hostname": "kaas-node-6c5e76f9-c2d2-4b1a-b047-3c299913a4bf",
        ...
    },
    ...
    
  3. Open the KaasCephCluster custom resource (CR) for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  4. In the nodes section, remove all storageDevices items that relate to the failed metadata device. For example:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceName1>  # remove the entire item from the storageDevices list
              # fullPath: <deviceByPath> if device is specified using symlink instead of name
              config:
                deviceClass: hdd
                metadataDevice: <metadataDevice>
            - name: <deviceName2>  # remove the entire item from the storageDevices list
              config:
                deviceClass: hdd
                metadataDevice: <metadataDevice>
            - name: <deviceName3>  # remove the entire item from the storageDevices list
              config:
                deviceClass: hdd
                metadataDevice: <metadataDevice>
            ...
    

    In the example above, <machineName> is the machine name of the node where the metadata device <metadataDevice> must be replaced.

  5. Create a KaaSCephOperationRequest CR template and save it as replace-failed-meta-<machineName>-<metadataDevice>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-meta-<machineName>-<metadataDevice>
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            cleanupByOsdId:
            - <osdID-1>
            - <osdID-2>
            ...
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute the following parameters:

    • <machineName> and <metadataDevice> with the machine and device names from the previous step

    • <managedClusterProjectName> with the cluster project name

    • <osdID-*> with IDs of the affected Ceph OSDs

    • <kaasCephClusterName> with the KaaSCephCluster CR name

    • <managedClusterProjectName> with the project name of the related managed cluster

  6. Apply the template to the cluster:

    kubectl apply -f replace-failed-meta-<machineName>-<metadataDevice>-request.yaml
    
  7. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  8. Verify that the removeInfo section is present in the KaaSCephOperationRequest CR status and that the cleanUpMap section matches the required removal:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-meta-<machineName>-<metadataDevice> -o yaml
    

    Example of system response:

    childNodesMapping:
      <nodeName>: <machineName>
    removeInfo:
      cleanUpMap:
        <nodeName>:
          osdMapping:
            "<osdID-1>":
              deviceMapping:
                <dataDevice-1>:
                  deviceClass: hdd
                  devicePath: <dataDeviceByPath-1>
                  devicePurpose: block
                  usedPartition: <dataLvPartition-1>
                  zapDisk: true
                <metadataDevice>:
                  deviceClass: hdd
                  devicePath: <metadataDeviceByPath>
                  devicePurpose: db
                  usedPartition: /dev/ceph-b0c70c72-8570-4c9d-93e9-51c3ab4dd9f9/osd-db-ecf64b20-1e07-42ac-a8ee-32ba3c0b7e2f
              uuid: ef516477-d2da-492f-8169-a3ebfc3417e2
            "<osdID-2>":
              deviceMapping:
                <dataDevice-2>:
                  deviceClass: hdd
                  devicePath: <dataDeviceByPath-2>
                  devicePurpose: block
                  usedPartition: <dataLvPartition-2>
                  zapDisk: true
                <metadataDevice>:
                  deviceClass: hdd
                  devicePath: <metadataDeviceByPath>
                  devicePurpose: db
                  usedPartition: /dev/ceph-b0c70c72-8570-4c9d-93e9-51c3ab4dd9f9/osd-db-ecf64b20-1e07-42ac-a8ee-32ba3c0b7e2f
              uuid: ef516477-d2da-492f-8169-a3ebfc3417e2
            ...
    

    Definition of values in angle brackets:

    • <machineName> - name of the machine on which the device is being replaced, for example, worker-1

    • <nodeName> - underlying node name of the machine, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af

    • <osdId> - Ceph OSD ID for the device being replaced, for example, 1

    • <dataDeviceByPath> - by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9

    • <dataDevice> - name of the device placed on the node, for example, /dev/vdc

    • <metadataDevice> - metadata name of the device placed on the node, for example, /dev/vde

    • <metadataDeviceByPath> - metadata by-path of the device placed on the node, for example, /dev/disk/by-path/pci-0000:00:12.0

    • <dataLvPartition> - logical volume partition of the data device

  9. Wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-meta-<machineName>-<metadataDevice> -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  10. In the KaaSCephOperationRequest CR, set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-meta-<machineName>-<metadataDevice>
    

    Configuration snippet:

    spec:
      osdRemove:
        approve: true
    
  11. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  12. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
Prepare the replaced metadata device for Ceph OSD re-creation

Note

This section describes how to create a metadata disk partition on N logical volumes. To create one partition on a metadata disk, refer to Reconfigure a partition of a Ceph OSD metadata device.

  1. Partition the replaced metadata device by N logical volumes (LVs), where N is the number of Ceph OSDs previously located on a failed metadata device.

    Calculate the new metadata LV percentage of used volume group capacity using the 100 / N formula.

  2. Log in to the node with the replaced metadata disk.

  3. Create an LVM physical volume atop the replaced metadata device:

    pvcreate <metadataDisk>
    

    Substitute <metadataDisk> with the replaced metadata device.

  4. Create an LVM volume group atop of the physical volume:

    vgcreate bluedb <metadataDisk>
    

    Substitute <metadataDisk> with the replaced metadata device.

  5. Create N LVM logical volumes with the calculated capacity per each volume:

    lvcreate -l <X>%VG -n meta_<i> bluedb
    

    Substitute <X> with the result of the 100 / N formula and <i> with the current number of metadata partitions.

As a result, the replaced metadata device will have N LVM paths, for example, /dev/bluedb/meta_1.

Re-create a Ceph OSD on the replaced metadata device

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

  1. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, add the cleaned Ceph OSD device with the replaced LVM paths of the metadata device from previous steps. For example:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>:
            storageDevices:
            - name: <deviceByID-1> # Recommended. Add the new device by ID /dev/disk/by-id/...
              #fullPath: <deviceByPath-1> # Add a new device by path /dev/disk/by-path/...
              config:
                deviceClass: hdd
                metadataDevice: /dev/<vgName>/<lvName-1>
            - name: <deviceByID-2> # Recommended. Add the new device by ID /dev/disk/by-id/...
              #fullPath: <deviceByPath-2> # Add a new device by path /dev/disk/by-path/...
              config:
                deviceClass: hdd
                metadataDevice: /dev/<vgName>/<lvName-2>
            - name: <deviceByID-3> # Recommended. Add the new device by ID /dev/disk/by-id/...
              #fullPath: <deviceByPath-3> # Add a new device by path /dev/disk/by-path/...
              config:
                deviceClass: hdd
                metadataDevice: /dev/<vgName>/<lvName-3>
    
    • Substitute <machineName> with the machine name of the node where the metadata device has been replaced.

    • Add all data devices for re-created Ceph OSDs and specify metadataDevice that is the path to the previously created logical volume. Substitute <vgName> with a volume group name that contains N logical volumes <lvName-i>.

  3. Wait for the re-created Ceph OSDs to apply to the Ceph cluster.

    You can monitor the application state using either the status section of the KaaSCephCluster CR or in the rook-ceph-tools Pod:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
    
Replace a failed Ceph node

After a physical node replacement, you can use the Ceph LCM API to redeploy failed Ceph nodes. The common flow of replacing a failed Ceph node is as follows:

  1. Remove the obsolete Ceph node from the Ceph cluster.

  2. Add a new Ceph node with the same configuration to the Ceph cluster.

Note

Ceph OSD node replacement presupposes usage of a KaaSCephOperationRequest CR. For workflow overview, spec and phases description, see High-level workflow of Ceph OSD or node removal.

Remove a failed Ceph node
  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, remove the required device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>: # remove the entire entry for the node to replace
            storageDevices: {...}
            role: [...]
    

    Substitute <machineName> with the machine name to replace.

  3. Save KaaSCephCluster and close the editor.

  4. Create a KaaSCephOperationRequest CR template and save it as replace-failed-<machineName>-request.yaml:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: replace-failed-<machineName>-request
      namespace: <managedClusterProjectName>
    spec:
      osdRemove:
        nodes:
          <machineName>:
            completeCleanUp: true
      kaasCephCluster:
        name: <kaasCephClusterName>
        namespace: <managedClusterProjectName>
    

    Substitute <kaasCephClusterName> with the corresponding KaaSCephCluster resource from the <managedClusterProjectName> namespace.

  5. Apply the template to the cluster:

    kubectl apply -f replace-failed-<machineName>-request.yaml
    
  6. Verify that the corresponding request has been created:

    kubectl get kaascephoperationrequest -n <managedClusterProjectName>
    
  7. Verify that the removeInfo section appeared in the KaaSCephOperationRequest CR status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
    

    Example of system response:

    status:
      childNodesMapping:
        <nodeName>: <machineName>
      osdRemoveStatus:
        removeInfo:
          cleanUpMap:
            <nodeName>:
              osdMapping:
                ...
                <osdId>:
                  deviceMapping:
                    ...
                    <deviceName>:
                      path: <deviceByPath>
                      partition: "/dev/ceph-b-vg_sdb/osd-block-b-lv_sdb"
                      type: "block"
                      class: "hdd"
                      zapDisk: true
    

    If needed, change the following values:

    • <machineName> - machine name where the replacement occurs, for example, worker-1.

    • <nodeName> - underlying machine node name, for example, kaas-node-5a74b669-7e53-4535-aabd-5b509ec844af.

    • <osdId> - actual Ceph OSD ID for the device being replaced, for example, 1.

    • <deviceName> - actual device name placed on the node, for example, sdb.

    • <deviceByPath> - actual device by-path placed on the node, for example, /dev/disk/by-path/pci-0000:00:1t.9.

  8. Verify that the cleanUpMap section matches the required removal and wait for the ApproveWaiting phase to appear in status:

    kubectl -n <managedClusterProjectName> get kaascephoperationrequest replace-failed-<machineName>-request -o yaml
    

    Example of system response:

    status:
      phase: ApproveWaiting
    
  9. Edit the KaaSCephOperationRequest CR and set the approve flag to true:

    kubectl -n <managedClusterProjectName> edit kaascephoperationrequest replace-failed-<machineName>-request
    

    For example:

    spec:
      osdRemove:
        approve: true
    
  10. Review the following status fields of the KaaSCephOperationRequest CR request processing:

    • status.phase - current state of request processing

    • status.messages - description of the current phase

    • status.conditions - full history of request processing before the current phase

    • status.removeInfo.issues and status.removeInfo.warnings - error and warning messages occurred during request processing, if any

  11. Verify that the KaaSCephOperationRequest has been completed. For example:

    status:
      phase: Completed # or CompletedWithWarnings if there are non-critical issues
    
  12. Remove the device cleanup jobs:

    kubectl delete jobs -n ceph-lcm-mirantis -l app=miraceph-cleanup-disks
    
Deploy a new Ceph node after removal of a failed one

Note

You can spawn Ceph OSD on a raw device, but it must be clean and without any data or partitions. If you want to add a device that was in use, also ensure it is raw and clean. To clean up all data and partitions from a device, refer to official Rook documentation.

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes section, add a new device:

    spec:
      cephClusterSpec:
        nodes:
          <machineName>: # add new configuration for replaced Ceph node
            storageDevices:
            - fullPath: <deviceByID> # Recommended since MCC 2.25.0 (17.0.0), non-wwn by-id symlink
              # name: <deviceByID> # Prior MCC 2.25.0, non-wwn by-id symlink
              # fullPath: <deviceByPath> # if device is supposed to be added with by-path
              config:
                deviceClass: hdd
              ...
    

    Substitute <machineName> with the machine name of the replaced node and configure it as required.

    Warning

    Since MCC 2.25.0 (17.0.0), Mirantis highly recommends using non-wwn by-id symlinks only to specify storage devices in the storageDevices list.

    For details, see Container Cloud documentation: Addressing storage devices.

  3. Verify that all Ceph daemons from the replaced node have appeared on the Ceph cluster and are in and up. The fullClusterInfo section should not contain any issues.

    kubectl -n <managedClusterProjectName> get kaascephcluster -o yaml
    

    Example of system response:

    status:
      fullClusterInfo:
        clusterStatus:
          ceph:
            health: HEALTH_OK
            ...
        daemonStatus:
          mgr:
            running: a is active mgr
            status: Ok
          mon:
            running: '3/3 mons running: [a b c] in quorum'
            status: Ok
          osd:
            running: '3/3 running: 3 up, 3 in'
            status: Ok
    
  4. Verify the Ceph node on the managed cluster:

    kubectl -n rook-ceph get pod -o wide | grep <machineName>
    
Remove Ceph OSD manually

You may need to manually remove a Ceph OSD, for example, in the following cases:

  • If you have removed a device or node from the KaaSCephCluster spec.cephClusterSpec.nodes or spec.cephClusterSpec.nodeGroups section with manageOsds set to false.

  • If you do not want to rely on Ceph LCM operations and want to manage the Ceph OSDs life cycle manually.

To safely remove one or multiple Ceph OSDs from a Ceph cluster, perform the following procedure for each Ceph OSD one by one.

Warning

The procedure presupposes the Ceph OSD disk or logical volumes partition cleanup.

To remove a Ceph OSD manually:

  1. Edit the KaaSCephCluster resource on a management cluster:

    kubectl --kubeconfig <mgmtKubeconfig> -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <mgmtKubeconfig> with the management cluster kubeconfig and <managedClusterProjectName> with the project name of the managed cluster.

  2. In the spec.cephClusterSpec.nodes section, remove the required storageDevices item of the corresponding node spec. If after removal storageDevices becomes empty and the node spec has no roles specified, also remove the node spec.

  3. Obtain kubeconfig of the managed cluster and provide it as an environment variable:

    export KUBECONFIG=<pathToManagedKubeconfig>
    
  4. Verify that all Ceph OSDs are up and in, the Ceph cluster is healthy, and no rebalance or recovery is in progress:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s
    

    Example of system response:

    cluster:
      id:     8cff5307-e15e-4f3d-96d5-39d3b90423e4
      health: HEALTH_OK
      ...
      osd: 4 osds: 4 up (since 10h), 4 in (since 10h)
    
  5. Stop the rook-ceph/rook-ceph-operator deployment to avoid premature reorchestration of the Ceph cluster:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
    
  6. Enter the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
    
  7. Mark the required Ceph OSD as out:

    ceph osd out osd.<ID>
    

    Note

    In the command above and in the steps below, substitute <ID> with the number of the Ceph OSD to remove.

  8. Wait until data backfilling to other OSDs is complete:

    ceph -s
    

    Once all of the PGs are active+clean, backfilling is complete and it is safe to remove the disk.

    Note

    For additional information on PGs backfilling, run ceph pg dump_stuck.

  9. Exit from the ceph-tools pod:

    exit
    
  10. Scale the rook-ceph/rook-ceph-osd-<ID> deployment to 0 replicas:

    kubectl -n rook-ceph scale deploy rook-ceph-osd-<ID> --replicas 0
    
  11. Enter the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
    
  12. Verify that the number of Ceph OSDs that are up and in has decreased by one daemon:

    ceph -s
    

    Example of system response:

    osd: 4 osds: 3 up (since 1h), 3 in (since 5s)
    
  13. Remove the Ceph OSD from the Ceph cluster:

    ceph osd purge <ID> --yes-i-really-mean-it
    
  14. Delete the Ceph OSD auth entry, if present. Otherwise, skip this step.

    ceph auth del osd.<ID>
    
  15. If you have removed the last Ceph OSD on the node and want to remove this node from the Ceph cluster, remove the CRUSH map entry:

    ceph osd crush remove <nodeName>
    

    Substitute <nodeName> with the name of the node where the removed Ceph OSD was placed.

  16. Verify that the failure domain within Ceph OSDs has been removed from the CRUSH map:

    ceph osd tree
    

    If you have removed the node, it will be removed from the CRUSH map.

  17. Exit from the ceph-tools pod:

    exit
    
  18. Clean up the disk used by the removed Ceph OSD. For details, see official Rook documentation.

    Warning

    If you are using multiple Ceph OSDs per device or metadata device, make sure that you can clean up the entire disk. Otherwise, instead clean up only the logical volume partitions for the volume group by running lvremove <lvpartion_uuid> any Ceph OSD pod that belongs to the same host as the removed Ceph OSD.

  19. Delete the rook-ceph/rook-ceph-osd-<ID> deployment previously scaled to 0 replicas:

    kubectl -n rook-ceph delete deploy rook-ceph-osd-<ID>
    

    Substitute <ID> with the number of the removed Ceph OSD.

  20. Scale the rook-ceph/rook-ceph-operator deployment to 1 replica and wait for the orchestration to complete:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
    kubectl -n rook-ceph get pod -w
    

    Once done, Ceph OSD removal is complete.

Migrate Ceph cluster to address storage devices using by-id

The by-id identifier is the only persistent device identifier for a Ceph cluster that remains stable after the cluster upgrade or any other maintenance. Therefore, Mirantis recommends using device by-id symlinks rather than device names or by-path symlinks.

Container Cloud uses the device by-id identifier as the default method of addressing the underlying devices of Ceph OSDs. Thus, you should migrate all existing Ceph clusters, which are still utilizing the device names or device by-path symlinks, to the by-id format.

This section explains how to configure the KaaSCephCluster specification to use the by-id symlinks instead of disk names and by-path identifiers as the default method of addressing storage devices.

Note

Mirantis recommends avoiding the use of wwn symlinks as by-id identifiers due to their lack of persistence expressed in inconsistent discovery during node boot.

Besides migrating to by-id, consider using the fullPath field for the by-id symlinks configuration, instead of the name field in the spec.cephClusterSpec.nodes.storageDevices section. This approach allows for clear understanding of field namings and their use cases.

Note

MOSK enables you to use fullPath for the by-id symlinks since MCC 2.25.0 (Cluster release 17.0.0). For earlier product versions, use the name field instead.

Migrate the Ceph nodes section to by-id identifiers

Available since MCC 2.25.0 (Cluster release 17.0.0)

  1. Make sure that your managed cluster is not currently running an upgrade or any other maintenance process.

  2. Obtain the list of all KaasCephCluster storage devices that use disk names or disk by-path as identifiers of Ceph node storage devices:

    kubectl -n <managedClusterProject> get kcc -o yaml
    

    Substitute <managedClusterProject> with the corresponding managed cluster namespace.

    Output example:

    spec:
      cephClusterSpec:
        nodes:
          ...
          managed-worker-1:
            storageDevices:
            - config:
                deviceClass: hdd
              name: sdc
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
          managed-worker-2:
            storageDevices:
            - config:
                deviceClass: hdd
              name: /dev/disk/by-id/wwn-0x26d546263bd312b8
            - config:
                deviceClass: hdd
              name: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dsdc
          managed-worker-3:
            storageDevices:
            - config:
                deviceClass: nvme
              name: nvme3n1
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
    
  3. Verify the items from the storageDevices sections to be moved to the by-id symlinks. The list of the items to migrate includes:

    • A disk name in the name field. For example, sdc, nvme3n1, and so on.

    • A disk /dev/disk/by-path symlink in the fullPath field. For example, /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2.

    • A disk /dev/disk/by-id symlink in the name field.

      Note

      This condition applies since MCC 2.25.0 (Cluster release 17.0.0).

    • A disk /dev/disk/by-id/wwn symlink, which is programmatically calculated at boot. For example, /dev/disk/by-id/wwn-0x26d546263bd312b8.

    For the example above, we have to migrate both items of managed-worker-1, both items of managed-worker-2, and the first item of managed-worker-3. The second item of managed-worker-3 has already been configured in the required format, therefore, we are leaving it as is.

  4. To migrate all affected storageDevices items to by-id symlinks, open the KaaSCephCluster custom resource for editing:

    kubectl -n <managedClusterProject> edit kcc
    
  5. For each affected node from the spec.cephClusterSpec.nodes section, obtain a corresponding status.providerStatus.hardware.storage section from the Machine custom resource:

    kubectl -n <managedClusterProject> get machine <machineName> -o yaml
    

    Substitute <managedClusterProject> with the corresponding cluster namespace and <machineName> with the machine name.

    Output example for managed-worker-1:

    status:
      providerStatus:
        hardware:
          storage:
          - byID: /dev/disk/by-id/wwn-0x05ad99618d66a21f
            byIDs:
            - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_05ad99618d66a21f
            - /dev/disk/by-id/scsi-305ad99618d66a21f
            - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_05ad99618d66a21f
            - /dev/disk/by-id/wwn-0x05ad99618d66a21f
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:0
            byPaths:
            - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:0
            name: /dev/sda
            serialNumber: 05ad99618d66a21f
            size: 61
            type: hdd
          - byID: /dev/disk/by-id/wwn-0x26d546263bd312b8
            byIDs:
            - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_26d546263bd312b8
            - /dev/disk/by-id/scsi-326d546263bd312b8
            - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_26d546263bd312b8
            - /dev/disk/by-id/wwn-0x26d546263bd312b8
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
            byPaths:
            - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2
            name: /dev/sdb
            serialNumber: 26d546263bd312b8
            size: 32
            type: hdd
          - byID: /dev/disk/by-id/wwn-0x2e52abb48862dbdc
            byIDs:
            - /dev/disk/by-id/lvm-pv-uuid-MncrcO-6cel-0QsB-IKaY-e8UK-6gDy-k2hOtf
            - /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_2e52abb48862dbdc
            - /dev/disk/by-id/scsi-32e52abb48862dbdc
            - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dbdc
            - /dev/disk/by-id/wwn-0x2e52abb48862dbdc
            byPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
            byPaths:
            - /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:1
            name: /dev/sdc
            serialNumber: 2e52abb48862dbdc
            size: 61
            type: hdd
    
  6. For each affected storageDevices item from the considered Machine, obtain a correct by-id symlink from status.providerStatus.hardware.storage.byIDs. Such by-id symlink must contain status.providerStatus.hardware.storage.serialNumber and must not contain wwn.

    For managed-worker-1, according to the example output above, we can use the following by-id symlinks:

    • Replace the first item of storageDevices that contains name: sdc with fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dbdc;

    • Replace the second item of storageDevices that contains fullPath: /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2 with fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_26d546263bd312b8.

  7. Replace all affected storageDevices items in KaaSCephCluster with the obtained ones.

    Note

    Prior to MCC 2.25.0 (Cluster release 17.0.0), place the by-id symlinks in the name field instead of the fullPath field.

    The resulting example of the storage device identifier migration:

    spec:
      cephClusterSpec:
        nodes:
          ...
          managed-worker-1:
            storageDevices:
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dbdc
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_26d546263bd312b8
          managed-worker-2:
            storageDevices:
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_031d9054c9b48f79
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_2e52abb48862dsdc
          managed-worker-3:
            storageDevices:
            - config:
                deviceClass: nvme
              fullPath: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543
            - config:
                deviceClass: hdd
              fullPath: /dev/disk/by-id/scsi-SATA_HGST_HUS724040AL_PN1334PEHN18ZS
    
  8. Save and quit editing the KaaSCephCluster custom resource.

After migration, the re-orchestration occurs. The whole procedure should not result in any real changes to the Ceph cluster state in Ceph OSDs.

Migrate the Ceph nodeGroups section to by-id identifiers

Available since MCC 2.25.0 (Cluster release 17.0.0)

Besides the nodes section, your cluster may contain the nodeGroups section specified with disk names instead of by-id symlinks. Despite of inplace replacement of the nodes storage device identifiers, nodeGroups requires another approach because of the repeatable spec section for different nodes.

In the case of migrating nodeGroups storage devices, the deviceLabels section should be used to label different disks with the same labels and use these labels in node groups after. For the deviceLabels section specification, refer to Ceph advanced configuration: extraOpts.

The following procedure describes how to keep the nodeGroups section but use unique by-id identifiers instead of disk names.

To migrate the Ceph nodeGroups section to by-id identifiers:

  1. Make sure that your managed cluster is not currently running an upgrade or any other maintenance process.

  2. Obtain the list of all KaasCephCluster storage devices that use disk names or disk by-path as identifiers of Ceph node group storage devices:

    kubectl -n <managedClusterProject> get kcc -o yaml
    

    Substitute <managedClusterProject> with the corresponding managed cluster namespace.

    Output example of the KaaSCephCluster nodeGroups section with disk names used as identifiers:

    spec:
      cephClusterSpec:
        nodeGroups:
          ...
          rack-1:
            nodes:
            - node-1
            - node-2
            spec:
              crush:
                rack: "rack-1"
              storageDevices:
              - name: nvme0n1
                config:
                  deviceClass: nvme
              - name: nvme1n1
                config:
                  deviceClass: nvme
              - name: nvme2n1
                config:
                  deviceClass: nvme
          rack-2:
            nodes:
            - node-3
            - node-4
            spec:
              crush:
                rack: "rack-2"
              storageDevices:
              - name: nvme0n1
                config:
                  deviceClass: nvme
              - name: nvme1n1
                config:
                  deviceClass: nvme
              - name: nvme2n1
                config:
                  deviceClass: nvme
          rack-3:
            nodes:
            - node-5
            - node-6
            spec:
              crush:
                rack: "rack-3"
              storageDevices:
              - name: nvme0n1
                config:
                  deviceClass: nvme
              - name: nvme1n1
                config:
                  deviceClass: nvme
              - name: nvme2n1
                config:
                  deviceClass: nvme
    
  3. Verify the items from the storageDevices sections to be moved to by-id symlinks. The list of the items to migrate includes:

    • A disk name in the name field. For example, sdc, nvme3n1, and so on.

    • A disk /dev/disk/by-path symlink in the fullPath field. For example, /dev/disk/by-path/pci-0000:00:05.0-scsi-0:0:0:2.

    • A disk /dev/disk/by-id symlink in the name field.

      Note

      This condition applies since MCC 2.25.0 (Cluster release 17.0.0).

    • A disk /dev/disk/by-id/wwn symlink, which is programmatically calculated at boot. For example, /dev/disk/by-id/wwn-0x26d546263bd312b8.

    All storageDevice sections in the example above contain disk names in the name field. Therefore, you need to replace them with by-id symlinks.

  4. Open the KaaSCephCluster custom resource for editing to start migration of all affected storageDevices items to by-id symlinks:

    kubectl -n <managedClusterProject> edit kcc
    
  5. Within each impacted Ceph node group in the nodeGroups section, add disk labels to the deviceLabels sections for every affected storage device linked with the nodes listed in nodes of that specific node group. Verify that these disk labels are equal to by-id symlinks of corresponding disks.

    For example, if the node group rack-1 contains two nodes node-1 and node-2 and spec contains three items with name, you need to obtain proper by-id symlinks for disk names from both nodes and write it down with the same disk labels. The following example contains the labels for by-id symlinks of nvme0n1, nvme1n1, and nvme2n1 disks from node-1 and node-2 correspondingly:

    spec:
      cephClusterSpec:
        extraOpts:
          deviceLabels:
            node-1:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R372150
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R183266
            node-2:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R900128
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R805840
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R848469
    

    Note

    Keep device labels repeatable for all nodes from the node group. This allows for specifying unified spec for different by-id symlinks of different nodes.

    Example of the full deviceLabels section for the nodeGroups section:

    spec:
      cephClusterSpec:
        extraOpts:
          deviceLabels:
            node-1:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R394543
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R372150
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB3T8HMLA-00007_S46FNY0R183266
            node-2:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R900128
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R805840
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB4040ALR-00007_S46FNY0R848469
            node-3:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00T2B0A-00007_S46FNY0R900128
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00T2B0A-00007_S46FNY0R805840
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00T2B0A-00007_S46FNY0R848469
            node-4:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00Z4SA0-00007_S46FNY0R286212
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00Z4SA0-00007_S46FNY0R350024
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB00Z4SA0-00007_S46FNY0R300756
            node-5:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB8UK0QBD-00007_S46FNY0R577024
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB8UK0QBD-00007_S46FNY0R718411
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB8UK0QBD-00007_S46FNY0R831424
            node-6:
              nvme-1: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB01DAU34-00007_S46FNY0R908440
              nvme-2: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB01DAU34-00007_S46FNY0R945405
              nvme-3: /dev/disk/by-id/nvme-SAMSUNG_MZ1LB01DAU34-00007_S46FNY0R224911
    
  6. For each affected node group in the nodeGroups section, replace the field with the insufficient disk identifier to the devLabel field with the disk label from the deviceLabels section.

    For the example above, the updated nodeGroups section looks as follows:

    spec:
      cephClusterSpec:
        nodeGroups:
          ...
          rack-1:
            nodes:
            - node-1
            - node-2
            spec:
              crush:
                rack: "rack-1"
              storageDevices:
              - devLabel: nvme-1
                config:
                  deviceClass: nvme
              - devLabel: nvme-2
                config:
                  deviceClass: nvme
              - devLabel: nvme-3
                config:
                  deviceClass: nvme
          rack-2:
            nodes:
            - node-3
            - node-4
            spec:
              crush:
                rack: "rack-2"
              storageDevices:
              - devLabel: nvme-1
                config:
                  deviceClass: nvme
              - devLabel: nvme-2
                config:
                  deviceClass: nvme
              - devLabel: nvme-3
                config:
                  deviceClass: nvme
          rack-3:
            nodes:
            - node-5
            - node-6
            spec:
              crush:
                rack: "rack-3"
              storageDevices:
              - devLabel: nvme-1
                config:
                  deviceClass: nvme
              - devLabel: nvme-2
                config:
                  deviceClass: nvme
              - devLabel: nvme-3
                config:
                  deviceClass: nvme
    
  7. Save and quit editing the KaaSCephCluster custom resource.

After migration, the re-orchestration occurs. The whole procedure should not result in any real changes to the Ceph cluster state in Ceph OSDs.

Increase Ceph cluster storage size

This section describes how to increase the overall storage size for all Ceph pools of the same device class: hdd, ssd, or nvme. The procedure presupposes adding a new Ceph OSD. The overall storage size for the required device class automatically increases once the Ceph OSD becomes available in the Ceph cluster.

To increase the overall storage size for a device class:

  1. Identify the current storage size for the required device class:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
    

    Substitute <managedClusterKubeconfig> with a managed cluster kubeconfig.

    Example of system response:

    --- RAW STORAGE ---
    CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
    hdd    128 GiB  101 GiB  23 GiB    27 GiB      21.40
    TOTAL  128 GiB  101 GiB  23 GiB    27 GiB      21.40
    
    --- POOLS ---
    POOL                   ID  PGS  STORED  OBJECTS  USED    %USED  MAX AVAIL
    device_health_metrics   1    1     0 B        0     0 B      0     30 GiB
    kubernetes-hdd          2   32  12 GiB    3.13k  23 GiB  20.57     45 GiB
    
  2. Identify the number of Ceph OSDs with the required device class:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df <deviceClass>
    

    Substitute the following parameters:

    • <managedClusterKubeconfig> with a managed cluster kubeconfig

    • <deviceClass> with the required device class: hdd, ssd, or nvme

    Example of system response for the hdd device class:

    ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP      META      AVAIL    %USE   VAR   PGS  STATUS
     1    hdd  0.03119   1.00000   32 GiB  5.8 GiB  4.8 GiB   1.5 MiB  1023 MiB   26 GiB  18.22  0.85   14      up
     3    hdd  0.03119   1.00000   32 GiB  6.9 GiB  5.9 GiB   1.1 MiB  1023 MiB   25 GiB  21.64  1.01   17      up
     0    hdd  0.03119   0.84999   32 GiB  6.8 GiB  5.8 GiB  1013 KiB  1023 MiB   25 GiB  21.24  0.99   16      up
     2    hdd  0.03119   1.00000   32 GiB  7.9 GiB  6.9 GiB   1.2 MiB  1023 MiB   24 GiB  24.55  1.15   20      up
                           TOTAL  128 GiB   27 GiB   23 GiB   4.8 MiB   4.0 GiB  101 GiB  21.41
    MIN/MAX VAR: 0.85/1.15  STDDEV: 2.29
    
  3. Follow Add a Ceph OSD on a managed cluster to add a new device with a supported device class: hdd, ssd, or nvme.

  4. Wait for the new Ceph OSD pod to start Running:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph get pod -l app=rook-ceph-osd
    

    Substitute <managedClusterKubeconfig> with a managed cluster kubeconfig.

  5. Verify that the new Ceph OSD has rebalanced and Ceph health is HEALTH_OK:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
    

    Substitute <managedClusterKubeconfig> with a managed cluster kubeconfig.

  6. Verify that the new Ceph has been OSD added to the list of device class OSDs:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df <deviceClass>
    

    Substitute the following parameters:

    • <managedClusterKubeconfig> with a managed cluster kubeconfig

    • <deviceClass> with the required device class: hdd, ssd, or nvme

    Example of system response for the hdd device class after adding a new Ceph OSD:

    ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP      META      AVAIL    %USE   VAR   PGS  STATUS
     1    hdd  0.03119   1.00000   32 GiB  4.5 GiB  3.5 GiB   1.5 MiB  1023 MiB   28 GiB  13.93  0.78   10      up
     3    hdd  0.03119   1.00000   32 GiB  5.5 GiB  4.5 GiB   1.1 MiB  1023 MiB   26 GiB  17.22  0.96   13      up
     0    hdd  0.03119   0.84999   32 GiB  6.5 GiB  5.5 GiB  1013 KiB  1023 MiB   25 GiB  20.32  1.14   15      up
     2    hdd  0.03119   1.00000   32 GiB  7.5 GiB  6.5 GiB   1.2 MiB  1023 MiB   24 GiB  23.43  1.31   19      up
     4    hdd  0.03119   1.00000   32 GiB  4.6 GiB  3.6 GiB       0 B     1 GiB   27 GiB  14.45  0.81   10      up
                           TOTAL  160 GiB   29 GiB   24 GiB   4.8 MiB   5.0 GiB  131 GiB  17.87
    MIN/MAX VAR: 0.78/1.31  STDDEV: 3.62
    
  7. Verify the total storage capacity increased for the entire device class:

    kubectl --kubeconfig <managedClusterKubeconfig> -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
    

    Substitute <managedClusterKubeconfig> with a managed cluster kubeconfig.

    Example of system response:

    --- RAW STORAGE ---
    CLASS  SIZE     AVAIL    USED    RAW USED  %RAW USED
    hdd    160 GiB  131 GiB  24 GiB    29 GiB      17.97
    TOTAL  160 GiB  131 GiB  24 GiB    29 GiB      17.97
    
    --- POOLS ---
    POOL                   ID  PGS  STORED  OBJECTS  USED    %USED  MAX AVAIL
    device_health_metrics   1    1     0 B        0     0 B      0     38 GiB
    kubernetes-hdd          2   32  12 GiB    3.18k  24 GiB  17.17     57 GiB
    
Move a Ceph Monitor daemon to another node

This document describes how to migrate a Ceph Monitor daemon from one node to another without changing the general number of Ceph Monitors in the cluster. In the Ceph Controller concept, migration of a Ceph Monitor means manually removing it from one node and adding it to another.

Consider the following exemplary placement scheme of Ceph Monitors in the nodes spec of the KaaSCephCluster CR:

nodes:
  node-1:
    roles:
    - mon
    - mgr
  node-2:
    roles:
    - mgr

Using the example above, if you want to move the Ceph Monitor from node-1 to node-2 without changing the number of Ceph Monitors, the roles table of the nodes spec must result as follows:

nodes:
  node-1:
    roles:
    - mgr
  node-2:
    roles:
    - mgr
    - mon

However, due to the Rook limitation related to Kubernetes architecture, once you move the Ceph Monitor through the KaaSCephCluster CR, changes will not apply automatically. This is caused by the following Rook behavior:

  • Rook creates Ceph Monitor resources as deployments with nodeSelector, which binds Ceph Monitor pods to a requested node.

  • Rook does not recreate new Ceph Monitors with the new node placement if the current mon quorum works.

Therefore, to move a Ceph Monitor to another node, you must also manually apply the new Ceph Monitors placement to the Ceph cluster as described below.

To move a Ceph Monitor to another node:

  1. Open the KaasCephCluster CR of a managed cluster:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the nodes spec of the KaaSCephCluster CR, change the mon roles placement without changing the total number of mon roles. For details, see the example above. Note the nodes on which the mon roles have been removed.

  3. Wait until the corresponding MiraCeph resource is updated with the new nodes spec:

    kubectl --kubeconfig <kubeconfig> -n ceph-lcm-mirantis get miraceph -o yaml
    

    Substitute <kubeconfig> with the Container Cloud cluster kubeconfig that hosts the required Ceph cluster.

  4. In the MiraCeph resource, determine which node has been changed in the nodes spec. Save the name value of the node where the mon role has been removed for further usage.

    kubectl -n <managedClusterProjectName> get machine -o jsonpath='{range .items[*]}{.metadata.name .status.nodeRef.name}{"\n"}{end}'
    

    Substitute <managedClusterProjectName> with the corresponding value.

  5. If you perform a managed cluster update, follow additional steps:

    1. Verify that the following conditions are met before proceeding to the next step:

      • There are at least 2 running and available Ceph Monitors so that the Ceph cluster is accessible during the Ceph Monitor migration:

        kubectl -n rook-ceph get pod -l app=rook-ceph-mon
        kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph -s
        
      • The MiraCeph object on the managed cluster has the required node with the mon role added in the nodes section of spec:

        kubectl -n ceph-lcm-mirantis get miraceph -o yaml
        
      • The Ceph NodeWorkloadLock for the required node is created:

        kubectl --kubeconfig child-kubeconfig get nodeworkloadlock -o jsonpath='{range .items[?(@.spec.nodeName == "<desiredNodeName>")]}{@.metadata.name}{"\n"}{end}' | grep ceph
        
    2. Scale the ceph-maintenance-controller deployment to 0 replicas:

      kubectl -n ceph-lcm-mirantis scale deploy ceph-maintenance-controller --replicas 0
      
    3. Manually edit the managed cluster node labels: remove the ceph_role_mon label from the obsolete node and add this label to the new node:

      kubectl label node <obsoleteNodeName> ceph_role_mon-
      kubectl label node <newNodeName> ceph_role_mon=true
      
    4. Verify that the rook-ceph-operator deployment is scaled to 0 replica:

      kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
      
  6. Obtain the rook-ceph-mon deployment name placed on the obsolete node using the previously obtained node name:

    kubectl -n rook-ceph get deploy -l app=rook-ceph-mon -o jsonpath="{.items[?(@.spec.template.spec.nodeSelector['kubernetes\.io/hostname'] == '<nodeName>')].metadata.name}"
    

    Substitute <nodeName> with the name of the node where you removed the mon role.

  7. Back up the rook-ceph-mon deployment placed on the obsolete node:

    kubectl -n rook-ceph get deploy <rook-ceph-mon-name> -o yaml > <rook-ceph-mon-name>-backup.yaml
    
  8. Remove the rook-ceph-mon deployment placed on the obsolete node:

    kubectl -n rook-ceph delete deploy <rook-ceph-mon-name>
    
  9. If you perform a managed cluster update, follow additional steps:

    1. Enter the ceph-tools pod:

      kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
      
    2. Remove the Ceph Monitor from the Ceph monmap by letter:

      ceph mon rm <monLetter>
      

      Substitute <monLetter> with the old Ceph Monitor letter. For example, mon-b has the letter b.

    3. Verify that the Ceph cluster does not have any information about the the removed Ceph Monitor:

      ceph mon dump
      ceph -s
      
    4. Exit the ceph-tools pod.

    5. Scale up the rook-ceph-operator deployment to 1 replica:

      kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
      
    6. Wait for the missing Ceph Monitor failover process to start:

      kubectl -n rook-ceph logs -l app=rook-ceph-operator -f
      

      Example of log extract:

      2024-03-01 12:33:08.741215 W | op-mon: mon b NOT found in ceph mon map, failover
      2024-03-01 12:33:08.741244 I | op-mon: marking mon "b" out of quorum
      ...
      2024-03-01 12:33:08.766822 I | op-mon: Failing over monitor "b"
      2024-03-01 12:33:08.766881 I | op-mon: starting new mon...
      
  10. Select one of the following options:

    Wait approximately 10 minutes until rook-ceph-operator performs a failover of the Pending mon pod. Inspect the logs during the failover process:

    kubectl -n rook-ceph logs -l app=rook-ceph-operator -f
    

    Example of log extract:

    2021-03-15 17:48:23.471978 W | op-mon: mon "a" not found in quorum, waiting for timeout (554 seconds left) before failover
    

    Note

    If the failover process fails:

    1. Scale down the rook-ceph-operator deployment to 0 replicas.

    2. Apply the backed-up rook-ceph-mon deployment.

    3. Scale back the rook-ceph-operator deployment to 1 replica.

    1. Scale the rook-ceph-operator deployment to 0 replicas:

      kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
      
    2. Scale the ceph-maintenance-controller deployment to 3 replicas:

      kubectl -n ceph-lcm-mirantis scale deploy ceph-maintenance-controller --replicas 3
      

Once done, Rook removes the obsolete Ceph Monitor from the node and creates a new one on the specified node with a new letter. For example, if the a, b, and c Ceph Monitors were in quorum and mon-c was obsolete, Rook removes mon-c and creates mon-d. In this case, the new quorum includes the a, b, and d Ceph Monitors.

Migrate a Ceph Monitor before machine replacement

Available since MCC 2.25.0 (Cluster release 17.0.0)

This document describes how to migrate a Ceph Monitor to another machine on baremetal-based clusters before node replacement as described in Delete a cluster machine using web UI.

Warning

  • Remove the Ceph Monitor role before the machine removal.

  • Make sure that the Ceph cluster always has an odd number of Ceph Monitors.

The procedure of a Ceph Monitor migration assumes that you temporarily move the Ceph Manager/Monitor to a worker machine. After a node replacement, we recommend migrating the Ceph Manager/Monitor to the new manager machine.

To migrate a Ceph Monitor to another machine:

  1. Move the Ceph Manager/Monitor daemon from the affected machine to one of the worker machines as described in Move a Ceph Monitor daemon to another node.

  2. Delete the affected machine as described in Delete a cluster machine.

  3. Add a new manager machine without the Monitor and Manager role as described in Create a machine using CLI.

    Warning

    The addition of a new machine with the Monitor and Manager role breaks the odd number quorum of Ceph Monitors.

  4. Move the previously migrated Ceph Manager/Monitor daemon to the new manager machine as described in Move a Ceph Monitor daemon to another node.

Enable Ceph RGW Object Storage

Ceph Controller enables you to deploy RADOS Gateway (RGW) Object Storage instances and automatically manage its resources such as users and buckets. Ceph Object Storage has an integration with OpenStack Object Storage (Swift) in MOSK.

To enable the RGW Object Storage:

  1. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with a corresponding value.

  2. Using the following table, update the cephClusterSpec.objectStorage.rgw section specification as required:

    Caution

    Since MCC 2.24.0 (Cluster releases 15.0.1 and 14.0.1), explicitly specify the deviceClass parameter for dataPool and metadataPool.

    Warning

    Since Container Cloud 2.6.0, the spec.rgw section is deprecated and its parameters are moved under objectStorage.rgw. If you continue using spec.rgw, it is automatically translated into objectStorage.rgw during the Container Cloud update to 2.6.0.

    We strongly recommend changing spec.rgw to objectStorage.rgw in all KaaSCephCluster CRs before spec.rgw becomes unsupported and is deleted.

    RADOS Gateway parameters

    Parameter

    Description

    name

    Ceph Object Storage instance name.

    dataPool

    Mutually exclusive with the zone parameter. Object storage data pool spec that should only contain replicated or erasureCoded and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. For dataPool, Mirantis recommends using an erasureCoded pool. For details, see Rook documentation: Erasure coding. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          dataPool:
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
    

    metadataPool

    Mutually exclusive with the zone parameter. Object storage metadata pool spec that should only contain replicated and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. Can use only replicated settings. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          metadataPool:
            replicated:
              size: 3
            failureDomain: host
    

    where replicated.size is the number of full copies of data on multiple nodes.

    Warning

    When using the non-recommended Ceph pools replicated.size of less than 3, Ceph OSD removal cannot be performed. The minimal replica size equals a rounded up half of the specified replicated.size.

    For example, if replicated.size is 2, the minimal replica size is 1, and if replicated.size is 3, then the minimal replica size is 2. The replica size of 1 allows Ceph having PGs with only one Ceph OSD in the acting state, which may cause a PG_TOO_DEGRADED health warning that blocks Ceph OSD removal. Mirantis recommends setting replicated.size to 3 for each Ceph pool.

    gateway

    The gateway settings corresponding to the rgw daemon settings. Includes the following parameters:

    • port - the port on which the Ceph RGW service will be listening on HTTP.

    • securePort - the port on which the Ceph RGW service will be listening on HTTPS.

    • instances - the number of pods in the Ceph RGW ReplicaSet. If allNodes is set to true, a DaemonSet is created instead.

      Note

      Mirantis recommends using 2 instances for Ceph Object Storage.

    • allNodes - defines whether to start the Ceph RGW pods as a DaemonSet on all nodes. The instances parameter is ignored if allNodes is set to true.

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          gateway:
            allNodes: false
            instances: 1
            port: 80
            securePort: 8443
    

    preservePoolsOnDelete

    Defines whether to delete the data and metadata pools in the rgw section if the object storage is deleted. Set this parameter to true if you need to store data even if the object storage is deleted. However, Mirantis recommends setting this parameter to false.

    objectUsers and buckets

    Optional. To create new Ceph RGW resources, such as buckets or users, specify the following keys. Ceph Controller will automatically create the specified object storage users and buckets in the Ceph cluster.

    • objectUsers - a list of user specifications to create for object storage. Contains the following fields:

      • name - a user name to create.

      • displayName - the Ceph user name to display.

      • capabilities - user capabilities:

        • user - admin capabilities to read/write Ceph Object Store users.

        • bucket - admin capabilities to read/write Ceph Object Store buckets.

        • metadata - admin capabilities to read/write Ceph Object Store metadata.

        • usage - admin capabilities to read/write Ceph Object Store usage.

        • zone - admin capabilities to read/write Ceph Object Store zones.

        The available options are *, read, write, read, write. For details, see Ceph documentation: Add/remove admin capabilities.

      • quotas - user quotas:

        • maxBuckets - the maximum bucket limit for the Ceph user. Integer, for example, 10.

        • maxSize - the maximum size limit of all objects across all the buckets of a user. String size, for example, 10G.

        • maxObjects - the maximum number of objects across all buckets of a user. Integer, for example, 10.

        For example:

        objectUsers:
        - capabilities:
            bucket: '*'
            metadata: read
            user: read
          displayName: test-user
          name: test-user
          quotas:
            maxBuckets: 10
            maxSize: 10G
        
    • users - a list of strings that contain user names to create for object storage.

      Note

      This field is deprecated. Use objectUsers instead. If users is specified, it will be automatically transformed to the objectUsers section.

    • buckets - a list of strings that contain bucket names to create for object storage.

    zone

    Optional. Mutually exclusive with metadataPool and dataPool. Defines the Ceph Multisite zone where the object storage must be placed. Includes the name parameter that must be set to one of the zones items. For details, see Enable multisite for Ceph RGW Object Storage.

    For example:

    cephClusterSpec:
      objectStorage:
        multisite:
          zones:
          - name: master-zone
          ...
        rgw:
          zone:
            name: master-zone
    

    SSLCert

    Optional. Custom TLS certificate parameters used to access the Ceph RGW endpoint. If not specified, a self-signed certificate will be generated.

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          SSLCert:
            cacert: |
              -----BEGIN CERTIFICATE-----
              ca-certificate here
              -----END CERTIFICATE-----
            tlsCert: |
              -----BEGIN CERTIFICATE-----
              private TLS certificate here
              -----END CERTIFICATE-----
            tlsKey: |
              -----BEGIN RSA PRIVATE KEY-----
              private TLS key here
              -----END RSA PRIVATE KEY-----
    

    SSLCertInRef

    Optional. Available since {{ product_name_abbr }} 25.1. Flag to determine that a TLS certificate for accessing the Ceph RGW endpoint is used but not exposed in spec. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          SSLCertInRef: true
    

    The operator must manually provide TLS configuration using the rgw-ssl-certificate secret in the rook-ceph namespace of the managed cluster. The secret object must have the following structure:

    data:
      cacert: <base64encodedCaCertificate>
      cert: <base64encodedCertificate>
    

    When removing an already existing SSLCert block, no additional actions are required, because this block uses the same rgw-ssl-certificate secret in the rook-ceph namespace.

    When adding a new secret directly without exposing it in spec, the following rules apply:

    • cert - base64 representation of a file with the server TLS key, server TLS cert, and cacert.

    • cacert - base64 representation of a cacert only.

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          name: rgw-store
          dataPool:
            deviceClass: hdd
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
            failureDomain: host
          metadataPool:
            deviceClass: hdd
            failureDomain: host
            replicated:
              size: 3
          gateway:
            allNodes: false
            instances: 1
            port: 80
            securePort: 8443
          preservePoolsOnDelete: false
    
Enable multisite for Ceph RGW Object Storage

TechPreview

The Ceph multisite feature allows object storage to replicate its data over multiple Ceph clusters. Using multisite, such object storage is independent and isolated from another object storage in the cluster. Only the multi-zone multisite setup is currently supported. For more details, see Ceph documentation: Multisite.

Enable the multisite RGW Object Storage
  1. Select from the following options:

    • If you do not have a Container cloud cluster yet, open kaascephcluster.yaml.template for editing.

    • If the Container cloud cluster is already deployed, open the KaasCephCluster CR of a managed cluster for editing:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with a corresponding value.

  2. Using the following table, update the cephClusterSpec.objectStorage.multisite section specification as required:

    Multisite parameters

    Parameter

    Description

    realms Technical Preview

    List of realms to use, represents the realm namespaces. Includes the following parameters:

    • name - the realm name.

    • pullEndpoint - optional, required only when the master zone is in a different storage cluster. The endpoint, access key, and system key of the system user from the realm to pull from. Includes the following parameters:

      • endpoint - the endpoint of the master zone in the master zone group.

      • accessKey - the access key of the system user from the realm to pull from.

      • secretKey - the system key of the system user from the realm to pull from.

    zoneGroups Technical Preview

    The list of zone groups for realms. Includes the following parameters:

    • name - the zone group name.

    • realmName - the realm namespace name to which the zone group belongs to.

    zones Technical Preview

    The list of zones used within one zone group. Includes the following parameters:

    • name - the zone name.

    • metadataPool - the settings used to create the Object Storage metadata pools. Must use replication. For details, see Pool parameters.

    • dataPool - the settings to create the Object Storage data pool. Can use replication or erasure coding. For details, see Pool parameters.

    • zoneGroupName - the zone group name.

    • endpointsForZone - available since {{ product_name_abbr }} 24.2. The list of all endpoints in the zone group. If you use ingress proxy for RGW, the list of endpoints must contain that FQDN/IP address to access RGW. By default, if no ingress proxy is used, the list of endpoints is set to the IP address of the RGW external service. Endpoints must follow the HTTP URL format.

    Caution

    The multisite configuration requires master and secondary zones to be reachable from each other.

  3. Select from the following options:

    • If you do not need to replicate data from a different storage cluster, and the current cluster represents the master zone, modify the current objectStorage section to use the multisite mode:

      1. Configure the zone RADOS Gateway (RGW) parameter by setting it to the RGW Object Storage name.

        Note

        Leave dataPool and metadataPool empty. These parameters are ignored because the zone block in the multisite configuration specifies the pools parameters. Other RGW parameters do not require changes.

        For example:

        objectStorage:
          rgw:
            dataPool: {}
            gateway:
              allNodes: false
              instances: 2
              port: 80
              securePort: 8443
            healthCheck: {}
            metadataPool: {}
            name: openstack-store
            preservePoolsOnDelete: false
            zone:
              name: openstack-store
        
      2. Create the multiSite section where the names of realm, zone group, and zone must match the current RGW name.

        Since MCC 2.27.0 (Cluster release 17.2.0), specify the endpointsForZone parameter according to your configuration:

        • If you use ingress proxy, which is defined in the spec.cephClusterSpec.ingress section, add the FQDN endpoint.

        • If you do not use any ingress proxy and access the RGW API using the default RGW external service, add the IP address of the external service or leave this parameter empty.

        The following example illustrates a complete objectStorage section:

        objectStorage:
          multiSite:
            realms:
            - name: openstack-store
            zoneGroups:
            - name: openstack-store
              realmName: openstack-store
            zones:
            - name: openstack-store
              zoneGroupName: openstack-store
              endpointsForZone: http://10.11.0.75:8080
              metadataPool:
                failureDomain: host
                  replicated:
                    size: 3
              dataPool:
                erasureCoded:
                  codingChunks: 1
                  dataChunks: 2
                failureDomain: host
          rgw:
            dataPool: {}
            gateway:
              allNodes: false
              instances: 2
              port: 80
              securePort: 8443
            healthCheck: {}
            metadataPool: {}
            name: openstack-store
            preservePoolsOnDelete: false
            zone:
              name: openstack-store
        
    • If you use a different storage cluster, and its object storage data must be replicated, specify the realm and zone group names along with the pullEndpoint parameter. Additionally, specify the endpoint, access key, and system keys of the system user of the realm from which you need to replicate data. For details, see the step 2 of this procedure.

      • To obtain the endpoint of the cluster zone that must be replicated, run the following command by specifying the zone group name of the required master zone on the master zone side:

        radosgw-admin zonegroup get --rgw-zonegroup=<ZONE_GROUP_NAME> | jq -r '.endpoints'
        

        The endpoint is located in the endpoints field.

      • To obtain the access key and the secret key of the system user, run the following command on the required Ceph cluster:

        radosgw-admin user list
        
      • To obtain the system user name, which has your RGW ObjectStorage name as prefix:

        radosgw-admin user info --uid="<USER_NAME>" | jq -r '.keys'
        

      For example:

      objectStorage:
        multiSite:
          realms:
          - name: openstack-store
            pullEndpoint:
              endpoint: http://10.11.0.75:8080
              accessKey: DRND5J2SVC9O6FQGEJJF
              secretKey: qpjIjY4lRFOWh5IAnbrgL5O6RTA1rigvmsqRGSJk
          zoneGroups:
          - name: openstack-store
            realmName: openstack-store
          zones:
          - name: openstack-store-backup
            zoneGroupName: openstack-store
            metadataPool:
              failureDomain: host
              replicated:
                size: 3
            dataPool:
              erasureCoded:
                codingChunks: 1
                dataChunks: 2
              failureDomain: host
      

      Note

      Mirantis recommends using the same metadataPool and dataPool settings as you use in the master zone.

  4. Configure the zone RGW parameter and leave dataPool and metadataPool empty. These parameters are ignored because the zone section in the multisite configuration specifies the pools parameters.

    Also, you can split the RGW daemon on daemons serving clients and daemons running synchronization. To enable this option, specify splitDaemonForMultisiteTrafficSync in the gateway section.

    For example:

    objectStorage:
      multiSite:
         realms:
         - name: openstack-store
           pullEndpoint:
             endpoint: http://10.11.0.75:8080
             accessKey: DRND5J2SVC9O6FQGEJJF
             secretKey: qpjIjY4lRFOWh5IAnbrgL5O6RTA1rigvmsqRGSJk
         zoneGroups:
         - name: openstack-store
           realmName: openstack-store
         zones:
         - name: openstack-store-backup
           zoneGroupName: openstack-store
           metadataPool:
             failureDomain: host
             replicated:
               size: 3
           dataPool:
             erasureCoded:
               codingChunks: 1
               dataChunks: 2
             failureDomain: host
      rgw:
        dataPool: {}
        gateway:
          allNodes: false
          instances: 2
          splitDaemonForMultisiteTrafficSync: true
          port: 80
          securePort: 8443
        healthCheck: {}
        metadataPool: {}
        name: openstack-store-backup
        preservePoolsOnDelete: false
        zone:
          name: openstack-store-backup
    
  5. On the ceph-tools pod, verify the multisite status:

    radosgw-admin sync status
    

Once done, ceph-operator will create the required resources and Rook will handle the multisite configuration. For details, see: Rook documentation: Object Multisite.

Configure and clean up a multisite configuration

Warning

Rook does not handle multisite configuration changes and cleanup. Therefore, once you enable multisite for Ceph RGW Object Storage, perform these operations manually in the ceph-tools pod. For details, see Rook documentation: Multisite cleanup.

If automatic update of zone group hostnames is disabled, manually specify all required hostnames and update the zone group. In the ceph-tools pod, run the following script:

/usr/local/bin/zonegroup_hostnames_update.sh --rgw-zonegroup <ZONEGROUP_NAME> --hostnames fqdn1[,fqdn2]

If the multisite setup is completely cleaned up, manually execute the following steps on the ceph-tools pod:

  1. Remove the .rgw.root pool:

    ceph osd pool rm .rgw.root .rgw.root --yes-i-really-really-mean-it
    

    Some other RGW pools may also require a removal after cleanup.

  2. Remove the related RGW crush rules:

    ceph osd crush rule ls | grep rgw | xargs -I% ceph osd crush rule rm %
    
Manage Ceph RBD or CephFS clients and RGW users

Available since 2.23.1 (Cluster release 12.7.0)

The section describes how to create, access, and remove Ceph RADOS Block Device (RBD) or Ceph File System (CephFS) clients and RADOS Gateway (RGW) users.

Manage Ceph RBD or CephFS clients

Available since 2.23.1 (Cluster release 12.7.0)

The KaaSCephCluster resource allows managing custom Ceph RADOS Block Device (RBD) or Ceph File System (CephFS) clients. This section describes how to create, access, and remove Ceph RBD or CephFS clients.

For all supported parameters of Ceph clients, refer to Clients parameters.

Create an RBD or CephFS client
  1. Edit the KaaSCephCluster resource by adding a new Ceph client to the spec section:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <managedClusterProject> with the corresponding Container Cloud project where the managed cluster was created.

    Example of adding an RBD client to the kubernetes-ssd pool:

    spec:
      cephClusterSpec:
        clients:
        - name: rbd-client
          caps:
            mon: allow r, allow command "osd blacklist"
            osd: profile rbd pool=kubernetes-ssd
    

    Example of adding a CephFS client to the cephfs-1 Ceph File System :

    spec:
      cephClusterSpec:
        clients:
        - name: cephfs-1-client
          caps:
            mds: allow rwp
            mon: allow r, allow command "osd blacklist"
            osd: allow rw tag cephfs data=cephfs-1 metadata=*
    

    For details about caps, refer to Ceph documentation: Authorization (capabilities).

    Note

    Ceph supports only providing of client access to the whole Ceph File System with all data pools in it.

  2. Wait for created clients to become ready in the KaaSCephCluster status:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    

    Example output:

    status:
      fullClusterInfo:
        blockStorageStatus:
          clientsStatus:
            rbd-client:
              present: true
              status: Ready
            cephfs-1-client:
              present: true
              status: Ready
    
Access data using an RBD or CephFS client
  1. Using the KaaSCephCluster status, obtain secretInfo with the Ceph client credentials :

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    

    Example output:

    status:
      miraCephSecretsInfo:
        secretInfo:
          clientSecrets:
          - name: rbd-client
            secretName: rook-ceph-client-rbd-client
            secretNamespace: rook-ceph
          - name: cephfs-1-client
            secretName: rook-ceph-client-cephfs-1-client
            secretNamespace: rook-ceph
    
  2. Use secretName and secretNamespace to access the Ceph client credentials from a managed cluster:

    kubectl --kubeconfig <managedClusterKubeconfig> -n <secretNamespace> get secret <secretName> -o jsonpath='{.data.<clientName>}' | base64 -d; echo
    

    Substitute the following parameters:

    • <managedClusterKubeconfig> with a managed cluster kubeconfig

    • <secretNamespace> with secretNamespace from the previous step

    • <secretName> with secretName from the previous step

    • <clientName> with the Ceph RBD or CephFS client name set in spec.cephClusterSpec.clients the KaaSCephCluster resource, for example, rbd-client

    Example output:

    AQAGHDNjxWYXJhAAjafCn3EtC6KgzgI1x4XDlg==
    
  3. Using the obtained credentials, create two configuration files on the required workloads to connect them with Ceph pools or file systems:

    • /etc/ceph/ceph.conf:

      [default]
         mon_host = <mon1IP>:6789,<mon2IP>:6789,...,<monNIP>:6789
      

      where mon_host are the comma-separated IP addresses with 6789 ports of the current Ceph Monitors. For example, 10.10.0.145:6789,10.10.0.153:6789,10.10.0.235:6789.

    • /etc/ceph/ceph.client.<clientName>.keyring:

      [client.<clientName>]
          key = <cephClientCredentials>
      
      • <clientName> is a client name set in spec.cephClusterSpec.clients the KaaSCephCluster resource, for example, rbd-client

      • <cephClientCredentials> are the client credentials obtained in the previous steps. For example, AQAGHDNjxWYXJhAAjafCn3EtC6KgzgI1x4XDlg==

  4. If the client caps parameters contain mon: allow r, verify the client access using the following command:

    ceph -n client.<clientName> -s
    
Remove an RBD or CephFS client
  1. Edit the KaaSCephCluster resource by removing the Ceph client from spec.cephClusterSpec.clients:

    kubectl -n <managedClusterProject> edit kaascephcluster
    
  2. Wait for the client to be removed from the KaaSCephCluster status in status.fullClusterInfo.blockStorageStatus.clientsStatus:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    
Manage Ceph Object Storage users

Available since 2.23.1 (Cluster release 12.7.0)

The KaaSCephCluster resource allows managing custom Ceph Object Storage users. This section describes how to create, access, and remove Ceph Object Storage users.

For all supported parameters of Ceph Object Storage users, refer to RADOS Gateway parameters.

Create a Ceph Object Storage user
  1. Edit the KaaSCephCluster resource by adding a new Ceph Object Storage user to the spec section:

    kubectl -n <managedClusterProject> edit kaascephcluster
    

    Substitute <managedClusterProject> with the corresponding Container Cloud project where the managed cluster was created.

    Example of adding the Ceph Object Storage user user-a:

    Caution

    For user name, apply the UUID format with no capital letters.

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            objectUsers:
            - capabilities:
                bucket: '*'
                metadata: read
                user: read
              displayName: user-a
              name: userA
              quotas:
                maxBuckets: 10
                maxSize: 10G
    
  2. Wait for the created user to become ready in the KaaSCephCluster status:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    

    Example output:

    status:
      fullClusterInfo:
        objectStorageStatus:
          objectStoreUsers:
            user-a:
              present: true
              phase: Ready
    
Access data using a Ceph Object Storage user
  1. Using the KaaSCephCluster status, obtain secretInfo with the Ceph user credentials :

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    

    Example output:

    status:
      miraCephSecretsInfo:
        secretInfo:
          rgwUserSecrets:
          - name: user-a
            secretName: rook-ceph-object-user-<objstoreName>-<username>
            secretNamespace: rook-ceph
    

    Substitute <objstoreName> with a Ceph Object Storage name and <username> with a Ceph Object Storage user name.

  2. Use secretName and secretNamespace to access the Ceph Object Storage user credentials from a managed cluster. The secret contains Amazon S3 access and secret keys.

    • To obtain the user S3 access key:

      kubectl --kubeconfig <managedClusterKubeconfig> -n <secretNamespace> get secret <secretName> -o jsonpath='{.data.AccessKey}' | base64 -d; echo
      

      Substitute the following parameters in the commands above and below:

      • <managedClusterKubeconfig> with a managed cluster kubeconfig

      • <secretNamespace> with secretNamespace from the previous step

      • <secretName> with secretName from the previous step

      Example output:

      D49G060HQ86U5COBTJ13
      
    • To obtain the user S3 secret key:

      kubectl --kubeconfig <managedClusterKubeconfig> -n <secretNamespace> get secret <secretName> -o jsonpath='{.data.SecretKey}' | base64 -d; echo
      

      Example output:

      bpuYqIieKvzxl6nzN0sd7L06H40kZGXNStD4UNda
      
  3. Configure the S3 client with the access and secret keys of the created user. You can access the S3 client using various tools such as s3cmd or awscli.

Remove a Ceph Object Storage user
  1. Edit the KaaSCephCluster resource by removing the required Ceph Object Storage user from spec.cephClusterSpec.objectStorage.rgw.objectUsers:

    kubectl -n <managedClusterProject> edit kaascephcluster
    
  2. Wait for the removed user to be removed from the KaaSCephCluster status in status.fullClusterInfo.objectStorageStatus.objectStoreUsers:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml
    
Verify Ceph

This section describes how to verify the components of a Ceph cluster after deployment. For troubleshooting, verify Ceph Controller and Rook logs as described in Verify Ceph Controller and Rook.

Verify the Ceph core services

To confirm that all Ceph components including mon, mgr, osd, and rgw have joined your cluster properly, analyze the logs for each pod and verify the Ceph status:

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
ceph -s

Example of a positive system response:

cluster:
    id:     4336ab3b-2025-4c7b-b9a9-3999944853c8
    health: HEALTH_OK

services:
    mon: 3 daemons, quorum a,b,c (age 20m)
    mgr: a(active, since 19m)
    osd: 6 osds: 6 up (since 16m), 6 in (since 16m)
    rgw: 1 daemon active (miraobjstore.a)

data:
    pools:   12 pools, 216 pgs
    objects: 201 objects, 3.9 KiB
    usage:   6.1 GiB used, 174 GiB / 180 GiB avail
    pgs:     216 active+clean
Verify rook-discover

To ensure that rook-discover is running properly, verify if the local-device configmap has been created for each Ceph node specified in the cluster configuration:

  1. Obtain the list of local devices:

    kubectl get configmap -n rook-ceph | grep local-device
    

    Example of a system response:

    local-device-01      1      30m
    local-device-02      1      29m
    local-device-03      1      30m
    
  2. Verify that each device from the list contains information about available devices for the Ceph node deployment:

    kubectl describe configmap local-device-01 -n rook-ceph
    

    Example of a positive system response:

    Name:         local-device-01
    Namespace:    rook-ceph
    Labels:       app=rook-discover
                  rook.io/node=01
    Annotations:  <none>
    
    Data
    ====
    devices:
    ----
    [{"name":"vdd","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-41d72dac-c0ff-4f24-b /dev/disk/by-path/virtio-pci-0000:00:09.0","size":32212254720,"uuid":"27e9cf64-85f4-48e7-8862-faa7270202ed","serial":"41d72dac-c0ff-4f24-b","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdd\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdd\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdb","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-path/virtio-pci-0000:00:07.0","size":67108864,"uuid":"988692e5-94ac-4c9a-bc48-7b057dd94fa4","serial":"","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdb\",\"available\":false,\"rejected_reasons\":[\"Insufficient space (\\u003c5GB)\"],\"sys_api\":{\"size\":67108864.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"64.00 MB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdb\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdc","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-e8fdba13-e24b-41f0-9 /dev/disk/by-path/virtio-pci-0000:00:08.0","size":32212254720,"uuid":"190a50e7-bc79-43a9-a6e6-81b173cd2e0c","serial":"e8fdba13-e24b-41f0-9","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdc\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdc\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""}]
    
Verify Ceph cluster state through CLI

Verifying Ceph cluster state is an entry point for issues investigation. This section describes how to verify Ceph state using the KaaSCephCluster, MiraCeph, and MiraCephLog resources.

Note

Before MOSK 25.1, use MiraCephLog instead of MiraCephHealth.

Verify Ceph cluster state

To verify the state of a Ceph cluster, Ceph Controller provides special sections in KaaSCephCluster.status. The resource contains information about the state of the Ceph cluster components, their health, and potentially problematic components.

To verify the Ceph cluster state from a managed cluster:

  1. Obtain kubeconfig of a managed cluster and provide it as an environment variable:

    export KUBECONFIG=<pathToManagedKubeconfig>
    
  2. Obtain the MiraCeph resource in YAML format:

    kubectl -n ceph-lcm-mirantis get miraceph -o yaml
    

    Information from MiraCeph.status is passed to the miraCephInfo section of the KaaSCephCluster CR. For details, see KaaSCephCluster.status miraCephInfo specification.

  3. Obtain the MiraCephHealth resource in YAML format:

    kubectl -n ceph-lcm-mirantis get miracephhealth -o yaml
    

    Information from MiraCephHealth is passed to the fullClusterInfo and shortClusterInfo sections of the KaaSCephCluster CR. For details, see KaaSCephCluster.status shortClusterInfo specification and KaaSCephCluster.status fullClusterInfo specification.

    Note

    Before MOSK 25.1, use MiraCephLog instead of MiraCephHealth as the resource name and in the command above.

To verify the Ceph cluster state from a management cluster:

  1. Obtain the KaaSCephCluster resource in the YAML format:

    kubectl -n <projectName> get kaascephcluster -o yaml
    

    Substitute <projectName> with the project name of the managed cluster.

  2. Verify the state of the required component using KaaSCephCluster.status description.

KaaSCephCluster.status description

KaaSCephCluster.status allows you to learn the current health of a Ceph cluster and identify potentially problematic components. This section describes KaaSCephCluster.status and its fields. To view KaaSCephCluster.status, perform the steps described in Verify Ceph cluster state through CLI.

KaaSCephCluster.status specification

Field

Description

kaasCephState

Available since MCC 2.25.0 (Cluster release 17.0.0). Describes the current state of KaasCephCluster and reflects any errors during object reconciliation, including spec generation, object creation on a managed cluster, and status retrieval.

miraCephInfo

Describes the current phase of Ceph spec reconciliation and spec validation result. The miraCephInfo section contains information about the current validation and reconcile of the KaaSCephCluster and MiraCeph resources. It helps to understand whether the specified configuration is valid to create a Ceph cluster and informs about the current phase of applying this configuration. For miraCephInfo fields description, see KaaSCephCluster.status miraCephInfo specification.

shortClusterInfo

Reresents a short version of fullclusterinfo and contains a summary on the Ceph cluster state collecting process and potential issues. It helps to quickly verify if the fullClusterInfo is actual and if any errors occurred during the information collecting. For shortClusterInfo fields description, see KaaSCephCluster.status shortClusterInfo specification.

fullClusterInfo

Contains a complete Ceph cluster information including cluster, Ceph resources, and daemons health. It helps to reveal the potentially problematic components. For fullClusterInfo fields description, see KaaSCephCluster.status fullClusterInfo specification.

miraCephSecretsInfo

Contains information about secrets of the managed cluster that are used in the Ceph cluster, such as keyrings, Ceph clients, RADOS Gateway user credentials, and so on. For miraCephSecretsInfo fields description, see KaaSCephCluster.status miraCephSecretsInfo specification.

The following tables describe all sections of KaaSCephCluster.status.

KaaSCephCluster.status miraCephInfo specification

Field

Description

phase

Contains the current phase of handling of the applied Ceph cluster spec. Can equal to Creating, Deploying, Validation, Ready, Deleting, or Failed.

message

Contains a detailed description of the current phase or an error message if the phase is Failed.

validation

Contains the KaaSCephCluster/MiraCeph spec validation result (Succeed or Failed) with a list of messages, if any. The validation section includes the following fields:

validation:
  result: Succeed or Failed
  messages: ["error", "messages", "list"]
KaaSCephCluster.status shortClusterInfo specification

Field

Description

state

Current Ceph cluster collector status:

  • Ready if information collecting works as expected

  • Failed if an error occurs

lastCheck

DateTime that equals to the last time when the cluster was verified.

lastUpdate

DateTime that equals to the last time when the Ceph cluster information was updated.

messages

List of error or warning messages found when gathering the facts about the Ceph cluster.

KaaSCephCluster.status fullClusterInfo specification

Field

Description

clusterStatus

General information from Rook about the Ceph cluster health and current state. The clusterStatus field contains the following fields:

clusterStatus:
  state: <rook ceph cluster common status>
  phase: <rook ceph cluster spec reconcile phase>
  message: <rook ceph cluster phase details>
  conditions: <history of rook ceph cluster
              reconcile steps>
  ceph: <ceph cluster health>
  storage:
    deviceClasses: <list of used device classes
                   in ceph cluster>
  version:
    image: <ceph image used in ceph cluster>
    version: <ceph version of ceph cluster>

operatorStatus

Status of the Rook Ceph Operator pod that is Ok or Not running.

daemonsStatus

Map of statuses for each Ceph cluster daemon type. Indicates the expected and actual number of Ceph daemons on the cluster. Available daemon types are: mgr, mon, osd, and rgw. The daemonsStatus field contains the following fields:

daemonsStatus:
  <daemonType>:
    status: <daemons status>
    running: <number of running daemons with
             details>

For example:

daemonsStatus:
  mgr:
    running: a is active mgr ([] standBy)
    status: Ok
  mon:
    running: '3/3 mons running: [a c d] in quorum'
    status: Ok
  osd:
    running: '4/4 running: 4 up, 4 in'
    status: Ok
  rgw:
    running: 2/2 running
             ([openstack.store.a openstack.store.b])
    status: Ok

blockStorageStatus

State of the Ceph cluster block storage resources. Includes the following fields:

  • pools - status map for each CephBlockPool resource. The map includes the following fields:

    pools:
      <cephBlockPoolName>:
        present: <flag whether desired pool is
                 present in ceph cluster>
        status: <rook ceph block pool resource status>
    
  • clients - status map for each Ceph client resource. The map includes the following fields:

    clients:
      <cephClientName>:
        present: <flag whether desired client is
                 present in ceph cluster>
        status: <rook ceph client resource status>
    

objectStorageStatus

State of the Ceph cluster object storage resources. Includes the following fields:

  • objectStoreStatus - status of the Rook Ceph Object Store. Information comes from Rook.

  • objectStoreUsers - status map for each Ceph Object User resource. The map includes the following fields:

    objectStoreUsers:
      <cephObjectUserName>:
        present: <flag whether desired rgw user is
                 present in ceph cluster>
        phase: <rook ceph object user resource phase>
    
  • objectStoreBuckets - status map for each Ceph Object Bucket resource. The map includes the following fields:

    objectStoreBuckets:
      <cephObjectBucketName>:
        present: <flag whether desired rgw bucket is
                 present in ceph cluster>
        phase: <rook ceph object bucket resource phase>
    

cephDetails

Verbose details of the Ceph cluster state. cephDetails includes the following fields:

  • diskUsage - the used, available, and total storage size for each deviceClass and pool.

    cephDetails:
      diskUsage:
        deviceClass:
          <deviceClass>:
            # The amount of raw storage consumed by user data (excluding bluestore database).
            bytesUsed: "<number>"
            # The amount of free space available in the cluster.
            bytesAvailable: "<number>"
            # The amount of storage capacity managed by the cluster.
            bytesTotal: "<number>"
        pools:
          <poolName>:
            # The space allocated for a pool over all OSDs. This includes replication,
            # allocation granularity, and erasure-coding overhead. Compression savings
            # and object content gaps are also taken into account. BlueStore database
            # is not included in this amount.
            bytesUsed: "<number>"
            # The notional percentage of storage used per pool.
            usedPercentage: "<number>"
            # Number calculated with the formula: bytesTotal - bytesUsed.
            bytesAvailable: "<number>"
            # An estimate of the notional amount of data that can be written to this pool.
            bytesTotal: "<number>"
    
  • cephDeviceMapping - a key-value mapping of which node contains which Ceph OSD and which Ceph OSD uses which disk.

    cephDetails:
      cephDeviceMapping:
        <kubernetes node name>:
          osd.<ID>: <deviceName>
    

    Note

    In MCC 2.24.2 (Cluster release 15.0.1), cephDeviceMapping is removed because its large size can potentially exceed the Kubernetes 1.5 MB quota.

cephCSIPluginDaemonsStatus

Contains information, similar to the daemonsStatus format, for each Ceph CSI plugin deployed in the Ceph cluster: rbd and, if enabled, cephfs. The cephCSIPluginDaemonsStatus field contains the following fields:

cephCSIPluginDaemonsStatus:
  <csiPlugin>:
    running: <number of running daemons with details>
    status: <csi plugin status>

For example:

cephCSIPluginDaemonsStatus:
  csi-rbdplugin:
    running: 1/3 running
    status: Some csi-rbdplugin daemons are not ready
  csi-cephfsplugin:
    running: 3/3 running
    status: Ok
KaaSCephCluster.status miraCephSecretsInfo specification Available since MCC 2.23.1 (Cluster release 12.7.0)

Field

Description

state

Current state of the secret collector on the Ceph cluster:

  • Ready - secrets information is collected successfully

  • Failed - secrets information fails to be collected

lastSecretCheck

DateTime when the Ceph cluster secrets were verified last time.

lastSecretUpdate

DateTime when the Ceph cluster secrets were updated last time.

secretsInfo

List of secrets for Ceph clients and RADOS Gateway users:

  • clientSecrets - details on secrets for Ceph clients

  • rgwUserSecrets - details on secrets for Ceph RADOS Gateway users

For example:

lastSecretCheck: "2022-09-05T07:05:35Z"
lastSecretUpdate: "2022-09-05T06:02:00Z"
secretInfo:
  clientSecrets:
  - name: client.admin
    secretName: rook-ceph-admin-keyring
    secretNamespace: rook-ceph
state: Ready

messages

List of error or warning messages, if any, found when collecting information about the Ceph cluster.

View Ceph cluster summary through the Container Cloud web UI

Warning

Mirantis highly recommends verifying a Ceph cluster using the CLI instead of the web UI. For details, see Verify Ceph cluster state through CLI.

The web UI capabilities for adding and managing a Ceph cluster are limited and lack flexibility in defining Ceph cluster specifications. For example, if an error occurs while adding a Ceph cluster using the web UI, usually you can address it only through the CLI.

The web UI functionality for managing Ceph cluster is going to be deprecated in one of the following releases.

Verifying Ceph cluster state is an entry point for issues investigation. Through the Ceph Clusters page of the Container Cloud web UI, you can view a detailed summary on all Ceph clusters deployed, including the cluster name and ID, health status, number of Ceph OSDs, and so on.

To view Ceph cluster summary:

  1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The page with cluster details opens.

  4. In the Ceph Clusters tab, verify the overall cluster health and rebalancing statuses.

  5. Available since MCC 2.25.0 (Cluster release 17.0.0). Click Cluster Details:

    • The Machines tab contains the list of deployed Ceph machines with the following details:

      • Status - deployment status

      • Role - role assigned to a machine, manager or monitor

      • Storage devices - number of storage devices assigned to a machine

      • UP OSDs and IN OSDs - number of up and in Ceph OSDs belonging to a machine

      Note

      To obtain details about a specific machine used for Ceph deployment, in the Clusters > <clusterName> > Machines tab, click the required machine name containing the storage label.

    • The OSDs tab contains the list of Ceph OSDs comprising the Ceph cluster with the following details:

      • OSD - Ceph OSD ID

      • Storage Device ID - storage device ID assigned to a Ceph OSD

      • Type - type of storage device assigned to a Ceph OSD

      • Partition - partition name where Ceph OSD is located

      • Machine - machine name where Ceph OSD is located

      • UP/DOWN - status of a Ceph OSD in a cluster

      • IN/OUT - service state of a Ceph OSD in a cluster

Verify Ceph Controller and Rook

The starting point for Ceph troubleshooting is the ceph-controller and rook-operator logs. Once you locate the component that causes issues, verify the logs of the related pod. This section describes how to verify the Ceph Controller and Rook objects of a Ceph cluster.

To verify Ceph Controller and Rook:

  1. Verify the Ceph cluster status:

    1. Verify that the status of each pod in the ceph-lcm-mirantis and rook-ceph namespaces is Running:

      • For ceph-lcm-mirantis:

        kubectl get pod -n ceph-lcm-mirantis
        
      • For rook-ceph:

        kubectl get pod -n rook-ceph
        
  2. Verify Ceph Controller. Ceph Controller prepares the configuration that Rook uses to deploy the Ceph cluster, managed using the KaasCephCluster resource. If Rook cannot finish the deployment, verify the Rook Operator logs as described in the step 4.

    1. List the pods:

      kubectl -n ceph-lcm-mirantis get pods
      
    2. Verify the logs of the required pod:

      kubectl -n ceph-lcm-mirantis logs <ceph-controller-pod-name>
      
    3. Verify the configuration:

      kubectl get kaascephcluster -n <managedClusterProjectName> -o yaml
      
    4. On the managed cluster, verify the MiraCeph subresource:

      kubectl get miraceph -n ceph-lcm-mirantis -o yaml
      
  3. Verify the Rook Operator logs. Rook deploys a Ceph cluster based on custom resources created by the Ceph Controller, such as pools, clients, cephcluster, and so on. Rook logs contain details about components orchestration. For details about the Ceph cluster status and to get access to CLI tools, connect to the ceph-tools pod as described in the step 5.

    1. Verify the Rook Operator logs:

      kubectl -n rook-ceph logs -l app=rook-ceph-operator
      
    2. Verify the CephCluster configuration:

      Note

      The Ceph Controller manages the CephCluster CR . Open the CephCluster CR only for verification and do not modify it manually.

      kubectl get cephcluster -n rook-ceph -o yaml
      
  4. Verify the ceph-tools pod:

    1. Execute the ceph-tools pod:

      kubectl --kubeconfig <pathToManagedClusterKubeconfig> -n rook-ceph exec -it $(kubectl --kubeconfig <pathToManagedClusterKubeconfig> -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
      
    2. Verify that CLI commands can run on the ceph-tools pod:

      ceph -s
      
  5. Verify hardware:

    1. Through the ceph-tools pod, obtain the required device in your cluster:

      ceph osd tree
      
    2. Enter all Ceph OSD pods in the rook-ceph namespace one by one:

      kubectl exec -it -n rook-ceph <osd-pod-name> bash
      
    3. Verify that the ceph-volume tool is available on all pods running on the target node:

      ceph-volume lvm list
      
  6. Verify data access. Ceph volumes can be consumed directly by Kubernetes workloads and internally, for example, by OpenStack services. To verify the Kubernetes storage:

    1. Verify the available storage classes. The storage classes that are automatically managed by Ceph Controller use the rook-ceph.rbd.csi.ceph.com provisioner.

      kubectl get storageclass
      

      Example of system response:

      NAME                            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
      kubernetes-ssd (default)        rook-ceph.rbd.csi.ceph.com     Delete          Immediate              false                  55m
      stacklight-alertmanager-data    kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-elasticsearch-data   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-postgresql-db        kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-prometheus-data      kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      
    2. Verify that volumes are properly connected to the Pod:

      1. Obtain the list of volumes in all namespaces or use a particular one:

        kubectl get persistentvolumeclaims -A
        

        Example of system response:

        NAMESPACE   NAME       STATUS   VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS     AGE
        rook-ceph   app-test   Bound    pv-test   1Gi        RWO            kubernetes-ssd   11m
        
      2. For each volume, verify the connection. For example:

        kubectl describe pvc app-test -n rook-ceph
        

        Example of a positive system response:

        Name:          app-test
        Namespace:     kaas
        StorageClass:  rook-ceph
        Status:        Bound
        Volume:        pv-test
        Labels:        <none>
        Annotations:   pv.kubernetes.io/bind-completed: yes
                       pv.kubernetes.io/bound-by-controller: yes
                       volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
        Finalizers:    [kubernetes.io/pvc-protection]
        Capacity:      1Gi
        Access Modes:  RWO
        VolumeMode:    Filesystem
        Events:        <none>
        

        In case of connection issues, inspect the Pod description for the volume information:

        kubectl describe pod <crashloopbackoff-pod-name>
        

        Example of system response:

        ...
        Events:
          FirstSeen LastSeen Count From    SubObjectPath Type     Reason           Message
          --------- -------- ----- ----    ------------- -------- ------           -------
          1h        1h       3     default-scheduler     Warning  FailedScheduling PersistentVolumeClaim is not bound: "app-test" (repeated 2 times)
          1h        35s      36    kubelet, 172.17.8.101 Warning  FailedMount      Unable to mount volumes for pod "wordpress-mysql-918363043-50pjr_default(08d14e75-bd99-11e7-bc4c-001c428b9fc8)": timeout expired waiting for volumes to attach/mount for pod "default"/"wordpress-mysql-918363043-50pjr". list of unattached/unmounted volumes=[mysql-persistent-storage]
          1h        35s      36    kubelet, 172.17.8.101 Warning  FailedSync       Error syncing pod
        
    3. Verify that the CSI provisioner plugins started properly and are in the Running status:

      1. Obtain the list of CSI provisioner plugins:

        kubectl -n rook-ceph get pod -l app=csi-rbdplugin-provisioner
        
      2. Verify the logs of the required CSI provisioner:

        kubectl logs -n rook-ceph <csi-provisioner-plugin-name> csi-provisioner
        
Enable Ceph tolerations and resources management

This section describes how to configure Ceph Controller to manage Ceph nodes resources.

Enable Ceph tolerations and resources management

Warning

This document does not provide any specific recommendations on requests and limits for Ceph resources. The document stands for a native Ceph resources configuration for any cluster with MOSK.

You can configure Ceph Controller to manage Ceph resources by specifying their requirements and constraints. To configure the resources consumption for the Ceph nodes, consider the following options that are based on different Helm release configuration values:

  • Configuring tolerations for taint nodes for the Ceph Monitor, Ceph Manager, and Ceph OSD daemons. For details, see Taints and Tolerations.

  • Configuring nodes resources requests or limits for the Ceph daemons and for each Ceph OSD device class such as HDD, SSD, or NVMe. For details, see Managing Resources for Containers.

To enable Ceph tolerations and resources management:

  1. To avoid Ceph cluster health issues during daemons configuration changing, set Ceph noout, nobackfill, norebalance, and norecover flags through the ceph-tools pod before editing Ceph tolerations and resources:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    ceph osd set noout
    ceph osd set nobackfill
    ceph osd set norebalance
    ceph osd set norecover
    exit
    

    Note

    Skip this step if you are only configuring the PG rebalance timeout and replicas count parameters.

  2. Edit the KaaSCephCluster resource of a managed cluster:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <managedClusterProjectName> with the project name of the required managed cluster.

  3. Specify the parameters in the hyperconverge section as required. The hyperconverge section includes the following parameters:

    Ceph tolerations and resources parameters

    Parameter

    Description

    Example values

    tolerations

    Specifies tolerations for taint nodes for the defined daemon type. Each daemon type key contains the following parameters:

    cephClusterSpec:
      hyperconverge:
        tolerations:
          <daemonType>:
            rules:
            - key: ""
              operator: ""
              value: ""
              effect: ""
              tolerationSeconds: 0
    

    Possible values for <daemonType> are osd, mon, mgr, and rgw. The following values are also supported:

    • all - specifies general toleration rules for all daemons if no separate daemon rule is specified.

    • mds - specifies the CephFS Metadata Server daemons.

    hyperconverge:
      tolerations:
        mon:
          rules:
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
        mgr:
          rules:
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
        osd:
          rules:
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
        rgw:
          rules:
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
    

    resources

    Specifies resources requests or limits. The parameter is a map with the daemon type as a key and the following structure as a value:

    hyperconverge:
      resources:
        <daemonType>:
          requests: <kubernetes valid spec of daemon resource requests>
          limits: <kubernetes valid spec of daemon resource limits>
    

    Possible values for <daemonType> are mon, mgr, osd, osd-hdd, osd-ssd, osd-nvme, prepareosd, rgw, and mds. The osd-hdd, osd-ssd, and osd-nvme resource requirements handle only the Ceph OSDs with a corresponding device class.

    hyperconverge:
      resources:
        mon:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        mgr:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        osd:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        osd-hdd:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        osd-ssd:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        osd-nvme:
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
    
  4. For the Ceph node specific resources settings, specify the resources section in the corresponding nodes spec of KaaSCephCluster:

    spec:
      cephClusterSpec:
        nodes:
          <nodeName>:
            resources:
              requests: <kubernetes valid spec of daemon resource requests>
              limits: <kubernetes valid spec of daemon resource limits>
    

    Substitute <nodeName> with the node requested for specific resources. For example:

    spec:
      cephClusterSpec:
        nodes:
          <nodeName>:
            resources:
              requests:
                memory: 1Gi
                cpu: 2
              limits:
                memory: 2Gi
                cpu: 3
    
  5. For the RADOS Gateway instances specific resources settings, specify the resources section in the rgw spec of KaaSCephCluster:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            gateway:
              resources:
                requests: <kubernetes valid spec of daemon resource requests>
                limits: <kubernetes valid spec of daemon resource limits>
    

    For example:

    spec:
      cephClusterSpec:
        objectStorage:
          rgw:
            gateway:
              resources:
                requests:
                  memory: 1Gi
                  cpu: 2
                limits:
                  memory: 2Gi
                  cpu: 3
    
  6. Save the reconfigured KaaSCephCluster resource and wait for ceph-controller to apply the updated Ceph configuration. It will recreate Ceph Monitors, Ceph Managers, or Ceph OSDs according to the specified hyperconverge configuration.

  7. If you have specified any osd tolerations, additionally specify tolerations for the rook instances:

    1. Open the Cluster resource of the required Ceph cluster on a management cluster:

      kubectl -n <ClusterProjectName> edit cluster
      

      Substitute <ClusterProjectName> with the project name of the required cluster.

    2. Specify the parameters in the ceph-controller section of spec.providerSpec.value.helmReleases:

      1. Specify the hyperconverge.tolerations.rook parameter as required:

        hyperconverge:
          tolerations:
            rook: |
             <yamlFormattedKubernetesTolerations>
        

        In <yamlFormattedKubernetesTolerations>, specify YAML-formatted tolerations from cephClusterSpec.hyperconverge.tolerations.osd.rules of the KaaSCephCluster spec. For example:

        hyperconverge:
          tolerations:
            rook: |
            - effect: NoSchedule
              key: node-role.kubernetes.io/controlplane
              operator: Exists
        
      2. In controllers.cephRequest.parameters.pgRebalanceTimeoutMin, specify the PG rebalance timeout for requests. The default is 30 minutes. For example:

        controllers:
          cephRequest:
            parameters:
              pgRebalanceTimeoutMin: 35
        
      3. In controllers.cephController.replicas, controllers.cephRequest.replicas, and controllers.cephStatus.replicas, specify the replicas count. The default is 3 replicas. For example:

        controllers:
          cephController:
            replicas: 1
          cephRequest:
            replicas: 1
          cephStatus:
            replicas: 1
        
    3. Save the reconfigured Cluster resource and wait for the ceph-controller Helm release update. It will recreate Ceph CSI and discover pods according to the specified hyperconverge.tolerations.rook configuration.

  8. Specify tolerations for different Rook resources using the following chart-based options:

    • hyperconverge.tolerations.rook - general toleration rules for each Rook service if no exact rules specified

    • hyperconverge.tolerations.csiplugin - for tolerations of the ceph-csi plugins DaemonSets

    • hyperconverge.tolerations.csiprovisioner - for the ceph-csi provisioner deployment tolerations

    • hyperconverge.nodeAffinity.csiprovisioner - provides the ceph-csi provisioner node affinity with a value section

  9. After a successful Ceph reconfiguration, unset the flags set in step 1 through the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    ceph osd unset
    ceph osd unset noout
    ceph osd unset nobackfill
    ceph osd unset norebalance
    ceph osd unset norecover
    exit
    

    Note

    Skip this step if you have only configured the PG rebalance timeout and replicas count parameters.

Once done, proceed to Verify Ceph tolerations and resources management.

Verify Ceph tolerations and resources management

After you enable Ceph resources management as described in Enable Ceph tolerations and resources management, perform the steps below to verify that the configured tolerations, requests, or limits have been successfully specified in the Ceph cluster.

To verify Ceph tolerations and resources management:

  • To verify that the required tolerations are specified in the Ceph cluster, inspect the output of the following commands:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.mon.tolerations}'
    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.mgr.tolerations}'
    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.osd.tolerations}'
    
  • To verify RADOS Gateway tolerations:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephobjectstore -o name) -o jsonpath='{.spec.gateway.placement.tolerations}'
    
  • To verify that the required resources requests or limits are specified for the Ceph mon, mgr, or osd daemons, inspect the output of the following command:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.resources}'
    
  • To verify that the required resources requests and limits are specified for the RADOS Gateway daemons, inspect the output of the following command:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephobjectstore -o name) -o jsonpath='{.spec.gateway.resources}'
    
  • To verify that the required resources requests or limits are specified for the Ceph OSDs hdd, ssd, or nvme device classes, perform the following steps:

    1. Identify which Ceph OSDs belong to the <deviceClass> device class in question:

      kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name) -- ceph osd crush class ls-osd <deviceClass>
      
    2. For each <osdID> obtained in the previous step, run the following command. Compare the output with the desired result.

      kubectl -n rook-ceph get deploy rook-ceph-osd-<osdID> -o jsonpath='{.spec.template.spec.containers[].resources}'
      
Enable Ceph multinetwork

Ceph allows establishing multiple IP networks and subnet masks for clusters with configured L3 network rules. In MOSK, you can configure multinetwork through the network section of the KaaSCephCluster CR. Ceph Controller uses this section to specify the Ceph networks for external access and internal daemon communication. The parameters in the network section use the CIDR notation, for example, 10.0.0.0/24.

Before enabling multiple networks for a Ceph cluster, consider the following requirements:

  • Do not confuse the IP addresses you define with the public-facing IP addresses the network clients may use to access the services.

  • If you define more than one IP address and subnet mask for the public or cluster network, ensure that the subnets within the network can route to each other.

  • Include each IP address or subnet in the network section to IP tables and open ports for them as necessary.

  • The pods of the Ceph OSD and RadosGW daemons use cross-pods health checkers to verify that the entire Ceph cluster is healthy. Therefore, each CIDR must be accessible inside Ceph pods.

  • Avoid using the 0.0.0.0/0 CIDR in the network section. With a zero range in publicNet and/or clusterNet, the Ceph daemons behavior is unpredictable.

To enable multinetwork for Ceph:

  1. Select from the following options:

    • If the Ceph cluster is not deployed on a managed cluster yet, edit the deployment KaaSCephCluster YAML template.

    • If the Ceph cluster is already deployed on a managed cluster, open KaaSCephCluster for editing:

      kubectl -n <managedClusterProjectName> edit kaascephcluster
      

      Substitute <managedClusterProjectName> with a corresponding value.

  2. In the clusterNet and/or publicNet parameters of the cephClusterSpec.network section, define a comma-separated array of CIDRs. For example:

    network:
      publicNet:  10.12.0.0/24,10.13.0.0/24
      clusterNet: 10.10.0.0/24,10.11.0.0/24
    
  3. Select from the following options:

    • If you are creating a managed cluster, save the updated KaaSCephCluster template to the corresponding file and proceed with the managed cluster creation.

    • If you are configuring KaaSCephCluster of an existing managed cluster, exiting the text editor will apply the changes.

Once done, the specified network CIDRs will be passed to the Ceph daemons pods through the rook-config-override ConfigMap.

Enable Ceph RBD mirroring

This section describes how to configure and use RADOS Block Device (RBD) mirroring for Ceph pools using the rbdMirror section in the KaaSCephCluster CR. The feature may be useful if, for example, you have two interconnected managed clusters. Once you enable RBD mirroring, the images in the specified pools will be replicated and if a cluster becomes unreachable, the second one will provide users with instant access to all images. For details, see Ceph Documentation: RBD Mirroring.

Note

Ceph Controller only supports bidirectional mirroring.

To enable Ceph RBD monitoring, follow the procedure below and use the following rbdMirror parameters description:

Ceph rbdMirror section parameters

Parameter

Description

daemonsCount

Count of rbd-mirror daemons to spawn. Mirantis recommends using one instance of the rbd-mirror daemon.

peers

Optional. List of mirroring peers of an external cluster to connect to. Only a single peer is supported. The peer section includes the following parameters:

  • site - the label of a remote Ceph cluster associated with the token.

  • token - the token that will be used by one site (Ceph cluster) to pull images from the other site. To obtain the token, use the rbd mirror pool peer bootstrap create command.

  • pools - optional, a list of pool names to mirror.

To enable Ceph RBD mirroring:

  1. In KaaSCephCluster CRs of both Ceph clusters where you want to enable mirroring, specify positive daemonsCount in the spec.cephClusterSpec.rbdMirror section:

    spec:
      cephClusterSpec:
        rbdMirror:
          daemonsCount: 1
    
  2. On both Ceph clusters where you want to enable mirroring, wait for the Ceph RBD Mirror daemons to start running:

    kubectl -n rook-ceph get pod -l app=rook-ceph-rbd-mirror
    
  3. In KaaSCephCluster of both Ceph clusters where you want to enable mirroring, specify the spec.cephClusterSpec.pools.mirroring.mode parameter for all pools that must be mirrored.

    Mirroring mode recommendations

    • Mirantis recommends using the pool mode for mirroring. For the pool mode, explicitly enable journaling for each image.

    • To use the image mirroring mode, explicitly enable mirroring as described in the step 8.

    spec:
      cephClusterSpec:
        pools:
        - name: image-hdd
          ...
          mirroring:
            mode: pool
        - name: volumes-hdd
          ...
          mirroring:
            mode: pool
    
  4. Obtain the name of an external site to mirror with. On pools with mirroring enabled, the name is typically ceph fsid:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    "app=rook-ceph-tools" -o name
    rbd mirror pool info <mirroringPoolName>
    # or
    ceph fsid
    

    Substitute <mirroringPoolName> with the name of a pool to be mirrored.

  5. On an external site to mirror with, create a new bootstrap peer token. Execute the following command within the ceph-tools pod CLI:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    rbd mirror pool peer bootstrap create <mirroringPoolName> --site-name <siteName>
    

    Substitute <mirroringPoolName> with the name of a pool to be mirrored. In <siteName>, assign a label for the external Ceph cluster that will be used along with mirroring.

    For details, see Ceph documentation: Bootstrap peers.

  6. In KaaSCephCluster on the cluster that should mirror pools, specify spec.cephClusterSpec.rbdMirror.peers with the obtained peer and pools to mirror:

    spec:
      cephClusterSpec:
        rbdMirror:
          ...
          peers:
          - site: <siteName>
            token: <bootstrapPeer>
            pools: [<mirroringPoolName1>, <mirroringPoolName2>, ...]
    

    Substitute <siteName> with the label assigned to the external Ceph cluster, <bootstrapPeer> with the token obtained in the previous step, and <mirroringPoolName> with names of pools that have the mirroring.mode parameter defined.

    For example:

    spec:
      cephClusterSpec:
        rbdMirror:
          ...
          peers:
          - site: cluster-b
            token: <base64-string>
            pools:
            - images-hdd
            - volumes-hdd
            - special-pool-ssd
    
  7. Verify that mirroring is enabled and each pool with spec.cephClusterSpec.pools.mirroring.mode defined has an external peer site:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l \
    "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    rbd mirror pool info <mirroringPoolName>
    

    Substitute <mirroringPoolName> with the name of a pool with mirroring enabled.

  8. If you have set the image mirroring mode in the pools section, explicitly enable mirroring for each image with rbd within the pool:

    Note

    Execute the following command within the ceph-tools pod with ceph and rbd CLI.

    rbd mirror image enable <poolName>/<imageName> <imageMirroringMode>
    

    Substitute <poolName> with the name of a pool with the image mirroring mode, <imageName> with the name of an image stored in the specified pool. Substitute <imageMirroringMode> with one of:

    • journal - for mirroring to use the RBD journaling image feature to replicate the image contents. If the RBD journaling image feature is not yet enabled on the image, it will be enabled automatically.

    • snapshot - for mirroring to use RBD image mirror-snapshots to replicate the image contents. Once enabled, an initial mirror-snapshot will automatically be created. To create additional RBD image mirror-snapshots, use the rbd command.

    For details, see Ceph Documentation: Enable image mirroring.

Configure Ceph Shared File System (CephFS)

Available since 2.23.1 (Cluster release 12.7.0)

Caution

Since Ceph Pacific, Ceph CSI driver does not propagate the 777 permission on the mount point of persistent volumes based on any StorageClass of the CephFS data pool.

The Ceph Shared File System, or CephFS, provides the capability to create read/write shared file system Persistent Volumes (PVs). These PVs support the ReadWriteMany access mode for the FileSystem volume mode. CephFS deploys its own daemons called MetaData Servers or Ceph MDS. For details, see Ceph Documentation: Ceph File System.

Note

By design, CephFS data pool and metadata pool must be replicated only.

Limitations

  • CephFS is supported as a Kubernetes CSI plugin that only supports creating Kubernetes Persistent Volumes based on the FileSystem volume mode. For a complete modes support matrix, see Ceph CSI: Support Matrix.

  • Before MOSK 25.1, Ceph Controller supports only one CephFS installation per Ceph cluster.

  • Re-creating of the CephFS instance in a cluster requires a different value for the name parameter.

CephFS specification

The KaaSCephCluster CR includes the spec.cephClusterSpec.sharedFilesystem.cephFS section with the following CephFS parameters:

CephFS specification

Parameter

Description

name

CephFS instance name.

dataPools

A list of CephFS data pool specifications. Each spec contains the name, replicated or erasureCoded, deviceClass, and failureDomain parameters. The first pool in the list is treated as the default data pool for CephFS and must always be replicated. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. The number of data pools is unlimited, but the default pool must always be present. For example:

cephClusterSpec:
  sharedFilesystem:
    cephFS:
    - name: cephfs-store
      dataPools:
      - name: default-pool
        deviceClass: ssd
        replicated:
          size: 3
        failureDomain: host
      - name: second-pool
        deviceClass: hdd
        erasureCoded:
          dataChunks: 2
          codingChunks: 1

Where replicated.size is the number of full copies of data on multiple nodes.

Warning

When using the non-recommended Ceph pools replicated.size of less than 3, Ceph OSD removal cannot be performed. The minimal replica size equals a rounded up half of the specified replicated.size.

For example, if replicated.size is 2, the minimal replica size is 1, and if replicated.size is 3, then the minimal replica size is 2. The replica size of 1 allows Ceph having PGs with only one Ceph OSD in the acting state, which may cause a PG_TOO_DEGRADED health warning that blocks Ceph OSD removal. Mirantis recommends setting replicated.size to 3 for each Ceph pool.

Warning

Modifying of dataPools on a deployed CephFS has no effect. You can manually adjust pool settings through the Ceph CLI. However, for any changes in dataPools, Mirantis recommends re-creating CephFS.

metadataPool

CephFS metadata pool spec that should only contain replicated, deviceClass, and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. Can use only replicated settings. For example:

cephClusterSpec:
  sharedFilesystem:
    cephFS:
     - name: cephfs-store
       metadataPool:
         deviceClass: nvme
         replicated:
           size: 3
         failureDomain: host

where replicated.size is the number of full copies of data on multiple nodes.

Warning

Modifying of metadataPool on a deployed CephFS has no effect. You can manually adjust pool settings through the Ceph CLI. However, for any changes in metadataPool, Mirantis recommends re-creating CephFS.

preserveFilesystemOnDelete

Defines whether to delete the data and metadata pools if CephFS is deleted. Set to true to avoid occasional data loss in case of human error. However, for security reasons, Mirantis recommends setting preserveFilesystemOnDelete to false.

metadataServer

Metadata Server settings correspond to the Ceph MDS daemon settings. Contains the following fields:

  • activeCount - the number of active Ceph MDS instances. As load increases, CephFS will automatically partition the file system across the Ceph MDS instances. Rook will create double the number of Ceph MDS instances as requested by activeCount. The extra instances will be in the standby mode for failover. Mirantis recommends specifying this parameter to 1 and increasing the MDS daemons count only in case of high load.

  • activeStandby - defines whether the extra Ceph MDS instances will be in active standby mode and will keep a warm cache of the file system metadata for faster failover. The instances will be assigned by CephFS in failover pairs. If false, the extra Ceph MDS instances will all be in passive standby mode and will not maintain a warm cache of the metadata. The default value is false.

  • resources - represents Kubernetes resource requirements for Ceph MDS pods.

For example:

cephClusterSpec:
  sharedFilesystem:
    cephFS:
    - name: cephfs-store
      metadataServer:
        activeCount: 1
        activeStandby: false
        resources: # example, non-prod values
          requests:
            memory: 1Gi
            cpu: 1
          limits:
            memory: 2Gi
            cpu: 2
Configure CephFS
  1. Optional. Override the CSI CephFS gRPC and liveness metrics port. For example, if an application is already using the default CephFS ports 9092 and 9082, which may cause conflicts on the node.

    1. Open the Cluster CR of a managed cluster for editing:

      kubectl -n <managedClusterProjectName> edit cluster
      

      Substitute <managedClusterProjectName> with the corresponding value.

    2. In the spec.providerSpec.helmReleases section, configure csiCephFsGPCMetricsPort and csiCephFsLivenessMetricsPort as required. For example:

      spec:
        providerSpec:
          helmReleases:
          ...
          - name: ceph-controller
            ...
            values:
              ...
              rookExtraConfig:
                csiCephFsEnabled: true
                csiCephFsGPCMetricsPort: "9092" # should be a string
                csiCephFsLivenessMetricsPort: "9082" # should be a string
      

    Rook will enable the CephFS CSI plugin and provisioner.

  2. Open the KaasCephCluster CR of a managed cluster for editing:

    kubectl edit kaascephcluster -n <managedClusterProjectName>
    

    Substitute <managedClusterProjectName> with the corresponding value.

  3. In the sharedFilesystem section, specify parameters according to CephFS specification. For example:

    spec:
      cephClusterSpec:
        sharedFilesystem:
          cephFS:
          - name: cephfs-store
            dataPools:
            - name: cephfs-pool-1
              deviceClass: hdd
              replicated:
                size: 3
              failureDomain: host
            metadataPool:
              deviceClass: nvme
              replicated:
                size: 3
              failureDomain: host
            metadataServer:
              activeCount: 1
              activeStandby: false
    
  4. Define the mds role for the corresponding nodes where Ceph MDS daemons should be deployed. Mirantis recommends labeling only one node with the mds role. For example:

    spec:
      cephClusterSpec:
        nodes:
          ...
          worker-1:
            roles:
            ...
            - mds
    

Once CephFS is specified in the KaaSCephCluster CR, Ceph Controller will validate it and request Rook to create CephFS. Then Ceph Controller will create a Kubernetes StorageClass, required to start provisioning the storage, which will operate the CephFS CSI driver to create Kubernetes PVs.

Share Ceph across two managed clusters

Available since MCC 2.23.1 (Cluster release 12.7.0) TechPreview

This section describes how to share a Ceph cluster with another managed cluster of the same management cluster and how to manage such Ceph cluster.

A shared Ceph cluster allows connecting of a consumer cluster to a producer cluster. The consumer cluster uses the Ceph cluster deployed on the producer to store the necessary data. In other words, the producer cluster contains the Ceph cluster with mon, mgr, osd, and mds daemons. And the consumer cluster contains clients that require access to the Ceph storage.

For example, an NGINX application that runs in a cluster without storage requires a persistent volume to store data. In this case, such a cluster can connect to a Ceph cluster and use it as a block or file storage.

Limitations

  • Before MCC 2.24.2 (Cluster release 15.0.1), connection to a shared Ceph cluster is possible only through the client.admin user.

  • The producer and consumer clusters must be located in the same management cluster.

  • The LCM network of the producer cluster must be available in the consumer cluster.

Plan a shared Ceph cluster

To plan a shared Ceph cluster, select resources to share on the producer Ceph cluster:

  • Select the RADOS Block Device (RBD) pools to share from the Ceph cluster

  • Select the CephFS name to share from the Ceph cluster

To obtain resources to share on the producer Ceph cluster:

  1. Open the KaaSCephCluster object.

  2. In spec.cephClusterSpec.pools, identify the Ceph cluster pools assigned to RBD pools.

    To obtain full names of RBD pools:

    kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd lspools
    

    Example of system response:

    ...
    2 kubernetes-hdd
    3 anotherpool-hdd
    ...
    

    In the example above, kubernetes-hdd and anotherpool-hdd are RBD pools.

  3. In spec.cephClusterSpec.sharedFilesystem, identify the CephFS name, for example:

    spec:
     cephClusterSpec:
       sharedFilesystem:
         cephFS:
         - name: cephfs-store
           dataPools:
           - name: cephfs-pool-1
             deviceClass: hdd
             replicated:
               size: 3
             failureDomain: host
           metadataPool:
             deviceClass: nvme
             replicated:
               size: 3
             failureDomain: host
           metadataServer:
             activeCount: 1
             activeStandby: false
    

    In the example above, the CephFS name is cephfs-store.

Create a Ceph non-admin client for a shared Ceph cluster

Available since MCC 2.24.2 (Cluster release 15.0.1)

Note

Before MCC 2.24.2 (Cluster release 15.0.1), skip this section and proceed to Connect the producer to the consumer.

Ceph requires a non-admin client to share the producer cluster resources with the consumer cluster. To connect the consumer cluster with the producer cluster, the Ceph client requires the following caps (permissions):

  • Read-write access to Ceph Managers

  • Read and role-definer access to Ceph Monitors

  • Read-write access to Ceph Metadata servers if CephFS pools must be shared

  • Profile access to shared RBD/CephFS pools’ access for Ceph OSDs

To create a Ceph non-admin client, add the following snippet to the clients section of the KaaSCephCluster object:

spec:
  cephClusterSpec:
    clients:
    - name: <nonAdminClientName>
      caps:
        mgr: "allow rw"
        mon: "allow r, profile role-definer"
        mds: "allow rw" # if CephFS must be shared
        osd: <poolsProfileCaps>

Substitute <nonAdminClientName> with a Ceph non-admin client name and <poolsProfileCaps> with a comma-separated profile list of RBD and CephFS pools in the following format:

  • profile rbd pool=<rbdPoolName> for each RBD pool

  • allow rw tag cephfs data=<cephFsName> for each CephFS pool

For example:

spec:
  cephClusterSpec:
    clients:
    - name: non-admin-client
      caps:
        mgr: "allow rw"
        mon: "allow r, profile role-definer"
        mds: "allow rw"
        osd: "profile rbd pool=kubernetes-hdd,profile rbd pool=anotherpool-hdd,allow rw tag cephfs data=cephfs-store"

To verify the status of the created Ceph client, inspect the status section of the KaaSCephCluster object. For example:

status:
  fullClusterInfo:
    blockStorageStatus:
      clientsStatus:
        non-admin-client:
          present: true
          status: Ready
  ...
  miraCephSecretsInfo:
     lastSecretCheck: "2023-05-19T12:18:16Z"
     lastSecretUpdate: "2023-05-19T12:18:16Z"
     secretInfo:
       clientSecrets:
       ...
       - name: client.non-admin-client
         secretName: rook-ceph-client-non-admin-client
         secretNamespace: rook-ceph
     state: Ready
Connect the producer to the consumer
  1. Enable the ceph-controller Helm release in the consumer cluster:

    1. Open the Cluster object for editing:

      kubectl -n <consumerClusterProjectName> edit cluster <consumerClusterName>
      
    2. In the spec section, add the ceph-controller Helm release:

      spec:
        providerSpec:
          value:
            helmReleases:
            - name: ceph-controller
              values: {}
      
  2. Obtain namespace/name of the consumer cluster:

    kubectl -n <consumerClusterProjectName> get cluster -o jsonpath='{range .items[*]}{@.metadata.namespace}{"/"}{@.metadata.name}{"\n"}{end}'
    

    Example output:

    managed-ns/managed-cluster
    
  3. Since since MCC 2.24.2 (Cluster release 15.0.1), obtain the previously created Ceph non-admin client as described in Create a Ceph non-admin client for a shared Ceph cluster to use it as <clientName> in the following step.

    Note

    For backward compatibility, the Ceph client.admin client is available as <clientName>. However, Mirantis does not recommend using client.admin for security reasons.

  4. Connect to the producer cluster and generate connectionString. Proceed according to the MCC version used:

    1. Create a KaaSCephOperationRequest resource in a managed cluster namespace of the management cluster:

      apiVersion: kaas.mirantis.com/v1alpha1
      kind: KaaSCephOperationRequest
      metadata:
        name: test-share-request
        namespace: <managedClusterProject>
      spec:
        k8sCluster:
          name: <managedClusterName>
          namespace: <managedClusterProject>
        kaasCephCluster:
          name: <managedKaaSCephClusterName>
          namespace: <managedClusterProject>
        share:
          clientName: <clientName>
          clusterID: <namespace/name>
          opts:
            cephFS: true # if the consumer cluster will use the CephFS storage
      
    2. After KaaSCephOperationRequest is applied, wait until the Prepared state displays in the status.shareStatus section.

    3. Obtain connectionString from the status.shareStatus section. The example of the status section:

      status:
      kaasRequestState: ok
      phase: Completed
      shareStatus:
        connectionString: |
          674a68494da7d135e5416f6566818c0b5da72e5cc44127308ba670a591db30824e814aa9cc45b6f07176d3f907de4f89292587cbd0e8f8fd71ec508dc9ed9ee36a8b87db3e3aa9c0688af916091b938ac0bd825d18fbcd548adb8821859c1d3edaf5f4a37ad93891a294fbcc39e3dc40e281ba19548f5b751fab2023a8e1a340d6e884514b478832880766e80ab047bf07e69f9c598b43820cc5d9874790e0f526851d3d2f3ce1897d98b02d560180f6214164aee04f20286d595cec0c54a2a7bd0437e906fc9019ab06b00e1ba1b1c47fe611bb759c0e0ff251181cb57672dd76c2bf3ca6dd0e8625c84102eeb88769a86d712eb1a989a5c895bd42d47107bc8105588d34860fadaa71a927329fc961f82e2737fe07b68d7239b3a9817014337096bcb076051c5e2a0ee83bf6c1cc2cb494f57fef9c5306361b6c0143501467f0ec14e4f58167a2d97f2efcb0a49630c2f1a066fe4796b41ae73fe8df4213de3a39b7049e6a186dda0866d2535bbf943cb7d7bb178ad3f5f12e3351194808af687de79986c137d245ceeb4fbc3af1b625aa83e2b269f24b56bc100c0890c7c9a4e02cf1aa9565b64e86a038af2b0b9d2eeaac1f9e5e2daa086c00bf404e5a4a5c0aeb6e91fe983efda54a6aa983f50b94e181f88577f6a8029250f6f884658ceafbc915f54efc8fd3db993a51ea5a094a5d7db71ae556b8fa6864682baccc2118f3971e8c4010f6f23cc7b727f569d0
        state: Prepared
      

    Connect to the producer cluster and generate connectionString in the ceph-controller Pod:

    Note

    If the consumer cluster will use the CephFS storage, add the --cephfs-enabled flag to ceph-cluster-connector command.

    kubectl -n ceph-lcm-mirantis exec -it deploy/ceph-controller -c ceph-controller -- sh
    ceph-cluster-connector --cluster-id <clusterNamespacedName> --client-name <clientName> --verbose
    

    Substitute the following parameters:

    • <clusterNamespacedName> with namespace/name of the consumer cluster

    • <clientName> with the Ceph client name from the previous step in the client.<name> format. For example, client.non-admin-client.

    Example of a positive system response:

    I1221 14:20:29.921024     139 main.go:17] Connector code version: 1.0.0-mcc-dev-ebcd6677
    I1221 14:20:29.921085     139 main.go:18] Go Version: go1.18.8
    I1221 14:20:29.921097     139 main.go:19] Go OS/Arch: linux/amd64
    I1221 14:20:30.801832     139 connector.go:71] Your connection string is:
    d0e64654d0551e7c3a940b8f460838261248193365a7115e54a3424aa2ad122e9a85bd12ec453ca5a092c37f6238e81142cf839fd15a4cd6aafa1238358cb50133d21b1656641541bd6c3bbcad220e8a959512ef11461d14fb11fd0c6110a54ed7e9a5f61eb677771cd5c8e6a6275eb7185e0b3e49e934c0ee08c6c2f37a669fc1754570cfdf893d0918fa91d802c2d36045dfc898803e423639994c2f21b03880202dfb9ed6e784f058ccf172d1bee78d7b20674652132886a80b0a8c806e23d9f69e9d0c7473d8caf24aaf014625727cbe08146e744bf0cf8f37825521d038
    

    Connect to the producer cluster and generate connectionString in the ceph-controller Pod:

    Note

    If the consumer cluster will use the CephFS storage, add the --cephfs-enabled flag to ceph-cluster-connector command.

    kubectl -n ceph-lcm-mirantis exec -it deploy/ceph-controller -c ceph-controller -- sh
    ceph-cluster-connector --cluster-id <clusterNamespacedName>
    

    Substitute <clusterNamespacedName> with namespace/name of the consumer cluster.

    Example of a positive system response:

    I1221 14:20:29.921024     139 main.go:17] Connector code version: 1.0.0-mcc-dev-ebcd6677
    I1221 14:20:29.921085     139 main.go:18] Go Version: go1.18.8
    I1221 14:20:29.921097     139 main.go:19] Go OS/Arch: linux/amd64
    I1221 14:20:30.801832     139 connector.go:71] Your connection string is:
    d0e64654d0551e7c3a940b8f460838261248193365a7115e54a3424aa2ad122e9a85bd12ec453ca5a092c37f6238e81142cf839fd15a4cd6aafa1238358cb50133d21b1656641541bd6c3bbcad220e8a959512ef11461d14fb11fd0c6110a54ed7e9a5f61eb677771cd5c8e6a6275eb7185e0b3e49e934c0ee08c6c2f37a669fc1754570cfdf893d0918fa91d802c2d36045dfc898803e423639994c2f21b03880202dfb9ed6e784f058ccf172d1bee78d7b20674652132886a80b0a8c806e23d9f69e9d0c7473d8caf24aaf014625727cbe08146e744bf0cf8f37825521d038
    
  5. Create the consumer KaaSCephCluster object file, for example, consumer-kcc.yaml with the following content:

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephCluster
    metadata:
      name: <consumerClusterProjectName>
      namespace: <clusterName>
    spec:
      cephClusterSpec:
        external:
          enable: true
          connectionString: <generatedConnectionString>
        network:
          clusterNet: <clusterNetCIDR>
          publicNet: <publicNetCIDR>
        nodes: {}
      k8sCluster:
        name: <clusterName>
        namespace: <consumerClusterProjectName>
    

    Specify the following values:

    • <consumerClusterProjectName> is the project name of the consumer managed cluster on the management cluster.

    • <clusterName> is the consumer managed cluster name.

    • <generatedConnectionString> is the connection string generated in the previous step.

    • <clusterNetCIDR> and <publicNetCIDR> are values that must match the same values in the producer KaaSCephCluster object.

    Note

    The spec.cephClusterSpec.network and spec.cephClusterSpec.nodes parameters are mandatory.

    The connectionString parameter is specified in the spec.cephClusterSpec.external section of the KaaSCephCluster CR. The parameter contains an encrypted string with all the configurations needed to connect the consumer cluster to the shared Ceph cluster.

  6. Apply consumer-kcc.yaml on the management cluster:

    kubectl apply -f consumer-kcc.yaml
    

Once the Ceph cluster is specified in the KaaSCephCluster CR of the consumer cluster, Ceph Controller validates it and requests Rook to connect the consumer and producer.

Consume pools from the Ceph cluster
  1. Open the KaasCephCluster CR of the consumer cluster for editing:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the spec.cephClusterSpec.pools, specify pools from the producer cluster to be used by the consumer cluster. For example:

    Caution

    Each name in the pools section must match the corresponding full pool name of the producer cluster. You can find full pool names in the KaaSCephCluster CR by the following path: status.fullClusterInfo.blockStorageStatus.poolsStatus.

    spec:
      cephClusterSpec:
        pools:
        - default: true
          deviceClass: ssd
          useAsFullName: true
          name: kubernetes-ssd
          role: kubernetes-ssd
        - default: false
          deviceClass: hdd
          useAsFullName: true
          name: volumes-hdd
          role: volumes
    

After specifying pools in the consumer KaaSCephCluster CR, Ceph Controller creates a corresponding StorageClass for each specified pool, which can be used for creating ReadWriteOnce persistent volumes (PVs) in the consumer cluster.

Enable CephFS on a consumer Ceph cluster
  1. Open the KaasCephCluster CR of the consumer cluster for editing:

    kubectl -n <managedClusterProjectName> edit kaascephcluster
    

    Substitute <managedClusterProjectName> with the corresponding value.

  2. In the sharedFilesystem section of the consumer cluster, specify the dataPools to share.

    Note

    Sharing CephFS also requires specifying the metadataPool and metadataServer sections similarly to the corresponding sections of the producer cluster. For details, see CephFS specification.

    For example:

    spec:
      cephClusterSpec:
        sharedFilesystem:
          cephFS:
          - name: cephfs-store
            dataPools:
            - name: cephfs-pool-1
              replicated:
                size: 3
              failureDomain: host
            metadataPool:
              replicated:
                size: 3
              failureDomain: host
            metadataServer:
              activeCount: 1
              activeStandby: false
    

After specifying CephFS in the KaaSCephCluster CR of the consumer cluster, Ceph Controller creates a corresponding StorageClass that allows creating ReadWriteMany (RWX) PVs in the consumer cluster.

Specify placement of Ceph cluster daemons

If you need to configure the placement of Rook daemons on nodes, you can add extra values in the Cluster providerSpec section of the ceph-controller Helm release.

The procedures in this section describe how to specify the placement of rook-ceph-operator, rook-discover, and csi-rbdplugin.

To specify rook-ceph-operator placement:

  1. On the management cluster, edit the Cluster resource of the target managed cluster:

    kubectl -n <managedClusterProjectName> edit cluster
    
  2. Add the following parameters to the ceph-controller Helm release values:

    spec:
      providerSpec:
        value:
          helmReleases:
          - name: ceph-controller
            values:
              rookOperatorPlacement:
                affinity: <rookOperatorAffinity>
                nodeSelector: <rookOperatorNodeSelector>
                tolerations: <rookOperatorTolerations>
    
    • <rookOperatorAffinity> is a key-value mapping that contains a valid Kubernetes affinity specification

    • <rookOperatorNodeSelector> is a key-value mapping that contains a valid Kubernetes nodeSelector specification

    • <rookOperatorTolerations> is a list that contains valid Kubernetes toleration items

  3. Wait for some time and verify on a managed cluster that the changes have applied:

    kubectl -n rook-ceph get deploy rook-ceph-operator -o yaml
    

To specify rook-discover and csi-rbdplugin placement simultaneously:

  1. On the management cluster, edit the desired Cluster resource:

    kubectl -n <managedClusterProjectName> edit cluster
    
  2. Add the following parameters to the ceph-controller Helm release values:

    spec:
      providerSpec:
        value:
          helmReleases:
          - name: ceph-controller
            values:
              rookExtraConfig:
                extraDaemonsetLabels: <labelSelector>
    

    Substitute <labelSelector> with a valid Kubernetes label selector expression to place the rook-discover and csi-rbdplugin DaemonSet pods.

  3. Wait for some time and verify on a managed cluster that the changes have applied:

    kubectl -n rook-ceph get ds rook-discover -o yaml
    kubectl -n rook-ceph get ds csi-rbdplugin -o yaml
    

To specify rook-discover and csi-rbdplugin placement separately:

  1. On the management cluster, edit the desired Cluster resource:

    kubectl -n <managedClusterProjectName> edit cluster
    
  2. If required, add the following parameters to the ceph-controller Helm release values:

    spec:
      providerSpec:
        value:
          helmReleases:
          - name: ceph-controller
            values:
              hyperconverge:
                nodeAffinity:
                  csiplugin: <labelSelector1>
                  rookDiscover: <labelSelector2>
    

    Substitute <labelSelectorX> with a valid Kubernetes label selector expression to place the rook-discover and csi-rbdplugin DaemonSet pods. For example, "role=storage-node; discover=true".

  3. Wait for some time and verify on the managed cluster that the changes have applied:

    kubectl -n rook-ceph get ds rook-discover -o yaml
    kubectl -n rook-ceph get ds csi-rbdplugin -o yaml
    
Migrate Ceph pools from one failure domain to another

The document describes how to change the failure domain of an already deployed Ceph cluster.

Note

This document focuses on changing the failure domain from a smaller to wider one, for example, from host to rack. Using the same instruction, you can move the failure domain from a wider to smaller scale.

Caution

Data movement implies the Ceph cluster rebalancing that may impact cluster performance, depending on the cluster size.

High-level overview of the procedure includes the following steps:

  1. Set correct labels on the nodes.

  2. Create the new bucket hierarchy.

  3. Move nodes to new buckets.

  4. Modify the CRUSH rules.

  5. Add the manual changes to the KaaSCephCluster spec.

  6. Scale the Ceph controllers.

Prerequisites
  1. Verify that the Ceph cluster has enough space for multiple copies of data to migrate. Mirantis highly recommends that the Ceph cluster has a minimum of 25% of free space for the procedure to succeed.

    Note

    The migration procedure implies data movement and optional modification of CRUSH rules that cause a large amount of data (depending on the cluster size) to be first copied to a new location in the Ceph cluster before data removal.

  2. Create a backup of the current KaaSCephCluster object from the managed namespace of the management cluster:

    kubectl -n <managedClusterProject> get kaascephcluster -o yaml > kcc-backup.yaml
    

    Substitute <managedClusterProject> with the corresponding managed cluster namespace of the management cluster.

  3. In the rook-ceph-tools pod on a managed cluster, obtain a backup of the CRUSH map:

    ceph osd getcrushmap -o /tmp/crush-map-orig
    crushtool -d /tmp/crush-map-orig -o /tmp/crush-map-orig.txt
    
Migrate Ceph pools

This procedure contains an example of moving failure domains of all pools from host to rack. Using the same instruction, you can migrate pools from other types of failure domains, migrate pools separately, and so on.

To migrate Ceph pools from one failure domain to another:

  1. Set the required CRUSH topology in the KaaSCephCluster object for each defined node. For details on the crush parameter, see Node parameters.

    Setting the CRUSH topology to each node causes the Ceph Controller to set proper Kubernetes labels on the nodes.

    Example of adding the rack CRUSH topology key for each node in the nodes section
    spec:
      cephClusterSpec:
        nodes:
          machine1:
            crush:
              rack: rack-1
          machine2:
            crush:
              rack: rack-1
          machine3:
            crush:
              rack: rack-2
          machine4:
            crush:
              rack: rack-2
          machine5:
            crush:
              rack: rack-3
          machine6:
            crush:
              rack: rack-3
    
  2. On the managed cluster, verify that the required buckets and bucket types are present in the Ceph hierarchy:

    1. Enter the ceph-tools pod:

      kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
      
    2. Verify that the required bucket type is present by default:

      ceph osd getcrushmap -o /tmp/crush-map
      crushtool -d /tmp/crush-map -o /tmp/crush-map.txt
      cat /tmp/crush-map.txt # Look for the section named → “# types”
      

      Example of system response:

      # types
      type 0 osd
      type 1 host
      type 2 chassis
      type 3 rack
      type 4 row
      type 5 pdu
      type 6 pod
      type 7 room
      type 8 datacenter
      type 9 zone
      type 10 region
      type 11 root
      
    3. Verify that the buckets with the required bucket type are present:

      cat /tmp/crush-map.txt # Look for the section named → “# buckets”
      

      Example of system response of an existing rack bucket:

      # buckets
      rack rack-1 {
        id -15
        id -16 class hdd
        # weight 0.00000
        alg straw2
        hash 0
      }
      
    4. If the required buckets are not created, create new ones with the required bucket type:

      ceph osd crush add-bucket <bucketName> <bucketType> root=default
      

      For example:

      ceph osd crush add-bucket rack-1 rack root=default
      ceph osd crush add-bucket rack-2 rack root=default
      ceph osd crush add-bucket rack-3 rack root=default
      
    5. Exit the ceph-tools pod.

  3. Optional. Order buckets as required:

    1. On the managed cluster, add the first Ceph CRUSH smaller bucket to its respective wider bucket:

      kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
      ceph osd crush move <smallerBucketName> <bucketType>=<widerBucketName>
      

      Substitute the following parameters:

      • <smallerBucketName> with the name of the smaller bucket, for example host name

      • <bucketType> with the required bucket type, for example rack

      • <widerBucketName> with the name of the wider bucket, for example rack name

      For example:

      ceph osd crush move kaas-node-1 rack=rack-1 root=default
      

      Warning

      Mirantis highly recommends moving one bucket at a time.

      For more details, refer to official Ceph documentation: CRUHS Maps: Moving a bucket.

    2. After the bucket is moved to the new location in the CRUSH hierarchy, verify that no data rebalancing occurs:

      ceph -s
      

      Caution

      Wait for rebalancing to complete before proceeding to the next step.

    3. Add the remaining Ceph CRUSH smaller buckets to their respective wider buckets one by one.

  4. Scale the Ceph Controller and Rook Operator deployments to 0 replicas:

    kubectl -n ceph-lcm-mirantis scale deploy --all --replicas 0
    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
    
  5. On the managed cluster, manually modify the CRUSH rules for Ceph pools to enable data placement on a new failure domain:

    1. Enter the ceph-tools pod:

      kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
      
    2. List the CRUSH rules and erasure code profiles for the pools:

      ceph osd pool ls detail
      
      Example output
      pool 1 'mirablock-k8s-block-hdd' replicated size 2 min_size 1 crush_rule 9 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1193 lfor 0/0/85 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.31
      pool 2 '.mgr' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 70 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 6.06
      pool 3 'openstack-store.rgw.otp' replicated size 2 min_size 1 crush_rule 11 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.27
      pool 4 'openstack-store.rgw.meta' replicated size 2 min_size 1 crush_rule 12 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.50
      pool 5 'openstack-store.rgw.log' replicated size 2 min_size 1 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.00
      pool 6 'openstack-store.rgw.buckets.non-ec' replicated size 2 min_size 1 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 1.50
      pool 7 'openstack-store.rgw.buckets.index' replicated size 2 min_size 1 crush_rule 15 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 2.25
      pool 8 '.rgw.root' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.75
      pool 9 'openstack-store.rgw.control' replicated size 2 min_size 1 crush_rule 16 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 1197 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw read_balance_score 3.00
      pool 10 'other-hdd' replicated size 2 min_size 1 crush_rule 19 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1179 lfor 0/0/85 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 1.69
      pool 11 'openstack-store.rgw.buckets.data' erasure profile openstack-store.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 18 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1198 lfor 0/0/86 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw
      pool 12 'vms-hdd' replicated size 2 min_size 1 crush_rule 21 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 1182 lfor 0/0/95 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.4 application rbd read_balance_score 1.24
      pool 13 'volumes-hdd' replicated size 2 min_size 1 crush_rule 23 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 1185 lfor 0/0/89 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.2 application rbd read_balance_score 1.31
      pool 14 'backup-hdd' replicated size 2 min_size 1 crush_rule 25 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1188 lfor 0/0/90 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.1 application rbd read_balance_score 2.06
      pool 15 'images-hdd' replicated size 2 min_size 1 crush_rule 27 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1191 lfor 0/0/90 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.1 application rbd read_balance_score 1.50
      
    3. For each replicated Ceph pool:

      1. Obtain the current CRUSH rule name:

        ceph osd crush rule dump <oldCrushRuleName>
        
      2. Create a new CRUSH rule with the required bucket type using the same root, device class, and new bucket type:

        ceph osd crush rule create-replicated <newCrushRuleName> <root> <bucketType> <deviceClass>
        

        For example:

        ceph osd crush rule create-replicated images-hdd-rack default rack hdd
        

        For more details, refer to official Ceph documentation: CRUSH Maps: Creating a rule for a replicated pool.

      3. Apply a new crush rule to the Ceph pool:

        ceph osd pool set <poolName> crush_rule <newCrushRuleName>
        

        For example:

        ceph osd pool set images-hdd crush_rule images-hdd-rack
        
      4. Wait for data to be rebalanced after moving the Ceph pool under the new failure domain (bucket type) by monitoring Ceph health:

        ceph -s
        

        Caution

        Update the following Ceph pool only after data rebalancing completes for the current Ceph pool.

      5. Verify that the old CRUSH rule is not used anymore:

        ceph osd pool ls detail
        

        The rule ID is located in the CRUSH map and must match the rule ID in the output of ceph osd dump.

      6. Remove the old unused CRUSH rule and rename the new one to the original name:

        ceph osd crush rule rm <oldCrushRuleName>
        ceph osd crush rule rename <newCrushRuleName> <oldCrushRuleName>
        
    4. For each erasure-coded Ceph pool:

      Note

      Erasure-coded pools require different number of buckets to store data. Instead of the number of replicas in replicated pools, erasure-coded pools require the coding chunks + data chunks number of buckets existing in the Ceph cluster. For example, if an erasure-coded pool has 2 coding chunks and 2 data chunks configured, then the pool requires 4 different buckets, for example, 4 racks, to store data.

      1. Obtain the current parameters of the erasure-coded profile:

        ceph osd erasure-code-profile get <ecProfile>
        
      2. In the profile, add the new bucket type as the failure domain using the crush-failure-domain parameter:

        ceph osd erasure-code-profile set <ecProfile> k=<int> m=<int> crush-failure-domain=<bucketType> crush-device-class=<deviceClass>
        
      3. Create a new CRUSH rule in the profile:

        ceph osd crush rule create-erasure <newEcCrushRuleName> <ecProfile>
        
      4. Apply the new CRUSH rule to the pool:

        ceph osd pool set <poolName> crush_rule <newEcCrushRuleName>
        
      5. Wait for data to be rebalanced after moving the Ceph pool under the new failure domain (bucket type) by monitoring Ceph health:

        ceph -s
        

        Caution

        Update the following Ceph pool only after data rebalancing completes for the current Ceph pool.

      6. Verify that the old CRUSH rule is not used anymore:

        ceph osd pool ls detail
        

        The rule ID is located in the CRUSH map and must match the rule ID in the output of ceph osd dump.

      7. Remove the old unused CRUSH rule and rename the new one to the original name:

        ceph osd crush rule rm <oldCrushRuleName>
        ceph osd crush rule rename <newCrushRuleName> <oldCrushRuleName>
        

        Note

        New erasure-coded profiles cannot be renamed, so they will not be removed automatically during pools cleanup. Remove them manually, if needed.

    5. Exit the ceph-tools pod.

  6. In the management cluster, update the KaaSCephCluster object by setting the failureDomain: rack parameter for each pool. The configuration from the Rook perspective must match the manually created configuration. For example:

    spec:
      cephClusterSpec:
        pools:
        - name: images
          ...
          failureDomain: rack
        - name: volumes
          ...
          failureDomain: rack
        ...
        objectStorage:
          rgw:
            dataPool:
              failureDomain: rack
              ...
            metadataPool:
              failureDomain: rack
              ...
    
  7. Monitor the Ceph cluster health and wait until rebalancing is completed:

    ceph -s
    

    Example of a successful system response:

    HEALTH_OK
    
  8. Scale back the Ceph Controller and Rook Operator deployments:

    kubectl -n ceph-lcm-mirantis scale deploy --all --replicas 3
    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
    
Enable periodic Ceph performance testing

TechPreview

Warning

Performance testing affects the overall Ceph cluster performance. Do not run it unless you are sure that user load will not be affected.

This section describes how to configure periodic Ceph performance testing using Kubernetes batch or cron jobs that execute a fio process in a separate container with a connection to the Ceph cluster. The test results can then be stored in a persistent volume attached to the container.

Ceph performance testing is managed by the KaaSCephOperationRequest CR that creates separate CephPerfTestRequest requests to handle the test run. Once you configure the perfTest section of the KaaSCephOperationRequest spec, it propagates to CephPerfTestRequest on the managed cluster in the ceph-lcm-mirantis namespace. You can create a performance test for a single or scheduled runs.

Create a Ceph performance test request

TechPreview

Warning

Performance testing affects the overall Ceph cluster performance. Do not run it unless you are sure that user load will not be affected.

This section describes how to create a Ceph performance test request through the KaaSCephOperationRequest CR.

To create a Ceph performance test request:

  1. Create an RBD image with the required parameters. For example, run the following command in ceph-tools-container to allow execution of the perftest example below on a managed cluster:

    kubectl exec -ti -n rook-ceph <ceph-tools-pod> -- bash
    rbd create <pool_name>/<image_name> --size 10G
    

    Substitute <ceph-tools-pod> with the ceph-tools Pod ID, <pool_name> and <image_name> with pool and image names, and specify the size. In the example below, mirablock-k8s-block-hdd is used as pool name and tests as image name:

    kubectl exec -ti -n rook-ceph rook-ceph-tools-94985cd9f-tjl29 -- bash
    rbd create mirablock-k8s-block-hdd/tests --size 10G
    
  2. Create a YAML template for the KaaSCephOperationRequest CR. For details, see KaaSCephOperationRequest CR perftest specification.

    kubectl apply -f <example_file_name>.yaml
    
    Example of the KaaSCephOperationRequest resource
    apiVersion: kaas.mirantis.com/v1alpha1
    kind: KaaSCephOperationRequest
    metadata:
      name: test-perf-req
      namespace: managed-ns
    spec:
      kaasCephCluster:
        name: ceph-kaas-managed
        namespace: managed-ns
      perfTest:
        parameters:
        - --ioengine=rbd
        - --pool=mirablock-k8s-block-hdd
        - --rbdname=tests
        - --name=single_perftest
        - --rw=randrw:16k
        - --rwmixread=40
        - --bs=4k
        - --size=500M
        - --iodepth=32
        - --numjobs=8
        - --group_reporting
        - --direct=1
        - --fsync=32
        - --buffered=0
        - --exitall
    
  3. Review the KaaSCephOperationRequest status. For details, see KaaSCephOperationRequest perftest status.

    kubectl get kaascephoperationrequest test-managed-req -n managed-ns
    

    Example of system response:

    NAME            KAASCEPHCLUSTER    CLUSTER       AGE   PHASE       MESSAGE
    test-perf-req   ceph-kaas-managed  kaas-managed  20m   Completed
    
  4. Review the CephPerfTestRequest status on the managed cluster.

    kubectl get cephperftestrequest -n ceph-lcm-mirantis
    

    Example of system response:

    NAME            AGE   PHASE      START TIME             DURATION   JOB STATUS   SCHEDULE
    test-perf-req   55m   Finished   2022-06-17T09:29:57Z   5m53s      Completed
    
  5. Review the performance test result by inspecting logs for the corresponding job Pod on the managed cluster:

    kubectl --kubeconfig <managedKubeconfig> -n rook-ceph logs -l app=ceph-perftest,perftest=<name>
    

    Substitute <managedKubeconfig> with the managed cluster kubeconfig and <name> with the KaaSCephOperationRequest metadata.name, for example, test-perf-req.

  6. Optional. Remove the KaaSCephOperationRequest. Removal of KaaSCephOperationRequest also removes the CephPerfTestRequest CR propagated to the managed cluster.

KaaSCephOperationRequest CR perftest specification

TechPreview

This section describes the KaaSCephOperationRequest CR specification used to automatically create a CephPerfTestRequest request. For the procedure workflow, see Enable periodic Ceph performance testing.

Spec of the KaaSCephOperationRequest perftest high-level parameters

Parameter

Description

perfTest

Describes the definition for the CephPerfTestRequest spec. For details on the perfTest parameters, see the tables below.

kaasCephCluster

Defines KaaSCephCluster on which the KaaSCephOperationRequest depends on. Use the kaasCephCluster parameter if the name or project of the corresponding Container Cloud cluster differs from the default one:

spec:
  kaasCephCluster:
    name: ceph-kaas-mgmt
    namespace: default

k8sCluster

Defines the cluster on which the KaaSCephOperationRequest depends on. Use the k8sCluster parameter if the name or project of the corresponding Container Cloud cluster differs from the default one:

spec:
  k8sCluster:
    name: kaas-mgmt
    namespace: default

If you omit this parameter, ceph-kcc-controller will set it automatically.

Ceph performance test parameters

Parameter

Description

parameters

A list of command arguments for a performance test execution. For all available parameters, see fio documentation.

Note

Performance test results will be saved on a PVC if the test run parameters contain an argument to save to a file. Otherwise, test results will be saved only as Pod logs. For example, for the default fio image, use the --output=/results/<fileName> option to redirect to a file that will be saved on the attached PVC. Configuring a mount point is not supported.

command

Optional. Entrypoint command to run performance test in the container. If the performance image is updated, you may also update the command. By default, equals the image entry point.

image

Container image to use for jobs. By default, vineethac/fio_image. Mirantis recommends using the default fio image as it supports multitude I/O engines. For details, see fio man page.

periodic

Configuration of the performance test runs as periodic jobs. Leave empty if a single run is required. For details, see Ceph performance periodic parameters.

saveResultOnPvc

Option that enables saving of the performance test results on a PVC. Contains the following fields:

  • pvcName - PVC name to use. If not specified, a PVC name for the performance test will be created automatically. Namespace is static and equals rook-ceph.

  • pvcStorageClass - StorageClass to use for PVC. If not specified, the default storage class is used.

  • pvcSize - PVC size, defaults to 10Gi.

  • preserveOnDelete - PVC preservation after removal of the performance test.

Ceph performance periodic parameters

Parameter

Description

schedule

Required. Schedule in base cron format. For example, * * 30 2 \*. The field format follows the Cron schedule syntax.

suspended

Pause CronJob scheduling to prevent performance test execution. Only for future scheduling.

runsToKeep

Number of runs to keep in history. Supported only by keeping old run Pods with their outputs.

Example of KaaSCephOperationRequest
apiVersion: kaas.mirantis.com/v1alpha1
kind: KaaSCephOperationRequest
metadata:
  name: test-managed-req
  namespace: managed-ns
spec:
  kaasCephCluster:
    name: ceph-cluster-managed-cluster
    namespace: managed-ns
  perfTest:
    parameters:
    - --ioengine=rbd
    - --pool=mirablock-k8s-block-hdd
    - --rbdname=tests
    - --name=single_perftest
    - --rw=randrw:16k
    - --rwmixread=40
    - --bs=4k
    - --size=500M
    - --iodepth=32
    - --numjobs=8
    - --group_reporting
    - --direct=1
    - --fsync=32
    - --buffered=0
    - --exitall
KaaSCephOperationRequest perftest status

TechPreview

This section describes the status.perfTestStatus fields of the KaaSCephOperationRequest CR that you can use to check the status of a Ceph performance test request.

Note

Performance test results will be saved on PVC if the test run parameters contain the saveResultOnPvc option. Otherwise, test results will be saved only as Pod logs. For details, see Ceph performance test parameters.

Status of the KaaSCephOperationRequest high-level parameters

Parameter

Description

perfTestStatus

Describes the status of the current CephPerfTestRequest. For details, see Status of the KaaSCephOperationRequest perfTestStatus parameters.

Status of the KaaSCephOperationRequest perfTestStatus parameters

Parameter

Description

phase

Describes the current request phase:

  • Pending - the request is created and placed in the request queue.

  • Scheduling - the performance test is handled, waiting for a Pod to be scheduled for the run.

  • WaitingNextRun - the performance test is waiting for the next run of the periodic job.

  • Running - the performance test is executing.

  • Finished - the performance test executed successfully.

  • Suspended - the performance test is suspended. Only for periodic jobs.

  • Failed - the performance test failed.

LastStartTime

The last start time of the performance test execution.

LastDurationTime

The duration of the last successful performance test.

LastJobStatus

The execution status of the last performance test.

messages

Issues or warnings found during the performance test run.

results

Location of the performance test result. Contains the following fields:

  • perftestReference - reference to the job or cron job with the performance test run.

  • referenceNamespace - namespace of the job or cron job with the performance test run.

  • storedOnPvc - location of the performance test results on a PVC with pvcName in pvcNamespace if the test run parameters contain the saveResultOnPvc option.

statusHistory

History of statuses and timings for cron jobs:

  • StartTime - start time of the previous performance test

  • JobStatus - last status of the performance test

  • DurationTime - last duration of the performance test

  • Messages - issues occured during the previous performance test

Example of status.perfTestStatus
status:
  perfTestStatus:
    lastDurationTime: 5m53s
    lastJobStatus: Completed
    lastStartTime: "2022-06-17T09:29:57Z"
    phase: Finished
    results:
      storedOnPvc:
        pvcName: test-perf-req-cephperftest
        pvcNamespace: rook-ceph
  phase: Completed

StackLight operations

The section covers the StackLight management aspects.

Configure StackLight

This section describes how to configure StackLight in your Mirantis OpenStack on Kubernetes deployment and includes the description of StackLight parameters and their verification.

StackLight configuration procedure

This section describes the StackLight configuration workflow.

To configure StackLight:

  1. Download your management cluster kubeconfig:

    1. Log in to the Container Cloud web UI with the m:kaas:namespace@operator or m:kaas:namespace@writer permissions.

    2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

    3. Expand the menu of the tab with your user name.

    4. Click Download kubeconfig to download kubeconfig of your management cluster.

    5. Log in to any local machine with kubectl installed.

    6. Copy the downloaded kubeconfig to this machine.

  2. Run one of the following commands:

    • For a management cluster:

      kubectl --kubeconfig <mgmtClusterKubeconfigPath> edit -n default cluster <mgmtClusterName>
      
    • For a managed cluster:

      kubectl --kubeconfig <mgmtClusterKubeconfigPath> edit -n <managedClusterProjectName> cluster <managedClusterName>
      
  3. In the following section of the opened manifest, configure the required StackLight parameters as described in StackLight configuration parameters.

    spec:
      providerSpec:
        value:
          helmReleases:
         - name: stacklight
           values:
    
  4. Verify StackLight after configuration.

StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in StackLight configuration procedure. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.

Important

Some parameters are marked as mandatory. Failure to specify values for such parameters causes the Admission Controller to reject cluster creation.

OpenStack cluster configuration parameters

This section describes the OpenStack-related StackLight configuration keys. For MOSK cluster configuration keys, see MOSK cluster configuration parameters.


Gnocchi

Key

Description

Example values

openstack.gnocchi.enabled (bool)

Enables Gnocchi monitoring. Set to false by default.

true or false

Ironic

Key

Description

Example values

openstack.ironic.enabled (bool)

Enables Ironic monitoring. Set to false by default.

true or false

OpenStack

Key

Description

Example values

openstack.enabled (bool)

Enables OpenStack monitoring. Set to true by default.

true or false

openstack.namespace (string)

Defines the namespace within which the OpenStack virtualized control plane is installed. Set to openstack by default.

openstack

RabbitMQ

Key

Description

Example values

openstack.rabbitmq.credentialsConfig (map)

Defines the RabbitMQ credentials to use if credentials discovery is disabled or some required parameters were not found during the discovery.

credentialsConfig:
  username: "stacklight"
  password: "stacklight"
  host: "rabbitmq.openstack.svc"
  queue: "notifications"
  vhost: "openstack"

openstack.rabbitmq.credentialsDiscovery (map)

Enables the credentials discovery to obtain the username and password from the secret object.

credentialsDiscovery:
  enabled: true
  namespace: openstack
  secretName: os-rabbitmq-user-credentials
SSL certificates

Key

Description

Example values

openstack.externalFQDN (string) Deprecated

External FQDN used to communicate with OpenStack services for certificates monitoring. The option is deprecated, use openstack.externalFQDNs.enabled instead.

https://os.ssl.mirantis.net/

openstack.externalFQDNs.enabled (bool)

External FQDN used to communicate with OpenStack services. Used for certificates monitoring. Set to false by default.

true or false

openstack.insecure (string)

Defines whether to verify the trust chain of the OpenStack endpoint SSL certificates during monitoring.

insecure:
  internal: true
  external: false
Telegraf

Key

Description

Example values

openstack.telegraf.credentialsConfig (map)

Specifies the OpenStack credentials to use if the credentials discovery is disabled or some required parameters were not found during the discovery.

credentialsConfig:
  identityEndpoint: "" # "http://keystone-api.openstack.svc:5000/v3"
  domain: "" # "default"
  password: "" # "workshop"
  project: "" # "admin"
  region: "" # "RegionOne"
  username: "" # "admin"

openstack.telegraf.credentialsDiscovery (map)

Enables the credentials discovery to obtain all required parameters from the secret object.

credentialsDiscovery:
  enabled: true
  namespace: openstack
  secretName: keystone-keystone-admin

openstack.telegraf.interval (string)

Specifies the interval of metrics gathering from the OpenStack API. Set to 1m by default.

1m, 3m

openstack.telegraf.insecure (bool)

Enables or disables the server certificate chain and host name verification. Set to true by default.

true or false

openstack.telegraf.skipPublicEndpoints (bool)

Enables or disables HTTP probes for public endpoints from the OpenStack service catalog. Set to false by default, meaning that Telegraf verifies all endpoints from the OpenStack service catalog, including the public, admin, and internal endpoints.

true or false

Tungsten Fabric

Key

Description

Example values

tungstenFabricMonitoring.enabled (bool)

Enables Tungsten Fabric monitoring.

Since MOSK 23.1, the parameter is set to true by default if Tungsten Fabric is deployed.

Before MOSK 23.1, the parameter is set to false by default. Set it to true only if Tungsten Fabric is deployed.

true or false

tungstenFabricMonitoring.exportersTimeout (string)

Available since MOSK 23.3. Defines the timeout of the tungstenfabric-exporter client requests. Set to 5s by default.

tungstenFabricMonitoring:
  exportersTimeout: "5s"

tungstenFabricMonitoring.analyticsEnabled (bool)

Available since MOSK 24.1. Enables or disables monitoring of the Tungsten Fabric analytics services.

In MOSK 24.1, defaults to true.

Since MOSK 24.2, the default value is set automatically based on the real state of the Tungsten Fabric analytics services (enabled or disabled) in the Tungsten Fabric cluster.

true or false

MOSK cluster configuration parameters

This section describes the MOSK cluster StackLight configuration keys. For OpenStack cluster configuration keys, see OpenStack cluster configuration parameters.


Alert configuration

Key

Description

Example values

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

customAlerts:
# To add a new alert:
- alert: ExampleAlert
  annotations:
    description: Alert description
    summary: Alert summary
  expr: example_metric > 0
  for: 5m
  labels:
    severity: warning
# To modify an existing alert expression:
- alert: AlertmanagerFailedReload
  expr: alertmanager_config_last_reload_successful == 5
# To disable an existing alert:
- alert: TargetDown
  enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.

Alerta

Key

Description

Example values

alerta.enabled (bool)

Enables or disables Alerta. Using the Alerta web UI, you can view the most recent or watched alerts, group, and filter alerts. Set to true by default.

true or false

Alertmanager integrations

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled, for example, for the Salesforce integration and Alertmanager notifications external rules.

Key

Description

Example values

alertmanagerSimpleConfig.genericReceivers (slice)

Provides a generic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

For example, to enable notifications to OpsGenie:

alertmanagerSimpleConfig:
  genericReceivers:
  - name: HTTP-opsgenie
    enabled: true # optional
    opsgenie_configs:
    - api_url: "https://example.app.eu.opsgenie.com/"
      api_key: "secret-key"
      send_resolved: true

alertmanagerSimpleConfig.genericRoutes (slice)

Provides a template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

genericRoutes:
- receiver: HTTP-opsgenie
  enabled: true # optional
  matchers:
    severity=~"major|critical"
  continue: true

alertmanagerSimpleConfig.inhibitRules.enabled (bool)

Disables or enables alert inhibition rules. If enabled, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Enabled by default. For details, see Alert dependencies. For details on inhibition rules, see Prometheus documentation.

true or false

Alertmanager: notifications to email

Key

Description

Example values

alertmanagerSimpleConfig.email.enabled (bool)

Enables or disables Alertmanager integration with email. Set to false by default.

true or false

alertmanagerSimpleConfig.email (map)

Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

email:
  enabled: false
  send_resolved: true
  to: "to@test.com"
  from: "from@test.com"
  smarthost: smtp.gmail.com:587
  auth_username: "from@test.com"
  auth_password: password
  auth_identity: "from@test.com"
  require_tls: true

alertmanagerSimpleConfig.email.route (map)

Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers: []
  routes: []
Alertmanager: notifications to Microsoft Teams

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Microsoft Teams integration depends on the Internet access through HTTPS.

Key

Description

Example values

alertmanagerSimpleConfig.msteams.enabled (bool)

Enables or disables Alertmanager integration with Microsoft Teams. Requires a set up Microsoft Teams channel and a channel connector. Set to false by default.

true or false

alertmanagerSimpleConfig.msteams.url (string)

Defines the URL of an Incoming Webhook connector of a Microsoft Teams channel. For details about channel connectors, see Microsoft documentation.

https://example.webhook.office.com/webhookb2/UUID

alertmanagerSimpleConfig.msteams.route (map)

Defines the notifications route for Alertmanager integration with MS Teams. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers: []
  routes: []
Alertmanager: notifications to Salesforce

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce integration depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-notifier and sf-reporter services.

The clusterId is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

alertmanagerSimpleConfig.salesForce.enabled (bool)

Enables or disables Alertmanager integration with Salesforce using the sf-notifier service. Disabled by default.

true or false

alertmanagerSimpleConfig.salesForce.auth (map)

Defines the Salesforce parameters and credentials for integration with Alertmanager.

auth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

alertmanagerSimpleConfig.salesForce.route (map)

Defines the notifications route for Alertmanager integration with Salesforce. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers:
  - severity="critical"
  routes: []

Note

By default, only Critical alerts will be sent to Salesforce.

alertmanagerSimpleConfig.salesForce.feed_enabled (bool)

Enables or disables feed update in Salesforce. To save API calls, this parameter is set to false by default.

true or false

alertmanagerSimpleConfig.salesForce.link_prometheus (bool)

Enables or disables links to the Prometheus web UI in alerts sent to Salesforce. To simplify troubleshooting, set to true by default.

true or false

Alertmanager: notifications to ServiceNow

Caution

Prior to configuring the integration with ServiceNow, perform the following prerequisite steps using the ServiceNow documentation of the required version.

  1. In a new or existing Incident table, add the Alert ID field as described in Add fields to a table. To avoid alerts duplication, select Unique.

  2. Create an Access Control List (ACL) with read/write permissions for the Incident table as described in Securing table records.

  3. Set up a service account.

Key

Description

Example values

alertmanagerSimpleConfig.serviceNow.enabled (bool)

Enables or disables Alertmanager integration with ServiceNow. Set to false by default. Requires a set up ServiceNow account and compliance with the Incident table requirements above.

true or false

alertmanagerSimpleConfig.serviceNow (map)

Defines the ServiceNow parameters and credentials for integration with Alertmanager:

  • incident_table - name of the table created in ServiceNow. Do not confuse with the table label.

  • api_version - version of the ServiceNow HTTP API. By default, v1.

  • alert_id_field - name of the unique string field configured in ServiceNow to hold Prometheus alert IDs. Do not confuse with the table label.

  • auth.instance - URL of the instance.

  • auth.username - name of the ServiceNow user account with access to Incident table.

  • auth.password - password of the ServiceNow user account.

serviceNow:
  enabled: true
  incident_table: "incident"
  api_version: "v1"
  alert_id_field: "u_alert_id"
  auth:
    instance: "https://dev00001.service-now.com"
    username: "testuser"
    password: "testpassword"
Alertmanager: notifications to Slack

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Slack integration depends on the Internet access through HTTPS.

Key

Description

Example values

alertmanagerSimpleConfig.slack.enabled (bool)

Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration. Set to false by default.

true or false

alertmanagerSimpleConfig.slack.api_url (string)

Defines the Slack webhook URL.

http://localhost:8888

alertmanagerSimpleConfig.slack.channel (string)

Defines the Slack channel or user to send notifications to.

monitoring

alertmanagerSimpleConfig.slack.route (map)

Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

route:
  matchers: []
  routes: []
Alertmanager: Watchdog alert

Key

Description

Example values

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers. Set to true by default.

true or false

Byte limit for Telemeter client

For internal StackLight use only

Key

Description

Example values

telemetry.telemeterClient.limitBytes (string)

Specifies the size limit of the incoming data length in bytes for the Telemeter client. Defaults to 1048576.

4194304

Cluster size

Key

Description

Example values

clusterSize (string)

Specifies the approximate expected cluster size. Set to small by default. Other possible values include medium and large. Depending on the choice, appropriate resource limits are passed according to the resources or resourcesPerClusterSize parameter.

Caution

Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), resourcesPerClusterSize is deprecated and is overridden by the resources parameter. Therefore, use the resources parameter instead.

The values differ by the OpenSearch and Prometheus resource limits:

  • small (default) - 2 CPU, 6 Gi RAM for OpenSearch, 1 CPU, 8 Gi RAM for Prometheus. Use small only for testing and evaluation purposes with no workloads expected.

  • medium - 4 CPU, 16 Gi RAM for OpenSearch, 3 CPU, 16 Gi RAM for Prometheus.

  • large - 8 CPU, 32 Gi RAM for OpenSearch, 6 CPU, 32 Gi RAM for Prometheus. Set to large only in case of lack of resources for OpenSearch and Prometheus.

small, medium, or large

Grafana

Key

Description

Example values

grafana.renderer.enabled (bool)

Removed in Container Cloud 2.27.0 (Cluster releases 17.2.0 and 16.2.0). Disables Grafana Image Renderer. For example, for resource-limited environments. Enabled by default.

true or false

grafana.homeDashboard (string)

Defines the home dashboard. Set to kubernetes-cluster by default. You can define any of the available dashboards.

kubernetes-cluster

High availability

Key

Description

Example values

highAvailabilityEnabled (bool) Mandatory

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Container Cloud Reference Architecture: StackLight deployment architecture. On managed clusters, set to false by default. On management clusters, true is mandatory.

true or false

Kubernetes network policies

Available since MCC 2.25.1 (Cluster releases 17.0.1 and 16.0.1)

Key

Description

Example values

networkPolicies.enabled (bool)

Enables or disables the Kubernetes Network Policy resource that allows controlling network connections to and from Pods deployed in the stackLight namespace. Enabled by default.

For the list of network policy rules, refer to StackLight rules for Kubernetes network policies. Customization of network policies is not supported.

true or false

Kubernetes tolerations

Key

Description

Example values

tolerations.default (slice)

Kubernetes tolerations to add to all StackLight components.

default:
- key: "com.docker.ucp.manager"
  operator: "Exists"
  effect: "NoSchedule"

tolerations.component (map)

Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

component:
  # elasticsearch:
  opensearch:
  - key: "com.docker.ucp.manager"
    operator: "Exists"
    effect: "NoSchedule"
  postgresql:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"
Log filtering for namespaces

Available since MCC 2.25.0 (Cluster releases 17.0.0 and 16.0.0)

Key

Description

Example values

logging.namespaceFiltering.logs.enabled (bool)

Limits the number of namespaces for Pods log collection. Enabled by default with the following list of monitored Kubernetes namespaces:

Kubernetes namespaces monitored by default
  • ceph
    If Ceph is enabled
  • ceph-lcm-mirantis
    If Ceph is enabled
  • default

  • kaas

  • kube-node-lease

  • kube-public

  • kube-system

  • lcm-system

  • local-path-storage

  • metallb

  • metallb-system

  • node-feature-discovery

  • openstack

  • openstack-ceph-shared
    If Ceph is enabled
  • openstack-lma-shared

  • openstack-provider-system

  • openstack-redis

  • openstack-tf-share
    If Tungsten Fabric is enabled
  • openstack-vault

  • osh-system

  • rook-ceph
    If Ceph is enabled
  • stacklight

  • system

  • tf
    If Tungsten Fabric is enabled

true or false

logging.namespaceFiltering.logs.extraNamespaces (map)

Adds extra namespaces to collect Kubernetes Pod logs from. Requires logging.enabled and logging.namespaceFiltering.logs.enabled set to true. Defines a YAML-formatted list of namespaces, which is empty by default.

logging:
  namespaceFiltering:
    logs:
      enabled: true
      extraNamespaces:
      - custom-ns-1

logging.namespaceFiltering.events.enabled (bool)

Limits the number of namespaces for Kubernetes events collection. Disabled by default due to sysdig scanner present on some MOSK clusters and due to cluster-scoped objects producing events by default to the default namespace, but it is not passed to StackLight configuration anyhow. Requires logging.enabled set to true.

true or false

logging.namespaceFiltering.events.extraNamespaces (map)

Adds extra namespaces to collect Kubernetes events from. Requires logging.enabled and logging.namespaceFiltering.events.enabled set to true. Defines a YAML-formatted list of namespaces, which is empty by default.

logging:
  namespaceFiltering:
    events:
      enabled: true
      extraNamespaces:
      - custom-ns-1
Log verbosity

Key

Description

Example values

stacklightLogLevels.default (string)

Defines the log verbosity level for all StackLight components if not defined using component. To use the component default log verbosity level, leave the string empty.

  • trace - most verbose log messages, generates large amounts of data

  • debug - messages typically of use only for debugging purposes

  • info - informational messages describing common processes such as service starting or stopping; can be ignored during normal system operation but may provide additional input for investigation

  • warn - messages about conditions that may require attention

  • error - messages on error conditions that prevent normal system operation and require action

  • crit - messages on critical conditions indicating that a service is not working, working incorrectly or is unusable, requiring immediate attention

    Since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0), the NO_SEVERITY severity label is automatically added to a log with no severity label in the message. This enables greater control over determining which logs Fluentd processes and which ones are skipped by mistake.

stacklightLogLevels.component (map)

Defines (overrides the default value) the log verbosity level for any StackLight component separately. To use the component default log verbosity, leave the string empty.

component:
  kubeStateMetrics: ""
  prometheusAlertManager: ""
  prometheusBlackboxExporter: ""
  prometheusNodeExporter: ""
  prometheusServer: ""
  alerta: ""
  alertmanagerWebhookServicenow: ""
  elasticsearchCurator: ""
  postgresql: ""
  prometheusEsExporter: ""
  sfNotifier: ""
  sfReporter: ""
  fluentd: ""
  # fluentdElasticsearch ""
  fluentdLogs: ""
  telemeterClient: ""
  telemeterServer: ""
  tfControllerExporter: ""
  tfVrouterExporter: ""
  telegrafDs: ""
  telegrafS: ""
  # elasticsearch: ""
  opensearch: ""
  # kibana: ""
  grafana: ""
  opensearchDashboards: ""
  metricbeat: ""
  prometheusMsTeams: ""
Logging

Key

Description

Example values

logging.enabled (bool) Mandatory

Enables or disables the StackLight logging stack. For details about the logging components, see Container Cloud Reference Architecture: StackLight Deployment architecture. Set to true by default. On management clusters, true is mandatory.

true or false

logging.level (bool)

Removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Sets the least important level of log messages to send to OpenSearch. Requires logging.enabled set to true.

The default logging level is INFO, meaning that StackLight will drop log messages for the lower DEBUG and TRACE levels. Levels from WARNING to EMERGENCY require attention.

Note

The FLUENTD_ERROR logs are of special type and cannot be dropped.

  • TRACE - the most verbose logs. Such level generates large amounts of data.

  • DEBUG- messages typically of use only for debugging purposes.

  • INFO - informational messages describing common processes such as service starting or stopping. Can be ignored during normal system operation but may provide additional input for investigation.

  • NOTICE - normal but significant conditions that may require special handling.

  • WARNING - messages on unexpected conditions that may require attention.

  • ERROR - messages on error conditions that prevent normal system operation and require action.

  • CRITICAL - messages on critical conditions indicating that a service is not working or working incorrectly.

  • ALERT - messages on severe events indicating that action is needed immediately.

  • EMERGENCY - messages indicating that a service is unusable.

logging.metricQueries (map)

Allows configuring OpenSearch queries for the data present in OpenSearch. Prometheus Elasticsearch Exporter then queries the OpenSearch database and exposes such metrics in the Prometheus format. For details, see Create logs-based metrics. Includes the following parameters:

  • indices - specifies the index pattern

  • interval and timeout - specify in seconds how often to send the query to OpenSearch and how long it can last before timing out

  • onError and onMissing - modify the prometheus-es-exporter behavior on query error and missing index. For details, see Prometheus Elasticsearch Exporter.

For usage example, see Create logs-based metrics.

logging.retentionTime (map)

Removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Specifies the retention time per index. Includes the following parameters:

  • logstash - specifies the logstash-* index retention time.

  • events - specifies the kubernetes_events-* index retention time.

  • notifications - specifies the notification-* index retention time.

The allowed values include integers (days) and numbers with suffixes: y, m, w, d, h, including capital letters.

logging:
  retentionTime:
    logstash: 3
    events: "2w"
    notifications: "1M"
Logging: Enforce OOPS compression

Available since MCC 2.25.0 (Cluster releases 17.0.0 and 16.0.0)

Key

Description

Example values

logging.enforceOopsCompression

Enforces 32 GB of heap size, unless the defined memory limit allows using 50 GB of heap. Requires logging.enabled set to true. Enabled by default. When disabled, StackLight computes heap as ⅘ of the set memory limit for any resulting heap value. For more details, see Tune OpenSearch performance.

logging:
  enforceOopsCompression: true
Logging to external outputs

Available since MCC 2.23.0 (Cluster release 11.7.0)

Key

Description

Example values

logging.externalOutputs (map)

Specifies external Elasticsearch, OpenSearch, and syslog destinations as fluentd-logs outputs. Requires logging.enabled: true. For configuration procedure, see Enable log forwarding to external destinations.

logging:
  externalOutputs:
    elasticsearch:
      # disabled: false
      type: elasticsearch
      level: info
      plugin_log_level: info
      tag_exclude: '{fluentd-logs,systemd}'
      host: elasticsearch-host
      port: 9200
      logstash_date_format: '%Y.%m.%d'
      logstash_format: true
      logstash_prefix: logstash
      ...
      buffer:
        # disabled: false
        chunk_limit_size: 16m
        flush_interval: 15s
        flush_mode: interval
        overflow_action: block
        ...
    opensearch:
      disabled: true
      type: opensearch
      ...
Logging to external outputs: secrets

Available since MCC 2.23.0 (Cluster release 11.7.0)

Key

Description

Example values

logging.externalOutputSecretMounts (map)

Specifies authentication secret mounts for external log destinations. Requires logging.externalOutputs to be enabled and a Kubernetes secret to be created under the stacklight namespace. Contains the following values:

  • secretName

    Mandatory. Kubernetes secret name.

  • mountPath

    Mandatory. Mount path of the Kubernetes secret defined in secretName.

  • defaultMode

    Optional. Decimal number defining secret permissions, 420 by default.

Secret mount configuration:

logging:
  externalOutputSecretMounts:
  - secretName: elasticsearch-certs
    mountPath: /tmp/elasticsearch-certs
    defaultMode: 420
  - secretName: opensearch-certs
    mountPath: /tmp/opensearch-certs

Elasticsearch configuration for the above secret mount:

logging:
  externalOutputs:
    elasticsearch:
      ...
      ca_file: /tmp/elasticsearch-certs/ca.pem
      client_cert: /tmp/elasticsearch-certs/client.pem
      client_key: /tmp/elasticsearch-certs/client.key
      client_key_pass: password
Logging to syslog

Deprecated since MCC 2.23.0 (Cluster release 11.7.0)

Note

Since Container Cloud 2.23.0 (Cluster release 11.7.0), logging.syslog is deprecated for the sake of logging.externalOutputs. For details, see Logging to external outputs.

Key

Description

Example values

logging.syslog.enabled (bool)

Enables or disables remote logging to syslog. Disabled by default. Requires logging.enabled set to true. For details and configuration example, see Enable remote logging to syslog.

true or false

logging.syslog.host (string)

Specifies the remote syslog host.

remote-syslog.svc

logging.syslog.level (string)

Removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Specifies logging level for the syslog output.

INFO

logging.syslog.port (string)

Specifies the remote syslog port.

514

logging.syslog.packetSize (string)

Defines the packet size in bytes for the syslog logging output. Set to 1024 by default. May be useful for syslog setups allowing packet size larger than 1 kB. Mirantis recommends that you tune this parameter to allow sending full log lines.

1024

logging.syslog.protocol (bool)

Specifies the remote syslog protocol. Set to udp by default.

tcp or udp

logging.syslog.tls.enabled (bool)

Optional. Disabled by default. Enables or disables TLS. Use TLS only for the TCP protocol. TLS will not be enabled if you set a protocol other than TCP.

true or false

logging.syslog.tls.verify_mode (int)

Optional. Configures TLS verification.

  • 0 for OpenSSL::SSL::VERIFY_NONE

  • 1 for OpenSSL::SSL::VERIFY_PEER

  • 2 for OpenSSL::SSL::VERIFY_FAIL_IF_NO_PEER_CERT

  • 4 for OpenSSL::SSL::VERIFY_CLIENT_ONCE

logging.syslog.tls.certificate (string)

Defines how to pass the certificate. secret takes precedence over hostPath.

  • secret - specifies the name of the secret holding the certificate.

  • hostPath - specifies an absolute host path to the PEM certificate.

certificate:
  secret: ""
  hostPath: "/etc/ssl/certs/ca-bundle.pem"
tag_exclude (string)
Since MCC 2.23.0 (11.7.0)

Optional. Overrides tag_include. Sets logs by tags to exclude from the destination output. For example, to exclude all logs with the test tag, set tag_exclude: '/.*test.*/'.

How to obtain tags for logs

Select from the following options:

  • In the main OpenSearch output, use the logger field that equals the tag.

  • Use logs of a particular Pod or container by following the below order, with the first match winning:

    1. The value of the app Pod label. For example, for app=opensearch-master, use opensearch-master as the log tag.

    2. The value of the k8s-app Pod label.

    3. The value of the app.kubernetes.io/name Pod label.

    4. If a release_group Pod label exists and the component Pod label starts with app, use the value of the component label as the tag. Otherwise, the tag is the application label joined to the component label with a -.

    5. The name of the container from which the log is taken.

The values for tag_exclude and tag_include are placed into <match> directives of Fluentd and only accept regex types that are supported by the <match> directive of Fluentd. For details, refer to the Fluentd official documentation.

'{fluentd-logs,systemd}'

tag_include (string)
Since MCC 2.23.0 (11.7.0)

Optional. Is overridden by tag_exclude. Sets logs by tags to include to the destination output. For example, to include all logs with the auth tag, set tag_include: '/.*auth.*/'.

'/.*auth.*/'

Monitoring of Ceph

Key

Description

Example values

ceph.enabled (bool)

Enables or disables Ceph monitoring on managed clusters. Set to false by default.

true or false

Monitoring of external endpoint

Key

Description

Example values

externalEndpointMonitoring.enabled (bool)

Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds. Set to false by default.

true or false

externalEndpointMonitoring.certificatesHostPath (string)

Defines the directory path with external endpoints certificates on host.

/etc/ssl/certs/

externalEndpointMonitoring.domains (slice)

Defines the list of HTTP endpoints to monitor. The endpoints must successfully respond to a liveness probe. For success, a request to a specific endpoint must result in a 2xx HTTP response code.

domains:
- https://prometheus.io/health
- http://example.com:8080/status
- http://example.net:8080/pulse
Monitoring of Ironic

Key

Description

Example values

ironic.endpoint (string)

Enables or disables monitoring of bare metal Ironic. To enable, specify the Ironic API URL.

http://ironic-api-http.kaas.svc:6385/v1

ironic.insecure (bool)

Defines whether to skip the chain and host verification. Set to false by default.

true or false

Monitoring of Mirantis Kubernetes Engine

Key

Description

Example values

mke.enabled (bool)

Enables or disables Mirantis Kubernetes Engine (MKE) monitoring. Set to true by default.

true or false

mke.dockerdDataRoot (string)

Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

/var/lib/docker

Monitoring of SSL certificates

Key

Description

Example values

sslCertificateMonitoring.enabled (bool)

Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour. Set to false by default.

true or false

sslCertificateMonitoring.domains (slice)

Defines the list of HTTPS endpoints to monitor the certificates from.

domains:
- https://prometheus.io
- https://example.com:8080
Monitoring of workload

Key

Description

Example values

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. Set to false by default. If set to true, you can define the namespaces to which the filter will apply. The parameter is designed for managed clusters.

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight
  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace

NodeSelector

Key

Description

Example values

nodeSelector.default (map)

Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

default:
  role: stacklight

nodeSelector.component (map)

Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

component:
  alerta:
    role: stacklight
    component: alerta
  # kibana:
  #   role: stacklight
  #   component: kibana
  opensearchDashboards:
    role: stacklight
    component: opensearchdashboards
OpenSearch

Key

Description

Example values

elasticsearch.retentionTime (map)

Removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Specifies the retention time per index. Includes the following parameters:

  • logstash - specifies the logstash-* index retention time.

  • events - specifies the kubernetes_events-* index retention time.

  • notifications - specifies the notification-* index retention time.

The allowed values include integers (days) and numbers with suffixes: y, m, w, d, h, including capital letters.

By default, values set in elasticsearch.logstashRetentionTime are used. However, the elasticsearch.retentionTime parameters, if defined, take precedence over elasticsearch.logstashRetentionTime.

elasticsearch:
  retentionTime:
    logstash: 3
    events: "2w"
    notifications: "1M"

elasticsearch.logstashRetentionTime (int)

Removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0).

Defines the OpenSearch (Elasticsearch) logstash-* index retention time in days. The logstash-* index stores all logs gathered from all nodes and containers. Set to 1 by default.

Note

Due to the known issue 27732-2, a custom setting for this parameter is dismissed during cluster deployment and changes to one day (default). Refer to the known issue description for the affected Cluster releases and available workaround.

1, 5, 15

elasticsearch.persistentVolumeClaimSize (string) Mandatory

Specifies the OpenSearch (Elasticsearch) PVC(s) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

Important

You cannot modify this parameter after cluster creation.

Note

Due to the known issue 27732-1, that is fixed in Container Cloud 2.22.0 (Cluster releases 11.6.0 and 12.7.0), the OpenSearch PVC size configuration is dismissed during a cluster deployment. Refer to the known issue description for affected Cluster releases and available workarounds.

elasticsearch:
  persistentVolumeClaimSize: 30Gi

elasticsearch.persistentVolumeUsableStorageSizeGB (integer)

Available since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Optional. Specifies the number of gigabytes that is exclusively available for the OpenSearch data.

  • Since Container Cloud 2.29.0 (Cluster releases 17.4.0 and 16.4.0), defines ceiling for storage-based retention, though only a portion of this storage will be available for indices, depending on the total size and cluster configuration.

  • Before Container Cloud 2.29.0 (Cluster releases 17.3.0, 16.3.0, or earlier), defines ceiling for storage-based retention where 80% of the defined value is assumed as available disk space for normal OpenSearch node functioning.

If not set (by default), the number of gigabytes from elasticsearch.persistentVolumeClaimSize is used.

This parameter is useful in the following cases:

  • The real storage behind the volume is shared between multiple consumers. As a result, OpenSearch cannot use all elasticsearch.persistentVolumeClaimSize.

  • The real volume size is bigger than elasticsearch.persistentVolumeClaimSize. As a result, OpenSearch can use more than elasticsearch.persistentVolumeClaimSize.

elasticsearch:
  persistentVolumeUsableStorageSizeGB: 160
OpenSearch Dashboards extra settings

Key

Description

Example values

logging.dashboardsExtraConfig (map)

Additional configuration for opensearch_dashboards.yml.

logging:
  dashboardsExtraConfig:
    opensearch.requestTimeout: 60000
OpenSearch extra settings

Key

Description

Example values

logging.extraConfig (map)

Additional configuration for opensearch.yml that allows setting various OpenSearch parameters, including logging settings, node watermarks, and other cluster-level configurations.

Since Container Cloud 2.29.0 and MOSK 25.1, by default, StackLight manages watermarks efficiently (low/high/flood: 150/100/50 GB). If .extraConfig sets any watermark, StackLight stops managing them. In this case, explicitly set all watermarks using absolute values instead of percentages to prevent issues. While percentages are accepted, they may cause unexpected behavior, especially in clusters that use LVP as a storage provisioner, where OpenSearch shares storage with other components.

logging:
  extraConfig:
    cluster.max_shards_per_node: 5000
Prometheus

Key

Description

Example values

prometheusServer.alertResendDelay (string)

Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag. Set to 2m by default.

2m, 90s

prometheusServer.alertsCommonLabels (dict)

Available since Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). Defines the list of labels to be injected to firing alerts while they are sent to Alertmanager. Empty by default.

The following labels are reserved for internal purposes and cannot be overridden: cluster_id, service, severity.

Caution

When new labels are injected, Prometheus sends alert updates with a new set of labels, which can potentially cause Alertmanager to have duplicated alerts for a short period of time if the cluster currently has firing alerts.

alertsCommonLabels:
  region: west
  environment: prod

prometheusServer.persistentVolumeClaimSize (string) Mandatory

Specifies the Prometheus PVC(s) size. The number of PVCs depends on the StackLight database mode. For HA, three PVCs will be created, each of the size specified in this parameter. For non-HA, one PVC of the specified size.

Important

You cannot modify this parameter after cluster creation.

prometheusServer:
  persistentVolumeClaimSize: 16Gi

prometheusServer.queryConcurrency (string)

Available since Container Cloud 2.24.0 (Cluster release 14.0.0). Defines the number of concurrent queries limit. Passed to the --query.max-concurrency flag. Set to 20 by default.

25

prometheusServer.retentionSize (string)

Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag. Set to 15GB by default.

15GB, 512MB

prometheusServer.retentionTime (string)

Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag. Set to 15d by default.

15d, 1000h, 10d12h

Prometheus Blackbox Exporter

Key

Description

Example values

blackboxExporter.customModules (map)

Specifies a set of custom Blackbox Exporter modules. For details, see Blackbox Exporter configuration: module. The http_2xx, http_2xx_verify, http_openstack, http_openstack_insecure, tls, tls_verify names are reserved for internal usage and any overrides will be discarded.

customModules:
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

blackboxExporter.timeoutOffset (string)

Specifies the offset to subtract from timeout in seconds (--timeout-offset), upper bounded by 5.0 to comply with the built-in StackLight functionality. If nothing is specified, the Blackbox Exporter default value is used. For example, for Blackbox Exporter v0.19.0, the default value is 0.5.

timeoutOffset: "0.1"

Prometheus custom recording rules

Key

Description

Example values

prometheusServer.customRecordingRules (slice)

Defines custom Prometheus recording rules. Overriding of existing recording rules is not supported.

customRecordingRules:
- name: ExampleRule.http_requests_total
  rules:
  - expr: sum by(job) (rate(http_requests_total[5m]))
    record: job:http_requests:rate5m
  - expr: avg_over_time(job:http_requests:rate5m[1w])
    record: job:http_requests:rate5m:avg_over_time_1w
Prometheus custom scrape configurations

Key

Description

Example values

prometheusServer.customScrapeConfigs (map)

Defines custom Prometheus scrape configurations. For details, see Prometheus documentation: scrape_config. The names of default StackLight scrape configurations, which you can view in the Status -> Targets tab of the Prometheus web UI, are reserved for internal usage and any overrides will be discarded. Therefore, provide unique names to avoid overrides.

customScrapeConfigs:
  custom-grafana:
    scrape_interval: 10s
    scrape_timeout: 5s
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels:
      - __meta_kubernetes_service_label_app
      - __meta_kubernetes_endpoint_port_name
      regex: grafana;service
      action: keep
    - source_labels:
      - __meta_kubernetes_pod_name
      target_label: pod
Prometheus metrics filtering

Available since Container Cloud 2.24.0 (Cluster release 14.0.0)

Key

Description

Example values

metricsFiltering.enabled (bool)

Configuration for managing Prometheus metrics filtering. When enabled (default), only actively used and explicitly white-listed metrics get scraped by Prometheus.

prometheusServer:
  metricsFiltering:
    enabled: true

metricsFiltering.extraMetricsInclude (map)

List of extra metrics to whitelist, which are dropped by default. Contains the following parameters:

  • <job name> - scraping job name as a key for extra white-listed metrics to add under the key. For the list of job names, see White list of Prometheus scrape jobs. If a job name is not present in this list, its target metrics are not dropped and are collected by Prometheus by default.

    You can also use group key names to add metrics to more than one job using _group-<key name>. The following list combines jobs by groups:

    List of jobs by groups
    _group-blackbox-metrics
     - blackbox
     - blackbox-external-endpoint
     - kubernetes-master-api
     - mcc-blackbox
     - mke-manager-api
     - msr-api
     - openstack-blackbox-ext
     - openstack-dns-probe # Since MOSK 24.3
     - refapp
    
    _group-controller-runtime-metrics
     - helm-controller
     - kaas-exporter
     - kubelet
     - kubernetes-apiservers
     - mcc-controllers
     - mcc-providers
     - rabbitmq-operator-metrics
    
    _group-etcd-metrics
     - etcd-server
     - ucp-kv
    
    _group-go-collector-metrics
     - cadvisor
     - calico
     - etcd-server
     - helm-controller
     - ironic
     - kaas-exporter
     - kubelet
     - kubernetes-apiservers
     - mcc-cache
     - mcc-controllers
     - mcc-providers
     - mke-metrics-controller
     - mke-metrics-engine
     - openstack-ingress-controller
     - postgresql
     - prometheus-alertmanager
     - prometheus-elasticsearch-exporter
     - prometheus-grafana
     - prometheus-libvirt-exporter
     - prometheus-memcached-exporter
     - prometheus-msteams
     - prometheus-mysql-exporter
     - prometheus-node-exporter
     - prometheus-rabbitmq-exporter # Deprecated since MOSK 25.1
     - prometheus-relay
     - prometheus-server
     - rabbitmq-operator-metrics
     - telegraf-docker-swarm
     - telemeter-client
     - telemeter-server
     - tf-control
     - tf-redis
     - tf-vrouter
     - ucp-kv
    
    _group-process-collector-metrics
     - alertmanager-webhook-servicenow
     - cadvisor
     - calico
     - etcd-server
     - helm-controller
     - ironic
     - kaas-exporter
     - kubelet
     - kubernetes-apiservers
     - mcc-cache
     - mcc-controllers
     - mcc-providers
     - mke-metrics-controller
     - mke-metrics-engine
     - openstack-ingress-controller
     - patroni
     - postgresql
     - prometheus-alertmanager
     - prometheus-elasticsearch-exporter
     - prometheus-grafana
     - prometheus-libvirt-exporter
     - prometheus-memcached-exporter
     - prometheus-msteams
     - prometheus-mysql-exporter
     - prometheus-node-exporter
     - prometheus-rabbitmq-exporter # Deprecated since MOSK 25.1
     - prometheus-relay
     - prometheus-server
     - rabbitmq-operator-metrics
     - sf-notifier
     - telegraf-docker-swarm
     - telemeter-client
     - telemeter-server
     - tf-control
     - tf-redis
     - tf-vrouter
     - tf-zookeeper
     - ucp-kv
    
    _group-rest-client-metrics
     - helm-controller
     - kaas-exporter
     - mcc-controllers
     - mcc-providers
    
    _group-service-handler-metrics
     - mcc-controllers
     - mcc-providers
    
    _group-service-http-metrics
     - mcc-cache
     - mcc-controllers
    
    _group-service-reconciler-metrics
     - mcc-controllers
     - mcc-providers
    

    Note

    The prometheus-coredns job from the go-collector-metrics and process-collector-metrics groups is removed in Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0).

  • <list of metrics to collect> - extra metrics of <job name> to be white-listed.

prometheusServer:
  metricsFiltering:
    enabled: true
    extraMetricsInclude:
      cadvisor:
        - container_memory_failcnt
        - container_network_transmit_errors_total
      calico:
        - felix_route_table_per_iface_sync_seconds_sum
        - felix_bpf_dataplane_endpoints
      _group-go-collector-metrics:
        - go_gc_heap_goal_bytes
        - go_gc_heap_objects_objects
Prometheus Node Exporter

Key

Description

Example values

nodeExporter.netDeviceExclude (string)

Excludes monitoring of RegExp-specified network devices. The number of network interface-related metrics is significant and may cause extended Prometheus RAM usage in big clusters. Therefore, Prometheus Node Exporter only collects information of a basic set of interfaces (both host and container) and excludes the following monitoring interfaces:

  • veth/cali - the host-side part of the container-host Ethernet tunnel

  • o-hm0 - the OpenStack Octavia management interface for communication with the amphora machine

  • tap, qg-, qr-, ha- - the Open vSwitch virtual bridge ports

  • br-(ex|int|tun) - the Open vSwitch virtual bridges

  • docker0, br- - the Docker bridge (master for the veth interfaces)

  • ovs-system - the Open vSwitch interface (mapping interfaces to bridges)

To enable information collecting for the interfaces above, edit the list of blacklisted devices as needed.

nodeExporter:
  netDeviceExclude: "^(veth.+|cali.+|o-hm0|tap.+|qg-.+|qr-.+|ha-.+|br-.+|ovs-system|docker0)$"

nodeExporter.extraCollectorsEnabled (slice)

Enables Node Exporter collectors. For a list of available collectors, see Node Exporter Collectors. The following collectors are enabled by default in StackLight:

  • arp

  • conntrack

  • cpu

  • diskstats

  • entropy

  • filefd

  • filesystem

  • hwmon

  • loadavg

  • meminfo

  • netdev

  • netstat

  • nfs

  • stat

  • sockstat

  • textfile

  • time

  • timex

  • uname

  • vmstat

extraCollectorsEnabled:
  - bcache
  - bonding
  - softnet
Prometheus Relay

Note

Prometheus Relay is set up as an endpoint in the Prometheus datasource in Grafana. Therefore, all requests from Grafana are sent to Prometheus through Prometheus Relay. If Prometheus Relay reports request timeouts or exceeds the response size limits, you can configure the parameters below. In this case, Prometheus Relay resource limits may also require tuning.

Key

Description

Example values

prometheusRelay.clientTimeout (string)

Specifies the client timeout in seconds. If empty, defaults to a value determined by the cluster size: 10 for small, 30 for medium, 60 for large.

Note

The cluster size parameters are available since Container Cloud 2.24.0 (Cluster release 14.0.0).

10

prometheusRelay.responseLimitBytes (string)

Specifies the response size limit in bytes. If empty, defaults to a value determined by the cluster size: 6291456 for small, 18874368 for medium, 37748736 for large.

Note

The cluster size parameters are available since Container Cloud 2.24.0 (Cluster release 14.0.0).

1048576

Prometheus remote write

Allows sending of metrics from Prometheus to a custom monitoring endpoint. For details, see Prometheus Documentation: remote_write.

Key

Description

Example values

prometheusServer.remoteWriteSecretMounts (slice)

Skip this step if your remote server does not have authorization. Defines additional mounts for remoteWrites secrets. Secret objects with credentials needed to access the remote endpoint must be precreated in the stacklight namespace. For details, see Kubernetes Secrets.

Note

To create more than one file for the same remote write endpoint, for example, to configure TLS connections, use a single secret object with multiple keys in the data field. Using the following example configuration, two files will be created, cert_file and key_file:

...
  data:
    cert_file: aWx1dnRlc3Rz
    key_file: dGVzdHVzZXI=
...
remoteWriteSecretMounts:
- secretName: prom-secret-files
  mountPath: /etc/config/remote_write

prometheusServer.remoteWrites (slice)

Defines the configuration of a custom remote_write endpoint for sending Prometheus samples.

Note

If the remote server uses authorization, first create secret(s) in the stacklight namespace and mount them to Prometheus through prometheusServer.remoteWriteSecretMounts. Then define the created secret in the authorization field.

remoteWrites:
-  url: http://remote_url/push
   authorization:
     credentials_file: /etc/config/remote_write/key_file
Resource limits

Key

Description

Example values

resourcesPerClusterSize (map)

Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes.

Caution

Since Container Cloud 2.28.0 (Cluster releases 17.3.0 and 16.3.0), resourcesPerClusterSize is deprecated and is overridden by the resources parameter. Therefore, use the resources parameter instead.

StackLight components for resource limits customization

Note

The below list has the componentName: <podNamePrefix>/<containerName> format.

alerta: alerta/alerta
alertmanager: prometheus-alertmanager/prometheus-alertmanager
alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
elasticsearch: opensearch-master/opensearch # Deprecated
elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
fluentdLogs: fluentd-logs/fluentd-logs
fluentdNotifications: fluentd-notifications/fluentd
grafana: grafana/grafana
grafanaRenderer: grafana/grafana-renderer # Removed in MCC 2.27.0 (17.2.0 and 16.2.0)
iamProxy: iam-proxy/iam-proxy # Deprecated
iamProxyAlerta: iam-proxy-alerta/iam-proxy
iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
iamProxyGrafana: iam-proxy-grafana/iam-proxy
iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter
metricCollector: metric-collector/metric-collector
metricbeat: metricbeat/metricbeat
nodeExporter: prometheus-node-exporter/prometheus-node-exporter
opensearch: opensearch-master/opensearch
opensearchDashboards: opensearch-dashboards/opensearch-dashboards
patroniExporter: patroni/patroni-patroni-exporter
pgsqlExporter: patroni/patroni-pgsql-exporter
postgresql: patroni/patroni
prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
prometheusMsTeams: prometheus-msteams/prometheus-msteams
prometheusRelay: prometheus-relay/prometheus-relay
prometheusServer: prometheus-server/prometheus-server
sfNotifier: sf-notifier/sf-notifier
sfReporter: sf-reporter/sf-reporter
stacklightHelmControllerController: stacklight-helm-controller/controller
telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
telegrafOpenstack: telegraf-openstack/telegraf-openstack # replaced with osdpl-exporter in 24.1
telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # Deprecated
telemeterClient: telemeter-client/telemeter-client
telemeterServer: telemeter-server/telemeter-server
telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter
tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter
resourcesPerClusterSize:
  # elasticsearch:
  opensearch:
    small:
      limits:
        cpu: "1000m"
        memory: "4Gi"
    medium:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    large:
      limits:
        cpu: "4000m"
        memory: "16Gi"

resources (map)

Provides the capability to override the containers resource requests or limits for any StackLight component.

StackLight components for resource limits customization

Note

The below list has the componentName: <podNamePrefix>/<containerName> format.

alerta: alerta/alerta
alertmanager: prometheus-alertmanager/prometheus-alertmanager
alertmanagerWebhookServicenow: alertmanager-webhook-servicenow/alertmanager-webhook-servicenow
blackboxExporter: prometheus-blackbox-exporter/blackbox-exporter
elasticsearch: opensearch-master/opensearch # Deprecated
elasticsearchCurator: elasticsearch-curator/elasticsearch-curator
elasticsearchExporter: elasticsearch-exporter/elasticsearch-exporter
fluentdElasticsearch: fluentd-logs/fluentd-logs # Deprecated
fluentdLogs: fluentd-logs/fluentd-logs
fluentdNotifications: fluentd-notifications/fluentd
grafana: grafana/grafana
grafanaRenderer: grafana/grafana-renderer # Removed in MCC 2.27.0 (17.2.0 and 16.2.0)
iamProxy: iam-proxy/iam-proxy # Deprecated
iamProxyAlerta: iam-proxy-alerta/iam-proxy
iamProxyAlertmanager: iam-proxy-alertmanager/iam-proxy
iamProxyGrafana: iam-proxy-grafana/iam-proxy
iamProxyKibana: iam-proxy-kibana/iam-proxy # Deprecated
iamProxyOpenSearchDashboards: iam-proxy-kibana/iam-proxy
iamProxyPrometheus: iam-proxy-prometheus/iam-proxy
kibana: opensearch-dashboards/opensearch-dashboards # Deprecated
kubeStateMetrics: prometheus-kube-state-metrics/prometheus-kube-state-metrics
libvirtExporter: prometheus-libvirt-exporter/prometheus-libvirt-exporter
metricCollector: metric-collector/metric-collector
metricbeat: metricbeat/metricbeat
nodeExporter: prometheus-node-exporter/prometheus-node-exporter
opensearch: opensearch-master/opensearch
opensearchDashboards: opensearch-dashboards/opensearch-dashboards
patroniExporter: patroni/patroni-patroni-exporter
pgsqlExporter: patroni/patroni-pgsql-exporter
postgresql: patroni/patroni
prometheusEsExporter: prometheus-es-exporter/prometheus-es-exporter
prometheusMsTeams: prometheus-msteams/prometheus-msteams
prometheusRelay: prometheus-relay/prometheus-relay
prometheusServer: prometheus-server/prometheus-server
sfNotifier: sf-notifier/sf-notifier
sfReporter: sf-reporter/sf-reporter
stacklightHelmControllerController: stacklight-helm-controller/controller
telegrafDockerSwarm: telegraf-docker-swarm/telegraf-docker-swarm
telegrafDs: telegraf-ds-smart/telegraf-ds-smart # Deprecated
telegrafDsSmart: telegraf-ds-smart/telegraf-ds-smart
telegrafOpenstack: telegraf-openstack/telegraf-openstack # replaced with osdpl-exporter in 24.1
telegrafS: telegraf-docker-swarm/telegraf-docker-swarm # Deprecated
telemeterClient: telemeter-client/telemeter-client
telemeterServer: telemeter-server/telemeter-server
telemeterServerAuthServer: telemeter-server/telemeter-server-authorization-server
tfControllerExporter: prometheus-tf-controller-exporter/prometheus-tungstenfabric-exporter
tfVrouterExporter: prometheus-tf-vrouter-exporter/prometheus-tungstenfabric-exporter
resources:
  alerta:
    requests:
      cpu: "50m"
      memory: "200Mi"
    limits:
      memory: "500Mi"

Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.

Note

The logging mechanism performance depends on the cluster log load. If the cluster components send an excessive amount of logs, the default resource requests and limits for fluentdLogs (or fluentdElasticsearch) may be insufficient, which may cause its pods to be OOMKilled and trigger the KubePodCrashLooping alert. In such case, increase the default resource requests and limits for fluentdLogs. For example:

resources:
  # fluentdElasticsearch:
  fluentdLogs:
    requests:
      memory: "500Mi"
    limits:
      memory: "1500Mi"
Salesforce reporter

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce reporter depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-reporter and sf-notifier services.

The clusterId key is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

sfReporter.enabled (bool)

Enables or disables reporting of Prometheus metrics to Salesforce. For details, see Container Cloud Reference Architecture: StackLight Deployment architecture. Disabled by default.

true or false

sfReporter.salesForceAuth (map)

Salesforce parameters and credentials for the metrics reporting integration.

Note

Modify this parameter if sf-notifier is not configured or if you want to use a different Salesforce user account to send reports to.

salesForceAuth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

sfReporter.cronjob (map)

Defines the Kubernetes cron job for sending metrics to Salesforce. By default, reports are sent at midnight server time.

cronjob:
  schedule: "0 0 * * *"
  concurrencyPolicy: "Allow"
  failedJobsHistoryLimit: ""
  successfulJobsHistoryLimit: ""
  startingDeadlineSeconds: 200
Storage class

In an HA StackLight setup, when highAvailabilityEnabled is set to true, all StackLight Persistent Volumes (PVs) use the Local Volume Provisioner (LVP) storage class not to rely on dynamic provisioners such as Ceph, which are not available in every deployment. In a non-HA StackLight setup, when no storage class is specified, PVs use the default storage class of a cluster.

Key

Description

Example values

storage.defaultStorageClass (string)

Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. To use the default storage class, leave the string empty.

lvp, standard

storage.componentStorageClasses (map)

Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the default storage class, leave the string empty.

componentStorageClasses:
  elasticsearch: ""
  opensearch: ""
  fluentd: ""
  postgresql: ""
  prometheusAlertManager: ""
  prometheusServer: ""
Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in StackLight configuration procedure and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

Verify StackLight configuration of an OpenStack cluster

Key

Verification procedure

  • externalFQDNs.enabled

  • openstack.insecure

  1. In the Prometheus web UI, navigate to Status > Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

  • openstack.enabled

  • openstack.namespace

  1. In the Grafana web UI, verify that the OpenStack dashboards are present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the OpenStack alerts are present in the list of alerts.

openstack.gnocchi.enabled

  1. In the Grafana web UI, verify that the Gnocchi dashboard is present and not empty. Alternatively, verify that the Gnocchi dashboard ConfigMap is present:

    kubectl get cm -n stacklight \
    grafana-dashboards-default-gnocchi
    
  2. In the OpenSearch Dashboards web UI, verify that logs for the gnocchi-metricd and gnocchi-api loggers are present.

openstack.ironic.enabled

  1. In the Grafana web UI, verify that the Ironic dashboard is present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the Ironic* alerts are present in the list of alerts.

  • openstack.rabbitmq.credentialsConfig

  • openstack.rabbitmq.credentialsDiscovery

In the OpenSearch Dashboards web UI, click Discover and verify that the audit-* and notifications-* indexes contain documents.

  • openstack.telegraf.credentialsConfig

  • openstack.telegraf.credentialsDiscovery

  • openstack.telegraf.interval

  • openstack.telegraf.insecure

  • openstack.telegraf.skipPublicEndpoints

In the Grafana web UI, verify that the OpenStack dashboards are present and not empty.

tungstenFabricMonitoring.enabled

  1. In the Grafana web UI, verify that the Tungsten Fabric dashboards are present and not empty.

  2. In the Prometheus web UI, click Alerts and verify that the Tungsten Fabric alerts are present in the list of alerts.

Verify StackLight configuration of a MOSK cluster

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta
  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

alertmanagerSimpleConfig.inhibitRules.enabled

Run the following command. An empty output indicates either a failure or that the feature is disabled.

kubectl  get cm -n stacklight prometheus-alertmanager -o \
yaml | grep -A 6 inhibit_rules
  • alertmanagerSimpleConfig.msteams.enabled

  • alertmanagerSimpleConfig.msteams.url

  • alertmanagerSimpleConfig.msteams.route

  1. Verify that the Prometheus Microsoft Teams pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=prometheus-msteams'
    
  2. Verify that the Prometheus Microsoft Teams pod logs have no errors:

    kubectl logs -f -n stacklight -l \
    'app=prometheus-msteams'
    
  3. Verify that notifications are being sent to the Microsoft Teams channel.

  • alertmanagerSimpleConfig.salesForce.enabled

  • alertmanagerSimpleConfig.salesForce.auth

  • alertmanagerSimpleConfig.salesForce.route

  1. Verify that sf-notifier is enabled. The output must include the sf-notifier pod name, 1/1 in the READY field and Running in the STATUS field.

    kubectl get pods -n stacklight
    
  2. Verify that sf-notifier successfully authenticates to Salesforce. The output must include the Salesforce authentication successful line.

    kubectl logs -f -n stacklight <sf-notifier-pod-name>
    
  3. In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-salesforce receiver and route.

alertmanagerSimpleConfig.salesForce.feed_enabled

  • Verify that the sf-notifier pod logs include Creating feed item messages. For such messages to appear in logs, DEBUG logging level must be set up.

  • Verify through Salesforce:

    1. Log in to the Salesforce web UI.

    2. Click the Feed tab for a case created by sf-notifier.

    3. Verify that All Messages gets updated.

alertmanagerSimpleConfig.salesForce.link_prometheus

Verify that SF_NOTIFIER_ADD_LINKS has changed to true or false according to your customization:

kubectl get deployment sf-notifier \
-o=jsonpath='{.spec.template.spec.containers[0].env}' | jq .

alertmanagerSimpleConfig.serviceNow

  1. Verify that the alertmanager-webhook-servicenow pod is up and running:

    kubectl get pods -n stacklight -l \
    'app=alertmanager-webhook-servicenow'
    
  2. Verify that authentication to ServiceNow was successful. The output should include ServiceNow authentication successful. In case of authentication failure, the ServiceNowAuthFailure alert will raise.

    kubectl logs -f -n stacklight \
    <alertmanager-webhook-servicenow-pod-name>
    
  3. In your ServiceNow instance, verify that the Watchdog alert appears in the Incident table. Once the incident is created, the pod logs should include a line similar to Created Incident: bef260671bdb2010d7b540c6cc4bcbed.

In case of any failure:

  • Verify that your ServiceNow instance is not in hibernation.

  • Verify that the service user credentials, table name, and alert_id_field are correct.

  • Verify that the ServiceNow user has access to the table with permission to read, create, and update records.

  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

blackboxExporter.customModules

  1. Verify that your module is present in the list of modules. It can take up to 10 minutes for the module to appear in the ConfigMap.

    kubectl get cm prometheus-blackbox-exporter -n stacklight \
    -o=jsonpath='{.data.blackbox\.yaml}'
    
  2. Review the configmap-reload container logs to verify that the reload happened successfully. It can take up to 1 minute for reload to happen after the module appears in the ConfigMap.

    kubectl logs -l app=prometheus-blackbox-exporter -n stacklight -c \
    configmap-reload
    

blackboxExporter.timeoutOffset

Verify that the args parameter of the blackbox-exporter container contains the specified --timeout-offset:

kubectl get deployment.apps/prometheus-blackbox-exporter -n stacklight \
-o=jsonpath='{.spec.template.spec.containers[?(@.name=="blackbox-exporter")].args}'

For example, for blackboxExporter.timeoutOffset set to 0.1, the output should include ["--config.file=/config/blackbox.yaml","--timeout-offset=0.1"]. It can take up to 10 minutes for the parameter to be populated.

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • clusterSize

  • resourcesPerClusterSize Deprecated

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
elasticsearch.logstashRetentionTime
Removed in MCC 2.26.0 (17.1.0, 16.1.0)

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

elasticsearch.persistentVolumeClaimSize

Verify that the PVC(s) capacity is equal or higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=opensearch-master"
  • elasticsearch.retentionTime

  • logging.retentionTime
    Removed in MCC 2.26.0 (17.1.0, 16.1.0)
  1. Verify that configMap includes the new data. The output should include the changed values.

    kubectl get cm elasticsearch-curator-config -n stacklight --kubeconfig=<pathToKubeconfig> -o yaml
    
  2. Verify that the elasticsearch-curator-{JOB_ID}-{POD_ID} job has successfully completed:

    kubectl logs elasticsearch-curator-<jobID>-<podID> -n stacklight --kubeconfig=<pathToKubeconfig>
    
  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

grafana.homeDashboard

In the Grafana web UI, verify that the desired dashboard is set as a home dashboard.

grafana.renderer.enabled
Removed in MCC 2.27.0 (17.2.0, 16.2.0)

Verify the Grafana Image Renderer. If set to true, the output should include HTTP Server started, listening at http://localhost:8081.

kubectl logs -f -n stacklight -l app=grafana \
--container grafana-renderer

highAvailabilityEnabled

Verify the number of service replicas for the HA or non-HA StackLight mode. For details, see Deployment architecture.

kubectl get sts -n stacklight
  • ironic.endpoint

  • ironic.insecure

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

logging.dashboardsExtraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-dashboards -o=jsonpath='{.data}'

Example of system response:

{"opensearch_dashboards.yml":"opensearch.hosts: http://opensearch-master:9200\
\nopensearch.requestTimeout: 60000\
\nopensearchDashboards.defaultAppId: dashboard/2d53aa40-ad1f-11e9-9839-052bda0fdf49\
\nserver:\
\n  host: 0.0.0.0\
\n  name: opensearch-dashboards\n"}

logging.enabled

Verify that OpenSearch, Fluentd, and OpenSearch Dashboards are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(opensearch-master,opensearchDashboards,fluentd-logs)'

logging.externalOutputs

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional output stream according to configured external outputs.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages. For example:

    kubectl logs -n stacklight -f <fluentd-logs-pod-name>| grep '\[error\]'
    

    Example output with a missing parameter:

    [...]
    2023-07-25 09:39:33 +0000 [error]: config error file="/etc/fluentd/fluent.conf" error_class=Fluent::ConfigError error="host or host_with_port is required"
    

    If a parameter is missing, verify the configuration as described in Enable log forwarding to external destinations.

  3. Verify that the log messages are appearing in the external server database.

To troubleshoot issues with Splunk, refer to No logs are forwarded to Splunk.

logging.externalOutputSecretMounts

Verify that files were created for the specified path in the Fluentd container:

kubectl get pods -n stacklight -o name | grep fluentd-logs | \
xargs -I{} kubectl exec -i {} -c fluentd-logs -n stacklight -- \
ls <logging.externalOutputSecretMounts.mountPath>

logging.extraConfig

Verify that the customization has applied:

kubectl -n stacklight get cm opensearch-master-config -o=jsonpath='{.data}'

Example of system response:

{"opensearch.yml":"cluster.name: opensearch\
\nnetwork.host: 0.0.0.0\
\nplugins.security.disabled: true\
\nplugins.index_state_management.enabled: false\
\npath.data: /usr/share/opensearch/data\
\ncompatibility.override_main_response_version: true\
\ncluster.max_shards_per_node: 5000\n"}
logging.level
Removed in MCC 2.26.0 (17.1.0, 16.1.0)
  1. Inspect the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs \
    -o "jsonpath={.data['output-logs\.conf']}"
    
  2. Grep the output using the following command. The pattern should contain all logging levels below the expected one.

    @type grep
    <exclude>
     key severity_label
     pattern /^<pattern>$/
    </exclude>
    

logging.metricQueries

For details, see steps 4.2 and 4.3 in Create logs-based metrics.

logging.syslog.enabled

  1. Verify the fluentd-logs Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-logs -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional container with the remote syslog configuration.

  2. After restart of the fluentd-logs pods, verify that their logs do not contain any delivery error messages.

  3. Verify that the log messages are appearing in the remote syslog database.

logging.syslog.packetSize

Verify that packetSize has changed according to your customization:

kubectl get cm -n stacklight fluentd-logs -o \
yaml | grep packet_size

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

mke.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the MKEAPIDown is not false-positively firing due to the certificate absence.

mke.enabled

  1. In the Grafana web UI, verify that the MKE Cluster and MKE Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the MKE* alerts are present in the list of alerts.

nodeExporter.extraCollectorsEnabled

In the Prometheus web UI, run the following PromQL queries. The result should not be empty.

node_scrape_collector_duration_seconds{collector="<COLLECTOR_NAME>"}
node_scrape_collector_success{collector="<COLLECTOR_NAME>"}

nodeExporter.netDeviceExclude

  1. Verify the DaemonSet configuration of the Node Exporter:

    kubectl get daemonset -n stacklight prometheus-node-exporter \
    -o=jsonpath='{.spec.template.spec.containers[0].args}' | jq .
    

    Expected system response:

    [
      "--path.procfs=/host/proc",
      "--path.sysfs=/host/sys",
      "--collector.netclass.ignored-devices=<paste_your_excluding_regexp_here>",
      "--collector.netdev.device-blacklist=<paste_your_excluding_regexp_here>",
      "--no-collector.ipvs"
    ]
    
  2. In the Prometheus web UI, run the following PromQL query. The expected result is 1.

    absent(node_network_transmit_bytes_total{device=~"<paste_your_excluding_regexp_here>"})
    
  • nodeSelector.component

  • nodeSelector.default

  • tolerations.component

  • tolerations.default

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
  • prometheusRelay.clientTimeout

  • prometheusRelay.responseLimitBytes

  1. Verify that the Prometheus Relay pod is up and running:

    kubectl get pods -n stacklight -l 'component=relay'
    
  2. Verify that the values have changed according to your customization:

    kubectl get pods -n stacklight prometheus-relay-9f87df558-zjpvn \
    -o=jsonpath='{.spec.containers[0].env}' | jq .
    

prometheusServer.alertsCommonLabels

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the alerting.alert_relabel_configs section contains the customization for common labels that you added in prometheusServer.alertsCommonLabels during StackLight configuration.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.customRecordingRules

  1. In the Prometheus web UI, navigate to Status > Rules.

  2. Verify that the list of Prometheus recording rules has changed according to your customization.

prometheusServer.customScrapeConfigs

  1. In the Prometheus web UI, navigate to Status > Targets.

  2. Verify that the required target has appeared in the list of targets.

It may take up to 10 minutes for the change to apply.

prometheusServer.persistentVolumeClaimSize

Verify that the PVC(s) capacity equals or is higher (in case of statically provisioned volumes) than specified:

kubectl get pvc -n stacklight -l "app=prometheus,component=server"
  • prometheusServer.alertResendDelay

  • prometheusServer.queryConcurrency

  • prometheusServer.retentionSize

  • prometheusServer.retentionTime

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • rules.alert.resend-delay

    • query.max-concurrency

    • storage.tsdb.retention.size

    • storage.tsdb.retention.time

prometheusServer.remoteWrites

  1. Inspect the remote_write configuration in the Status > Configuration section of the Prometheus web UI.

  2. Inspect the Prometheus server logs for errors:

    kubectl logs prometheus-server-0 prometheus-server -n stacklight
    

prometheusServer.remoteWriteSecretMounts

Verify that files were created for the specified path in the Prometheus container:

kubectl exec -it prometheus-server-0 -c prometheus-server -n \
stacklight -- ls <remoteWriteSecretMounts.mountPath>

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

  • sfReporter.cronjob

  • sfReporter.enabled

  • sfReporter.salesForce

  1. Verify that Salesforce reporter is enabled. The SUSPEND field in the output must be False.

    kubectl get cronjob -n stacklight
    
  2. Verify that the Salesforce reporter configuration includes all expected queries:

    kubectl get configmap -n stacklight \
    sf-reporter-config -o yaml
    
  3. After cron job execution (by default, at midnight server time), obtain the Salesforce reporter pod name. The output should include the Salesforce reporter pod name and STATUS must be Completed.

    kubectl get pods -n stacklight
    
  4. Verify that Salesforce reporter successfully authenticates to Salesforce and creates records. The output must include the Salesforce authentication successful, Created record or Duplicate record and Updated record lines.

    kubectl logs -n stacklight <sf-reporter-pod-name>
    
  • sslCertificateMonitoring.domains

  • sslCertificateMonitoring.enabled

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

  • storage.componentStorageClasses

  • storage.defaultStorageClass

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight
Tune OpenSearch performance

The following hardware recommendations and software settings apply for better OpenSearch performance in a MOSK cluster.

To tune OpenSearch performance:

  1. Depending on your cluster size, set the required disk and CPU size along with memory limit and heap size.

    Heap size is calculated in StackLight as ⅘ of the specified memory limit. If the calculated heap size exceeds 32 GB, slightly crossing this threshold causes significant waste of memory due to loss of Ordinary Object Pointers (OOPS) compression, which allows storing 64-bit pointers in 32-bits.

    Since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0), to prevent this behavior, for the memory limit in the 31-50 GB range, the heap size is set to fixed 31 GB using the enforceOopsCompression parameter, which is enabled by default. For details, see Logging: Enforce OOPS compression. Exceeding the range causes loss of benefit of OOPS compression, so the ⅘ formula applies again.

    OpenSearch is write-heavy, so SSD is preferable as a disk type.

    Hardware recommendations for OpenSearch

    Cluster size

    Memory limit (GB)

    Heap size (GB)

    CPU (# of cores)

    Small

    16

    12.8

    2

    Medium

    32

    25.6

    4

    Large

    64

    51.2

    8

    To configure hardware settings for OpenSearch, refer to Resource limits in the StackLight configuration procedure section.

  2. Configure the maximum count of mmap files. OpenSearch uses mmapfs to map shards stored on disk, which is set to 65530 by default.

    To verify max_map_count:

    sysctl -n vm.max_map_count
    

    To increase max_map_count, follow the Create MOSK host profiles procedure.

    Example configuration:

    kernelParameters:
      sysctl:
        vm.max_map_count: "<value>"
    

    Extended retention periods, which depend on open shards, require increasing this value significantly. For example, to 262144.

  3. Configure swap as it significantly degrades performance. Lower swappiness to 1 or 0 (to disable swap). For details, use the Create MOSK host profiles procedure.

    Example configuration:

    kernelParameters:
      sysctl:
        vm.swappiness: "<value>"
    
  4. Configure the kernel I/O scheduler to improve timing of disk writing operations. Change it to one of the following options:

    • none - applies the FIFO queue.

    • mq-deadline - applies three queues: FIFO read, FIFO write, and sorted.

    Changing I/O scheduling is also possible through BareMetalHostProfile. However, the specific implementation highly depends on the disk type used:

    cat /sys/block/sda/queue/scheduler
    
    mq-deadline kyber bfq [none]
    
Export logs from OpenSearch Dashboards to CSV

Available since MCC 2.23.0 (12.7.0 and 11.7.0)

This section describes how to export logs from the OpenSearch Dashboards navigation panel to the CSV format.

Caution

The log limit is set 10 000 rows, and it does not take into account the resulted file size.

Note

The following instruction describes how to export all logs from the opensearch-master-0 node of an OpenSearch cluster.

To export logs from the OpenSearch Dashboards navigation panel to CSV:

  1. Log in to the OpenSearch Dashboards web UI as described in Getting access.

  2. Navigate to the Discover page.

  3. In the left navigation panel, select the required log index pattern from the top drop-down menu. For example, system* for system logs and audit* for audit logs.

  4. In the middle top menu, click Add filter and add the required filters. For example:

    • event.provider matches the opensearch-master logger

    • orchestrator.pod matches the opensearch-master-0 node name

  5. In Search field names, search for required fields to be present in the resulting CSV file. For example:

    • orchestrator.pod for opensearch-master-0

    • message for the log message

  6. In the right top menu:

    1. Click Save to save the filter after naming it.

    2. Click Reporting > Generate CSV.

    When the report generation completes, download the file depending on your browser settings.

OpenSearch Dashboards

This section describes OpenSearch Dashboards that enable you to observe visual representation of logs and Kubernetes events of your cluster.

View OpenSearch Dashboards

OpenSearch Dashboards is part of the StackLight logging stack. Using the OpenSearch Dashboards web UI, you can view the visual representation of your OpenStack deployment notifications, logs, Kubernetes events, and other cluster notifications related to your deployment.

Note

By default, StackLight logging stack, including OpenSearch Dashboards, is disabled. For details, see Mirantis Container Cloud Reference Architecture: Deployment architecture.

To view the OpenSearch Dashboards:

  1. Log in to the OpenSearch Dashboards web UI as described in Getting access.

  2. Click the required dashboard to inspect the visualizations or perform a search:

    Dashboard

    Description

    Notifications

    Provides visualizations on the number of notifications over time per source and severity, host, and breakdowns. The dashboard includes search.

    K8s events

    Provides visualizations on the number of Kubernetes events per type, and top event-producing resources and namespaces by reason and event type. Includes search.

    System Logs

    Available for clusters created since Container Cloud 2.26.0 (Cluster releases 17.1.x, 16.1.x, or later).

    Provides visualizations on the number of log messages per severity, source, and top log-producing host, namespaces, containers, and applications. Includes search.

    Caution

    Due to a known issue, this dashboard does not exist in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). The issue is addressed in Container Cloud 2.26.1 (Cluster releases 17.1.1 and 16.1.1). To work around the issue in 2.26.0, you can map the fields of the logstash index to the system one and view logs in the deprecated Logs dashboard. For mapping details, see System index fields mapped to Logstash index fields.

    Logs Deprecated in 2.26.0 (17.1.0 and 16.1.0)

    Available only for clusters created before Container Cloud 2.26.0 (Cluster releases 17.0.x, 16.0.x, or earlier).

    Analogous to System Logs but contains logs generated only for the mentioned Cluster releases.

Search in OpenSearch Dashboards

OpenSearch Dashboards provide the following search tools:

  • Filters

  • Queries

  • Full-text search

Filters enable you to organize the output information using the interface tools. You can search for information by a set of indexed fields using a variety of logical operators.

Queries enable you to construct search commands using OpenSearch query domain-specific language (DSL) expressions. These expressions allow you to search by the fields not included in the index.

In addition to filters and queries, you can use the Search input field for full-text search.

Create a filter
  1. From the dashboard view, click Add filter.

  2. In the dialog that opens, select the field of search in the Field drop-down menu.

  3. Select the logical operator in the Operator drop-down menu.

  4. Type or select the filter value from the Value drop-down menu.

Create a filter using the ‘flat object’ field type

Available since MCC 2.23.0 (12.7.0 and 11.7.0)

For the orchestrator.labels field of the system and audit log indices, you can use the flat_object field type to apply the filtering using value or valueAndPath. For example:

  • Using value: to obtain all logs produced by iam-proxy, add the following filters:

    • orchestrator.type that matches kubernetes

    • orchestrator.labels._value that matches iam-proxy

  • Using valueAndPath: to obtain all logs produced by the OpenSearch cluster, add the following filters:

    • orchestrator.type that matches kubernetes

    • orchestrator.labels._valueAndPath that matches orchestrator.labels.app=opensearch-master

Create a query
  1. From the dashboard view, click Add filter.

  2. In the dialog that opens, click Edit as Query DSL and type in the search request.

Learn more

OpenSearch documentation:

View Grafana dashboards

Using the Grafana web UI, you can view the visual representation of the metric graphs based on the time series databases.

Most Grafana dashboards include a View logs in OpenSearch Dashboards link to immediately view relevant logs in the OpenSearch Dashboards web UI. The OpenSearch Dashboards web UI displays logs filtered using the Grafana dashboard variables, such as the drop-downs. Once you amend the variables, wait for Grafana to generate a new URL.

Note

Due to the known issue, the View logs in OpenSearch Dashboards link does not work in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0). The issue is addressed in Container Cloud 2.26.1 (Cluster releases 17.1.1 and 16.1.1).

Caution

The Grafana dashboards that contain drop-down lists are limited to 1000 lines. Therefore, if you require data on a specific item, use the filter by name instead.

Note

Grafana dashboards that present node data have an additional Node identifier drop-down menu. By default, it is set to machine to display short names for Kubernetes nodes. To display Kubernetes node name labels, change this option to node.

To view the Grafana dashboards:

  1. Log in to the Grafana web UI as described Getting access.

  2. From the drop-down list, select the required dashboard to inspect the status and statistics of the corresponding service in your management or MOSK cluster:

    Component

    Dashboard

    Description

    Ceph cluster

    Ceph Cluster

    Provides the overall health status of the Ceph cluster, capacity, latency, and recovery metrics.

    Ceph Nodes

    Provides an overview of the host-related metrics, such as the number of Ceph Monitors, Ceph OSD hosts, average usage of resources across the cluster, network and hosts load.

    This dashboard is deprecated since Container Cloud 2.25.0 (Cluster releases 17.0.0 and 16.0.0) and is removed in Container Cloud 2.26.0 (Cluster releases 17.1.0 and 16.1.0).

    Therefore, Mirantis recommends switching to the following dashboards in the current release:

    • For Ceph stats, use the Ceph Cluster dashboard.

    • For resource utilization, use the System dashboard, which includes filtering by Ceph node labels, such as ceph_role_osd, ceph_role_mon, and ceph_role_mgr.

    Ceph OSDs

    Provides metrics for Ceph OSDs, including the Ceph OSD read and write latencies, distribution of PGs per Ceph OSD, Ceph OSDs and physical device performance.

    Ceph Pools

    Provides metrics for Ceph pools, including the client IOPS and throughput by pool and pools capacity usage.

    Ironic

    Ironic BM

    Provides graphs on Ironic health, HTTP API availability, provisioned nodes by state and installed ironic-conductor backend drivers.

    Container Cloud

    Clusters Overview

    Represents the main cluster capacity statistics for all clusters of a Container Cloud deployment where StackLight is installed.

    Note

    Due to the known issue, the Prometheus Targets Unavailable panel of the Clusters Overview dashboard does not display data for managed clusters of the 11.7.0, 11.7.4, 12.5.0, and 12.7.x series Cluster releases after update to Container Cloud 2.24.0.

    Etcd

    Available since Container Cloud 2.21.0 (Cluster release 11.5.0). Provides graphs on database size, leader elections, requests duration, incoming and outgoing traffic.

    MCC Applications Performance

    Available since Container Cloud 2.23.0 (Cluster release 11.7.0). Provides information on the Container Cloud internals work based on Golang, controller runtime, and custom metrics. You can use it to verify performance of applications and for troubleshooting purposes.

    Kubernetes resources

    Kubernetes Calico

    Provides metrics of the entire Calico cluster usage, including the cluster status, host status, and Felix resources.

    Kubernetes Cluster

    Provides metrics for the entire Kubernetes cluster, including the cluster status, host status, and resources consumption.

    Kubernetes Containers

    Provides charts showing resource consumption per deployed Pod containers running on Kubernetes nodes.

    Kubernetes Deployments

    Provides information on the desired and current state of all service replicas deployed on a Container Cloud cluster.

    Kubernetes Namespaces

    Provides the Pods state summary and the CPU, MEM, network, and IOPS resources consumption per name space.

    Kubernetes Nodes

    Provides charts showing resources consumption per Container Cloud cluster node.

    Kubernetes Pods

    Provides charts showing resources consumption per deployed Pod.

    NGINX

    NGINX

    Provides the overall status of the NGINX cluster and information about NGINX requests and connections.

    OpenStack

    OpenStack - Overview

    Provides general information on OpenStack services resources consumption, API errors, deployed OpenStack compute nodes and block storage usage.

    OpenStack Ingress controller

    Available since MOSK 23.3. Monitors the number of requests, response times and statuses, as well as the number of Ingress SSL certificates including expiration time and resources usage.

    OpenStack Instances Availability

    Available since MOSK 23.2. Provides information about the availability of instance floating IPs per OpenStack compute node and project. Also, enables monitoring of probe statistics for individual instance floating IPs.

    OpenStack Network IP Capacity

    Available since MOSK 25.1. Provides information about the statistics of IP address allocation for external networks and subnets on non-Tungsten Fabric based MOSK clusters. For configuration details, see Start monitoring IP address capacity.

    OpenStack PortProber

    Available since MOSK 24.2. Provides information about the availability of Neutron ports per OpenStack compute node, project, and port owner.

    OpenStack PortProber [Deprecated]

    Available since MOSK 25.1. Provides information about the availability of Neutron ports per OpenStack compute node, project, and port owner. Deprecated in favor of the OpenStack PortProber dashboard.

    Use this deprecated dashboard only to access old data collected before MOSK 25.1.

    OpenStack PowerDNS

    Available since MOSK 24.3. Provides different stats about OpenStack PowerDNS servers such as connections, resources, queries, rings, errors, and other.

    OpenStack Usage Efficiency

    Available since MOSK 23.3. Provides information about requested (allocated) CPU and memory usage efficiency on a per-project and per-flavor basis. Aims to identify flavors that specific projects are not effectively using, with allocations significantly exceeding actual usage. Also, evaluates per-instance underuse for specific projects.

    KPI - Provisioning

    Provides provisioning statistics for OpenStack compute instances, including graphs on VM creation results by day.

    Cinder

    Provides graphs on the OpenStack Block Storage service health, HTTP API availability, pool capacity and utilization, number of created volumes and snapshots.

    Glance

    Provides graphs on the OpenStack Image service health, HTTP API availability, number of created images and snapshots.

    Gnocchi

    Provides panels and graphs on the Gnocchi health and HTTP API availability.

    Heat

    Provides graphs on the OpenStack Orchestration service health, HTTP API availability and usage.

    Ironic OpenStack

    Provides graphs on the OpenStack Bare Metal Provisioning service health, HTTP API availability, provisioned nodes by state and installed ironic-conductor backend drivers.

    Keystone

    Provides graphs on the OpenStack Identity service health, HTTP API availability, number of tenants and users by state.

    Neutron

    Provides graphs on the OpenStack networking service health, HTTP API availability, agents status and usage of Neutron L2 and L3 resources.

    NGINX Ingress controller

    Not recommended. Deprecated since MOSK 23.3 and is removed in MOSK 24.1. Use OpenStack Ingress controller instead.

    Monitors the number of requests, response times and statuses, as well as the number of Ingress SSL certificates including expiration time and resources usage.

    Nova - Availability Zones

    Provides detailed graphs on the OpenStack availability zones and hypervisor usage.

    Nova - Hypervisor Overview

    Provides a set of single-stat panels presenting resources usage by host.

    Nova - Instances

    Provides graphs on libvirt Prometheus exporter health and resources usage. Monitors the number of running instances and tasks and allows sorting the metrics by top instances.

    Nova - Overview

    Provides graphs on the OpenStack compute services (nova-scheduler, nova-conductor, and nova-compute) health, as well as HTTP API availability.

    Nova - Tenants

    Provides graphs on CPU, RAM, disk throughput, IOPS, and space usage and allocation and allows sorting the metrics by top tenants.

    Nova - Users

    Provides graphs on CPU, RAM, disk throughput, IOPS, and space usage and allocation and allows sorting the metrics by top users.

    Nova - Utilization

    Provides detailed graphs on Nova hypervisor resources capacity and consumption.

    Memcached

    Memcached Prometheus exporter dashboard. Monitors Kubernetes Memcached pods and displays memory usage, hit rate, evicts and reclaims rate, items in cache, network statistics, and commands rate.

    MySQL

    MySQL Prometheus exporter dashboard. Monitors Kubernetes MySQL pods, resources usage and provides details on current connections and database performance.

    RabbitMQ [Deprecated]

    Not recommended. Deprecated since MOSK 25.1. RabbitMQ Prometheus exporter dashboard. Monitors Kubernetes RabbitMQ pods, resources usage and provides details on cluster utilization and performance.

    Caution

    This dashboard is renamed from RabbitMQ to RabbitMQ [Deprecated] in MOSK 25.1 and will be removed in one of the following releases for the sake of the RabbitMQ Overview and RabbitMQ Erlang dashboards.

    For deprecation details, see Deprecation notes: RabbitMQ Prometheus Exporter.

    RabbitMQ Erlang

    Available since MOSK 25.1. Monitors RabbitMQ BEAM performance, memory details, load and distribution metrics using native Prometheus plugin metrics.

    RabbitMQ Overview

    Available since MOSK 25.1. Monitors RabbitMQ node performance, resource usage, message queue, channel, and connection statistics using native Prometheus plugin metrics.

    Cassandra

    Provides graphs on Cassandra clusters’ health, ongoing operations, and resource consumption.

    Kafka

    Provides graphs on Kafka clusters’ and broker health, as well as broker and topic usage.

    Redis

    Provides graphs on Redis clusters’ and pods’ health, connections, command calls, and resource consumption.

    Tungsten Fabric

    Tungsten Fabric Controller

    Provides graphs on the overall Tungsten Fabric Controller cluster processes and usage.

    Tungsten Fabric vRouter

    Provides graphs on the overall Tungsten Fabric vRouter cluster processes and usage.

    ZooKeeper

    Provides graphs on ZooKeeper clusters’ quorum health and resource consumption.

    StackLight

    Alertmanager

    Provides performance metrics on the overall health status of the Prometheus Alertmanager service, the number of firing and resolved alerts received for various periods, the rate of successful and failed notifications, and the resources consumption.

    OpenSearch

    Provides information about the overall health status of the OpenSearch cluster, including the resources consumption, number of operations and their performance.

    OpenSearch Indices

    Provides detailed information about the state of indices, including their size, the number and the size of segments.

    Grafana

    Provides performance metrics for the Grafana service, including the total number of Grafana entities, CPU and memory consumption.

    PostgreSQL

    Provides PostgreSQL statistics, including read (DQL) and write (DML) row operations, transaction and lock, replication lag and conflict, and checkpoint statistics, as well as PostgreSQL performance metrics.

    Prometheus

    Provides the availability and performance behavior of the Prometheus servers, the sample ingestion rate, and system usage statistics per server. Also, provides statistics about the overall status and uptime of the Prometheus service, the chunks number of the local storage memory, target scrapes, and queries duration.

    Prometheus Relay

    Provides service status and resources consumption metrics.

    Telemeter Server

    Provides statistics and the overall health status of the Telemeter service.

    Note

    Due to the known issue, the Telemeter Client Status panel of the Telemeter Server dashboard does not display data for managed clusters of the 11.7.0, 11.7.4, 12.5.0, and 12.7.x series Cluster releases after update to Container Cloud 2.24.0.

    System

    System

    Provides a detailed resource consumption and operating system information per Container Cloud cluster node.

    Mirantis Kubernetes Engine (MKE)

    MKE Cluster

    Provides a global overview of an MKE cluster: statistics about the number of the worker and manager nodes, containers, images, Swarm services.

    MKE Containers

    Provides per container resources consumption metrics for the MKE containers such as CPU, RAM, network.

Export data from Table panels of Grafana dashboards to CSV

This section describes how to export data from Table panels of Grafana dashboards to .csv files.

Note

Grafana performs data exports for individual panels on a dashboard, not the entire dashboard.

To export data from Table panels of Grafana dashboards to CSV:

  1. Log in to the Grafana web UI as described in Getting access.

  2. In the right top corner of the required Table panel, click the kebab menu icon and select Inspect > Data.

  3. In Data options of the Data tab, configure export options:

    • Enable Apply panel transformation

    • Leave Formatted data enabled

    • Enable Download for Excel, if required

  4. Click Download CSV.

StackLight alerts

This section provides an overview of the available predefined StackLight alerts, including OpenStack, Tungsten Fabric, Container Cloud, Ceph, StackLight, MKE, and other alerts that can contain information about both OpenStack and MOSK clusters.

To view the alerts, use the Prometheus web UI. To view the firing alerts, use Alertmanager or Alerta web UI.

For alert troubleshooting guidelines, see Troubleshoot alerts.

OpenStack

This section describes the alerts available for OpenStack services.

OpenStack system services

This section describes the alerts available for the OpenStack system services.

Libvirt

This section lists the alerts for the libvirt service.


LibvirtDown

Severity

Critical

Summary

Failure to gather libvirt metrics.

Description

Libvirt Exporter fails to gather metrics on the {{ $labels.node }} node for 2 minutes.

LibvirtExporterTargetDown

Severity

Major

Summary

Libvirt Exporter Prometheus target is down.

Description

Prometheus fails to scrape metrics from the Libvirt Exporter endpoint on the {{ $labels.node }} node.

LibvirtExporterTargetsOutage

Severity

Critical

Summary

Libvirt Exporter Prometheus targets outage.

Description

Prometheus fails to scrape metrics from all Libvirt Exporter endpoints.

MariaDB

This section lists the alerts for the MariaDB service.


MariadbClusterDown

Severity

Critical

Summary

MariaDB cluster is down.

Description

The MariaDB {{ $labels.cluster }} cluster in the {{ $labels.namespace }} namespace is down.

MariadbExporterClusterTargetsOutage

Replaced with MariadbExporterTargetDown in 23.3

Severity

Critical

Summary

MariaDB Exporter cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the {{ $labels.cluster }} cluster exporters endpoints (more than 1/10 failed scrapes).

MariadbExporterTargetDown

Since 23.3 to replace MariadbExporterClusterTargetsOutage

Severity

Critical

Summary

MariaDB Exporter cluster Prometheus target down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.cluster }} cluster on the {{ $labels.node }} node.

MariadbGaleraDonorFallingBehind

Severity

Warning

Summary

MariaDB node is falling behind.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} MariaDB node in the {{ $labels.cluster }} cluster is falling behind (queue size {{ $value }}).

MariadbGaleraNotReady

Severity

Major

Summary

MariaDB cluster node is not ready.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} MariaDB node in the {{ $labels.cluster }} cluster is not ready to accept queries.

MariadbGaleraOutOfSync

Severity

Warning

Summary

MariaDB cluster node is out of sync.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} MariaDB node in the {{ $labels.cluster }} cluster is not in sync ({{ $value }} != 4).

MariadbInnodbLogWaits

Severity

Warning

Summary

MariaDB InnoDB log writes are stalling.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} MariaDB InnoDB logs are waiting for disk at a rate of {{ $value }} per second (more than 10).

MariadbTableLockWaitHigh

Severity

Warning

Summary

MariaDB table lock waits are high.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} MariaDB node in the {{ $labels.cluster }} cluster has high table lock waits of {{ $value }} percentage (more than 30).

Memcached

This section lists the alerts for the Memcached service.


MemcachedClusterDown

Severity

Critical

Summary

Memcached cluster is down.

Description

The Memcached {{ $labels.cluster }} cluster in the {{ $labels.namespace }} namespace is down.

MemcachedConnectionsNoneWarning

Severity

Warning

Summary

Memcached has no open connections.

Description

The Memcached database cluster {{ $labels.cluster }} in the {{ $labels.namespace }} namespace has no open connections.

MemcachedConnectionsNoneMajor

Severity

Warning

Summary

Memcached has no open connections on all nodes.

Description

The Memcached database cluster {{ $labels.cluster }} in the {{ $labels.namespace }} namespace has no open connections on all nodes.

MemcachedEvictionsLimit

Severity

Warning

Summary

10 Memcached evictions.

Description

An average of {{ $value }} evictions occurred in the Memcached database cluster {{ $labels.cluster }} in the {{ $labels.namespace }} namespace during the last minute.

MemcachedExporterTargetDown

Since 23.3 to replace MemcachedExporterClusterTargetsOutage

Severity

Critical

Summary

Memcached Exporter cluster Prometheus target down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.cluster }} cluster on the {{ $labels.node }} node.

MemcachedExporterClusterTargetsOutage

Replaced with MemcachedExporterTargetDown in 23.3

Severity

Critical

Summary

Memcached Exporter cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the {{ $labels.cluster }} cluster exporters endpoints (more than 1/10 failed scrapes).

SSL certificates

This section describes the alerts for the OpenStack SSL certificates. By default, these alerts are disabled. To enable them, set openstack.externalFQDNs.enabled to true. For details, see Configuration options for SSL certificates.


OpenstackSSLCertExpirationHigh

Severity

Critical

Summary

SSL certificate for an OpenStack service expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate for the OpenStack {{ $labels.namespace }}/{{ $labels.service_name }} service endpoints expires on {{ $value | humanizeTimestamp }}, less than 10 days are left.

OpenstackSSLCertExpirationMedium

Severity

Major

Summary

SSL certificate for an OpenStack service expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate for the OpenStack {{ $labels.namespace }}/{{ $labels.service_name }} service endpoints expires on {{ $value | humanizeTimestamp }}, less than 30 days are left.

OpenstackSSLProbesFailing

Severity

Critical

Summary

SSL certificate probes for an OpenStack service are failing.

Description

The SSL certificate probes for the OpenStack {{ $labels.namespace }}/{{ $labels.service_name }} service endpoints are failing.

OpenstackSSLProbesTargetOutage

Severity

Critical

Summary

OpenStack {{ $labels.service_name } SSL ingress target outage.

Description

Prometheus fails to probe the OpenStack {{ $labels.service_name }} service SSL ingress target.

RabbitMQ

This section lists the alerts for the RabbitMQ service.


RabbitMQUnreachablePeersDetected

Note

Before Container Cloud 2.29.0 (Cluster releases 17.4.0 and 16.4.0), this alert was named RabbitMQNetworkPartitionsDetected.

Severity

Major

Summary

RabbitMQ unreachable peers detected.

Description

The {{ $labels.pod }} RabbitMQ pod in the {{ $labels.namespace }} namespace has {{ $value }} unreachable peers.

RabbitMQDown

Deprecated since MCC 2.29.0 (17.4.0 and 16.4.0)

Severity

Critical

Summary

RabbitMQ is down.

Description

The {{ $labels.cluster }} RabbitMQ cluster in the {{ $labels.namespace }} namespace is down for the last 2 minutes.

RabbitMQExporterTargetDown

Deprecated since MCC 2.29.0 (17.4.0 and 16.4.0)

Severity

Critical

Summary

{{ $labels.service_name }} RabbitMQ Exporter Prometheus target is down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.namespace }}/{{ $labels.service_name }} on the {{ $labels.node }} node.

RabbitMQOperatorTargetDown

Severity

Major

Summary

RabbitMQ operator Prometheus target is down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod on the {{ $labels.node }} node.

RabbitMQFileDescriptorUsageWarning

Severity

Warning

Summary

RabbitMQ file descriptors usage is high for the last 10 minutes.

Description

The {{ $labels.pod }} RabbitMQ pod in the {{ $labels.namespace }} namespace has high file descriptor usage of {{ $value }} percent.

RabbitMQNodeDiskFreeAlarm

Severity

Warning

Summary

RabbitMQ disk space usage is high.

Description

The {{ $labels.pod }} RabbitMQ pod in the {{ $labels.namespace }} namespace has low disk free space available.

RabbitMQNodeMemoryAlarm

Severity

Major

Summary

RabbitMQ memory usage is high.

Description

The {{ $labels.pod }} RabbitMQ pod in the {{ $labels.namespace }} namespace has low free memory.

RabbitMQTargetDown

Available since MCC 2.29.0 (17.4.0 and 16.4.0)

Severity

Major

Summary

{{ $labels.pod }} RabbitMQ Prometheus target is down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.namespace }}/{{ $labels.service_name }} on the {{ $labels.node }} node.

OpenStack core services

This section describes the alerts available for the OpenStack core services.

OpenStack services API

This section describes the alerts for the OpenStack services API:

OpenstackIngressControllerTargetsOutage

Severity

Critical

Summary

OpenStack ingress controller Prometheus targets outage.

Description

Prometheus fails to scrape metrics from all OpenStack ingress controller endpoints.

OpenstackAPI401Critical

Severity

Critical

Summary

OpenStack API responds with HTTP 401.

Description

The OpenStack API {{ $labels.component }} responds with HTTP 401 for more than 5% of requests for the last 10 minutes.

OpenstackAPI5xxCritical

Severity

Critical

Summary

OpenStack API responds with HTTP 5xx.

Description

The OpenStack API {{ $labels.component }} responds with HTTP 5xx for more than 1% of requests for the last 10 minutes.

OpenstackApiServiceDown

Available since MOSK 24.1

Severity

Critical

Summary

OpenStack {{ $labels.url }} API outage

Description

The OpenStack {{ $labels.url }} API is not accessible.

OpenstackPublicAPI401Critical

Severity

Critical

Summary

OpenStack public API responds with HTTP 401.

Description

The OpenStack {{ $labels.ingress }} public ingress responds with HTTP 401 for more than 5% of requests for the last 10 minutes.

OpenstackPublicAPI5xxCritical

Severity

Critical

Summary

OpenStack Public API responds with HTTP 5xx.

Description

The OpenStack {{ $labels.ingress }} public ingress responds with HTTP 5xx for more than 1% of requests for the last 10 minutes.

OpenstackServiceInternalApiOutage

Removed in MOSK 24.1

Severity

Critical

Summary

OpenStack {{ $labels.service_name }} internal API outage.

Description

The OpenStack {{ $labels.service_name }} internal API is not accessible.

OpenstackServicePublicApiOutage

Removed in MOSK 24.1

Severity

Critical

Summary

OpenStack {{ $labels.service_name }} public API outage.

Description

The OpenStack {{ $labels.service_name }} public API is not accessible.

Cinder

This section lists the alerts for Cinder:

CinderServiceDisabled

Severity

Critical

Summary

{{ $labels.binary }} service is disabled.

Description

The {{ $labels.binary }} service is disabled in the {{ $labels.zone }} zone on all hosts.

CinderServiceDown

Severity

Major

Summary

{{ $labels.binary }} service is down.

Description

The {{ $labels.binary }} service is in the down state in the {{ $labels.zone }} zone on the {{ $labels.host }} host.

CinderServiceOutage

Severity

Critical

Summary

{{ $labels.binary }} service outage.

Description

The {{ $labels.binary }} service is down in the {{ $labels.zone }} zone on all hosts where it is enabled.

Ironic

This section lists the alerts for Ironic.

IronicDriverMissing

Severity

Major

Summary

ironic-conductor {{ $labels.driver }} backend driver missing.

Description

The {{ $labels.driver }} backend driver of the ironic-conductor container is missing on {{ $value }} node(s).

Neutron

This section lists the alerts for Neutron:

NeutronAgentDisabled

Severity

Critical

Summary

{{ $labels.binary }} agent is disabled.

Description

The {{ $labels.binary }} agent is disabled in the {{ $labels.zone }} zone on all hosts.

NeutronAgentDown

Severity

Critical

Summary

{{ $labels.binary }} agent is down.

Description

The {{ $labels.binary }} agent is in the down state in the {{ $labels.zone }} zone on the {{ $labels.host }} host.

NeutronAgentOutage

Severity

Critical

Summary

{{ $labels.binary }} agent outage.

Description

The {{ $labels.binary }} agent is down in the {{ $labels.zone }} zone on all hosts where it is enabled.

Nova

This section lists the alerts for Nova:

NovaOrphanedAllocationsDetected

Available since MOSK 24.3

Severity

Major

Summary

Openstack Nova orphaned allocations detected.

Description

Orphaned resource allocations are detected on compute nodes.

NovaServiceDisabled

Severity

Critical

Summary

{{ $labels.binary }} service is disabled.

Description

The {{ $labels.binary }} service is disabled in the {{ $labels.zone }} zone on all hosts.

NovaServiceDown

Severity

Critical

Summary

{{ $labels.binary }} service is down.

Description

The {{ $labels.binary }} service is in the down state in the {{ $labels.zone }} zone on the {{ $labels.host }} host.

NovaServiceOutage

Severity

Critical

Summary

{{ $labels.binary }} service outage.

Description

The {{ $labels.binary }} service is down in the {{ $labels.zone }} zone on all hosts where it is enabled.

Cloudprober

This section lists the alerts for the Cloudprober service:

OpenstackCloudproberTargetDown

Since 23.3 to replace OpenstackCloudproberTargetsOutage TechPreview

Severity

Major

Summary

Openstack Cloudprober Prometheus target down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod on the {{ $labels.node }} node.

OpenstackCloudproberTargetsOutage

Replaced with OpenstackCloudproberTargetDown in 23.3 Available since 23.2 TechPreview

Severity

Major

Summary

Openstack Cloudprober Prometheus targets outage.

Description

Prometheus fails to scrape metrics from all OpenStack Cloudprober endpoints (more than 1/10 failed scrapes).

Portprober

This section lists the alerts for the Portprober service:

OpenstackPortproberTargetsOutage

Available since MOSK 24.2 TechPreview

Severity

Major

Summary

OpenStack Portprober Prometheus targets outage.

Description

Prometheus fails to scrape metrics from more than one OpenStack Portprober endpoint.

PowerDNS

Available since MOSK 24.3

This section lists the alerts for the PowerDNS service:

OpenstackPowerDNSProbeFailure

Severity

Critical

Summary

DNS probe failure for {{ $labels.target_name }} {{ $labels.target_type }}

Description

The DNS probe failed at least 3 times for the DNS {{ $labels.target_type }} {{ $labels.target_name }} using the {{ $labels.protocol }} protocol in the last 20 minutes.

OpenstackPowerDNSProbesTargetOutage

Severity

Critical

Summary

DNS probe target experienced outage for {{ $labels.target_name }} {{ $labels.target_type }}

Description

Prometheus failed to probe the DNS {{ $labels.target_type }} {{ $labels.target_name }} 3 times using the {{ $labels.protocol }} protocol in the last 20 minutes.

OpenstackPowerDNSQueryDurationHigh

Severity

Warning

Summary

High DNS query duration for {{ $labels.target_name }} {{ $labels.target_type }}

Description

The DNS query duration for the DNS {{ $labels.target_type }} {{ $labels.target_name }} using the {{ $labels.protocol }} protocol exceeded 3 seconds at least 3 times in the last 20 minutes.

OpenstackPowerDNSTargetDown

Severity

Critical

Summary

PowerDNS Prometheus target is down

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod on the {{ $labels.node }} node.

OpenstackPowerDNSUDPInCsumErrors

Severity

Critical

Summary

Openstack PowerDNS UDP checksum errors detected

Description

The number of UDP checksum errors has increased by {{ printf "%.0f" $value }} on the {{ $labels.pod }} Pod over the last 2 hours.

OpenstackPowerDNSUDPInErrors

Severity

Critical

Summary

Openstack PowerDNS UDP input errors detected

Description

The number of UDP input errors has increased by {{ printf "%.0f" $value }} on the {{ $labels.pod }} Pod over the last 2 hours.

OpenstackPowerDNSUDPRecvBufErrors

Severity

Critical

Summary

Openstack PowerDNS UDP receive buffer errors detected

Description

The number of UDP receive buffer errors has increased by {{ printf "%.0f" $value }} on the {{ $labels.pod }} Pod over the last 2 hours.

OpenstackPowerDNSUDPSndBufErrors

Severity

Critical

Summary

Openstack PowerDNS UDP send buffer errors detected

Description

The number of UDP send buffer errors has increased by {{ printf "%.0f" $value }} on the {{ $labels.pod }} Pod over the last 2 hours.

Credential rotation

Available since MOSK 24.1

This section lists the alerts for OpenStack credential rotation:

Note

You can adjust thresholds for the alerts included in this section. If needed, refer to Alerts configuration.

OpenstackAdminCredentialsRotationOverdue

Severity

Warning

Summary

OpenStack administrator credentials are overdue for rotation

Description

The OpenStack administrator credentials have not been rotated since {{ $value | humanizeTimestamp }}, for more than 30 days.

OpenstackServiceCredentialsRotationOverdue

Severity

Warning

Summary

OpenStack service user credentials are overdue for rotation

Description

The OpenStack service user credentials have not been rotated since {{ $value | humanizeTimestamp }}, for more than 30 days.

OpenStack Controller (Rockoon)

This section describes the alerts available for the OpenStack Controller (Rockoon).


OsDplExporterCollectorFailure

Available since MOSK 24.3

Severity

Major

Summary

Collector failure in the OpenStackDeployment Exporter.

Description

The {{$labels.collector}} collector of the OpenStackDeployment Exporter fails to retrieve data for the last 2 minutes.

OsDplExporterTargetDown

Severity

Critical

Summary

OpenStackDeployment Exporter Prometheus target is down.

Description

Prometheus fails to scrape metrics from the OpenStackDeployment Exporter endpoint.

OsDplSSLCertExpirationHigh

Severity

Warning

Summary

SSL certificate for an OpenStack service expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate {{ $labels.identifier }} from the OpenStackDeployment expires on {{ $value | humanizeTimestamp }}, less than 10 days are left.

OsDplSSLCertExpirationMedium

Severity

Major

Summary

SSL certificate for an OpenStack service expires on {{ $value | humanizeTimestamp }}

Description

The SSL certificate {{ $labels.identifier }} from the OpenStackDeployment expires on {{ $value | humanizeTimestamp }}, less than 30 days are left.

Tungsten Fabric

This section describes the alerts available for the Tungsten Fabric services.

Cassandra

This section lists the alerts for Cassandra.


CassandraAuthFailures

Severity

Warning

Summary

Cassandra authentication failures.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports an increased number of authentication failures.

CassandraCacheHitRateTooLow

Severity

Major

Summary

Cassandra cache hit rate is too low.

Description

The average hit rate for the {{ $labels.cache }} cache in the {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster is below 85%.

CassandraClientRequestFailure

Severity

Major

Summary

Cassandra client {{ $labels.operation }} request failure.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports an increased number of {{ $labels.operation }} operation failures. A failure is a non-timeout exception.

CassandraClientRequestUnavailable

Severity

Critical

Summary

Cassandra client {{ $labels.operation }} request is unavailable.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ labels.cassandra_cluster }} cluster reports an increased number of {{ $labels.operation }} operations ending with UnavailableException. There are not enough replicas alive to perform the {{ $labels.operation }} query with the requested consistency level.

CassandraClusterTargetDown

Available since 23.3 to replace CassandraClusterTargetsOutage

Severity

Critical

Summary

Cassandra cluster target down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.cluster }} cluster on the {{ $labels.node }} node.

CassandraClusterTargetsOutage

Replaced by CassandraClusterTargetDown in 23.3

Severity

Critical

Summary

Cassandra cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the {{ $labels.cluster }} cluster endpoints (more than 1/10 failed scrapes).

CassandraCommitlogTasksPending

Severity

Warning

Summary

Cassandra commitlog has too many pending tasks.

Description

The commitlog in the {{ $labels.namespace }}/ {{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reached 15 pending tasks.

CassandraCompactionExecutorTasksBlocked

Severity

Warning

Summary

Cassandra compaction executor tasks are blocked.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ labels.cassandra_cluster }} cluster reports that {{ $value }} compaction executor tasks are blocked.

CassandraCompactionTasksPending

Severity

Warning

Summary

Cassandra has too many pending compactions.

Description

The pending compaction tasks in the {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ labels.cassandra_cluster }} cluster reached the threshold of 100 on average as measured over 30 minutes. This may occur due to a too low cluster I/O capacity.

CassandraConnectionTimeouts

Severity

Critical

Summary

Cassandra connection timeouts.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports an increased number of connection timeouts between nodes.

CassandraFlushWriterTasksBlocked

Severity

Warning

Summary

Cassandra flush writer tasks are blocked.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports that {{ $value } flush writer tasks are blocked.

CassandraHintsTooMany

Severity

Major

Summary

Cassandra has too many hints.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports an increased number of hints. Replica nodes are not available to accept mutation due to a failure or maintenance.

CassandraRepairTasksBlocked

Severity

Warning

Summary

Cassandra repair tasks are blocked.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports that {{ $value }} repair tasks are blocked.

CassandraStorageExceptions

Severity

Critical

Summary

Cassandra storage exceptions.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports an increased number of storage exceptions.

CassandraTombstonesTooManyCritical

Severity

Critical

Summary

Cassandra scanned 50000 tombstones.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster scanned {{ $value }} tombstones in 99% of read queries.

CassandraTombstonesTooManyMajor

Severity

Major

Summary

Cassandra scanned 25000 tombstones.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster scanned {{ $value }} tombstones in 99% of read queries.

CassandraTombstonesTooManyWarning

Severity

Warning

Summary

Cassandra scanned 10000 tombstones.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster scanned {{ $value }} tombstones in 99% of read queries.

CassandraViewWriteLatencyTooHigh

Severity

Warning

Summary

Cassandra high view/write latency.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Cassandra Pod in the {{ $labels.cassandra_cluster }} cluster reports over 1-second view/write latency for 99% of requests.

Kafka

This section lists the alerts for Kafka.


KafkaClusterTargetDown

Since 23.3 to replace KafkaClusterTargetsOutage

Severity

Critical

Summary

Kafka cluster Prometheus target down.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.cluster }} cluster on the {{ $labels.node }} node.

KafkaClusterTargetsOutage

Replaced with KafkaClusterTargetDown in 23.3

Severity

Critical

Summary

Kafka cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the {{ $labels.cluster }} cluster endpoints (more than 1/10 failed scrapes).

KafkaInsufficientBrokers

Severity

Critical

Summary

Kafka cluster has missing brokers.

Description

The {{ $labels.cluster }} Kafka cluster in the {{ $labels.namespace }} namespace has missing brokers.

KafkaMissingController

Severity

Critical

Summary

Kafka cluster controller is missing.

Description

The {{ $labels.cluster }} Kafka cluster in the {{ $labels.namespace }} namespace has no controllers.

KafkaOfflinePartitionsDetected

Severity

Critical

Summary

Unavailable partitions in Kafka cluster.

Description

Partitions without a primary replica have been detected in the {{ $labels.cluster }} Kafka cluster in the {{ $labels.namespace }} namespace.

KafkaTooManyControllers

Severity

Critical

Summary

Kafka cluster has too many controllers.

Description

The {{ $labels.cluster }} Kafka cluster in the {{ $labels.namespace }} in namespace has too many controllers.

KafkaUncleanLeaderElectionOccured

Severity

Major

Summary

Unclean Kafka broker was elected as cluster leader.

Description

A Kafka broker that has not finished the replication state has been elected as leader in {{ $labels.cluster }} within the {{ $labels.namespace }} namespace.

KafkaUnderReplicatedPartitions

Severity

Warning

Summary

Kafka cluster has underreplicated partitions.

Description

The topics in the {{ $labels.cluster }} Kafka cluster in the {{ $labels.namespace }} namespace have insufficient replica partitions.

Redis

This section lists the alerts for Redis.


RedisClusterFlapping

Severity

Major

Summary

Redis cluster is flapping.

Description

Changes have been detected in the {{ $labels.cluster }} Redis cluster within the {{ $labels.namespace }} namespace replica connections.

RedisClusterTargetDown

Since 23.3 to replace RedisClusterTargetsOutage

Severity

Major

Summary

Redis cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from the {{ $labels.pod }} Pod of the {{ $labels.cluster }} cluster on the {{ $labels.node }} node.

RedisClusterTargetsOutage

Replaced with RedisClusterTargetDown in 23.3

Severity

Major

Summary

Redis cluster Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the {{ $labels.cluster }} cluster endpoints (more than 1/10 failed scrapes).

RedisDisconnectedReplicas

Severity

Warning

Summary

Redis has disconnected replicas.

Description

The {{ $labels.cluster }} Redis cluster in the {{ $labels.namespace }} namespace is not replicating to all replicas. Consider verifying the Redis replication status.

RedisDown

Severity

Critical

Summary

Redis Pod is down.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Redis Pod in the {{ $labels.cluster }} cluster is down.

RedisMissingPrimary

Severity

Critical

Summary

Redis cluster has no primary node.

Description

The {{ $labels.cluster }} Redis cluster in the {{ $labels.namespace }} namespace has no node marked as primary.

RedisMultiplePrimaries

Severity

Major

Summary

Redis has multiple primaries.

Description

The {{ $labels.cluster }} Redis cluster in the {{ $labels.namespace }} namespace has {{ $value }} nodes marked as primary.

RedisRejectedConnections

Severity

Major

Summary

Redis cluster has rejected connections.

Description

Some connections to the {{ $labels.namespace }}/ {{ $labels.pod }} Redis Pod in the {{ $labels.cluster }} cluster have been rejected.

RedisReplicationBroken

Severity

Major

Summary

Redis replication is broken.

Description

The {{ $labels.cluster }} Redis cluster in the {{ $labels.namespace }} namespace instance lost a replica.

Tungsten Fabric Operator

This section lists alerts for the Tungsten Fabric Operator.

TungstenFabricOperatorTargetDown

Available since MOSK 23.3

Severity

Critical

Summary

Tungsten Fabric Operator Prometheus target is down.

Description

Prometheus fails to scrape metrics from the Tungsten Fabric Operator metrics service.

Tungsten Fabric

This section lists the alerts for Tungsten Fabric.


TungstenFabricAPI401Critical

Severity

Critical

Summary

Tungsten Fabric API responds with HTTP 401.

Description

The Tungsten Fabric API responds with HTTP 401 for more than 5% of requests for the last 10 minutes.

TungstenFabricAPI5xxCritical

Severity

Critical

Summary

Tungsten Fabric API responds with HTTP 5xx.

Description

The Tungsten Fabric API responds with HTTP 5xx for more than 1% of requests for the last 10 minutes.

TungstenFabricBGPSessionsDown

Severity

Warning

Summary

Tungsten Fabric BGP sessions are down.

Description

{{ $value }} Tungsten Fabric BGP sessions on the {{ $labels.node }} node are down for 2 minutes.

TungstenFabricBGPSessionsNoActive

Severity

Warning

Summary

No active Tungsten Fabric BGP sessions.

Description

There are no active Tungsten Fabric BGP sessions on the {{ $labels.node }} node for 2 minutes.

TungstenFabricBGPSessionsNoEstablished

Severity

Warning

Summary

No established Tungsten Fabric BGP sessions.

Description

There are no established Tungsten Fabric BGP sessions on the {{ $labels.node }} node for 2 minutes.

TungstenFabricControllerDown

Severity

Warning

Summary

Tungsten Fabric Controller is down.

Description

The Tungsten Fabric Controller on the {{ $labels.node }} node is down for 2 minutes.

TungstenFabricControllerOutage

Severity

Critical

Summary

All Tungsten Fabric Controllers are down.

Description

All Tungsten Fabric Controllers are down for 2 minutes.

TungstenFabricControllerTargetsOutage

Severity

Critical

Summary

Tungsten Fabric Controller Prometheus targets outage.

Description

Prometheus fails to scrape metrics from 2/3 of the Tungsten Fabric Controller exporter endpoints.

TungstenFabricVrouterDown

Severity

Warning

Summary

Tungsten Fabric vRouter is down.

Description

The Tungsten Fabric vRouter on the {{ $labels.node }} node is down for 2 minutes.

TungstenFabricVrouterLLSSessionsChangesTooHigh

Severity

Warning

Summary

Tungsten Fabric vRouter LLS sessions changes reached the limit of 5.

Description

The Tungsten Fabric vRouter LLS sessions on the {{ $labels.node }} node have changed {{ $value }} times.

TungstenFabricVrouterLLSSessionsTooHigh

Severity

Warning

Summary

Tungsten Fabric vRouter LLS sessions reached the limit of 10.

Description

{{ $value }} Tungsten Fabric vRouter LLS sessions are open on the {{ $labels.node }} node for 2 minutes.

TungstenFabricVrouterMetadataCheck

Severity

Critical

Summary

Tungsten Fabric metadata is unavailable.

Description

The Tungsten Fabric metadata on the {{ $labels.node }} node is unavailable for 15 minutes.

TungstenFabricVrouterOutage

Severity

Critical

Summary

All Tungsten Fabric vRouters are down.

Description

All Tungsten Fabric vRouters are down for 2 minutes.

TungstenFabricVrouterTargetDown

Severity

Major

Summary

Tungsten Fabric vRouter Prometheus target is down.

Description

Prometheus fails to scrape metrics from the Tungsten Fabric vRouter exporter endpoint on the {{ $labels.node }} node.

TungstenFabricVrouterTargetsOutage

Severity

Critical

Summary

Tungsten Fabric vRouter Prometheus targets outage.

Description

Prometheus fails to scrape metrics from all Tungsten Fabric vRouter exporter endpoints.

TungstenFabricVrouterXMPPSessionsChangesTooHigh

Severity

Warning

Summary

Tungsten Fabric vRouter XMPP sessions changes reached the limit of 5.

Description

The Tungsten Fabric vRouter XMPP sessions on the {{ $labels.node }} node have changed {{ $value }} times.

TungstenFabricVrouterXMPPSessionsTooHigh

Severity

Warning

Summary

Tungsten Fabric vRouter XMPP sessions reached the limit of 10.

Description

{{ $value }} Tungsten Fabric vRouter XMPP sessions are open on the {{ $labels.node }} node for 2 minutes.

TungstenFabricVrouterXMPPSessionsZero

Severity

Warning

Summary

No Tungsten Fabric vRouter XMPP sessions.

Description

There are no Tungsten Fabric vRouter XMPP sessions on the {{ $labels.node }} node for 2 minutes.

TungstenFabricXMPPSessionsChangesTooHigh

Severity

Warning

Summary

Tungsten Fabric XMPP sessions changes reached the limit of 100.

Description

The Tungsten Fabric XMPP sessions on the {{ $labels.node }} node have changed {{ $value }} times.

TungstenFabricXMPPSessionsDown

Severity

Warning

Summary

Tungsten Fabric XMPP sessions are down.

Description

{{ $value }} Tungsten Fabric XMPP sessions on the {{ $labels.node }} node are down for 2 minutes.

TungstenFabricXMPPSessionsMissing

Severity

Warning

Summary

Missing Tungsten Fabric XMPP sessions.

Description

{{ $value }} Tungsten Fabric XMPP sessions are missing on the compute cluster for 2 minutes.

TungstenFabricXMPPSessionsMissingEstablished

Severity

Warning

Summary

Missing established Tungsten Fabric XMPP sessions.

Description

{{ $value }} established Tungsten Fabric XMPP sessions are missing on the compute cluster for 2 minutes.

TungstenFabricXMPPSessionsTooHigh

Severity

Warning

Summary

Tungsten Fabric XMPP sessions reached the limit of 500.

Description

{{ $value }} Tungsten Fabric XMPP sessions on the {{ $labels.node }} node are open for 2 minutes.

ZooKeeper

This section lists the alerts for ZooKeeper.