Mirantis OpenStack for Kubernetes Documentation

This documentation provides information on how to deploy and operate a Mirantis OpenStack for Kubernetes (MOSK) environment. The documentation is intended to help operators to understand the core concepts of the product. The documentation provides sufficient information to deploy and operate the solution.

The information provided in this documentation set is being constantly improved and amended based on the feedback and kind requests from the consumers of MOS.

The following table lists the guides included in the documentation set you are reading:

Guide list

Guide

Purpose

Reference Architecture

Learn the fundamentals of MOSK reference architecture to appropriately plan your deployment

Deployment Guide

Deploy a MOSK environment of a preferred configuration using supported deployment profiles tailored to the demands of specific business cases

Operations Guide

Operate your MOSK environment

Release Notes

Learn about new features and bug fixes in the current MOSK version

Intended audience

This documentation is intended for engineers who have the basic knowledge of Linux, virtualization and containerization technologies, Kubernetes API and CLI, Helm and Helm charts, Mirantis Kubernetes Engine (MKE), and OpenStack.

Documentation history

The following table contains the released revision of the documentation set you are reading.

Release date

Release name

August, 2023

MOSK 23.2 series

Conventions

This documentation set uses the following conventions in the HTML format:

Documentation conventions

Convention

Description

boldface font

Inline CLI tools and commands, titles of the procedures and system response examples, table titles

monospaced font

Files names and paths, Helm charts parameters and their values, names of packages, nodes names and labels, and so on

italic font

Information that distinguishes some concept or term

Links

External links and cross-references, footnotes

Main menu > menu item

GUI elements that include any part of interactive user interface and menu navigation

Superscript

Some extra, brief information

Note

The Note block

Messages of a generic meaning that may be useful for the user

Caution

The Caution block

Information that prevents a user from mistakes and undesirable consequences when following the procedures

Warning

The Warning block

Messages that include details that can be easily missed, but should not be ignored by the user and are valuable before proceeding

See also

The See also block

List of references that may be helpful for understanding of some related tools, concepts, and so on

Learn more

The Learn more block

Used in the Release Notes to wrap a list of internal references to the reference architecture, deployment and operation procedures specific to a newly implemented product feature

Product Overview

Mirantis OpenStack for Kubernetes (MOSK) combines the power of Mirantis Container Cloud for delivering and managing Kubernetes clusters, with the industry standard OpenStack APIs, enabling you to build your own cloud infrastructure.

The advantages of running all of the OpenStack components as a Kubernetes application are multi-fold and include the following:

  • Zero downtime, non-disruptive updates

  • Fully automated Day-2 operations

  • Full-stack management from bare metal through the operating system and all the necessary components

The list of the most common use cases includes:

Software-defined data center

The traditional data center requires multiple requests and interactions to deploy new services, by abstracting the data center functionality behind a standardized set of APIs service can be deployed faster and more efficiently. MOSK enables you to define all your data center resources behind the industry standard OpenStack APIs allowing you to automate the deployment of applications or simply request resources through the UI to quickly and efficiently provision virtual machines, storage, networking, and other resources.

Virtual Network Functions (VNFs)

VNFs require high performance systems that can be accessed on demand in a standardized way, with assurances that they will have access to the necessary resources and performance guarantees when needed. MOSK provides extensive support for VNF workload enabling easy access to functionality such as Intel EPA (NUMA, CPU pinning, Huge Pages) as well as the consumption of specialized networking interfaces cards to support SR-IOV and DPDK. The centralized management model of MOSK and Mirantis Container Cloud also enables the easy management of multiple MOSK deployments with full lifecycle management.

Legacy workload migration

With the industry moving toward cloud-native technologies many older or legacy applications are not able to be moved easily and often it does not make financial sense to transform the applications to cloud-native applications. MOSK provides a stable cloud platform that can cost-effectively host legacy applications whilst still providing the expected levels of control, customization, and uptime.

Reference Architecture

Mirantis OpenStack for Kubernetes (MOSK) is a virtualization platform that provides an infrastructure for cloud-ready applications, in combination with reliability and full control over the data.

MOSK combines OpenStack, an open-source cloud infrastructure software, with application management techniques used in the Kubernetes ecosystem that include container isolation, state enforcement, declarative definition of deployments, and others.

MOSK integrates with Mirantis Container Cloud to rely on its capabilities for bare-metal infrastructure provisioning, Kubernetes cluster management, and continuous delivery of the stack components.

MOSK simplifies the work of a cloud operator by automating all major cloud life cycle management routines including cluster updates and upgrades.

Deployment profiles

A Mirantis OpenStack for Kubernetes (MOSK) deployment profile is a thoroughly tested and officially supported reference architecture that is guaranteed to work at a specific scale and is tailored to the demands of a specific business case, such as generic IaaS cloud, Network Function Virtualisation infrastructure, Edge Computing, and others.

A deployment profile is defined as a combination of:

  • Services and features the cloud offers to its users.

  • Non-functional characteristics that users and operators should expect when running the profile on top of a reference hardware configuration. Including, but not limited to:

    • Performance characteristics, such as an average network throughput between VMs in the same virtual network.

    • Reliability characteristics, such as the cloud API error response rate when recovering a failed controller node.

    • Scalability characteristics, such as the total amount of virtual routers tenants can run simultaneously.

  • Hardware requirements - the specification of physical servers, and networking equipment required to run the profile in production.

  • Deployment parameters that an operator for the cloud can tweak within a certain range without being afraid of breaking the cloud or losing support.

In addition, the following items may be included in a definition:

  • Compliance-driven technical requirements, such as TLS encryption of all external API endpoints.

  • Foundation-level software components, such as Tungsten Fabric or Open vSwitch as a back end for the networking service.

Note

Mirantis reserves the right to revise the technical implementation of any profile at will while preserving its definition - the functional and non-functional characteristics that operators and users are known to rely on.

MOSK supports a huge list of different deployment profiles to address a wide variety of business tasks. The table below includes the profiles for the most common use cases.

Note

Some components of a MOSK cluster are mandatory and are being installed during the managed cluster deployment by Container Cloud regardless of the deployment profile in use. StackLight is one of the cluster components that are enabled by default. See Container Cloud Operations Guide for details.

Supported deployment profiles

Profile

OpenStackDeployment CR Preset

Description

Cloud Provider Infrastructure (CPI)

compute

Provides the core set of the services an IaaS vendor would need including some extra functionality. The profile is designed to support up 50-70 compute nodes and a reasonable number of storage nodes. 0

The core set of services provided by the profile includes:

  • Compute (Nova)

  • Images (Glance)

  • Networking (Neutron with Open vSwitch as a back end)

  • Identity (Keystone)

  • Block Storage (Cinder)

  • Orchestration (Heat)

  • Load balancing (Octavia)

  • DNS (Designate)

  • Secret Management (Barbican)

  • Web front end (Horizon)

  • Bare metal provisioning (Ironic) 1 2

  • Telemetry (aodh, Ceilometer, and Gnocchi) 3

CPI with Tungsten Fabric

compute-tf

A variation of the CPI profile 1 with Tugsten Fabric as a back end for networking.

0

The supported node count is approximate and may vary depending on the hardware, cloud configuration, and planned workload.

1(1,2)

Ironic is an optional component for the CPI profile. See Bare Metal service for details.

2

Ironic is not supported for the CPI with Tungsten Fabric profile. See Tungsten Fabric known limitations for details.

3

Telemetry services are optional components with the Technology preview status and should be enabled together through the list of services to be deployed in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Components overview

Mirantis OpenStack for Kubernetes (MOSK) includes the following key design elements.

HelmBundle Operator

The HelmBundle Operator is the realization of the Kubernetes Operator pattern that provides a Kubernetes custom resource of the HelmBundle kind and code running inside a pod in Kubernetes. This code handles changes, such as creation, update, and deletion, in the Kubernetes resources of this kind by deploying, updating, and deleting groups of Helm releases from specified Helm charts with specified values.

OpenStack

The OpenStack platform manages virtual infrastructure resources, including virtual servers, storage devices, networks, and networking services, such as load balancers, as well as provides management functions to the tenant users.

Various OpenStack services are running as pods in Kubernetes and are represented as appropriate native Kubernetes resources, such as Deployments, StatefulSets, and DaemonSets.

For a simple, resilient, and flexible deployment of OpenStack and related services on top of a Kubernetes cluster, MOSK uses OpenStack-Helm that provides a required collection of the Helm charts.

Also, MOSK uses OpenStack Operator as the realization of the Kubernetes Operator pattern. The OpenStack Operator provides a custom Kubernetes resource of the OpenStackDeployment kind and code running inside a pod in Kubernetes. This code handles changes such as creation, update, and deletion in the Kubernetes resources of this kind by deploying, updating, and deleting groups of the Helm releases.

Ceph

Ceph is a distributed storage platform that provides storage resources, such as objects and virtual block devices, to virtual and physical infrastructure.

MOSK uses Rook as the implementation of the Kubernetes Operator pattern that manages resources of the CephCluster kind to deploy and manage Ceph services as pods on top of Kubernetes to provide Ceph-based storage to the consumers, which include OpenStack services, such as Volume and Image services, and underlying Kubernetes through Ceph CSI (Container Storage Interface).

The Ceph Controller is the implementation of the Kubernetes Operator pattern, that manages resources of the MiraCeph kind to simplify management of the Rook-based Ceph clusters.

StackLight Logging, Monitoring, and Alerting

The StackLight component is responsible for collection, analysis, and visualization of critical monitoring data from physical and virtual infrastructure, as well as alerting and error notifications through a configured communication system, such as email. StackLight includes the following key sub-components:

  • Prometheus

  • OpenSearch

  • OpenSearch Dashboards

  • Fluentd

Requirements

MOSK cluster hardware requirements

This section provides hardware requirements for the Mirantis Container Cloud management cluster with a managed Mirantis OpenStack for Kubernetes (MOSK) cluster.

For installing MOSK, the Mirantis Container Cloud management cluster and managed cluster must be deployed with baremetal provider.

Important

A MOSK cluster is to be used for a deployment of an OpenStack cluster and its components. Deployment of third-party workloads on a MOSK cluster is neither allowed nor supported.

Note

One of the industry best practices is to verify every new update or configuration change in a non-customer-facing environment before applying it to production. Therefore, Mirantis recommends having a staging cloud, deployed and maintained along with the production clouds. The recommendation is especially applicable to the environments that:

  • Receive updates often and use continuous delivery. For example, any non-isolated deployment of Mirantis Container Cloud.

  • Have significant deviations from the reference architecture or third party extensions installed.

  • Are managed under the Mirantis OpsCare program.

  • Run business-critical workloads where even the slightest application downtime is unacceptable.

A typical staging cloud is a complete copy of the production environment including the hardware and software configurations, but with a bare minimum of compute and storage capacity.

The table below describes the node types the MOSK reference architecture includes.

MOSK node types

Node type

Description

Mirantis Container Cloud management cluster nodes

The Container Cloud management cluster architecture on bare metal requires three physical servers for manager nodes. On these hosts, we deploy a Kubernetes cluster with services that provide Container Cloud control plane functions.

OpenStack control plane node and StackLight node

Host OpenStack control plane services such as database, messaging, API, schedulers conductors, and L3 and L2 agents, as well as the StackLight components.

Note

MOSK enables the cloud operator to collocate the OpenStack control plane with the managed cluster master nodes on the OpenStack deployments of a small size. This capability is available as technical preview. Use such configuration for testing and evaluation purposes only.

Tenant gateway node

Optional. Hosts OpenStack gateway services including L2, L3, and DHCP agents. The tenant gateway nodes are combined with OpenStack control plane nodes. The strict requirement is a dedicated physical network (bond) for tenant network traffic.

Tungsten Fabric control plane node

Required only if Tungsten Fabric is enabled as a back end for the OpenStack networking. These nodes host the TF control plane services such as Cassandra database, messaging, API, control, and configuration services.

Tungsten Fabric analytics node

Required only if Tungsten Fabric is enabled as a back end for the OpenStack networking. These nodes host the TF analytics services such as Cassandra, ZooKeeper, and collector.

Compute node

Hosts the OpenStack Compute services such as QEMU, L2 agents, and others.

Infrastructure nodes

Runs underlying Kubernetes cluster management services. The MOSK reference configuration requires minimum three infrastructure nodes.

The table below specifies the hardware resources the MOSK reference architecture recommends for each node type.

Hardware requirements

Node type

# of servers

CPU cores # per server

RAM per server, GB

Disk space per server, GB

NICs # per server

Mirantis Container Cloud management cluster node

3 0

16

128

1 SSD x 960
1 SSD x 1900 1

3 2

OpenStack control plane, gateway 3, and StackLight nodes

3 or more

32

128

1 SSD x 500
2 SSD x 1000 6

5

Tenant gateway (optional)

0-3

32

128

1 SSD x 500

5

Tungsten Fabric control plane nodes 4

3

16

64

1 SSD x 500

1

Tungsten Fabric analytics nodes 4

3

32

64

1 SSD x 1000

1

Compute node

3 (varies)

16

64

1 SSD x 500 7

5

Infrastructure node (Kubernetes cluster management)

3 8

16

64

1 SSD x 500

5

Infrastructure node (Ceph) 5 TBV

3

16

64

1 SSD x 500
2 HDDs x 2000

5

Note

The exact hardware specifications and number of the control plane and gateway nodes depend on a cloud configuration and scaling needs. For example, for the clouds with more than 12,000 Neutron ports, Mirantis recommends increasing the number of gateway nodes.

0

Adding more than 3 nodes to a management or regional cluster is not supported.

1

In total, at least 2 disks are required:

  • sda - system storage, minimum 60 GB

  • sdb - Container Cloud services storage, not less than 110 GB. The exact capacity requirements depend on StackLight data retention period.

See Management cluster storage for details.

2

OOB management (IPMI) port is not included.

3

OpenStack gateway services can optionally be moved to separate nodes.

4(1,2)

TF control plane and analytics nodes can be combined with a respective addition of RAM, CPU, and disk space to the hardware hosts. Though, Mirantis does not recommend such configuration for production environments as the risk of the cluster downtime if one of the nodes unexpectedly fails increases.

5
  • A Ceph cluster with 3 Ceph nodes does not provide hardware fault tolerance and is not eligible for recovery operations, such as a disk or an entire node replacement.

  • A Ceph cluster uses the replication factor that equals 3. If the number of Ceph OSDs is less than 3, a Ceph cluster moves to the degraded state with the write operations restriction until the number of alive Ceph OSDs equals the replication factor again.

6
  • 1 SSD x 500 for operating system

  • 1 SSD x 1000 for OpenStack LVP

  • 1 SSD x 1000 for StackLight LVP

7

When Nova is used with local folders, additional capacity is required depending on the VM images size.

8

For nodes hardware requirements, refer to Container Cloud Reference Architecture: Managed cluster hardware configuration.

Note

If you would like to evaluate the MOSK capabilities and do not have much hardware at your disposal, you can deploy it in a virtual environment. For example, on top of another OpenStack cloud using the sample Heat templates.

Please mind, the tooling is provided for reference only and is not a part of the product itself. Mirantis does not guarantee its interoperability with any MOSK version.

Management cluster storage

The management cluster requires minimum two storage devices per node. Each device is used for different type of storage:

  • One storage device for boot partitions and root file system. SSD is recommended. A RAID device is not supported.

  • One storage device per server is reserved for local persistent volumes. These volumes are served by the Local Storage Static Provisioner, that is local-volume-provisioner, and used by many services of Mirantis Container Cloud.

You can configure host storage devices using BareMetalHostProfile resources. For details, see Create a custom bare metal host profile.

System requirements for the seed node

The seed node is only necessary to deploy the management cluster. When the bootstrap is complete, the bootstrap node can be discarded and added back to the MOSK cluster as a node of any type.

The minimum reference system requirements for a baremetal-based bootstrap seed node are as follow:

  • Basic Ubuntu 18.04 server with the following configuration:

    • Kernel of version 4.15.0-76.86 or later

    • 8 GB of RAM

    • 4 CPU

    • 10 GB of free disk space for the bootstrap cluster cache

  • No DHCP or TFTP servers on any NIC networks

  • Routable access IPMI network for the hardware servers.

  • Internet access for downloading of all required artifacts

    If you use a firewall or proxy, make sure that the bootstrap, management, and regional clusters have access to the following IP ranges and domain names:

    • IP ranges:

    • Domain names:

      • mirror.mirantis.com and repos.mirantis.com for packages

      • binary.mirantis.com for binaries and Helm charts

      • mirantis.azurecr.io for Docker images

      • mcc-metrics-prod-ns.servicebus.windows.net:9093 for Telemetry (port 443 if proxy is enabled)

      • mirantis.my.salesforce.com for Salesforce alerts

    Note

    • Access to Salesforce is required from any Container Cloud cluster type.

    • If any additional Alertmanager notification receiver is enabled, for example, Slack, its endpoint must also be accessible from the cluster.

Components collocation

MOSK uses Kubernetes labels to place components onto hosts. For the default locations of components, see MOSK cluster hardware requirements. Additionally, MOSK supports component collocation. This is mostly useful for OpenStack compute and Ceph nodes. For component collocation, consider the following recommendations:

  • When calculating hardware requirements for nodes, consider the requirements for all collocated components.

  • When performing maintenance on a node with collocated components, execute the maintenance plan for all of them.

  • When combining other services with the OpenStack compute host, verify that reserved_host_* has increased accordingly to the needs of collocated components by using node-specific overrides for the compute service.

Infrastructure requirements

This section lists the infrastructure requirements for the Mirantis OpenStack for Kubernetes (MOSK) reference architecture.

Infrastructure requirements

Service

Description

MetalLB

MetalLB exposes external IP addresses of cluster services to access applications in a Kubernetes cluster.

DNS

The Kubernetes Ingress NGINX controller is used to expose OpenStack services outside of a Kubernetes deployment. Access to the Ingress services is allowed only by its FQDN. Therefore, DNS is a mandatory infrastructure service for an OpenStack on Kubernetes deployment.

Automatic upgrade of a host operating system

To keep operating system on a bare metal host up to date with the latest security updates, the operating system requires periodic software packages upgrade that may or may not require the host reboot.

Mirantis Container Cloud uses life cycle management tools to update the operating system packages on the bare metal hosts.

In a management cluster, software package upgrade and host restart are applied automatically when a new Container Cloud version with available kernel or software packages upgrade is released.

In a managed cluster, package upgrade and host restart are applied as part of usual cluster update, when applicable. To start planning the maintenance window and proceed with the managed cluster update, see Update a MOSK cluster to a major release version.

Operating system upgrade and host restart are applied to cluster nodes one by one. If Ceph is installed in the cluster, the Container Cloud orchestration securely pauses the Ceph OSDs on the node before restart. This allows avoiding degradation of the storage service.

Cloud services

Each section below is dedicated to a particular service provided by MOSK. They contain configuration details and usage samples of supported capabilities provided through the custom resources.

Note

The list of the services and their supported features included in this section is not full and is being constantly amended based on the complexity of the architecture and use of a particular service.

Core cloud services

Compute service

Mirantis OpenStack for Kubernetes (MOSK) provides instances management capability through the Compute service (OpenStack Nova). The Compute service interacts with other OpenStack components of an OpenStack environment to provide life-cycle management of the virtual machine instances.

Resource oversubscription

The Compute service (OpenStack Nova) enables you to spawn instances that can collectively consume more resources than what is physically available on a compute node through resource oversubscription, also known as overcommit or allocation ratio.

Resources available for oversubscription on a compute node include the number of CPUs, amount of RAM, and amount of available disk space. When making a scheduling decision, the scheduler of the Compute service takes into account the actual amount of resources multiplied by the allocation ratio. Thereby, the service allocates resources based on the assumption that not all instances will be using their full allocation of resources at the same time.

Oversubscription enables you to increase the density of workloads and compute resource utilization and, thus, achieve better Return on Investment (ROI) on compute hardware. In addition, oversubscription can also help avoid the need to create too many fine-grained flavors, which is commonly known as flavor explosion.

Configuring initial resource oversubscription

Available since MOSK 23.1

There are two ways to control the oversubscription values for compute nodes:

  • The legacy approach entails utilizing the {cpu,disk,ram}_allocation_ratio configuration options offered by the Compute service. A drawback of this method is that restarting the Compute service is mandatory to apply the new configuration. This introduces the risk of possible interruptions of cloud user operations, for example, instance build failures.

  • The modern and recommended approach, adopted in MOSK 23.1, involves using the initial_{cpu,disk,ram}_allocation_ratio configuration options, which are employed exclusively during the initial provisioning of a compute node. This may occur during the initial deployment of the cluster or when new compute nodes are added subsequently. Any further alterations can be performed dynamically using the OpenStack Placement service API without necessitating the restart of the service.

There is no definitive method for selecting optimal oversubscription values. As a cloud operator, you should continuously monitor your workloads, ideally have a comprehensive understanding of their nature, and experimentally determine the maximum values that do not impact performance. This approach ensures maximum workload density and cloud resource utilization.

To configure the initial compute resource oversubscription in MOSK, specify the spec:features:nova:allocation_ratios parameter in the OpenStackDeployment custom resource as explained in the table below.

Resource oversubscription configuration

Parameter

spec:features:nova:allocation_ratios

Configuration

Configure initial oversubscription of CPU, disk space, and RAM resources on compute nodes. By default, the following values are applied:

  • cpu: 8.0

  • disk: 1.6

  • ram: 1.0

Note

In MOSK 22.5 and earlier, the effective default value of RAM allocation ratio is 1.1.

Warning

Mirantis strongly advises against oversubscribing RAM, by any amount. See Preventing resource overconsumption for details.

Changing the resource oversubscription configuration through the OpenStackDeployment resource after cloud deployment will only affect the newly added compute nodes and will not change oversubscription for already existing compute nodes. To change oversubscription for already existing compute nodes, use the placement service API as described in Change oversubscription settings for existing compute nodes.

Usage

Configuration example:

kind: OpenStackDeployment
spec:
  features:
    nova:
      allocation_ratios:
        cpu: 8
        disk: 1.6
        ram: 1.0

Configuration example of setting different oversubscription values for specific nodes:

spec:
  nodes:
    compute-type::hi-perf:
      features:
        nova:
          allocation_ratios:
            cpu: 2.0
            disk: 1.0

In the example configuration above, the compute nodes labeled with compute-type=hi-perf label will use less intense oversubscription on CPU and no oversubscription on disk.

Preventing resource overconsumption

When using oversubscription, it is important to conduct thorough cloud management and monitoring to avoid system overloading and performance degradation. If many or all instances on a compute node start using all allocated resources at once and, thereby, overconsume physical resources, failure scenarios depend on the resource being exhausted.

Symptoms of resource exhaustion

Affected resource

Symptoms

CPU

Workloads are getting slower as they actively compete for physical CPU usage. A useful indicator is the steal time as reported inside the workload, which is a percentage of time the operating system in the workload is waiting for actual physical CPU core availability to run instructions.

To verify the steal time in the Linux-based workload, use the top command:

top -bn1 | head | grep st$ | awk -F ',' '{print $NF}'

Generally, steal times of >10 for 20-30 minutes are considered alarming.

RAM

Operating system on the compute node starts to aggressively use physical swap space, which significantly slows the workloads down. Sometimes, when the swap is also exhausted, the operating system of a compute node can outright OOM kill most offending processes, which can cause major disruptions to workloads or a compute node itself.

Warning

While it may seem like a good idea to make the most of available resources, oversubscribing RAM can lead to various issues and is generally not recommended due to potential performance degradation, reduced stability, and security risks for the workloads.

Mirantis strongly advises against oversubscribing RAM, by any amount.

Disk space

Depends on the physical layout of storage. Virtual root and ephemeral storage devices that are hosted on a compute node itself are put in the read-only mode negatively affecting workloads. Additionally, the file system used by the operating system on a compute node may become read-only too blocking the compute node operability.

There are workload types that are not suitable for running in an oversubscribed environment, especially those with high performance, latency-sensitive, or real-time requirements. Such workloads are better suited for compute nodes with dedicated CPUs, ensuring that only processes of a single instance run on each CPU core.

vCPU type

Parameter

spec:features:nova:vcpu_type

Usage

Configures the type of vCPU that Nova will create instances with. The default CPU model configured for all instances managed by Nova is host-model, the same as in Nova for the KVM or QEMU hypervisor.

Supported CPU models

The supported CPU models include:

  • host-model (default) - mimics the host CPU and provides for decent performance, good security, and moderate compatibility with live migrations.

    With this mode, libvirt finds an available predefined CPU model that best matches the host CPU, and then explicitly adds the missing CPU feature flags to closely match the host CPU features. To mitigate known security flaws, libvirt automatically adds critical CPU flags, supported by installed libvirt, QEMU, kernel, and CPU microcode versions.

    This is a safe choice if your OpenStack compute node CPUs are of the same generation. If your OpenStack compute node CPUs are sufficiently different, for example, span multiple CPU generations, Mirantis strongly recommends setting explicit CPU models supported by all of your OpenStack compute node CPUs or organizing your OpenStack compute nodes into host aggregates and availability zones that have largely identical CPUs.

    Note

    The host-model model does not guarantee two-way live migrations between nodes.

    When migrating instances, the libvirt domain XML is first copied as is to the destination OpenStack compute node. Once the instance is hard rebooted or shut down and started again, the domain XML will be re-generated. If versions of libvirt, kernel, CPU microcode, or BIOS firmware differ from what they were on the source compute node the instance was started before, libvirt may pick up additional CPU feature flags, making it impossible to live-migrate back to the original compute node.

  • host-passthrough - provides maximum performance, especially when nested virtualization is required or if live migration support is not a concern for workloads. Live migration requires exactly the same CPU on all OpenStack compute nodes, including the CPU microcode and kernel versions. Therefore, for live migrations support, organize your compute nodes into host aggregates and availability zones. For workload migration between non-identical OpenStack compute nodes, contact Mirantis support.

  • A comma-separated list of exact QEMU CPU models to create and emulate. Specify the common and less advanced CPU models first. All explicit CPU models provided must be compatible with the OpenStack compute node CPUs.

    To specify an exact CPU model, review the available CPU models and their features. List and inspect the /usr/share/libvirt/cpu_map/*.xml files in the libvirt containers of pods of the libvirt DeamonSet or multiple DaemonSets if you are using node-specific settings.

    Review the available CPU models
    1. Identify the available libvirt DaemonSets:

      kubectl -n openstack get ds -l application=libvirt --show-labels
      

      Example of system response:

      NAME                     DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR                   AGE  LABELS
      libvirt-libvirt-default  2        2        2      2           2          openstack-compute-node=enabled  34d  app.kubernetes.io/managed-by=Helm,application=libvirt,component=libvirt,release_group=openstack-libvirt
      
    2. Identify the pods of libvirt DaemonSets:

      kubectl -n openstack get po -l application=libvirt,release_group=openstack-libvirt
      

      Example of system response:

      NAME                           READY  STATUS   RESTARTS  AGE
      libvirt-libvirt-default-5zs8m  2/2    Running  0         8d
      libvirt-libvirt-default-vt8wd  2/2    Running  0         3d14h
      
    3. List and review the available CPU model definition files. For example:

      kubectl -n openstack exec -ti libvirt-libvirt-default-5zs8m -c libvirt -- ls /usr/share/libvirt/cpu_map/*.xml
      
    4. List and review the content of all CPU model definition files. For example:

      kubectl -n openstack exec -ti libvirt-libvirt-default-5zs8m -c libvirt -- bash -c 'for f in `ls /usr/share/libvirt/cpu_map/*.xml`; do echo $f; cat $f; done'
      
Configuration examples

For example, to set the host-passthrough CPU model for all OpenStack compute nodes:

spec:
  features:
    nova:
      vcpu_type: host-passthrough

For nodes that are labeled with processor=amd-epyc, set a custom EPYC CPU model:

spec:
  nodes:
    processor::amd-epyc
      features:
        nova:
          vcpu_type: EPYC
Live migration

Parameter

Usage

features:nova:live_migration_interface

Specifies the name of the NIC device on the actual host that will be used by Nova for the live migration of instances.

Mirantis recommends setting up your Kubernetes hosts in such a way that networking is configured identically on all of them, and names of the interfaces serving the same purpose or plugged into the same network are consistent across all physical nodes.

Also, set the option to vhost0 in the following cases:

  • The Neutron service uses Tungsten Fabric.

  • Nova migrates instances through the interface specified by the Neutron’s tunnel_interface parameter.

features:nova:libvirt:tls

Available since MOSK 23.2. If set to true, enables the live migration over TLS:

spec:
  features:
    nova:
      libvirt:
        tls:
          enabled: true

See also Securing live migration data.

Image storage back end

Parameter

features:nova:images:backend

Usage

Defines the type of storage for Nova to use on the compute hosts for the images that back up the instances.

The list of supported options include:

  • local

    The local storage is used. The pros include faster operation, failure domain independency from the external storage. The cons include local space consumption and less performant and robust live migration with block migration.

  • ceph

    Instance images are stored in a Ceph pool shared across all Nova hypervisors. The pros include faster image start, faster and more robust live migration. The cons include considerably slower IO performance, workload operations direct dependency on Ceph cluster availability and performance.

  • lvm TechPreview

    Instance images and ephemeral images are stored on a local Logical Volume. If specified, features:nova:images:lvm:volume_group must be set to an available LVM Volume Group, by default, nova-vol. For details, see Enable LVM ephemeral storage.

Encrypted data transfer for noVNC

Available since MOSK 23.1

Parameter

features:nova:console:novnc:tls:enabled

Usage

The noVNC client provides remote control or remote desktop access to guest virtual machines through the Virtual Network Computing (VNC) system. The MOSK Compute service users can access their instances using the noVNC clients through the noVNC proxy server. MOSK uses TLS to secure public-facing VNC access on networks between a noVNC client and noVNC proxy server.

The features:nova:console:novnc:tls:enabled ensures that the data transferred between the instance and the noVNC proxy server is encrypted. Both servers use the VeNCrypt authentication scheme for the data encryption.

To enable the encrypted data transfer for noVNC, use the following structure in the OpenStackDeployment custom resource:

kind: OpenStackDeployment
spec:
  features:
    nova:
      console:
        novnc:
          tls:
            enabled: true
Networking service

Mirantis OpenStack for Kubernetes (MOSK) Networking service (OpenStack Neutron) provides cloud applications with Connectivity-as-a-Service enabling instances to communicate with each other and the outside world.

The API provided by the service abstracts all the nuances of implementing a virtual network infrastructure on top of your own physical network infrastructure. The service allows cloud users to create advanced virtual network topologies that may include load balancing, virtual private networking, traffic filtering, and other services.

MOSK Networking service supports Open vSwitch and Tungsten Fabric SDN technologies as back ends.

General configuration

MOSK offers the Networking service as a part of its core setup. You can configure the service through the spec:features:neutron section of the OpenStackDeployment custom resource.

Tunnel interface

Parameter

features:neutron:tunnel_interface

Usage

Defines the name of the NIC device on the actual host that will be used for Neutron.

Mirantis recommends setting up your Kubernetes hosts in such a way that networking is configured identically on all of them, and names of the interfaces serving the same purpose or plugged into the same network are consistent across all physical nodes.

DNS servers

Parameter

features:neutron:dns_servers

Usage

Defines the list of IPs of DNS servers that are accessible from virtual networks. Used as default DNS servers for VMs.

External networks

Parameter

features:neutron:external_networks

Usage

Contains the data structure that defines external (provider) networks on top of which the Neutron networking will be created.

Floating IP networks

Parameter

features:neutron:floating_network

Usage

If enabled, must contain the data structure defining the floating IP network that will be created for Neutron to provide external access to your Nova instances.

BGP dynamic routing

Available since MOSK 23.2 TechPreview

The BGP dynamic routing extension to the Networking service (OpenStack Neutron) is particularly useful for the MOSK clouds where private networks managed by cloud users need to be transparently integrated into the networking of the data center.

For example, the BGP dynamic routing is a common requirement for IPv6-enabled environments, where clients need to seamlessly access cloud workloads using dedicated IP addresses with no address translation involved in between the cloud and the external network.

Untitled Diagram

BGP dynamic routing changes the way self-service (private) network prefixes are communicated to BGP-compatible physical network devices, such as routers, present in the data center. It eliminates the traditional reliance on static routes or ICMP-based advertising by enabling the direct passing of private network prefix information to router devices.

Note

To effectively use the BGP dynamic routing feature, Mirantis recommends acquiring good understanding of OpenStack address scopes and how they work.

The components of the OpenStack BGP dynamic routing are:

  • Service plugin

    An extension to the Networking service (OpenStack Neutron) that implements the logic for BGP-related entities orhestration and provides the cloud user-facing API. A cloud administrator creates and configures a BGP speaker using the CLI or API and manually schedules it to one or more hosts running the agent.

  • Agent

    Manages BGP peering sessions. In MOSK, the BGP agent runs on nodes labeled with openstack-gateway=enabled.

Prefix advertisement depends on the binding of external networks to a BGP speaker and the address scope of external and internal IP address ranges or subnets.

Prefix advertisement

BGP dynamic routing advertises prefixes for self-service networks and host routes for floating IP addresses.

To successfully advertise a self-service network, you need to fulfill the following conditions:

  • External and self-service networks reside in the same address scope.

  • The router contains an interface on the self-service subnet and a gateway on the external network.

  • The BGP speaker associates with the external network that provides a gateway on the router.

  • The BGP speaker has the advertise_tenant_networks attribute set to True.

To successfully advertise a floating IP address, you need to fulfill the following conditions:

  • The router with the floating IP address binding contains a gateway on an external network with the BGP speaker association.

  • The BGP speaker has the advertise_floating_ip_host_routes attribute set to true.

The diagram below is an example of the BGP dynamic routing in the non-DVR mode with self-service networks and the following advertisements:

  • B>* 192.168.0.0/25 [200/0] through 10.11.12.1

  • B>* 192.168.0.128/25 [200/0] through 10.11.12.2

  • B>* 10.11.12.234/32 [200/0] through 10.11.12.1

Untitled Diagram
Operation in the Distributed Virtal Router (DVR) mode

For both floating IP and IPv4 fixed IP addresses, the BGP speaker advertises the gateway of the floating IP agent on the corresponding compute node as the next-hop IP address. When using IPv6 fixed IP addresses, the BGP speaker advertises the DVR SNAT node as the next-hop IP address.

The diagram below is an example of the BGP dynamic routing in the DVR mode with self-service networks and the following advertisements:

  • B>* 192.168.0.0/25 [200/0] through 10.11.12.1

  • B>* 192.168.0.128/25 [200/0] through 10.11.12.2

  • B>* 10.11.12.234/32 [200/0] through 10.11.12.12

Untitled Diagram
DVR incompatibility with ARP announcements and VRRP

Due to the known issue #1774459 in the upstream implementation, Mirantis does not recommend using Distributed Virtual Routing (DVR) routers in the same networks as load balancers or other applications that utilize the Virtual Router Redundancy Protocol (VRRP) such as Keepalived. The issue prevents the DVR functionality from working correctly with network protocols that rely on the Address Resolution Protocol (ARP) announcements such as VRRP.

The issue occurs when updating permanent ARP entries for allowed_address_pair IP addresses in DVR routers because DVR performs the ARP table update through the control plane and does not allow any ARP entry to leave the node to prevent the router IP/MAC from contaminating the network.

This results in various network failover mechanisms not functioning in virtual networks that have a distributed virtual router plugged in. For instance, the default back end for MOSK Load Balancing service, represented by OpenStack Octavia with the OpenStack Amphora back end when deployed in the HA mode in a DVR-connected network, is not able to redirect the traffic from a failed active service instance to a standby one without interruption.

Block Storage service

Mirantis OpenStack for Kubernetes (MOSK) provides volume management capability through the Block Storage service (OpenStack Cinder).

Backup configuration

MOSK provides support for the following back ends for the Block Storage service (OpenStack Cinder):

Support status of storage back ends for Cinder

Back end

Support status

Ceph

Full support, default

NFS

  • TechPreview for Yoga and newer OpenStack releases

  • Available since MOSK 23.2

S3

  • TechPreview for Yoga and newer OpenStack releases

  • Available since MOSK 23.2

In MOSK, Cinder backup is enabled and uses the Ceph back end for Cinder by default. The backup configuration is stored in the spec:features:cinder:backup structure in the OpenStackDeployment custom resource. If necessary, you can disable the backup feature in Cinder as follows:

kind: OpenStackDeployment
spec:
  features:
    cinder:
      backup:
        enabled: false

Using this structure, you can also configure another backup driver supported by MOSK for Cinder as described below. At any given time, only one back end can be enabled.

Configuring an NFS driver

Available since MOSK 23.2 TechPreview

MOSK supports NFS Unix authentication exclusively. To use an NFS driver with MOSK, ensure you have a preconfigured NFS server with an NFS share accessible to a Unix Cinder user. This user must be the owner of the exported NFS folder, and the folder must have the permission value set to 775.

All Cinder services run with the same user by default. To obtain the Unix user ID:

kubectl -n openstack get pod -l application=cinder,component=api -o jsonpath='{.items[0].spec.securityContext.runAsUser}'

Note

The NFS server must be accessible through the network from all OpenStack control plane nodes of the cluster.

To enable the NFS storage for Cinder backup, configure the following structure in the OpenStackDeployment object:

spec:
  features:
    cinder:
      backup:
        drivers:
          <BACKEND_NAME>:
            type: nfs
            enabled: true
            backup_share: <URL_TO_NFS_SHARE>

You can specify the backup_share parameter in following formats: hostname:path, ipv4addr:path, or [ipv6addr]:path. For example: 1.2.3.4:/cinder_backup.

Configuring an S3 driver

Available since MOSK 23.2 TechPreview

To use an S3 driver with MOSK, ensure you have a preconfigured S3 storage with a user account created for access.

Note

The S3 storage must be accessible through the network from all OpenStack control plane nodes of the cluster.

To enable the S3 storage for Cinder backup:

  1. Create a dedicated secret in Kuberbetes to securely store the credentials required for accessing the S3 storage:

    ---
    apiVersion: v1
    kind: Secret
    metadata:
      labels:
        openstack.lcm.mirantis.com/osdpl_secret: "true"
      name: cinder-backup-s3-hidden
      namespace: openstack
    type: Opaque
    data:
      access_key: <ACCESS_KEY_FOR_S3_ACCOUNT>
      secret_key: <ACCESS_KEY_FOR_S3_ACCOUNT>
    
  2. Configure the following structure in the OpenStackDeployment object:

    spec:
      features:
        cinder:
          backup:
            drivers:
              <BACKEND_NAME>:
                type: s3
                enabled: true
                endpoint_url: <URL_TO_S3_STORAGE>
                store_bucket: <S3_BUCKET_NAME>
                store_access_key:
                  value_from:
                    secret_key_ref:
                      key: access_key
                      name: cinder-backup-s3-hidden
                store_secret_key:
                  value_from:
                    secret_key_ref:
                      key: secret_key
                      name: cinder-backup-s3-hidden
    
Volume encryption

TechPreview

The Block Storage service (OpenStack Cinder) supports volume encryption using a key stored in the Key Manager service (OpenStack Barbican). Such configuration uses Linux Unified Key Setup (LUKS) to create an encrypted volume type and attach it to the Compute service (OpenStack Nova) instances. Nova retrieves the asymmetric key from Barbican and stores it on the OpenStack compute node as a libvirt key to encrypt the volume locally or on the back end and only after that transfers it to Cinder.

Note

  • To create an encrypted volume under a non-admin user, the creator role must be assigned to the user.

  • When planning your cloud, consider that encryption may impact CPU.

Identity service

Mirantis OpenStack for Kubernetes (MOSK) provides authentication, service discovery, and distributed multi-tenant authorization through the OpenStack Identity service, aka Keystone.

Integration with Mirantis Container Cloud IAM

MOSK integrates with Mirantis Container Cloud Identity and Access Management (IAM) subsystem to allow centralized management of users and their permissions across multiple clouds.

The core component of Container Cloud IAM is Keycloak, the open-source identity and access management software. Its primary function is to perform secure authentication of cloud users against its built-in or various external identity databases, such as LDAP directories, OpenID Connect or SAML compatible identity providers.

By default, every MOSK cluster is integrated with the Keycloak running in the Container Cloud management cluster. The integration automatically provisions the necessary configuration on the MOSK and Container Cloud IAM sides, such as the os client object in Keycloak. However, for the federated users to get proper permissions after logging in, the cloud operator needs to define the role mapping rules specific to each MOSK environment.

Connecting to Keycloak

Parameter

features:keystone:keycloak

Usage

Defines parameters to connect to the Keycloak identity provider

Regions

A region in MOSK represents a complete OpenStack cluster that has a dedicated control plane and set of API endpoints. It is not uncommon for operators of large clouds to offer their users several OpenStack regions, which differ by their geographical location or purpose. In order to easily navigate in a multi-region environment, cloud users need a way to distinguish clusters by their names.

The region_name parameter of an OpenStackDeployment custom resource specifies the name of the region that will be configured in all the OpenStack services comprising the MOSK cluster upon the initial deployment.

Important

Once the cluster is up and running, the cloud operator cannot set or change the name of the region. Therefore, Mirantis recommends selecting a meaningful name for the new region before the deployment starts. For example, the region name can be based on the name of the data center the cluster is located in.

Usage sample:

apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
metadata:
  name: openstack-cluster
  namespace: openstack
spec:
  region_name: <your-region-name>
Application credentials

Application credentials is a mechanism in the MOSK Identity service that enables application automation tools, such as shell scripts, Terraform modules, Python programs, and others, to securely perform various actions in the cloud API in order to deploy and manage application components.

Application credentials is a modern alternative to the legacy approach where every application owner had to request several technical user accounts to ensure their tools could authenticate in the cloud.

For the details on how to create and authenticate with application credentials, refer to Manage application credentials.

Application credentials must be explicitly enabled for federated users

By default, cloud users logging in to the cloud through the Mirantis Container Cloud IAM or any external identity provider cannot use the application credentials mechanism.

An application credential is heavily tied to the account of the cloud user owning it. An application automation tool that is a consumer of the credential acts on behalf of the human user who created the credential. Each action that the application automation tool performs gets authorized against the permissions, including roles and groups, the user currently has.

The source of truth about a federated user permissions is the identity provider. This information gets temporary transferred to the cloud’s Identity service inside a token once the user authenticates. By default, if such a user creates an application credential and passes it to the automation tool, there is no data to validate the tool’s action on the user’s behalf.

However, a cloud operator can configure the authorization_ttl parameter for an identity provider object to enable caching of its users authorization data. The parameter defines for how long in minutes the information about user permissions is preserved in the database after the user successfully logs in to the cloud.

Warning

Authorization data caching has security implications. In case a federated user account is revoked or his permissions change in the identity provider, the cloud Identity service will still allow performing actions on the user behalf until the cached data expires or the user re-authenticates in the cloud.

To set authorization_ttl to, for example, 60 minutes for the keycloak identity provider in Keystone:

  1. Log in to the keystone-client Pod:

    kubectl -n openstack exec $(kubectl -n openstack get po -l application=keystone,component=client -oname) -ti -c keystone-client -- bash
    
  2. Inside the Pod, run the following command:

    openstack identity provider set keycloak --authorization-ttl 60
    
Domain-specific configuration

Parameter

features:keystone:domain_specific_configuration

Usage

Defines the domain-specific configuration and is useful for integration with LDAP. An example of OsDpl with LDAP integration, which will create a separate domain.with.ldap domain and configure it to use LDAP as an identity driver:

spec:
  features:
    keystone:
      domain_specific_configuration:
        enabled: true
        domains:
          domain.with.ldap:
            enabled: true
            config:
              assignment:
                driver: keystone.assignment.backends.sql.Assignment
              identity:
                driver: ldap
              ldap:
                chase_referrals: false
                group_desc_attribute: description
                group_id_attribute: cn
                group_member_attribute: member
                group_name_attribute: ou
                group_objectclass: groupOfNames
                page_size: 0
                password: XXXXXXXXX
                query_scope: sub
                suffix: dc=mydomain,dc=com
                url: ldap://ldap01.mydomain.com,ldap://ldap02.mydomain.com
                user: uid=openstack,ou=people,o=mydomain,dc=com
                user_enabled_attribute: enabled
                user_enabled_default: false
                user_enabled_invert: true
                user_enabled_mask: 0
                user_id_attribute: uid
                user_mail_attribute: mail
                user_name_attribute: uid
                user_objectclass: inetOrgPerson
Image service

Mirantis OpenStack for Kubernetes (MOSK) provides the image management capability through the OpenStack Image service, aka Glance.

The Image service enables you to discover, register, and retrieve virtual machine images. Using the Glance API, you can query virtual machine image metadata and retrieve actual images.

MOSK deployment profiles include the Image service in the core set of services. You can configure the Image service through the spec:features definition in the OpenStackDeployment custom resource.

Image signature verification

TechPreview

MOSK can automatically verify the cryptographic signatures associated with images to ensure the integrity of their data. A signed image has a few additional properties set in its metadata that include img_signature, img_signature_hash_method, img_signature_key_type, and img_signature_certificate_uuid. You can find more information about these properties and their values in the upstream OpenStack documentation.

MOSK performs image signature verification during the following operations:

  • A cloud user or a service creates an image in the store and starts to upload its data. If the signature metadata properties are set on the image, its content gets verified against the signature. The Image service accepts non-signed image uploads.

  • A cloud user spawns a new instance from an image. The Compute service ensures that the data it downloads from the image storage matches the image signature. If the signature is missing or does not match the data, the operation fails. Limitations apply, see Known limitations.

  • A cloud user boots an instance from a volume, or creates a new volume from an image. If the image is signed, the Block Storage service compares the downloaded image data against the signature. If there is a mismatch, the operation fails. The service will accept a non-signed image as a source for a volume. Limitations apply, see Known limitations.

Configuration example
spec:
  features:
    glance:
      signature:
        enabled: true
Signing pre-built images

Every MOSK cloud is pre-provisioned with a baseline set of images containing most popular operating systems, such as Ubuntu, Fedora, CirrOS.

In addition, a few services in MOSK rely on the creation of service instances to provide their functions, namely the Load Balancer service and the Bare Metal service, and require corresponding images to exist in the image store.

When image signature verification is enabled during the cloud deployment, all these images get automatically signed with a pre-generated self-signed certificate. Enabling the feature in an already existing cloud requires manual signing of all of the images stored in it. Consult the OpenStack documentation for an example of the image signing procedure.

Supported storage back ends

The image signature verification is supported for LVM and local back ends for ephemeral storage.

The functionality is not compatible with Ceph-backed ephemeral storage combined with RAW formatted images. The Ceph copy-on-write mechanism enables the user to create instance virtual disks without downloading the image to a compute node, the data is handled completely on the side of a Ceph cluster. This enables you to spin up instances almost momentarily but makes it impossible to verify the image data before creating an instance from it.

Known limitations
  • The Image service does not enforce the presence of a signature in the metadata when the user creates a new image. The service will accept the non-signed image uploads.

  • The Image service does not verify the correctness of an image signature upon update of the image metadata.

  • MOSK does not validate if the certificate used to sign an image is trusted, it only ensures the correctness of the signature itself. Cloud users are allowed to use self-signed certificates.

  • The Compute service does not verify image signature for Ceph back end when the RAW image format is used as described in Supported storage back ends.

  • The Compute service does not verify image signature if the image is already cached on the target compute node.

  • The Instance HA service may experience issues when auto-evacuating instances created from signed images if it does have access to the corresponding secrets in the Key manager service.

  • The Block Storage service does not perform image signature verification when a Ceph back end is used and the images are in the RAW format.

  • The Block Storage service does not enforce the presence of a signature on the images.

Object Storage service

Ceph Object Gateway provides Object Storage (Swift) API for end users in MOSK deployments. For the API compatibility, refer to Ceph Documentation: Ceph Object Gateway Swift API.

Object storage enablement

Parameter

features:services:object-storage

Usage

Enables the object storage and provides a RADOS Gateway Swift API that is compatible with the OpenStack Swift API.

To enable the service, add object-storage to the service list:

spec:
  features:
    services:
    - object-storage

To create the RADOS Gateway pool in Ceph, see Container Cloud Operations Guide: Enable Ceph RGW Object Storage.

Object storage server-side encryption

TechPreview

Ceph Object Gateway also provides Amazon S3 compatible API. For details, see Ceph Documentation: Ceph Object Gateway S3 API. Using integration with the OpenStack Key Manager service (Barbican), the objects uploaded through S3 API can be encrypted by Ceph Object Gateway according to the AWS Documentation: Protecting data using server-side encryption with customer-provided encryption keys (SSE-C) specification.

Instead of Swift, such configuration uses an S3 client to upload server-side encrypted objects. Using server-side encryption, the data is sent over a secure HTTPS connection in an unencrypted form and the Ceph Object Gateway stores that data in the Ceph cluster in an encrypted form.

Dashboard

MOSK Dashboard (OpenStack Horizon) provides a web-based interface for users to access the functions of the cloud services.

Custom theme

Parameter

features:horizon:themes

Usage

Defines the list of custom OpenStack Dashboard themes. Content of the archive file with a theme depends on the level of customization and can include static files, Django templates, and other artifacts. For the details, refer to OpenStack official documentation: Customizing Horizon Themes.

spec:
  features:
    horizon:
      themes:
        - name: theme_name
          description: The brand new theme
          url: https://<path to .tgz file with the contents of custom theme>
          sha256summ: <SHA256 checksum of the archive above>

Auxiliary cloud services

Telemetry services

TechPreview

The Telemetry services are part of OpenStack services available in Mirantis OpenStack for Kubernetes (MOSK). The Telemetry services monitor OpenStack components, collect and store the telemetry data from them, and perform responsive actions upon this data. See OpenStack cluster for details about OpenStack services in MOSK.

OpenStack Ceilometer is a service that collects data from various OpenStack components. The service can also collect and process notifications from different OpenStack services. Ceilometer stores the data in the Gnocchi database. The service is specified as metering in the OpenStackDeployment custom resource (CR).

Gnocchi is an open-source time series database. One of the advantages of this database is the ability to pre-aggregate the telemetry data while storing it. Gnocchi is specified as metric in the OpenStackDeployment CR.

OpenStack Aodh is part of the Telemetry project. Aodh provides a service that creates alarms based on various metric values or specific events and triggers response actions. The service uses data collected and stored by Ceilometer and Gnocchi. Aodh is specified as alarming in the OpenStackDeployment CR.

Enabling Telemetry services

The Telemetry feature in MOSK has a single mode. The autoscaling mode provides settings for telemetry data collection and storing. The OpenStackDeployment CR should have this mode specified for the correct work of the OpenStack Telemetry services. The autoscaling mode has the following notable configurations:

  • Gnocchi stores cache and data using the Redis storage driver.

  • Metric stores data for one hour with a resolution of 1 minute.

The Telemetry services are disabled by default in MOSK. You have to enable them in the openstackdeployment.yaml file (the OpenStackDeployment CR). The following code block provides an example of deploying the Telemetry services as part of MOSK:

kind: OpenStackDeployment
spec:
  features:
    services:
    - alarming
    - metering
    - metric
    telemetry:
      mode: autoscaling
Advanced configuration
Gnocchi

Gnocchi is not an OpenStack service, so the settings related to its functioning should be included in the spec:common:infra section of the OpenStackDeployment CR.

Ceilometer

The Ceilometer configuration files contain many list structures. Overriding list elements in YAML files is context-dependent and error-prone. Therefore, to override these configuration files, define the spec:services structure in the OpenStackDeployment CR. The spec:services structure provides the ability to use a complete file as text and not as YAML data structure.

Overriding through the spec:services structure is possible for the following files:

  • pipeline.yaml

  • polling.yaml

  • meters.yaml

  • gnocchi_resources.yaml

  • event_pipeline.yaml

  • event_definitions.yaml

An example of overriding through the OpenStackDeployment CR

By default, the autoscaling mode collects the data related to CPU, disk, and memory every minute. The autoscaling mode collects the rest of the available metrics every hour.

The following example shows the overriding of the polling.yaml configuration file through the spec:services structure of the OpenStackDeployment CR.

  1. Get the current configuration file:

    kubectl -n openstack get secret ceilometer-etc -ojsonpath="{.data['polling\.yaml']}" | base64 -d
    sources:
    - interval: 60
      meters:
      - cpu
      - disk*
      - memory*
      name: ascale_pollsters
    - interval: 3600
      meters:
      - '*'
      name: all_pollsters
    
  2. Add the network parameter to the file.

  3. Copy and paste the edited polling.yaml file content to the spec:services:metering section of the OpenStackDeployment CR:

    spec:
      services:
        metering:
          ceilometer:
            conf:
              polling: | # Obligatory. The "|" indicator denotes the literal style. See https://yaml.org/spec/1.2-old/spec.html#id2795688 for details.
                sources:
                - interval: 60
                  meters:
                  - cpu
                  - disk*
                  - memory*
                  - network*
                  name: ascale_pollsters
                - interval: 3600
                  meters:
                  - '*'
                  name: all_pollsters
    
Bare Metal service

The Bare Metal service (Ironic) is an extra OpenStack service that can be deployed by the OpenStack Operator. This section provides the baremetal-specific configuration options of the OpenStackDeployment resource.

Enabling the Bare Metal service

The Bare Metal service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

To install bare metal services, add the baremetal keyword to the spec:features:services list:

spec:
  features:
    services:
      - baremetal

Note

All bare metal services are scheduled to the nodes with the openstack-control-plane: enabled label.

Ironic agent deployment images

To provision a user image onto a bare metal server, Ironic boots a node with a ramdisk image. Depending on the node’s deploy interface and hardware, the ramdisk may require different drivers (agents). MOSK provides tinyIPA-based ramdisk images and uses the direct deploy interface with the ipmitool power interface.

Example of agent_images configuration:

spec:
  features:
    ironic:
       agent_images:
         base_url: https://binary.mirantis.com/openstack/bin/ironic/tinyipa
         initramfs: tinyipa-stable-ussuri-20200617101427.gz
         kernel: tinyipa-stable-ussuri-20200617101427.vmlinuz

Since the bare metal nodes hardware may require additional drivers, you may need to build a deploy ramdisk for particular hardware. For more information, see Ironic Python Agent Builder. Be sure to create a ramdisk image with the version of Ironic Python Agent appropriate for your OpenStack release.

Bare metal networking

Ironic supports the flat and multitenancy networking modes.

The flat networking mode assumes that all bare metal nodes are pre-connected to a single network that cannot be changed during the virtual machine provisioning. This network with bridged interfaces for Ironic should be spread across all nodes including compute nodes to allow plug-in regular virtual machines to connect to Ironic network. In its turn, the interface defined as provisioning_interface should be spread across gateway nodes. The cloud operator can perform all these underlying configuration through the L2 templates.

Example of the OsDpl resource illustrating the configuration for the flat network mode:

spec:
  features:
    services:
      - baremetal
    neutron:
      external_networks:
        - bridge: ironic-pxe
          interface: <baremetal-interface>
          network_types:
            - flat
          physnet: ironic
          vlan_ranges: null
    ironic:
       # The name of neutron network used for provisioning/cleaning.
       baremetal_network_name: ironic-provisioning
       networks:
         # Neutron baremetal network definition.
         baremetal:
           physnet: ironic
           name: ironic-provisioning
           network_type: flat
           external: true
           shared: true
           subnets:
             - name: baremetal-subnet
               range: 10.13.0.0/24
               pool_start: 10.13.0.100
               pool_end: 10.13.0.254
               gateway: 10.13.0.11
       # The name of interface where provision services like tftp and ironic-conductor
       # are bound.
       provisioning_interface: br-baremetal

The multitenancy network mode uses the neutron Ironic network interface to share physical connection information with Neutron. This information is handled by Neutron ML2 drivers when plugging a Neutron port to a specific network. MOSK supports the networking-generic-switch Neutron ML2 driver out of the box.

Example of the OsDpl resource illustrating the configuration for the multitenancy network mode:

spec:
  features:
    services:
      - baremetal
    neutron:
      tunnel_interface: ens3
      external_networks:
        - physnet: physnet1
          interface: <physnet1-interface>
          bridge: br-ex
          network_types:
            - flat
          vlan_ranges: null
          mtu: null
        - physnet: ironic
          interface: <physnet-ironic-interface>
          bridge: ironic-pxe
          network_types:
            - vlan
          vlan_ranges: 1000:1099
    ironic:
      # The name of interface where provision services like tftp and ironic-conductor
      # are bound.
      provisioning_interface: <baremetal-interface>
      baremetal_network_name: ironic-provisioning
      networks:
        baremetal:
          physnet: ironic
          name: ironic-provisioning
          network_type: vlan
          segmentation_id: 1000
          external: true
          shared: false
          subnets:
            - name: baremetal-subnet
              range: 10.13.0.0/24
              pool_start: 10.13.0.100
              pool_end: 10.13.0.254
              gateway: 10.13.0.11
DNS service

Mirantis OpenStack for Kubernetes (MOSK) provides DNS records managing capability through the DNS service (OpenStack Designate).

LoadBalancer type for PowerDNS

The supported back end for Designate is PowerDNS. If required, you can specify whether to use an external IP address or UDP, TCP, or TCP + UDP kind of Kubernetes for the PowerDNS service.

To configure LoadBalancer for PowerDNS, use the spec:features:designate definition in the OpenStackDeployment custom resource.

The list of supported options includes:

  • external_ip - Optional. An IP address for the LoadBalancer service. If not defined, LoadBalancer allocates the IP address.

  • protocol - A protocol for the Designate back end in Kubernetes. Can only be udp, tcp, or tcp+udp.

  • type - The type of the back end for Designate. Can only be powerdns.

For example:

spec:
  features:
    designate:
      backend:
        external_ip: 10.172.1.101
        protocol: udp
        type: powerdns
DNS service known limitations
Inability to set up a secondary DNS zone

Due to an issue in the dnspython library, Asynchronous Transfer Full Range (AXFR) requests do not work and cause inability to set up a secondary DNS zone. The issue affects OpenStack Victoria and will be fixed in the Yoga release.

Key Manager service

MOSK Key Manager service (OpenStack Barbican) provides secure storage, provisioning, and management of cloud application secret data, such as Symmetric Keys, Asymmetric Keys, Certificates, and raw binary data.

Configuring the Vault back end

Parameter

features:barbican:backends:vault

Usage

Specifies the object containing the Vault parameters to connect to Barbican.

The list of supported options includes:

  • enabled - boolean parameter indicating that the Vault back end is enabled

  • approle_role_id - Vault app role ID

  • approle_secret_id - secret ID created for the app role

  • vault_url - URL of the Vault server

  • use_ssl - enables the SSL encryption. Since MOSK does not currently support the Vault SSL encryption, the use_ssl parameter should be set to false

  • kv_mountpoint TechPreview - optional, specifies the mountpoint of a Key-Value store in Vault to use

  • namespace TechPreview - optional, specifies the Vault namespace to use with all requests to Vault

    Note

    The Vault namespaces feature is available only in Vault Enterprise.

    Note

    Vault namespaces are supported only starting from the OpenStack Victoria release.

If the Vault back end is used, configure it properly using the following parameters:

spec:
  features:
    barbican:
      backends:
        vault:
          enabled: true
          approle_role_id: <APPROLE_ROLE_ID>
          approle_secret_id: <APPROLE_SECRET_ID>
          vault_url: <VAULT_SERVER_URL>
          use_ssl: false

Mirantis recommeds hiding the approle_id and approle_secret_id keys as described in Hiding sensitive information.

Note

Since MOSK does not currently support the Vault SSL encryption, set the use_ssl parameter to false.

Instance High Availability service

TechPreview

Instance High Availability service (OpenStack Masakari) enables cloud users to ensure that their instances get automatically evacuated from a failed hypervisor.

Architecture of the Instance HA service

The service consists of the following components:

  • API recieves requests from users and events from monitors, and sends them to engine

  • Engine executes recovery workflow

  • Monitors detect failures and notifies API. MOSK uses monitors of the following types:

    • Instance monitor performs liveness of instance processes

    • Host monitor performs liveness of a compute host, runs as part of the Node controller from the OpenStack Controller

    Note

    The Processes monitor is not present in MOSK as far as HA for the compute processes is handled by Kubernetes.

Enabling the Instance HA service

The Instance HA service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

Parameter

features:services:instance-ha

Usage

Enables Masakari, the OpenStack service that ensures high availability of instances running on a host. To enable the service, add instance-ha to the service list:

spec:
  features:
    services:
    - instance-ha
Shared Filesystems service

Available since MOSK 22.5 TechPreview

MOSK Shared Filesystems service (OpenStack Manila) provides Shared Filesystems as a service. The Shared Filesystems service enables you to create and manage shared filesystems in your multi-project cloud environments.

Service architecture

The Shared FileSystems service consists of manila-api, manila-scheduler, and manila-share services. All these services communicate with each other through the AMQP protocol and store their data in the MySQL database.

manila-api

Provides a stable RESTful API, authenticates and routes requests throughout the Shared Filesystem service

manila-scheduler

Responsible for scheduling and routing requests to the appropriate manila-share service by determining which back end should serve as the destination for a share creation request

manila-share

Responsible for managing Shared Filesystems service devices, specifically the back-end ones

The diagram below illustrates how the Shared FileSystems services communicate with each other.

Untitled Diagram
Shared Filesystems drivers

MOSK ensures support for different kind of equipment and shared filesystems by means of special drivers that are part of the manila-share service. Also, these divers determine the ability to restrict access to data stored on a shared filesystem, list of operations with Manila volumes, and types of connections to the client network.

Driver Handles Share Servers (DHSS) is one of the main parameters that define the Manila workflow including the way the Manila driver makes clients access shared filesystems. Some drivers support only one DHSS mode, for example, the LVM share driver. Others support both modes, for example, the Generic driver. If the DHSS is set to False in the driver configuration, the driver does not prepare the share server that provides access to the share filesystems and the server and network setup should be performed by the administrator. In this case, the Shared Filesystems service only manages the server in its own configuration.

Untitled Diagram

If the driver configuration includes DHSS=True, the driver creates a service virtual machine that provides access to shared filesystems. Also, when DHSS=True, the Shared Filesystems service performs a network setup to provide client’s access to the created service virtual machine. For working with the service virtual machine, the Shared Filesystems service requires a separate service network that must be included in the driver’s configuration as well.

Let’s describe the Generic driver as an example for the DHSS=True case. There are two network topologies for connecting client’s network to the service virtual machine, which depend of the connect_share_server_to_tenant_network parameter. If the connect_share_server_to_tenant_network parameter is set to False, which is default, the client must create a shared network connected to a public router. IP addresses from this network will be granted access to the created shared filesystem. The Shared Filesystems service creates a subnet in its service network where the network port of the new service virtual machine and network port of the clent’s router will be connected to. When a new shared filesystem is created, the client’s machine is granted access to it through the router.

Untitled Diagram

If the connect_share_server_to_tenant_network parameter is set to True, the Shared Filesystems service creates the service virtual machines with two network interfaces. One of them is connected to the service network while the other one is connected to the client’s network.

Untitled Diagram
Enabling the Shared Filesystems service

The Shared Filesystems service is not included into the core set of services and needs to be explicitly enabled in the OpenStackDeployment custom resource.

To install the OpenStack Manila services, add the shared-file-system keyword to the spec:features:services list:

spec:
  features:
    services:
      - shared-file-system

OpenStack

OpenStack cluster

OpenStack and auxiliary services are running as containers in the kind: Pod Kubernetes resources. All long-running services are governed by one of the ReplicationController-enabled Kubernetes resources, which include either kind: Deployment, kind: StatefulSet, or kind: DaemonSet.

The placement of the services is mostly governed by the Kubernetes node labels. The labels affecting the OpenStack services include:

  • openstack-control-plane=enabled - the node hosting most of the OpenStack control plane services.

  • openstack-compute-node=enabled - the node serving as a hypervisor for Nova. The virtual machines with tenants workloads are created there.

  • openvswitch=enabled - the node hosting Neutron L2 agents and OpenvSwitch pods that manage L2 connection of the OpenStack networks.

  • openstack-gateway=enabled - the node hosting Neutron L3, Metadata and DHCP agents, Octavia Health Manager, Worker and Housekeeping components.

_images/os-k8s-pods-layout.png

Note

OpenStack is an infrastructure management platform. Mirantis OpenStack for Kubernetes (MOSK) uses Kubernetes mostly for orchestration and dependency isolation. As a result, multiple OpenStack services are running as privileged containers with host PIDs and Host Networking enabled. You must ensure that at least the user with the credentials used by Helm/Tiller (administrator) is capable of creating such Pods.

Infrastructure services

Service

Description

Storage

While the underlying Kubernetes cluster is configured to use Ceph CSI for providing persistent storage for container workloads, for some types of workloads such networked storage is suboptimal due to latency.

This is why the separate local-volume-provisioner CSI is deployed and configured as an additional storage class. Local Volume Provisioner is deployed as kind: DaemonSet.

Database

A single WSREP (Galera) cluster of MariaDB is deployed as the SQL database to be used by all OpenStack services. It uses the storage class provided by Local Volume Provisioner to store the actual database files. The service is deployed as kind: StatefulSet of a given size, which is no less than 3, on any openstack-control-plane node. For details, see OpenStack database architecture.

Messaging

RabbitMQ is used as a messaging bus between the components of the OpenStack services.

A separate instance of RabbitMQ is deployed for each OpenStack service that needs a messaging bus for intercommunication between its components.

An additional, separate RabbitMQ instance is deployed to serve as a notification messages bus for OpenStack services to post their own and listen to notifications from other services. StackLight also uses this message bus to collect notifications for monitoring purposes.

Each RabbitMQ instance is a single node and is deployed as kind: StatefulSet.

Caching

A single multi-instance of the Memcached service is deployed to be used by all OpenStack services that need caching, which are mostly HTTP API services.

Coordination

A separate instance of etcd is deployed to be used by Cinder, which require Distributed Lock Management for coordination between its components.

Ingress

Is deployed as kind: DaemonSet.

Image pre-caching

A special kind: DaemonSet is deployed and updated each time the kind: OpenStackDeployment resource is created or updated. Its purpose is to pre-cache container images on Kubernetes nodes, and thus, to minimize possible downtime when updating container images.

This is especially useful for containers used in kind: DaemonSet resources, as during the image update Kubernetes starts to pull the new image only after the container with the old image is shut down.

OpenStack services

Service

Description

Identity (Keystone)

Uses MySQL back end by default.

keystoneclient - a separate kind: Deployment with a pod that has the OpenStack CLI client as well as relevant plugins installed, and OpenStack admin credentials mounted. Can be used by administrator to manually interact with OpenStack APIs from within a cluster.

Image (Glance)

Supported back end is RBD (Ceph is required).

Volume (Cinder)

Supported back end is RBD (Ceph is required).

Network (Neutron)

Supported back ends are Open vSwitch and Tungsten Fabric.

Placement

Compute (Nova)

Supported hypervisor is Qemu/KVM through libvirt library.

Dashboard (Horizon)

DNS (Designate)

Supported back end is PowerDNS.

Load Balancer (Octavia)

Ceph Object Gateway (SWIFT)

Provides the object storage and a Ceph Object Gateway Swift API that is compatible with the OpenStack Swift API. You can manually enable the service in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Instance HA (Masakari)

An OpenStack service that ensures high availability of instances running on a host. You can manually enable Masakari in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Orchestration (Heat)

Key Manager (Barbican)

The supported back ends include:

  • The built-in Simple Crypto, which is used by default

  • Vault

    Vault by HashiCorp is a third-party system and is not installed by MOSK. Hence, the Vault storage back end should be available elsewhere on the user environment and accessible from the MOSK deployment.

    If the Vault back end is used, you can configure Vault in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Tempest

Runs tests against a deployed OpenStack cloud. You can manually enable Tempest in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

Telemetry

Telemetry services include alarming (aodh), metering (Ceilometer), and metric (Gnocchi). All services should be enabled together through the list of services to be deployed in the OpenStackDeployment CR as described in Deploy an OpenStack cluster.

OpenStack database architecture

A complete setup of a MariaDB Galera cluster for OpenStack is illustrated in the following image:

_images/os-k8s-mariadb-galera.png

MariaDB server pods are running a Galera multi-master cluster. Clients requests are forwarded by the Kubernetes mariadb service to the mariadb-server pod that has the primary label. Other pods from the mariadb-server StatefulSet have the backup label. Labels are managed by the mariadb-controller pod.

The MariaDB Controller periodically checks the readiness of the mariadb-server pods and sets the primary label to it if the following requirements are met:

  • The primary label has not already been set on the pod.

  • The pod is in the ready state.

  • The pod is not being terminated.

  • The pod name has the lowest integer suffix among other ready pods in the StatefulSet. For example, between mariadb-server-1 and mariadb-server-2, the pod with the mariadb-server-1 name is preferred.

Otherwise, the MariaDB Controller sets the backup label. This means that all SQL requests are passed only to one node while other two nodes are in the backup state and replicate the state from the primary node. The MariaDB clients are connecting to the mariadb service.

OpenStack lifecycle management

The OpenStack Operator component is a combination of the following entities:

OpenStack Controller

The OpenStack Controller runs in a set of containers in a pod in Kubernetes. The OpenStack Controller is deployed as a Deployment with 1 replica only. The failover is provided by Kubernetes that automatically restarts the failed containers in a pod.

However, given the recommendation to use a separate Kubernetes cluster for each OpenStack deployment, the controller in envisioned mode for operation and deployment will only manage a single OpenStackDeployment resource, making the proper HA much less of an issue.

The OpenStack Controller is written in Python using Kopf, as a Python framework to build Kubernetes operators, and Pykube, as a Kubernetes API client.

Using Kubernetes API, the controller subscribes to changes to resources of kind: OpenStackDeployment, and then reacts to these changes by creating, updating, or deleting appropriate resources in Kubernetes.

The basic child resources managed by the controller are Helm releases. They are rendered from templates taking into account an appropriate values set from the main and features fields in the OpenStackDeployment resource.

Then, the common fields are merged to resulting data structures. Lastly, the services fields are merged providing the final and precise override for any value in any Helm release to be deployed or upgraded.

The constructed values are then used by the OpenStack Controller during a Helm release installation.

OpenStack Controller containers

Container

Description

osdpl

The core container that handles changes in the osdpl object.

helmbundle

The container that watches the helmbundle objects and reports their statuses to the osdpl object in status:children. See OpenStackDeploymentStatus custom resource for details.

health

The container that watches all Kubernetes native resources, such as Deployments, Daemonsets, Statefulsets, and reports their statuses to the osdpl object in status:health. See OpenStackDeploymentStatus custom resource for details.

secrets

The container that provides data exchange between different components such as Ceph.

node

The container that handles the node events.

_images/openstack_controller.png
OpenStackDeployment Admission Controller

The CustomResourceDefinition resource in Kubernetes uses the OpenAPI Specification version 2 to specify the schema of the resource defined. The Kubernetes API outright rejects the resources that do not pass this schema validation.

The language of the schema, however, is not expressive enough to define a specific validation logic that may be needed for a given resource. For this purpose, Kubernetes enables the extension of its API with Dynamic Admission Control.

For the OpenStackDeployment (OsDpl) CR the ValidatingAdmissionWebhook is a natural choice. It is deployed as part of OpenStack Controller by default and performs specific extended validations when an OsDpl CR is created or updated.

The inexhaustive list of additional validations includes:

  • Deny the OpenStack version downgrade

  • Deny the OpenStack version skip-level upgrade

  • Deny the OpenStack master version deployment

  • Deny upgrade to the OpenStack master version

  • Deny upgrade if any part of an OsDpl CR specification changes along with the OpenStack version

Under specific circumstances, it may be viable to disable the Admission Controller, for example, when you attempt to deploy or upgrade to the master version of OpenStack.

Warning

Mirantis does not support MOSK deployments performed without the OpenStackDeployment Admission Controller enabled. Disabling of the OpenStackDeployment Admission Controller is only allowed in staging non-production environments.

To disable the Admission Controller, ensure that the following structures and values are present in the openstack-controller HelmBundle resource:

apiVersion: lcm.mirantis.com/v1alpha1
kind: HelmBundle
metadata:
  name: openstack-operator
  namespace: osh-system
spec:
  releases:
  - name: openstack-operator
    values:
      admission:
        enabled: false

At that point, all safeguards except for those expressed by the CR definition are disabled.

OpenStack configuration

MOSK provides the configurational capabilities through a number of custom resources. This section is intended to provide detailed overview of these custom resources and their possible configuration.

OpenStackDeployment custom resource

The detailed information about schema of an OpenStackDeployment custom resource can be obtained by running:

kubectl get crd openstackdeployments.lcm.mirantis.com -o yaml

The definition of a particular OpenStack deployment can be obtained by running:

kubectl -n openstack get osdpl -o yaml
Example of an OpenStackDeployment CR of minimum configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: OpenStackDeployment
metadata:
  name: openstack-cluster
  namespace: openstack
spec:
  openstack_version: victoria
  preset: compute
  size: tiny
  internal_domain_name: cluster.local
  public_domain_name: it.just.works
  features:
    neutron:
      tunnel_interface: ens3
      external_networks:
        - physnet: physnet1
          interface: veth-phy
          bridge: br-ex
          network_types:
           - flat
          vlan_ranges: null
          mtu: null
      floating_network:
        enabled: False
    nova:
      live_migration_interface: ens3
      images:
        backend: local
Hiding sensitive information

Available since MOSK 23.1

The OpenStackDeployment custom resource enables you to securely store sensitive fields in Kubernetes secrets. To do that, verify that the reference secret is present in the same namespace as the OpenStackDeployment object and the openstack.lcm.mirantis.com/osdpl_secret label is set to true. The list of fields that can be hidden from OpenStackDeployment is limited and defined by the OpenStackDeployment schema.

For example, to hide spec:features:ssl:public_endpoints:api_cert, use the following structure:

spec:
  features:
    ssl:
      public_endpoints:
        api_cert:
          value_from:
            secret_key_ref:
              key: api_cert
              name: osh-dev-hidden

Note

The fields that used to store confidential settings in OpenStackDeploymentSecret and OpenStackDeployment before MOSK 23.1 include:

spec:
  features:
    ssl:
      public_endpoints:
        - ca_cert
        - api_cert
        - api_key
    barbican:
      backends:
        vault:
          - approle_role_id
          - approle_secret_id
          - ssl_ca_crt_file
    baremetal:
      ngs:
        hardware:
          *:
            - username
            - password
            - ssh_private_key
            - secret
Main elements
Main elements of OpenStackDeployment custom resource

Element

Sub-element

Description

apiVersion

n/a

Specifies the version of the Kubernetes API that is used to create this object

kind

n/a

Specifies the kind of the object

metadata

name

Specifies the name of metadata. Should be set in compliance with the Kubernetes resource naming limitations

namespace

Specifies the metadata namespace. While technically it is possible to deploy OpenStack on top of Kubernetes in other than openstack namespace, such configuration is not included in the MOSK system integration test plans. Therefore, Mirantis does not recommend such scenario.

Warning

Both OpenStack and Kubernetes platforms provide resources to applications. When OpenStack is running on top of Kubernetes, Kubernetes is completely unaware of OpenStack-native workloads, such as virtual machines, for example.

For better results and stability, Mirantis recommends using a dedicated Kubernetes cluster for OpenStack, so that OpenStack and auxiliary services, Ceph, and StackLight are the only Kubernetes applications running in the cluster.

spec

openstack_version

Specifies the OpenStack release to deploy

preset

String that specifies the name of the preset, a predefined configuration for the OpenStack cluster. A preset includes:

  • A set of enabled services that includes virtualization, bare metal management, secret management, and others

  • Major features provided by the services, such as VXLAN encapsulation of the tenant traffic

  • Integration of services

Every supported deployment profile incorporates an OpenStack preset. Refer to Deployment profiles for the list of possible values.

size

String that specifies the size category for the OpenStack cluster. The size category defines the internal configuration of the cluster such as the number of replicas for service workers and timeouts, etc.

The list of supported sizes include:

  • tiny - for approximately 10 OpenStack compute nodes

  • small - for approximately 50 OpenStack compute nodes

  • medium - for approximately 100 OpenStack compute nodes

public_domain_name

Specifies the public DNS name for OpenStack services. This is a base DNS name that must be accessible and resolvable by API clients of your OpenStack cloud. It will be present in the OpenStack endpoints as presented by the OpenStack Identity service catalog.

The TLS certificates used by the OpenStack services (see below) must also be issued to this DNS name.

persistent_volume_storage_class

Specifies the Kubernetes storage class name used for services to create persistent volumes. For example, backups of MariaDB. If not specified, the storage class marked as default will be used.

features

Contains the top-level collections of settings for the OpenStack deployment that potentially target several OpenStack services. The section where the customizations should take place.

The features:services element contains a list of extra OpenStack services to deploy. Extra OpenStack services are services that are not included into preset.

region_name

TechPreview

The name of the region used for deployment, defaults to RegionOne.

features:policies

Defines the list of custom policies for OpenStack services.

Configuration structure:

spec:
  features:
    policies:
      nova:
        custom_policy: custom_value

The list of services available for configuration includes: Cinder, Nova, Designate, Keystone, Glance, Neutron, Heat, Octavia, Barbican, Placement, Ironic, aodh, Gnocchi, and Masakari.

Caution

Mirantis is not responsible for cloud operability in case of default policies modifications but provides API to pass the required configuration to the core OpenStack services.

features:policies:strict_admin

TechPreview

Enables a tested set of policies that limits the global admin role to only the user with admin role in the admin project or user with the service role. The latter should be used only for service users utilizied for communication between OpenStack services.

Configuration structure:

spec:
  features:
    policies:
      strict_admin:
        enabled: true
  services:
    identity:
      keystone:
        values:
          conf:
            keystone:
              resource:
                admin_project_name: admin
                admin_project_domain_name: Default

Note

The spec.services part of the above section will become redundant in one of the following releases.

artifacts

A low-level section that defines the base URI prefixes for images and binary artifacts.

common

A low-level section that defines values that will be passed to all OpenStack (spec:common:openstack) or auxiliary (spec:common:infra) services Helm charts.

Configuration structure:

spec:
  artifacts:
  common:
    openstack:
      values:
    infra:
      values:
services

A section of the lowest level, enables the definition of specific values to pass to specific Helm charts on a one-by-one basis:

Warning

Mirantis does not recommend changing the default settings for spec:artifacts, spec:common, and spec:services elements. Customizations can compromise the OpenStack deployment update and upgrade processes. However, you may need to edit the spec:services section to limit hardware resources in case of a hyperconverged architecture as described in Limit HW resources for hyperconverged OpenStack compute nodes.

Logging

Parameter

features:logging:<service>:level

Usage

Specifies the standard logging levels for OpenStack services that include the following, at increasing severity: TRACE, DEBUG, INFO, AUDIT, WARNING, ERROR, and CRITICAL.

Configuration example:

spec:
  features:
    logging:
      nova:
        level: DEBUG
Node-specific configuration

Depending on the use case, you may need to configure the same application components differently on different hosts. MOSK enables you to easily perform the required configuration through node-specific overrides at the OpenStack Controller side.

The limitation of using the node-specific overrides is that they override only the configuration settings while other components, such as startup scripts and others, should be reconfigured as well.

Caution

The overrides have been implemented in a similar way to the OpenStack node and node label specific DaemonSet configurations. Though, the OpenStack Controller node-specific settings conflict with the upstream OpenStack node and node label specific DaemonSet configurations. Therefore, we do not recommend configuring node and node label overrides.

The list of allowed node labels is located in the Cluster object status providerStatus.releaseRef.current.allowedNodeLabels field.

If the value field is not defined in allowedNodeLabels, a label can have any value.

Before or after a machine deployment, add the required label from the allowed node labels list with the corresponding value to spec.providerSpec.value.nodeLabels in machine.yaml. For example:

nodeLabels:
- key: <NODE-LABEL>
  value: <NODE-LABEL-VALUE>

The addition of a node label that is not available in the list of allowed node labels is restricted.

The node-specific settings are activated through the spec:nodes section of the OsDpl CR. The spec:nodes section contains the following subsections:

  • features- implements overrides for a limited subset of fields and is constructed similarly to spec::features

  • services - similarly to spec::services, enables you to override settings in general for the components running as DaemonSets.

Example configuration:

spec:
  nodes:
    <NODE-LABEL>::<NODE-LABEL-VALUE>:
      features:
        # Detailed information about features might be found at
        # openstack_controller/admission/validators/nodes/schema.yaml
      services:
        <service>:
          <chart>:
            <chart_daemonset_name>:
              values:
                # Any value from specific helm chart
Tempest

Parameter

features:services:tempest

Usage

Enables tests against a deployed OpenStack cloud:

spec:
  features:
    services:
    - tempest

See also

API Reference

OpenStackDeploymentSecret custom resource

Deprecated in MOSK 23.1

The resource of kind OpenStackDeploymentSecret (OsDplSecret) is a custom resource that is intended to aggregate cloud’s confidential settings such as SSL/TLS certificates, external systems access credentials, and other secrets.

To obtain detailed information about the schema of an OsDplSecret custom resource, run:

kubectl get crd openstackdeploymentsecret.lcm.mirantis.com -o yaml
Usage

The resource has similar structure as the OpenStackDeployment custom resource and enables the user to set a limited subset of fields that contain sensitive data.

Important

If you are migrating the related fields from the OpenStackDeployment custom resource, refer to Migrating secrets from OpenStackDeployment to OpenStackDeploymentSecret CR.

Example of an OpenStackDeploymentSecret custom resource of minimum configuration:

 1apiVersion: lcm.mirantis.com/v1alpha1
 2kind: OpenStackDeploymentSecret
 3metadata:
 4  name: osh-dev
 5  namespace: openstack
 6spec:
 7  features:
 8    ssl:
 9      public_endpoints:
10        ca_cert: |-
11          -----BEGIN CERTIFICATE-----
12          ...
13          -----END CERTIFICATE-----
14        api_cert: |-
15          -----BEGIN CERTIFICATE-----
16          ...
17          -----END CERTIFICATE-----
18        api_key: |-
19          -----BEGIN RSA PRIVATE KEY-----
20          ...
21          -----END RSA PRIVATE KEY-----
22    barbican:
23      backends:
24        vault:
25          approle_role_id: f6f0f775-...-cc00a1b7d0c3
26          approle_secret_id: 2b5c4b87-...-9bfc6d796f8c
Public endpoints certificates
features:ssl

Contains the content of SSL/TLS certificates (server, key, and CA bundle) used to enable secure communication to public OpenStack API services.

These certificates must be issued to the DNS domain specified in the public_domain_name field.

Vault back end for Barbican
features:barbican:backends:vault

Specifies the object containing parameters used to connect to a Hashicorp Vault instance. The list of supported configurations includes:

  • approle_role_id – Vault app role ID

  • approle_secret_id – Secret ID created for the app role

OpenStackDeploymentStatus custom resource

The resource of kind OpenStackDeploymentStatus (OsDplSt) is a custom resource that describes the status of an OpenStack deployment. To obtain detailed information about the schema of an OpenStackDeploymentStatus (OsDplSt) custom resource, run:

kubectl get crd openstackdeploymentstatus.lcm.mirantis.com -o yaml

To obtain the status definition for a particular OpenStack deployment, run:

kubectl -n openstack get osdplst -o yaml
Example of an OpenStackDeploymentStatus custom resource configuration
  1 kind: OpenStackDeploymentStatus
  2 metadata:
  3   name: osh-dev
  4   namespace: openstack
  5 spec: {}
  6 status:
  7   handle:
  8     lastStatus: update
  9   health:
 10     barbican:
 11       api:
 12         generation: 2
 13         status: Ready
 14     cinder:
 15       api:
 16         generation: 2
 17         status: Ready
 18       backup:
 19         generation: 1
 20         status: Ready
 21       scheduler:
 22         generation: 1
 23         status: Ready
 24       volume:
 25         generation: 1
 26         status: Ready
 27   osdpl:
 28     cause: update
 29     changes: '((''add'', (''status'',), None, {''watched'': {''ceph'': {''secret'':
 30       {''hash'': ''0fc01c5e2593bc6569562b451b28e300517ec670809f72016ff29b8cbaf3e729''}}}}),)'
 31     controller_version: 0.5.3.dev12
 32     fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 33     openstack_version: ussuri
 34     state: APPLIED
 35     timestamp: "2021-09-08 17:01:45.633143"
 36   services:
 37     baremetal:
 38       controller_version: 0.5.3.dev12
 39       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 40       openstack_version: ussuri
 41       state: APPLIED
 42       timestamp: "2021-09-08 17:00:54.081353"
 43     block-storage:
 44       controller_version: 0.5.3.dev12
 45       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 46       openstack_version: ussuri
 47       state: APPLIED
 48       timestamp: "2021-09-08 17:00:57.306669"
 49     compute:
 50       controller_version: 0.5.3.dev12
 51       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 52       openstack_version: ussuri
 53       state: APPLIED
 54       timestamp: "2021-09-08 17:01:18.853068"
 55     coordination:
 56       controller_version: 0.5.3.dev12
 57       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 58       openstack_version: ussuri
 59       state: APPLIED
 60       timestamp: "2021-09-08 17:01:00.593719"
 61     dashboard:
 62       controller_version: 0.5.3.dev12
 63       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 64       openstack_version: ussuri
 65       state: APPLIED
 66       timestamp: "2021-09-08 17:00:57.652145"
 67     database:
 68       controller_version: 0.5.3.dev12
 69       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 70       openstack_version: ussuri
 71       state: APPLIED
 72       timestamp: "2021-09-08 17:01:00.233777"
 73     dns:
 74       controller_version: 0.5.3.dev12
 75       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 76       openstack_version: ussuri
 77       state: APPLIED
 78       timestamp: "2021-09-08 17:00:56.540886"
 79     identity:
 80       controller_version: 0.5.3.dev12
 81       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 82       openstack_version: ussuri
 83       state: APPLIED
 84       timestamp: "2021-09-08 17:01:00.961175"
 85     image:
 86       controller_version: 0.5.3.dev12
 87       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 88       openstack_version: ussuri
 89       state: APPLIED
 90       timestamp: "2021-09-08 17:00:58.976976"
 91     ingress:
 92       controller_version: 0.5.3.dev12
 93       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
 94       openstack_version: ussuri
 95       state: APPLIED
 96       timestamp: "2021-09-08 17:01:01.440757"
 97     key-manager:
 98       controller_version: 0.5.3.dev12
 99       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
100       openstack_version: ussuri
101       state: APPLIED
102       timestamp: "2021-09-08 17:00:51.822997"
103     load-balancer:
104       controller_version: 0.5.3.dev12
105       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
106       openstack_version: ussuri
107       state: APPLIED
108       timestamp: "2021-09-08 17:01:02.462824"
109     memcached:
110       controller_version: 0.5.3.dev12
111       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
112       openstack_version: ussuri
113       state: APPLIED
114       timestamp: "2021-09-08 17:01:03.165045"
115     messaging:
116       controller_version: 0.5.3.dev12
117       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
118       openstack_version: ussuri
119       state: APPLIED
120       timestamp: "2021-09-08 17:00:58.637506"
121     networking:
122       controller_version: 0.5.3.dev12
123       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
124       openstack_version: ussuri
125       state: APPLIED
126       timestamp: "2021-09-08 17:01:35.553483"
127     object-storage:
128       controller_version: 0.5.3.dev12
129       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
130       openstack_version: ussuri
131       state: APPLIED
132       timestamp: "2021-09-08 17:01:01.828834"
133     orchestration:
134       controller_version: 0.5.3.dev12
135       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
136       openstack_version: ussuri
137       state: APPLIED
138       timestamp: "2021-09-08 17:01:02.846671"
139     placement:
140       controller_version: 0.5.3.dev12
141       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
142       openstack_version: ussuri
143       state: APPLIED
144       timestamp: "2021-09-08 17:00:58.039210"
145     redis:
146       controller_version: 0.5.3.dev12
147       fingerprint: a112a4a7d00c0b5b79e69a2c78c3b50b0caca76a15fe7d79a6ad1305b19ee5ec
148       openstack_version: ussuri
149       state: APPLIED
150       timestamp: "2021-09-08 17:00:36.562673"
Health structure

The health subsection provides a brief output on services health.

OsDpl structure

The osdpl subsection describes the overall status of the OpenStack deployment.

OsDpl structure elements

Element

Description

cause

The cause that triggered the LCM action: update when OsDpl is updated, resume when the OpenStack Controller is restarted

changes

A string representation of changes in the OpenstackDeployment object

controller_version

The version of openstack-controller that handles the LCM action

fingerprint

The SHA sum of the OpenStackDeployment object spec section

openstack_version

The current OpenStack version specified in the osdpl object

state

The current state of the LCM action. Possible values include:

  • APPLYING - not all operations are completed

  • APPLIED - all operations are completed

timestamp

The timestamp of the status:osdpl section update

Services structure

The services subsection provides detailed information of LCM performed with a specific service. This is a dictionary where keys are service names, for example, baremetal or compute and values are dictionaries with the following items.

Services structure elements

Element

Description

controller_version

The version of the openstack-controller that handles the LCM action on a specific service

fingerprint

The SHA sum of the OpenStackDeployment object spec section used when performing the LCM on a specific service

openstack_version

The OpenStack version specified in the osdpl object used when performing the LCM action on a specific service

state

The current state of the LCM action performed on a service. Possible values include:

  • WAITING - waiting for dependencies.

  • APPLYING - not all operations are completed.

  • APPLIED - all operations are completed.

timestamp

The timestamp of the status:services:<SERVICE-NAME> section update.

OpenStack Controller configuration

Available since MOSK 23.2

The OpenStack Controller enables you to modify its configuration at runtime without restarting. MOSK stores the controller configuration in the openstack-controller-config ConfigMap in the osh-system namespace of your cluster.

To retrieve the OpenStack Controller configuration ConfigMap, run:

kubectl get configmaps openstack-controller-config -o yaml
Example of an OpenStack Controller configuration ConfigMap
apiVersion: v1
data:
  extra_conf.ini: |
    [maintenance]
    respect_nova_az = false
kind: ConfigMap
metadata:
  annotations:
    openstackdeployments.lcm.mirantis.com/skip_update: "true"
  name: openstack-controller-config
  namespace: osh-system
OpenStack Controller extra configuration parameters

Section

Parameter

Default value

Description

[osctl]

wait_application_ready_timeout

1200

The number of seconds to wait for all application components to become ready.

wait_application_ready_delay

10

The number of seconds before going to the sleep mode between attempts to verify if the application is ready.

node_not_ready_flapping_timeout

120

The amount of time to wait for the flapping node.

[helmbundle]

manifest_enable_timeout

600

The number of seconds to wait until the values set in the manifest are propagated to the dependent objects.

manifest_enable_delay

10

The number of seconds between attempts to verify if the values were applied.

manifest_disable_timeout

600

The number of seconds to wait until the values are removed from the manifest and propagated to the child objects.

manifest_disable_delay

10

The number of seconds between attempts to verify if the values were removed from the release.

manifest_purge_timeout

600

The number of seconds to wait until the Kubernetes object is removed.

manifest_purge_delay

10

The number of seconds between attempts to verify if the Kubernetes object is removed.

manifest_apply_delay

10

The number of seconds to pause for the Helm bundle changes.

[maintenance]

instance_migrate_concurrency

1

The number of instances to migrate concurrently.

nwl_parallel_max_compute

30

The maximum number of compute nodes allowed for a parallel update.

nwl_parallel_max_gateway

1

The maximum number of gateway nodes allowed for a parallel update.

respect_nova_az

true

Respect Nova availability zone (AZ). The true value allows the parallel update only for the compute nodes in the same AZ.

ndr_skip_instance_check

false

The flag to skip the instance verification on a host before proceeding with the node removal. The false value blocks the node removal until at least one instance exists on the host.

ndr_skip_volume_check

false

The flag to skip the volume verification on a host before proceeding with the node removal. The false value blocks the node removal until at least one volume exists on the host. A volume is tied to a specific host only for the LVM back end.

OpenStack database

MOSK relies on the MariaDB Galera cluster to provide its OpenStack components with a reliable storage of persistent data.

For successful long-term operations of a MOSK cloud, it is crucial to ensure the healthy state of the OpenStack database as well as the safety of the data stored in it. To help you with that, MOSK provides built-in automated procedures for OpenStack database maintenance, backup, and restoration. The hereby chapter describes the internal mechanisms and configuration details for the provided tools.

Overview of the OpenStack database backup and restoration

MOSK relies on the MariaDB Galera cluster to provide its OpenStack components with a reliable storage for persistent data. Mirantis recommends backing up your OpenStack databases daily to ensure the safety of your cloud data. Also, you should always create an instant backup before updating your cloud or performing any kind of potentially disruptive experiment.

MOSK has a built-in automated backup routine that can be triggered manually or by schedule. For detailed information about the process of MariaDB Galera cluster backup, refer to Workflows of the OpenStack database backup and restoration.

Backup and restoration can only be performed against the OpenStack database as a whole. Granular per-service or per-table procedures are not supported by MOSK.

Periodic backups

By default, periodic backups are turned off. Though, a cloud operator can easily enable this capability by adding the following structure to the OpenStackDeployment custom resource:

spec:
  features:
    database:
      backup:
        enabled: true

For the configuration details, refer to Periodic OpenStack database backups.

Database restoration

Along with the automated backup routine, MOSK provides the Mariabackup tool for the OpenStack database restoration. For the database restoration procedure, refer to Restore OpenStack databases from a backup. For more information about the restoration process, consult Workflows of the OpenStack database backup and restoration.

Storage for backup data

By default, MOSK backup routine stores the OpenStack database data into the Mirantis Ceph cluster, which is a part of the same cloud. This is sufficient for the vast majority of clouds. However, you may want to have the backup data stored off the cloud to comply with specific enterprise practices for infrastructure recovery and data safety.

To achieve that, MOSK enables you to point the backup routine to an external data volume. For details, refer to Remote storage for OpenStack database backups.

Size of a backup storage

The size of a backup storage volume depends directly on the size of the MOSK cluster, which can be determined through the size parameter in the OpenStackDeployment CR.

The list of the recommended sizes for a minimal backup volume includes:

  • 20 GB for the tiny cluster size

  • 40 GB for the small cluster size

  • 80 GB for the medium cluster size

If required, you can change the default size of a database backup volume. However, make sure that you configure the volume size before OpenStack deployment is complete. This is because there is no automatic way to resize the backup volume once the cloud is deployed. Also, only the local backup storage (Ceph) supports the configuration of the volume size.

To change the default size of the backup volume, use the following structure in the OpenStackDeployment CR:

spec:
  services:
    database:
      mariadb:
        values:
          volume:
            phy_backup:
              size: "200Gi"
Local backup storage - default

To store the backup data to a local Mirantis Ceph, the MOSK underlying Kubernetes cluster needs to have a preconfigured storage class for Kubernetes persistent volumes with the Ceph cluster as a storage back end.

When restoring the OpenStack database from a local Ceph storage, the cron job restores the state on each MariaDB node sequentially. It is not possible to perform parallel restoration because Ceph Kubernetes volumes do not support concurrent mounting from multiple places.

Remote backup storage

MOSK provides you with a capability to store the OpenStack database data outside of the cloud, on an external storage device that supports common data access protocols, such as third-party NAS appliances.

Refer to Remote storage for OpenStack database backups for the configuration details.

Workflows of the OpenStack database backup and restoration

This section provides technical details about the internal implementation of automated backup and restoration routines built into MOSK. The below information would be helpful for troubleshooting of any issues related to the process or understanding the impact these procedures impose on a running cloud.

Backup workflow

The OpenStack database backup workflow consists of the following phases.

Backup phase 1

The mariadb-phy-backup job launches the mariadb-phy-backup-<TIMESTAMP> pod. This pod contains the main backup script, which is responsible for:

  • Basic sanity checks and choosing right node for backup

  • Verifying the wsrep status and changing the wsrep_desync parameter settings

  • Managing the mariadb-phy-backup-runner pod

During the first backup phase, the following actions take place:

  1. Sanity check: verification of the Kubernetes status and wsrep status of each MariaDB pod. If some pods have wrong statuses, the backup job fails unless the --allow-unsafe-backup parameter is passed to the main script in the Kubernetes backup job.

    Note

    • Since MOSK 22.4, the --allow-unsafe-backup functionality is removed from the product for security and backup procedure simplification purposes.

    • Mirantis does not recommend setting the --allow-unsafe-backup parameter unless it is absolutely required. To ensure the consistency of a backup, verify that the MariaDB Galera cluster is in a working state before you proceed with the backup.

  2. Select the replica to back up. The system selects the replica with the highest number in its name as a target replica. For example, if the MariaDB server pods have the mariadb-server-0, mariadb-server-1, and mariadb-server-2 names, the mariadb-server-2 replica will be backed up.

  3. Desynchronize the replica from the Galera cluster. The script connects the target replica and sets the wsrep_desync variable to ON. Then, the replica stops receiving write-sets and receives the wsrep status Donor/Desynced. The Kubernetes health check of that mariadb-server pod fails and the Kubernetes status of that pod becomes Not ready. If the pod has the primary label, the MariaDB Controller sets the backup label to it and the pod is removed from the endpoints list of the MariaDB service.

_images/os-k8s-mariadb-backup-phase1.png
Backup phase 2
  1. The main script in the mariadb-phy-backup pod launches the Kubernetes pod mariadb-phy-backup-runner-<TIMESTAMP> on the same node where the target mariadb-server replica is running, which is node X in the example.

  2. The mariadb-phy-backup-runner pod has both mysql data directory and backup directory mounted. The pod performs the following actions:

    1. Verifies that there is enough space in the /var/backup folder to perform the backup. The amount of available space in the folder should be greater than <DB-SIZE> * <MARIADB-BACKUP-REQUIRED-SPACE-RATIO in KB.

    2. Performs the actual backup using the mariabackup tool.

    3. If the number of current backups is greater than the value of the MARIADB_BACKUPS_TO_KEEP job parameter, the script removes all old backups exceeding the allowed number of backups.

    4. Exits with 0 code.

  3. The script waits untill the mariadb-phy-backup-runner pod is completed and collects its logs.

  4. The script puts the backed up replica back to sync with the Galera cluster by setting wsrep_desync to OFF and waits for the replica to become Ready in Kubernetes.

_images/os-k8s-mariadb-backup-phase2.png
Restoration workflow

The OpenStack database restoration workflow consists of the following phases.

Restoration phase 1

The mariadb-phy-restore job launches the mariadb-phy-restore pod. This pod contains the main restore script, which is responsible for:

  • Scaling of the mariadb-server StatefulSet

  • Verifying of the mariadb-server pods statuses

  • Managing of the openstack-mariadb-phy-restore-runner pods

Caution

During the restoration, the database is not available for OpenStack services that means a complete outage of all OpenStack services.

During the first phase, the following actions are performed:

  1. Save the list of mariadb-server persistent volume claims (PVC).

  2. Scale the mariadb server StatefulSet to 0 replicas. At this point, the database becomes unavailable for OpenStack services.

_images/os-k8s-mariadb-restore-phase1.png
Restoration phase 2
  1. The mariadb-phy-restore pod launches openstack-mariadb-phy-restore-runner with the first mariadb-server replica PVC mounted to the /var/lib/mysql folder and the backup PVC mounted to /var/backup. The openstack-mariadb-phy-restore-runner pod performs the following actions:

    1. Unarchives the database backup files to a temporary directory within /var/backup.

    2. Executes mariabackup --prepare on the unarchived data.

    3. Creates the .prepared file in the temporary directory in /var/backup.

    4. Restores the backup to /var/lib/mysql.

    5. Exits with 0.

  2. The script in the mariadb-phy-restore pod collects the logs from the openstack-mariadb-phy-restore-runner pod and removes the pod. Then, the script launches the next openstack-mariadb-phy-restore-runner pod for the next mariadb-server replica PVC. The openstack-mariadb-phy-restore-runner pod restores the backup to /var/lib/mysql and exits with 0.

    Step 2 is repeated for every mariadb-server replica PVC sequentially.

  3. When the last replica’s data is restored, the last openstack-mariadb-phy-restore-runner pod removes the .prepared file and the temporary folder with unachieved data from /var/backup.

_images/os-k8s-mariadb-restore-phase2.png
Restoration phase 3
  1. The mariadb-phy-restore pod scales the mariadb-server StatefulSet back to the configured number of replicas.

  2. The mariadb-phy-restore pod waits until all mariadb-server replicas are ready.

_images/os-k8s-mariadb-restore-phase3.png
OpenStack database auto-cleanup

By design, when deleting a cloud resource, for example, an instance, volume, or router, an OpenStack service does not immediately delete its data but marks it as removed so that it can later be picked up by the garbage collector.

Given that an OpenStack resource is often represented by more than one record in the database, deletion of all of them right away could affect the overall responsiveness of the cloud API. On the other hand, an OpenStack database being severely clogged with stale data is one of the most typical reasons for the cloud slowness.

To keep the OpenStack database small and performance fast, MOSK is pre-configured to automatically clean up the removed database records older than 30 days. By default, the clean up is performed for the following MOSK services every Monday according to the schedule:

The default database cleanup schedule by OpenStack service

Service

Service identifier

Clean up time

Block Storage (OpenStack Cinder)

cinder

12:01 a.m.

Compute (OpenStack Nova)

nova

01:01 a.m.

Image (OpenStack Glance)

glance

02:01 a.m.

Instance HA (OpenStack Masakari)

masakari

03:01 a.m.

Key Manager (OpenStack Barbican)

barbican

04:01 a.m.

Orchestration (OpenStack Heat)

heat

05:01 a.m.

If required, you can adjust the cleanup schedule for the OpenStack database by adding the features:database:cleanup setting to the OpenStackDeployment CR following the example below. The schedule parameter must contain a valid cron expression. The age parameter specifies the number of days after which a stale record gets cleaned up.

spec:
  features:
    database:
      cleanup:
        <os-service-identifier>:
          enabled: true
          schedule: "1 0 * * 1"
          age: 30
          batch: 1000
Periodic OpenStack database backups

MOSK uses the Mariabackup utility to back up the MariaDB Galera cluster data where the OpenStack data is stored. The Mariabackup gets launched on a periodic basis as a part of the Kubernetes CronJob included in any MOSK deployment and is suspended by default.

Note

If you are using the default back end to store the backup data, which is Ceph, you can increase the default size of a backup volume. However, make sure to configure the volume size before you deploy OpenStack.

For the default sizes and configuration details, refer to Size of a backup storage.

Enabling the periodic backup

MOSK enables you to configure the periodic backup of the OpenStack database through the OpenStackDeployment object. To enable the backup, use the following structure:

spec:
  features:
    database:
      backup:
        enabled: true

By default, the backup job:

  • Runs backup on a daily basis at 01:00 AM

  • Creates incremental backups daily and full backups weekly

  • Keeps 10 latest full backups

  • Stores backups in the mariadb-phy-backup-data PVC

  • Has the backup timeout of 3600 seconds

  • Has the incremental backup type

To verify the configuration of the mariadb-phy-backup CronJob object, run:

kubectl -n openstack get cronjob mariadb-phy-backup
Example of a mariadb-phy-backup CronJob object
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  annotations:
    openstackhelm.openstack.org/release_uuid: ""
  creationTimestamp: "2020-09-08T14:13:48Z"
  managedFields:
  <<<skipped>>>>
  name: mariadb-phy-backup
  namespace: openstack
  resourceVersion: "726449"
  selfLink: /apis/batch/v1beta1/namespaces/openstack/cronjobs/mariadb-phy-backup
  uid: 88c9be21-a160-4de1-afcf-0853697dd1a1
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
      labels:
        application: mariadb-phy-backup
        component: backup
        release_group: openstack-mariadb
    spec:
      activeDeadlineSeconds: 4200
      backoffLimit: 0
      completions: 1
      parallelism: 1
      template:
        metadata:
          creationTimestamp: null
          labels:
            application: mariadb-phy-backup
            component: backup
            release_group: openstack-mariadb
        spec:
          containers:
          - command:
            - /tmp/mariadb_resque.py
            - backup
            - --backup-timeout
            - "3600"
            - --backup-type
            - incremental
            env:
            - name: MARIADB_BACKUPS_TO_KEEP
              value: "10"
            - name: MARIADB_BACKUP_PVC_NAME
              value: mariadb-phy-backup-data
            - name: MARIADB_FULL_BACKUP_CYCLE
              value: "604800"
            - name: MARIADB_REPLICAS
              value: "3"
            - name: MARIADB_BACKUP_REQUIRED_SPACE_RATIO
              value: "1.2"
            - name: MARIADB_RESQUE_RUNNER_IMAGE
              value: docker-dev-kaas-local.docker.mirantis.net/general/mariadb:10.4.14-bionic-20200812025059
            - name: MARIADB_RESQUE_RUNNER_SERVICE_ACCOUNT
              value: mariadb-phy-backup-runner
            - name: MARIADB_RESQUE_RUNNER_POD_NAME_PREFIX
              value: openstack-mariadb
            - name: MARIADB_POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            image: docker-dev-kaas-local.docker.mirantis.net/general/mariadb:10.4.14-bionic-20200812025059
            imagePullPolicy: IfNotPresent
            name: phy-backup
            resources: {}
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /tmp
              name: pod-tmp
            - mountPath: /tmp/mariadb_resque.py
              name: mariadb-bin
              readOnly: true
              subPath: mariadb_resque.py
            - mountPath: /tmp/resque_runner.yaml.j2
              name: mariadb-bin
              readOnly: true
              subPath: resque_runner.yaml.j2
            - mountPath: /etc/mysql/admin_user.cnf
              name: mariadb-secrets
              readOnly: true
              subPath: admin_user.cnf
          dnsPolicy: ClusterFirst
          initContainers:
          - command:
            - kubernetes-entrypoint
            env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: INTERFACE_NAME
              value: eth0
            - name: PATH
              value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
            - name: DEPENDENCY_SERVICE
            - name: DEPENDENCY_DAEMONSET
            - name: DEPENDENCY_CONTAINER
            - name: DEPENDENCY_POD_JSON
            - name: DEPENDENCY_CUSTOM_RESOURCE
            image: docker-dev-kaas-local.docker.mirantis.net/openstack/extra/kubernetes-entrypoint:v1.0.0-20200311160233
            imagePullPolicy: IfNotPresent
            name: init
            resources: {}
            securityContext:
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
              runAsUser: 65534
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          nodeSelector:
            openstack-control-plane: enabled
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext:
            runAsUser: 999
          serviceAccount: mariadb-phy-backup
          serviceAccountName: mariadb-phy-backup
          terminationGracePeriodSeconds: 30
          volumes:
          - emptyDir: {}
            name: pod-tmp
          - name: mariadb-secrets
            secret:
              defaultMode: 292
              secretName: mariadb-secrets
          - configMap:
              defaultMode: 365
              name: mariadb-bin
            name: mariadb-bin
  schedule: 0 1 * * *
  successfulJobsHistoryLimit: 3
  suspend: false
Overriding the default configuration

To override the default configuration, set the parameters and environment variables that are passed to the CronJob as described in the tables below.

MariaDB backup: Configuration parameters

Parameter

Type

Default

Description

--backup-type

String

incremental

Type of a backup. The list of possible values include:

  • incremental

    If the newest full backup is older than the value of the full_backup_cycle parameter, the system performs a full backup. Otherwise, the system performs an incremental backup of the newest full backup.

  • full

    Always performs only a full backup.

Usage example:

spec:
  features:
    database:
      backup:
        backup_type: incremental

--backup-timeout

Integer

21600

Timeout in seconds for the system to wait for the backup operation to succeed.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_timeout: 30000

--allow-unsafe-backup

Boolean

false

Not recommended, removed since MOSK 22.4.

If set to true, enables the MariaDB cluster backup in a not fully operational cluster, where:

  • The current number of ready pods is not equal to MARIADB_REPLICAS.

  • Some replicas do not have healthy wsrep statuses.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              allow_unsafe_backup: true
MariaDB backup: Environment variables

Variable

Type

Default

Description

MARIADB_BACKUPS_TO_KEEP

Integer

10

Number of full backups to keep.

Usage example:

spec:
  features:
    database:
      backup:
        backups_to_keep: 3

MARIADB_BACKUP_PVC_NAME

String

mariadb-phy-backup-data

Persistent volume claim used to store backups.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_pvc_name: mariadb-phy-backup-data

MARIADB_FULL_BACKUP_CYCLE

Integer

604800

Number of seconds that defines a period between 2 full backups. During this period, incremental backups are performed. The parameter is taken into account only if backup_type is set to incremental. Otherwise, it is ignored. For example, with full_backup_cycle set to 604800 seconds a full backup is taken weekly and, if cron is set to 0 0 * * *, an incremental backup is performed on daily basis.

Usage example:

spec:
  features:
    database:
      backup:
        full_backup_cycle: 70000

MARIADB_BACKUP_REQUIRED_SPACE_RATIO

Floating

1.2

Multiplier for the database size to predict the space required to create a backup, either full or incremental, and perform a restoration keeping the uncompressed backup files on the same file system as the compressed ones.

To estimate the size of MARIADB_BACKUP_REQUIRED_SPACE_RATIO, use the following formula: size of (1 uncompressed full backup + all related incremental uncompressed backups + 1 full compressed backup) in KB =< (DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO) in KB.

The DB_SIZE is the disk space allocated in the MySQL data directory, which is /var/lib/mysql, for databases data excluding galera.cache and ib_logfile* files. This parameter prevents the backup PVC from being full in the middle of the restoration and backup procedures. If the current available space is lower than DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO, the backup script fails before the system starts the actual backup and the overall status of the backup job is failed.

Usage example:

spec:
  services:
    database:
      mariadb:
        values:
          conf:
            phy_backup:
              backup_required_space_ratio: 1.4

For example, to perform full backups monthly and incremental backups daily at 02:30 AM and keep the backups for the last six months, configure the database backup in your OpenStackDeployment object as follows:

spec:
  features:
    database:
      backup:
        enabled: true
        backups_to_keep: 6
        schedule_time: '30 2 * * *'
        full_backup_cycle: 2628000
Remote storage for OpenStack database backups

By default, MOSK stores the OpenStack database backups locally in the Mirantis Ceph cluster, which is a part of the same cloud.

Alternatively, MOSK provides you with a capability to create remote backups using an external storage. This section contains configuration details for a remote back end to be used for the OpenStack data backup.

In general, the built-in automated backup routine saves the data to the mariadb-phy-backup-data PersistentVolumeClaim (PVC), which is provisioned from StorageClass specified in the spec.persistent_volume_storage_class parameter of the OpenstackDeployment custom resource (CR).

Remote NFS storage for OpenStack database backups

TechPreview

Requirements
  • A preconfigured NFS server with NFS share that a Unix backup and restore user has access to. By default, it is the same user that runs MySQL server in a MariaDB image.

    To get the Unix user ID, run:

    kubectl -n openstack get cronjob mariadb-phy-backup -o jsonpath='{.spec.jobTemplate.spec.template.spec.securityContext.runAsUser}'
    

    Note

    Verify that the NFS server is accessible through the network from all of the OpenStack control plane nodes of the cluster.

  • The nfs-common package installed on all OpenStack control plane nodes.

Limitations
  • Only NFS Unix authentication is supported.

  • Removal of the NFS persistent volume does not automatically remove the data.

  • No validation of mount options. If mount options are specified incorrectly in the OpenStackDeployment CR, the mount command fails upon the creation of a backup runner pod.

Enabling the NFS back end

To enable the NFS back end, configure the following structure in the OpenStackDeployment object:

spec:
  features:
    database:
      backup:
        enabled: true
        backend: pv_nfs
        pv_nfs:
          server: <ip-address/dns-name-of-the-server>
          path: <path-to-the-share-folder-on-the-server>

Optionally, MOSK enables you to set the required mount options for the NFS mount command. You can set as many options of mount as you need. For example:

spec:
  services:
    database:
      mariadb:
        values:
          volume:
            phy_backup:
              nfs:
                mountOptions:
                  - "nfsvers=4"
                  - "hard"
OpenStack message bus

The internal components of Mirantis OpenStack for Kubernetes (MOSK) coordinate their operations and exchange status information using the cluster’s message bus (RabbitMQ).

Exposable OpenStack notifications

Available since MOSK 22.5

MOSK enables you to configure OpenStack services to emit notification messages to the MOSK cluster messaging bus (RabbitMQ) every time an OpenStack resource, for example, an instance, image, and so on, changes its state due to a cloud user action or through its lifecycle. For example, MOSK Compute service (OpenStack Nova) can publish the instance.create.end notification once a newly created instance is up and running.

Sample of an instance.create.end notification
{
    "event_type": "instance.create.end",
    "payload": {
        "nova_object.data": {
            "action_initiator_project": "6f70656e737461636b20342065766572",
            "action_initiator_user": "fake",
            "architecture": "x86_64",
            "auto_disk_config": "MANUAL",
            "availability_zone": "nova",
            "block_devices": [],
            "created_at": "2012-10-29T13:42:11Z",
            "deleted_at": null,
            "display_description": "some-server",
            "display_name": "some-server",
            "fault": null,
            "flavor": {
             "nova_object.data": {
              "description": null,
              "disabled": false,
              "ephemeral_gb": 0,
              "extra_specs": {
                  "hw:watchdog_action": "disabled"
              },
              "flavorid": "a22d5517-147c-4147-a0d1-e698df5cd4e3",
              "is_public": true,
              "memory_mb": 512,
              "name": "test_flavor",
              "projects": null,
              "root_gb": 1,
              "rxtx_factor": 1.0,
              "swap": 0,
              "vcpu_weight": 0,
              "vcpus": 1
             },
             "nova_object.name": "FlavorPayload",
             "nova_object.namespace": "nova",
             "nova_object.version": "1.4"
            },
            "host": "compute",
            "host_name": "some-server",
            "image_uuid": "155d900f-4e14-4e4c-a73d-069cbf4541e6",
            "instance_name": "instance-00000001",
            "ip_addresses": [
             {
              "nova_object.data": {
                  "address": "192.168.1.3",
                  "device_name": "tapce531f90-19",
                  "label": "private",
                  "mac": "fa:16:3e:4c:2c:30",
                  "meta": {},
                  "port_uuid": "ce531f90-199f-48c0-816c-13e38010b442",
                  "version": 4
              },
              "nova_object.name": "IpPayload",
              "nova_object.namespace": "nova",
              "nova_object.version": "1.0"
             }
            ],
            "kernel_id": "",
            "key_name": "my-key",
            "keypairs": [
             {
              "nova_object.data": {
                  "fingerprint": "1e:2c:9b:56:79:4b:45:77:f9:ca:7a:98:2c:b0:d5:3c",
                  "name": "my-key",
                  "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQDx8nkQv/zgGgB4rMYmIf+6A4l6Rr+o/6lHBQdW5aYd44bd8JttDCE/F/pNRr0lRE+PiqSPO8nDPHw0010JeMH9gYgnnFlyY3/OcJ02RhIPyyxYpv9FhY+2YiUkpwFOcLImyrxEsYXpD/0d3ac30bNH6Sw9JD9UZHYcpSxsIbECHw== Generated-by-Nova",
                  "type": "ssh",
                  "user_id": "fake"
              },
              "nova_object.name": "KeypairPayload",
              "nova_object.namespace": "nova",
              "nova_object.version": "1.0"
             }
            ],
            "launched_at": "2012-10-29T13:42:11Z",
            "locked": false,
            "locked_reason": null,
            "metadata": {},
            "node": "fake-mini",
            "os_type": null,
            "power_state": "running",
            "progress": 0,
            "ramdisk_id": "",
            "request_id": "req-5b6c791d-5709-4f36-8fbe-c3e02869e35d",
            "reservation_id": "r-npxv0e40",
            "state": "active",
            "tags": [
             "tag"
            ],
            "task_state": null,
            "tenant_id": "6f70656e737461636b20342065766572",
            "terminated_at": null,
            "trusted_image_certificates": [
             "cert-id-1",
             "cert-id-2"
            ],
            "updated_at": "2012-10-29T13:42:11Z",
            "user_id": "fake",
            "uuid": "178b0921-8f85-4257-88b6-2e743b5a975c"
        },
        "nova_object.name": "InstanceCreatePayload",
        "nova_object.namespace": "nova",
        "nova_object.version": "1.12"
    },
    "priority": "INFO",
    "publisher_id": "nova-compute:compute"
}

OpenStack notification messages can be consumed and processed by various corporate systems to integrate MOSK clouds into the company infrastructure and business processes.

The list of the most common use cases includes:

  • Using notification history for retrospective security audit

  • Using the real-time aggregation of notification messages to gather statistics on cloud resource consumption for further capacity planning

Cloud billing considerations

Notifications alone should not be considered as a source of data for any kind of financial reporting. The delivery of the messages can not be guaranteed due to various technical reasons. For example, messages can be lost if an external consumer is not fetching them from the queue fast enough.

Mirantis strongly recommends that your cloud billing solutions rely on the combination of the following data sources:

  • Periodic polling of the OpenStack API as a reliable source of information about allocated resources

  • Subscription to notifications to receive timely updates about the resource status change

If you are looking for a ready-to-use billing solution for your cloud, contact Mirantis or one of our partners.

A cloud administrator can securely expose part of a MOSK cluster message bus to the outside world. This enables an external consumer to subscribe to the notification messages emitted by the cluster services.

Important

The latest OpenStack release available in MOSK supports notifications from the following services:

  • Block storage (OpenStack Cinder)

  • DNS (OpenStack Designate)

  • Image (OpenStack Glance)

  • Orchestration (OpenStack Heat)

  • Bare Metal (OpenStack Ironic)

  • Identity (OpenStack Keystone)

  • Shared Filesystems (OpenStack Manila)

  • Instance High Avalability (OpenStack Masakari)

  • Networking (OpenStack Neutron)

  • Compute (OpenStack Nova)

To enable the external notification endpoint, add the following structure to the OpenStackDeployment custom resource. For example:

spec:
  features:
    messaging:
      notifications:
        external:
          enabled: true
          topics:
            - external-consumer-A
            - external-consumer-2

For each topic name specified in the topics field, MOSK creates a topic exchange in its RabbitMQ cluster together with a set of queues bound to this topic. All enabled MOSK services will publish their notification messages to all configured topics so that multiple consumers can receive the same messages in parallel.

A topic name must follow Kubernetes standard format for object names and IDs that is only lowercase alphanumeric characters, -, or . The topic name notifications is reserved for the internal use.

The MOSK supports the connection to message bus (RabbitMQ) through either a plain text user name and password or an encrypted X.509 certificate.

Each topic exchange is protected by automatically generated authentication credentials and certificates for secure connection that are stored as a secret in the openstack-external namespace of a MOSK underlying Kubernetes cluster. A secret is identified by the name of the topic. The list of attributes for the secret object includes:

  • hosts

    The IP addresses which an external notification endpoint is available on

  • port_amqp, port_amqp-tls

    The TCP ports which external notification endpoint is available on

  • vhost

    The name of the RabbitMQ virtual host which the topic queues are created on

  • username, password

    Authentication data

  • ca_cert

    The client CA certificate

  • client_cert

    The client certificate

  • client_key

    The client private key

For the configuration example above, the following objects will be created:

kubectl -n openstack-external get secret

NAME                                            TYPE           DATA   AGE
openstack-external-consumer-A-notifications     Opaque         4      4m51s
openstack-external-consumer-2-notifications     Opaque         4      4m51s

Tungsten Fabric

Tungsten Fabric provides basic L2/L3 networking to an OpenStack environment running on the MKE cluster and includes the IP address management, security groups, floating IP addresses, and routing policies functionality. Tungsten Fabric is based on overlay networking, where all virtual machines are connected to a virtual network with encapsulation (MPLSoGRE, MPLSoUDP, VXLAN). This enables you to separate the underlay Kubernetes management network. A workload requires an external gateway, such as a hardware EdgeRouter or a simple gateway to route the outgoing traffic.

The Tungsten Fabric vRouter uses different gateways for the control and data planes.

Tungsten Fabric cluster

All services of Tungsten Fabric are delivered as separate containers, which are deployed by the Tungsten Fabric Operator (TFO). Each container has an INI-based configuration file that is available on the host system. The configuration file is generated automatically upon the container start and is based on environment variables provided by the TFO through Kubernetes ConfigMaps.

The main Tungsten Fabric containers run with the host network as DeploymentSet, without using the Kubernetes networking layer. The services listen directly on the host network interface.

The following diagram describes the minimum production installation of Tungsten Fabric with a Mirantis OpenStack for Kubernetes (MOSK) deployment.

_images/tf-architecture.png

For the details about the Tungsten Fabric services included in MOSK deployments and the types of traffic and traffic flow directions, see the subsections below.

Tungsten Fabric cluster components

This section describes the Tungsten Fabric services and their distribution across the Mirantis OpenStack for Kubernetes (MOSK) deployment.

The Tungsten Fabric services run mostly as DaemonSets in separate containers for each service. The deployment and update processes are managed by the Tungsten Fabric Operator. However, Kubernetes manages the probe checks and restart of broken containers.

Configuration and control services

All configuration and control services run on the Tungsten Fabric Controller nodes.

Service name

Service description

config-api

Exposes a REST-based interface for the Tungsten Fabric API.

config-nodemgr

Collects data of the Tungsten Fabric configuration processes and sends it to the Tungsten Fabric collector.

control

Communicates with the cluster gateways using BGP and with the vRouter agents using XMPP, as well as redistributes appropriate networking information.

control-nodemgr

Collects the Tungsten Fabric Controller process data and sends this information to the Tungsten Fabric collector.

device-manager

Manages physical networking devices using netconf or ovsdb. In multi-node deployments, it operates in the active-backup mode.

dns

Using the named service, provides the DNS service to the VMs spawned on different compute nodes. Each vRouter node connects to two Tungsten Fabric Controller containers that run the dns process.

named

The customized Berkeley Internet Name Domain (BIND) daemon of Tungsten Fabric that manages DNS zones for the dns service.

schema

Listens to configuration changes performed by a user and generates corresponding system configuration objects. In multi-node deployments, it works in the active-backup mode.

svc-monitor

Listens to configuration changes of service-template and service-instance, as well as spawns and monitors virtual machines for the firewall, analyzer services, and so on. In multi-node deployments, it works in the active-backup mode.

webui

Consists of the webserver and jobserver services. Provides the Tungsten Fabric web UI.

Analytics services

All analytics services run on Tungsten Fabric analytics nodes.

Service name

Service description

alarm-gen

Evaluates and manages the alarms rules.

analytics-api

Provides a REST API to interact with the Cassandra analytics database.

analytics-nodemgr

Collects all Tungsten Fabric analytics process data and sends this information to the Tungsten Fabric collector.

analytics-database-nodemgr

Provisions the init model if needed. Collects data of the database process and sends it to the Tungsten Fabric collector.

collector

Collects and analyzes data from all Tungsten Fabric services.

query-engine

Handles the queries to access data from the Cassandra database.

snmp-collector

Receives the authorization and configuration of the physical routers from the config-nodemgr service, polls the physical routers using the Simple Network Management Protocol (SNMP), and uploads the data to the Tungsten Fabric collector.

topology

Reads the SNMP information from the physical router user-visible entities (UVEs), creates a neighbor list, and writes the neighbor information to the physical router UVEs. The Tungsten Fabric web UI uses the neighbor list to display the physical topology.

vRouter

The Tungsten Fabric vRouter provides data forwarding to an OpenStack tenant instance and reports statistics to the Tungsten Fabric analytics service. The Tungsten Fabric vRouter is installed on all OpenStack compute nodes. Mirantis OpenStack for Kubernetes (MOSK) supports the kernel-based deployment of the Tungsten Fabric vRouter.

vRouter services on the OpenStack compute nodes

Service name

Service description

vrouter-agent

Connects to the Tungsten Fabric Controller container and the Tungsten Fabric DNS system using the Extensible Messaging and Presence Protocol (XMPP). The vRouter Agent acts as a local control plane. Each Tungsten Fabric vRouter Agent is connected to at least two Tungsten Fabric controllers in an active-active redundancy mode.

The Tungsten Fabric vRouter Agent is responsible for all networking-related functions including routing instances, routes, and others.

The Tungsten Fabric vRouter uses different gateways for the control and data planes. For example, the Linux system gateway is located on the management network, and the Tungsten Fabric gateway is located on the data plane network.

vrouter-nodemgr

Collects the supervisor vrouter data and sends it to the Tungsten Fabric collector.

The following diagram illustrates the Tungsten Fabric kernel vRouter set up by the TF operator:

_images/tf_vrouter.png

On the diagram above, the following types of networks interfaces are used:

  • eth0 - for the management (PXE) network (eth1 and eth2 are the slave interfaces of Bond0)

  • Bond0.x - for the MKE control plane network

  • Bond0.y - for the MKE data plane network

Third-party services

Service name

Service description

cassandra

  • On the Tungsten Fabric control plane nodes, maintains the configuration data of the Tungsten Fabric cluster.

  • On the Tungsten Fabric analytics nodes, stores the collector service data.

cassandra-operator

The Kubernetes operator that enables the Cassandra clusters creation and management.

kafka

Handles the messaging bus and generates alarms across the Tungsten Fabric analytics containers.

kafka-operator

The Kubernetes operator that enables Kafka clusters creation and management.

redis

Stores the physical router UVE storage and serves as a messaging bus for event notifications.

redis-operator

The Kubernetes operator that enables Redis clusters creation and management.

zookeeper

Holds the active-backup status for the device-manager, svc-monitor, and the schema-transformer services. This service is also used for mapping of the Tungsten Fabric resources names to UUIDs.

zookeeper-operator

The Kubernetes operator that enables ZooKeeper clusters creation and management.

rabbitmq

Exchanges messages between API servers and original request senders.

rabbitmq-operator

The Kubernetes operator that enables RabbitMQ clusters creation and management.

Plugin services

All Tungsten Fabric plugin services are installed on the OpenStack controller nodes.

Service name

Service description

neutron-server

The Neutron server that includes the Tungsten Fabric plugin.

octavia-api

The Octavia API that includes the Tungsten Fabric Octavia driver.

heat-api

The Heat API that includes the Tungsten Fabric Heat resources and templates.

Image precaching DaemonSets

Along with the Tungsten Fabric services, MOSK deploys and updates special image precaching DaemonSets when the kind TFOperator resource is created or image references in it get updated. These DaemonSets precache container images on Kubernetes nodes minimizing possible downtime when updating container images. Cloud operator can disable image precaching through the TFOperator resource.

Tungsten Fabric traffic flow

This section describes the types of traffic and traffic flow directions in a Mirantis OpenStack for Kubernetes (MOSK) cluster.

User interface and API traffic

The following diagram illustrates all types of UI and API traffic in a Mirantis OpenStack for Kubernetes cluster, including the monitoring and OpenStack API traffic. The OpenStack Dashboard pod hosts Horizon and acts as a proxy for all other types of traffic. TLS termination is also performed for this type of traffic.

_images/tf-traffic_flow_ui_api.png
SDN traffic

SDN or Tungsten Fabric traffic goes through the overlay Data network and processes east-west and north-south traffic for applications that run in a MOSK cluster. This network segment typically contains tenant networks as separate MPLS-over-GRE and MPLS-over-UDP tunnels. The traffic load depends on the workload.

The control traffic between the Tungsten Fabric controllers, edge routers, and vRouters uses the XMPP with TLS and iBGP protocols. Both protocols produce low traffic that does not affect MPLS over GRE and MPLS over UDP traffic. However, this traffic is critical and must be reliably delivered. Mirantis recommends configuring higher QoS for this type of traffic.

The following diagram displays both MPLS over GRE/MPLS over UDP and iBGP and XMPP traffic examples in a MOSK cluster:

_images/tf-traffic_flow_sdn.png
Tungsten Fabric lifecycle management

Mirantis OpenStack for Kubernetes (MOSK) provides the Tungsten Fabric lifecycle management including pre-deployment custom configurations, updates, data backup and restoration, as well as handling partial failure scenarios, by means of the Tungsten Fabric operator.

This section is intended for the cloud operators who want to gain insight into the capabilities provided by the Tungsten Fabric operator along with the understanding of how its architecture allows for easy management while addressing the concerns of users of Tungsten Fabric-based MOSK clusters.

Tungsten Fabric Operator

The Tungsten Fabric Operator (TFO) is based on the Kubernetes operator SDK project. The Kubernetes operator SDK is a framework that uses the controller-runtime library to make writing operators easier by providing the following:

  • High-level APIs and abstractions to write the operational logic more intuitively.

  • Tools for scaffolding and code generation to bootstrap a new project fast.

  • Extensions to cover common operator use cases.

The TFO deploys the following sub-operators. Each sub-operator handles a separate part of a TF deployment:

TFO sub-operators

Network

Description

TFControl

Deploys the Tungsten Fabric control services, such as:

  • Control

  • DNS

  • Control NodeManager

TFConfig

Deploys the Tungsten Fabric configuration services, such as:

  • API

  • Service monitor

  • Schema transformer

  • Device manager

  • Configuration NodeManager

  • Database NodeManager

TFAnalytics

Deploys the Tungsten Fabric analytics services, such as:

  • API

  • Collector

  • Alarm

  • Alarm-gen

  • SNMP

  • Topology

  • Alarm NodeManager

  • Database NodeManager

  • SNMP NodeManager

TFVrouter

Deploys a vRouter on each compute node with the following services:

  • vRouter Agent

  • NodeManager

TFWebUI

Deploys the following web UI services:

  • Web server

  • Job server

TFTool

Deploys the following tools for debug purposes:

  • TF-CLI

  • CTools

TFTest

An operator to run Tempest tests.

Besides the sub-operators that deploy TF services, TFO uses operators to deploy and maintain third-party services, such as different types of storage, cache, message system, and so on. The following table describes all third-party operators:

TFO third-party sub-operators

Network

Description

casandra-operator

An upstream operator that automates the Cassandra HA storage operations for the configuration and analytics data.

zookeeper-operator

An upstream operator for deployment and automation of a ZooKeeper cluster.

kafka-operator

An operator for the Kafka cluster used by analytics services.

redis-operator

An upstream operator that automates the Redis cluster deployment and keeps it healthy.

rabbitmq-operator

An operator for the messaging system based on RabbitMQ.

The following diagram illustrates a simplified TFO workflow:

_images/tf-operator-workflow.png
TFOperator custom resource

The resource of kind TFOperator (TFO) is a custom resource (CR) defined by a resource of kind CustomResourceDefinition.

The CustomResourceDefinition resource in Kubernetes uses the OpenAPI Specification (OAS) version 2 to specify the schema of the defined resource. The Kubernetes API outright rejects the resources that do not pass this schema validation. Along with schema validation, TFOperator uses ValidatingAdmissionWebhook for extended validations when a CR is created or updated.

For the list of configuration options available to a cloud operator, refer to Tungsten Fabric configuration. Also, check out the Tungsten Fabric API Reference document of the MOSK version that your cluster has been deployed with.

TFOperator custom resource validation

Tungsten Fabric Operator uses ValidatingAdmissionWebhook to validate environment variables set to Tungsten Fabric components upon the TFOperator object creation or update. The following validations are performed:

  • Environment variables passed to TF components containers

  • Mapping between tfVersion and tfImageTag, if defined

  • Schedule and data capacity format for tf-dbBackup

If required, you can disable ValidatingAdmissionWebhook through the TFOperator HelmBundle resource:

apiVersion: lcm.mirantis.com/v1alpha1
kind: HelmBundle
metadata:
  name: tungstenfabric-operator
  namespace: tf
spec:
  releases:
  - name: tungstenfabric-operator
    values:
      admission:
        enabled: false
Allowed environment variables for TF components

Environment variables

TF components and containers

  • INTROSPECT_LISTEN_ALL

  • LOG_DIR

  • LOG_LEVEL

  • LOG_LOCAL

  • tf-analytics (alarm-gen, api, collector, alarm-nodemgr, db-nodemgr, nodemgr, snmp-nodemgr, query-engine, snmp, topology)

  • tf-config (api, db-nodemgr, nodemgr)

  • tf-control (control, dns, nodemgr)

  • tf-vrouter (agent, dpdk-nodemgr, nodemgr)

  • LOG_DIR

  • LOG_LEVEL

  • LOG_LOCAL

tf-config (config, devicemgr, schema, svc-monitor)

  • PROVISION_DELAY

  • PROVISION_RETRIES

  • BGP_ASN

  • ENCAP_PRIORITY

  • VXLAN_VN_ID_MODE

  • tf-analytics (alarm-provisioner, db-provisioner, provisioner, snmp-provisioner)

  • tf-config (db-provisioner, provisioner)

  • tf-control (provisioner)

  • tf-vrouter (dpdk-provisioner, provisioner)

  • CONFIG_API_LIST_OPTIMIZATION_ENABLED

  • CONFIG_API_WORKER_COUNT

  • CONFIG_API_MAX_REQUESTS

  • FWAAS_ENABLE

  • RABBITMQ_HEARTBEAT_INTERVAL

  • DISABLE_VNC_API_STATS

tf-config (config)

  • DNS_NAMED_MAX_CACHE_SIZE

  • DNS_NAMED_MAX_RETRANSMISSIONS

  • DNS_RETRANSMISSION_INTERVAL

tf-control (dns)

  • WEBUI_LOG_LEVEL

  • WEBUI_STATIC_AUTH_PASSWORD

  • WEBUI_STATIC_AUTH_ROLE

  • WEBUI_STATIC_AUTH_USER

tf-webui (job, web)

  • ANALYTICS_CONFIG_AUDIT_TTL

  • ANALYTICS_DATA_TTL

  • ANALYTICS_FLOW_TTL

  • ANALYTICS_STATISTICS_TTL

  • COLLECTOR_disk_usage_percentage_high_watermark0

  • COLLECTOR_disk_usage_percentage_high_watermark1

  • COLLECTOR_disk_usage_percentage_high_watermark2

  • COLLECTOR_disk_usage_percentage_low_watermark0

  • COLLECTOR_disk_usage_percentage_low_watermark1

  • COLLECTOR_disk_usage_percentage_low_watermark2

  • COLLECTOR_high_watermark0_message_severity_level

  • COLLECTOR_high_watermark1_message_severity_level

  • COLLECTOR_high_watermark2_message_severity_level

  • COLLECTOR_low_watermark0_message_severity_level

  • COLLECTOR_low_watermark1_message_severity_level

  • COLLECTOR_low_watermark2_message_severity_level

  • COLLECTOR_pending_compaction_tasks_high_watermark0

  • COLLECTOR_pending_compaction_tasks_high_watermark1

  • COLLECTOR_pending_compaction_tasks_high_watermark2

  • COLLECTOR_pending_compaction_tasks_low_watermark0

  • COLLECTOR_pending_compaction_tasks_low_watermark1

  • COLLECTOR_pending_compaction_tasks_low_watermark2

  • COLLECTOR_LOG_FILE_COUNT

  • COLLECTOR_LOG_FILE_SIZE

tf-analytics (collector)

  • ANALYTICS_DATA_TTL

  • QUERYENGINE_MAX_SLICE

  • QUERYENGINE_MAX_TASKS

  • QUERYENGINE_START_TIME

tf-analytics (query-engine)

  • SNMPCOLLECTOR_FAST_SCAN_FREQUENCY

  • SNMPCOLLECTOR_SCAN_FREQUENCY

tf-analytics (snmp)

TOPOLOGY_SCAN_FREQUENCY

tf-analytics (topology)

  • DPDK_UIO_DRIVER

  • PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_INTERFACE

  • SRIOV_PHYSICAL_NETWORK

  • SRIOV_VF

  • TSN_AGENT_MODE

  • TSN_NODES

  • AGENT_MODE

  • FABRIC_SNAT_HASH_TABLE_SIZE

  • PRIORITY_BANDWIDTH

  • PRIORITY_ID

  • PRIORITY_SCHEDULING

  • PRIORITY_TAGGING

  • QOS_DEF_HW_QUEUE

  • QOS_LOGICAL_QUEUES

  • QOS_QUEUE_ID

  • VROUTER_GATEWAY

  • HUGE_PAGES_2MB

  • HUGE_PAGES_1GB

  • DISABLE_TX_OFFLOAD

  • DISABLE_STATS_COLLECTION

tf-vrouter (agent)

  • CPU_CORE_MASK

  • SERVICE_CORE_MASK

  • DPDK_CTRL_THREAD_MASK

  • DPDK_COMMAND_ADDITIONAL_ARGS

  • DPDK_MEM_PER_SOCKET

  • DPDK_UIO_DRIVER

  • HUGE_PAGES

  • HUGE_PAGES_DIR

  • NIC_OFFLOAD_ENABLE

  • DPDK_ENABLE_VLAN_FWRD

tf-vrouter (agent-dpdk)

See also

API Reference

Tungsten Fabric configuration

Mirantis OpenStack for Kubernetes (MOSK) allows you to easily adapt your Tungsten Fabric deployment to the needs of your environment through the TFOperator custom resource.

This section includes custom configuration details available to you.

Cassandra configuration

This section describes the Cassandra configuration through the Tungsten Fabric Operator custom resource.

Cassandra resource limits configuration

By default, Tungsten Fabric Operator sets up the following resource limits for Cassandra analytics and configuration StatefulSets:

Limits:
  cpu:     8
  memory:  32Gi
Requests:
  cpu:     1
  memory:  16Gi

This is a verified configuration suitable for most cases. However, if nodes are under a heavy load, the KubeContainerCPUThrottlingHigh StackLight alert may raise for Tungsten Fabric Pods of the tf-cassandra-analytics and tf-cassandra-config StatefulSets. If such alerts appear constantly, you can increase the limits through the TFOperator CR. For example:

spec:
  controllers:
    cassandra:
      deployments:
      - name: tf-cassandra-config
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
      - name: tf-cassandra-analytics
        resources:
          limits:
            cpu: "12"
            memory: 32Gi
          requests:
            cpu: "2"
            memory: 16Gi
Custom configuration

To specify custom configurations for Cassandra clusters, use the configOptions settings in the TFOperator CR. For example, you may need to increase the file cache size in case of a heavy load on the nodes labeled with tfanalyticsdb=enabled or tfconfigdb=enabled:

spec:
  controllers:
    cassandra:
       deployments:
       - name: tf-cassandra-analytics
         configOptions:
             file_cache_size_in_mb: 1024
Custom vRouter settings

TechPreview

To specify custom settings for the Tungsten Fabric (TF) vRouter nodes, for example, to change the name of the tunnel network interface or enable debug level logging on some subset of nodes, use the customSpecs settings in the TFOperator CR.

For example, to enable debug level logging on a specific node or multiple nodes:

 1spec:
 2  controllers:
 3    tf-vrouter:
 4      agent:
 5        customSpecs:
 6        - name: <CUSTOMSPEC-NAME>
 7          label:
 8            name: <NODE-LABEL>
 9            value: <NODE-LABEL-VALUE>
10          containers:
11          - name: agent
12            env:
13            - name: LOG_LEVEL
14              value: SYS_DEBUG

Caution

The customspecs:name value must follow the RFC 1123 international format. Verify that the name of a DaemonSet object is a valid DNS subdomain name.

The customSpecs parameter inherits all settings for the tf-vrouter containers that are set on the spec:controllers:agent level and overrides or adds additional parameters. The example configuration above overrides the logging level from SYS_INFO, which is the default logging level, to SYS_DEBUG.

For clusters with a multi-rack architecture, you may need to redefine the gateway IP for the Tungsten Fabric vRouter nodes using the VROUTER_GATEWAY parameter. For details, see Multi-rack architecture.

Control plane traffic interface

By default, the TF control service uses the management interface for the BGP and XMPP traffic. You can change the control service interface using the controlInterface parameter in the TFOperator CR, for example, to combine the BGP and XMPP traffic with the data (tenant) traffic:

spec:
  settings:
    controlInterface: <tunnel-interface>
Traffic encapsulation

Tungsten Fabric implements cloud tenants’ virtual networks as Layer 3 overlays. Tenant traffic gets encapsulated into one of the supported protocols and is carried over the infrastructure network between 2 compute nodes or a compute node and an edge router device.

In addition, Tungsten Fabric is capable of exchanging encapsulated traffic with external systems in order to build advanced virtual networking topologies, for example, BGP VPN connectivity between 2 MOSK clouds or a MOSK cloud and a cloud tenant premises.

MOSK supports the following encapsulation protocols:

  • MPLS over Generic Routing Encapsulation (GRE)

    A traditional encapsulation method supported by several router vendors, including Cisco and Juniper. The feature is applicable when other encapsulation methods are not available. For example, an SDN gateway runs software that does not support MPLS over UDP.

  • MPLS over User Datagram Protocol (UDP)

    A variation of the MPLS over GRE mechanism. It is the default and the most frequently used option in MOSK. MPLS over UDP replaces headers in UDP packets. In this case, a UDP port stores a hash of the packet payload (entropy). It provides a significant benefit for equal-cost multi-path (ECMP) routing load balancing. MPLS over UDP and MPLS over GRE transfer Layer 3 traffic only.

  • Virtual Extensible LAN (VXLAN) TechPrev

    The combination of VXLAN and EVPN technologies is often used for creating advanced cloud networking topologies. For example, it can provide transparent Layer 2 interconnections between Virtual Network Functions running on top of the cloud and physical traffic generator appliances hosted somewhere else.

Encapsulation priority

The ENCAP_PRIORIY parameter defines the priority in which the encapsulation protocols are attempted to be used when setting the BGP VPN connectivity between the cloud and external systems.

By default, the encapsulation order is set to MPLSoUDP,MPLSoGRE,VXLAN. The cloud operator can change it depending their needs in the TFOperator custom resource as it is illustrated in Configuring encapsulation.

The list of supported encapsulated methods along with their order is shared between BGP peers as part of the capabilities information exchange when establishing a BGP session. Both parties must support the same encapsulation methods to build a tunnel for the network traffic.

For example, if the cloud operator wants to set up a Layer 2 VPN between the cloud and their network infrastructure, they configure the cloud’s virtual networks with VXLAN identifiers (VNIs) and do the same on the other side, for example, on a network switch. Also, VXLAN must be set in the first position in encapsulation priority order. Otherwise, VXLAN tunnels will not get established between endpoints, even though both endpoints may support the VXLAN protocol.

However, setting VXLAN first in the encapsulation priority order will not enforce VXLAN encapsulation between compute nodes or between compute nodes and gateway routers that use Layer 3 VPNs for communication.

Configuring encapsulation

The TFOperator custom resource allows you to define encapsulation settings for your Tungsten Fabric cluster.

Important

The TFOperator CR must be the only place to configure the cluster encapsulation. Performing these configurations through the TF web UI, CLI, or API does not provide the configuration persistency, and the settings defined this way may get reset to defaults during the cluster services restart or update.

Note

Defining the default values for encapsulation parameters in the TF operator CR is unnecessary.

Encapsulation settings

Parameter

Default value

Description

ENCAP_PRIORITY

MPLSoUDP,MPLSoGRE,VXLAN

Defines the encapsulation priority order.

VXLAN_VN_ID_MODE

automatic

Defines the Virtual Network ID type. The list of possible values includes:

  • automatic - to assign the VXLAN identifier to virtual networks automatically

  • configured - to make cloud users explicitly provide the VXLAN identifier for the virtual networks.

Typically, for a Layer 2 VPN use case, the VXLAN_VN_ID_MODE parameter is set to configured.

Example configuration:

controllers:
  tf-config:
    provisioner:
      containers:
      - env:
        - name: VXLAN_VN_ID_MODE
          value: automatic
        - name: ENCAP_PRIORITY
          value: VXLAN,MPLSoUDP,MPLSoGRE
        name: provisioner
Autonomous System Number (ASN)

In the routing fabric of a data centre, a MOSK cluster with Tungsten Fabric enabled can be represented either by a separate Autonomous System (AS) or as part of a bigger autonomous system. In either case, Tungsten Fabric needs to participate in the BGP peering, exchanging routes with external devices and within the cloud.

The Tungsten Fabric Controller acts as an internal (iBGP) route reflector for the cloud’s AS by populating /32 routes pointing to VMs across all compute nodes as well as the cloud’s edge gateway devices in case they belong to the same AS. Apart from being an iBGP router reflector for the cloud’s AS, the Tungsten Fabric Controller can act as a BGP peer for autonomous systems external to the cloud, for example, for the AS configured across the data center’s leaf-spine fabric.

The Autonomous System Number (ASN) setting contains the unique identifier of the autonomous system that the MOSK cluster with Tungsten Fabric belongs to. The ASN number does not affect the internal iBGP communication between vRouters running on the compute nodes. Such communication will work regardless of the ASN number settings. However, any network appliance that is not managed by the Tungsten Fabric control plane will have BGP configured manually. Therefore, the ASN settings should be configured accordingly on both sides. Otherwise, it would result in the inability to establish BPG sessions, regardless of whether the external device peers with Tungsten Fabric over iBGP or eBGP.

Configuring ASNs

The TFOperator custom resource enables you to define ASN settings for your Tungsten Fabric cluster.

Important

The TFOperator CR must be the only place to configure the cluster ASN. Performing these configurations through the TF web UI, CLI, or API does not provide the configuration persistency, and the settings defined this way may get reset to defaults during the cluster services restart or update.

Note

Defining the default values for ASN parameters in the TF operator CR is unnecessary.

ASN settings

Parameter

Default value

Description

BGP_ASN

64512

Defines the control node’s Autonomous System Number (ASN).

ENABLE_4BYTE_AS

FALSE

Enables the 4-byte ASN format.

Example configuration:

controllers:
  tf-config:
    provisioner:
      containers:
      - env:
        - name: BGP_ASN
          value: "64515"
        - name: ENABLE_4BYTE_AS
          value: "true"
        name: provisioner
  tf-control:
    provisioner:
      containers:
      - env:
        - name: BGP_ASN
          value: "64515"
        name: provisioner
Access to external DNS

By default, the TF tf-control-dns-external service is created to expose TF control dns. You can disable creation of this service using the enableDNSExternal parameter in the TFOperator CR. For example:

spec:
  controllers:
    tf-control:
      dns:
        enableDNSExternal: false
Gateway for vRouter data plane network

If an edge router is accessible from the data plane through a gateway, define the VROUTER_GATEWAY parameter in the TFOperator custom resource. Otherwise, the default system gateway is used.

spec:
  controllers:
    tf-vrouter:
      agent:
        containers:
        - name: agent
          env:
          - name: VROUTER_GATEWAY
            value: <data-plane-network-gateway>

You can also configure the parameter for Tungsten Fabric vRouter in the DPDK mode:

spec:
  controllers:
    tf-vrouter:
      agent-dpdk:
        enabled: true
        containers:
        - name: agent
          env:
          - name: VROUTER_GATEWAY
            value: <data-plane-network-gateway>
Tungsten Fabric image precaching

By default, MOSK deploys image precaching DaemonSets to minimize possible downtime when updating container images. You can disable creation of these DaemonSets by setting the imagePreCaching parameter in the TFOperator custom resource to false:

spec:
  settings:
     imagePreCaching: false

When you disable imagePreCaching, the Tungsten Fabric Operator does not automatically remove the image precaching DaemonSets that have already been created. These DaemonSets do not affect the cluster setup. To remove them manually:

kubectl -n tf delete daemonsets.apps -l app=tf-image-pre-caching
Graceful restart and long-lived graceful restart

Available since MOSK 23.2 for Tungsten Fabric 21.4 only TechPreview

Graceful restart and long-lived graceful restart are vital mechanisms within BGP (Border Gateway Protocol) routing, designed to optimize the routing tables convergence in scenarios where a BGP router restarts or a networking failure is experienced, leading to interruptions of router peering.

During a graceful restart, a router can signal its BGP peers about its impending restart, requesting them to retain the routes it had previously advertised as active. This allows for seamless network operation and minimal disruption to data forwarding during the router downtime.

The long-lived aspect of the long-lived graceful restart extends the graceful restart effectiveness beyond the usual restart duration. This extension provides an additional layer of resilience and stability to BGP routing updates, bolstering the network ability to manage unforeseen disruptions.

Caution

Mirantis does not generally recommend using the graceful restart and long-lived graceful restart features with the Tungsten Fabric XMPP helper, unless the configuration is done by proficient operators with at-scale expertise in networking domain and exclusively to address specific corner cases.

Configuring graceful restart and long-lived graceful restart

Tungsten Fabric Operator allows for easy enablement and configuration of the graceful restart and long-lived graceful restart features through the TFOperator custom resource:

spec:
  settings:
    settings:
      gracefulRestart:
        enabled: <BOOLEAN>
        bgpHelperEnabled: <BOOLEAN>
        xmppHelperEnabled: <BOOLEAN>
        restartTime: <TIME_IN_SECONDS>
        llgrRestartTime: <TIME_IN_SECONDS>
        endOfRibTimeout: <TIME_IN_SECONDS>
Graceful restart and long-lived graceful restart settings

Parameter

Default value

Description

enabled

false

Enables or disables graceful restart and long-lived graceful restart features.

bgpHelperEnabled

false

Specifies the time interval, when the Tungsten Fabric control services act as a graceful restart helper to the edge router or any other BGP peer by retaining the routes learned from this peer and advertising them to the rest of the network as applicable.

Note

BGP peer should support and be configured with graceful restart for all of the address families used.

xmppHelperEnabled

false

Specifies the time interval, when the datapath agent should retain the last route path from the Tungsten Fabric Controller when an XMPP-based connection is lost.

restartTime

300

Configures a non-zero restart time in seconds to advertise for graceful restart capability from peers.

llgrRestartTime

300

Specifies the amount of time in seconds the vRouter datapath should keep advertised routes from the Tungsten Fabric control services, when an XMPP connection between the control and vRouter agent services is lost.

Note

When graceful restart and long-lived graceful restart are both configured, the duration of the long-lived graceful restart timer is the sum of both timers.

endOfRibTimeout

300

Specifies the amount of time in seconds a control node waits to remove stale routes from a vRouter agent Routing Information Base (RIB).

Tungsten Fabric database

Tungsten Fabric (TF) uses Cassandra and ZooKeeper to store its data. Cassandra is a fault-tolerant and horizontally scalable database that provides persistent storage of configuration and analytics data. ZooKeeper is used by TF for allocation of unique object identifiers and transactions implementation.

To prevent data loss, Mirantis recommends that you simultaneously back up the ZooKeeper database dedicated to configuration services and the Cassandra database.

The backup of database must be consistent across all systems because the state of the Tungsten Fabric databases is associated with other system databases, such as OpenStack databases.

Periodic Tungsten Fabric database backups

MOSK enables you to perform the automatic TF data backup in the JSON format using the tf-dbbackup-job cron job. By default, it is disabled. To back up the TF databases, enable tf-dbBackup in the TF Operator custom resource:

spec:
  controllers:
    tf-dbBackup:
      enabled: true

By default, the tf-dbbackup-job job is scheduled for weekly execution, allocating PVC of 5 Gi size for storing backups and keeping 5 previous backups. To configure the backup parameters according to the needs of your cluster, use the following structure:

spec:
  controllers:
    tf-dbBackup:
      enabled: true
      dataCapacity: 30Gi
      schedule: "0 0 13 * 5"
      storedBackups: 10

To temporarily disable tf-dbbackup-job, suspend the job:

spec:
  controllers:
    tf-dbBackup:
      enabled: true
      suspend: true

To delete the tf-dbbackup-job job, disable tf-dbBackup in the TF Operator custom resource:

spec:
  controllers:
    tf-dbBackup:
      enabled: false
Remote storage for Tungsten Fabric database backups

Available since MOSK 23.2 TechPreview

MOSK supports configuring a remote NFS storage for TF data backups through the TF Operator custom resource:

spec:
  controllers:
    tf-dbBackup:
      enabled: true
      backupType: "pv_nfs"
      nfsOptions:
        path: <PATH_TO_SHARE_FOLDER_ON_SERVER>
        server: <IP_ADDRESS/DNS_NAME_OF_SERVER>

If PVC backups were used previously, the old PVC will not be utilized. You can delete it with the following command:

kubectl -n tf delete pvc <TF_DB_BACKUP_PVC>
Tungsten Fabric services

The section explains specifics of the Tungsten Fabric services provided by Mirantis OpenStack for Kubernetes (MOSK). The list of the services and their supported features included in this section is not full and is being constantly amended based on the complexity of the architecture and use of a particular service.

Tungsten Fabric load balancing (HAProxy)

Note

Since 23.1, MOSK provides technology preview for Octavia Amphora load balancing. To start experimenting with the new load balancing solution, refer to Octavia Amphora load balancing.

MOSK ensures Octavia with Tungsten Fabric integration by OpenStack Octavia Driver with Tungsten Fabric HAProxy as a back end.

The Tungsten Fabric-based MOSK deployment supports creation, update, and deletion operations with the following standard load balancing API entities:

  • Load balancers

    Note

    For a load balancer creation operation, the driver supports only the vip-subnet-id argument, the vip-network-id argument is not supported.

  • Listeners

  • Pools

  • Health monitors

The Tungsten Fabric-based MOSK deployment does not support the following load balancing capabilities:

  • L7 load balancing capabilities, such as L7 policies, L7 rules, and others

  • Setting specific availability zones for load balancers and their resources

  • Using of the UDP protocol

  • Operations with Octavia quotas

  • Operations with Octavia flavors

Warning

The Tungsten Fabric-based MOSK deployment enables you to manage the load balancer resources by means of the OpenStack CLI or OpenStack Horizon. Do not perform any manipulations with the load balancer resources through the Tungsten Fabric web UI because in this case the changes will not be reflected on the OpenStack API side.

Octavia Amphora load balancing

Available since MOSK 23.1 TechPreview

Octavia Amphora (Amphora v2) load balancing provides a scalable and flexible solution for load balancing in cloud environments. MOSK deploys Amphora load balancer on each node of the OpenStack environment ensuring that load balancing services are easily accessible, highly scalable, and highly reliable.

Compared to the Octavia Tungsten Fabric driver for LBaaS v2 solution, Amphora offers several advanced features including:

  • Full compatibility with the Octavia API, which provides a standardized interface for load balancing in MOSK OpenStack environments. This makes it easier to manage and integrate with other OpenStack services.

  • Layer 7 policies and rules, which allow for more granular control over traffic routing and load balancing decisions. This enables users to optimize their application performance and improve the user experience.

  • Support for the UDP protocol, which is commonly used for real-time communications and other high-performance applications. This enables users to deploy a wider range of applications with the same load balancing infrastructure.

Enabling Octavia Amphora load balancing

By default, MOSK uses the Octavia Tungsten Fabric load balancing. Once Octavia Amphora load balancing is enabled, the existing Octavia Tungsten Fabric driver load balancers will continue to function normally. However, you cannot migrate your load balancer workloads from the old LBaaS v2 solution to Amphora.

Note

As long as MOSK provides Octavia Amphora load balancing as a technology preview feature, Mirantis cannot guarantee the stability of this solution and does not provide a migration path from Tungsten Fabric load balancing (HAProxy), which is used by default.

To enable Octavia Amphora load balancing:

  1. Assign openstack-gateway: enabled labels to the compute nodes in either order.

  2. To make Amphora the default provider, specify it in the OpenStackDeployment custom resource:

    spec:
    features:
      octavia:
        default_provider: amphorav2
    
  3. Verify that the OpenStack Controller has scheduled new Octavia pods that include health manager, worker, and housekeeping pods.

    kubectl get pods -n openstack -l 'application=octavia,component in (worker, health_manager, housekeeping)'
    

    Example of output for an environment with two compute nodes:

    NAME                                    READY   STATUS    RESTARTS   AGE
    octavia-health-manager-default-48znl    1/1     Running   0          4h32m
    octavia-health-manager-default-jk82v    1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-24vc4   1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-h9ccv   1/1     Running   0          4h34m
    octavia-housekeeping-7bdf9cbd6c-rptvv   1/1     Running   0          4h34m
    octavia-worker-665f84fc7-8kdqd          1/1     Running   0          4h34m
    octavia-worker-665f84fc7-j6jn9          1/1     Running   0          4h31m
    octavia-worker-665f84fc7-kqf9t          1/1     Running   0          4h33m
    
Creating new load balancers

The workflow for creating new load balancers with Amphora is identical to the workflow for creating load balancers with Octavia Tungsten Fabric driver for LBaaS v2. You can do it either through the OpenStack Horizon UI or OpenStack CLI.

If you have not defined amphorav2 as default provider in the OpenStackDeployment custom resource, you can specify it explicitly when creating a load balancer using the provider argument:

openstack loadbalancer create --provider amphorav2
Tungsten Fabric known limitations

This section contains a summary of the Tungsten Fabric upstream features and use cases not supported in MOSK, features and use cases offered as Technology Preview in the current product release if any, and known limitations of Tungsten Fabric in integration with other product components.

Feature or use case

Status

Description

Tungsten Fabric web UI

Provided as is

MOSK provides the TF web UI as is and does not include this service in the support Service Level Agreement

Automatic generation of network port records in DNSaaS (Designate)

Not supported

As a workaround, you can use the Tungsten Fabric built-in DNS service that enables virtual machines to resolve each other names

Secret management (Barbican)

Not supported

It is not possible to use the certificates stored in Barbican to terminate HTTPs on a load balancer in a Tungsten Fabric deployment

Role Based Access Control (RBAC) for Neutron objects

Not supported

Advanced Tungsten Fabric features

Provided as is

MOSK provides the following advanced Tungsten Fabric features as is and does not include them in the support Service Level Agreement:

  • Service Function Chaining

  • Production ready multi-site SDN

  • Layer 3 multihoming

  • Long-Lived Graceful Restart (LLGR)

Technical Preview

DPDK

Monitoring of tf-rabbitmq

Not supported

Due to a known issue, tf-rabbitmq is not monitored on new MOSK 22.5 clusters. The existing clusters updated to MOSK 22.5 are not affected.

Tungsten Fabric integration with OpenStack

The levels of integration between OpenStack and Tungsten Fabric (TF) include controllers and services integration levels.

Controllers integration

The integration between the OpenStack and TF controllers is implemented through the shared Kubernetes openstack-tf-shared namespace. Both controllers have access to this namespace to read and write the Kubernetes kind: Secret objects.

The OpenStack Controller posts the data into the openstack-tf-shared namespace required by the TF services. The TF controller watches this namespace. Once an appropriate secret is created, the TF controller obtains it into the internal data structures for further processing.

The OpenStack Controller includes the following data for the TF Controller:

  • tunnel_inteface

    Name of the network interface for the TF data plane. This interface is used by TF for the encapsulated traffic for overlay networks.

  • Keystone authorization information

    Keystone Administrator credentials and an up-and-running IAM service are required for the TF Controller to initiate the deployment process.

  • Nova metadata information

    Required for the TF vRrouter agent service.

Also, the OpenStack Controller watches the openstack-tf-shared namespace for the vrouter_port parameter that defines the vRouter port number and passes it to the nova-compute pod.

Services integration

The list of the OpenStack services that are integrated with TF through their API include:

  • neutron-server - integration is provided by the contrail-neutron-plugin component that is used by the neutron-server service for transformation of the API calls to the TF API compatible requests.

  • nova-compute - integration is provided by the contrail-nova-vif-driver and contrail-vrouter-api packages used by the nova-compute service for interaction with the TF vRouter to the network ports.

  • octavia-api - integration is provided by the Octavia TF Driver that enables you to use OpenStack CLI and Horizon for operations with load balancers. See Tungsten Fabric load balancing (HAProxy) for details.

Warning

TF is not integrated with the following OpenStack services:

  • DNS service (Designate)

  • Key management (Barbican)

Networking

Depending on the size of an OpenStack environment and the components that you use, you may want to have a single or multiple network interfaces, as well as run different types of traffic on a single or multiple VLANs.

This section provides the recommendations for planning the network configuration and optimizing the cloud performance.

Networking overview

Mirantis OpenStack for Kubernetes (MOSK) cluster networking is complex and defined by the security requirements and performance considerations. It is based on the Kubernetes cluster networking provided by Mirantis Container Cloud and expanded to facilitate the demands of the OpenStack virtualization platform.

A Container Cloud Kubernetes cluster provides a platform for MOSK and is considered a part of its control plane. All networks that serve Kubernetes and related traffic are considered control plane networks. The Kubernetes cluster networking is typically focused on connecting pods of different nodes as well as exposing the Kubernetes API and services running in pods into an external network.

The OpenStack networking connects virtual machines to each other and the outside world. Most of the OpenStack-related networks are considered a part of the data plane in an OpenStack cluster. Ceph networks are considered data plane networks for the purpose of this reference architecture.

When planning your OpenStack environment, consider the types of traffic that your workloads generate and design your network accordingly. If you anticipate that certain types of traffic, such as storage replication, will likely consume a significant amount of network bandwidth, you may want to move that traffic to a dedicated network interface to avoid performance degradation.

The following diagram provides a simplified overview of the underlay networking in a MOSK environment:

cluster-networking
Management cluster networking

This page summarizes the recommended networking architecture of a Mirantis Container Cloud management cluster for a Mirantis OpenStack for Kubernetes (MOSK) cluster.

We recommend deploying the management cluster with a dedicated interface for the provisioning (PXE) network. The separation of the provisioning network from the management network ensures additional security and resilience of the solution.

MOSK end users typically should have access to the Keycloak service in the management cluster for authentication to the Horizon web UI. Therefore, we recommend that you connect the management network of the management cluster to an external network through an IP router. The default route on the management cluster nodes must be configured with the default gateway in the management network.

If you deploy the multi-rack configuration, ensure that the provisioning network of the management cluster is connected to an IP router that connects it to the provisioning networks of all racks.

MOSK cluster networking

Mirantis OpenStack for Kubernetes (MOSK) clusters managed by Mirantis Container Cloud use the following networks to serve different types of traffic:

MOSK network types

Network role

Description

Provisioning (PXE) network

Facilitates the iPXE boot of all bare metal machines in a MOSK cluster and provisioning of the operating system to machines.

This network is only used during provisioning of the host. It must not be configured on an operational MOSK node.

Life-cycle management (LCM) network

Connects LCM Agents running on the hosts to the Container Cloud LCM API. The LCM API is provided by the regional or management cluster. The LCM network is also used for communication between kubelet and the Kubernetes API server inside a Kubernetes cluster. The MKE components use this network for communication inside a swarm cluster.

The LCM subnet(s) provides IP addresses that are statically allocated by the IPAM service to bare metal hosts. This network must be connected to the Kubernetes API endpoint of the regional cluster through an IP router. LCM Agents running on MOSK clusters will connect to the regional cluster API through this router. LCM subnets may be different per MOSK cluster as long as this connection requirement is satisfied.

You can use more than one LCM network segment in a MOSK cluster. In this case, separated L2 segments and interconnected L3 subnets are still used to serve LCM and API traffic.

All IP subnets in the LCM networks must be connected to each other by IP routes. These routes must be configured on the hosts through L2 templates.

All IP subnets in the LCM network must be connected to the Kubernetes API endpoints of the management or regional cluster through an IP router.

You can manually select the load balancer IP address for external access to the cluster API and specify it in the Cluster object configuration. Alternatively, you can allocate a dedicated IP range for a virtual IP of the cluster API load balancer by adding a Subnet object with a special annotation. Mirantis recommends that this subnet stays unique per MOSK cluster. For details, see Create subnets.

Note

When using the L2 announcement of the IP address for the cluster API load balancer, the following limitations apply:

  • Only one of the LCM networks can contain the API endpoint. This network is called API/LCM throughout this documentation. It consists of a VLAN segment stretched between all Kubernetes master nodes in the cluster and the IP subnet that provides IP addresses allocated to these nodes.

  • The load balancer IP address must be allocated from the same subnet CIDR address that the LCM subnet uses.

When using the BGP announcement of the IP address for the cluster API load balancer, which is available as Technology Preview since MOSK 23.2.2, no segment stretching is required between Kubernetes master nodes. Also, in this scenario, the load balancer IP address is not required to match the LCM subnet CIDR address.

Kubernetes workloads network

Serves as an underlay network for traffic between pods in the MOSK cluster. Do not share this network between clusters.

There might be more than one Kubernetes pods network segment in the cluster. In this case, they must be connected through an IP router.

Kubernetes workloads network does not need an external access.

The Kubernetes workloads subnet(s) provides IP addresses that are statically allocated by the IPAM service to all nodes and that are used by Calico for cross-node communication inside a cluster. By default, VXLAN overlay is used for Calico cross-node communication.

Kubernetes external network

Serves for an access to the OpenStack endpoints in a MOSK cluster.

When using the L2 (ARP) announcement of the external endpoints of load-balanced services, the network must contain a VLAN segment extended to all MOSK nodes connected to this network.

When using the BGP announcement of the external endpoints of load-balanced services, which is available as Technology Preview since MOSK 23.2.2, there is no requirement of having a single VLAN segment extended to all MOSK nodes connected to this network.

A typical MOSK cluster only has one external network.

The external network must include at least two IP address ranges defined by separate Subnet objects in Container Cloud API:

  • MOSK services address range

    Provides IP addresses for externally available load-balanced services, including OpenStack API endpoints.

  • External address range

    Provides IP addresses to be assigned to network interfaces on all cluster nodes that are connected to this network. MetalLB speakers must run on the same nodes. For details, see Configure the MetalLB speaker node selector.

    This is required for external traffic to return to the originating client. The default route on the MOSK nodes that are connected to the external network must be configured with the default gateway in the external network.

Storage access network

Serves for the storage access traffic from and to Ceph OSD services.

A MOSK cluster may have more than one VLAN segment and IP subnet in the storage access network. All IP subnets of this network in a single cluster must be connected by an IP router.

The storage access network does not require external access unless you want to directly expose Ceph to the clients outside of a MOSK cluster.

Note

A direct access to Ceph by the clients outside of a MOSK cluster is technically possible but not supported by Mirantis. Use at your own risk.

The IP addresses from subnets in this network are statically allocated by the IPAM service to Ceph nodes. The Ceph OSD services bind to these addresses on their respective nodes.

This is a public network in Ceph terms. 1

Storage replication network

Serves for the storage replication traffic between Ceph OSD services.

A MOSK cluster may have more than one VLAN segment and IP subnet in this network as long as the subnets are connected by an IP router.

This network does not require external access.

The IP addresses from subnets in this network are statically allocated by the IPAM service to Ceph nodes. The Ceph OSD services bind to these addresses on their respective nodes.

This is a cluster network in Ceph terms. 1

Out-of-Band (OOB) network

Connects Baseboard Management Controllers (BMCs) of the bare metal hosts. Must not be accessible from a MOSK cluster.

1(1,2)

For more details about Ceph networks, see Ceph Network Configuration Reference.

The following diagram illustrates the networking schema of the Container Cloud deployment on bare metal with a MOSK cluster:

_images/network-multirack.png
Network types

This section describes network types for Layer 3 networks used for Kubernetes and Mirantis OpenStack for Kubernetes (MOSK) clusters along with requirements for each network type.

Note

Only IPv4 is currently supported by Container Cloud and IPAM for infrastructure networks. IPv6 is not supported and not used in Container Cloud and MOSK underlay infrastructure networks.

The following diagram provides an overview of the underlay networks in a MOSK environment:

_images/os-cluster-l3-networking.png
L3 networks for Kubernetes

A MOSK deployment typically requires the following types of networks:

  • Provisioning network

    Used for provisioning of bare metal servers.

  • Management network

    Used for management of the Container Cloud infrastructure and for communication between containers in Kubernetes.

  • LCM/API network

    Must be configured on the Kubernetes manager nodes of the cluster. Contains the Kubernetes API endpoint with the VRRP virtual IP address. Enables communication between the MKE cluster nodes.

  • LCM network

    Enables communication between the MKE cluster nodes. Multiple VLAN segments and IP subnets can be created for a multi-rack architecture. Each server must be connected to one of the LCM segments and have an IP from the corresponding subnet.

  • External network

    Used to expose the OpenStack, StackLight, and other services of the MOSK cluster.

  • Kubernetes workloads network

    Used for communication between containers in Kubernetes.

  • Storage access network (Ceph)

    Used for accessing the Ceph storage. In Ceph terms, this is a public network 0. We recommended that it is placed on a dedicated hardware interface.

  • Storage replication network (Ceph)

    Used for Ceph storage replication. In Ceph terms, this is a cluster network 0. To ensure low latency and fast access, place the network on a dedicated hardware interface.

0(1,2)

For details about Ceph networks, see Ceph Network Configuration Reference.

L3 networks for MOSK

The MOSK deployment additionally requires the following networks.

L3 networks for MOSK

Service name

Network

Description

VLAN name

Networking

Provider networks

Typically, a routable network used to provide the external access to OpenStack instances (a floating network). Can be used by the OpenStack services such as Ironic, Manila, and others, to connect their management resources.

pr-floating

Networking

Overlay networks (virtual networks)

The network used to provide denied, secure tenant networks with the help of the tunneling mechanism (VLAN/GRE/VXLAN). If the VXLAN and GRE encapsulation takes place, the IP address assignment is required on interfaces at the node level.

neutron-tunnel

Compute

Live migration network

The network used by the OpenStack compute service (Nova) to transfer data during live migration. Depending on the cloud needs, it can be placed on a dedicated physical network not to affect other networks during live migration. The IP address assignment is required on interfaces at the node level.

lm-vlan

The way of mapping of the logical networks described above to physical networks and interfaces on nodes depends on the cloud size and configuration. We recommend placing OpenStack networks on a dedicated physical interface (bond) that is not shared with storage and Kubernetes management network to minimize the influence on each other.

L3 networks requirements

The following tables describe networking requirements for a MOSK cluster, Container Cloud management and Ceph clusters.

Container Cloud management cluster networking requirements

Network type

Provisioning

Management

Suggested interface name

bm-pxe

lcm-nw

Minimum number of VLANs

1

1

Minimum number of IP subnets

3

2

Minimum recommended IP subnet size

  • 8 IP addresses (Container Cloud management cluster hosts)

  • 8 IP addresses (MetalLB for provisioning services)

  • 16 IP addresses (DHCP range for directly connected servers)

  • 8 IP addresses (Container Cloud management cluster hosts, API VIP)

  • 16 IP addresses (MetalLB for Container Cloud services)

External routing

Not required

Required, may use proxy server

Multiple segments/stretch segment

Stretch segment for management cluster due to MetalLB Layer 2 limitations 1

Stretch segment due to VRRP, MetalLB Layer 2 limitations

Internal routing

Routing to separate DHCP segments, if in use

  • Routing to API endpoints of managed clusters for LCM

  • Routing to MetalLB ranges of managed clusters for StackLight authentication

  • Default route from Container Cloud management cluster hosts

1

Multiple VLAN segments with IP subnets can be added to the cluster configuration for separate DHCP domains.

MOSK cluster networking requirements

Network type

Provisioning

LCM/API

LCM

External

Kubernetes workloads

Minimum number of VLANs

1 (optional)

1

1 (optional)

1

1

Suggested interface name

N/A

lcm-nw

lcm-nw

k8s-ext-v

k8s-pods 2

Minimum number of IP subnets

1 (optional)

1

1 (optional)

2

1

Minimum recommended IP subnet size

16 IPs (DHCP range)

  • 3 IPs for Kubernetes manager nodes

  • 1 IP for the API endpoint VIP

1 IP per MOSK node (Kubernetes worker)

  • 1 IP per cluster node

  • 16 IPs (MetalLB for StackLight, OpenStack services)

1 IP per cluster node

Stretch or multiple segments

Multiple

Stretch due to VRRP limitations

Multiple

Stretch connected to all MOSK controller nodes. For details, see Configure the MetalLB speaker node selector.

Multiple

External routing

Not required

Not required

Not required

Required, default route

Not required

Internal routing

Routing to the provisioning network of the management cluster

  • Routing to the IP subnet of the Container Cloud management network

  • Routing to all LCM IP subnets of the same MOSK cluster, if in use

  • Routing to the IP subnet of the LCM/API network

  • Routing to all IP subnets of the LCM network, if in use

Routing to the IP subnet of the Container Cloud Management API

Routing to all IP subnets of Kubernetes workloads

2

The bridge interface with this name is mandatory if you need to separate Kubernetes workloads traffic. You can configure this bridge over the VLAN or directly over the bonded or single interface.

MOSK Ceph cluster networking requirements

Network type

Storage access

Storage replication

Minimum number of VLANs

1

1

Suggested interface name

stor-public 3

stor-cluster 3

Minimum number of IP subnets

1

1

Minimum recommended IP subnet size

1 IP per cluster node

1 IP per cluster node

Stretch or multiple segments

Multiple

Multiple

External routing

Not required

Not required

Internal routing

Routing to all IP subnets of the Storage access network

Routing to all IP subnets of the Storage replication network

Note

When selecting externally routable subnets, ensure that the subnet ranges do not overlap with the internal subnets ranges. Otherwise, internal resources of users will not be available from the MOSK cluster.

3(1,2)

For details about Ceph networks, see Ceph Network Configuration Reference.

Multi-rack architecture

TechPreview

Mirantis OpenStack for Kubernetes (MOSK) enables you to deploy a cluster with a multi-rack architecture, where every data center cabinet (a rack), incorporates its own Layer 2 network infrastructure that does not extend beyond its top-of-rack switch. The architecture allows a MOSK cloud to integrate natively with the Layer 3-centric networking topologies such as Spine-Leaf that are commonly seen in modern data centers.

The architecture eliminates the need to stretch and manage VLANs across parts of a single data center, or to build VPN tunnels between the segments of a geographically distributed cloud.

The set of networks present in each rack depends on the OpenStack networking service back end in use.

multi-rack-overview.html
Bare metal provisioning

The multi-rack architecture in Mirantis Container Cloud and MOSK requires additional configuration of networking infrastructure. Every Layer 2 domain, or rack, needs to have a DHCP relay agent configured on its dedicated segment of the Common/PXE network (lcm-nw VLAN). The agent handles all Layer-2 DHCP requests incoming from the bare metal servers living in the rack and forwards them as Layer-3 packets across the data center fabric to a Mirantis Container Cloud regional cluster.

multi-rack-bm.html

You need to configure per-rack DHCP ranges by defining Subnet resources in Mirantis Container Cloud as described in Mirantis Container Cloud documentation: Configure multiple DHCP ranges using Subnet resources.

Based on the address of the DHCP agent that relays a request from a server, Mirantis Container Cloud will automatically allocate an IP address in the corresponding subnet.

For the networks types other than Common/PXE, you need to define subnets using the Mirantis Container Cloud L2 templates. Every rack needs to have a dedicated set of L2 templates, each template representing a specific server role and configuration.

Multi-rack MOSK cluster with Tungsten Fabric

A typical medium and more sized MOSK cloud consists of three or more racks that can generally be divided into the following major categories:

  • Compute/Storage racks that contain the hypervisors and instances running on top of them. Additionally, they contain nodes that store cloud applications’ block, ephemeral, and object data as part of the Ceph cluster.

  • Control plane racks that incorporate all the components needed by the cloud operator to manage its life cycle. Also, they include the services through which the cloud users interact with the cloud to deploy their applications, such as cloud APIs and web UI.

    A control plane rack may also contain additional compute and storage nodes.

The diagram below will help you to plan the networking layout of a multi-rack MOSK cloud with Tungsten Fabric.

multi-rack-tf.html

Note

Since 23.2.2, MOSK supports full L3 networking topology in the Technology Preview scope. This enables deployment of specific cluster segments in dedicated racks without the need for L2 layer extension between them. For configuration procedure, see Configure BGP announcement for cluster API LB address and Configure BGP announcement for cluster services LB addresses in Deployment Guide.

For MOSK 23.1 and older versions, Kubernetes masters (3 nodes) either need to be placed into a single rack or, if distributed across multiple racks for better availability, require stretching of the L2 segment of the management network across these racks. This requirement is caused by the Mirantis Kubernetes Engine underlay for MOSK relying on the Layer 2 VRRP protocol to ensure high availability of the Kubernetes API endpoint.

The table below provides a mapping between the racks and the network types participating in a multi-rack MOSK cluster with the Tungsten Fabric back end.

Networks and VLANs for a multi-rack MOSK cluster with TF

Network

VLAN name

Control Plane rack

Compute/Storage rack

Common/PXE

lcm-nw

Yes

Yes

Management

lcm-nw

Yes

Yes

External (MetalLB)

k8s-ext-v

Yes

No

Kubernetes workloads

k8s-pods-v

Yes

Yes

Storage access (Ceph)

stor-frontend

Yes

Yes

Storage replication (Ceph)

stor-backend

Yes

Yes

Overlay

tenant-vlan

Yes

Yes

Live migration

lm-vlan

Yes

Yes

Physical networks layout

This section summarizes the requirements for the physical layout of underlay network and VLANs configuration for the multi-rack architecture of Mirantis OpenStack for Kubernetes (MOSK).

Physical networking of a Container Cloud management cluster

Due to limitations of virtual IP address for Kubernetes API and of MetalLB load balancing in Container Cloud, the management cluster nodes must share VLAN segments in the provisioning and management networks.

In the multi-rack architecture, the management cluster nodes may be placed to a single rack or spread across three racks. In either case, provisioning and management network VLANs must be stretched across ToR switches of the racks.

The following diagram illustrates physical and L2 connections of the Container Cloud management cluster.

_images/os-cluster-mgmt-physical.png
Physical networking of a MOSK cluster
External network

Due to limitations of MetalLB load balancing, all MOSK cluster nodes connected to the external network must share the VLAN segment in the external network.

In the multi-rack architecture, the external network VLAN must be stretched to the ToR switches of all the racks where nodes connected to the external network are located. All other VLANs may be configured per rack.

Kubernetes manager nodes

Due to limitations of using a virtual IP address for Kubernetes API, the Kubernetes manager nodes must share the VLAN segment in the API/LCM network.

In the multi-rack architecture, Kubernetes manager nodes may be spread across three racks. The API/LCM network VLAN must be stretched to the ToR switches of the racks. All other VLANs may be configured per rack.

The following diagram illustrates physical and L2 network connections of the Kubernetes manager nodes in a MOSK cluster.

Caution

Such configuration does not apply to a compact control plane MOSK installation. See Create a MOSK cluster.

_images/os-cluster-k8s-mgr-physical.png
OpenStack controller nodes

The following diagram illustrates physical and L2 network connections of the control plane nodes in a MOSK cluster.

_images/os-cluster-control-physical.png
OpenStack compute nodes

All VLANs for OpenStack compute nodes may be configured per rack. No VLAN should be stretched across multiple racks.

The following diagram illustrates physical and L2 network connections of the compute nodes in a MOSK cluster.

_images/os-cluster-compute-physical.png
OpenStack storage nodes

All VLANs for OpenStack storage nodes may be configured per rack. No VLAN should be stretched across multiple racks.

The following diagram illustrates physical and L2 network connections of the storage nodes in a MOSK cluster.

_images/os-cluster-storage-physical.png
Performance optimization

The following recommendations apply to all types of nodes in the Mirantis OpenStack for Kubernetes (MOSK) clusters.

Jumbo frames

To improve the goodput, we recommend that you enable jumbo frames where possible. The jumbo frames have to be enabled on the whole path of the packets traverse. If one of the network components cannot handle jumbo frames, the network path uses the smallest MTU.

Bonding

To provide fault tolerance of a single NIC, we recommend using the link aggregation, such as bonding. The link aggregation is useful for linear scaling of bandwidth, load balancing, and fault protection. Depending on the hardware equipment, different types of bonds might be supported. Use the multi-chassis link aggregation as it provides fault tolerance at the device level. For example, MLAG on Arista equipment or vPC on Cisco equipment.

The Linux kernel supports the following bonding modes:

  • active-backup

  • balance-xor

  • 802.3ad (LACP)

  • balance-tlb

  • balance-alb

Since LACP is the IEEE standard 802.3ad supported by the majority of network platforms, we recommend using this bonding mode. Use the Link Aggregation Control Protocol (LACP) bonding mode with MC-LAG domains configured on ToR switches. This corresponds to the 802.3ad bond mode on hosts.

Additionally, follow these recommendations in regards to bond interfaces:

  • Use ports from different multi-port NICs when creating bonds. This makes network connections redundant if failure of a single NIC occurs.

  • Configure the ports that connect servers to the PXE network with PXE VLAN as native or untagged. On these ports, configure LACP fallback to ensure that the servers can reach DHCP server and boot over network.

Spanning tree portfast mode

Configure Spanning Tree Protocol (STP) settings on the network switch ports to ensure that the ports start forwarding packets as soon as the link comes up. It helps avoid iPXE timeout issues and ensures reliable boot over network.

Storage

A MOSK cluster uses Ceph as a distributed storage system for file, block, and object storage. This section provides an overview of a Ceph cluster deployed by Container Cloud.

Ceph overview

Mirantis Container Cloud deploys Ceph on MOSK using Helm charts with the following components:

Rook Ceph Operator

A storage orchestrator that deploys Ceph on top of a Kubernetes cluster. Also known as Rook or Rook Operator. Rook operations include:

  • Deploying and managing a Ceph cluster based on provided Rook CRs such as CephCluster, CephBlockPool, CephObjectStore, and so on.

  • Orchestrating the state of the Ceph cluster and all its daemons.

KaaSCephCluster custom resource (CR)

Represents the customization of a Kubernetes installation and allows you to define the required Ceph configuration through the Container Cloud web UI before deployment. For example, you can define the failure domain, Ceph pools, Ceph node roles, number of Ceph components such as Ceph OSDs, and so on. The ceph-kcc-controller controller on the Container Cloud management cluster manages the KaaSCephCluster CR.

Ceph Controller

A Kubernetes controller that obtains the parameters from Container Cloud through a CR, creates CRs for Rook and updates its CR status based on the Ceph cluster deployment progress. It creates users, pools, and keys for OpenStack and Kubernetes and provides Ceph configurations and keys to access them. Also, Ceph Controller eventually obtains the data from the OpenStack Controller for the Keystone integration and updates the Ceph Object Gateway services configurations to use Kubernetes for user authentication. Ceph Controller operations include:

  • Transforming user parameters from the Container Cloud Ceph CR into Rook CRs and deploying a Ceph cluster using Rook.

  • Providing integration of the Ceph cluster with Kubernetes.

  • Providing data for OpenStack to integrate with the deployed Ceph cluster.

Ceph Status Controller

A Kubernetes controller that collects all valuable parameters from the current Ceph cluster, its daemons, and entities and exposes them into the KaaSCephCluster status. Ceph Status Controller operations include:

  • Collecting all statuses from a Ceph cluster and corresponding Rook CRs.

  • Collecting additional information on the health of Ceph daemons.

  • Provides information to the status section of the KaaSCephCluster CR.

Ceph Request Controller

A Kubernetes controller that obtains the parameters from Container Cloud through a CR and manages Ceph OSD lifecycle management (LCM) operations. It allows for a safe Ceph OSD removal from the Ceph cluster. Ceph Request Controller operations include:

  • Providing an ability to perform Ceph OSD LCM operations.

  • Obtaining specific CRs to remove Ceph OSDs and executing them.

  • Pausing the regular Ceph Controller reconciliation until all requests are completed.

A typical Ceph cluster consists of the following components:

  • Ceph Monitors - three or, in rare cases, five Ceph Monitors.

  • Ceph Managers - one Ceph Manager in a regular cluster.

  • Ceph Object Gateway (radosgw) - Mirantis recommends having three or more radosgw instances for HA.

  • Ceph OSDs - the number of Ceph OSDs may vary according to the deployment needs.

    Warning

    • A Ceph cluster with 3 Ceph nodes does not provide hardware fault tolerance and is not eligible for recovery operations, such as a disk or an entire Ceph node replacement.

    • A Ceph cluster uses the replication factor that equals 3. If the number of Ceph OSDs is less than 3, a Ceph cluster moves to the degraded state with the write operations restriction until the number of alive Ceph OSDs equals the replication factor again.

The placement of Ceph Monitors and Ceph Managers is defined in the KaaSCephCluster CR.

The following diagram illustrates the way a Ceph cluster is deployed in Container Cloud:

_images/ceph-deployment.png

The following diagram illustrates the processes within a deployed Ceph cluster:

_images/ceph-data-flow.png
Ceph limitations

A Ceph cluster configuration in MOSK includes but is not limited to the following limitations:

  • Only one Ceph Controller per MOSK cluster and only one Ceph cluster per Ceph Controller are supported.

  • The replication size for any Ceph pool must be set to more than 1.

  • Only one CRUSH tree per cluster. The separation of devices per Ceph pool is supported through device classes with only one pool of each type for a device class.

  • All CRUSH rules must have the same failure_domain.

  • Only the following types of CRUSH buckets are supported:

    • topology.kubernetes.io/region

    • topology.kubernetes.io/zone

    • topology.rook.io/datacenter

    • topology.rook.io/room

    • topology.rook.io/pod

    • topology.rook.io/pdu

    • topology.rook.io/row

    • topology.rook.io/rack

    • topology.rook.io/chassis

  • RBD mirroring is not supported.

  • Consuming an existing Ceph cluster is not supported.

  • CephFS is not supported.

  • Only IPv4 is supported.

  • If two or more Ceph OSDs are located on the same device, there must be no dedicated WAL or DB for this class.

  • Only a full collocation or dedicated WAL and DB configurations are supported.

  • The minimum size of any defined Ceph OSD device is 5 GB.

  • Reducing the number of Ceph Monitors is not supported and causes the Ceph Monitor daemons removal from random nodes.

  • Ceph cluster does not support removable devices (with hotplug enabled) for deploying Ceph OSDs.

  • When adding a Ceph node with the Ceph Monitor role, if any issues occur with the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead, named using the next alphabetic character in order. Therefore, the Ceph Monitor names may not follow the alphabetical order. For example, a, b, d, instead of a, b, c.

Ceph integration with OpenStack

The integration between Ceph and OpenStack controllers is implemented through the shared Kubernetes openstack-ceph-shared namespace. Both controllers have access to this namespace to read and write the Kubernetes kind: Secret objects.

_images/osctl-ceph-integration.png

As Ceph is required and only supported back end for several OpenStack services, all necessary Ceph pools must be specified in the configuration of the kind: MiraCeph custom resource as part of the deployment. Once the Ceph cluster is deployed, the Ceph Controller posts the information required by the OpenStack services to be properly configured as a kind: Secret object into the openstack-ceph-shared namespace. The OpenStack Controller watches this namespace. Once the corresponding secret is created, the OpenStack Controller transforms this secret to the data structures expected by the OpenStack-Helm charts. Even if an OpenStack installation is triggered at the same time as a Ceph cluster deployment, the OpenStack Controller halts the deployment of the OpenStack services that depend on Ceph availability until the secret in the shared namespace is created by the Ceph Controller.

For the configuration of Ceph Object Gateway as an OpenStack Object Storage, the reverse process takes place. The OpenStack Controller waits for the OpenStack-Helm to create a secret with OpenStack Identity (Keystone) credentials that Ceph Object Gateway must use to validate the OpenStack Identity tokens, and posts it back to the same openstack-ceph-shared namespace in the format suitable for consumption by the Ceph Controller. The Ceph Controller then reads this secret and reconfigures Ceph Object Gateway accordingly.

Mirantis StackLight

StackLight is the logging, monitoring, and alerting solution that provides a single pane of glass for cloud maintenance and day-to-day operations as well as offers critical insights into cloud health including operational information about the components deployed with Mirantis OpenStack for Kubernetes (MOSK). StackLight is based on Prometheus, an open-source monitoring solution and a time series database, and OpenSearch, the logs and notifications storage.

Deployment architecture

Mirantis OpenStack for Kubernetes (MOSK) deploys the StackLight stack as a release of a Helm chart that contains the helm-controller and HelmBundle custom resources. The StackLight HelmBundle consists of a set of Helm charts describing the StackLight components. Apart from the OpenStack-specific components below, StackLight also includes the components described in Mirantis Container Cloud Reference Architecture: Deployment architecture. By default, StackLight logging stack is disabled.

During the StackLight configuration when deploying a MOSK cluster, you can define the HA or non-HA StackLight architecture type. For details, see Mirantis Container Cloud Reference Architecture: StackLight database modes.

OpenStack-specific StackLight components overview

StackLight component

Description

Prometheus native exporters and endpoints

Export the existing metrics as Prometheus metrics and include:

  • libvirt-exporter

  • memcached-exporter

  • mysql-exporter

  • rabbitmq-exporter

  • tungstenfabric-exporter

Telegraf OpenStack plugin

Collects and processes the OpenStack metrics.

Monitored components

StackLight measures, analyzes, and reports in a timely manner about failures that may occur in the following Mirantis OpenStack for Kubernetes (MOSK) components and their sub-components. Apart from the components below, StackLight also monitors the components listed in Mirantis Container Cloud Reference Architecture: Monitored components.

  • Libvirt

  • Memcached

  • MariaDB

  • NTP

  • OpenStack (Barbican, Cinder, Designate, Glance, Heat, Horizon, Ironic, Keystone, Neutron, Nova, Octavia)

  • OpenStack SSL certificates

  • Tungsten Fabric (Casandra, Kafka, Redis, ZooKeeper)

  • RabbitMQ

    Note

    Due to a known issue, tf-rabbitmq is not monitored on new MOSK 22.5 clusters. The existing clusters updated to MOSK 22.5 are not affected.

OpenSearch and Prometheus storage sizing

Caution

Calculations in this document are based on numbers from a real-scale test cluster with 34 nodes. The exact space required for metrics and logs must be calculated depending on the ongoing cluster operations. Some operations force the generation of additional metrics and logs. The values below are approximate. Use them only as recommendations.

During the deployment of a new cluster, you must specify the OpenSearch retention time and Persistent Volume Claim (PVC) size, Prometheus PVC, retention time, and retention size. When configuring an existing cluster, you can only set OpenSearch retention time, Prometheus retention time, and retention size.

The following table describes the recommendations for both OpenSearch and Prometheus retention size and PVC size for a cluster with 34 nodes. Retention time depends on the space allocated for the data. To calculate the required retention time, use the {retention time} = {retention size} / {amount of data per day} formula.

Service

Required space per day

Description

OpenSearch

StackLight in non-HA mode:
  • 202 - 253 GB for the entire cluster

  • ~6 - 7.5 GB for a single node

StackLight in HA mode:
  • 404 - 506 GB for the entire cluster

  • ~12 - 15 GB for a single node

When setting Persistent Volume Claim Size for OpenSearch during the cluster creation, take into account that it defines the PVC size for a single instance of the OpenSearch cluster. StackLight in HA mode has 3 OpenSearch instances. Therefore, for a total OpenSearch capacity, multiply the PVC size by 3.

Prometheus

  • 11 GB for the entire cluster

  • ~400 MB for a single node

Every Prometheus instance stores the entire database. Multiple replicas store multiple copies of the same data. Therefore, treat the Prometheus PVC size as the capacity of Prometheus in the cluster. Do not sum them up.

Prometheus has built-in retention mechanisms based on the database size and time series duration stored in the database. Therefore, if you miscalculate the PVC size, retention size set to ~1 GB less than the PVC size will prevent disk overfilling.

StackLight integration with OpenStack

StackLight integration with OpenStack includes automatic discovery of RabbitMQ credentials for notifications and OpenStack credentials for OpenStack API metrics. For details, see the openstack.rabbitmq.credentialsConfig and openstack.telegraf.credentialsConfig parameters description in StackLight configuration parameters.

Workload monitoring

Available since MOSK 23.2 TechPreview

LCM operations may require measuring the downtime of cloud end user instances to assess if SLA commitments regarding workload downtime are being met. Additionally, continuous monitoring of network connectivity is essential for early problem detection.

To address these needs, MOSK provides the OpenStack workload monitoring feature through the Cloudprober exporter. Presently, MOSK supports monitoring of floating IP addresses exclusively through the Internet Control Message Protocol (ICMP).

instance_availability_arch

To be able to monitor instance availability, your cluster should meet the following requirements:

  • IP connectivity between the network used to assign floating IP addresses and all OpenStack control plane nodes

  • ICMP ingress and egress traffic allowed in operating systems on the monitored virtual machines

  • ICMP ingress and egress traffic allowed in the OpenStack project by configuring security groups

To enable the workload monitoring service, use the following OpenStackDeployment definition:

spec:
  features:
    services:
      - cloudprober

For the detailed configuration procedure of the instance availability monitoring, refer to Configure monitoring of instance availability.

Blueprints

This section contains a collection of Mirantis OpenStack for Kubernetes (MOSK) architecture blueprints that include common cluster topology and configuration patterns that can be referred to when building a MOSK cloud. Every blueprint is validated by Mirantis and is known to work. You can use these blueprints alone or in combination, although the interoperability of all possible combinations can not be guaranteed.

The section provides information on the target use cases, pros and cons of every blueprint and outlines the extents of its applicability. However, do not hesitate to reach out to Mirantis if you have any questions or doubts on whether a specific blueprint can be applied when designing your cloud.

Remote compute nodes
Introduction to edge computing

Although a classic cloud approach allows resources to be distributed across multiple regions, it still needs powerful data centers to host control planes and compute clusters. Such regional centralization poses challenges when the number of data consumers grows. It becomes hard to access the resources hosted in the cloud even though the resources are located in the same geographic region. The solution would be to bring the data closer to the consumer. And this is exactly what edge computing provides.

Edge computing is a paradigm that brings computation and data storage closer to the sources of data or the consumer. It is designed to improve response time and save bandwidth.

A few examples of use cases for edge computing include:

  • Hosting a video stream processing application on premises of a large stadium during the Super Bowl match

  • Placing the inventory or augmented reality services directly in the industrial facilities, such as storage, powerplant, shipyard, and so on

  • A small computation node deployed in a far-distanced village supermarket to host an application for store automatization and accounting

These and many other use cases could be solved by deploying multiple edge clusters managed from a single central place. The idea of centralized management plays a significant role for the business efficiency of the edge cloud environment:

  • Cloud operators obtain a single management console for the cloud that simplifies the Day-1 provisioning of new edge sites and Day-2 operations across multiple geographically distributed points of presence

  • Cloud users get ability to transparently connect their edge applications with central databases or business logic components hosted in data centers or public clouds

Depending on the size, location, and target use case, the points of presence comprising an edge cloud environment can be divided into five major categories. Mirantis OpenStack powered by Mirantis Container Cloud offers reference architectures to address the centralized management in core and regional data centers as well as edge sites.

Untitled Diagram
Overview of the remote compute nodes approach

Remote compute nodes is one of the approaches to the implementation of the edge computing concept offered by MOSK. The topology consists of a MOSK cluster residing in a data center, which is extended with multiple small groups of compute nodes deployed in geographically distanced remote sites. Remote compute nodes are integrated into the MOSK cluster just like the nodes in the central site with their configuration and life cycle managed through the same means.

Along with compute nodes, remote sites need to incorporate network gateway components that allow application users to consume edge services directly without looping the traffic through the central site.

Untitled Diagram
Design considerations for a remote site

Deployment of an edge cluster managed from a single central place starts with a proper planning. This section provides recommendations on how to approach the deployment design.

Compute nodes aggregation into availability zones

Mirantis recommends organizing nodes in each remote site into separate Availability Zones in the MOSK Compute (OpenStack Nova), Networking (OpenStack Neutron), and Block Storage (OpenStack Cinder) services. This enables the cloud users to be aware of the failure domain represented by a remote site and distribute the parts of their applications accordingly.

Storage

Typically, high latency in between the central control plane and remote sites makes it not feasible to rely on Ceph as a storage for the instance root/ephemeral and block data.

Mirantis recommends that you configure the remote sites to use the following back ends:

  • Local storage (LVM or QCOW2) as a storage back end for the MOSK Compute service. See images-storage-back-end for the configuration details.

  • LVM on iSCSI back end for the MOSK Block Storage service. See Enable LVM block storage for the enablement procedure.

To maintain the small size of a remote site, the compute nodes need to be hyper-converged and combine the compute and block storage functions.

Site sizing

There is no limitation on the number of the remote sites and their size. However, when planning the cluster, ensure consistency between the total number of nodes managed by a single control plane and the value of the size parameter set in the OpenStackDeployment custom resource. For the list of supported sizes, refer to Main elements.

Additionally, the sizing of the remote site needs to take into account the characteristics of the networking channel with the main site.

Typically, an edge site consists of 3-7 compute nodes installed in a single, usually rented, rack.

Network latency and bandwidth

Mirantis recommends keeping the network latency between the main and remote sites as low as possible. For stable interoperability of cluster components, the latency needs to be around 30-70 milliseconds. Though, depending on the cluster configuration and dynamism of the workloads running in the remote site, the stability of the cluster can be preserved with the latency of up to 190 milliseconds.

The bandwidth of the communication channel between the main and remote sites needs to be sufficient to run the following traffic:

  • The control plane and management traffic, such as OpenStack messaging, database access, MOSK underlay Kubernetes cluster control plane, and so on. A single remote compute node in the idle state requires at minimum 1.5 Mbit/s of bandwidth to perform the non-data plane communications.

  • The data plane traffic, such as OpenStack image operations, instances VNC console traffic, and so on, that heavily depend on the profile of the workloads and other aspects of the cloud usage.

In general, Mirantis recommends having a minimum of 100 MBit/s bandwidth between the main and remote sites.

Loss of connectivity to the central site

MOSK remote compute nodes architecture is designed to tolerate a temporary loss of connectivity between the main cluster and the remote sites. In case of a disconnection, the instances running on remote compute nodes will keep running normally preserving their ability to read and write ephemeral and block storage data presuming it is located in the same site, as well as connectivity to their neighbours and edge application users. However, the instances will not have access to any cloud services or applications located outside of their remote site.

Since the MOSK control plane communicates with remote compute nodes through the same network channel, cloud users will not be able to perform any manipulations, for example, instance creation, deletion, snapshotting, and so on, over their edge applications until the connectivity gets restored. MOSK services providing high availability to cloud applications, such as the Instance HA service and Network service, need to be connected to the remote compute nodes to perform a failover of application components running in the remote site.

Once the connectivity between the main and the remote site restores, all functions become available again. The period during which an edge application can sustain normal function after a connectivity loss is determined by multiple factors including the selected networking back end for the MOSK cluster. Mirantis recommends that a cloud operator performs a set of test manipulations over the cloud resources hosted in the remote site to ensure that it has been fully restored.

Long-lived graceful restart in Tungsten Fabric

When configured in Tungsten Fabric-powered clouds, the Graceful restart and long-lived graceful restart feature significantly improves the MOSK ability to sustain the connectivity of workloads running at remote sites in situations when a site experiences a loss of connection to the central hosting location of the control plane.

Extensive testing has demonstrated that remote sites can effectively withstand a 72-hour control plane disconnection with zero impact on the running applications.

Security of cross-site communication

Given that a remote site communicates with its main MOSK cluster across a wide area network (WAN), it becomes important to protect sensitive data from being intercepted and viewed by a third party. Specifically, you should ensure the protection of the data belonging to the following cloud components:

  • Mirantis Container Cloud life-cycle management plane

    Bare metal servers provisioning and control, Kubernetes cluster deployment and management, Mirantis StackLight telemetry

  • MOSK control plane

    Communication between the components of OpenStack, Tungsten Fabric, and Mirantis Ceph

  • MOSK data plane

    Cloud application traffic

The most reliable way to protect the data is to configure the network equipment in the data center and the remote site to encapsulate all the bypassing remote-to-main communications into an encrypted VPN tunnel. Alternatively, Mirantis Container Cloud and MOSK can be configured to force encryption of specific types of network traffic, such as:

  • Kubernetes networking for MOSK underlying Kubernetes cluster that handles the vast majority of in-MOSK communications

  • OpenStack tenant networking that carries all the cloud application traffic

The ability to enforce traffic encryption depends on the specific version of the Mirantis Container Cloud and MOSK in use, as well as the selected SDN back end for OpenStack.

Remote compute nodes with Tungsten Fabric

TechPreview

In MOSK, the main cloud that controls remote computes can be the regional site that locates the regional cluster and the MOSK control plane. Additionally, it can contain a local storage and compute nodes.

The remote computes implementation in MOSK considers Tungsten Fabric as an SDN solution.

Remote computes bare metal servers are configured as Kubernetes workers hosting the deployments for:

  • Tungsten Fabric vRouter-gateway service

  • Nova-compute

  • Local storage (LVM with iSCSI block storage)

Large clusters

This section describes a validated MOSK cluster architecture that is capable of handling 10,000 instances under a single control plane.

Hardware characteristics
Node roles layout

Role

Nodes count

Server specification

Management cluster Kubernetes nodes

3

  • 16 vCPU 3.4 GHz

  • 32 GB RAM

  • 2 x 480 GB SSD drives

  • 2 x 10 Gbps NICs

MOSK cluster Kubernetes master nodes

3

  • 16 vCPU 3.4 GHz

  • 32 GB RAM

  • 2 x 480 GB SSD drives

  • 2 x 10 Gbps NICs

OpenStack controller nodes

5

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

OpenStack compute and storage nodes

Up to 500 total

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

StackLight nodes

3

  • 64 vCPU 2.5 GHz

  • 256 RAM

  • 2 x 240 GB SSD drives

  • 2 x 3.8 TB NVMe drives

  • 2 x 25 Gbps NICs

Cluster architecture
Cluster architecture

Configuration

Value

Dedicated StackLight nodes

Yes

Dedicated Ceph storage nodes

Yes

Dedicated control plane Kubernetes nodes

Yes

Dedicated OpenStack gateway nodes

No, collocated with OpenStack controller nodes

OpenStack networking backend

Open vSwitch, no Distributed Virtual Router

Cluster size in the OpenStackDeployment CR

medium

Cluster validation

The architecture validation is perfomed by means of simultanious creation of multiple OpenStack resources of various types and execution of functional tests against each resource. The amount of resources hosted in the cluster at the moment when a certain threshold of non-operational resources starts being observed, is described below as cluster capacity limit.

Note

A successfully created resource has the Active status in the API and passes the functional tests, for example, its floating IP address is accessible. The MOSK cluster is considered to be able to handle the created resources if it successfully performs the LCM operations including the OpenStack services restart, both on the control and data plane.

Note

The key limiting factor for creating more OpenStack objects in this illustrative setup is hardware resources (vCPU and RAM) available on the compute nodes.

OpenStack resource capacity limits

OpenStack resource

Limit

Instances

11101

Network ports - instances

37337

Network ports - service (avg. per gateway node)

3517

Volumes

2784

Routers

2448

Networks

3383

Orchestration stacks

2419

Hardware resources utilization
Consumed hardware resources by a filled up cluster in the idle state

Node role

Load average

vCPU

RAM in GB

OpenStack controller + gateway

10

10

100

OpenStack compute

30

25

160

Ceph storage

2

2

15

StackLight

10

8

102

Kubernetes master

10

6

13

Cephless cloud

Persistent storage is a key component of any MOSK deployment. Out of the box, MOSK includes an open-source software-defined storage solution (Ceph), which hosts various kinds of cloud application data, such as root and ephemeral disks for virtual machines, virtual machine images, attachable virtual block storage, and object data. In addition, a Ceph cluster usually acts as a storage for the internal MOSK components, such as Kubernetes, OpenStack, StackLight, and so on.

Being distributed and redundant by design, Ceph requires a certain minimum amount of servers, also known as OSD or storage nodes, to work. A production-grade Ceph cluster typically consists of at least nine storage nodes, while a development and test environment may include four to six servers. For details, refer to MOSK cluster hardware requirements.

It is possible to reduce the overall footprint of a MOSK cluster by collocating the Ceph components with hypervisors on the same physical servers; this is also known as hyper-converged design. However, this architecture still may not satisfy the requirements of certain use cases for the cloud.

Standalone telco-edge MOSK clouds typically consist of three to seven servers hosted in a single rack, where every piece of CPU, memory and disk resources is strictly accounted and better be dedicated to the cloud workloads, rather than control plane. For such clouds, where the cluster footprint is more important than the resiliency of the application data storage, it makes sense either not to have a Ceph cluster at all or to replace it with some primitive non-redundant solution.

Enterprise virtualization infrastructure with third-party storage is not a rare strategy among large companies that rely on proprietary storage appliances, provided by NetApp, Dell, HPE, Pure Storage, and other major players in the data storage sector. These industry leaders offer a variety of storage solutions meticulously designed to suit various enterprise demands. Many companies, having already invested substantially in proprietary storage infrastructure, prefer integrating MOSK with their existing storage systems. This approach allows them to leverage this investment rather than incurring new costs and logistical complexities associated with migrating to Ceph.

Architecture
Cephless-architecture

Kind of data

MOSK component

Data storage in Cephless architecture

Configuration

Root and ephemeral disks of instances

Compute service (OpenStack Nova)

  • Compute node local file system (QCOW2 images).

  • Compute node local storage devices (LVM volumes).

    You can select QCOW2 and LVM back end per compute node.

  • Volumes through the “boot from volume” feature of the Compute service.

    You can select the Boot from volume option when spinning up a new instance as a cloud user.

Volumes

Block Storage service (OpenStack Cinder)

  • MOSK standard LVM+iSCSI back end for the Block Storage service. This aligns in a seamless manner with the concept of hyper-converged design, wherein the LVM volumes are collocated on the compute nodes.

  • Third-party storage.

Enable LVM block storage

Volumes backups

Block Storage service (OpenStack Cinder)

  • External NFS share TechPreview

  • External S3 endpoint TechPreview

Alternatively, you can disable the volume backup functionality.

Backup configuration

Tungsten Fabric database backups

Tungsten Fabric (Cassandra, ZooKeeper)

External NFS share TechPreview

Alternatively, you can disable the Tungsten Fabric database backups functionality.

Tungsten Fabric database

OpenStack database backups

OpenStack (MariaDB)

  • External NFS share TechPreview

  • External S3-compatible storage TechPreview

  • Local file system of one of the MOSK controller nodes. By default, database backups are stored on the local file system on the node where the MariaDB service is running. This imposes a risk to cloud security and resiliency. For enterprise environments, it is a common requirement to store all the backup data externally.

Alternatively, you can disable the database backup functionality.

Results of functional testing

OpenStack Tempest

Local file system of MOSK controller nodes.

The openstack-tempest-run-tests job responsible for running the Tempest suite stores the results of its execution in a volume requested through the pvc-tempest PersistentVolumeClaim (PVC). The subject volume can be created by the local volume provisioner on the same Kubernetes worker node, where the job runs. Usually, it is a MOSK controller node.

Run Tempest tests

Instance images and snapshots

Image service (OpenStack Glance)

You can configure the Block Storage service (OpenStack Cinder) to be used as a storage back end for images and snapshots. In this case, each image is represented as a volume.

Important

Representing volumes as images implies a hard requirement for the selected block storage back end to support multi-attach capability that is concurrent reads and writes to and from a single volume.

Enable Cinder back end for Glance

Application object data

Object storage service (Ceph RADOS Gateway)

External S3, Swift, or any other third-party storage solutions compatible with object access protocols.

Note

An external object storage solution will not be integrated into the MOSK identity service (OpenStack Keystone), the cloud applications will need to take care of managing access to their object data themselves.

If no Ceph is deployed as part of a cluster, the MOSK built-in Object Storage service API endpoints are disabled automatically.

Logs, metrics, alerts

Mirantis StackLight (Prometeus, Alertmanager, Patroni, OpenSearch)

Local file system of MOSK controller nodes.

StackLight must be deployed in the HA mode, when all its data gets stored on the local file system of the nodes running StackLight services. In this mode, StackLight components get configured to handle the data replication themselves.

StackLight deployment architecture

Limitations
  • The determination of whether a MOSK cloud will include Ceph or not should take place during its planning and design phase. Once the deployment is complete, reconfiguring the cloud to switch between Ceph and non-Ceph architectures becomes impossible.

  • Mirantis recommends avoiding substitution of Ceph-backed persistent volumes in the MOSK underlying Kubernetes cluster with local volumes (local volume provisioner) for production environments. MOSK does not support such configuration unless the components that rely on these volumes can replicate their data themselves, for example, StackLight. Volumes provided by the local volume provisioner are not redundant, as they are bound to just a single node and can only be mounted from the Kubernetes pods running on the same nodes.

Node maintenance API

This section describes internal implementation of the node maintenance API and how OpenStack and Tungsten Fabric controllers communicate with LCM and each other during a managed cluster update.

Node maintenance API objects

The node maintenance API consists of the following objects:

  • Cluster level:

    • ClusterWorkloadLock

    • ClusterMaintenanceRequest

  • Node level:

    • NodeWorkloadLock

    • NodeMaintenanceRequest

WorkloadLock objects

The WorkloadLock objects are created by each Application Controller. These objects prevent LCM from performing any changes on the cluster or node level while the lock is in the active state. The inactive state of the lock means that the Application Controller has finished its work and the LCM can proceed with the node or cluster maintenance.

ClusterWorkloadLock object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterWorkloadLock
metadata:
  name: cluster-1-openstack
spec:
  controllerName: openstack
status:
  state: active # inactive;active;failed (default: active)
  errorMessage: ""
  release: "6.16.0+21.3"
NodeWorkloadLock object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: NodeWorkloadLock
metadata:
  name: node-1-openstack
spec:
  nodeName: node-1
  controllerName: openstack
status:
  state: active # inactive;active;failed (default: active)
  errorMessage: ""
  release: "6.16.0+21.3"
MaintenanceRequest objects

The MaintenanceRequest objects are created by LCM. These objects notify Application Controllers about the upcoming maintenance of a cluster or a specific node.

ClusterMaintenanceRequest object example configuration
apiVersion: lcm.mirantis.com/v1alpha1
kind: ClusterMaintenanceRequest
metadata:
  name: cluster-1
spec:
  scope: drain # drain;os
NodeMaintenanceRequest object example configuration
 apiVersion: lcm.mirantis.com/v1alpha1
 kind: NodeMaintenanceRequest
 metadata:
   name: node-1
 spec:
   nodeName: node-1
   scope: drain # drain;os

The scope parameter in the object specification defines the impact on the managed cluster or node. The list of the available options include:

  • drain

    A regular managed cluster update. Each node in the cluster goes over a drain procedure. No node reboot takes place, a maximum impact includes restart of services on the node including Docker, which causes the restart of all containers present in the cluster.

  • os

    A node might be rebooted during the update. Triggers the workload evacuation by the OpenStack Controller.

When the MaintenanceRequest object is created, an Application Controller executes a handler to prepare workloads for maintenance and put appropriate WorkloadLock objects into the inactive state.

When maintenance is over, LCM removes MaintenanceRequest objects, and the Application Controllers move their WorkloadLocks objects into the active state.

OpenStack Controller maintenance API

When LCM creates the ClusterMaintenanceRequest object, the OpenStack Controller ensures that all OpenStack components are in the Healthy state, which means that the pods are up and running, and the readiness probes are passing.

The ClusterMaintenanceRequest object creation flow:

ClusterMaintenanceRequest - create

When LCM creates the NodeMaintenanceRequest, the OpenStack Controller:

  1. Prepares components on the node for maintenance by removing nova-compute from scheduling.

  2. If the reboot of a node is possible, the instance migration workflow is triggered. The Operator can configure the instance migration flow through the Kubernetes node annotation and should define the required option before the managed cluster update.

    To mitigate the potential impact on the cloud workloads, you can define the instance migration flow for the compute nodes running the most valuable instances.

    The list of available options for the instance migration configuration includes:

    • The openstack.lcm.mirantis.com/instance_migration_mode annotation:

      • live

        Default. The OpenStack controller live migrates instances automatically. The update mechanism tries to move the memory and local storage of all instances on the node to another node without interrupting before applying any changes to the node. By default, the update mechanism makes three attempts to migrate each instance before falling back to the manual mode.

        Note

        Success of live migration depends on many factors including the selected vCPU type and model, the amount of data that needs to be transferred, the intensity of the disk IO and memory writes, the type of the local storage, and others. Instances using the following product features are known to have issues with live migration:

        • LVM-based ephemeral storage with and without encryption

        • Encrypted block storage volumes

        • CPU and NUMA node pinning

      • manual

        The OpenStack Controller waits for the Operator to migrate instances from the compute node. When it is time to update the compute node, the update mechanism asks you to manually migrate the instances and proceeds only once you confirm the node is safe to update.

      • skip

        The OpenStack Controller skips the instance check on the node and reboots it.

        Note

        For the clouds relying on the converged LVM with iSCSI block storage that offer persistent volumes in a remote edge sub-region, it is important to keep in mind that applying a major change to a compute node may impact not only the instances running on this node but also the instances attached to the LVM devices hosted there. We recommend that in such environments you perform the update procedure in the manual mode with mitigation measures taken by the Operator for each compute node. Otherwise, all the instances that have LVM with iSCSI volumes attached would need reboot to restore the connectivity.

    • The openstack.lcm.mirantis.com/instance_migration_attempts annotation

      Defines the number of times the OpenStack Controller attempts to migrate a single instance before giving up. Defaults to 3.

    Note

    You can also use annotations to control the update of non-compute nodes if they represent critical points of a specific cloud architecture. For example, setting the instance_migration_mode to manual on a controller node with a collocated gateway (Open vSwitch) will allow the Operator to gracefully shut down all the virtual routers hosted on this node.

  3. If the OpenStack Controller cannot migrate instances due to errors, it is suspended unless all instances are migrated manually or the openstack.lcm.mirantis.com/instance_migration_mode annotation is set to skip.

The NodeMaintenanceRequest object creation flow:

NodeMaintenanceRequest - create

When the node maintenance is over, LCM removes the NodeMaintenanceRequest object and the OpenStack Controller:

  • Verifies that the Kubernetes Node becomes Ready.

  • Verifies that all OpenStack components on a given node are Healthy, which means that the pods are up and running, and the readiness probes are passing.

  • Ensures that the OpenStack components are connected to RabbitMQ. For example, the Neutron Agents become alive on the node, and compute instances are in the UP state.

Note

The OpenStack Controller enables you to have only one nodeworkloadlock object at a time in the inactive state. Therefore, the update process for nodes is sequential.

The NodeMaintenanceRequest object removal flow:

NodeMaintenanceRequest - delete

When the cluster maintenance is over, the OpenStack Controller sets the ClusterWorkloadLock object to back active and the update completes.

The CLusterMaintenanceRequest object removal flow:

ClusterMaintenanceRequest - delete
Tungsten Fabric Controller maintenance API

The Tungsten Fabric (TF) Controller creates and uses both types of workloadlocks that include ClusterWorkloadLock and NodeWorkloadLock.

When the ClusterMaintenanceRequest object is created, the TF Controller verifies the TF cluster health status and proceeds as follows:

  • If the cluster is Ready , the TF Controller moves the ClusterWorkloadLock object to the inactive state.

  • Otherwise, the TF Controller keeps the ClusterWorkloadLock object in the active state.

When the NodeMaintenanceRequest object is created, the TF Controller verifies the vRouter pod state on the corresponding node and proceeds as follows:

  • If all containers are Ready, the TF Controller moves the NodeWorkloadLock object to the inactive state.

  • Otherwise, the TF Controller keeps the NodeWorkloadLock in the active state.

Note

If there is a NodeWorkloadLock object in the inactive state present in the cluster, the TF Controller does not process the NodeMaintenanceRequest object for other nodes until this inactive NodeWorkloadLock object becomes active.

When the cluster LCM removes the MaintenanceRequest object, the TF Controller waits for the vRouter pods to become ready and proceeds as follows:

  • If all containers are in the Ready state, the TF Controller moves the NodeWorkloadLock object to the active state.

  • Otherwise, the TF Controller keeps the NodeWorkloadLock object in the inactive state.

Cluster update flow

This section describes the MOSK cluster update flow to the product releases that contain major updates and require node reboot such as support for new Linux kernel, and similar.

The diagram below illustrates the sequence of operations controlled by LCM and taking place during the update under the hood. We assume that the ClusterWorkloadLock and NodeWrokloadLock objects present in the cluster are in the active state before the cloud operator triggers the update.

Cluster update flow

See also

For details about the Application Controllers flow during different maintenance stages, refer to:

Phase 1: The Operator triggers the update
  1. The Operator sets appropriate annotations on nodes and selects suitable migration mode for workloads.

  2. The Operator triggers the managed cluster update through the Mirantis Container Cloud web UI as described in Step 2. Initiate MOSK cluster update.

  3. LCM creates the ClusterMaintenance object and notifies the application controllers about planned maintenance.

Phase 2: LCM triggers the OpenStack and Ceph update
  1. The OpenStack update starts.

  2. Ceph is waiting for the OpenStack ClusterWorkloadLock object to become inactive.

  3. When the OpenStack update is finalized, the OpenStack Controller marks ClusterWorkloadLock as inactive.

  4. The Ceph Controller triggers an update of the Ceph cluster.

  5. When the Ceph update is finalized, Ceph marks the ClusterWorkloadLock object as inactive.

Phase 3: LCM initiates the Kubernetes master nodes update
  1. If a master node has collocated roles, LCM creates NodeMainteananceRequest for the node.

  2. All Application Controllers mark their NodeWorkloadLock objects for this node as inactive.

  3. LCM starts draining the node by gracefully moving out all pods from the node. The DaemonSet pods are not evacuated and left running.

  4. LCM downloads the new version of the LCM Agent and runs its states.

    Note

    While running Ansible states, the services on the node may be restarted.

  5. The above flow is applied to all Kubernetes master nodes one by one.

  6. LCM removes NodeMainteananceRequest.

Phase 4: LCM initiates the Kubernetes worker nodes update
  1. LCM creates NodeMaintenanceRequest for the node with specifying scope.

  2. Application Controllers start preparing the node according to the scope.

  3. LCM waits until all Application Controllers mark their NodeWorkloadLock objects for this node as inactive.

  4. All pods are evacuated from the node by draining it. This does not apply to the DaemonSet pods, which cannot be removed.

  5. LCM downloads the new version of the LCM Agent and runs its states.

    Note

    While running Ansible states, the services on the node may be restarted.

  6. The above flow is applied to all Kubernetes worker nodes one by one.

  7. LCM removes NodeMainteananceRequest.

Phase 5: Finalization
  1. LCM triggers the update for all other applications present in the cluster, such as StackLight, Tungsten Fabric, and others.

  2. LCM removes ClusterMaintenanceRequest.

After a while the cluster update completes and becomes fully operable again.

Parallelizing node update operations

Available since MOSK 23.2 TechPreview

MOSK enables you to parallelize node update operations, significantly improving the efficiency of your deployment. This capability applies to any operation that utilizes the Node Maintenance API, such as cluster updates or graceful node reboots.

The core implementation of parallel updates is handled by the LCM Controller ensuring seamless execution of parallel operations. LCM starts performing an operation on the node only when all NodeWorkloadLock objects for the node are marked as inactive. By default, the LCM Controller creates one NodeMaintenanceRequest at a time.

Each application controller, including Ceph, OpenStack, and Tungsten Fabric Controllers, manages parallel NodeMaintenanceRequest objects independently. The controllers determine how to handle and execute parallel node maintenance requests based on specific requirements of their respective applications. To understand the workflow of the Node Maintenance API, refer to WorkloadLock objects.

Enhancing parallelism during node updates
  1. Set the nodes update order.

    You can optimize parallel updates by setting the order in which nodes are updated. You can accomplish this by configuring upgradeIndex of the Machine object. For the procedure, refer to Mirantis Container Cloud: Change upgrade order for machines.

  2. Increase parallelism.

    Boost parallelism by adjusting the maximum number of worker node updates that are allowed during LCM operations using the spec.providerSpec.value.maxWorkerUpgradeCount configuration parameter, which is set to 1 by default.

    For configuration details, refer to Mirantis Container Cloud: Configure the parallel update of worker nodes.

  3. Execute LCM operations.

    Run LCM operations, such as cluster updates, taking advantage of the increased parallelism.

OpenStack nodes update

By default, the OpenStack Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the OpenStack controller nodes sequentially (one by one).

  • Updates the gateway nodes sequentially. Technically, you can increase the number of gateway nodes upgrades allowed in parallel using the nwl_parallel_max_gateway parameter but Mirantis does not recommend to do so.

  • Updates the compute nodes in parallel. The default number of allowed parallel updates is 30. You can adjust this value through the nwl_parallel_max_compute parameter.

    Parallelism considerations for compute nodes

    When considering parallelism for compute nodes, take into account that during certain pod restarts, for example, the openvswitch-vswitchd pods, a brief instance downtime may occur. Select a suitable level of parallelism to minimize the impact on workloads and prevent excessive load on the control plane nodes.

    If your cloud environment is distributed across failure domains, which are represented by Nova availability zones, you can limit the parallel updates of nodes to only those within the same availability zone. This behavior is controlled by the respect_nova_az option in the OpenStack Controller.

The OpenStack Controller configuration is stored in the openstack-controller-config configMap of the osh-system namespace. The options are picked up automatically after update. To learn more about the OpenStack Controller configuration parameters, refer to OpenStack Controller configuration.

Ceph nodes update

By default, the Ceph Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the non-storage nodes sequentially. Non-storage nodes include all nodes that have mon, mgr, rgw, or mds roles.

  • Updates storage nodes in parallel. The default number of allowed parallel updates is calculated automatically based on the minimal failure domain in a Ceph cluster.

    Parallelism calculations for storage nodes

    The Ceph Controller automatically calculates the parallelism number in the following way:

    • Finds the minimal failure domain for a Ceph cluster. For example, the minimal failure domain is rack.

    • Filters all currently requested nodes by minimal failure domain. For example, parallelism equals to 5, and LCM requests 3 nodes from the rack1 rack and 2 nodes from the rack2 rack.

    • Handles each filtered node group one by one. For example, the controller handles in parallel all nodes from rack1 before processing nodes from rack2.

The Ceph Controller handles non-storage nodes before the storage ones. If there are node requests for both node types, the Ceph Controller handles sequentially the non-storage nodes first. Therefore, Mirantis recommends setting the upgrade index of a higher priority for the non-storage nodes to decrease the total upgrade time.

If the minimal failure domain is host, the Ceph Controller updates only one storage node per failure domain unit. This results in updating all Ceph nodes sequentially, despite the potential for increased parallelism.

Tungsten Fabric nodes update

By default, the Tungsten Fabric Controller handles the NodeMaintenanceRequest objects as follows:

  • Updates the Tungsten Fabric Controller and gateway nodes sequentially.

  • Updates the vRouter nodes in parallel. The Tungsten Fabric Controller allows updating up to 30 vRouter nodes in parallel.

    Maximum amount of vRouter nodes in maintenance

    While the Tungsten Fabric Controller has the capability to process up to 30 NodeMaintenanceRequest objects targeted to vRouter nodes, the actual amount may be lower. This is due to a check that ensures OpenStack readiness to unlock the relevant nodes for maintenance. If OpenStack allows for maintenance, the Tungsten Fabric Controller verifies the vRouter pods. Upon successful verification, the NodeWorkloadLock object is switched to the maintenance mode.

Deployment Guide

Mirantis OpenStack for Kubernetes (MOSK) enables the operator to create, scale, update, and upgrade OpenStack deployments on Kubernetes through a declarative API.

The Kubernetes built-in features, such as flexibility, scalability, and declarative resource definition make MOSK a robust solution.

Plan the deployment

The detailed plan of any Mirantis OpenStack for Kubernetes (MOSK) deployment is determined on a per-cloud basis. For the MOSK reference architecture and design overview, see Reference Architecture.

Also, read through Mirantis Container Cloud Reference Architecture: Container Cloud bare metal as a MOSK cluster is deployed on top of a bare metal cluster managed by Mirantis Container Cloud.

Note

One of the industry best practices is to verify every new update or configuration change in a non-customer-facing environment before applying it to production. Therefore, Mirantis recommends having a staging cloud, deployed and maintained along with the production clouds. The recommendation is especially applicable to the environments that:

  • Receive updates often and use continuous delivery. For example, any non-isolated deployment of Mirantis Container Cloud.

  • Have significant deviations from the reference architecture or third party extensions installed.

  • Are managed under the Mirantis OpsCare program.

  • Run business-critical workloads where even the slightest application downtime is unacceptable.

A typical staging cloud is a complete copy of the production environment including the hardware and software configurations, but with a bare minimum of compute and storage capacity.

Provision a Container Cloud bare metal management cluster

The bare metal management system enables the Infrastructure Operator to deploy Container Cloud on a set of bare metal servers. It also enables Container Cloud to deploy MOSK clusters on bare metal servers without a pre-provisioned operating system.

To provision your bare metal management cluster, refer to Mirantis Container Cloud Deployment Guide: Deploy a baremetal-based management cluster

Create a managed cluster

After bootstrapping your baremetal-based Mirantis Container Cloud management cluster, you can create a baremetal-based managed cluster to deploy Mirantis OpenStack for Kubernetes using the Container Cloud API.

Add a bare metal host

Before creating a bare metal managed cluster, add the required number of bare metal hosts using CLI and YAML files for configuration. This section describes how to add bare metal hosts using the Container Cloud CLI during a managed cluster creation.

To add a bare metal host:

  1. Verify that you configured each bare metal host as follows:

    • Enable the boot NIC support for UEFI load. Usually, at least the built-in network interfaces support it.

    • Enable the UEFI-LAN-OPROM support in BIOS -> Advanced -> PCIPCIe.

    • Enable the IPv4-PXE stack.

    • Set the following boot order:

      1. UEFI-DISK

      2. UEFI-PXE

    • If your PXE network is not configured to use the first network interface, fix the UEFI-PXE boot order to speed up node discovering by selecting only one required network interface.

    • Power off all bare metal hosts.

    Warning

    Only one Ethernet port on a host must be connected to the Common/PXE network at any given time. The physical address (MAC) of this interface must be noted and used to configure the BareMetalHost object describing the host.

  2. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  3. Describe the unique credentials of the new bare metal host:

    Create a YAML file that describes the unique credentials of the new bare metal host as a BareMetalHostCredential object.

    apiVersion: kaas.mirantis.com/v1alpha1
    kind: BareMetalHostCredential
    metadata:
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-credential-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      username: <ipmi-user-name>
      password:
        value: <ipmi-user-password>
    
    • In the metadata section, add a unique credentials name and the name of the non-default project (namespace) dedicated for the managed cluster being created.

    • In the spec section, add the IPMI user name and password in plain text to access the Baseboard Management Controller (BMC). The password will not be stored in the BareMetalHostCredential object but will be erased and saved in an underlying Secret object.

    Caution

    Each bare metal host must have a unique BareMetalHostCredential. For details about the BareMetalHostCredential object, refer to Mirantis Container Cloud API Reference: BareMetalHostCredential.

    Create a secret YAML file that describes the unique credentials of the new bare metal host. Example of the bare metal host secret:

    apiVersion: v1
    data:
      password: <credentials-password>
      username: <credentials-user-name>
    kind: Secret
    metadata:
      labels:
        kaas.mirantis.com/credentials: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <credentials-name>
      namespace: <managed-cluster-project-name>
    type: Opaque
    
    • In the data section, add the IPMI user name and password in the base64 encoding to access the BMC. To obtain the base64-encoded credentials, you can use the following command in your Linux console:

      echo -n <username|password> | base64
      

      Caution

      Each bare metal host must have a unique Secret.

    • In the metadata section, add the unique name of credentials and the name of the non-default project (namespace) dedicated for the managed cluster being created. To create a project, refer to Mirantis Container Cloud Operations Guide: Create a project for managed clusters.

  4. Apply this secret YAML file to your deployment:

    kubectl apply -f ${<bmh-cred-file-name>}.yaml
    
  5. Create a YAML file that contains a description of the new bare metal host:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      annotations:
        kaas.mirantis.com/baremetalhost-credentials-name: <bare-metal-host-credential-unique-name>
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip-address-for-bmc-access>
        credentialsName: <credentials-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    
    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip-address-for-bmc-access>
        credentialsName: <credentials-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    For a detailed fields description, see Mirantis Container Cloud API Reference: BareMetalHost.

  6. Apply this configuration YAML file to your deployment:

    kubectl create -f ${<bare-metal-host-config-file-name>}.yaml
    
  7. Verify the new BareMetalHost object status:

    kubectl get -n <managed-cluster-project-name> bmh -o wide <bare-metal-host-unique-name>
    

    Example of system response:

    NAMESPACE    NAME   STATUS   STATE      CONSUMER  BMC                        BOOTMODE  ONLINE  ERROR  REGION
    my-project   bmh1   OK       preparing