This documentation provides information on how to deploy and operate a
Mirantis OpenStack for Kubernetes (MOSK) environment.
The documentation is intended to help operators to understand the core
concepts of the product. The documentation provides sufficient information
to deploy and operate the solution.
The information provided in this documentation set is being constantly
improved and amended based on the feedback and kind requests from the
consumers of MOS.
The following table lists the guides included in the documentation set you
are reading:
This documentation is intended for engineers who have the basic knowledge of
Linux, virtualization and containerization technologies, Kubernetes API and
CLI, Helm and Helm charts, Mirantis Kubernetes Engine (MKE), and OpenStack.
GUI elements that include any part of interactive user interface and
menu navigation
Superscript
Some extra, brief information
Note
The Note block
Messages of a generic meaning that may be useful for the user
Caution
The Caution block
Information that prevents a user from mistakes and undesirable
consequences when following the procedures
Warning
The Warning block
Messages that include details that can be easily missed, but should not
be ignored by the user and are valuable before proceeding
See also
The See also block
List of references that may be helpful for understanding of some related
tools, concepts, and so on
Learn more
The Learn more block
Used in the Release Notes to wrap a list of internal references to
the reference architecture, deployment and operation procedures specific
to a newly implemented product feature
Mirantis OpenStack for Kubernetes (MOSK) combines the power of
Mirantis Container Cloud for delivering and managing Kubernetes clusters, with
the industry standard OpenStack APIs, enabling you to build your own cloud
infrastructure.
The advantages of running all of the OpenStack components as a Kubernetes
application are multi-fold and include the following:
Zero downtime, non-disruptive updates
Fully automated Day-2 operations
Full-stack management from bare metal through the operating system and
all the necessary components
The list of the most common use cases includes:
Software-defined data center
The traditional data center requires multiple requests and interactions
to deploy new services, by abstracting the data center functionality
behind a standardized set of APIs service can be deployed faster and
more efficiently. MOSK enables you to define all your
data center resources behind the industry standard OpenStack APIs allowing you
to automate the deployment of applications or simply request resources
through the UI to quickly and efficiently provision virtual machines,
storage, networking, and other resources.
Virtual Network Functions (VNFs)
VNFs require high performance systems that can be accessed on demand in
a standardized way, with assurances that they will have access to the
necessary resources and performance guarantees when needed.
MOSK provides extensive support for VNF workload enabling
easy access to functionality
such as Intel EPA (NUMA, CPU pinning, Huge Pages) as well as the consumption
of specialized networking interfaces cards to support SR-IOV and DPDK.
The centralized management model of MOSK and Mirantis
Container Cloud also enables the easy management of multiple
MOSK deployments with full lifecycle management.
Legacy workload migration
With the industry moving toward cloud-native technologies many older or
legacy applications are not able to be moved easily and often it does not
make financial sense to transform the applications to cloud-native
applications. MOSK provides a stable cloud platform that
can cost-effectively host legacy applications whilst still providing the
expected levels of control, customization, and uptime.
Mirantis OpenStack for Kubernetes (MOSK) is a virtualization
platform that provides an infrastructure for cloud-ready applications,
in combination with reliability and full control over the data.
MOSK combines OpenStack, an open-source cloud
infrastructure software, with application management techniques used
in the Kubernetes ecosystem that include container isolation, state
enforcement, declarative definition of deployments, and others.
MOSK integrates with Mirantis Container Cloud to rely
on its capabilities for bare-metal infrastructure provisioning, Kubernetes
cluster management, and continuous delivery of the stack components.
MOSK simplifies the work of a cloud operator by
automating all major cloud life cycle management routines including
cluster updates and upgrades.
A Mirantis OpenStack for Kubernetes (MOSK) deployment profile is a
thoroughly tested and officially supported reference architecture that is
guaranteed to work at a specific scale and is tailored to the demands of
a specific business case, such as generic IaaS cloud, Network Function
Virtualisation infrastructure, Edge Computing, and others.
A deployment profile is defined as a combination of:
Services and features the cloud offers to its users.
Non-functional characteristics that users and operators should expect when
running the profile on top of a reference hardware configuration. Including,
but not limited to:
Performance characteristics, such as an average network throughput between
VMs in the same virtual network.
Reliability characteristics, such as the cloud API error response rate when
recovering a failed controller node.
Scalability characteristics, such as the total amount of virtual routers
tenants can run simultaneously.
Hardware requirements - the specification of physical servers, and
networking equipment required to run the profile in production.
Deployment parameters that an operator for the cloud can tweak within a
certain range without being afraid of breaking the cloud or losing support.
In addition, the following items may be included in a definition:
Compliance-driven technical requirements, such as TLS encryption of all
external API endpoints.
Foundation-level software components, such as Tungsten Fabric or
Open vSwitch as a back end for the networking service.
Note
Mirantis reserves the right to revise the technical implementation of any
profile at will while preserving its definition - the functional
and non-functional characteristics that operators and users are known
to rely on.
MOSK supports a huge list of different deployment profiles
to address a wide variety of business tasks. The table below includes the
profiles for the most common use cases.
Note
Some components of a MOSK cluster are mandatory and
are being installed
during the managed cluster deployment by Container Cloud regardless of the
deployment profile in use. StackLight is one of the cluster components that
are enabled by default. See Container Cloud Operations Guide
for details.
Provides the core set of the services an IaaS vendor would need
including some extra functionality. The profile is designed to
support up 50-70 compute nodes and a reasonable number of
storage nodes. 0
The core set of services provided by the profile includes:
Compute (Nova)
Images (Glance)
Networking (Neutron with Open vSwitch as a back end)
Telemetry services are optional components with the Technology preview
status and should be enabled together through the list of services to be
deployed in the OpenStackDeployment CR as described in
Deploy an OpenStack cluster.
The HelmBundle Operator is the realization of the Kubernetes Operator
pattern that provides a Kubernetes custom resource of the HelmBundle
kind and code running inside a pod in Kubernetes. This code handles changes,
such as creation, update, and deletion, in the Kubernetes resources of this
kind by deploying, updating, and deleting groups of Helm releases from
specified Helm charts with specified values.
The OpenStack platform manages virtual infrastructure resources, including
virtual servers, storage devices, networks, and networking services, such as
load balancers, as well as provides management functions to the tenant users.
Various OpenStack services are running as pods in Kubernetes and are
represented as appropriate native Kubernetes resources, such as
Deployments, StatefulSets, and DaemonSets.
For a simple, resilient, and flexible deployment of OpenStack and related
services on top of a Kubernetes cluster, MOSK uses
OpenStack-Helm that provides a required collection of the Helm charts.
Also, MOSK uses OpenStack Operator as the realization
of the Kubernetes Operator pattern. The OpenStack Operator provides a custom
Kubernetes resource of the OpenStackDeployment kind and code running
inside a pod in Kubernetes. This code handles changes such as creation,
update, and deletion in the Kubernetes resources of this kind by
deploying, updating, and deleting groups of the Helm releases.
Ceph is a distributed storage platform that provides storage resources,
such as objects and virtual block devices, to virtual and physical
infrastructure.
MOSK uses Rook as the implementation of the
Kubernetes Operator pattern that manages resources of the CephCluster
kind to deploy and
manage Ceph services as pods on top of Kubernetes to provide Ceph-based
storage to the consumers, which include OpenStack services, such as Volume
and Image services, and underlying Kubernetes through Ceph CSI (Container
Storage Interface).
The Ceph Controller is the implementation of the Kubernetes Operator
pattern, that manages resources of the MiraCeph kind to simplify
management of the Rook-based Ceph clusters.
The StackLight component is responsible for collection, analysis, and
visualization of critical monitoring data from physical and virtual
infrastructure, as well as alerting and error notifications through
a configured communication system, such as email. StackLight includes
the following key sub-components:
This section provides hardware requirements for the Mirantis Container
Cloud management cluster with a managed Mirantis OpenStack for Kubernetes
(MOSK) cluster.
For installing MOSK, the Mirantis Container Cloud management
cluster and managed cluster must be deployed with baremetal provider.
Important
A MOSK cluster is to be used for a
deployment of an OpenStack cluster and its components. Deployment of
third-party workloads on a MOSK cluster is neither
allowed nor supported.
Note
One of the industry best practices is to verify every new update or
configuration change in a non-customer-facing environment before
applying it to production. Therefore, Mirantis recommends
having a staging cloud, deployed and maintained along with the production
clouds. The recommendation is especially applicable to the environments
that:
Receive updates often and use continuous delivery. For example,
any non-isolated deployment of Mirantis Container Cloud.
Have significant deviations from the reference architecture or
third party extensions installed.
Are managed under the Mirantis OpsCare program.
Run business-critical workloads where even the slightest application
downtime is unacceptable.
A typical staging cloud is a complete copy of the production environment
including the hardware and software configurations, but with a bare minimum
of compute and storage capacity.
The table below describes the node types the MOSK reference
architecture includes.
The Container Cloud management cluster architecture on bare metal
requires three physical servers for manager nodes. On these hosts,
we deploy a Kubernetes cluster with services that provide Container
Cloud control plane functions.
OpenStack control plane node and StackLight node
Host OpenStack control plane services such as database, messaging, API,
schedulers conductors, and L3 and L2 agents, as well as the StackLight
components.
Note
MOSK enables the cloud operator to
collocate the OpenStack control plane with the managed cluster master
nodes on the OpenStack deployments of a small size. This capability
is available as technical preview. Use such configuration for testing
and evaluation purposes only.
Tenant gateway node
Optional. Hosts OpenStack gateway services including L2, L3, and DHCP
agents. The tenant gateway nodes are combined with OpenStack control
plane nodes. The strict requirement is a dedicated physical network
(bond) for tenant network traffic.
Tungsten Fabric control plane node
Required only if Tungsten Fabric is enabled as a back end for the
OpenStack networking. These nodes host the TF control plane services
such as Cassandra database, messaging, API, control, and configuration
services.
Tungsten Fabric analytics node
Required only if Tungsten Fabric is enabled as a back end for the
OpenStack networking. These nodes host the TF analytics services
such as Cassandra, ZooKeeper, and collector.
Compute node
Hosts the OpenStack Compute services such as QEMU, L2 agents, and
others.
Infrastructure nodes
Runs underlying Kubernetes cluster management services.
The MOSK reference configuration requires minimum
three infrastructure nodes.
The table below specifies the hardware resources the MOSK
reference architecture recommends for each node type.
The exact hardware specifications and number of the control plane
and gateway nodes depend on a cloud configuration and scaling needs.
For example, for the clouds with more than 12,000 Neutron ports, Mirantis
recommends increasing the number of gateway nodes.
TF control plane and analytics nodes can be combined with a respective
addition of RAM, CPU, and disk space to the hardware hosts. Though,
Mirantis does not recommend such configuration for production environments
as the risk of the cluster downtime if one of the nodes unexpectedly fails
increases.
A Ceph cluster with 3 Ceph nodes does not provide hardware fault
tolerance and is not eligible for recovery operations,
such as a disk or an entire node replacement.
A Ceph cluster uses the replication factor that equals 3.
If the number of Ceph OSDs is less than 3, a Ceph cluster moves
to the degraded state with the write operations restriction until
the number of alive Ceph OSDs equals the replication factor again.
If you would like to evaluate the MOSK
capabilities and do not have much hardware at your disposal,
you can deploy it in a virtual environment. For example, on
top of another OpenStack cloud using the sample Heat templates.
Please mind, the tooling is provided for reference only and is not
a part of the product itself. Mirantis does not guarantee its
interoperability with any MOSK version.
The management cluster requires minimum two storage devices per node.
Each device is used for different type of storage:
One storage device for boot partitions and root file system.
SSD is recommended. A RAID device is not supported.
One storage device per server is reserved for local persistent
volumes. These volumes are served by the Local Storage Static Provisioner,
that is local-volume-provisioner, and used by many services of Mirantis
Container Cloud.
The seed node is only necessary to deploy the management cluster.
When the bootstrap is complete, the bootstrap node can be
discarded and added back to the MOSK cluster as a node of
any type.
The minimum reference system requirements for a baremetal-based bootstrap
seed node are as follow:
Basic Ubuntu 18.04 server with the following configuration:
Kernel of version 4.15.0-76.86 or later
8 GB of RAM
4 CPU
10 GB of free disk space for the bootstrap cluster cache
No DHCP or TFTP servers on any NIC networks
Routable access IPMI network for the hardware servers.
Internet access for downloading of all required artifacts
If you use a firewall or proxy, make sure that the bootstrap, management,
and regional clusters have access to the following IP ranges and domain names:
MOSK uses Kubernetes labels to place components onto hosts.
For the default locations of components, see MOSK cluster hardware requirements. Additionally,
MOSK supports component collocation. This is mostly useful
for OpenStack compute and Ceph nodes. For component collocation, consider
the following recommendations:
When calculating hardware requirements for nodes, consider the requirements
for all collocated components.
When performing maintenance on a node with collocated components, execute the
maintenance plan for all of them.
When combining other services with the OpenStack compute host, verify that
reserved_host_* has increased accordingly to the needs of collocated
components by using node-specific overrides for the compute service.
MetalLB exposes external IP addresses of cluster services to access
applications in a Kubernetes cluster.
DNS
The Kubernetes Ingress NGINX controller is used to expose OpenStack
services outside of a Kubernetes deployment. Access to the Ingress
services is allowed only by its FQDN. Therefore, DNS is a mandatory
infrastructure service for an OpenStack on Kubernetes deployment.
To keep operating system on a bare metal host up to date with the latest
security updates, the operating system requires periodic software
packages upgrade that may or may not require the host reboot.
Mirantis Container Cloud uses life cycle management tools to update
the operating system packages on the bare metal hosts.
In a management cluster, software package upgrade and host restart are
applied automatically when a new Container Cloud version
with available kernel or software packages upgrade is released.
In a managed cluster, package upgrade and host restart are applied as part of
usual cluster update, when applicable. To start planning the maintenance
window and proceed with the managed cluster update, see Update a MOSK cluster to a major release version.
Operating system upgrade and host restart are applied to cluster
nodes one by one. If Ceph is installed in the cluster, the Container
Cloud orchestration securely pauses the Ceph OSDs on the node before
restart. This allows avoiding degradation of the storage service.
Each section below is dedicated to a particular service provided by
MOSK. They contain configuration details and usage
samples of supported capabilities provided through the custom resources.
Note
The list of the services and their supported features included in
this section is not full and is being constantly amended based on the
complexity of the architecture and use of a particular service.
Mirantis OpenStack for Kubernetes (MOSK) provides instances management
capability through the Compute service (OpenStack Nova). The Compute service
interacts with other OpenStack components of an OpenStack environment to
provide life-cycle management of the virtual machine instances.
The Compute service (OpenStack Nova) enables you to spawn instances that can
collectively consume more resources than what is physically available on a
compute node through resource oversubscription, also known as overcommit
or allocation ratio.
Resources available for oversubscription on a compute node include the number
of CPUs, amount of RAM, and amount of available disk space. When making a
scheduling decision, the scheduler of the Compute service takes into account
the actual amount of resources multiplied by the allocation ratio. Thereby,
the service allocates resources based on the assumption that not all instances
will be using their full allocation of resources at the same time.
Oversubscription enables you to increase the density of workloads and compute
resource utilization and, thus, achieve better Return on Investment (ROI) on
compute hardware. In addition, oversubscription can also help avoid the need
to create too many fine-grained flavors, which is commonly known as
flavor explosion.
There are two ways to control the oversubscription values for compute
nodes:
The legacy approach entails utilizing the
{cpu,disk,ram}_allocation_ratio configuration options offered by the
Compute service. A drawback of this method is that restarting the Compute
service is mandatory to apply the new configuration. This introduces the
risk of possible interruptions of cloud user operations, for example,
instance build failures.
The modern and recommended approach, adopted in MOSK
23.1, involves using the initial_{cpu,disk,ram}_allocation_ratio
configuration options, which are employed exclusively during the initial
provisioning of a compute node. This may occur during the initial deployment
of the cluster or when new compute nodes are added subsequently. Any further
alterations can be performed dynamically using the OpenStack Placement
service API without necessitating the restart of the service.
There is no definitive method for selecting optimal oversubscription values.
As a cloud operator, you should continuously monitor your workloads, ideally
have a comprehensive understanding of their nature, and experimentally
determine the maximum values that do not impact performance. This approach
ensures maximum workload density and cloud resource utilization.
To configure the initial compute resource oversubscription in
MOSK, specify the spec:features:nova:allocation_ratios
parameter in the OpenStackDeployment custom resource as explained in the
table below.
Changing the resource oversubscription configuration through the
OpenStackDeployment resource after cloud deployment will only
affect the newly added compute nodes and will not change
oversubscription for already existing compute nodes.
To change oversubscription for already existing compute nodes, use the
placement service API as described in Change oversubscription settings for existing compute nodes.
In the example configuration above, the compute nodes labeled with
compute-type=hi-perf label will use less intense oversubscription
on CPU and no oversubscription on disk.
When using oversubscription, it is important to conduct thorough cloud
management and monitoring to avoid system overloading and performance
degradation. If many or all instances on a compute node start using all
allocated resources at once and, thereby, overconsume physical resources,
failure scenarios depend on the resource being exhausted.
Workloads are getting slower as they actively compete for physical CPU
usage. A useful indicator is the steal time as reported inside the
workload, which is a percentage of time the operating system in the
workload is waiting for actual physical CPU core availability to run
instructions.
To verify the steal time in the Linux-based workload, use the
top command:
top-bn1|head|grepst$|awk-F',''{print $NF}'
Generally, steal times of >10 for 20-30 minutes are considered
alarming.
RAM
Operating system on the compute node starts to aggressively use physical
swap space, which significantly slows the workloads down. Sometimes, when
the swap is also exhausted, the operating system of a compute node can
outright OOM kill most offending processes, which can cause major
disruptions to workloads or a compute node itself.
Warning
While it may seem like a good idea to make the most of
available resources, oversubscribing RAM can lead to various issues and
is generally not recommended due to potential performance degradation,
reduced stability, and security risks for the workloads.
Mirantis strongly advises against oversubscribing RAM, by any amount.
Disk space
Depends on the physical layout of storage. Virtual root and ephemeral
storage devices that are hosted on a compute node itself are put in
the read-only mode negatively affecting workloads. Additionally,
the file system used by the operating system on a compute node may
become read-only too blocking the compute node operability.
There are workload types that are not suitable for running in an oversubscribed
environment, especially those with high performance, latency-sensitive, or
real-time requirements. Such workloads are better suited for compute nodes
with dedicated CPUs, ensuring that only processes of a single instance run
on each CPU core.
Configures the type of vCPU that Nova will create instances with.
The default CPU model configured for all instances managed by Nova is
host-model, the same as in Nova for the KVM or QEMU hypervisor.
host-model (default) - mimics the host CPU and provides for decent
performance, good security, and moderate compatibility with live migrations.
With this mode, libvirt finds an available predefined CPU model that best
matches the host CPU, and then explicitly adds the missing CPU feature
flags to closely match the host CPU features. To mitigate known security
flaws, libvirt automatically adds critical CPU flags, supported by
installed libvirt, QEMU, kernel, and CPU microcode versions.
This is a safe choice if your OpenStack compute node CPUs are of the same
generation. If your OpenStack compute node CPUs are sufficiently different,
for example, span multiple CPU generations, Mirantis strongly recommends
setting explicit CPU models supported by all of your OpenStack compute node
CPUs or organizing your OpenStack compute nodes into host aggregates and
availability zones that have largely identical CPUs.
Note
The host-model model does not guarantee two-way live migrations
between nodes.
When migrating instances, the libvirt domain XML is first copied as is to
the destination OpenStack compute node. Once the instance is hard rebooted
or shut down and started again, the domain XML will be re-generated. If
versions of libvirt, kernel, CPU microcode, or BIOS firmware differ from
what they were on the source compute node the instance was started before,
libvirt may pick up additional CPU feature flags, making it impossible to
live-migrate back to the original compute node.
host-passthrough - provides maximum performance, especially when nested
virtualization is required or if live migration support is not a concern for
workloads. Live migration requires exactly the same CPU on all
OpenStack compute nodes, including the CPU microcode and kernel versions.
Therefore, for live migrations support, organize your compute nodes into host
aggregates and availability zones. For workload migration between
non-identical OpenStack compute nodes, contact Mirantis support.
A comma-separated list of exact QEMU CPU models to create and emulate.
Specify the common and less advanced CPU models first. All explicit CPU
models provided must be compatible with the OpenStack compute node CPUs.
To specify an exact CPU model, review the available CPU models and their
features. List and inspect the /usr/share/libvirt/cpu_map/*.xml files in
the libvirt containers of pods of the libvirt DeamonSet or multiple
DaemonSets if you are using node-specific settings.
Specifies the name of the NIC device on the actual host that will be
used by Nova for the live migration of instances.
Mirantis recommends setting up your Kubernetes hosts in such a way
that networking is configured identically on all of them,
and names of the interfaces serving the same purpose or plugged into
the same network are consistent across all physical nodes.
Also, set the option to vhost0 in the following cases:
The Neutron service uses Tungsten Fabric.
Nova migrates instances through the interface specified by
the Neutron’s tunnel_interface parameter.
features:nova:libvirt:tls
Available since MOSK 23.2.
If set to true, enables the live migration over TLS:
Defines the type of storage for Nova to use on the compute hosts for
the images that back up the instances.
The list of supported options include:
local
The local storage is used.
The pros include faster operation, failure domain independency
from the external storage. The cons include local space consumption
and less performant and robust live migration with block migration.
ceph
Instance images are stored in a Ceph pool shared across all
Nova hypervisors. The pros include faster image start, faster and
more robust live migration. The cons include considerably slower
IO performance, workload operations direct dependency on Ceph cluster
availability and performance.
lvmTechPreview
Instance images and ephemeral images are stored on a local Logical
Volume. If specified, features:nova:images:lvm:volume_group must
be set to an available LVM Volume Group, by default, nova-vol.
For details, see Enable LVM ephemeral storage.
The noVNC client provides remote control or remote desktop access to guest
virtual machines through the Virtual Network Computing (VNC) system.
The MOSK Compute service users can access their
instances using the noVNC clients through the noVNC proxy server.
MOSK uses TLS to secure public-facing VNC access
on networks between a noVNC client and noVNC proxy server.
The features:nova:console:novnc:tls:enabled ensures that the data
transferred between the instance and the noVNC proxy server is encrypted.
Both servers use the VeNCrypt authentication scheme for the data
encryption.
To enable the encrypted data transfer for noVNC, use the following
structure in the OpenStackDeployment custom resource:
Mirantis OpenStack for Kubernetes (MOSK) Networking service
(OpenStack Neutron) provides cloud applications with
Connectivity-as-a-Service enabling instances to communicate with each
other and the outside world.
The API provided by the service abstracts all the nuances of implementing
a virtual network infrastructure on top of your own physical network
infrastructure. The service allows cloud users to create advanced virtual
network topologies that may include load balancing, virtual private
networking, traffic filtering, and other services.
MOSK Networking service supports Open vSwitch and
Tungsten Fabric SDN technologies as back ends.
MOSK offers the Networking service as a part of its
core setup. You can configure the service through the
spec:features:neutron section of the OpenStackDeployment custom
resource.
Defines the name of the NIC device on the actual host that will be
used for Neutron.
Mirantis recommends setting up your Kubernetes hosts in such a way
that networking is configured identically on all of them,
and names of the interfaces serving the same purpose or plugged into
the same network are consistent across all physical nodes.
If enabled, must contain the data structure defining the floating IP
network that will be created for Neutron to provide external access to
your Nova instances.
The BGP dynamic routing extension to the Networking service (OpenStack Neutron)
is particularly useful for the MOSK clouds where private
networks managed by cloud users need to be transparently integrated into the
networking of the data center.
For example, the BGP dynamic routing is a common requirement for IPv6-enabled
environments, where clients need to seamlessly access cloud workloads using
dedicated IP addresses with no address translation involved in between the
cloud and the external network.
Untitled Diagram
BGP dynamic routing changes the way self-service (private) network prefixes
are communicated to BGP-compatible physical network devices, such as routers,
present in the data center. It eliminates the traditional reliance on static
routes or ICMP-based advertising by enabling the direct passing of private
network prefix information to router devices.
Note
To effectively use the BGP dynamic routing feature, Mirantis
recommends acquiring good understanding of OpenStack address scopes
and how they work.
The components of the OpenStack BGP dynamic routing are:
Service plugin
An extension to the Networking service (OpenStack Neutron) that implements
the logic for BGP-related entities orhestration and provides the cloud
user-facing API. A cloud administrator creates and configures a BGP speaker
using the CLI or API and manually schedules it to one or more hosts running
the agent.
Agent
Manages BGP peering sessions. In MOSK, the BGP agent
runs on nodes labeled with openstack-gateway=enabled.
Prefix advertisement depends on the binding of external networks to a BGP
speaker and the address scope of external and internal IP address ranges or
subnets.
BGP dynamic routing advertises prefixes for self-service networks and host
routes for floating IP addresses.
To successfully advertise a self-service network, you need to fulfill
the following conditions:
External and self-service networks reside in the same address scope.
The router contains an interface on the self-service subnet and a gateway
on the external network.
The BGP speaker associates with the external network that provides
a gateway on the router.
The BGP speaker has the advertise_tenant_networks attribute set
to True.
To successfully advertise a floating IP address, you need to fulfill
the following conditions:
The router with the floating IP address binding contains a gateway on
an external network with the BGP speaker association.
The BGP speaker has the advertise_floating_ip_host_routes attribute
set to true.
The diagram below is an example of the BGP dynamic routing in the non-DVR mode
with self-service networks and the following advertisements:
B>*192.168.0.0/25[200/0] through 10.11.12.1
B>*192.168.0.128/25[200/0] through 10.11.12.2
B>*10.11.12.234/32[200/0] through 10.11.12.1
Untitled DiagramOperation in the Distributed Virtal Router (DVR) mode¶
For both floating IP and IPv4 fixed IP addresses, the BGP speaker advertises
the gateway of the floating IP agent on the corresponding compute node as
the next-hop IP address. When using IPv6 fixed IP addresses, the BGP speaker
advertises the DVR SNAT node as the next-hop IP address.
The diagram below is an example of the BGP dynamic routing in the DVR mode
with self-service networks and the following advertisements:
DVR incompatibility with ARP announcements and VRRP¶
Due to the known issue
#1774459 in the upstream
implementation, Mirantis does not recommend using Distributed Virtual Routing
(DVR) routers in the same networks as load balancers or other applications
that utilize the Virtual Router Redundancy Protocol (VRRP) such as Keepalived.
The issue prevents the DVR functionality from working correctly with network
protocols that rely on the Address Resolution Protocol (ARP) announcements
such as VRRP.
The issue occurs when updating permanent ARP entries for
allowed_address_pair IP addresses in DVR routers because DVR performs
the ARP table update through the control plane and does not allow any
ARP entry to leave the node to prevent the router IP/MAC from
contaminating the network.
This results in various network failover mechanisms not functioning in virtual
networks that have a distributed virtual router plugged in. For instance, the
default back end for MOSK Load Balancing service,
represented by OpenStack Octavia with the OpenStack Amphora back end when
deployed in the HA mode in a DVR-connected network, is not able to redirect
the traffic from a failed active service instance to a standby one without
interruption.
In MOSK, Cinder backup is enabled and uses the Ceph back
end for Cinder by default. The backup configuration is stored
in the spec:features:cinder:backup structure in the
OpenStackDeployment custom resource. If necessary, you can disable
the backup feature in Cinder as follows:
Using this structure, you can also configure another backup driver supported
by MOSK for Cinder as described below. At any given time,
only one back end can be enabled.
MOSK supports NFS Unix authentication exclusively.
To use an NFS driver with MOSK, ensure you have
a preconfigured NFS server with an NFS share accessible to a Unix
Cinder user. This user must be the owner of the exported NFS folder,
and the folder must have the permission value set to 775.
All Cinder services run with the same user by default.
To obtain the Unix user ID:
You can specify the backup_share parameter in following formats:
hostname:path, ipv4addr:path, or [ipv6addr]:path.
For example: 1.2.3.4:/cinder_backup.
The Block Storage service (OpenStack Cinder) supports volume encryption using a
key stored in the Key Manager service (OpenStack Barbican). Such configuration
uses Linux Unified Key Setup (LUKS) to create an encrypted volume type and
attach it to the Compute service (OpenStack Nova) instances.
Nova retrieves the asymmetric key from Barbican and stores it
on the OpenStack compute node as a libvirt key to encrypt the volume
locally or on the back end and only after that transfers it to Cinder.
Note
To create an encrypted volume under a non-admin user, the
creator role must be assigned to the user.
When planning your cloud, consider that encryption may impact CPU.
Mirantis OpenStack for Kubernetes (MOSK) provides authentication,
service discovery, and distributed multi-tenant authorization through the
OpenStack Identity service, aka Keystone.
MOSK integrates with Mirantis Container Cloud Identity and
Access Management (IAM) subsystem to allow centralized management of users and
their permissions across multiple clouds.
The core component of Container Cloud IAM is Keycloak, the open-source identity
and access management software. Its primary function is to perform secure
authentication of cloud users against its built-in or various external
identity databases, such as LDAP directories, OpenID Connect or SAML
compatible identity providers.
By default, every MOSK cluster is integrated with the
Keycloak running in the Container Cloud management cluster. The integration
automatically provisions the necessary configuration on the
MOSK and Container Cloud IAM sides, such as the os
client object in Keycloak. However, for the federated users to get proper
permissions after logging in, the cloud operator needs to define the role
mapping rules specific to each MOSK environment.
A region in MOSK represents a complete OpenStack cluster
that has a dedicated control plane and set of API endpoints. It is not uncommon
for operators of large clouds to offer their users several OpenStack regions,
which differ by their geographical location or purpose. In order to easily
navigate in a multi-region environment, cloud users need a way to distinguish
clusters by their names.
The region_name parameter of an OpenStackDeployment custom resource
specifies the name of the region that will be configured in all the OpenStack
services comprising the MOSK cluster upon the initial
deployment.
Important
Once the cluster is up and running, the cloud operator cannot
set or change the name of the region. Therefore, Mirantis recommends
selecting a meaningful name for the new region before the deployment starts.
For example, the region name can be based on the name of the data center the
cluster is located in.
Application credentials is a mechanism in the MOSK
Identity service that enables application automation tools, such as shell
scripts, Terraform modules, Python programs, and others, to securely
perform various actions in the cloud API in order to deploy and manage
application components.
Application credentials is a modern alternative to the legacy approach where
every application owner had to request several technical user accounts
to ensure their tools could authenticate in the cloud.
For the details on how to create and authenticate with application credentials,
refer to Manage application credentials.
Application credentials must be explicitly enabled for federated users¶
By default, cloud users logging in to the cloud through the Mirantis Container
Cloud IAM or any external identity provider cannot use the application
credentials mechanism.
An application credential is heavily tied to the account of the cloud user
owning it. An application automation tool that is a consumer of the credential
acts on behalf of the human user who created the credential. Each action that
the application automation tool performs gets authorized against the
permissions, including roles and groups, the user currently has.
The source of truth about a federated user permissions is the identity
provider. This information gets temporary transferred to the cloud’s
Identity service inside a token once the user authenticates. By default,
if such a user creates an application credential and passes it to the
automation tool, there is no data to validate the tool’s action on
the user’s behalf.
However, a cloud operator can configure the authorization_ttl parameter
for an identity provider object to enable caching of its users authorization
data. The parameter defines for how long in minutes the information about
user permissions is preserved in the database after the user successfully
logs in to the cloud.
Warning
Authorization data caching has security implications. In case a
federated user account is revoked or his permissions change in the identity
provider, the cloud Identity service will still allow performing actions
on the user behalf until the cached data expires or the user
re-authenticates in the cloud.
To set authorization_ttl to, for example, 60 minutes for the keycloak
identity provider in Keystone:
Defines the domain-specific configuration and is useful for integration
with LDAP. An example of OsDpl with LDAP integration, which will create
a separate domain.with.ldap domain and configure it to use LDAP as
an identity driver:
Mirantis OpenStack for Kubernetes (MOSK) provides the image management
capability through the OpenStack Image service, aka Glance.
The Image service enables you to discover, register, and retrieve virtual
machine images. Using the Glance API, you can query virtual machine image
metadata and retrieve actual images.
MOSK deployment profiles include the Image service in the
core set of services. You can configure the Image service through the
spec:features definition in the OpenStackDeployment custom resource.
MOSK can automatically verify the cryptographic signatures
associated with images to ensure the integrity of their data. A signed image
has a few additional properties set in its metadata that include
img_signature, img_signature_hash_method, img_signature_key_type,
and img_signature_certificate_uuid. You can find more information about
these properties and their values in the upstream OpenStack documentation.
MOSK performs image signature verification during the
following operations:
A cloud user or a service creates an image in the store and starts
to upload its data. If the signature metadata properties are set
on the image, its content gets verified against the signature.
The Image service accepts non-signed image uploads.
A cloud user spawns a new instance from an image. The Compute service
ensures that the data it downloads from the image storage matches
the image signature. If the signature is missing or does not match the
data, the operation fails. Limitations apply, see
Known limitations.
A cloud user boots an instance from a volume, or creates a new volume from
an image. If the image is signed, the Block Storage service compares the
downloaded image data against the signature. If there is a mismatch, the
operation fails.
The service will accept a non-signed image as a source for a volume.
Limitations apply, see Known limitations.
Every MOSK cloud is pre-provisioned with a baseline set of
images containing most popular operating systems, such as Ubuntu, Fedora,
CirrOS.
In addition, a few services in MOSK rely on the creation
of service instances to provide their functions, namely the Load Balancer
service and the Bare Metal service, and require corresponding images to exist
in the image store.
When image signature verification is enabled during the cloud deployment,
all these images get automatically signed with a pre-generated self-signed
certificate. Enabling the feature in an already existing cloud requires manual
signing of all of the images stored in it. Consult the OpenStack documentation
for an example of the image signing procedure.
The image signature verification is supported for LVM and local back ends for
ephemeral storage.
The functionality is not compatible with Ceph-backed ephemeral storage
combined with RAW formatted images. The Ceph copy-on-write mechanism enables
the user to create instance virtual disks without downloading the image to
a compute node, the data is handled completely on the side of a Ceph cluster.
This enables you to spin up instances almost momentarily but makes it
impossible to verify the image data before creating an instance from it.
The Image service does not enforce the presence of a signature in
the metadata when the user creates a new image. The service will accept the
non-signed image uploads.
The Image service does not verify the correctness of an image signature
upon update of the image metadata.
MOSK does not validate if the certificate used to sign an
image is trusted,
it only ensures the correctness of the signature itself. Cloud users are
allowed to use self-signed certificates.
The Compute service does not verify image signature for Ceph back end when
the RAW image format is used as described in
Supported storage back ends.
The Compute service does not verify image signature if the image is already
cached on the target compute node.
The Instance HA service may experience issues when auto-evacuating instances
created from signed images if it does have access to the corresponding
secrets in the Key manager service.
The Block Storage service does not perform image signature verification
when a Ceph back end is used and the images are in the RAW format.
The Block Storage service does not enforce the presence of a signature on
the images.
Instead of Swift, such configuration uses an S3 client to upload server-side
encrypted objects. Using server-side encryption, the data is sent over a secure
HTTPS connection in an unencrypted form and the Ceph Object Gateway stores that
data in the Ceph cluster in an encrypted form.
Defines the list of custom OpenStack Dashboard themes.
Content of the archive file with a theme depends on the level of
customization and can include static files, Django templates, and other
artifacts. For the details, refer to OpenStack official documentation:
Customizing Horizon Themes.
spec:features:horizon:themes:-name:theme_namedescription:The brand new themeurl:https://<path to .tgz file with the contents of custom theme>sha256summ:<SHA256 checksum of the archive above>
The Telemetry services are part of OpenStack services
available in Mirantis OpenStack for Kubernetes (MOSK). The Telemetry
services monitor OpenStack components, collect and store the telemetry data
from them, and perform responsive actions upon this data. See
OpenStack cluster for details about OpenStack services in
MOSK.
OpenStack Ceilometer is a service that collects data from various OpenStack
components. The service can also collect and process notifications from
different OpenStack services. Ceilometer stores the data in the Gnocchi
database. The service is specified as metering in the
OpenStackDeployment custom resource (CR).
Gnocchi is an open-source time series database. One of the advantages of this
database is the ability to pre-aggregate the telemetry data while storing it.
Gnocchi is specified as metric in the OpenStackDeployment CR.
OpenStack Aodh is part of the Telemetry project. Aodh provides a service that
creates alarms based on various metric values or specific events and triggers
response actions. The service uses data collected and stored by Ceilometer
and Gnocchi. Aodh is specified as alarming in the OpenStackDeployment
CR.
The Telemetry feature in MOSK has a single mode.
The autoscaling mode provides settings for telemetry data collection and
storing. The OpenStackDeployment CR should have this mode specified for the
correct work of the OpenStack Telemetry services. The autoscaling mode has
the following notable configurations:
Gnocchi stores cache and data using the Redis storage driver.
Metric stores data for one hour with a resolution of 1 minute.
The Telemetry services are disabled by default in MOSK.
You have to enable them
in the openstackdeployment.yaml file (the OpenStackDeployment CR).
The following code block provides an example of deploying the Telemetry
services as part of MOSK:
Gnocchi is not an OpenStack service, so the settings related to its
functioning should be included in the spec:common:infra section of the
OpenStackDeployment CR.
The Ceilometer configuration files contain many list structures. Overriding
list elements in YAML files is context-dependent and error-prone. Therefore,
to override these configuration files, define the spec:services
structure in the OpenStackDeployment CR.
The spec:services structure provides the ability to use a complete file as
text and not as YAML data structure.
Overriding through the spec:services structure is possible for the
following files:
pipeline.yaml
polling.yaml
meters.yaml
gnocchi_resources.yaml
event_pipeline.yaml
event_definitions.yaml
An example of overriding through the OpenStackDeployment CR
By default, the autoscaling mode collects the data related to
CPU, disk, and memory every minute. The autoscaling mode collects the rest of
the available metrics every hour.
The following example shows the overriding of the polling.yaml
configuration file through the spec:services structure of the
OpenStackDeployment CR.
The Bare Metal service (Ironic) is an extra OpenStack service that can be
deployed by the OpenStack Operator. This section provides the
baremetal-specific configuration options of the OpenStackDeployment
resource.
To provision a user image onto a bare metal server, Ironic boots a node with
a ramdisk image. Depending on the node’s deploy interface and hardware, the
ramdisk may require different drivers (agents). MOSK
provides tinyIPA-based ramdisk images and uses the direct deploy interface
with the ipmitool power interface.
Since the bare metal nodes hardware may require additional drivers,
you may need to build a deploy ramdisk for particular hardware. For more
information, see Ironic Python Agent Builder.
Be sure to create a ramdisk image with the version of Ironic Python Agent
appropriate for your OpenStack release.
Ironic supports the flat and multitenancy networking modes.
The flat networking mode assumes that all bare metal nodes are
pre-connected to a single network that cannot be changed during the
virtual machine provisioning. This network with bridged interfaces
for Ironic should be spread across all nodes including compute nodes
to allow plug-in regular virtual machines to connect to Ironic network.
In its turn, the interface defined as provisioning_interface should
be spread across gateway nodes. The cloud operator can perform
all these underlying configuration through the L2 templates.
Example of the OsDpl resource illustrating the configuration for the flat
network mode:
spec:features:services:-baremetalneutron:external_networks:-bridge:ironic-pxeinterface:<baremetal-interface>network_types:-flatphysnet:ironicvlan_ranges:nullironic:# The name of neutron network used for provisioning/cleaning.baremetal_network_name:ironic-provisioningnetworks:# Neutron baremetal network definition.baremetal:physnet:ironicname:ironic-provisioningnetwork_type:flatexternal:trueshared:truesubnets:-name:baremetal-subnetrange:10.13.0.0/24pool_start:10.13.0.100pool_end:10.13.0.254gateway:10.13.0.11# The name of interface where provision services like tftp and ironic-conductor# are bound.provisioning_interface:br-baremetal
The multitenancy network mode uses the neutron Ironic network
interface to share physical connection information with Neutron. This
information is handled by Neutron ML2 drivers when plugging a Neutron port
to a specific network. MOSK supports the
networking-generic-switch Neutron ML2 driver out of the box.
Example of the OsDpl resource illustrating the configuration for the
multitenancy network mode:
spec:features:services:-baremetalneutron:tunnel_interface:ens3external_networks:-physnet:physnet1interface:<physnet1-interface>bridge:br-exnetwork_types:-flatvlan_ranges:nullmtu:null-physnet:ironicinterface:<physnet-ironic-interface>bridge:ironic-pxenetwork_types:-vlanvlan_ranges:1000:1099ironic:# The name of interface where provision services like tftp and ironic-conductor# are bound.provisioning_interface:<baremetal-interface>baremetal_network_name:ironic-provisioningnetworks:baremetal:physnet:ironicname:ironic-provisioningnetwork_type:vlansegmentation_id:1000external:trueshared:falsesubnets:-name:baremetal-subnetrange:10.13.0.0/24pool_start:10.13.0.100pool_end:10.13.0.254gateway:10.13.0.11
The supported back end for Designate is PowerDNS. If required, you can specify
whether to use an external IP address or UDP, TCP, or TCP + UDP kind
of Kubernetes for the PowerDNS service.
To configure LoadBalancer for PowerDNS, use the spec:features:designate
definition in the OpenStackDeployment custom resource.
The list of supported options includes:
external_ip - Optional. An IP address for the LoadBalancer service. If
not defined, LoadBalancer allocates the IP address.
protocol - A protocol for the Designate back end in Kubernetes. Can only
be udp, tcp, or tcp+udp.
type - The type of the back end for Designate. Can only be powerdns.
Due to an issue in the dnspython library, Asynchronous Transfer Full Range
(AXFR) requests do not work and cause inability to set up a secondary DNS zone.
The issue affects OpenStack Victoria and will be fixed in the Yoga release.
MOSK Key Manager service (OpenStack Barbican) provides
secure storage, provisioning, and management of cloud application secret data,
such as Symmetric Keys, Asymmetric Keys, Certificates, and raw binary data.
Instance High Availability service (OpenStack Masakari) enables cloud users
to ensure that their instances get automatically evacuated from a failed
hypervisor.
The Instance HA service is not included into the core set of services and needs
to be explicitly enabled in the OpenStackDeployment custom resource.
Parameter
features:services:instance-ha
Usage
Enables Masakari, the OpenStack service that ensures high availability
of instances running on a host. To enable the service, add
instance-ha to the service list:
MOSK Shared Filesystems service (OpenStack Manila) provides
Shared Filesystems as a service. The Shared Filesystems service enables you to
create and manage shared filesystems in your multi-project cloud environments.
The Shared FileSystems service consists of manila-api,
manila-scheduler, and manila-share services. All these services
communicate with each other through the AMQP protocol and store their data
in the MySQL database.
manila-api
Provides a stable RESTful API, authenticates and routes requests
throughout the Shared Filesystem service
manila-scheduler
Responsible for scheduling and routing requests to the appropriate
manila-share service by determining which back end should
serve as the destination for a share creation request
manila-share
Responsible for managing Shared Filesystems service devices, specifically
the back-end ones
The diagram below illustrates how the Shared FileSystems services communicate
with each other.
MOSK ensures support for different kind of equipment and
shared filesystems by means of special drivers that are part of the
manila-share service. Also, these divers determine the ability to restrict
access to data stored on a shared filesystem, list of operations with Manila
volumes, and types of connections to the client network.
Driver Handles Share Servers (DHSS) is one of the main parameters that
define the Manila workflow including the way the Manila driver makes clients
access shared filesystems. Some drivers support only one DHSS mode,
for example, the LVM share driver. Others support both modes, for example,
the Generic driver. If the DHSS is set to False in the driver
configuration, the driver does not prepare the share server that provides
access to the share filesystems and the server and network setup should be
performed by the administrator. In this case, the Shared Filesystems service
only manages the server in its own configuration.
Untitled Diagram
If the driver configuration includes DHSS=True, the driver creates a
service virtual machine that provides access to shared filesystems.
Also, when DHSS=True, the Shared Filesystems service performs a network
setup to provide client’s access to the created service virtual machine.
For working with the service virtual machine, the Shared Filesystems service
requires a separate service network that must be included in the driver’s
configuration as well.
Let’s describe the Generic driver as an example for the DHSS=True case.
There are two network topologies for connecting client’s network to the
service virtual machine, which depend of the
connect_share_server_to_tenant_network parameter.
If the connect_share_server_to_tenant_network parameter is set to
False, which is default, the client must create a shared network connected
to a public router. IP addresses from this network will be granted access to
the created shared filesystem. The Shared Filesystems service creates a subnet
in its service network where the network port of the new service virtual
machine and network port of the clent’s router will be connected to. When a
new shared filesystem is created, the client’s machine is granted access to it
through the router.
Untitled Diagram
If the connect_share_server_to_tenant_network parameter is set to True,
the Shared Filesystems service creates the service virtual machines with two
network interfaces. One of them is connected to the service network while the
other one is connected to the client’s network.
The Shared Filesystems service is not included into the core set of services
and needs to be explicitly enabled in the OpenStackDeployment custom
resource.
To install the OpenStack Manila services, add the shared-file-system
keyword to the spec:features:services list:
OpenStack and auxiliary services are running as containers in the kind: Pod
Kubernetes resources. All long-running services are governed by one of
the ReplicationController-enabled Kubernetes resources, which include
either kind: Deployment, kind: StatefulSet, or kind: DaemonSet.
The placement of the services is mostly governed by the Kubernetes node labels.
The labels affecting the OpenStack services include:
openstack-control-plane=enabled - the node hosting most of the OpenStack
control plane services.
openstack-compute-node=enabled - the node serving as a hypervisor for
Nova. The virtual machines with tenants workloads are created there.
openvswitch=enabled - the node hosting Neutron L2 agents and OpenvSwitch
pods that manage L2 connection of the OpenStack networks.
openstack-gateway=enabled - the node hosting Neutron L3, Metadata and
DHCP agents, Octavia Health Manager, Worker and Housekeeping components.
Note
OpenStack is an infrastructure management platform. Mirantis OpenStack
for Kubernetes (MOSK) uses Kubernetes mostly for
orchestration and dependency isolation. As a result, multiple OpenStack
services are running as privileged containers with host PIDs and Host
Networking enabled. You must ensure that at least the user with the
credentials used by Helm/Tiller (administrator) is capable of creating
such Pods.
While the underlying Kubernetes cluster is configured to use Ceph CSI
for providing persistent storage for container workloads, for some
types of workloads such networked storage is suboptimal due to latency.
This is why the separate local-volume-provisioner CSI is
deployed and configured as an additional storage class.
Local Volume Provisioner is deployed as kind: DaemonSet.
Database
A single WSREP (Galera) cluster of MariaDB is deployed as the SQL
database to be used by all OpenStack services. It uses the storage class
provided by Local Volume Provisioner to store the actual database files.
The service is deployed as kind: StatefulSet of a given size, which
is no less than 3, on any openstack-control-plane node. For details,
see OpenStack database architecture.
Messaging
RabbitMQ is used as a messaging bus between the components of the
OpenStack services.
A separate instance of RabbitMQ is deployed for each OpenStack service
that needs a messaging bus for intercommunication between its
components.
An additional, separate RabbitMQ instance is deployed to serve as
a notification messages bus for OpenStack services to post their own
and listen to notifications from other services.
StackLight also uses this message bus to collect notifications for
monitoring purposes.
Each RabbitMQ instance is a single node and is deployed as
kind: StatefulSet.
Caching
A single multi-instance of the Memcached service is deployed to be used
by all OpenStack services that need caching, which are mostly HTTP API
services.
Coordination
A separate instance of etcd is deployed to be used by Cinder,
which require Distributed Lock Management for coordination between its
components.
Ingress
Is deployed as kind: DaemonSet.
Image pre-caching
A special kind: DaemonSet is deployed and updated each time the
kind: OpenStackDeployment resource is created or updated.
Its purpose is to pre-cache container images on Kubernetes nodes, and
thus, to minimize possible downtime when updating container images.
This is especially useful for containers used in kind: DaemonSet
resources, as during the image update Kubernetes starts to pull the
new image only after the container with the old image is shut down.
keystoneclient - a separate kind: Deployment with a pod that
has the OpenStack CLI client as well as relevant plugins installed,
and OpenStack admin credentials mounted. Can be used by
administrator to manually interact with OpenStack APIs from within a
cluster.
Image (Glance)
Supported back end is RBD (Ceph is required).
Volume (Cinder)
Supported back end is RBD (Ceph is required).
Network (Neutron)
Supported back ends are Open vSwitch and Tungsten Fabric.
Placement
Compute (Nova)
Supported hypervisor is Qemu/KVM through libvirt library.
Dashboard (Horizon)
DNS (Designate)
Supported back end is PowerDNS.
Load Balancer (Octavia)
Ceph Object Gateway (SWIFT)
Provides the object storage and a Ceph Object Gateway Swift API that is
compatible with the OpenStack Swift API. You can manually enable the
service in the OpenStackDeployment CR as described in
Deploy an OpenStack cluster.
Instance HA (Masakari)
An OpenStack service that ensures high availability of instances running
on a host. You can manually enable Masakari in the
OpenStackDeployment CR as described in Deploy an OpenStack cluster.
Orchestration (Heat)
Key Manager (Barbican)
The supported back ends include:
The built-in Simple Crypto, which is used by default
Vault
Vault by HashiCorp is a third-party system and is not
installed by MOSK. Hence,
the Vault storage back end should be
available elsewhere on the user environment and accessible from
the MOSK deployment.
If the Vault back end is used, you can configure Vault in the
OpenStackDeployment CR as described in
Deploy an OpenStack cluster.
Tempest
Runs tests against a deployed OpenStack cloud. You can manually enable
Tempest in the OpenStackDeployment CR as described in
Deploy an OpenStack cluster.
Telemetry
Telemetry services include alarming (aodh), metering (Ceilometer),
and metric (Gnocchi). All services should be enabled together through
the list of services to be deployed in the OpenStackDeployment CR
as described in Deploy an OpenStack cluster.
A complete setup of a MariaDB Galera cluster for OpenStack is illustrated
in the following image:
MariaDB server pods are running a Galera multi-master cluster. Clients
requests are forwarded by the Kubernetes mariadb service to the
mariadb-server pod that has the primary label. Other pods from
the mariadb-server StatefulSet have the backup label. Labels are
managed by the mariadb-controller pod.
The MariaDB Controller periodically checks the readiness of the
mariadb-server pods and sets the primary label to it if the following
requirements are met:
The primary label has not already been set on the pod.
The pod is in the ready state.
The pod is not being terminated.
The pod name has the lowest integer suffix among other ready pods in
the StatefulSet. For example, between mariadb-server-1 and
mariadb-server-2, the pod with the mariadb-server-1 name is
preferred.
Otherwise, the MariaDB Controller sets the backup label. This means that
all SQL requests are passed only to one node while other two nodes are in
the backup state and replicate the state from the primary node.
The MariaDB clients are connecting to the mariadb service.
The OpenStack Controller runs in a set of containers in a pod in Kubernetes.
The OpenStack Controller is deployed as a Deployment with 1 replica only.
The failover is provided by Kubernetes that automatically restarts the
failed containers in a pod.
However, given the recommendation to use a separate Kubernetes cluster
for each OpenStack deployment, the controller in envisioned mode for
operation and deployment will only manage a single OpenStackDeployment
resource, making the proper HA much less of an issue.
The OpenStack Controller is written in Python using Kopf, as a Python
framework to build Kubernetes operators, and Pykube, as a Kubernetes API
client.
Using Kubernetes API, the controller subscribes to changes to resources of
kind: OpenStackDeployment, and then reacts to these changes by creating,
updating, or deleting appropriate resources in Kubernetes.
The basic child resources managed by the controller are Helm releases.
They are rendered from templates taking into account
an appropriate values set from the main and features fields in the
OpenStackDeployment resource.
Then, the common fields are merged to resulting data structures.
Lastly, the services fields are merged providing the final and precise override
for any value in any Helm release to be deployed or upgraded.
The constructed values are then used by the OpenStack Controller during a
Helm release installation.
The core container that handles changes in the osdpl object.
helmbundle
The container that watches the helmbundle objects
and reports their statuses to the osdpl object in
status:children. See OpenStackDeploymentStatus custom resource for details.
health
The container that watches all Kubernetes native
resources, such as Deployments, Daemonsets, Statefulsets,
and reports their statuses to the osdpl object in
status:health. See OpenStackDeploymentStatus custom resource for details.
secrets
The container that provides data exchange between different
components such as Ceph.
The CustomResourceDefinition resource in Kubernetes uses the
OpenAPI Specification version 2 to specify the schema of the resource
defined. The Kubernetes API outright rejects the resources that do not
pass this schema validation.
The language of the schema, however, is not expressive enough to define a
specific validation logic that may be needed for a given resource. For this
purpose, Kubernetes enables the extension of its API with
Dynamic Admission Control.
For the OpenStackDeployment (OsDpl) CR the ValidatingAdmissionWebhook
is a natural choice. It is deployed as part of OpenStack Controller
by default and performs specific extended validations when an OsDpl CR is
created or updated.
The inexhaustive list of additional validations includes:
Deny the OpenStack version downgrade
Deny the OpenStack version skip-level upgrade
Deny the OpenStack master version deployment
Deny upgrade to the OpenStack master version
Deny upgrade if any part of an OsDpl CR specification
changes along with the OpenStack version
Under specific circumstances, it may be viable to disable the Admission
Controller, for example, when you attempt to deploy or upgrade to the master
version of OpenStack.
Warning
Mirantis does not support MOSK deployments
performed without the OpenStackDeployment Admission Controller enabled.
Disabling of the OpenStackDeployment Admission Controller is only
allowed in staging non-production environments.
To disable the Admission Controller, ensure that the following structures and
values are present in the openstack-controller HelmBundle resource:
MOSK provides the configurational capabilities through a
number of custom resources. This section is intended to provide detailed
overview of these custom resources and their possible configuration.
The OpenStackDeployment custom resource enables you to securely store
sensitive fields in Kubernetes secrets. To do that, verify that the
reference secret is present in the same namespace as the
OpenStackDeployment object and the
openstack.lcm.mirantis.com/osdpl_secret label is set to true.
The list of fields that can be hidden from OpenStackDeployment is limited
and defined by the OpenStackDeployment schema.
For example, to hide spec:features:ssl:public_endpoints:api_cert, use the
following structure:
Main elements of OpenStackDeployment custom resource¶
Element
Sub-element
Description
apiVersion
n/a
Specifies the version of the Kubernetes API that is used to create
this object
kind
n/a
Specifies the kind of the object
metadata
name
Specifies the name of metadata. Should be set in compliance with the
Kubernetes resource naming limitations
namespace
Specifies the metadata namespace. While technically it is possible to
deploy OpenStack on top of Kubernetes in other than openstack
namespace, such configuration is not included in the
MOSK system integration test plans. Therefore,
Mirantis does not recommend such scenario.
Warning
Both OpenStack and Kubernetes platforms provide resources
to applications. When OpenStack is running on top of Kubernetes,
Kubernetes is completely unaware of OpenStack-native workloads,
such as virtual machines, for example.
For better results and stability, Mirantis recommends using a
dedicated Kubernetes cluster for OpenStack, so that OpenStack and
auxiliary services, Ceph, and StackLight are the only Kubernetes
applications running in the cluster.
spec
openstack_version
Specifies the OpenStack release to deploy
preset
String that specifies the name of the preset, a predefined
configuration for the OpenStack cluster. A preset includes:
A set of enabled services that includes virtualization, bare
metal management, secret management, and others
Major features provided by the services, such as VXLAN encapsulation
of the tenant traffic
Integration of services
Every supported deployment profile incorporates an OpenStack preset.
Refer to Deployment profiles for the list of possible values.
size
String that specifies the size category for the OpenStack cluster.
The size category defines the internal configuration of the cluster
such as the number of replicas for service workers and timeouts, etc.
The list of supported sizes include:
tiny - for approximately 10 OpenStack compute nodes
small - for approximately 50 OpenStack compute nodes
medium - for approximately 100 OpenStack compute nodes
public_domain_name
Specifies the public DNS name for OpenStack services. This is a base
DNS name that must be accessible and resolvable by API clients of your
OpenStack cloud. It will be present in the OpenStack endpoints as
presented by the OpenStack Identity service catalog.
The TLS certificates used by the OpenStack services (see below) must
also be issued to this DNS name.
persistent_volume_storage_class
Specifies the Kubernetes storage class name used for services to create
persistent volumes. For example, backups of MariaDB. If not specified,
the storage class marked as default will be used.
features
Contains the top-level collections of settings for the OpenStack
deployment that potentially target several OpenStack services. The
section where the customizations should take place.
The features:services element contains a list of extra OpenStack
services to deploy. Extra OpenStack services are services that are not
included into preset.
The list of services available for configuration includes: Cinder, Nova,
Designate, Keystone, Glance, Neutron, Heat, Octavia, Barbican, Placement,
Ironic, aodh, Gnocchi, and Masakari.
Caution
Mirantis is not responsible for cloud operability in case
of default policies modifications but provides API to pass the required
configuration to the core OpenStack services.
Enables a tested set of policies that limits the global admin role to
only the user with admin role in the admin project or user with the
service role. The latter should be used only for service users utilizied
for communication between OpenStack services.
A low-level section that defines values that will be passed to all
OpenStack (spec:common:openstack) or auxiliary
(spec:common:infra) services Helm charts.
A section of the lowest level, enables the definition of
specific values to pass to specific Helm charts on a one-by-one basis:
Warning
Mirantis does not recommend changing the default settings for
spec:artifacts, spec:common, and spec:services elements.
Customizations can compromise the OpenStack deployment update and upgrade
processes.
However, you may need to edit the spec:services section to limit
hardware resources in case of a hyperconverged architecture as described in
Limit HW resources for hyperconverged OpenStack compute nodes.
Specifies the standard logging levels for OpenStack services that
include the following, at increasing severity: TRACE, DEBUG,
INFO, AUDIT, WARNING, ERROR, and CRITICAL.
Depending on the use case, you may need to configure the same application
components differently on different hosts. MOSK enables
you to easily perform the required configuration through node-specific
overrides at the OpenStack Controller side.
The limitation of using the node-specific overrides is that they override
only the configuration settings while other components, such as startup
scripts and others, should be reconfigured as well.
Caution
The overrides have been implemented in a similar way to the
OpenStack node and node label specific DaemonSet configurations.
Though, the OpenStack Controller node-specific settings conflict
with the upstream OpenStack node and node label specific DaemonSet
configurations. Therefore, we do not recommend configuring node and
node label overrides.
The list of allowed node labels is located in the Cluster object status
providerStatus.releaseRef.current.allowedNodeLabels field.
If the value field is not defined in allowedNodeLabels, a label can
have any value.
Before or after a machine deployment, add the required label from the allowed
node labels list with the corresponding value to
spec.providerSpec.value.nodeLabels in machine.yaml. For example:
The addition of a node label that is not available in the list of allowed node
labels is restricted.
The node-specific settings are activated through the spec:nodes
section of the OsDpl CR. The spec:nodes section contains the following
subsections:
features- implements overrides for a limited subset of fields and is
constructed similarly to spec::features
services - similarly to spec::services, enables you to override
settings in general for the components running as DaemonSets.
Example configuration:
spec:nodes:<NODE-LABEL>::<NODE-LABEL-VALUE>:features:# Detailed information about features might be found at# openstack_controller/admission/validators/nodes/schema.yamlservices:<service>:<chart>:<chart_daemonset_name>:values:# Any value from specific helm chart
The resource of kind OpenStackDeploymentSecret (OsDplSecret) is a custom
resource that is intended to aggregate cloud’s confidential settings such
as SSL/TLS certificates, external systems access credentials, and other
secrets.
To obtain detailed information about the schema of an OsDplSecret custom
resource, run:
The resource has similar structure as the OpenStackDeployment custom
resource and enables the user to set a limited subset of fields that
contain sensitive data.
The resource of kind OpenStackDeploymentStatus (OsDplSt) is a custom
resource that describes the status of an OpenStack deployment.
To obtain detailed information about the schema of an
OpenStackDeploymentStatus (OsDplSt) custom resource, run:
The services subsection provides detailed information of LCM performed with
a specific service. This is a dictionary where keys are service names, for
example, baremetal or compute and values are dictionaries with the
following items.
The OpenStack Controller enables you to modify its configuration at runtime
without restarting. MOSK stores the controller configuration
in the openstack-controller-config ConfigMap in the osh-system
namespace of your cluster.
To retrieve the OpenStack Controller configuration ConfigMap, run:
OpenStack Controller extra configuration parameters¶
Section
Parameter
Default value
Description
[osctl]
wait_application_ready_timeout
1200
The number of seconds to wait for all application components
to become ready.
wait_application_ready_delay
10
The number of seconds before going to the sleep mode between attempts
to verify if the application is ready.
node_not_ready_flapping_timeout
120
The amount of time to wait for the flapping node.
[helmbundle]
manifest_enable_timeout
600
The number of seconds to wait until the values set in the manifest
are propagated to the dependent objects.
manifest_enable_delay
10
The number of seconds between attempts to verify if the values
were applied.
manifest_disable_timeout
600
The number of seconds to wait until the values are removed from
the manifest and propagated to the child objects.
manifest_disable_delay
10
The number of seconds between attempts to verify if the values were
removed from the release.
manifest_purge_timeout
600
The number of seconds to wait until the Kubernetes object is removed.
manifest_purge_delay
10
The number of seconds between attempts to verify if the Kubernetes
object is removed.
manifest_apply_delay
10
The number of seconds to pause for the Helm bundle changes.
[maintenance]
instance_migrate_concurrency
1
The number of instances to migrate concurrently.
nwl_parallel_max_compute
30
The maximum number of compute nodes allowed for a parallel update.
nwl_parallel_max_gateway
1
The maximum number of gateway nodes allowed for a parallel update.
respect_nova_az
true
Respect Nova availability zone (AZ). The true value allows
the parallel update only for the compute nodes in the same AZ.
ndr_skip_instance_check
false
The flag to skip the instance verification on a host before proceeding
with the node removal. The false value blocks the node removal
until at least one instance exists on the host.
ndr_skip_volume_check
false
The flag to skip the volume verification on a host before proceeding
with the node removal. The false value blocks the node removal
until at least one volume exists on the host. A volume is tied to
a specific host only for the LVM back end.
MOSK relies on the MariaDB Galera cluster to provide
its OpenStack components with a reliable storage of persistent data.
For successful long-term operations of a MOSK cloud, it
is crucial to ensure the healthy state of the OpenStack database as well as the
safety of the data stored in it. To help you with that, MOSK
provides built-in automated procedures for OpenStack database maintenance,
backup, and restoration. The hereby chapter describes the internal mechanisms
and configuration details for the provided tools.
Overview of the OpenStack database backup and restoration¶
MOSK relies on the MariaDB Galera cluster to provide
its OpenStack components with a reliable storage for persistent data.
Mirantis recommends backing up your OpenStack databases daily to ensure
the safety of your cloud data. Also, you should always create an instant
backup before updating your cloud or performing any kind of potentially
disruptive experiment.
MOSK has a built-in automated backup routine that can be
triggered manually or by schedule. For detailed information about the process
of MariaDB Galera cluster backup, refer to Workflows of the OpenStack database backup and restoration.
Backup and restoration can only be performed against the OpenStack database
as a whole. Granular per-service or per-table procedures are not supported
by MOSK.
By default, periodic backups are turned off. Though, a cloud operator can
easily enable this capability by adding the following structure to the
OpenStackDeployment custom resource:
By default, MOSK backup routine stores the OpenStack
database data into the Mirantis Ceph cluster, which is a part of the same
cloud. This is sufficient for the vast majority of clouds. However, you may
want to have the backup data stored off the cloud to comply with specific
enterprise practices for infrastructure recovery and data safety.
The size of a backup storage volume depends directly on the size of the
MOSK cluster, which can be determined through the
size parameter in the OpenStackDeployment CR.
The list of the recommended sizes for a minimal backup volume includes:
20 GB for the tiny cluster size
40 GB for the small cluster size
80 GB for the medium cluster size
If required, you can change the default size of a database backup volume.
However, make sure that you configure the volume size before OpenStack
deployment is complete. This is because there is no automatic way to
resize the backup volume once the cloud is deployed. Also, only the local
backup storage (Ceph) supports the configuration of the volume size.
To change the default size of the backup volume, use the following structure
in the OpenStackDeployment CR:
To store the backup data to a local Mirantis Ceph, the MOSK
underlying Kubernetes cluster needs to have a preconfigured storage class for
Kubernetes persistent volumes with the Ceph cluster as a storage back end.
When restoring the OpenStack database from a local Ceph storage, the cron job
restores the state on each MariaDB node sequentially. It is not possible to
perform parallel restoration because Ceph Kubernetes volumes do not support
concurrent mounting from multiple places.
MOSK provides you with a capability to store the OpenStack
database data outside of the cloud, on an external storage device that supports
common data access protocols, such as third-party NAS appliances.
Workflows of the OpenStack database backup and restoration¶
This section provides technical details about the internal implementation
of automated backup and restoration routines built into
MOSK. The below information would be helpful for
troubleshooting of any issues related to the process or understanding the
impact these procedures impose on a running cloud.
The mariadb-phy-backup job launches the
mariadb-phy-backup-<TIMESTAMP> pod. This pod contains the main backup
script, which is responsible for:
Basic sanity checks and choosing right node for backup
Verifying the wsrep status and changing the wsrep_desync parameter
settings
Managing the mariadb-phy-backup-runner pod
During the first backup phase, the following actions take place:
Sanity check: verification of the Kubernetes status and wsrep status of
each MariaDB pod. If some pods have wrong statuses, the backup job
fails unless the --allow-unsafe-backup parameter is passed to
the main script in the Kubernetes backup job.
Note
Since MOSK 22.4, the --allow-unsafe-backup
functionality is removed from the product for security and backup
procedure simplification purposes.
Mirantis does not recommend setting the --allow-unsafe-backup
parameter unless it is absolutely required. To ensure the consistency
of a backup, verify that the MariaDB Galera cluster is in a working
state before you proceed with the backup.
Select the replica to back up. The system selects the replica with the
highest number in its name as a target replica. For example, if the
MariaDB server pods have the mariadb-server-0, mariadb-server-1,
and mariadb-server-2 names, the mariadb-server-2 replica will
be backed up.
Desynchronize the replica from the Galera cluster. The script connects
the target replica and sets the wsrep_desync variable to ON.
Then, the replica stops receiving write-sets and receives the wsrep
status Donor/Desynced. The Kubernetes health check of that
mariadb-server pod fails and the Kubernetes status of that pod
becomes Notready. If the pod has the primary label, the MariaDB
Controller sets the backup label to it and the pod is removed from
the endpoints list of the MariaDB service.
The main script in the mariadb-phy-backup pod launches the
Kubernetes pod mariadb-phy-backup-runner-<TIMESTAMP>
on the same node where the target mariadb-server replica is running,
which is node X in the example.
The mariadb-phy-backup-runner pod has both mysql data directory
and backup directory mounted. The pod performs the following
actions:
Verifies that there is enough space in the /var/backup folder to
perform the backup. The amount of available space in the folder
should be greater than
<DB-SIZE>*<MARIADB-BACKUP-REQUIRED-SPACE-RATIO in KB.
Performs the actual backup using the mariabackup tool.
If the number of current backups is greater than the value of the
MARIADB_BACKUPS_TO_KEEP job parameter, the script removes all
old backups exceeding the allowed number of backups.
Exits with 0 code.
The script waits untill the mariadb-phy-backup-runner pod is
completed and collects its logs.
The script puts the backed up replica back to sync with the Galera
cluster by setting wsrep_desync to OFF and waits for
the replica to become Ready in Kubernetes.
The mariadb-phy-restore pod launches
openstack-mariadb-phy-restore-runner with the first
mariadb-server replica PVC mounted to the /var/lib/mysql
folder and the backup PVC mounted to /var/backup.
The openstack-mariadb-phy-restore-runner pod performs the
following actions:
Unarchives the database backup files to a temporary directory within
/var/backup.
Executes mariabackup--prepare on the unarchived data.
Creates the .prepared file in the temporary directory in
/var/backup.
Restores the backup to /var/lib/mysql.
Exits with 0.
The script in the mariadb-phy-restore pod collects the logs
from the openstack-mariadb-phy-restore-runner pod and removes
the pod. Then, the script launches the next
openstack-mariadb-phy-restore-runner pod for the next
mariadb-server replica PVC.
The openstack-mariadb-phy-restore-runner pod restores the backup
to /var/lib/mysql and exits with 0.
Step 2 is repeated for every mariadb-server replica PVC
sequentially.
When the last replica’s data is restored, the last
openstack-mariadb-phy-restore-runner pod removes the .prepared
file and the temporary folder with unachieved data from /var/backup.
By design, when deleting a cloud resource, for example, an instance, volume,
or router, an OpenStack service does not immediately delete its data but
marks it as removed so that it can later be picked up by the garbage
collector.
Given that an OpenStack resource is often represented by more than one record
in the database, deletion of all of them right away could affect the overall
responsiveness of the cloud API. On the other hand, an OpenStack database
being severely clogged with stale data is one of the most typical reasons for
the cloud slowness.
To keep the OpenStack database small and performance fast,
MOSK is pre-configured to automatically clean up the removed
database records older than 30 days. By default, the clean up is performed for
the following MOSK services every Monday according to the
schedule:
The default database cleanup schedule by OpenStack service¶
Service
Service identifier
Clean up time
Block Storage (OpenStack Cinder)
cinder
12:01 a.m.
Compute (OpenStack Nova)
nova
01:01 a.m.
Image (OpenStack Glance)
glance
02:01 a.m.
Instance HA (OpenStack Masakari)
masakari
03:01 a.m.
Key Manager (OpenStack Barbican)
barbican
04:01 a.m.
Orchestration (OpenStack Heat)
heat
05:01 a.m.
If required, you can adjust the cleanup schedule for the OpenStack database by
adding the features:database:cleanup setting to the OpenStackDeployment
CR following the example below. The schedule parameter must contain a
valid cron expression. The age parameter specifies the number of days after
which a stale record gets cleaned up.
MOSK uses the Mariabackup utility to back up the MariaDB
Galera cluster data where the OpenStack data is stored. The Mariabackup gets
launched on a periodic basis as a part of the Kubernetes CronJob included
in any MOSK deployment and is suspended by default.
Note
If you are using the default back end to store the backup data,
which is Ceph, you can increase the default size of a backup volume.
However, make sure to configure the volume size before you deploy
OpenStack.
MOSK enables you to configure the periodic backup of the
OpenStack database through the OpenStackDeployment object. To enable the
backup, use the following structure:
spec:features:database:backup:enabled:true
By default, the backup job:
Runs backup on a daily basis at 01:00 AM
Creates incremental backups daily and full backups weekly
Keeps 10 latest full backups
Stores backups in the mariadb-phy-backup-data PVC
Has the backup timeout of 3600 seconds
Has the incremental backup type
To verify the configuration of the mariadb-phy-backup CronJob
object, run:
Type of a backup. The list of possible values include:
incremental
If the newest full backup is older than the value of
the full_backup_cycle parameter, the system performs a full
backup. Otherwise, the system performs an incremental backup of
the newest full backup.
Number of seconds that defines a period between 2 full backups.
During this period, incremental backups are performed. The parameter
is taken into account only if backup_type is set to
incremental. Otherwise, it is ignored.
For example, with full_backup_cycle set to 604800 seconds
a full backup is taken weekly and, if cron is set to 00***,
an incremental backup is performed on daily basis.
Multiplier for the database size to predict the space required to
create a backup, either full or incremental, and perform a
restoration keeping the uncompressed backup files on the same file
system as the compressed ones.
To estimate the size of MARIADB_BACKUP_REQUIRED_SPACE_RATIO, use
the following formula: size of (1 uncompressed full backup + all
related incremental uncompressed backups + 1 full compressed backup)
in KB =< (DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO) in
KB.
The DB_SIZE is the disk space allocated in the MySQL data
directory, which is /var/lib/mysql, for databases data excluding
galera.cache and ib_logfile* files. This parameter prevents
the backup PVC from being full in the middle of the restoration and
backup procedures. If the current available space is lower than
DB_SIZE * MARIADB_BACKUP_REQUIRED_SPACE_RATIO, the backup
script fails before the system starts the actual backup and the
overall status of the backup job is failed.
For example, to perform full backups monthly and incremental backups
daily at 02:30 AM and keep the backups for the last six months,
configure the database backup in your OpenStackDeployment object
as follows:
By default, MOSK stores the OpenStack database backups
locally in the Mirantis Ceph cluster, which is a part of the same cloud.
Alternatively, MOSK provides you with a capability to create
remote backups using an external storage. This section contains configuration
details for a remote back end to be used for the OpenStack data backup.
In general, the built-in automated backup routine saves the data to the
mariadb-phy-backup-data PersistentVolumeClaim (PVC), which is provisioned
from StorageClass specified in the spec.persistent_volume_storage_class
parameter of the OpenstackDeployment custom resource (CR).
Remote NFS storage for OpenStack database backups¶
A preconfigured NFS server with NFS share that a Unix backup and
restore user has access to. By default, it is the same user that runs
MySQL server in a MariaDB image.
Removal of the NFS persistent volume does not automatically remove the data.
No validation of mount options. If mount options are specified incorrectly in
the OpenStackDeployment CR, the mount command fails upon the
creation of a backup runner pod.
Optionally, MOSK enables you to set the required mount
options for the NFS mount command. You can set as many options
of mount as you need. For example:
The internal components of Mirantis OpenStack for Kubernetes (MOSK)
coordinate their operations and exchange status information using the
cluster’s message bus (RabbitMQ).
MOSK enables you to configure OpenStack services to emit
notification messages to the MOSK cluster messaging bus
(RabbitMQ) every time an OpenStack resource, for example, an instance, image,
and so on, changes its state due to a cloud user action or through its
lifecycle. For example, MOSK Compute service (OpenStack
Nova) can publish the instance.create.end notification once a newly created
instance is up and running.
OpenStack notification messages can be consumed and processed by various
corporate systems to integrate MOSK clouds into the
company infrastructure and business processes.
The list of the most common use cases includes:
Using notification history for retrospective security audit
Using the real-time aggregation of notification messages to gather
statistics on cloud resource consumption for further capacity planning
Cloud billing considerations
Notifications alone should not be considered as a source of data for any
kind of financial reporting. The delivery of the messages can not be
guaranteed due to various technical reasons. For example, messages can
be lost if an external consumer is not fetching them from the queue fast
enough.
Mirantis strongly recommends that your cloud billing solutions rely on the
combination of the following data sources:
Periodic polling of the OpenStack API as a reliable source of information
about allocated resources
Subscription to notifications to receive timely updates about the resource
status change
A cloud administrator can securely expose part of a MOSK
cluster message bus to the outside world. This enables an external consumer
to subscribe to the notification messages emitted by the cluster services.
Important
The latest OpenStack release available in MOSK supports
notifications from the following services:
Block storage (OpenStack Cinder)
DNS (OpenStack Designate)
Image (OpenStack Glance)
Orchestration (OpenStack Heat)
Bare Metal (OpenStack Ironic)
Identity (OpenStack Keystone)
Shared Filesystems (OpenStack Manila)
Instance High Avalability (OpenStack Masakari)
Networking (OpenStack Neutron)
Compute (OpenStack Nova)
To enable the external notification endpoint, add the following structure
to the OpenStackDeployment custom resource. For example:
For each topic name specified in the topics field, MOSK
creates a topic exchange in its RabbitMQ cluster together with a set of queues
bound to this topic. All enabled MOSK services will
publish their notification messages to all configured topics so that
multiple consumers can receive the same messages in parallel.
A topic name must follow Kubernetes standard format for object names and IDs
that is only lowercase alphanumeric characters, -, or . The topic
name notifications is reserved for the internal use.
The MOSK supports the connection to message bus (RabbitMQ)
through either a plain text user name and password or an encrypted X.509
certificate.
Each topic exchange is protected by automatically generated authentication
credentials and certificates for secure connection that are stored as a secret
in the openstack-external namespace of a MOSK underlying
Kubernetes cluster. A secret is identified by the name of the topic. The list
of attributes for the secret object includes:
hosts
The IP addresses which an external notification endpoint is available on
port_amqp, port_amqp-tls
The TCP ports which external notification endpoint is available on
vhost
The name of the RabbitMQ virtual host which the topic queues are created on
username, password
Authentication data
ca_cert
The client CA certificate
client_cert
The client certificate
client_key
The client private key
For the configuration example above, the following objects will be created:
Tungsten Fabric provides basic L2/L3 networking to an OpenStack environment
running on the MKE cluster and includes the IP address management, security
groups, floating IP addresses, and routing policies functionality.
Tungsten Fabric is based on overlay networking, where all virtual machines are
connected to a virtual network with encapsulation (MPLSoGRE, MPLSoUDP, VXLAN).
This enables you to separate the underlay Kubernetes management network. A
workload requires an external gateway, such as a hardware EdgeRouter or a
simple gateway to route the outgoing traffic.
The Tungsten Fabric vRouter uses different gateways for the control and data
planes.
All services of Tungsten Fabric are delivered as separate containers, which
are deployed by the Tungsten Fabric Operator (TFO). Each container has an
INI-based configuration file that is available on the host system. The
configuration file is generated automatically upon the container start and is
based on environment variables provided by the TFO through Kubernetes
ConfigMaps.
The main Tungsten Fabric containers run with the host network as
DeploymentSet, without using the Kubernetes networking layer. The services
listen directly on the host network interface.
The following diagram describes the minimum production installation of
Tungsten Fabric with a Mirantis OpenStack for Kubernetes (MOSK)
deployment.
For the details about the Tungsten Fabric services included in
MOSK deployments and the types of traffic and traffic
flow directions, see the subsections below.
This section describes the Tungsten Fabric services and their distribution
across the Mirantis OpenStack for Kubernetes (MOSK) deployment.
The Tungsten Fabric services run mostly as DaemonSets in separate containers
for each service. The deployment and update processes are managed by the
Tungsten Fabric Operator. However, Kubernetes manages the probe checks and
restart of broken containers.
All configuration and control services run on the Tungsten Fabric Controller
nodes.
Service name
Service description
config-api
Exposes a REST-based interface for the Tungsten Fabric API.
config-nodemgr
Collects data of the Tungsten Fabric configuration processes and sends
it to the Tungsten Fabric collector.
control
Communicates with the cluster gateways using BGP and with the vRouter
agents using XMPP, as well as redistributes appropriate networking
information.
control-nodemgr
Collects the Tungsten Fabric Controller process data and sends
this information to the Tungsten Fabric collector.
device-manager
Manages physical networking devices using netconf or ovsdb.
In multi-node deployments, it operates in the active-backup mode.
dns
Using the named service, provides the DNS service to the VMs spawned
on different compute nodes. Each vRouter node connects to two
Tungsten Fabric Controller containers that run the dns process.
named
The customized Berkeley Internet Name Domain (BIND) daemon of
Tungsten Fabric that manages DNS zones for the dns service.
schema
Listens to configuration changes performed by a user and generates
corresponding system configuration objects. In multi-node deployments,
it works in the active-backup mode.
svc-monitor
Listens to configuration changes of service-template and
service-instance, as well as spawns and monitors virtual machines
for the firewall, analyzer services, and so on. In multi-node
deployments, it works in the active-backup mode.
webui
Consists of the webserver and jobserver services. Provides
the Tungsten Fabric web UI.
All analytics services run on Tungsten Fabric analytics nodes.
Service name
Service description
alarm-gen
Evaluates and manages the alarms rules.
analytics-api
Provides a REST API to interact with the Cassandra analytics
database.
analytics-nodemgr
Collects all Tungsten Fabric analytics process data and sends
this information to the Tungsten Fabric collector.
analytics-database-nodemgr
Provisions the init model if needed. Collects data of the database
process and sends it to the Tungsten Fabric collector.
collector
Collects and analyzes data from all Tungsten Fabric services.
query-engine
Handles the queries to access data from the Cassandra database.
snmp-collector
Receives the authorization and configuration of the physical routers
from the config-nodemgr service, polls the physical routers using
the Simple Network Management Protocol (SNMP), and uploads the data to
the Tungsten Fabric collector.
topology
Reads the SNMP information from the physical router user-visible
entities (UVEs), creates a neighbor list, and writes the neighbor
information to the physical router UVEs. The Tungsten Fabric web UI uses
the neighbor list to display the physical topology.
The Tungsten Fabric vRouter provides data forwarding to an OpenStack tenant
instance and reports statistics to the Tungsten Fabric analytics service. The
Tungsten Fabric vRouter is installed on all OpenStack compute nodes.
Mirantis OpenStack for Kubernetes (MOSK) supports the kernel-based
deployment of the Tungsten Fabric vRouter.
Connects to the Tungsten Fabric Controller container and the Tungsten
Fabric DNS system using the Extensible Messaging and Presence Protocol
(XMPP). The vRouter Agent acts as a local control plane. Each Tungsten
Fabric vRouter Agent is connected to at least two Tungsten Fabric
controllers in an active-active redundancy mode.
The Tungsten Fabric vRouter Agent is responsible for all
networking-related functions including routing instances, routes,
and others.
The Tungsten Fabric vRouter uses different gateways for the control
and data planes. For example, the Linux system gateway is located
on the management network, and the Tungsten Fabric gateway is located
on the data plane network.
vrouter-nodemgr
Collects the supervisor vrouter data and sends it
to the Tungsten Fabric collector.
The following diagram illustrates the Tungsten Fabric kernel vRouter set up by
the TF operator:
On the diagram above, the following types of networks interfaces are used:
eth0 - for the management (PXE) network (eth1 and eth2 are the
slave interfaces of Bond0)
On the Tungsten Fabric control plane nodes, maintains the
configuration data of the Tungsten Fabric cluster.
On the Tungsten Fabric analytics nodes, stores the collector
service data.
cassandra-operator
The Kubernetes operator that enables the Cassandra clusters creation
and management.
kafka
Handles the messaging bus and generates alarms across the Tungsten
Fabric analytics containers.
kafka-operator
The Kubernetes operator that enables Kafka clusters creation and
management.
redis
Stores the physical router UVE storage and serves as a messaging bus
for event notifications.
redis-operator
The Kubernetes operator that enables Redis clusters creation and
management.
zookeeper
Holds the active-backup status for the device-manager,
svc-monitor, and the schema-transformer services. This service
is also used for mapping of the Tungsten Fabric resources names to
UUIDs.
zookeeper-operator
The Kubernetes operator that enables ZooKeeper clusters creation and
management.
rabbitmq
Exchanges messages between API servers and original request senders.
rabbitmq-operator
The Kubernetes operator that enables RabbitMQ clusters creation and
management.
Along with the Tungsten Fabric services, MOSK deploys and
updates special image precaching DaemonSets when the kind TFOperator
resource is created or image references in it get updated.
These DaemonSets precache container images on Kubernetes nodes minimizing
possible downtime when updating container images. Cloud operator can
disable image precaching through the TFOperator resource.
The following diagram illustrates all types of UI
and API traffic in a Mirantis OpenStack for Kubernetes
cluster, including the monitoring and OpenStack API traffic. The OpenStack
Dashboard pod hosts Horizon and acts as a proxy for all other types of
traffic. TLS termination is also performed for this type of traffic.
SDN or Tungsten Fabric traffic goes through the overlay Data network and
processes east-west and north-south traffic for applications that run in a
MOSK cluster. This network segment typically contains
tenant networks as separate MPLS-over-GRE and MPLS-over-UDP tunnels.
The traffic load depends on the workload.
The control traffic between the Tungsten Fabric controllers, edge routers, and
vRouters uses the XMPP with TLS and iBGP protocols. Both protocols produce low
traffic that does not affect MPLS over GRE and MPLS over UDP traffic.
However, this traffic is critical and must be reliably delivered. Mirantis
recommends configuring higher QoS for this type of traffic.
The following diagram displays both MPLS over GRE/MPLS over UDP and iBGP and
XMPP traffic examples in a MOSK cluster:
Mirantis OpenStack for Kubernetes (MOSK) provides the Tungsten Fabric
lifecycle management including pre-deployment custom configurations, updates,
data backup and restoration, as well as handling partial failure scenarios,
by means of the Tungsten Fabric operator.
This section is intended for the cloud operators who want to gain insight into
the capabilities provided by the Tungsten Fabric operator along with the
understanding of how its architecture allows for easy management while
addressing the concerns of users of Tungsten Fabric-based
MOSK clusters.
The Tungsten Fabric Operator (TFO) is based on the Kubernetes operator
SDK project. The Kubernetes operator SDK is a framework that uses the
controller-runtime library to make writing operators easier by providing
the following:
High-level APIs and abstractions to write the operational logic more
intuitively.
Tools for scaffolding and code generation to bootstrap a new project fast.
Extensions to cover common operator use cases.
The TFO deploys the following sub-operators. Each sub-operator handles a
separate part of a TF deployment:
Deploys the Tungsten Fabric control services, such as:
Control
DNS
Control NodeManager
TFConfig
Deploys the Tungsten Fabric configuration services, such as:
API
Service monitor
Schema transformer
Device manager
Configuration NodeManager
Database NodeManager
TFAnalytics
Deploys the Tungsten Fabric analytics services, such as:
API
Collector
Alarm
Alarm-gen
SNMP
Topology
Alarm NodeManager
Database NodeManager
SNMP NodeManager
TFVrouter
Deploys a vRouter on each compute node with the following services:
vRouter Agent
NodeManager
TFWebUI
Deploys the following web UI services:
Web server
Job server
TFTool
Deploys the following tools for debug purposes:
TF-CLI
CTools
TFTest
An operator to run Tempest tests.
Besides the sub-operators that deploy TF services, TFO uses operators to deploy
and maintain third-party services, such as different types of storage, cache,
message system, and so on. The following table describes all third-party
operators:
The resource of kind TFOperator (TFO) is a custom resource (CR) defined by
a resource of kind CustomResourceDefinition.
The CustomResourceDefinition resource in Kubernetes uses the OpenAPI
Specification (OAS) version 2 to specify the schema of the defined resource.
The Kubernetes API outright rejects the resources that do not pass this schema
validation. Along with schema validation, TFOperator uses
ValidatingAdmissionWebhook for extended validations when a CR is created
or updated.
Tungsten Fabric Operator uses ValidatingAdmissionWebhook to validate
environment variables set to Tungsten Fabric components upon the TFOperator
object creation or update. The following validations are performed:
Environment variables passed to TF components containers
Mapping between tfVersion and tfImageTag, if defined
Schedule and data capacity format for tf-dbBackup
If required, you can disable ValidatingAdmissionWebhook through the
TFOperator HelmBundle resource:
Mirantis OpenStack for Kubernetes (MOSK) allows you to easily adapt
your Tungsten Fabric deployment to the needs of your environment through the
TFOperator custom resource.
This section includes custom configuration details available to you.
By default, Tungsten Fabric Operator sets up the following resource limits for
Cassandra analytics and configuration StatefulSets:
Limits:cpu:8memory:32GiRequests:cpu:1memory:16Gi
This is a verified configuration suitable for most cases. However, if nodes
are under a heavy load, the KubeContainerCPUThrottlingHigh StackLight alert
may raise for Tungsten Fabric Pods of the tf-cassandra-analytics and
tf-cassandra-config StatefulSets. If such alerts appear constantly, you can
increase the limits through the TFOperator CR. For example:
To specify custom configurations for Cassandra clusters, use the
configOptions settings in the TFOperator CR. For example, you may need
to increase the file cache size in case of a heavy load on the nodes labeled
with tfanalyticsdb=enabled or tfconfigdb=enabled:
To specify custom settings for the Tungsten Fabric (TF) vRouter nodes, for
example, to change the name of the tunnel network interface or enable debug
level logging on some subset of nodes, use the customSpecs settings in
the TFOperator CR.
For example, to enable debug level logging on a specific node or multiple
nodes:
The customspecs:name value must follow the RFC 1123
international format. Verify that the name of a DaemonSet object
is a valid DNS subdomain name.
The customSpecs parameter inherits all settings for the tf-vrouter
containers that are set on the spec:controllers:agent level and overrides
or adds additional parameters. The example configuration above overrides the
logging level from SYS_INFO, which is the default logging level, to
SYS_DEBUG.
For clusters with a multi-rack architecture, you may need to redefine the
gateway IP for the Tungsten Fabric vRouter nodes using the VROUTER_GATEWAY
parameter. For details, see Multi-rack architecture.
By default, the TF control service uses the management interface for
the BGP and XMPP traffic. You can change the control service interface
using the controlInterface parameter in the TFOperator CR, for example,
to combine the BGP and XMPP traffic with the data (tenant) traffic:
Tungsten Fabric implements cloud tenants’ virtual networks as Layer 3 overlays.
Tenant traffic gets encapsulated into one of the supported protocols and is
carried over the infrastructure network between 2 compute nodes or a compute
node and an edge router device.
In addition, Tungsten Fabric is capable of exchanging encapsulated traffic with
external systems in order to build advanced virtual networking topologies,
for example, BGP VPN connectivity between 2 MOSK clouds or a
MOSK cloud and a cloud tenant premises.
MOSK supports the following encapsulation protocols:
MPLS over Generic Routing Encapsulation (GRE)
A traditional encapsulation method supported by several router vendors,
including Cisco and Juniper. The feature is applicable when other
encapsulation methods are not available. For example, an SDN gateway
runs software that does not support MPLS over UDP.
MPLS over User Datagram Protocol (UDP)
A variation of the MPLS over GRE mechanism. It is the default and the most
frequently used option in MOSK. MPLS over UDP replaces
headers in UDP packets. In this case, a UDP port stores a hash of
the packet payload (entropy). It provides a significant benefit for
equal-cost multi-path (ECMP) routing load balancing. MPLS over UDP and MPLS
over GRE transfer Layer 3 traffic only.
Virtual Extensible LAN (VXLAN) TechPrev
The combination of VXLAN and EVPN technologies is often used for creating
advanced cloud networking topologies. For example, it can provide
transparent Layer 2 interconnections between Virtual Network Functions
running on top of the cloud and physical traffic generator appliances hosted
somewhere else.
The ENCAP_PRIORIY parameter defines the priority in which the
encapsulation protocols are attempted to be used when setting the BGP VPN
connectivity between the cloud and external systems.
By default, the encapsulation order is set to MPLSoUDP,MPLSoGRE,VXLAN.
The cloud operator can change it depending their needs in the TFOperator
custom resource as it is illustrated in Configuring encapsulation.
The list of supported encapsulated methods along with their order is shared
between BGP peers as part of the capabilities information exchange when
establishing a BGP session. Both parties must support the same encapsulation
methods to build a tunnel for the network traffic.
For example, if the cloud operator wants to set up a Layer 2 VPN between the
cloud and their network infrastructure, they configure the cloud’s virtual
networks with VXLAN identifiers (VNIs) and do the same on the other side,
for example, on a network switch. Also, VXLAN must be set in the first position
in encapsulation priority order. Otherwise, VXLAN tunnels will not get
established between endpoints, even though both endpoints may support the VXLAN
protocol.
However, setting VXLAN first in the encapsulation priority order will not
enforce VXLAN encapsulation between compute nodes or between compute nodes and
gateway routers that use Layer 3 VPNs for communication.
The TFOperator custom resource allows you to define encapsulation settings
for your Tungsten Fabric cluster.
Important
The TFOperator CR must be the only place to configure
the cluster encapsulation. Performing these configurations through
the TF web UI, CLI, or API does not provide the configuration persistency,
and the settings defined this way may get reset to defaults during the
cluster services restart or update.
Note
Defining the default values for encapsulation parameters in the TF operator
CR is unnecessary.
In the routing fabric of a data centre, a MOSK cluster
with Tungsten Fabric enabled can be represented either by a separate
Autonomous System (AS)
or as part of a bigger autonomous system. In either case, Tungsten Fabric
needs to participate in the BGP peering, exchanging routes with external
devices and within the cloud.
The Tungsten Fabric Controller acts as an internal (iBGP) route reflector for
the cloud’s AS by populating /32 routes pointing to VMs across all compute
nodes as well as the cloud’s edge gateway devices in case they belong to the
same AS. Apart from being an iBGP router reflector for the cloud’s AS, the
Tungsten Fabric Controller can act as a BGP peer for autonomous systems
external to the cloud, for example, for the AS configured across the data
center’s leaf-spine fabric.
The Autonomous System Number (ASN) setting contains the unique identifier
of the autonomous system that the MOSK cluster with
Tungsten Fabric belongs to. The ASN number does not affect the internal
iBGP communication between vRouters running on the compute nodes. Such
communication will work regardless of the ASN number settings. However,
any network appliance that is not managed by the Tungsten Fabric control plane
will have BGP configured manually. Therefore, the ASN settings should be
configured accordingly on both sides. Otherwise, it would result in the
inability to establish BPG sessions, regardless of whether the external device
peers with Tungsten Fabric over iBGP or eBGP.
The TFOperator custom resource enables you to define ASN settings for
your Tungsten Fabric cluster.
Important
The TFOperator CR must be the only place to configure
the cluster ASN. Performing these configurations through the TF web UI,
CLI, or API does not provide the configuration persistency, and the
settings defined this way may get reset to defaults during the cluster
services restart or update.
Note
Defining the default values for ASN parameters in the TF operator
CR is unnecessary.
By default, the TF tf-control-dns-external service is created to expose
TF controldns. You can disable creation of this service using the
enableDNSExternal parameter in the TFOperator CR. For example:
If an edge router is accessible from the data plane through a gateway, define
the VROUTER_GATEWAY parameter in the TFOperator custom resource.
Otherwise, the default system gateway is used.
By default, MOSK deploys image precaching DaemonSets
to minimize possible downtime when updating container images. You can disable
creation of these DaemonSets by setting the imagePreCaching parameter in
the TFOperator custom resource to false:
spec:settings:imagePreCaching:false
When you disable imagePreCaching, the Tungsten Fabric Operator does not
automatically remove the image precaching DaemonSets that have already been
created. These DaemonSets do not affect the cluster setup. To remove them
manually:
Available since MOSK 23.2
for Tungsten Fabric 21.4 onlyTechPreview
Graceful restart and long-lived graceful restart are vital mechanisms
within BGP (Border Gateway Protocol) routing, designed to optimize
the routing tables convergence in scenarios where a BGP router restarts or
a networking failure is experienced, leading to interruptions of router
peering.
During a graceful restart, a router can signal its BGP peers about its
impending restart, requesting them to retain the routes it had previously
advertised as active. This allows for seamless network operation and minimal
disruption to data forwarding during the router downtime.
The long-lived aspect of the long-lived graceful restart extends
the graceful restart effectiveness beyond the usual restart duration.
This extension provides an additional layer of resilience and stability
to BGP routing updates, bolstering the network ability to manage
unforeseen disruptions.
Caution
Mirantis does not generally recommend using the graceful restart
and long-lived graceful restart features with the Tungsten Fabric XMPP
helper, unless the configuration is done by proficient operators with
at-scale expertise in networking domain and exclusively to address specific
corner cases.
Configuring graceful restart and long-lived graceful restart¶
Tungsten Fabric Operator allows for easy enablement and configuration
of the graceful restart and long-lived graceful restart features through
the TFOperator custom resource:
Graceful restart and long-lived graceful restart settings¶
Parameter
Default value
Description
enabled
false
Enables or disables graceful restart and long-lived graceful
restart features.
bgpHelperEnabled
false
Specifies the time interval, when the Tungsten Fabric control services
act as a graceful restart helper to the edge router or any other BGP
peer by retaining the routes learned from this peer and advertising
them to the rest of the network as applicable.
Note
BGP peer should support and be configured with graceful
restart for all of the address families used.
xmppHelperEnabled
false
Specifies the time interval, when the datapath agent should retain
the last route path from the Tungsten Fabric Controller when an
XMPP-based connection is lost.
restartTime
300
Configures a non-zero restart time in seconds to advertise for graceful
restart capability from peers.
llgrRestartTime
300
Specifies the amount of time in seconds the vRouter datapath should keep
advertised routes from the Tungsten Fabric control services, when
an XMPP connection between the control and vRouter agent services is lost.
Note
When graceful restart and long-lived graceful restart
are both configured, the duration of the long-lived graceful
restart timer is the sum of both timers.
endOfRibTimeout
300
Specifies the amount of time in seconds a control node waits to remove
stale routes from a vRouter agent Routing Information Base (RIB).
Tungsten Fabric (TF) uses Cassandra and ZooKeeper to store its data.
Cassandra is a fault-tolerant and horizontally scalable database that provides
persistent storage of configuration and analytics data. ZooKeeper is used by
TF for allocation of unique object identifiers and transactions implementation.
To prevent data loss, Mirantis recommends that you simultaneously back up
the ZooKeeper database dedicated to configuration services and the Cassandra
database.
The backup of database must be consistent across all systems
because the state of the Tungsten Fabric databases is associated with
other system databases, such as OpenStack databases.
MOSK enables you to perform the automatic TF
data backup in the JSON format using the tf-dbbackup-job cron job.
By default, it is disabled. To back up the TF databases, enable
tf-dbBackup in the TF Operator custom resource:
spec:controllers:tf-dbBackup:enabled:true
By default, the tf-dbbackup-job job is scheduled for weekly execution,
allocating PVC of 5 Gi size for storing backups and keeping 5 previous
backups. To configure the backup parameters according to the needs of your
cluster, use the following structure:
The section explains specifics of the Tungsten Fabric services provided by
Mirantis OpenStack for Kubernetes (MOSK). The list of the services
and their supported features included in this section is not full and is being
constantly amended based on the complexity of the architecture and use of
a particular service.
MOSK ensures Octavia with Tungsten Fabric integration
by OpenStack Octavia Driver with Tungsten Fabric HAProxy as a back end.
The Tungsten Fabric-based MOSK deployment supports
creation, update, and deletion operations with the following standard
load balancing API entities:
Load balancers
Note
For a load balancer creation operation, the driver supports
only the vip-subnet-id argument, the vip-network-id argument is
not supported.
Listeners
Pools
Health monitors
The Tungsten Fabric-based MOSK deployment does not support
the following load balancing capabilities:
L7 load balancing capabilities, such as L7 policies, L7 rules, and others
Setting specific availability zones for load balancers and their resources
Using of the UDP protocol
Operations with Octavia quotas
Operations with Octavia flavors
Warning
The Tungsten Fabric-based MOSK deployment
enables you to manage the load balancer resources by means of the OpenStack
CLI or OpenStack Horizon. Do not perform any manipulations with the load
balancer resources through the Tungsten Fabric web UI because in this case
the changes will not be reflected on the OpenStack API side.
Octavia Amphora (Amphora v2) load balancing provides a scalable and flexible
solution for load balancing in cloud environments. MOSK
deploys Amphora load balancer on each node of the OpenStack environment
ensuring that load balancing services are easily accessible, highly scalable,
and highly reliable.
Compared to the Octavia Tungsten Fabric driver for LBaaS v2 solution, Amphora
offers several advanced features including:
Full compatibility with the Octavia API, which provides a standardized
interface for load balancing in MOSK OpenStack
environments. This makes it easier to manage and integrate with other
OpenStack services.
Layer 7 policies and rules, which allow for more granular control over
traffic routing and load balancing decisions. This enables users to
optimize their application performance and improve the user experience.
Support for the UDP protocol, which is commonly used for real-time
communications and other high-performance applications. This enables
users to deploy a wider range of applications with the same load
balancing infrastructure.
By default, MOSK uses the Octavia Tungsten Fabric load
balancing. Once Octavia Amphora load balancing is enabled, the existing Octavia
Tungsten Fabric driver load balancers will continue to function normally.
However, you cannot migrate your load balancer workloads from the old LBaaS
v2 solution to Amphora.
Note
As long as MOSK provides Octavia Amphora load
balancing as a technology preview feature, Mirantis
cannot guarantee the stability of this solution and does not provide
a migration path from Tungsten Fabric load balancing (HAProxy), which
is used by default.
To enable Octavia Amphora load balancing:
Assign openstack-gateway:enabled labels to the compute nodes in either
order.
To make Amphora the default provider, specify it in the
OpenStackDeployment custom resource:
spec:features:octavia:default_provider:amphorav2
Verify that the OpenStack Controller has scheduled new Octavia pods that
include health manager, worker, and housekeeping pods.
kubectlgetpods-nopenstack-l'application=octavia,component in (worker, health_manager, housekeeping)'
Example of output for an environment with two compute nodes:
The workflow for creating new load balancers with Amphora is identical
to the workflow for creating load balancers with Octavia Tungsten Fabric
driver for LBaaS v2. You can do it either through the OpenStack Horizon
UI or OpenStack CLI.
If you have not defined amphorav2 as default provider in the
OpenStackDeployment custom resource, you can specify it explicitly
when creating a load balancer using the provider argument:
This section contains a summary of the Tungsten Fabric upstream features
and use cases not supported in MOSK,
features and use cases offered as Technology Preview
in the current product release if any, and known limitations of Tungsten
Fabric in integration with other product components.
Feature or use case
Status
Description
Tungsten Fabric web UI
Provided as is
MOSK provides the TF web UI as is and does not
include this service in the support Service Level Agreement
Automatic generation of network port records in DNSaaS
(Designate)
Not supported
As a workaround, you can use the Tungsten Fabric
built-in DNS service that enables virtual machines to resolve
each other names
Secret management (Barbican)
Not supported
It is not possible to use the certificates stored in Barbican
to terminate HTTPs on a load balancer in a Tungsten Fabric deployment
Role Based Access Control (RBAC) for Neutron objects
Not supported
Advanced Tungsten Fabric features
Provided as is
MOSK provides the following advanced Tungsten Fabric
features as is and does not include them in the support Service
Level Agreement:
Service Function Chaining
Production ready multi-site SDN
Layer 3 multihoming
Long-Lived Graceful Restart (LLGR)
Technical Preview
DPDK
Monitoring of tf-rabbitmq
Not supported
Due to a known issue, tf-rabbitmq is not monitored on new
MOSK 22.5 clusters. The existing clusters updated to
MOSK 22.5 are not affected.
The integration between the OpenStack and TF controllers is
implemented through the shared Kubernetes openstack-tf-shared namespace.
Both controllers have access to this namespace to read and write the Kubernetes
kind: Secret objects.
The OpenStack Controller posts the data into the openstack-tf-shared
namespace required by the TF services. The TF controller watches this
namespace. Once an appropriate secret is created, the TF controller obtains it
into the internal data structures for further processing.
The OpenStack Controller includes the following data for the TF Controller:
tunnel_inteface
Name of the network interface for the TF data plane. This interface
is used by TF for the encapsulated traffic for overlay networks.
Keystone authorization information
Keystone Administrator credentials and an up-and-running IAM service
are required for the TF Controller to initiate the deployment process.
Nova metadata information
Required for the TF vRrouter agent service.
Also, the OpenStack Controller watches the openstack-tf-shared namespace
for the vrouter_port parameter that defines the vRouter port number and
passes it to the nova-compute pod.
The list of the OpenStack services that are integrated with TF through their
API include:
neutron-server - integration is provided by the
contrail-neutron-plugin component that is used by the neutron-server
service for transformation of the API calls to the TF API compatible
requests.
nova-compute - integration is provided by the
contrail-nova-vif-driver and contrail-vrouter-api packages used
by the nova-compute service for interaction with the TF vRouter to
the network ports.
octavia-api - integration is provided by the Octavia TF Driver that
enables you to use OpenStack CLI and Horizon for operations with load
balancers. See Tungsten Fabric load balancing (HAProxy) for details.
Warning
TF is not integrated with the following OpenStack services:
Depending on the size of an OpenStack environment and the components
that you use, you may want to have a single or multiple network interfaces,
as well as run different types of traffic on a single or multiple VLANs.
This section provides the recommendations for planning the network
configuration and optimizing the cloud performance.
Mirantis OpenStack for Kubernetes (MOSK) cluster networking is
complex and defined by the security requirements and performance
considerations. It is based on the Kubernetes cluster networking provided by
Mirantis Container Cloud and expanded to facilitate the demands of the
OpenStack virtualization platform.
A Container Cloud Kubernetes cluster provides a platform for
MOSK and is considered a part of its control plane. All
networks that serve Kubernetes and related traffic are considered control
plane networks. The Kubernetes cluster networking is typically focused on
connecting pods of different nodes as well as exposing the Kubernetes API and
services running in pods into an external network.
The OpenStack networking connects virtual machines to each other and the
outside world. Most of the OpenStack-related networks are considered a part of
the data plane in an OpenStack cluster. Ceph networks are considered data
plane networks for the purpose of this reference architecture.
When planning your OpenStack environment, consider the types of traffic that
your workloads generate and design your network accordingly. If you
anticipate that certain types of traffic, such as storage replication,
will likely consume a significant amount of network bandwidth, you may
want to move that traffic to a dedicated network interface to avoid
performance degradation.
The following diagram provides a simplified overview of the underlay
networking in a MOSK environment:
This page summarizes the recommended networking architecture of a Mirantis
Container Cloud management cluster for a Mirantis OpenStack for Kubernetes
(MOSK) cluster.
We recommend deploying the management cluster with a dedicated interface
for the provisioning (PXE) network. The separation of the provisioning network
from the management network ensures additional security and resilience of
the solution.
MOSK end users typically should have access to the Keycloak
service in the management cluster for authentication to the Horizon web UI.
Therefore, we recommend that you connect the management network of the
management cluster to an external network through an IP router. The default
route on the management cluster nodes must be configured with the default
gateway in the management network.
If you deploy the multi-rack configuration, ensure that the provisioning
network of the management cluster is connected to an IP router that connects
it to the provisioning networks of all racks.
Facilitates the iPXE boot of all bare metal machines in a
MOSK cluster and provisioning of the operating system
to machines.
This network is only used during provisioning of the host. It must not
be configured on an operational MOSK node.
Life-cycle management (LCM) network
Connects LCM Agents running on the hosts to the Container Cloud LCM API.
The LCM API is provided by the regional or management cluster.
The LCM network is also used for communication between kubelet
and the Kubernetes API server inside a Kubernetes cluster. The MKE
components use this network for communication inside a swarm cluster.
The LCM subnet(s) provides IP addresses that are statically allocated
by the IPAM service to bare metal hosts. This network must be connected
to the Kubernetes API endpoint of the regional cluster through an
IP router. LCM Agents running on MOSK clusters will
connect to the regional cluster API through this router. LCM subnets
may be different per MOSK cluster as long as this
connection requirement is satisfied.
You can use more than one LCM network segment in a MOSK
cluster. In this case, separated L2 segments and interconnected L3
subnets are still used to serve LCM and API traffic.
All IP subnets in the LCM networks must be connected to each other
by IP routes. These routes must be configured on the hosts through
L2 templates.
All IP subnets in the LCM network must be connected to the Kubernetes
API endpoints of the management or regional cluster through an IP
router.
You can manually select the load balancer IP address for external
access to the cluster API and specify it in the Cluster object
configuration. Alternatively, you can allocate a dedicated IP range
for a virtual IP of the cluster API load balancer by adding a Subnet
object with a special annotation. Mirantis recommends that this subnet
stays unique per MOSK cluster.
For details, see Create subnets.
Note
When using the L2 announcement of the IP address for the
cluster API load balancer, the following limitations apply:
Only one of the LCM networks can contain the API endpoint.
This network is called API/LCM throughout this documentation.
It consists of a VLAN segment stretched between all Kubernetes
master nodes in the cluster and the IP subnet that provides
IP addresses allocated to these nodes.
The load balancer IP address must be allocated from the same
subnet CIDR address that the LCM subnet uses.
When using the BGP announcement of the IP address for the cluster API
load balancer, which is available as Technology Preview since
MOSK 23.2.2, no segment stretching is required
between Kubernetes master nodes. Also, in this scenario, the load
balancer IP address is not required to match the LCM subnet CIDR address.
Kubernetes workloads network
Serves as an underlay network for traffic between pods in
the MOSK cluster. Do not share this network between
clusters.
There might be more than one Kubernetes pods network segment in the cluster.
In this case, they must be connected through an IP router.
Kubernetes workloads network does not need an external access.
The Kubernetes workloads subnet(s) provides IP addresses that
are statically allocated by the IPAM service to all nodes and that
are used by Calico for cross-node communication inside a cluster.
By default, VXLAN overlay is used for Calico cross-node communication.
Kubernetes external network
Serves for an access to the OpenStack endpoints in a MOSK
cluster.
When using the L2 (ARP) announcement of the external endpoints of
load-balanced services, the network must contain a VLAN segment
extended to all MOSK nodes connected to this network.
When using the BGP announcement of the external endpoints of
load-balanced services, which is available as Technology Preview since
MOSK 23.2.2, there is no requirement of having
a single VLAN segment extended to all MOSK nodes
connected to this network.
A typical MOSK cluster only has one external network.
The external network must include at least two IP address ranges
defined by separate Subnet objects in Container Cloud API:
MOSK services address range
Provides IP addresses for externally available
load-balanced services, including OpenStack API endpoints.
External address range
Provides IP addresses to be assigned to network interfaces
on all cluster nodes that are connected to this network.
MetalLB speakers must run on the same nodes. For details, see
Configure the MetalLB speaker node selector.
This is required for external traffic to return to the originating
client. The default route on the MOSK nodes that
are connected to the external network must be configured with the
default gateway in the external network.
Storage access network
Serves for the storage access traffic from and to Ceph OSD services.
A MOSK cluster may have more than one VLAN segment
and IP subnet in the storage access network. All IP subnets of this
network in a single cluster must be connected by an IP router.
The storage access network does not require external access unless
you want to directly expose Ceph to the clients outside of a
MOSK cluster.
Note
A direct access to Ceph by the clients outside of a
MOSK cluster is technically possible but not
supported by Mirantis. Use at your own risk.
The IP addresses from subnets in this network are statically allocated
by the IPAM service to Ceph nodes. The Ceph OSD services bind to these
addresses on their respective nodes.
Serves for the storage replication traffic between Ceph OSD services.
A MOSK cluster may have more than one VLAN segment
and IP subnet in this network as long as the subnets are connected
by an IP router.
This network does not require external access.
The IP addresses from subnets in this network are statically allocated
by the IPAM service to Ceph nodes.
The Ceph OSD services bind to these addresses on their respective nodes.
This section describes network types for Layer 3 networks used for Kubernetes
and Mirantis OpenStack for Kubernetes (MOSK) clusters along with
requirements for each network type.
Note
Only IPv4 is currently supported by Container Cloud and IPAM
for infrastructure networks. IPv6 is not supported and not used
in Container Cloud and MOSK underlay
infrastructure networks.
The following diagram provides an overview of the underlay networks in a
MOSK environment:
A MOSK deployment typically requires the following types of
networks:
Provisioning network
Used for provisioning of bare metal servers.
Management network
Used for management of the Container Cloud infrastructure and for
communication between containers in Kubernetes.
LCM/API network
Must be configured on the Kubernetes manager nodes of the cluster. Contains
the Kubernetes API endpoint with the VRRP virtual IP address. Enables
communication between the MKE cluster nodes.
LCM network
Enables communication between the MKE cluster nodes. Multiple VLAN
segments and IP subnets can be created for a multi-rack architecture. Each
server must be connected to one of the LCM segments and have an IP from
the corresponding subnet.
External network
Used to expose the OpenStack, StackLight, and other services of the
MOSK cluster.
Kubernetes workloads network
Used for communication between containers in Kubernetes.
Storage access network (Ceph)
Used for accessing the Ceph storage. In Ceph terms, this is a public
network 0. We recommended that it is placed on a dedicated hardware
interface.
Storage replication network (Ceph)
Used for Ceph storage replication. In Ceph terms, this is a cluster
network 0. To ensure low latency and fast access, place the network on a
dedicated hardware interface.
Typically, a routable network used to provide the external access to
OpenStack instances (a floating network). Can be used by the OpenStack
services such as Ironic, Manila, and others, to connect their
management resources.
pr-floating
Networking
Overlay networks (virtual networks)
The network used to provide denied, secure tenant networks with the
help of the tunneling mechanism (VLAN/GRE/VXLAN). If the VXLAN and GRE
encapsulation takes place, the IP address assignment is required on
interfaces at the node level.
neutron-tunnel
Compute
Live migration network
The network used by the OpenStack compute service (Nova) to transfer
data during live migration. Depending on the cloud needs, it can be
placed on a dedicated physical network not to affect other networks
during live migration. The IP address assignment is required on
interfaces at the node level.
lm-vlan
The way of mapping of the logical networks described above to physical
networks and interfaces on nodes depends on the cloud size and configuration.
We recommend placing OpenStack networks on a dedicated physical interface
(bond) that is not shared with storage and Kubernetes management network
to minimize the influence on each other.
The bridge interface with this name is mandatory if you need to separate
Kubernetes workloads traffic. You can configure this bridge over the VLAN or
directly over the bonded or single interface.
Routing to all IP subnets of the Storage access network
Routing to all IP subnets of the Storage replication network
Note
When selecting externally routable subnets, ensure that the subnet
ranges do not overlap with the internal subnets ranges. Otherwise, internal
resources of users will not be available from the MOSK
cluster.
Mirantis OpenStack for Kubernetes (MOSK) enables you to deploy a
cluster with a multi-rack architecture, where every data center cabinet
(a rack), incorporates its own Layer 2 network infrastructure that does not
extend beyond its top-of-rack switch. The architecture allows a
MOSK cloud to integrate natively with the Layer 3-centric
networking topologies such as Spine-Leaf
that are commonly seen in modern data centers.
The architecture eliminates the need to stretch and manage VLANs across
parts of a single data center, or to build VPN tunnels between the segments of
a geographically distributed cloud.
The set of networks present in each rack depends on the OpenStack
networking service back end in use.
The multi-rack architecture in Mirantis Container Cloud and
MOSK requires
additional configuration of networking infrastructure. Every Layer 2 domain,
or rack, needs to have a DHCP relay agent configured on its dedicated
segment of the Common/PXE network (lcm-nw VLAN). The agent
handles all Layer-2 DHCP requests incoming from the bare metal servers living
in the rack and forwards them as Layer-3 packets across the data center fabric
to a Mirantis Container Cloud regional cluster.
Based on the address of the DHCP agent that relays a request from
a server, Mirantis Container Cloud will automatically allocate
an IP address in the corresponding subnet.
For the networks types other than Common/PXE, you need to define subnets using
the Mirantis Container Cloud L2 templates.
Every rack needs to have a dedicated set of L2 templates,
each template representing a specific server role and configuration.
A typical medium and more sized MOSK cloud consists of three
or more racks that can generally be divided into the following major
categories:
Compute/Storage racks that contain the hypervisors and instances running on
top of them. Additionally, they contain nodes that store cloud applications’
block, ephemeral, and object data as part of the Ceph cluster.
Control plane racks that incorporate all the components needed by the cloud
operator to manage its life cycle. Also, they include the services through
which the cloud users interact with the cloud to deploy their applications,
such as cloud APIs and web UI.
A control plane rack may also contain additional compute and storage nodes.
The diagram below will help you to plan the networking layout of a multi-rack
MOSK cloud with Tungsten Fabric.
For MOSK 23.1 and older versions, Kubernetes masters
(3 nodes) either need to be placed into a single rack or, if distributed
across multiple racks for better availability, require stretching of the
L2 segment of the management network across these racks.
This requirement is caused by the Mirantis Kubernetes Engine underlay for
MOSK relying on the Layer 2 VRRP protocol to ensure high
availability of the Kubernetes API endpoint.
The table below provides a mapping between the racks and the network types
participating in a multi-rack MOSK cluster with the
Tungsten Fabric back end.
Networks and VLANs for a multi-rack MOSK
cluster with TF¶
This section summarizes the requirements for the physical layout of underlay
network and VLANs configuration for the multi-rack architecture of
Mirantis OpenStack for Kubernetes (MOSK).
Physical networking of a Container Cloud management cluster¶
Due to limitations of virtual IP address for Kubernetes API and of MetalLB
load balancing in Container Cloud, the management cluster nodes must share
VLAN segments in the provisioning and management networks.
In the multi-rack architecture, the management cluster nodes may be placed to
a single rack or spread across three racks. In either case, provisioning and
management network VLANs must be stretched across ToR switches of the racks.
The following diagram illustrates physical and L2 connections of
the Container Cloud management cluster.
Due to limitations of MetalLB load balancing, all MOSK
cluster nodes connected to the external network must share the VLAN segment
in the external network.
In the multi-rack architecture, the external network VLAN must be
stretched to the ToR switches of all the racks where nodes connected to the
external network are located. All other VLANs may be configured per rack.
Due to limitations of using a virtual IP address for Kubernetes API, the
Kubernetes manager nodes must share the VLAN segment in the API/LCM network.
In the multi-rack architecture, Kubernetes manager nodes may be spread across
three racks. The API/LCM network VLAN must be stretched to the ToR switches
of the racks. All other VLANs may be configured per rack.
The following diagram illustrates physical and L2 network connections
of the Kubernetes manager nodes in a MOSK cluster.
Caution
Such configuration does not apply to a compact control plane
MOSK installation. See Create a MOSK cluster.
To improve the goodput, we recommend that you enable jumbo frames where
possible. The jumbo frames have to be enabled on the whole path of the packets
traverse. If one of the network components cannot handle jumbo frames, the
network path uses the smallest MTU.
To provide fault tolerance of a single NIC, we recommend using the link
aggregation, such as bonding. The link aggregation is useful for linear
scaling of bandwidth, load balancing, and fault protection. Depending
on the hardware equipment, different types of bonds might be supported.
Use the multi-chassis link aggregation as it provides fault tolerance
at the device level. For example, MLAG on Arista equipment or vPC on
Cisco equipment.
The Linux kernel supports the following bonding modes:
active-backup
balance-xor
802.3ad (LACP)
balance-tlb
balance-alb
Since LACP is the IEEE standard 802.3ad supported by the majority of
network platforms, we recommend using this bonding mode.
Use the Link Aggregation Control Protocol (LACP) bonding mode
with MC-LAG domains configured on ToR switches. This corresponds to
the 802.3ad bond mode on hosts.
Additionally, follow these recommendations in regards to bond interfaces:
Use ports from different multi-port NICs when creating bonds. This makes
network connections redundant if failure of a single NIC occurs.
Configure the ports that connect servers to the PXE network with PXE VLAN
as native or untagged. On these ports, configure LACP fallback to ensure
that the servers can reach DHCP server and boot over network.
Configure Spanning Tree Protocol (STP) settings on the network switch ports to
ensure that the ports start forwarding packets as soon as the link comes up.
It helps avoid iPXE timeout issues and ensures reliable boot over network.
A MOSK cluster uses Ceph as a distributed storage system
for file, block, and object storage. This section provides an overview of a
Ceph cluster deployed by Container Cloud.
Mirantis Container Cloud deploys Ceph on MOSK using Helm
charts with the following components:
Rook Ceph Operator
A storage orchestrator that deploys Ceph on top of a Kubernetes cluster. Also
known as Rook or RookOperator. Rook operations include:
Deploying and managing a Ceph cluster based on provided Rook CRs such as
CephCluster, CephBlockPool, CephObjectStore, and so on.
Orchestrating the state of the Ceph cluster and all its daemons.
KaaSCephCluster custom resource (CR)
Represents the customization of a Kubernetes installation and allows you to
define the required Ceph configuration through the Container Cloud web UI
before deployment. For example, you can define the failure domain, Ceph pools,
Ceph node roles, number of Ceph components such as Ceph OSDs, and so on.
The ceph-kcc-controller controller on the Container Cloud management
cluster manages the KaaSCephCluster CR.
Ceph Controller
A Kubernetes controller that obtains the parameters from Container Cloud
through a CR, creates CRs for Rook and updates its CR status based on the Ceph
cluster deployment progress. It creates users, pools, and keys for OpenStack
and Kubernetes and provides Ceph configurations and keys to access them. Also,
Ceph Controller eventually obtains the data from the OpenStack Controller for
the Keystone integration and updates the Ceph Object Gateway services
configurations to use Kubernetes for user authentication. Ceph Controller
operations include:
Transforming user parameters from the Container Cloud Ceph CR into Rook CRs
and deploying a Ceph cluster using Rook.
Providing integration of the Ceph cluster with Kubernetes.
Providing data for OpenStack to integrate with the deployed Ceph cluster.
Ceph Status Controller
A Kubernetes controller that collects all valuable parameters from the current
Ceph cluster, its daemons, and entities and exposes them into the
KaaSCephCluster status. Ceph Status Controller operations include:
Collecting all statuses from a Ceph cluster and corresponding Rook CRs.
Collecting additional information on the health of Ceph daemons.
Provides information to the status section of the KaaSCephCluster
CR.
Ceph Request Controller
A Kubernetes controller that obtains the parameters from Container Cloud
through a CR and manages Ceph OSD lifecycle management (LCM) operations. It
allows for a safe Ceph OSD removal from the Ceph cluster. Ceph Request
Controller operations include:
Providing an ability to perform Ceph OSD LCM operations.
Obtaining specific CRs to remove Ceph OSDs and executing them.
Pausing the regular Ceph Controller reconciliation until all requests are
completed.
A typical Ceph cluster consists of the following components:
Ceph Monitors - three or, in rare cases, five Ceph Monitors.
Ceph Managers - one Ceph Manager in a regular cluster.
Ceph Object Gateway (radosgw) - Mirantis recommends having three or more
radosgw instances for HA.
Ceph OSDs - the number of Ceph OSDs may vary according to the deployment
needs.
Warning
A Ceph cluster with 3 Ceph nodes does not provide
hardware fault tolerance and is not eligible
for recovery operations,
such as a disk or an entire Ceph node replacement.
A Ceph cluster uses the replication factor that equals 3.
If the number of Ceph OSDs is less than 3, a Ceph cluster
moves to the degraded state with the write operations
restriction until the number of alive Ceph OSDs
equals the replication factor again.
The placement of Ceph Monitors and Ceph Managers is defined in the
KaaSCephCluster CR.
The following diagram illustrates the way a Ceph cluster is deployed in
Container Cloud:
The following diagram illustrates the processes within a deployed Ceph cluster:
A Ceph cluster configuration in MOSK includes but is
not limited to the following limitations:
Only one Ceph Controller per MOSK cluster
and only one Ceph cluster per Ceph Controller are supported.
The replication size for any Ceph pool must be set to more than 1.
Only one CRUSH tree per cluster. The separation of devices per Ceph pool is
supported through device classes
with only one pool of each type for a device class.
All CRUSH rules must have the same failure_domain.
Only the following types of CRUSH buckets are supported:
topology.kubernetes.io/region
topology.kubernetes.io/zone
topology.rook.io/datacenter
topology.rook.io/room
topology.rook.io/pod
topology.rook.io/pdu
topology.rook.io/row
topology.rook.io/rack
topology.rook.io/chassis
RBD mirroring is not supported.
Consuming an existing Ceph cluster is not supported.
CephFS is not supported.
Only IPv4 is supported.
If two or more Ceph OSDs are located on the same device, there must be no
dedicated WAL or DB for this class.
Only a full collocation or dedicated WAL and DB configurations are supported.
The minimum size of any defined Ceph OSD device is 5 GB.
Reducing the number of Ceph Monitors is not supported and causes the Ceph
Monitor daemons removal from random nodes.
Ceph cluster does not support removable devices (with hotplug enabled) for
deploying Ceph OSDs.
When adding a Ceph node with the Ceph Monitor role, if any issues occur with
the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead,
named using the next alphabetic character in order. Therefore, the Ceph Monitor
names may not follow the alphabetical order. For example, a, b, d,
instead of a, b, c.
The integration between Ceph and OpenStack controllers is implemented
through the shared Kubernetes openstack-ceph-shared namespace.
Both controllers have access to this namespace to read and write
the Kubernetes kind: Secret objects.
As Ceph is required and only supported back end for several OpenStack
services, all necessary Ceph pools must be specified in the configuration
of the kind: MiraCeph custom resource as part of the deployment.
Once the Ceph cluster is deployed, the Ceph Controller posts the
information required by the OpenStack services to be properly configured
as a kind: Secret object into the openstack-ceph-shared namespace.
The OpenStack Controller watches this namespace. Once the corresponding
secret is created, the OpenStack Controller transforms this secret to the
data structures expected by the OpenStack-Helm charts. Even if an OpenStack
installation is triggered at the same time as a Ceph cluster deployment, the
OpenStack Controller halts the deployment of the OpenStack services that
depend on Ceph availability until the secret in the shared namespace is
created by the Ceph Controller.
For the configuration of Ceph Object Gateway as an OpenStack Object
Storage, the reverse process takes place. The OpenStack Controller waits
for the OpenStack-Helm to create a secret with OpenStack Identity
(Keystone) credentials that Ceph Object Gateway must use to validate the
OpenStack Identity tokens, and posts it back to the same
openstack-ceph-shared namespace in the format suitable for
consumption by the Ceph Controller. The Ceph Controller then reads this
secret and reconfigures Ceph Object Gateway accordingly.
StackLight is the logging, monitoring, and alerting solution that provides a
single pane of glass for cloud maintenance and day-to-day operations as well
as offers critical insights into cloud health including operational
information about the components deployed with Mirantis OpenStack for
Kubernetes (MOSK). StackLight is based on Prometheus, an
open-source monitoring solution and a time series database, and OpenSearch, the
logs and notifications storage.
Mirantis OpenStack for Kubernetes (MOSK) deploys the StackLight stack
as a release of a Helm chart that contains the helm-controller and HelmBundle
custom resources. The StackLight HelmBundle consists of a set of Helm charts
describing the StackLight components. Apart from the OpenStack-specific
components below, StackLight also includes the components described in
Mirantis Container Cloud Reference Architecture: Deployment architecture.
By default, StackLight logging stack is disabled.
StackLight measures, analyzes, and reports in a timely manner about failures
that may occur in the following Mirantis OpenStack for Kubernetes
(MOSK)
components and their sub-components. Apart from the components below,
StackLight also monitors the components listed in
Mirantis Container Cloud Reference Architecture: Monitored components.
Calculations in this document are based on numbers from a
real-scale test cluster with 34 nodes. The exact space required for metrics
and logs must be calculated depending on the ongoing cluster operations.
Some operations force the generation of additional metrics and logs. The
values below are approximate. Use them only as recommendations.
During the deployment of a new cluster, you must specify the OpenSearch
retention time and Persistent Volume Claim (PVC) size, Prometheus PVC,
retention time, and retention size.
When configuring an existing cluster, you can only set OpenSearch
retention time, Prometheus retention time, and retention size.
The following table describes the recommendations for both OpenSearch
and Prometheus retention size and PVC size for a cluster with 34 nodes.
Retention time depends on the space allocated for the data. To calculate
the required retention time, use the
{retentiontime}={retentionsize}/{amountofdataperday} formula.
Service
Required space per day
Description
OpenSearch
StackLight in non-HA mode:
202 - 253 GB for the entire cluster
~6 - 7.5 GB for a single node
StackLight in HA mode:
404 - 506 GB for the entire cluster
~12 - 15 GB for a single node
When setting Persistent Volume Claim Size for OpenSearch
during the cluster creation, take into account that it defines the PVC
size for a single instance of the OpenSearch cluster. StackLight in HA
mode has 3 OpenSearch instances. Therefore, for a total OpenSearch
capacity, multiply the PVC size by 3.
Prometheus
11 GB for the entire cluster
~400 MB for a single node
Every Prometheus instance stores the entire database. Multiple replicas
store multiple copies of the same data. Therefore, treat the Prometheus
PVC size as the capacity of Prometheus in the cluster. Do not sum them
up.
Prometheus has built-in retention mechanisms based on the database size
and time series duration stored in the database. Therefore, if you
miscalculate the PVC size, retention size set to ~1 GB less than the PVC
size will prevent disk overfilling.
StackLight integration with OpenStack includes automatic discovery of RabbitMQ
credentials for notifications and OpenStack credentials for OpenStack API
metrics. For details, see the
openstack.rabbitmq.credentialsConfig and
openstack.telegraf.credentialsConfig parameters description in
StackLight configuration parameters.
LCM operations may require measuring the downtime of cloud end user instances
to assess if SLA commitments regarding workload downtime are being met.
Additionally, continuous monitoring of network connectivity is essential
for early problem detection.
To address these needs, MOSK provides the
OpenStack workload monitoring feature through the Cloudprober exporter.
Presently, MOSK supports monitoring of floating
IP addresses exclusively through the Internet Control Message Protocol
(ICMP).
instance_availability_arch
To be able to monitor instance availability, your cluster should meet
the following requirements:
IP connectivity between the network used to assign floating IP addresses
and all OpenStack control plane nodes
ICMP ingress and egress traffic allowed in operating systems on the monitored
virtual machines
ICMP ingress and egress traffic allowed in the OpenStack project by
configuring security groups
To enable the workload monitoring service, use the following
OpenStackDeployment definition:
This section contains a collection of Mirantis OpenStack for Kubernetes
(MOSK) architecture blueprints that include common cluster
topology and configuration patterns that can be referred to when building a
MOSK cloud. Every blueprint is validated by Mirantis and
is known to work. You can use these blueprints alone or in combination,
although the interoperability of all possible combinations can not be
guaranteed.
The section provides information on the target use cases, pros and cons of
every blueprint and outlines the extents of its applicability. However, do
not hesitate to reach out to Mirantis if you have any questions or doubts
on whether a specific blueprint can be applied when designing your cloud.
Although a classic cloud approach allows resources to be distributed across
multiple regions, it still needs powerful data centers to host control planes
and compute clusters. Such regional centralization poses challenges when the
number of data consumers grows. It becomes hard to access the resources hosted
in the cloud even though the resources are located in the same geographic
region. The solution would be to bring the data closer to the consumer.
And this is exactly what edge computing provides.
Edge computing is a paradigm that brings computation and data storage closer to
the sources of data or the consumer. It is designed to improve response time
and save bandwidth.
A few examples of use cases for edge computing include:
Hosting a video stream processing application on premises of a large stadium
during the Super Bowl match
Placing the inventory or augmented reality services directly in the
industrial facilities, such as storage, powerplant, shipyard, and so on
A small computation node deployed in a far-distanced village supermarket to
host an application for store automatization and accounting
These and many other use cases could be solved by deploying multiple edge
clusters managed from a single central place. The idea of centralized
management plays a significant role for the business efficiency of the edge
cloud environment:
Cloud operators obtain a single management console for the cloud that
simplifies the Day-1 provisioning of new edge sites and Day-2 operations
across multiple geographically distributed points of presence
Cloud users get ability to transparently connect their edge applications
with central databases or business logic components hosted in data centers
or public clouds
Depending on the size, location, and target use case, the points of presence
comprising an edge cloud environment can be divided into five major categories.
Mirantis OpenStack powered by Mirantis Container Cloud offers reference
architectures to address the centralized management in core and regional data
centers as well as edge sites.
Remote compute nodes is one of the approaches to the implementation of the
edge computing concept offered by MOSK. The topology
consists of a MOSK cluster residing in a data center,
which is extended with multiple small groups of compute nodes deployed in
geographically distanced remote sites. Remote compute nodes are integrated
into the MOSK cluster just like the nodes in the central
site with their configuration and life cycle managed through the same means.
Along with compute nodes, remote sites need to incorporate network gateway
components that allow application users to consume edge services directly
without looping the traffic through the central site.
Deployment of an edge cluster managed from a single central place starts with
a proper planning. This section provides recommendations on how to approach
the deployment design.
Compute nodes aggregation into availability zones¶
Mirantis recommends organizing nodes in each remote site into separate
Availability Zones in the MOSK Compute (OpenStack Nova),
Networking (OpenStack Neutron), and Block Storage (OpenStack Cinder)
services. This enables the cloud users to be aware of the failure domain
represented by a remote site and distribute the parts of their applications
accordingly.
Typically, high latency in between the central control plane and remote sites
makes it not feasible to rely on Ceph as a storage for the instance
root/ephemeral and block data.
Mirantis recommends that you configure the remote sites to use the following
back ends:
Local storage (LVM or QCOW2) as a storage back end for the
MOSK Compute service. See images-storage-back-end
for the configuration details.
LVM on iSCSI back end for the MOSK Block Storage service.
See Enable LVM block storage for the enablement procedure.
To maintain the small size of a remote site, the compute nodes need to be
hyper-converged and combine the compute and block storage functions.
There is no limitation on the number of the remote sites and their size.
However, when planning the cluster, ensure consistency between the total number
of nodes managed by a single control plane and the value of the size
parameter set in the OpenStackDeployment custom resource. For the list of
supported sizes, refer to Main elements.
Additionally, the sizing of the remote site needs to take into account the
characteristics of the networking channel with the main site.
Typically, an edge site consists of 3-7 compute nodes installed in a single,
usually rented, rack.
Mirantis recommends keeping the network latency between the main and remote
sites as low as possible. For stable interoperability of cluster components,
the latency needs to be around 30-70 milliseconds. Though, depending on the
cluster configuration and dynamism of the workloads running in the remote site,
the stability of the cluster can be preserved with the latency of up to 190
milliseconds.
The bandwidth of the communication channel between the main and remote sites
needs to be sufficient to run the following traffic:
The control plane and management traffic, such as OpenStack messaging,
database access, MOSK underlay Kubernetes cluster control
plane, and so on. A single remote compute node in the idle state requires at
minimum 1.5 Mbit/s of bandwidth to perform the non-data plane communications.
The data plane traffic, such as OpenStack image operations, instances VNC
console traffic, and so on, that heavily depend on the profile of the
workloads and other aspects of the cloud usage.
In general, Mirantis recommends having a minimum of 100 MBit/s bandwidth
between the main and remote sites.
MOSK remote compute nodes architecture is designed to
tolerate a temporary loss of connectivity between the main cluster and
the remote sites. In case of a disconnection, the instances running
on remote compute nodes will keep running normally preserving their
ability to read and write ephemeral and block storage data presuming it
is located in the same site, as well as connectivity to their neighbours
and edge application users. However, the instances will not have access
to any cloud services or applications located outside of their remote site.
Since the MOSK control plane communicates with remote
compute nodes through the same network channel, cloud users will not be able
to perform any manipulations, for example, instance creation, deletion,
snapshotting, and so on, over their edge applications until the
connectivity gets restored. MOSK services providing high
availability to cloud applications, such as the Instance HA service and Network
service, need to be connected to the remote compute nodes to perform a failover
of application components running in the remote site.
Once the connectivity between the main and the remote site restores, all
functions become available again. The period during which an edge application
can sustain normal function after a connectivity loss is determined by multiple
factors including the selected networking back end for the
MOSK cluster. Mirantis recommends that a cloud operator
performs a set of test manipulations over the cloud resources hosted in the
remote site to ensure that it has been fully restored.
When configured in Tungsten Fabric-powered clouds, the Graceful restart and long-lived graceful restart
feature significantly improves the MOSK ability to sustain
the connectivity of workloads running at remote sites in situations when
a site experiences a loss of connection to the central hosting location of
the control plane.
Extensive testing has demonstrated that remote sites can effectively withstand
a 72-hour control plane disconnection with zero impact on the running
applications.
Given that a remote site communicates with its main MOSK
cluster across a wide area network (WAN), it becomes important to protect
sensitive data from being intercepted and viewed by a third party.
Specifically, you should ensure the protection of the data belonging to the
following cloud components:
Bare metal servers provisioning and control, Kubernetes cluster deployment
and management, Mirantis StackLight telemetry
MOSK control plane
Communication between the components of OpenStack, Tungsten Fabric, and
Mirantis Ceph
MOSK data plane
Cloud application traffic
The most reliable way to protect the data is to configure the network equipment
in the data center and the remote site to encapsulate all the bypassing
remote-to-main communications into an encrypted VPN tunnel. Alternatively,
Mirantis Container Cloud and MOSK can be configured to force
encryption of specific types of network traffic, such as:
Kubernetes networking for MOSK underlying Kubernetes
cluster that handles the vast majority of in-MOSK
communications
OpenStack tenant networking that carries all the cloud application traffic
The ability to enforce traffic encryption depends on the specific version of
the Mirantis Container Cloud and MOSK in use, as well as
the selected SDN back end for OpenStack.
In MOSK, the main cloud that controls remote computes can be
the regional site that locates the regional cluster and the
MOSK control plane. Additionally, it can contain a local
storage and compute nodes.
The remote computes implementation in MOSK considers
Tungsten Fabric as an SDN solution.
Remote computes bare metal servers are configured as Kubernetes workers
hosting the deployments for:
The architecture validation is perfomed by means of simultanious creation of
multiple OpenStack resources of various types and execution of functional tests
against each resource. The amount of resources hosted in the cluster at the
moment when a certain threshold of non-operational resources starts being
observed, is described below as cluster capacity limit.
Note
A successfully created resource has the Active status in the API
and passes the functional tests, for example, its floating IP address is
accessible. The MOSK cluster is considered to be able to
handle the created resources if it successfully performs the LCM operations
including the OpenStack services restart, both on the control and data
plane.
Note
The key limiting factor for creating more OpenStack objects in this
illustrative setup is hardware resources (vCPU and RAM) available on the
compute nodes.
Persistent storage is a key component of any MOSK
deployment. Out of the box, MOSK includes an open-source
software-defined storage solution (Ceph), which hosts various kinds of
cloud application data, such as root and ephemeral disks for virtual machines,
virtual machine images, attachable virtual block storage, and object data.
In addition, a Ceph cluster usually acts as a storage for the internal
MOSK components, such as Kubernetes, OpenStack, StackLight,
and so on.
Being distributed and redundant by design, Ceph requires a certain minimum
amount of servers, also known as OSD or storage nodes, to work.
A production-grade Ceph cluster typically consists of at least nine storage
nodes, while a development and test environment may include four to six
servers. For details, refer to MOSK cluster hardware requirements.
It is possible to reduce the overall footprint of a MOSK
cluster by collocating the Ceph components with hypervisors on the same
physical servers; this is also known as hyper-converged design. However,
this architecture still may not satisfy the requirements of certain use cases
for the cloud.
Standalone telco-edge MOSK clouds typically consist of
three to seven servers hosted in a single rack, where every piece of CPU,
memory and disk resources is strictly accounted and better be dedicated
to the cloud workloads, rather than control plane. For such clouds,
where the cluster footprint is more important than the resiliency of
the application data storage, it makes sense either not to have a Ceph
cluster at all or to replace it with some primitive non-redundant solution.
Enterprise virtualization infrastructure with third-party storage is
not a rare strategy among large companies that rely on proprietary storage
appliances, provided by NetApp, Dell, HPE, Pure Storage, and other major
players in the data storage sector. These industry leaders offer a variety
of storage solutions meticulously designed to suit various enterprise demands.
Many companies, having already invested substantially in proprietary storage
infrastructure, prefer integrating MOSK with their existing
storage systems. This approach allows them to leverage this investment rather
than incurring new costs and logistical complexities associated with
migrating to Ceph.
MOSK standard LVM+iSCSI back end for the Block
Storage service. This aligns in a seamless manner with the concept
of hyper-converged design, wherein the LVM volumes are collocated
on the compute nodes.
Local file system of one of the MOSK controller
nodes. By default, database backups are stored on the local file
system on the node where the MariaDB service is running. This imposes
a risk to cloud security and resiliency. For enterprise environments,
it is a common requirement to store all the backup data externally.
Alternatively, you can disable the database backup functionality.
Results of functional testing
OpenStack Tempest
Local file system of MOSK controller nodes.
The openstack-tempest-run-tests job responsible for running
the Tempest suite stores the results of its execution in a volume
requested through the pvc-tempest PersistentVolumeClaim
(PVC). The subject volume can be created by the local volume provisioner
on the same Kubernetes worker node, where the job runs. Usually, it is
a MOSK controller node.
You can configure the Block Storage service (OpenStack Cinder)
to be used as a storage back end for images and snapshots.
In this case, each image is represented as a volume.
Important
Representing volumes as images implies a hard
requirement for the selected block storage back end to support
multi-attach capability that is concurrent reads and writes to
and from a single volume.
External S3, Swift, or any other third-party storage solutions
compatible with object access protocols.
Note
An external object storage solution will not be integrated
into the MOSK identity service (OpenStack
Keystone), the cloud applications will need to take care of managing
access to their object data themselves.
If no Ceph is deployed as part of a cluster, the MOSK
built-in Object Storage service API endpoints are disabled automatically.
StackLight must be deployed in the HA mode, when all its data gets
stored on the local file system of the nodes running StackLight
services. In this mode, StackLight components get configured
to handle the data replication themselves.
The determination of whether a MOSK cloud will
include Ceph or not should take place during its planning and design
phase. Once the deployment is complete, reconfiguring the cloud
to switch between Ceph and non-Ceph architectures becomes impossible.
Mirantis recommends avoiding substitution of Ceph-backed persistent volumes
in the MOSK underlying Kubernetes cluster with local
volumes (local volume provisioner) for production environments.
MOSK does not support such configuration unless
the components that rely on these volumes can replicate
their data themselves, for example, StackLight. Volumes provided by
the local volume provisioner are not redundant, as they are bound
to just a single node and can only be mounted from the Kubernetes
pods running on the same nodes.
This section describes internal implementation of the node maintenance API
and how OpenStack and Tungsten Fabric controllers communicate with LCM and
each other during a managed cluster update.
The WorkloadLock objects are created by each Application Controller.
These objects prevent LCM from performing any changes on the cluster or node
level while the lock is in the active state. The inactive state of the lock
means that the Application Controller has finished its work and the LCM can
proceed with the node or cluster maintenance.
The MaintenanceRequest objects are created by LCM. These objects notify
Application Controllers about the upcoming maintenance of a cluster or
a specific node.
ClusterMaintenanceRequest object example configuration¶
The scope parameter in the object specification defines the impact on
the managed cluster or node. The list of the available options include:
drain
A regular managed cluster update. Each node in the cluster
goes over a drain procedure. No node reboot takes place, a maximum impact
includes restart of services on the node including Docker, which causes
the restart of all containers present in the cluster.
os
A node might be rebooted during the update. Triggers the workload
evacuation by the OpenStack Controller.
When the MaintenanceRequest object is created, an Application Controller
executes a handler to prepare workloads for maintenance and put appropriate
WorkloadLock objects into the inactive state.
When maintenance is over, LCM removes MaintenanceRequest objects,
and the Application Controllers move their WorkloadLocks objects into
the active state.
When LCM creates the ClusterMaintenanceRequest object, the OpenStack
Controller ensures that all OpenStack components are in the Healthy
state, which means that the pods are up and running, and the readiness
probes are passing.
The ClusterMaintenanceRequest object creation flow:
ClusterMaintenanceRequest - create
When LCM creates the NodeMaintenanceRequest, the OpenStack Controller:
Prepares components on the node for maintenance by removing
nova-compute from scheduling.
If the reboot of a node is possible, the instance migration workflow is
triggered. The Operator can configure the instance migration flow
through the Kubernetes node annotation and should define the required option
before the managed cluster update.
To mitigate the potential impact on the cloud workloads, you can define
the instance migration flow for the compute nodes running the most valuable
instances.
The list of available options for the instance migration configuration
includes:
The openstack.lcm.mirantis.com/instance_migration_mode annotation:
live
Default. The OpenStack controller live migrates instances
automatically. The update mechanism tries to move the memory
and local storage of all instances on the node to another
node without interrupting before applying any changes to the node.
By default, the update mechanism makes three attempts to migrate
each instance before falling back to the manual mode.
Note
Success of live migration depends on many factors
including the selected vCPU type and model, the amount of
data that needs to be transferred, the intensity of the disk
IO and memory writes, the type of the local storage, and others.
Instances using the following product features are known to have
issues with live migration:
LVM-based ephemeral storage with and without encryption
Encrypted block storage volumes
CPU and NUMA node pinning
manual
The OpenStack Controller waits for the Operator to migrate instances
from the compute node. When it is time to update the compute node,
the update mechanism asks you to manually migrate the instances and
proceeds only once you confirm the node is safe to update.
skip
The OpenStack Controller skips the instance check on the
node and reboots it.
Note
For the clouds relying on the converged LVM with iSCSI block
storage that offer persistent volumes in a remote edge sub-region,
it is important to keep in mind that applying a major change to a
compute node may impact not only the instances running on this node
but also the instances attached to the LVM devices hosted there.
We recommend that in such environments you perform the update procedure
in the manual mode with mitigation measures taken by the Operator
for each compute node. Otherwise, all the instances that have LVM with
iSCSI volumes attached would need reboot to restore the connectivity.
The openstack.lcm.mirantis.com/instance_migration_attempts annotation
Defines the number of times the OpenStack Controller attempts
to migrate a single instance before giving up.
Defaults to 3.
Note
You can also use annotations to control the update of
non-compute nodes if they represent critical points of a specific
cloud architecture. For example, setting the instance_migration_mode
to manual on a controller node with a collocated gateway (Open vSwitch)
will allow the Operator to gracefully shut down all the virtual routers
hosted on this node.
If the OpenStack Controller cannot migrate instances due to errors, it
is suspended unless all instances are migrated manually or
the openstack.lcm.mirantis.com/instance_migration_mode annotation
is set to skip.
The NodeMaintenanceRequest object creation flow:
NodeMaintenanceRequest - create
When the node maintenance is over, LCM removes the NodeMaintenanceRequest
object and the OpenStack Controller:
Verifies that the Kubernetes Node becomes Ready.
Verifies that all OpenStack components on a given node are Healthy,
which means that the pods are up and running, and the readiness probes
are passing.
Ensures that the OpenStack components are connected to RabbitMQ.
For example, the Neutron Agents become alive on the node, and compute
instances are in the UP state.
Note
The OpenStack Controller enables you to have only one
nodeworkloadlock object at a time in the inactive state. Therefore,
the update process for nodes is sequential.
The NodeMaintenanceRequest object removal flow:
NodeMaintenanceRequest - delete
When the cluster maintenance is over, the OpenStack Controller sets the
ClusterWorkloadLock object to back active and the update completes.
The CLusterMaintenanceRequest object removal flow:
The Tungsten Fabric (TF) Controller creates and uses both types of
workloadlocks that include ClusterWorkloadLock and NodeWorkloadLock.
When the ClusterMaintenanceRequest object is created, the TF Controller
verifies the TF cluster health status and proceeds as follows:
If the cluster is Ready , the TF Controller moves the
ClusterWorkloadLock object to the inactive state.
Otherwise, the TF Controller keeps the ClusterWorkloadLock object
in the active state.
When the NodeMaintenanceRequest object is created, the TF Controller
verifies the vRouter pod state on the corresponding node and proceeds as
follows:
If all containers are Ready, the TF Controller moves the
NodeWorkloadLock object to the inactive state.
Otherwise, the TF Controller keeps the NodeWorkloadLock in the active
state.
Note
If there is a NodeWorkloadLock object in the inactive state
present in the cluster, the TF Controller does not process the
NodeMaintenanceRequest object for other nodes until this inactive
NodeWorkloadLock object becomes active.
When the cluster LCM removes the MaintenanceRequest object, the TF
Controller waits for the vRouter pods to become ready and proceeds as follows:
If all containers are in the Ready state, the TF Controller moves
the NodeWorkloadLock object to the active state.
Otherwise, the TF Controller keeps the NodeWorkloadLock object in the
inactive state.
This section describes the MOSK cluster update
flow to the product releases that contain major updates and require node reboot
such as support for new Linux kernel, and similar.
The diagram below illustrates the sequence of operations controlled by
LCM and taking place during the update under the hood. We assume that the
ClusterWorkloadLock and NodeWrokloadLock objects present in the cluster
are in the active state before the cloud operator triggers the update.
Cluster update flow
See also
For details about the Application Controllers flow during different
maintenance stages, refer to:
MOSK enables you to parallelize node update operations,
significantly improving the efficiency of your deployment. This capability
applies to any operation that utilizes the Node Maintenance API, such as
cluster updates or graceful node reboots.
The core implementation of parallel updates is handled by the LCM Controller
ensuring seamless execution of parallel operations. LCM starts performing an
operation on the node only when all NodeWorkloadLock objects for the node
are marked as inactive. By default, the LCM Controller creates one
NodeMaintenanceRequest at a time.
Each application controller, including Ceph, OpenStack, and Tungsten Fabric
Controllers, manages parallel NodeMaintenanceRequest objects independently.
The controllers determine how to handle and execute parallel node maintenance
requests based on specific requirements of their respective applications.
To understand the workflow of the Node Maintenance API, refer to
WorkloadLock objects.
You can optimize parallel updates by setting the order in which nodes are
updated. You can accomplish this by configuring upgradeIndex of
the Machine object. For the procedure, refer to
Mirantis Container Cloud: Change upgrade order for machines.
Increase parallelism.
Boost parallelism by adjusting the maximum number of worker node updates
that are allowed during LCM operations using the
spec.providerSpec.value.maxWorkerUpgradeCount configuration parameter,
which is set to 1 by default.
By default, the OpenStack Controller handles the NodeMaintenanceRequest
objects as follows:
Updates the OpenStack controller nodes sequentially (one by one).
Updates the gateway nodes sequentially. Technically, you can increase
the number of gateway nodes upgrades allowed in parallel using the
nwl_parallel_max_gateway parameter but Mirantis does not recommend
to do so.
Updates the compute nodes in parallel. The default number of allowed
parallel updates is 30. You can adjust this value through
the nwl_parallel_max_compute parameter.
Parallelism considerations for compute nodes
When considering parallelism for compute nodes, take into account that
during certain pod restarts, for example, the openvswitch-vswitchd
pods, a brief instance downtime may occur. Select a suitable level
of parallelism to minimize the impact on workloads and prevent excessive
load on the control plane nodes.
If your cloud environment is distributed across failure domains, which are
represented by Nova availability zones, you can limit the parallel updates
of nodes to only those within the same availability zone. This behavior is
controlled by the respect_nova_az option in the OpenStack Controller.
The OpenStack Controller configuration is stored in the
openstack-controller-config configMap of the osh-system namespace.
The options are picked up automatically after update. To learn more about
the OpenStack Controller configuration parameters,
refer to OpenStack Controller configuration.
By default, the Ceph Controller handles the NodeMaintenanceRequest
objects as follows:
Updates the non-storage nodes sequentially. Non-storage nodes include all
nodes that have mon, mgr, rgw, or mds roles.
Updates storage nodes in parallel. The default number of allowed
parallel updates is calculated automatically based on the minimal
failure domain in a Ceph cluster.
Parallelism calculations for storage nodes
The Ceph Controller automatically calculates the parallelism number
in the following way:
Finds the minimal failure domain for a Ceph cluster. For example,
the minimal failure domain is rack.
Filters all currently requested nodes by minimal failure domain.
For example, parallelism equals to 5, and LCM requests 3 nodes from
the rack1 rack and 2 nodes from the rack2 rack.
Handles each filtered node group one by one. For example, the controller
handles in parallel all nodes from rack1 before processing nodes
from rack2.
The Ceph Controller handles non-storage nodes before the storage
ones. If there are node requests for both node types, the Ceph Controller
handles sequentially the non-storage nodes first. Therefore, Mirantis
recommends setting the upgrade index of a higher priority for the non-storage
nodes to decrease the total upgrade time.
If the minimal failure domain is host, the Ceph Controller updates only
one storage node per failure domain unit. This results in updating all Ceph
nodes sequentially, despite the potential for increased parallelism.
By default, the Tungsten Fabric Controller handles the
NodeMaintenanceRequest objects as follows:
Updates the Tungsten Fabric Controller and gateway nodes sequentially.
Updates the vRouter nodes in parallel. The Tungsten Fabric Controller
allows updating up to 30 vRouter nodes in parallel.
Maximum amount of vRouter nodes in maintenance
While the Tungsten Fabric Controller has the capability to process up
to 30 NodeMaintenanceRequest objects targeted to vRouter nodes,
the actual amount may be lower. This is due to a check that ensures
OpenStack readiness to unlock the relevant nodes for maintenance.
If OpenStack allows for maintenance, the Tungsten Fabric Controller
verifies the vRouter pods. Upon successful verification,
the NodeWorkloadLock object is switched to the maintenance mode.
Mirantis OpenStack for Kubernetes (MOSK) enables the operator to
create, scale, update, and upgrade OpenStack deployments on Kubernetes through
a declarative API.
The Kubernetes built-in features, such as flexibility, scalability, and
declarative resource definition make MOSK a robust solution.
The detailed plan of any Mirantis OpenStack for Kubernetes (MOSK)
deployment is determined on a per-cloud basis. For the MOSK
reference architecture and design overview, see Reference Architecture.
One of the industry best practices is to verify every new update or
configuration change in a non-customer-facing environment before
applying it to production. Therefore, Mirantis recommends
having a staging cloud, deployed and maintained along with the production
clouds. The recommendation is especially applicable to the environments
that:
Receive updates often and use continuous delivery. For example,
any non-isolated deployment of Mirantis Container Cloud.
Have significant deviations from the reference architecture or
third party extensions installed.
Are managed under the Mirantis OpsCare program.
Run business-critical workloads where even the slightest application
downtime is unacceptable.
A typical staging cloud is a complete copy of the production environment
including the hardware and software configurations, but with a bare minimum
of compute and storage capacity.
Provision a Container Cloud bare metal management cluster¶
The bare metal management system enables the Infrastructure Operator to
deploy Container Cloud on a set of bare metal servers. It also enables
Container Cloud to deploy MOSK clusters on bare
metal servers without a pre-provisioned operating system.
After bootstrapping your baremetal-based Mirantis Container Cloud
management cluster, you can create a baremetal-based managed cluster
to deploy Mirantis OpenStack for Kubernetes using the Container Cloud API.
Before creating a bare metal managed cluster, add the required number
of bare metal hosts using CLI and YAML files for configuration.
This section describes how to add bare metal hosts using the Container Cloud
CLI during a managed cluster creation.
To add a bare metal host:
Verify that you configured each bare metal host as follows:
Enable the boot NIC support for UEFI load. Usually, at least the built-in
network interfaces support it.
Enable the UEFI-LAN-OPROM support in
BIOS -> Advanced -> PCIPCIe.
Enable the IPv4-PXE stack.
Set the following boot order:
UEFI-DISK
UEFI-PXE
If your PXE network is not configured to use the first network interface,
fix the UEFI-PXE boot order to speed up node discovering
by selecting only one required network interface.
Power off all bare metal hosts.
Warning
Only one Ethernet port on a host must be connected to the
Common/PXE network at any given time. The physical address
(MAC) of this interface must be noted and used to configure
the BareMetalHost object describing the host.
Log in to the host where your management cluster kubeconfig is located
and where kubectl is installed.
Describe the unique credentials of the new bare metal host:
MOSK 22.5
Create a YAML file that describes the unique credentials of the new
bare metal host as a BareMetalHostCredential object.
In the metadata section, add a unique credentials name and the
name of the non-default project (namespace) dedicated for the
managed cluster being created.
In the spec section, add the IPMI user name and password in plain
text to access the Baseboard Management Controller (BMC). The password
will not be stored in the BareMetalHostCredential object but will
be erased and saved in an underlying Secret object.
In the data section, add the IPMI user name and password in the
base64 encoding to access the BMC. To obtain the base64-encoded
credentials, you can use the following command in your Linux console: