Mirantis Container Cloud Operations Guide latest documentation

Mirantis Container Cloud Operations Guide

Preface

This documentation provides information on how to deploy and operate Mirantis Container Cloud.

About this documentation set

The documentation is intended to help operators understand the core concepts of the product.

The information provided in this documentation set is being constantly improved and amended based on the feedback and kind requests from our software consumers. This documentation set outlines description of the features that are supported within two latest Cloud Container minor releases, with a corresponding note Available since release.

The following table lists the guides included in the documentation set you are reading:

Guides list

Guide

Purpose

Reference Architecture

Learn the fundamentals of Container Cloud reference architecture to plan your deployment.

Deployment Guide

Deploy Container Cloud of a preferred configuration using supported deployment profiles tailored to the demands of specific business cases.

Operations Guide

Deploy and operate the Container Cloud managed clusters.

Release Compatibility Matrix

Deployment compatibility of the Container Cloud components versions for each product release.

Release Notes

Learn about new features and bug fixes in the current Container Cloud version as well as in the Container Cloud minor releases.

QuickStart Guides

Easy and lightweight instructions to get started with Container Cloud.

For your convenience, we provide all guides from this documentation set in HTML (default), single-page HTML, PDF, and ePUB formats. To use the preferred format of a guide, select the required option from the Formats menu next to the guide title on the Container Cloud documentation home page.

Intended audience

This documentation assumes that the reader is familiar with network and cloud concepts and is intended for the following users:

  • Infrastructure Operator

    • Is member of the IT operations team

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI, and OpenStack to support the application development team

    • Accesses Mirantis Container Cloud and Kubernetes through a local machine or web UI

    • Provides verified artifacts through a central repository to the Tenant DevOps engineers

  • Tenant DevOps engineer

    • Is member of the application development team and reports to line-of-business (LOB)

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI to support application owners

    • Accesses Container Cloud and Kubernetes through a local machine or web UI

    • Consumes artifacts from a central repository approved by the Infrastructure Operator

Conventions

This documentation set uses the following conventions in the HTML format:

Documentation conventions

Convention

Description

boldface font

Inline CLI tools and commands, titles of the procedures and system response examples, table titles.

monospaced font

Files names and paths, Helm charts parameters and their values, names of packages, nodes names and labels, and so on.

italic font

Information that distinguishes some concept or term.

Links

External links and cross-references, footnotes.

Main menu > menu item

GUI elements that include any part of interactive user interface and menu navigation.

Superscript

Some extra, brief information. For example, if a feature is available from a specific release or if a feature is in the Technology Preview development stage.

Note

The Note block

Messages of a generic meaning that may be useful to the user.

Caution

The Caution block

Information that prevents a user from mistakes and undesirable consequences when following the procedures.

Warning

The Warning block

Messages that include details that can be easily missed, but should not be ignored by the user and are valuable before proceeding.

See also

The See also block

List of references that may be helpful for understanding of some related tools, concepts, and so on.

Learn more

The Learn more block

Used in the Release Notes to wrap a list of internal references to the reference architecture, deployment and operation procedures specific to a newly implemented product feature.

Technology Preview support scope

This documentation set includes description of the Technology Preview features. A Technology Preview feature provides early access to upcoming product innovations, allowing customers to experience the functionality and provide feedback during the development process. Technology Preview features may be privately or publicly available and neither are intended for production use. While Mirantis will provide support for such features through official channels, normal Service Level Agreements do not apply. Customers may be supported by Mirantis Customer Support or Mirantis Field Support.

As Mirantis considers making future iterations of Technology Preview features generally available, we will attempt to resolve any issues that customers experience when using these features.

During the development of a Technology Preview feature, additional components may become available to the public for testing. Because Technology Preview features are being under development, Mirantis cannot guarantee the stability of such features. As a result, if you are using Technology Preview features, you may not be able to seamlessly upgrade to subsequent releases of that feature. Mirantis makes no guarantees that Technology Preview features will be graduated to a generally available product release.

The Mirantis Customer Success Organization may create bug reports on behalf of support cases filed by customers. These bug reports will then be forwarded to the Mirantis Product team for possible inclusion in a future release.

Documentation history

The documentation set refers to Mirantis Container Cloud GA as to the latest released GA version of the product. For details about the Container Cloud GA minor releases dates, refer to Container Cloud releases.

Mirantis Container Cloud CLI

The Mirantis Container Cloud APIs are implemented using the Kubernetes CustomResourceDefinitions (CRDs) that enable you to expand the Kubernetes API. For details, see Mirantis Container Cloud API.

You can operate Container Cloud using the kubectl command-line tool that is based on the Kubernetes API. For the kubectl reference, see the official Kubernetes documentation.

The Container Cloud Operations Guide mostly contains manuals that describe the Container Cloud web UI that is intuitive and easy to get started with. Some sections are divided into a web UI instruction and an analogous but more advanced CLI one. Certain Container Cloud operations can be performed only using CLI with the corresponding steps described in dedicated sections. For details, refer to the required component section of this guide.

Create and operate managed clusters

Note

This tutorial applies only to the Container Cloud web UI users with the writer access role assigned by the Infrastructure Operator. To add a bare metal host, the operator access role is also required.

After you deploy the Mirantis Container Cloud management cluster, you can start creating managed clusters that will be based on the same cloud provider type that you have for the management cluster: OpenStack, AWS, bare metal, or VMWare vSphere.

The deployment procedure is performed using the Container Cloud web UI and comprises the following steps:

  1. For a baremetal-based managed cluster, create and configure bare metal hosts with corresponding labels for machines such as worker, manager, or storage.

  2. Create an initial cluster configuration depending on the provider type.

  3. Add the required amount of machines with the corresponding configuration to the managed cluster.

  4. For a baremetal-based managed cluster, add a Ceph cluster.

Create and operate a baremetal-based managed cluster

After bootstrapping your baremetal-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy a baremetal-based management cluster, you can start creating the baremetal-based managed clusters using the Container Cloud web UI.

Add a bare metal host

Before creating a bare metal managed cluster, add the required number of bare metal hosts either using the Container Cloud web UI for a default configuration or using CLI for an advanced configuration.

Add a bare metal host using web UI

This section describes how to add bare metal hosts using the Container Cloud web UI during a managed cluster creation.

Before you proceed with adding a bare metal host:

  • Enable the boot NIC support for UEFI load. Usually, at least the built-in network interfaces support it.

  • Enable the UEFI-LAN-OPROM support in BIOS -> Advanced -> PCIPCIe.

  • Enable the IPv4-PXE stack.

  • Set the following boot order:

    1. UEFI-DISK

    2. UEFI-PXE

  • If your PXE network is not configured to use the first network interface, fix the UEFI-PXE boot order to speed up node discovering by selecting only one required network interface.

  • Power off all bare metal hosts.

Warning

Only one Ethernet port on a host must be connected to the Common/PXE network at any given time. The physical address (MAC) of this interface must be noted and used to configure the BareMetalHost object describing the host.

To add a bare metal host to a baremetal-based managed cluster:

  1. Optional. Create a custom bare metal host profile depending on your needs as described in Create a custom bare metal host profile.

  2. Log in to the Container Cloud web UI with the operator permissions.

  3. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  4. In the Baremetal tab, click Add BM host.

  5. Fill out the Add new BM host form as required:

    • Baremetal host name

      Specify the name of the new bare metal host.

    • Username

      Specify the name of the user for accessing the BMC (IPMI user).

    • Password

      Specify the password of the user for accessing the BMC (IPMI password).

    • Boot MAC address

      Specify the MAC address of the PXE network interface.

    • IP Address

      Specify the IP address to access the BMC.

    • Label

      Assign the machine label to the new host that defines which type of machine may be deployed on this bare metal host. Only one label can be assigned to a host. The supported labels include:

  6. Click Create

    While adding the bare metal host, Container Cloud discovers and inspects the hardware of the bare metal host and adds it to BareMetalHost.status for future references.

Now, you can proceed to Create a managed cluster.

Add a bare metal host using CLI

This section describes how to add bare metal hosts using the Container Cloud CLI during a managed cluster creation.

To add a bare metal host using API:

  1. Verify that you configured each bare metal host as follows:

    • Enable the boot NIC support for UEFI load. Usually, at least the built-in network interfaces support it.

    • Enable the UEFI-LAN-OPROM support in BIOS -> Advanced -> PCIPCIe.

    • Enable the IPv4-PXE stack.

    • Set the following boot order:

      1. UEFI-DISK

      2. UEFI-PXE

    • If your PXE network is not configured to use the first network interface, fix the UEFI-PXE boot order to speed up node discovering by selecting only one required network interface.

    • Power off all bare metal hosts.

    Warning

    Only one Ethernet port on a host must be connected to the Common/PXE network at any given time. The physical address (MAC) of this interface must be noted and used to configure the BareMetalHost object describing the host.

  2. Optional. Create a custom bare metal host profile depending on your needs as described in Create a custom bare metal host profile.

  3. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  4. Create a secret YAML file that describes the unique credentials of the new bare metal host.

    Example of the bare metal host secret:

    apiVersion: v1
    data:
      password: <credentials-password>
      username: <credentials-user-name>
    kind: Secret
    metadata:
      labels:
        kaas.mirantis.com/credentials: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <credentials-name>
      namespace: <managed-cluster-project-name>
    type: Opaque
    

    In the data section, add the IPMI user name and password in the base64 encoding to access the BMC. To obtain the base64-encoded credentials, you can use the following command in your Linux console:

    echo -n <username|password> | base64
    

    Caution

    Each bare metal host must have a unique Secret.

  5. Apply this secret YAML file to your deployment:

    kubectl apply -f ${<bmh-cred-file-name>}.yaml
    
  6. Create a YAML file that contains a description of the new bare metal host.

    Example of the bare metal host configuration file with the worker role:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-project-name>
    spec:
      bmc:
        address: <ip_address_for-bmc-access>
        credentialsName: <credentials-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    For a detailed fields description, see BareMetalHost.

  7. Apply this configuration YAML file to your deployment:

    kubectl apply -f ${<bare-metal-host-config-file-name>}.yaml
    

Now, proceed with Create a managed cluster.

Create a custom bare metal host profile

The bare metal host profile is a Kubernetes custom resource. It allows the operator to define how the storage devices and the operating system are provisioned and configured.

This section describes the bare metal host profile default settings and configuration of custom profiles for managed clusters using Mirantis Container Cloud API. This procedure also applies to a management cluster with a few differences described in Deployment Guide: Customize the default bare metal host profile.

Default configuration of the host system storage

The default host profile requires three storage devices in the following strict order:

  1. Boot device and operating system storage

    This device contains boot data and operating system data. It is partitioned using the GUID Partition Table (GPT) labels. The root file system is an ext4 file system created on top of an LVM logical volume. For a detailed layout, refer to the table below.

  2. Local volumes device

    This device contains an ext4 file system with directories mounted as persistent volumes to Kubernetes. These volumes are used by the Mirantis Container Cloud services to store its data, including monitoring and identity databases.

  3. Ceph storage device

    This device is used as a Ceph datastore or Ceph OSD.

The following table summarizes the default configuration of the host system storage set up by the Container Cloud bare metal management.

Default configuration of the bare metal host storage

Device/partition

Name/Mount point

Recommended size, GB

Description

/dev/sda1

bios_grub

4 MiB

The mandatory GRUB boot partition required for non-UEFI systems.

/dev/sda2

UEFI -> /boot/efi

0.2 GiB

The boot partition required for the UEFI boot mode.

/dev/sda3

config-2

64 MiB

The mandatory partition for the cloud-init configuration. Used during the first host boot for initial configuration.

/dev/sda4

lvm_root_part

100% of the remaining free space in the LVM volume group

The main LVM physical volume that is used to create the root file system.

/dev/sdb

lvm_lvp_part -> /mnt/local-volumes

100% of the remaining free space in the LVM volume group

The LVM physical volume that is used to create the file system for LocalVolumeProvisioner.

/dev/sdc

-

100% of the remaining free space in the LVM volume group

Clean raw disk that will be used for the Ceph storage back end.

If required, you can customize the default host storage configuration. For details, see Create a custom host profile.

Create a custom host profile

In addition to the default BareMetalHostProfile object installed with Mirantis Container Cloud, you can create custom profiles for managed clusters using Container Cloud API.

Note

The procedure below also applies to the Container Cloud management clusters.

To create a custom bare metal host profile:

  1. Select from the following options:

    • For a management cluster, log in to the bare metal seed node that will be used to bootstrap the management cluster.

    • For a managed cluster, log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

      Note

      The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Select from the following options:

    • For a management cluster, open templates/bm/baremetalhostprofiles.yaml.template for editing.

    • For a managed cluster, create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the host profile using the example template below to meet your hardware configuration requirements:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHostProfile
    metadata:
      name: <PROFILE_NAME>
      namespace: <PROJECT_NAME>
    spec:
      devices:
      # From the HW node, obtain the first device, which size is at least 60Gib
      - device:
          minSizeGiB: 60
          wipe: true
        partitions:
        - name: bios_grub
          partflags:
          - bios_grub
          sizeGiB: 0.00390625
          wipe: true
        - name: uefi
          partflags:
          - esp
          sizeGiB: 0.2
          wipe: true
        - name: config-2
          sizeGiB: 0.0625
          wipe: true
        - name: lvm_root_part
          sizeGiB: 0
          wipe: true
      # From the HW node, obtain the second device, which size is at least 60Gib
      # If a device exists but does not fit the size,
      # the BareMetalHostProfile will not be applied to the node
      - device:
          minSizeGiB: 30
          wipe: true
      # From the HW node, obtain the disk device with the exact name
      - device:
          byName: /dev/nvme0n1
          minSizeGiB: 30
          wipe: true
        partitions:
        - name: lvm_lvp_part
          sizeGiB: 0
          wipe: true
      # Example of wiping a device w\o partitioning it.
      # Mandatory for the case when a disk is supposed to be used for Ceph back end
      # later
      - device:
          byName: /dev/sde
          wipe: true
      fileSystems:
      - fileSystem: vfat
        partition: config-2
      - fileSystem: vfat
        mountPoint: /boot/efi
        partition: uefi
      - fileSystem: ext4
        logicalVolume: root
        mountPoint: /
      - fileSystem: ext4
        logicalVolume: lvp
        mountPoint: /mnt/local-volumes/
      logicalVolumes:
      - name: root
        sizeGiB: 0
        vg: lvm_root
      - name: lvp
        sizeGiB: 0
        vg: lvm_lvp
      postDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
      preDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
      volumeGroups:
      - devices:
        - partition: lvm_root_part
        name: lvm_root
      - devices:
        - partition: lvm_lvp_part
        name: lvm_lvp
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
      kernelParameters:
        sysctl:
          kernel.panic: "900"
          kernel.dmesg_restrict: "1"
          kernel.core_uses_pid: "1"
          fs.file-max: "9223372036854775807"
          fs.aio-max-nr: "1048576"
          fs.inotify.max_user_instances: "4096"
          vm.max_map_count: "262144"
    
  4. Add or edit the mandatory parameters in the new BareMetalHostProfile object. For the parameters description, see API: BareMetalHostProfile spec.

  5. Select from the following options:

    • For a management cluster, proceed with the cluster bootstrap procedure as described in Deployment Guide: Bootstrap a management cluster.

    • For a managed cluster:

      1. Add the bare metal host profile to your management cluster:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> apply -f <pathToBareMetalHostProfileFile>
        
      2. If required, further modify the host profile:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> edit baremetalhostprofile <hostProfileName>
        
      3. Proceed with adding bare metal hosts either using web UI or CLI.

Enable huge pages in a host profile

The BareMetalHostProfile API allows configuring a host to use the huge pages feature of the Linux kernel on managed clusters.

Note

Huge pages is a mode of operation of the Linux kernel. With huge pages enabled, the kernel allocates the RAM in bigger chunks, or pages. This allows a KVM (kernel-based virtual machine) and VMs running on it to use the host RAM more efficiently and improves the performance of VMs.

To enable huge pages in a custom bare metal host profile for a managed cluster:

  1. Log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Open for editing or create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the grubConfig section of the host profile spec using the example below to configure the kernel boot parameters and enable huge pages:

    spec:
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
        - GRUB_CMDLINE_LINUX_DEFAULT="hugepagesz=1G hugepages=N"
    

    The example configuration above will allocate N huge pages of 1 GB each on the server boot. The last hugepagesz parameter value is default unless default_hugepagesz is defined. For details about possible values, see official Linux kernel documentation.

  4. Add the bare metal host profile to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> apply -f <pathToBareMetalHostProfileFile>
    
  5. If required, further modify the host profile:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> edit baremetalhostprofile <hostProfileName>
    
  6. Proceed with adding bare metal hosts.

Create a managed cluster

This section instructs you on how to configure and deploy a managed cluster that is based on the baremetal-based management cluster through the Mirantis Container Cloud web UI.

To create a managed cluster on bare metal:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH keys tab, click Add SSH Key to upload the public SSH key that will be used for the SSH access to VMs.

  4. Available since 2.7.0. Optional. In the Proxies tab, enable proxy access to the managed cluster:

    1. Click Add Proxy.

    2. In the Add New Proxy wizard, fill out the form with the following parameters:

      Proxy configuration

      Parameter

      Description

      Proxy Name

      Name of the proxy server to use during a managed cluster creation.

      Region

      From the drop-down list, select the required region.

      HTTP Proxy

      Add the HTTP proxy server domain name in the following format:

      • http://proxy.example.com:port - for anonymous access

      • http://user:password@proxy.example.com:port - for restricted access

      HTTPS Proxy

      Add the HTTPS proxy server domain name in the same format as for HTTP Proxy.

      No Proxy

      Comma-separated list of IP addresses or domain names.

    For the list of Mirantis resources and IP addresses to be accessible from the Container Cloud clusters, see Reference Architecture: Hardware and system requirements.

  5. In the Clusters tab, click Create Cluster.

  6. Configure the new cluster in the Create New Cluster wizard that opens:

    1. Define general and Kubernetes parameters:

      Create new cluster: General, Provider, and Kubernetes

      Section

      Parameter name

      Description

      General settings

      Cluster name

      The cluster name.

      Provider

      Select Baremetal.

      Region

      From the drop-down list, select Baremetal.

      Release version

      The Container Cloud version.

      Proxy Available since 2.7.0

      Optional. From the drop-down list, select the proxy server name that you have previously created.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to the bare metal hosts.

      Provider

      LB host IP

      The IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster. This IP address must be on the Combined/PXE network.

      LB address range

      The range of IP addresses that can be assigned to load balancers for Kubernetes Services by MetalLB.

      Kubernetes

      Services CIDR blocks

      The Kubernetes Services CIDR blocks. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes pods CIDR blocks. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

      HA Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Logs Severity Level Available since 2.6.0

      The severity level of logs to collect. For details about severity levels, see Logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  7. Click Create.

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

  8. Recommended. Configure an L2 template for a new cluster as described in Advanced networking configuration. You may skip this step if you do not require L2 separation for network traffic.

    Note

    This step is mandatory for Mirantis OpenStack for Kubernetes (MOS) clusters.

Now, proceed to Add a machine.

Advanced networking configuration

By default, Mirantis Container Cloud configures a single interface on the cluster nodes, leaving all other physical interfaces intact.

With L2 networking templates, you can create advanced host networking configurations for your clusters. For example, you can create bond interfaces on top of physical interfaces on the host or use multiple subnets to separate different types of network traffic.

You can use several host-specific L2 templates per one cluster to support different hardware configurations. For example, you can create L2 templates with different number and layout of NICs to be applied to the specific machines of one cluster.

When you create a baremetal-based project, the exemplary templates with the ipam/PreInstalledL2Template label are copied to this project. These templates are preinstalled during the management cluster bootstrap.

Follow the procedures below to create L2 templates for your managed clusters.

Create subnets

Before creating an L2 template, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the managed cluster nodes. Where required, create a number of subnets for a particular project using the Subnet CR. A subnet has three logical scopes:

  • global - CR uses the default namespace. A subnet can be used for any cluster located in any project.

  • namespaced - CR uses the namespace that corresponds to a particular project where managed clusters are located. A subnet can be used for any cluster located in the same project.

  • cluster - CR uses the namespace where the referenced cluster is located. A subnet is only accessible to the cluster that L2Template.spec.clusterRef refers to. The Subnet objects with the cluster scope will be created for every new cluster.

You can have subnets with the same name in different projects. In this case, the subnet that has the same project as the cluster will be used. One L2 template may often reference several subnets, those subnets may have different scopes in this case.

The IP address objects (IPaddr CR) that are allocated from subnets always have the same project as their corresponding IpamHost objects, regardless of the subnet scope.

To create subnets:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnet.yaml file with a number of global or namespaced subnets:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnet.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: demo
      namespace: demo-namespace
    spec:
      cidr: 10.11.0.0/24
      gateway: 10.11.0.9
      includeRanges:
      - 10.11.0.5-10.11.0.70
      nameservers:
      - 172.18.176.6
    
    Specification fields of the Subnet object

    Parameter

    Description

    cidr (singular)

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

    includeRanges (list)

    A list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes (excluding the gateway address). The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. In the example above, the addresses 10.11.0.5-10.11.0.70 (excluding the gateway address 10.11.0.9) will be allocated for nodes. The includeRanges parameter is mutually exclusive with excludeRanges.

    excludeRanges (list)

    A list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation (excluding gateway address). Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The excludeRanges parameter is mutually exclusive with includeRanges.

    useWholeCidr (boolean)

    If set to true, the subnet address (10.11.0.0 in the example above) and the broadcast address (10.11.0.255 in the example above) are included into the address allocation for nodes. Otherwise, (false by default), the subnet address and broadcast address will be excluded from the address allocation.

    gateway (singular)

    A valid gateway address, for example, 10.11.0.9.

    nameservers (list)

    A list of the IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

    Caution

    The subnet for the PXE network is automatically created during deployment and must contain the ipam/DefaultSubnet: "1" label. Each bare metal region must have only one subnet with this label.

    The following labels in metadata describe or change the subnet functioning:

    metadata.labels fields of the Subnet object

    Parameter

    Description

    cluster.sigs.k8s.io/cluster-name (singular)

    The name of the cluster that the subnet belongs to.

    ipam/SVC-MetalLB (singular) Technology Preview

    When set to "1", the subnet provides additional address ranges for MetalLB. No IP address objects (IPaddr CR) will be generated using this subnet. When using this label, set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation to provide additional address ranges for MetalLB using the subnet.

    ipam/SVC-ceph-public (singular) Available since 2.7.0, Technology Preview

    When set to "1", the subnet with this label will be used to configure Ceph networking. Ceph will automatically use this subnet for its external connections. A Ceph OSD will look for and bind to an address from this subnet when it is started on a machine. Use this subnet in the L2 template for storage nodes. Assign this subnet to the interface connected to your storage access network.

    When using this label, set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation.

    ipam/SVC-ceph-cluster (singular) Available since 2.7.0, Technology Preview

    When set to "1", the subnet with this label will be used to configure Ceph networking. Ceph will automatically use this subnet for its internal replication traffic. Use this subnet in the L2 template for storage nodes. Assign this subnet to the interface connected to your storage replication network.

    When using this label, set the cluster.sigs.k8s.io/cluster-name label to the name of the target cluster during the subnet creation.

    Caution

    Using of a dedicated network for Kubernetes pods traffic, for external connection to the Kubernetes services exposed by the cluster, and for the Ceph cluster access and replication traffic is available as Technology Preview. Use such configurations for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

    The following feature is still under development and will be announced in one of the following Container Cloud releases:

    • Switching Kubernetes API to listen to the specified IP address on the node

  3. Verify that the subnet is successfully created:

    kubectl get subnet kaas-mgmt -oyaml
    

    In the system output, verify the status fields of the subnet.yaml file using the table below.

    Status fields of the Subnet object

    Parameter

    Description

    statusMessage

    Contains a short state description and a more detailed one if applicable. The short status values are as follows:

    • OK - operational.

    • ERR - non-operational. This status has a detailed description, for example, ERR: Wrong includeRange for CIDR….

    cidr

    Reflects the actual CIDR, has the same meaning as spec.cidr.

    gateway

    Reflects the actual gateway, has the same meaning as spec.gateway.

    nameservers

    Reflects the actual name servers, has same meaning as spec.nameservers.

    ranges

    Specifies the address ranges that are calculated using the fields from spec: cidr, includeRanges, excludeRanges, gateway, useWholeCidr. These ranges are directly used for nodes IP allocation.

    lastUpdate

    Includes the date and time of the latest update of the Subnet RC.

    allocatable

    Includes the number of currently available IP addresses that can be allocated for nodes from the subnet.

    allocatedIPs

    Specifies the list of IPv4 addresses with the corresponding IPaddr object IDs that were already allocated from the subnet.

    capacity

    Contains the total number of IP addresses being held by ranges that equals to a sum of the allocatable and allocatedIPs parameters values.

    versionIpam

    Contains thevVersion of the kaas-ipam component that made the latest changes to the Subnet RC.

    Example of a successfully created subnet:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        ipam/UID: 6039758f-23ee-40ba-8c0f-61c01b0ac863
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: kaas-mgmt
      namespace: default
    spec:
      cidr: 10.0.0.0/24
      excludeRanges:
      - 10.0.0.100
      - 10.0.0.101-10.0.0.120
      gateway: 10.0.0.1
      includeRanges:
      - 10.0.0.50-10.0.0.90
      nameservers:
      - 172.18.176.6
    status:
      allocatable: 38
      allocatedIPs:
      - 10.0.0.50:0b50774f-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.51:1422e651-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.52:1d19912c-ffed-11ea-84c7-0242c0a85b02
      capacity: 41
      cidr: 10.0.0.0/24
      gateway: 10.0.0.1
      lastUpdate: "2020-09-26T11:40:44Z"
      nameservers:
      - 172.18.176.6
      ranges:
      - 10.0.0.50-10.0.0.90
      statusMessage: OK
      versionIpam: v3.0.999-20200807-130909-44151f8
    
  4. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates.

Automate multiple subnet creation using SubnetPool

Before creating an L2 template, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the managed cluster nodes. You can also create multiple subnets using the SubnetPool object to separate different types of network traffic. SubnetPool allows for automatic creation of Subnet objects that will consume blocks from the parent SubnetPool CIDR IP address range. The SubnetPool blockSize setting defines the IP address block size to allocate to each child Subnet. SubnetPool has a global scope, so any SubnetPool can be used to create the Subnet objects for any namespace and for any cluster.

To automate multiple subnet creation using SubnetPool:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnetpool.yaml file with a number of subnet pools:

    Note

    You can define either or both subnets and subnet pools, depending on the use case. A single L2 template can use either or both subnets and subnet pools.

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnetpool.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: SubnetPool
    metadata:
      name: kaas-mgmt
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      cidr: 10.10.0.0/16
      blockSize: /25
      nameservers:
      - 172.18.176.6
      gatewayPolicy: first
    

    For the specification fields description of the SubnetPool object, see SubnetPool spec.

  3. Verify that the subnet pool is successfully created:

    kubectl get subnetpool kaas-mgmt -oyaml
    

    In the system output, verify the status fields of the subnetpool.yaml file. For the status fields description of the SunbetPool object, see SubnetPool status.

  4. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates. In this procedure, select the exemplary L2 template for multiple subnets that contains the l3Layout section.

    Caution

    Using the l3Layout section, define all subnets of a cluster. Otherwise, do not use the l3Layout section. Defining only part of subnets is not allowed.

Create L2 templates

After you create subnets for one or more managed clusters or projects as described in Create subnets or Automate multiple subnet creation using SubnetPool, follow the procedure below to create L2 templates for a managed cluster. This procedure contains exemplary L2 templates for the following use cases:

L2 template example with bonds and bridges

This section contains an exemplary L2 template that demonstrates how to set up bonds and bridges on hosts for your managed clusters as described in Create L2 templates.

Caution

Using of a dedicated network for Kubernetes pods traffic, for external connection to the Kubernetes services exposed by the cluster, and for the Ceph cluster access and replication traffic is available as Technology Preview. Use such configurations for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

The following feature is still under development and will be announced in one of the following Container Cloud releases:

  • Switching Kubernetes API to listen to the specified IP address on the node

Dedicated network for the Kubernetes pods traffic

If you want to use a dedicated network for Kubernetes pods traffic, configure each node with an IPv4 and/or IPv6 address that will be used to route the pods traffic between nodes. To accomplish that, use the npTemplate.bridges.k8s-pods bridge in the L2 template, as demonstrated in the example below. As defined in Reference Architecture: Host networking, this bridge name is reserved for the Kubernetes pods network. When the k8s-pods bridge is defined in an L2 template, Calico CNI uses that network for routing the pods traffic between nodes.

Dedicated network for the Kubernetes services traffic (MetalLB)

You can use a dedicated network for external connection to the Kubernetes services exposed by the cluster. If enabled, MetalLB will listen and respond on the dedicated virtual bridge. To accomplish that, configure each node where metallb-speaker is deployed with an IPv4 or IPv6 address. Both, the MetalLB IP address ranges and the IP addresses configured on those nodes, must fit in the same CIDR.

Use the npTemplate.bridges.k8s-ext bridge in the L2 template, as demonstrated in the example below. This bridge name is reserved for the Kubernetes external network. The Subnet object that corresponds to the k8s-ext bridge must have explicitly excluded IP address ranges that are in use by MetalLB.

Dedicated network for the Ceph distributed storage traffic

Starting from Container Cloud 2.7.0, you can configure dedicated networks for the Ceph cluster access and replication traffic. Set labels on the Subnet CRs for the corresponding networks, as described in Create subnets. Container Cloud automatically configures Ceph to use the addresses from these subnets. Ensure that the addresses are assigned to the storage nodes.

Use the npTemplate.bridges.ceph-cluster and npTemplate.bridges.ceph-replication bridges in the L2 template, as demonstrated in the example below. These names are reserved for the Ceph cluster access and replication networks.

The Subnet objects used to assign IP addresses to these bridges must have corresponding labels ipam/SVC-ceph-public for the ceph-cluster bridge and ipam/SVC-ceph-cluster for the ceph-replication bridge.

Example of an L2 template with interfaces bonding
apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
spec:
  clusterRef: managed-cluster
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  npTemplate: |
    version: 2
    ethernets:
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
    bonds:
      bond0:
        interfaces:
          - ten10gbe0s0
          - ten10gbe0s1
    vlans:
      k8s-ext-vlan:
        id: 1001
        link: bond0
      k8s-pods-vlan:
        id: 1002
        link: bond0
      ceph-cluster-vlan:
        id: 1003
        link: bond0
      ceph-replication-vlan:
        id: 1004
        link: bond0
    bridges:
      k8s-ext:
        interfaces: [k8s-ext-vlan]
        addresses:
          - {{ip "k8s-ext:demo-ext"}}
      k8s-pods:
        interfaces: [k8s-pods-vlan]
        addresses:
          - {{ip "k8s-pods:demo-pods"}}
      ceph-cluster:
        interfaces: [ceph-cluster-vlan]
        addresses:
          - {{ip "ceph-cluster:demo-ceph-cluster"}}
      ceph-replication:
        interfaces: [ceph-replication-vlan]
        addresses:
          - {{ip "ceph-replication:demo-ceph-replication"}}
L2 template example for automatic multiple subnet creation

This section contains an exemplary L2 template for automatic multiple subnet creation as described in Automate multiple subnet creation using SubnetPool. This template also contains the L3Layout section that allows defining the Subnet scopes and enables optional auto-creation of the Subnet objects from the SubnetPool objects.

For details on how to create L2 templates, see Create L2 templates.

Caution

Do not assign an IP address to the PXE nic 0 NIC explicitly to prevent the IP duplication during updates. The IP address is automatically assigned by the bootstrapping engine.

Example of an L2 template for multiple subnets:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
spec:
  clusterRef: managed-cluster
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  l3Layout:
    - subnetName: pxe-subnet
      scope:      global
    - subnetName: subnet-1
      subnetPool: kaas-mgmt
      scope:      namespace
    - subnetName: subnet-2
      subnetPool: kaas-mgmt
      scope:      cluster
  npTemplate: |
    version: 2
    ethernets:
      onboard1gbe0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 0}}
        set-name: {{nic 0}}
        # IMPORTANT: do not assign an IP address here explicitly
        # to prevent IP duplication issues. The IP will be assigned
        # automatically by the bootstrapping engine.
        # addresses: []
      onboard1gbe1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 1}}
        set-name: {{nic 1}}
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
        addresses:
          - {{ip "2:subnet-1"}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
        addresses:
          - {{ip "3:subnet-2"}}

In the template above, the following networks are defined in the l3Layout section:

  • pxe-subnet - global PXE network that already exists. A subnet name must refer to the PXE subnet created for the region.

  • subnet-1 - unless already created, this subnet will be created from the kaas-mgmt subnet pool. The subnet name must be unique within the project. This subnet is shared between the project clusters.

  • subnet-2 - will be created from the kaas-mgmt subnet pool. This subnet has the cluster scope. Therefore, the real name of the Subnet CR object consists of the subnet name defined in l3Layout and the cluster UID. But the npTemplate section of the L2 template must contain only the subnet name defined in l3Layout. The subnets of the cluster scope are not shared between clusters.

Caution

Using the l3Layout section, define all subnets of a cluster. Otherwise, do not use the l3Layout section. Defining only part of subnets is not allowed.


To create an L2 template for a new managed cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Inspect the existing L2 templates to select the one that fits your deployment:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    get l2template -n <ProjectNameForNewManagedCluster>
    
  3. Create an L2 YAML template specific to your deployment using one of the exemplary templates:

    Note

    You can create several L2 templates with different configurations to be applied to different nodes of the same cluster. See Assign L2 templates to machines for details.

  4. Add or edit the mandatory parameters in the new L2 template. The following tables provide the description of the mandatory and the l3Layout section parameters in the example templates mentioned in the previous step.

    L2 template mandatory parameters

    Parameter

    Description

    clusterRef

    References the Cluster object that this template is applied to. The default value is used to apply the given template to all clusters within a particular project, unless an L2 template that references a specific cluster name exists.

    Caution

    • An L2 template must have the same namespace as the referenced cluster.

    • A cluster can be associated with many L2 templates. Only one of them can have the ipam/DefaultForCluster label. Every L2 template that does not have the ipam/DefaultForCluster label must be assigned to a particular machine using l2TemplateSelector.

    • A project (Kubernetes namespace) can have only one default L2 template (L2Template with Spec.clusterRef: default).

    ifMapping or autoIfMappingPrio

    • ifMapping is a list of interface names for the template. The interface mapping is defined globally for all bare metal hosts in the cluster but can be overridden at the host level, if required, by editing the IpamHost object for a particular host.

    • autoIfMappingPrio is a list of prefixes, such as eno, ens, and so on, to match the interfaces to automatically create a list for the template. If you are not aware of any specific ordering of interfaces on the nodes, use the default ordering from Predictable Network Interfaces Names specification for systemd. You can also override the default NIC list per host using the IfMappingOverride parameter of the corresponding IpamHost. The provision value corresponds to the network interface that was used to provision a node. Usually, it is the first NIC found on a particular node. It is defined explicitly to ensure that this interface will not be reconfigured accidentally.

    npTemplate

    A netplan-compatible configuration with special lookup functions that defines the networking settings for the cluster hosts, where physical NIC names and details are parameterized. This configuration will be processed using Go templates. Instead of specifying IP and MAC addresses, interface names, and other network details specific to a particular host, the template supports use of special lookup functions. These lookup functions, such as nic, mac, ip, and so on, return host-specific network information when the template is rendered for a particular host. For details about netplan, see the official netplan documentation.

    Caution

    All rules and restrictions of the netplan configuration also apply to L2 templates. For details, see the official netplan documentation.

    For more details about the L2Template custom resource (CR), see the L2Template API section.

    l3Layout section parameters

    Parameter

    Description

    subnetName

    Name of the Subnet object that will be used in the npTemplate section to allocate IP addresses from. All Subnet names must be unique across a single L2 template.

    subnetPool

    Optional. Default: none. Name of the parent SubnetPool object that will be used to create a Subnet object with a given subnetName and scope. If a corresponding Subnet object already exists, nothing will be created and the existing object will be used. If no SubnetPool is provided, no new Subnet object will be created.

    scope

    Logical scope of the Subnet object with a corresponding subnetName. Possible values:

    • global - the Subnet object is accessible globally, for any Container Cloud project and cluster in the region, for example, the PXE subnet.

    • namespace - the Subnet object is accessible within the same project and region where the L2 template is defined.

    • cluster - the Subnet object is only accessible to the cluster that L2Template.spec.clusterRef refers to. The Subnet objects with the cluster scope will be created for every new cluster.

    The following table describes the main lookup functions for an L2 template.

    Lookup function

    Description

    {{nic N}}

    Name of a NIC number N. NIC numbers correspond to the interface mapping list.

    {{mac N}}

    MAC address of a NIC number N registered during a host hardware inspection.

    {{ip “N:subnet-a”}}

    IP address and mask for a NIC number N. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{ip “br0:subnet-x”}}

    IP address and mask for a virtual interface, “br0” in this example. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{gateway_from_subnet “subnet-a”}}

    IPv4 default gateway address from the given subnet.

    {{nameservers_from_subnet “subnet-a”}}

    List of the IP addresses of name servers from the given subnet.

    Note

    Every subnet referenced in an L2 template can have either a global or namespaced scope. In the latter case, the subnet must exist in the same project where the corresponding cluster and L2 template are located.

  5. Add the L2 template to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <pathToL2TemplateYamlFile>
    
  6. Optional. Further modify the template:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    -n <ProjectNameForNewManagedCluster> edit l2template <L2templateName>
    
  7. Proceed with Add a machine. The resulting L2 template will be used to render the netplan configuration for the managed cluster machines.


The workflow of the netplan configuration using an L2 template is as follows:

  1. The kaas-ipam service uses the data from BareMetalHost, the L2 template, and subnets to generate the netplan configuration for every cluster machine.

  2. The generated netplan configuration is saved in the status.netconfigV2 section of the IpamHost resource. If the status.l2RenderResult field of the IpamHost resource is OK, the configuration was rendered in the IpamHost resource successfully. Otherwise, the status contains an error message.

  3. The baremetal-provider service copies data from the status.netconfigV2 of IpamHost to the Spec.StateItemsOverwrites[‘deploy’][‘bm_ipam_netconfigv2’] parameter of LCMMachine.

  4. The lcm-agent service on every host synchronizes the LCMMachine data to its host. The lcm-agent service runs a playbook to update the netplan configuration on the host during the pre-download and deploy phases.

Assign L2 templates to machines

You can create multiple L2 templates with different configurations and apply them to different machines in the same cluster.

To assign a specific L2 template to machines in a cluster:

  1. Create the default L2 template for the cluster. It will be used for machines that do not have an L2 template explicitly assigned.

    To designate an L2 template as default, assign the ipam/DefaultForCluster label to it. Only one L2 template in a cluster can have this label.

  2. Create other required L2 templates for the cluster. Use the clusterRef parameter in the L2 template spec to assign the templates to the cluster.

  3. Add the l2template-<NAME> label to every L2 template. Replace the <NAME> parameter with the unique name of the L2 template.

  4. Assign an L2 template to a machine. Set the l2TemplateSelector field in the machine spec to the name of the label added in the previous step. IPAM controller uses this field to use a specific L2 template for the corresponding machine.

    Alternatively, you may set the l2TemplateSelector field to the name of the L2 template. This makes the template exclusively used by the corresponding machine.

Consider the following examples of an L2 template assignment to a machine.

Example of an L2Template resource:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: ExampleNetConfig
  namespace: MyProject
  labels:
    l2template-ExampleNetConfig: ""
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
...
spec:
  clusterRef: MyCluster
...

Example of a Machine resource with the label-based L2 template selector:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: Machine1
  namespace: MyProject
...
spec:
  l2TemplateSelector:
    label: l2template-ExampleNetConfig
...

Example of a Machine resource with the name-based L2 template selector:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: Machine1
  namespace: MyProject
...
spec:
  l2TemplateSelector:
    name: ExampleNetConfig
...

Add a machine

This section describes how to add a machine to a newly created managed cluster using either the Mirantis Container Cloud web UI or CLI for an advanced configuration.

Create a machine using web UI

After you add bare metal hosts and create a managed cluster as described in Create a managed cluster, proceed with associating Kubernetes machines of your cluster with the previously added bare metal hosts using the Mirantis Container Cloud web UI.

To add a Kubernetes machine to a baremetal-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the operator or writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine button.

  5. Fill out the Create New Machine form as required:

    • Count

      Specify the number of machines to add.

    • Manager

      Select Manager or Worker to create a Kubernetes manager or worker node. The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    • BareMetal Host Label

      Assign the role to the new machine(s) to link the machine to a previously created bare metal host with the corresponding label. You can assign one role type per machine. The supported labels include:

      • Worker

        The default role for any node in a managed cluster. Only the kubelet service is running on the machines of this type.

      • Manager

        This node hosts the manager services of a managed cluster. For the reliability reasons, Container Cloud does not permit running end user workloads on the manager nodes or use them as storage nodes.

      • Storage

        This node is a worker node that also hosts Ceph OSDs and provides its disk resources to Ceph. Container Cloud permits end users to run workloads on storage nodes by default.

    • Node Labels

      Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

      Caution

      If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

      Note

      You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

    At this point, Container Cloud adds the new machine object to the specified managed cluster. And the Bare Metal Operator controller creates the relation to BareMetalHost with the labels matching the roles.

    Provisioning of the newly created machine starts when the machine object is created and includes the following stages:

    1. Creation of partitions on the local disks as required by the operating system and the Container Cloud architecture.

    2. Configuration of the network interfaces on the host as required by the operating system and the Container Cloud architecture.

    3. Installation and configuration of the Container Cloud LCM agent.

  7. Repeat the steps above for the remaining machines.

    Monitor the deploy or update live status of the machine:

    • Quick status

      On the Clusters page, in the Managers or Workers columns. The green status icon indicates that the machine is Ready, the orange status icon indicates that the machine is Updating.

    • Detailed status Available since 2.7.0

      In the Machines section of a particular cluster page, in the Status column. Hover over a particular machine status icon to verify the deploy or update status of a specific machine component.

    You can monitor the status of the following machine components:

    • Kubelet

      Verify that a node is ready in a Kubernetes cluster, as reported by kubelet

    • Swarm

      Verify that a node is healthy and belongs to Docker Swarm in the cluster

    The machine creation starts with the Provision status. During provisioning, the machine is not expected to be accessible since its infrastructure (VM, network, and so on) is being created.

    Other machine statuses are the same as the LCMMachine object states described in Reference Architecture: LCM controller.

    Once the status changes to Ready, the deployment of the managed cluster components on this machine is complete.

Now, proceed to Add a Ceph cluster.

Create a machine using CLI

This section describes a bare metal host and machine configuration using Mirantis Container Cloud CLI.

Deploy a machine to a specific bare metal host

A Kubernetes machine requires a dedicated bare metal host for deployment. The bare metal hosts are represented by the BareMetalHost objects in Kubernetes API. All BareMetalHost objects are labeled by the Operator when created. A label reflects the hardware capabilities of a host. As a result of labeling, all bare metal hosts are divided into three types: Control Plane, Worker, and Storage.

In some cases, you may need to deploy a machine to a specific bare metal host. This is especially useful when some of your bare metal hosts have different hardware configuration than the rest.

To deploy a machine to a specific bare metal host:

  1. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  2. Identify the bare metal host that you want to associate with the specific machine. For example, host host-1.

    kubectl get baremetalhost host-1 -o yaml
    
  3. Add a label that will uniquely identify this host, for example, by the name of the host and machine that you want to deploy on it.

    Caution

    Do not remove any existing labels from the BareMetalHost resource. For more details about labels, see BareMetalHost.

    kubectl edit baremetalhost host-1
    

    Configuration example:

    kind: BareMetalHost
    metadata:
      name: host-1
      namespace: myProjectName
      labels:
        kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
        ...
    
  4. Create a new text file with the YAML definition of the Machine object, as defined in Machine.

  5. Add a label selector that matches the label you have added to the BareMetalHost object in the previous step.

    Example:

    kind: Machine
    metadata:
      name: worker-HW11-cad5
      namespace: myProjectName
    spec:
      ...
      providerSpec:
        value:
          apiVersion: baremetal.k8s.io/v1alpha1
          kind: BareMetalMachineProviderSpec
          ...
          hostSelector:
            matchLabels:
              kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
      ...
    
  6. Specify the details of the machine configuration in the object created in the previous step. For example:

    • Add a reference to a custom BareMetalHostProfile object, as defined in Machine.

    • Specify an override for the ordering and naming of the NICs for the machine. For details, see Override network interfaces naming and order.

    • If you use a specific L2 template for the machine, set the unique name or label of the corresponding L2 template in the L2templateSelector section of the Machine object.

  7. Add the configured machine to the cluster:

    kubectl create -f worker-HW11-cad5.yaml
    

    Once done, this machine will be associated with the specified bare metal host.

Override network interfaces naming and order

An L2 template contains the ifMapping field that allows you to identify Ethernet interfaces for the template. The Machine object API enables the Operator to override the mapping from the L2 template by enforcing a specific order of names of the interfaces when applied to the template.

The field l2TemplateIfMappingOverride in the spec of the Machine object contains a list of interfaces names. The order of the interfaces names in the list is important because the L2Template object will be rendered with NICs ordered as per this list.

Note

Changes in the l2TemplateIfMappingOverride field will apply only once when the Machine and corresponding IpamHost objects are created. Further changes to l2TemplateIfMappingOverride will not reset the interfaces assignment and configuration.

Caution

The l2TemplateIfMappingOverride field must contain the names of all interfaces of the bare metal host.

The following example illustrates how to include the override field to the Machine object. In this example, we configure the interface eno1, which is the second on-board interface of the server, to precede the first on-board interface eno0.

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  finalizers:
  - foregroundDeletion
  - machine.cluster.sigs.k8s.io
  labels:
    cluster.sigs.k8s.io/cluster-name: kaas-mgmt
    cluster.sigs.k8s.io/control-plane: "true"
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
spec:
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      hostSelector:
        matchLabels:
          baremetal: hw-master-0
      image: {}
      kind: BareMetalMachineProviderSpec
      l2TemplateIfMappingOverride:
      - eno1
      - eno0
      - enp0s1
      - enp0s2

As a result of the configuration above, when used with the example L2 template for bonds and bridges described in Create L2 templates, the enp0s1 and enp0s2 interfaces will be bonded and that bond will be used to create subinterfaces for Kubernetes networks (k8s-pods) and for Kubernetes external network (k8s-ext).

See also

Delete a machine

Add a Ceph cluster

After you add machines to your new bare metal managed cluster as described in Add a machine, create a Ceph cluster on top of this managed cluster using the Mirantis Container Cloud web UI.

For an advanced configuration through the KaaSCephCluster CR, see Ceph advanced configuration. To configure Ceph controller through Kubernetes templates to manage Ceph nodes resources, see Enable Ceph tolerations and resources management.

The procedure below enables you to create a Ceph cluster with minimum three Ceph nodes that provides persistent volumes to the Kubernetes workloads in the managed cluster.

To create a Ceph cluster in the managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The Cluster page with the Machines and Ceph clusters lists opens.

  4. In the Ceph Clusters block, click Create Cluster.

  5. Configure the Ceph cluster in the Create New Ceph Cluster wizard that opens:

    Create new Ceph cluster

    Section

    Parameter name

    Description

    General settings

    Name

    The Ceph cluster name.

    Cluster Network

    Replication network for Ceph OSDs. Must contain the CIDR definition and match the corresponding values of the cluster L2Template object or the environment network values.

    Public Network

    Public network for Ceph data. Must contain the CIDR definition and match the corresponding values of the cluster L2Template object or the environment network values.

    Enable OSDs LCM

    Select to enable LCM for Ceph OSDs.

    Machines / Machine #1-3

    Select machine

    Select the name of the Kubernetes machine that will host the corresponding Ceph node in the Ceph cluster.

    Manager, Monitor

    Select the required Ceph services to install on the Ceph node.

    Devices

    Select the disk that Ceph will use.

    Warning

    Do not select the device for system services, for example, sda.

    Enable Object Storage

    Select to enable the single-instance RGW Object Storage.

  6. To add more Ceph nodes to the new Ceph cluster, click + next to any Ceph Machine title in the Machines tab. Configure a Ceph node as required.

    Warning

    Do not add more than 3 Manager and/or Monitor services to the Ceph cluster.

  7. After you add and configure all nodes in your Ceph cluster, click Create.

Once done, verify your Ceph cluster as described in Verify Ceph.

Delete a managed cluster

Due to a development limitation in baremetal operator, deletion of a managed cluster requires preliminary deletion of the worker machines running on the cluster.

Using the Container Cloud web UI, first delete worker machines one by one until you hit the minimum of 2 workers for an operational cluster. After that, you can delete the cluster with the remaining workers and managers.

To delete a baremetal-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name to open the list of machines running on it.

  4. Click the More action icon in the last column of the worker machine you want to delete and select Delete. Confirm the deletion.

  5. Repeat the step above until you have 2 workers left.

  6. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  7. Verify the list of machines to be removed. Confirm the deletion.

  8. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, click the Delete credential action icon next to the name of the credentials to be deleted.

    2. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs, and so on.

Create and operate an OpenStack-based managed cluster

After bootstrapping your OpenStack-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy an OpenStack-based management cluster, you can create the OpenStack-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create an OpenStack-based managed cluster using the Mirantis Container Cloud web UI of the OpenStack-based management cluster.

To create an OpenStack-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the OpenStack VMs creation.

  4. In the Credentials tab:

    1. Click Add Credential to add your OpenStack credentials. You can either upload your OpenStack clouds.yaml configuration file or fill in the fields manually.

    2. Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. Optional. In the Proxies tab, enable proxy access to the managed cluster:

    1. Click Add Proxy.

    2. In the Add New Proxy wizard, fill out the form with the following parameters:

      Proxy configuration

      Parameter

      Description

      Proxy Name

      Name of the proxy server to use during a managed cluster creation.

      Region

      From the drop-down list, select the required region.

      HTTP Proxy

      Add the HTTP proxy server domain name in the following format:

      • http://proxy.example.com:port - for anonymous access

      • http://user:password@proxy.example.com:port - for restricted access

      HTTPS Proxy

      Add the HTTPS proxy server domain name in the same format as for HTTP Proxy.

      No Proxy

      Comma-separated list of IP addresses or domain names.

    For the list of Mirantis resources and IP addresses to be accessible from the Container Cloud clusters, see Reference Architecture: Hardware and system requirements.

  6. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General Settings

      Name

      Cluster name

      Provider

      Select OpenStack

      Provider Credential

      From the drop-down list, select the OpenStack credentials name that you have previously created.

      Release Version

      The Container Cloud version.

      Proxy

      Optional. From the drop-down list, select the proxy server name that you have previously created.

      SSH Keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      External Network

      Type of the external network in the OpenStack cloud provider.

      DNS Name Servers

      Comma-separated list of the DNS hosts IPs for the OpenStack VMs configuration.

      Kubernetes

      Node CIDR

      The Kubernetes nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR Blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR Blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

      HA Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Logs Severity Level Available since 2.6.0

      The severity level of logs to collect. For details about severity levels, see Logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  7. Click Create.

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

  8. Proceed with Add a machine.

Add a machine

After you create a new OpenStack-based Mirantis Container Cloud managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an OpenStack-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with Machines list opens.

  4. On the cluster page, click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Specify the number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Flavor

    From the drop-down list, select the required hardware configuration for the machine. The list of available flavors corresponds to the one in your OpenStack environment.

    For the hardware requirements, see: Reference Architecture: Requirements for an OpenStack-based cluster.

    Image

    From the drop-down list, select the cloud image with Ubuntu 18.04. If you do not have this image in the list, add it to your OpenStack environment using the Horizon web UI by downloading the image from the Ubuntu official website.

    Availability zone

    From the drop-down list, select the availability zone from which the new machine will be launched.

    Node Labels

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    Monitor the deploy or update live status of the machine:

    • Quick status

      On the Clusters page, in the Managers or Workers columns. The green status icon indicates that the machine is Ready, the orange status icon indicates that the machine is Updating.

    • Detailed status Available since 2.7.0

      In the Machines section of a particular cluster page, in the Status column. Hover over a particular machine status icon to verify the deploy or update status of a specific machine component.

    You can monitor the status of the following machine components:

    • Kubelet

      Verify that a node is ready in a Kubernetes cluster, as reported by kubelet

    • Swarm

      Verify that a node is healthy and belongs to Docker Swarm in the cluster

    The machine creation starts with the Provision status. During provisioning, the machine is not expected to be accessible since its infrastructure (VM, network, and so on) is being created.

    Other machine statuses are the same as the LCMMachine object states described in Reference Architecture: LCM controller.

    Once the status changes to Ready, the deployment of the managed cluster components on this machine is complete.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

An operational managed cluster deployment must contain a minimum of 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. The deployment of the cluster does not start until the minimum number of nodes is created.

To meet the etcd quorum and to prevent the deployment failure, deletion of the manager nodes is prohibited.

A machine with the manager node role is automatically deleted during the managed cluster deletion.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an OpenStack-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. If the cluster deletion hangs and the The cluster is being deleted message does not disappear for a while:

    1. Expand the menu of the tab with your username.

    2. Click Download kubeconfig to download kubeconfig of your management cluster.

    3. Log in to any local machine with kubectl installed.

    4. Copy the downloaded kubeconfig to this machine.

    5. Run the following command:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
    6. Edit the opened kubeconfig by removing the following lines:

      finalizers:
      - cluster.cluster.k8s.io
      
  6. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Create and operate an AWS-based managed cluster

After bootstrapping your AWS-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy an AWS-based management cluster, you can create the AWS-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create an AWS-based managed cluster using the Mirantis Container Cloud web UI of the AWS-based management cluster.

To create an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be configured on each AWS instance to provide user access.

  4. In the Credentials tab:

    1. Click Add Credential and fill in the required fields to add your AWS credentials.

    2. Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. Available since 2.7.0. Optional. In the Proxies tab, enable proxy access to the managed cluster:

    1. Click Add Proxy.

    2. In the Add New Proxy wizard, fill out the form with the following parameters:

      Proxy configuration

      Parameter

      Description

      Proxy Name

      Name of the proxy server to use during a managed cluster creation.

      Region

      From the drop-down list, select the required region.

      HTTP Proxy

      Add the HTTP proxy server domain name in the following format:

      • http://proxy.example.com:port - for anonymous access

      • http://user:password@proxy.example.com:port - for restricted access

      HTTPS Proxy

      Add the HTTPS proxy server domain name in the same format as for HTTP Proxy.

      No Proxy

      Comma-separated list of IP addresses or domain names.

    For the list of Mirantis resources and IP addresses to be accessible from the Container Cloud clusters, see Reference Architecture: Hardware and system requirements.

  6. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select AWS

      Provider credential

      From the drop-down list, select the previously created AWS credentials name.

      Release version

      The Container Cloud version.

      Proxy Available since 2.7.0

      Optional. From the drop-down list, select the proxy server name that you have previously created.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      AWS region

      From the drop-down list, select the AWS Region for the managed cluster. For example, us-east-2.

      Kubernetes

      Services CIDR blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

      HA Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Logs Severity Level Available since 2.6.0

      The severity level of logs to collect. For details about severity levels, see Logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  7. Click Create.

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

  8. Proceed with Add a machine.

Add a machine

After you create a new AWS-based managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Mirantis Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Specify the number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Instance type

    From the drop-down list, select the required AWS instance type. For production deployments, Mirantis recommends:

    • c5d.2xlarge for worker nodes

    • c5d.4xlarge for manager nodes

    • r5.4xlarge for nodes where the StackLight server components run

    For more details about requirements, see Reference architecture: AWS system requirements.

    AMI ID

    From the drop-down list, select the required AMI ID of Ubuntu 18.04. For example, ubuntu-bionic-18.04-amd64-server-20200729.

    Root device size

    Select the required root device size, 40 by default.

    Node Labels

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    Monitor the deploy or update live status of the machine:

    • Quick status

      On the Clusters page, in the Managers or Workers columns. The green status icon indicates that the machine is Ready, the orange status icon indicates that the machine is Updating.

    • Detailed status Available since 2.7.0

      In the Machines section of a particular cluster page, in the Status column. Hover over a particular machine status icon to verify the deploy or update status of a specific machine component.

    You can monitor the status of the following machine components:

    • Kubelet

      Verify that a node is ready in a Kubernetes cluster, as reported by kubelet

    • Swarm

      Verify that a node is healthy and belongs to Docker Swarm in the cluster

    The machine creation starts with the Provision status. During provisioning, the machine is not expected to be accessible since its infrastructure (VM, network, and so on) is being created.

    Other machine statuses are the same as the LCMMachine object states described in Reference Architecture: LCM controller.

    Once the status changes to Ready, the deployment of the managed cluster components on this machine is complete.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

An operational managed cluster deployment must contain a minimum of 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. The deployment of the cluster does not start until the minimum number of nodes is created.

To meet the etcd quorum and to prevent the deployment failure, deletion of the manager nodes is prohibited.

A machine with the manager node role is automatically deleted during the managed cluster deletion.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically removes the Amazon Virtual Private Cloud (VPC) connected with this cluster and frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Create and operate a VMWare vSphere-based managed cluster

After bootstrapping your VMWare vSphere-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy a VMWare vSphere-based management cluster, you can create vSphere-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create a VMWare vSphere-based managed cluster using the Mirantis Container Cloud web UI of the vSphere-based management cluster.

To create a vSphere-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the vSphere VMs creation.

  4. In the Credentials tab:

    1. Click Add Credential to add your vSphere credentials. You can either upload your vSphere vsphere.yaml configuration file or fill in the fields manually.

    2. Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. Available since 2.7.0. Optional. In the Proxies tab, enable proxy access to the managed cluster:

    1. Click Add Proxy.

    2. In the Add New Proxy wizard, fill out the form with the following parameters:

      Proxy configuration

      Parameter

      Description

      Proxy Name

      Name of the proxy server to use during a managed cluster creation.

      Region

      From the drop-down list, select the required region.

      HTTP Proxy

      Add the HTTP proxy server domain name in the following format:

      • http://proxy.example.com:port - for anonymous access

      • http://user:password@proxy.example.com:port - for restricted access

      HTTPS Proxy

      Add the HTTPS proxy server domain name in the same format as for HTTP Proxy.

      No Proxy

      Comma-separated list of IP addresses or domain names.

    For the list of Mirantis resources and IP addresses to be accessible from the Container Cloud clusters, see Reference Architecture: Hardware and system requirements.

  6. In the RHEL Licenses tab, click Add RHEL License and fill out the form with the following parameters:

    RHEL license parameters

    Parameter

    Description

    RHEL License Name

    RHEL license name

    Username

    User name to access the RHEL license

    Password

    Password to access the RHEL license

    Pool IDs

    Optional. Specify the pool IDs for RHEL licenses for Virtual Datacenters. Otherwise, Subscription Manager will select a subscription from the list of available and appropriate for the machines.

  7. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General Settings

      Name

      Cluster name

      Provider

      Select vSphere

      Provider Credential

      From the drop-down list, select the vSphere credentials name that you have previously added.

      Release Version

      Container Cloud version.

      Proxy Available since 2.7.0

      Optional. From the drop-down list, select the proxy server name that you have previously created.

      SSH Keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      LB Host IP

      IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster.

      LB Address Range

      MetalLB range of IP addresses that can be assigned to load balancers for Kubernetes Services.

      vSphere

      Machine Folder Path

      Full path to a folder that will store the cluster machines metadata.

      Network Path

      Full path to a network for cluster machines.

      Resource Pool Path

      Full path to a resource pool in which VMs will be created.

      Datastore For Cluster

      Full path to a storage for virtual machines disks.

      Datastore For Cloud Provider

      Full path to a storage for Kubernetes volumes.

      SCSI Controller Type

      SCSI controller type for virtual machines. Leave pvscsi as default.

      Enable IPAM Available since 2.6.0

      Enables IPAM. Set to true if a vSphere network has no DHCP server. Also, provide the following additional parameters for a proper network setup on machines using embedded IP address management (IPAM):

      Network CIDR

      CIDR of the provided vSphere network. For example, 10.20.0.0/16.

      Network Gateway

      Gateway of the provided vSphere network.

      DNS Name Servers

      List of nameservers for the provided vSphere network.

      Include Ranges

      IP range for the cluster machines. Specify the range of the provided CIDR. For example, 10.20.0.100-10.20.0.200.

      Exclude Ranges

      Optional. IP ranges to be excluded from being assigned to the cluster machines. The MetalLB range and the load balancer IP address should not intersect with the addresses for IPAM. For example, 10.20.0.150-10.20.0.170.

      Kubernetes

      Node CIDR

      Kubernetes nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR Blocks

      Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR Blocks

      Kubernetes pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

      HA Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Logs Severity Level Available since 2.6.0

      The severity level of logs to collect. For details about severity levels, see Logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  8. Click Create.

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

  9. Proceed with Add a machine.

Add a machine

After you create a new VMWare vSphere-based Mirantis Container Cloud managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to a vSphere-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with Machines list opens.

  4. On the cluster page, click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Template Path

    Path to the prepared OVF template.

    RHEL License

    From the drop-down list, select the RHEL license that you previously added for the cluster being deployed.

    Node Labels

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    Monitor the deploy or update live status of the machine:

    • Quick status

      On the Clusters page, in the Managers or Workers columns. The green status icon indicates that the machine is Ready, the orange status icon indicates that the machine is Updating.

    • Detailed status Available since 2.7.0

      In the Machines section of a particular cluster page, in the Status column. Hover over a particular machine status icon to verify the deploy or update status of a specific machine component.

    You can monitor the status of the following machine components:

    • Kubelet

      Verify that a node is ready in a Kubernetes cluster, as reported by kubelet

    • Swarm

      Verify that a node is healthy and belongs to Docker Swarm in the cluster

    The machine creation starts with the Provision status. During provisioning, the machine is not expected to be accessible since its infrastructure (VM, network, and so on) is being created.

    Other machine statuses are the same as the LCMMachine object states described in Reference Architecture: LCM controller.

    Once the status changes to Ready, the deployment of the managed cluster components on this machine is complete.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

An operational managed cluster deployment must contain a minimum of 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. The deployment of the cluster does not start until the minimum number of nodes is created.

To meet the etcd quorum and to prevent the deployment failure, deletion of the manager nodes is prohibited.

A machine with the manager node role is automatically deleted during the managed cluster deletion.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete a VMWare vSphere-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

  5. Deleting a cluster automatically turns the machines off. Therefore, clean up the hosts manually in the vSphere web UI. The machines will be automatically released from the RHEL subscription.

  6. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Change a cluster configuration

After deploying a managed cluster, you can configure the following cluster settings:

  • Enable or disable StackLight and configure its parameters if enabled. Alternatively, you can configure StackLight through kubeconfig as described in Configure StackLight.

  • Add or remove SSH keys

To change a cluster configuration:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Select the required project.

  3. On the Clusters page, click the More action icon in the last column of the required cluster and select Configure cluster.

  4. In the Configure cluster window:

    • Available since 2.7.0 In the General Settings tab, you can:

      • Using the drop-down menu, select the required previously created SSH key to add it to the running cluster

      • Using the Add SSH Key action icon, add a new field with the SSH keys drop-down menu if your cluster requires several keys

      • Using the Remove SSH Key action icon, remove the fields with unused SSH keys, if any

        Note

        To delete an SSH key, use the SSH Keys tab of the main menu.

    • In the Stacklight tab, select or deselect StackLight and configure its parameters if enabled.

  5. Click Update to apply the changes.

Update a managed cluster

A Mirantis Container Cloud management cluster automatically upgrades to a new available Container Cloud release version that supports new Cluster releases. Once done, a newer version of a Cluster release becomes available for managed clusters that you update using the Container Cloud web UI.

Caution

Make sure to update the Cluster release version of your managed cluster before the current Cluster release version becomes unsupported by a new Container Cloud release version. Otherwise, Container Cloud stops auto-upgrade and eventually Container Cloud itself becomes unsupported.

This section describes how to update a managed cluster of any provider type using the Container Cloud web UI.

To update a managed cluster:

  1. For bare metal clusters, set the maintenance flag for Ceph:

    1. Open the KaasCephCluster CR for editing:

      kubectl edit kaascephcluster
      
    2. Enable the maintenance flag:

      spec:
        cephClusterSpec:
          maintenance: true
      
  2. Log in to the Container Cloud web UI with the writer permissions.

  3. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  4. In the Clusters tab, click More action icon in the last column for each cluster and select Update cluster where available.

  5. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release. The release notes for each Container Cloud and Cluster release are available at Release Notes: Container Cloud releases and Release Notes: Cluster releases.

  6. Click Update.

    Before the cluster update starts, Container Cloud performs a backup of MKE and Docker Swarm. The backup directory is located under:

    • /srv/backup/swarm on every Container Cloud node for Docker Swarm

    • /srv/backup/ucp on one of the controller nodes for MKE

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

    Note

    If the update hangs with the following error in logs, restart lcm-agent using the service lcm-agent-* restart command on the affected nodes:

    lcmAgentUpgradeStatus:
        error: 'failed to download agent binary: Get https://<mcc-cache-address>/bin/lcm/bin/lcm-agent/v0.2.0-289-gd7e9fa9c/lcm-agent:
          x509: certificate signed by unknown authority'
    
  7. For bare metal clusters, disable the maintenance flag for Ceph from the KaasCephCluster CR once the update is complete and all nodes are in the Ready status:

    spec:
      cephClusterSpec:
        maintenance: false
    

Caution

Due to the development limitations, the MCR upgrade to version 19.03.13 or 19.03.14 on existing Container Cloud clusters is not supported.

Note

In rare cases, after a managed cluster upgrade, Grafana may stop working due to the issues with helm-controller.

The development team is working on the issue that will be addressed in the upcoming release.

Note

MKE and Kubernetes API may return short-term 50x errors during the upgrade process. Ignore these errors.

Delete a machine

This section instructs you on how to scale down an existing managed cluster through the Mirantis Container Cloud web UI.

Warning

An operational managed cluster deployment must contain a minimum of 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. The deployment of the cluster does not start until the minimum number of nodes is created.

To meet the etcd quorum and to prevent the deployment failure, deletion of the manager nodes is prohibited.

A machine with the manager node role is automatically deleted during the managed cluster deletion.

To delete a machine from a managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click on the required cluster name to open the list of machines running on it.

  4. Click the More action icon in the last column of the machine you want to delete and select Delete. Confirm the deletion.

Deleting a machine automatically frees up the resources allocated to this machine.

Attach an existing Mirantis Kubernetes Engine cluster

Starting from Mirantis Kubernetes Engine (MKE) 3.3.4, you can attach an existing MKE cluster that is not deployed by Mirantis Container Cloud to a management cluster. This feature allows for visualization of all your MKE clusters details in one place including clusters health, capacity, and usage.

For supported configurations of existing MKE clusters that are not deployed by Container Cloud, see MKE, MSR, and MCR Compatibility Matrix.

Note

Using the free Mirantis license, you can create up to three Container Cloud managed clusters with three worker nodes on each cluster. Within the same quota, you can also attach existing MKE clusters that are not deployed by Container Cloud. If you need to increase this quota, contact Mirantis support for further details.

Using the instruction below, you can also install StackLight to your existing MKE cluster during the attach procedure. For the StackLight system requirements, refer to the Reference Architecture: Requirements of the corresponding cloud provider.

You can also update all your MKE clusters to the latest version once your management cluster automatically updates to a newer version where a new MKE Cluster release with the latest MKE version is available. For details, see Update a managed cluster.

Caution

An MKE cluster can be attached to only one management cluster. Attachment of a Container Cloud-based MKE cluster to another management cluster is not supported.

To attach an existing MKE cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, expand the Create Cluster menu and click Attach Existing MKE Cluster.

  4. In the wizard that opens, fill out the form with the following parameters as required:

    1. Configure general settings:

      MKE cluster configuration

      Section

      Parameter

      Description

      General Settings

      Cluster Name

      Specify the cluster name.

      Region

      Select the required cloud provider: OpenStack, AWS, or bare metal.

    2. Upload the MKE client bundle or fill in the fields manually. To download the MKE client bundle, refer to MKE user access: Download client certificates.

    3. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable Monitoring

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Note

      The logging mechanism performance depends on the cluster log load. In case of a high load, you may need to increase the default resource requests and limits for fluentdElasticsearch. For details, see StackLight configuration parameters: Resource limits.

      HA Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Logs Severity Level Available since 2.6.0

      The severity level of logs to collect. For details about severity levels, see Logging.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  5. Click Create.

    To monitor the cluster readiness, hover over the status icon of a specific cluster in the Status column of the Clusters page.

    Once the orange blinking status icon is green and Ready, the cluster deployment or update is complete.

    Starting from Container Cloud 2.7.0, you can monitor live deployment status of the following cluster components:

    Component

    Description

    Bastion

    For the OpenStack and AWS-based clusters, the Bastion node IP address status that confirms the Bastion node creation

    Helm

    Installation or upgrade status of all Helm releases

    Kubelet

    Readiness of the node in a Kubernetes cluster, as reported by kubelet

    Kubernetes

    Readiness of all requested Kubernetes objects

    Nodes

    Equality of the requested nodes number in the cluster to the number of ready nodes

    OIDC

    Readiness of the cluster OIDC configuration

    StackLight

    Health of all StackLight-related objects in a Kubernetes cluster

    Swarm

    Readiness of all nodes in a Docker Swarm cluster

Connect to the Mirantis Kubernetes Engine web UI

After you deploy a new or attach an existing Mirantis Kubernetes Engine (MKE) cluster to a management cluster, start managing your cluster using the MKE web UI.

To connect to the MKE web UI:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required MKE cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the MKE UI endpoint.

  5. Paste the copied IP to a web browser and use the same credentials that you use to access the Container Cloud web UI.

Warning

To ensure the Container Cloud stability in managing the Container Cloud-based MKE clusters, a number of MKE API functions is not available for the Container Cloud-based MKE clusters as compared to the attached MKE clusters that are deployed not by Container Cloud. Use the Container Cloud web UI or CLI for this functionality instead.

See Reference Architecture: MKE API limitations for details.

Caution

The MKE web UI contains help links that lead to the MKE, MSR, and MCR documentation suite. Besides MKE and Mirantis Container Runtime (MCR), which are integrated with Container Cloud, that documentation suite covers other MKE, MSR, and MCR components and cannot be fully applied to the Container Cloud-based MKE clusters. Therefore, to avoid any sort of misconceptions, before you proceed with MKE web UI documentation, read Reference Architecture: MKE API limitations and make sure you are using the documentation of the supported MKE version as per Release Compatibility Matrix.

Connect to a Mirantis Container Cloud cluster

After you deploy a Mirantis Container Cloud management or managed cluster, connect to the cluster to verify the availability and status of the nodes as described below.

This section also describes how to SSH to a node of a cluster where Bastion host is used for SSH access. For example, on the OpenStack-based management cluster or AWS-based management and managed clusters.

To connect to a managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Verify the status of the manager nodes. Once the first manager node is deployed and has the Ready status, the Download Kubeconfig option for the cluster being deployed becomes active.

  5. Open the Clusters tab.

  6. Click the More action icon in the last column of the required cluster and select Download Kubeconfig:

    1. Enter your user password.

    2. Not recommended. Select Offline Token to generate an offline IAM token. Otherwise, for security reasons, the kubeconfig token expires every 30 minutes of the Container Cloud API idle time and you have to download kubeconfig again with a newly generated token.

    3. Click Download.

  7. Verify the availability of the managed cluster machines:

    1. Export the kubeconfig parameters to your local machine with access to kubectl. For example:

      export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
      
    2. Obtain the list of available Container Cloud machines:

      kubectl get nodes -o wide
      

      The system response must contain the details of the nodes in the READY status.

To connect to a management cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Obtain the list of available management cluster machines:

    kubectl get nodes -o wide
    

    The system response must contain the details of the nodes in the READY status.

To SSH to a Container Cloud cluster node if Bastion is used:

  1. Obtain kubeconfig of the management or managed cluster as described in the procedures above.

  2. Obtain the internal IP address of a node you require access to:

    kubectl get nodes -o wide
    
  3. Obtain the Bastion public IP:

    kubectl get cluster -o jsonpath='{.status.providerStatus.bastion.publicIp}' \
    -n <project_name> <cluster_name>
    
  4. Run the following command:

    ssh -i <private_key> mcc-user@<node_internal_ip> -o "proxycommand ssh -W %h:%p \
    -i <private_key> mcc-user@<bastion_public_ip>"
    

    Substitute the parameters enclosed in angle brackets with the corresponding values of your cluster obtained in previous steps.

    The <private_key> for a management cluster is ssh_key created during bootstrap in the same directory as the bootstrap script. For a managed cluster, this is the SSH Key that you added in the Container Cloud web UI before the managed cluster creation.

Operate management and regional clusters

The Mirantis Container Cloud web UI enables you to perform the following operations with the Container Cloud management and regional clusters:

  • View the cluster details (such as cluster ID, creation date, nodes count, and so on) as well as obtain a list of the cluster endpoints including the StackLight components, depending on your deployment configuration.

    To view generic cluster details, in the Clusters tab, click the More action icon in the last column of the required cluster and select Cluster info.

    Note

    • Adding more than 3 nodes or deleting nodes from a management or regional cluster is not supported.

    • Removing a management or regional cluster using the Container Cloud web UI is not supported. Use the dedicated cleanup script instead. For details, see Remove a management cluster and Remove a regional cluster.

    • Before removing a regional cluster, delete the credentials of the deleted managed clusters associated with the region.

  • Verify the current release version of the cluster including the list of installed components with their versions and the cluster release change log.

    To view a cluster release version details, in the Clusters tab, click the version in the Release column next to the name of the required cluster.

This section outlines the operations that can be performed with a management or regional cluster.

Automatic upgrade workflow

A management cluster upgrade to a newer version is performed automatically once a new Container Cloud version is released. Regional clusters also upgrade automatically along with the management cluster. For more details about the Container Cloud release upgrade mechanism, see: Reference Architecture: Container Cloud release controller.

Container Cloud remains operational during the management and regional clusters upgrade. Managed clusters are not affected during this upgrade. For the list of components that are updated during the Container Cloud upgrade, see the Components versions section of the corresponding Container Cloud release in Release Notes.

When Mirantis announces support of the newest versions of Mirantis Container Runtime (MCR) and Mirantis Kubernetes Engine (MKE), Container Cloud automatically upgrades these components as well. For the maintenance window best practices before upgrade of these components, see MKE and MCR Documentation.

Caution

Due to the development limitations, the MCR upgrade to version 19.03.13 or 19.03.14 on existing Container Cloud clusters is not supported.

Note

MKE and Kubernetes API may return short-term 50x errors during the upgrade process. Ignore these errors.

Configure NTP server for a regional cluster

If you did not add the NTP server parameters during the management cluster bootstrap, configure them on the existing regional cluster as required. These parameters are applied to all machines of regional and managed clusters in the specified region.

Warning

The procedure below triggers an upgrade of all clusters in a specific region, which may lead to workload disruption during nodes cordoning and draining.

To configure an NTP server for a regional cluster:

  1. Download your management cluster kubeconfig:

    1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

    2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

    3. Expand the menu of the tab with your user name.

    4. Click Download kubeconfig to download kubeconfig of your management cluster.

    5. Log in to any local machine with kubectl installed.

    6. Copy the downloaded kubeconfig to this machine.

  2. Use the downloaded kubeconfig to edit the management cluster:

    kubectl --kubeconfig <kubeconfigPath> edit -n <projectName> cluster <managementClusterName>
    

    In the command above and the step below, replace the parameters enclosed in angle brackets with the corresponding values of your cluster.

  3. In the regional section, add the ntp:servers section with the list of required servers names:

    spec:
      ...
      providerSpec:
        value:
          kaas:
          ...
            regional:
              - helmReleases:
                - name: <providerName>
                  values:
                    config:
                      lcm:
                        ...
                        ntp:
                          servers:
                          - 0.pool.ntp.org
                          ...
    

Remove a management cluster

This section describes how to remove a management cluster.

To remove a management cluster:

  1. Verify that you have successfully removed all managed clusters that run on top of the management cluster to be removed. For details, see the corresponding Delete a managed cluster section depending on your cloud provider in Create and operate managed clusters.

  2. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  3. Run the following script:

    bootstrap.sh cleanup
    

Note

Removing a management or regional cluster using the Container Cloud web UI is not supported.

Remove a regional cluster

This section describes how to remove a regional cluster.

To remove a regional cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the project with the managed clusters of the regional cluster to remove using the Switch Project action icon located on top of the main left-side navigation panel.

  3. Verify that you have successfully deleted all managed clusters that run on top of the regional cluster to be removed. For details, see the corresponding Delete a managed cluster section depending on your cloud provider in Create and operate managed clusters.

  4. Delete the credentials associated with the region:

    1. In the Credentials tab, click the first credentials name.

    2. In the window that opens, capture the Region Name field.

    3. Repeat two previous steps for the remaining credentials in the list.

    4. Delete all credentials with the name of the region that you are going to remove.

  5. Log in to a local machine where your management and regional clusters kubeconfig files are located and where kubectl is installed.

    Note

    The management or regional cluster kubeconfig files are created during the last stage of the management or regional cluster bootstrap.

  6. Run the following script with the corresponding values of your cluster:

    REGIONAL_CLUSTER_NAME=<regionalClusterName> REGIONAL_KUBECONFIG=<pathToRegionalClusterKubeconfig> KUBECONFIG=<mgmtClusterKubeconfig> ./bootstrap.sh destroy_regional
    

Note

Removing a management or regional cluster using the Container Cloud web UI is not supported.

Manage IAM

IAM CLI

IAM CLI is a user-facing command-line tool for managing scopes, roles, and grants. Using your personal credentials, you can perform different IAM operations through the iamctl tool. For example, you can verify the current status of the IAM service, request or revoke service tokens, verify your own grants within Mirantis Container Cloud as well as your token details.

Configure IAM CLI

The iamctl command-line interface uses the iamctl.yaml configuration file to interact with IAM.

To create the IAM CLI configuration file:

  1. Log in to the management cluster.

  2. If you do not have iamctl, install it using the download link for the latest version available in the Artifacts section of the current Container Cloud release. For details, see Container Cloud Release notes.

  3. Change the directory to one of the following:

    • $HOME/.iamctl

    • $HOME

    • $HOME/etc

    • /etc/iamctl

  4. Create iamctl.yaml with the following exemplary parameters and values that correspond to your deployment:

    server: <IAM_API_ADDRESS>
    timeout: 60
    verbose: 99 # Verbosity level, from 0 to 99
    
    tls:
        enabled: true
        ca: <PATH_TO_CA_BUNDLE>
    
    auth:
        issuer: <IAM_REALM_IN_KEYCLOAK>
        ca: <PATH_TO_CA_BUNDLE>
        client_id: iam
        client_secret:
    
    • The <IAM_API_ADDRESS> value has the IAM_API_ADDRESS is <ip>:<port> / <dns-name> format.

    • The <IAM_REALM_IN_KEYCLOAK> value has the <keycloak-url>/auth/realms/<realm-name> format, where <realm-name> defaults to iam.

Available IAM CLI commands

Using iamctl, you can perform different role-based access control operations in your managed cluster. For example:

  • Grant or revoke access to a managed cluster and a specific user for troubleshooting

  • Grant or revoke access to a Mirantis Container Cloud project that contains several managed clusters

  • Create or delete tokens for the Container Cloud services with a specific set of grants as well as identify when a service token was used the last time

The iamctl command-line interface contains the following set of commands:

The following tables describe the iamctl commands with their descriptions.

General commands

Usage

Description

iamctl --help, iamctl help

Output the list of available commands.

iamctl help <command>

Output the description of a specific command.

Account information commands

Usage

Description

iamctl account info

Output detailed account information such as user email, user name, the details of their active and offline sessions, tokens statuses and expiration dates.

iamctl account login

Log in the current user. The system prompts to enter your authentication credentials. After a successful login, your user token is added to the $HOME/.iamctl directory.

iamctl account logout

Log out the current user. Once done, the user information is removed from $HOME/.iamctl.

Scope commands

Usage

Description

iamctl scope list

List the IAM scopes available for the current environment.

Example output:

+---------------+--------------------------+
|     NAME      |   DESCRIPTION            |
+---------------+--------------------------+
| m:iam         | IAM scope                |
| m:kaas        | Container Cloud scope    |
| m:k8s:managed |                          |
| m:k8s         | Kubernetes scope         |
| m:cloud       | Cloud scope              |
+---------------+--------------------------+

iamctl scope list [prefix]

Output the specified scope list. For example: iamctl m:k8s.

Role commands

Usage

Description

iamctl role list <scope>

List the roles for the specified scope in IAM.

iamctl role show <scope> <role>

Output the details of the specified scope role including the role name (admin, viewer, reader), its description, and an example of the grant command. For example: iamctl role show m:iam admin.

Grant commands

Usage

Description

iamctl grant give [username] [scope] [role]

Provide a user with a role in a scope. For example, the iamctl grant give jdoe m:iam admin command provides the IAM admin role in the m:iam scope to John Doe.

For the list of supported IAM scopes and roles, see: Role list.

Note

To lock or disable a user, use LDAP or Google OAuth depending on the external provider integrated to your deployment.

iamctl grant list <username>

List the grants provided to the specified user. For example: iamctl grant list jdoe.

Example output:

+--------+--------+---------------+
| SCOPE  |  ROLE  |   GRANT FQN   |
+--------+--------+---------------+
| m:iam  | admin  | m:iam@admin   |
| m:sl   | viewer | m:sl@viewer   |
| m:kaas | writer | m:kaas@writer |
+--------+--------+---------------+
  • m:iam@admin - admin rights in all IAM-related applications

  • m:sl@viewer - viewer rights in all StackLight-related applications

  • m:kaas@writer - writer rights in Container Cloud

iamctl grant revoke [username] [scope] [role]

Revoke the grants provided to the user.

Service token commands

Usage

Description

iamctl servicetoken list [--all]

List the details of all service tokens created by the current user. The output includes the following service token details:

  • ID

  • Alias, for example, nova, jenkins-ci

  • Creation date and time

  • Creation owner

  • Grants

  • Last refresh date and time

  • IP address

iamctl servicetoken show [ID]

Output the details of a service token with the specified ID.

iamctl servicetoken create [alias] [service] [grant1 grants2...]

Create a token for a specific service with the specified set of grants. For example, iamctl servicetoken create new-token iam m:iam@viewer.

iamctl servicetoken delete [ID1 ID2...]

Delete a service token with the specified ID.

User commands

Usage

Description

iamctl user list

List user names and emails of all current users.

iamctl user show <username>

Output the details of the specified user.

Role list

Mirantis Container Cloud creates the IAM roles in scopes. For each application type, such as iam, k8s, or kaas, Container Cloud creates a scope in Keycloak. And every scope contains a set of roles such as admin, user, viewer. The default IAM roles can be changed during a managed cluster deployment. You can grant or revoke a role access using the IAM CLI. For details, see: IAM CLI.

Example of the structure of a cluster-admin role in a managed cluster:

m:k8s:kaas-tenant-name:k8s-cluster-name@cluster-admin
  • m - prefix for all IAM roles in Container Cloud

  • k8s - application type, Kubernetes

  • kaas-tenant-name:k8s-cluster-name - a managed cluster identifier in Container Cloud (CLUSTER_ID)

  • @ - delimiter between a scope and role

  • cluster-admin - name of the role within the Kubernetes scope


The following tables include the scopes and their roles descriptions by Container Cloud components:

Container Cloud

Scope identifier

Role name

Grant example

Role description

m:kaas

reader

m:kaas@reader 0

List the managed clusters within the Container Cloud scope.

writer

m:kaas@writer 0

Create or delete the managed clusters within the Container Cloud scope.

operator

m:kaas@operator

Add or delete a bare metal host and machine within the Container Cloud scope, create a project.

m:kaas:$<CLUSTER_ID>

reader

m:kaas:$<CLUSTER_ID>@reader

List the managed clusters within the specified Container Cloud cluster ID.

writer

m:kaas:$<CLUSTER_ID>@writer

Create or delete the managed clusters within the specified Container Cloud cluster ID.

0(1,2)

Grant is available by default. Other grants can be added during a management and managed cluster deployment.

Kubernetes

Scope identifier

Role name

Grant example

Role description

m:k8s:<CLUSTER_ID>

cluster-admin

m:k8s:<CLUSTER_ID>@cluster-admin

Allow the super-user access to perform any action on any resource on the cluster level. When used in ClusterRoleBinding, provide full control over every resource in a cluster and all Kubernetes namespaces.

StackLight

Scope identifier

Role name

Grant example

Role description

m:sl:$<CLUSTER_ID> or m:sl:$<CLUSTER_ID>:<SERVICE_NAME>

admin

  • m:sl:$<CLUSTER_ID>@admin

  • m:sl:$<CLUSTER_ID>:alerta@admin

  • m:sl:$<CLUSTER_ID>:alertmngmnt@admin

  • m:sl:$<CLUSTER_ID>:kibana@admin

  • m:sl:$<CLUSTER_ID>:grafana@admin

  • m:sl:$<CLUSTER_ID>:prometheus@admin

Access the specified web UI(s) within the scope.

The m:sl:$<CLUSTER_ID>@admin grant provides access to all StackLight web UIs: Prometheus, Alerta, Alertmanager, Kibana, Grafana.

Change passwords for IAM users

For security reasons, Mirantis strongly recommends changing the default passwords for IAM users on publicly accessible Mirantis Container Cloud deployments.

To change the IAM passwords:

  1. Obtain the Keycloak admin password:

    kubectl get secret -n kaas iam-api-secrets -o jsonpath='{.data.keycloak_password}' | base64 -d ; echo
    
  2. Obtain the Keycloak load balancer IP:

    kubectl get svc -n kaas iam-keycloak-http
    
  3. Log in to the Keycloak web UI using the following link form with the default keycloak admin user and the Keycloak credentials obtained in the previous steps:

    https://<Keycloak-LB-IP>/auth/admin/master/console/#/realms/iam/users

  4. In the Manage > Users menu, select the required user.

  5. Open Credentials tab.

  6. Using the Reset password form, update the password as required.

    Note

    To change the password permanently, toggle the Temporary switch to the OFF position. Otherwise, the user will be prompted to change the password after the next login.

Manage StackLight

Using StackLight, you can monitor the components deployed in Mirantis Container Cloud and be quickly notified of critical conditions that may occur in the system to prevent service downtimes.

Access StackLight web UIs

By default, StackLight provides five web UIs including Prometheus, Alertmanager, Alerta, Kibana, and Grafana. This section describes how to access any of these web UIs. To use an optional Cerebro web UI, which is disabled by default, to debug the Elasticsearch clusters, see Access Elasticsearch clusters using Cerebro.

To access a StackLight web UI:

  1. Log in to the Mirantis Container Cloud web UI.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the required endpoint IP from the StackLight Endpoints section.

  5. Paste the copied IP to a web browser and use the default credentials to log in to the web UI. Once done, you are automatically authenticated to all StackLight web UIs.

Note

The Alertmanager web UI displays alerts received by all configured receivers, which can be mistaken for duplicates. To only display the alerts received by a particular receiver, use the Receivers filter.

View Grafana dashboards

Using the Grafana web UI, you can view the visual representation of the metric graphs based on the time series databases. Most Grafana dashboards include a View logs in Kibana link to immediately view relevant logs in the Kibana web UI.

To view the Grafana dashboards:

  1. Log in to the Grafana web UI as described in Access StackLight web UIs.

  2. From the drop-down list, select the required dashboard to inspect the status and statistics of the corresponding service in your management or managed cluster:

    Component

    Dashboard

    Description

    Ceph cluster

    Ceph Cluster

    Provides the overall health status of the Ceph cluster, capacity, latency, and recovery metrics.

    Ceph Nodes

    Provides an overview of the host-related metrics, such as the number of Ceph Monitors, Ceph OSD hosts, average usage of resources across the cluster, network and hosts load.

    Ceph OSD

    Provides metrics for Ceph OSDs, including the Ceph OSD read and write latencies, distribution of PGs per Ceph OSD, Ceph OSDs and physical device performance.

    Ceph Pools

    Provides metrics for Ceph pools, including the client IOPS and throughput by pool and pools capacity usage.

    Ironic bare metal

    Ironic BM

    Provides graphs on Ironic health, HTTP API availability, provisioned nodes by state and installed ironic-conductor back-end drivers.

    Container Cloud clusters

    Clusters Overview

    Represents the main cluster capacity statistics for all clusters of a Mirantis Container Cloud deployment where StackLight is installed.

    Kubernetes resources

    Kubernetes Calico

    Provides metrics of the entire Calico cluster usage, including the cluster status, host status, and Felix resources.

    Kubernetes Cluster

    Provides metrics for the entire Kubernetes cluster, including the cluster status, host status, and resources consumption.

    Kubernetes Deployments

    Provides information on the desired and current state of all service replicas deployed on a Container Cloud cluster.

    Kubernetes Namespaces

    Provides the pods state summary and the CPU, MEM, network, and IOPS resources consumption per name space.

    Kubernetes Nodes

    Provides charts showing resources consumption per Container Cloud cluster node.

    Kubernetes Pods

    Provides charts showing resources consumption per deployed pod.

    NGINX

    NGINX

    Provides the overall status of the NGINX cluster and information about NGINX requests and connections.

    StackLight

    Alertmanager

    Provides performance metrics on the overall health status of the Prometheus Alertmanager service, the number of firing and resolved alerts received for various periods, the rate of successful and failed notifications, and the resources consumption.

    Elasticsearch

    Provides information about the overall health status of the Elasticsearch cluster, including the resources consumption and the state of the shards.

    Grafana

    Provides performance metrics for the Grafana service, including the total number of Grafana entities, CPU and memory consumption.

    PostgreSQL

    Provides PostgreSQL statistics, including read (DQL) and write (DML) row operations, transaction and lock, replication lag and conflict, and checkpoint statistics, as well as PostgreSQL performance metrics.

    Prometheus

    Provides the availability and performance behavior of the Prometheus servers, the sample ingestion rate, and system usage statistics per server. Also, provides statistics about the overall status and uptime of the Prometheus service, the chunks number of the local storage memory, target scrapes, and queries duration.

    Pushgateway

    Provides performance metrics and the overall health status of the service, the rate of samples received for various periods, and the resources consumption.

    Prometheus Relay

    Provides service status and resources consumption metrics.

    Telemeter Server

    Provides statistics and the overall health status of the Telemeter service.

    System

    System

    Provides a detailed resource consumption and operating system information per Container Cloud cluster node.

    Mirantis Kubernetes Engine (MKE)

    MKE Cluster

    Provides a global overview of an MKE cluster: statistics about the number of the worker and manager nodes, containers, images, Swarm services.

    MKE Containers

    Provides per container resources consumption metrics for the MKE containers such as CPU, RAM, network.

View Kibana dashboards

Using the Kibana web UI, you can view the visual representation of logs and Kubernetes events of your deployment.

To view the Kibana dashboards:

  1. Log in to the Kibana web UI as described in Access StackLight web UIs.

  2. Click the required dashboard to inspect the visualizations or perform a search:

    Dashboard

    Description

    Logs

    Provides visualizations on the number of log messages per severity, source, and top log-producing host, namespaces, containers, and applications. Includes search.

    Kubernetes events

    Provides visualizations on the number of Kubernetes events per type, and top event-producing resources and namespaces by reason and event type. Includes search.

Available StackLight alerts

This section provides an overview of the available predefined StackLight alerts. To view the alerts, use the Prometheus web UI. To view the firing alerts, use Alertmanager or Alerta web UI.

Alert dependencies

Using alert inhibition rules, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Alert inhibition rules are enabled by default.

The following table describes the dependency between alerts. Once an alert from the Alert column raises, the alert from the Silences column will be suppressed with the Inhibited status in the Alertmanager web UI.

Alert

Silences

CephClusterFullCritical

CephClusterFullWarning

CephClusterHealthCritical

CephClusterHealthMinor

CephOSDDiskUnavailable

CephOSDDiskNotResponding

CephOSDPgNumTooHighCritical

CephOSDPgNumTooHighWarning

DockerSwarmServiceReplicasFlapping

DockerSwarmServiceReplicasDown

DockerSwarmServiceReplicasOutage

DockerSwarmServiceReplicasDown

ElasticClusterStatusCritical

ElasticClusterStatusWarning

ElasticHeapUsageCritical

ElasticHeapUsageWarning

TargetFlapping

TargetDown

FileDescriptorUsageCritical

FileDescriptorUsageMajor

FileDescriptorUsageWarning

FileDescriptorUsageMajor

FileDescriptorUsageWarning

SystemDiskFullMajor

SystemDiskFullWarning

SystemDiskInodesFullMajor

SystemDiskInodesFullWarning

SystemLoadTooHighCritical

SystemLoadTooHighWarning

SystemMemoryFullMajor

SystemMemoryFullWarning

KubePersistentVolumeUsageCritical

KubePersistentVolumeFullInFourDays

KubeAPIErrorsHighMajor

KubeAPIErrorsHighWarning

KubeAPILatencyHighMajor

KubeAPILatencyHighWarning

KubeAPIOutage

KubeAPIDown

KubeAPIResourceErrorsHighMajor

KubeAPIResourceErrorsHighWarning

KubeClientCertificateExpirationInOneDay

KubeClientCertificateExpirationInSevenDays

MKEAPIOutage

MKEAPIDown

MKENodeDiskFullCritical

MKENodeDiskFullWarning

PostgresqlPrimaryDown

PostgresqlPatroniClusterUnlocked

PostgresqlReplicationNonStreamingReplicas

PostgresqlReplicationPaused

PostgresqlReplicaDown

PostgresqlReplicationNonStreamingReplicas

PostgresqlReplicationPaused

PostgresqlReplicationSlowWalApplication

PostgresqlReplicationSlowWalDownload

PostgresqlReplicationWalArchiveWriteFailing

PrometheusErrorSendingAlertsMajor

PrometheusErrorSendingAlertsWarning

SSLCertExpirationMajor

SSLCertExpirationWarning

MCCSSLCertExpirationMajor

MCCSSLCertExpirationWarning

Alertmanager

This section describes the alerts for the Alertmanager service.


AlertmanagerFailedReload

Severity

Warning

Summary

Failure to reload the Alertmanager configuration.

Description

Reloading the Alertmanager configuration failed for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


AlertmanagerMembersInconsistent

Severity

Major

Summary

Alertmanager cluster members are not found.

Description

Alertmanager has not found all other members of the cluster.


AlertmanagerNotificationFailureWarning

Severity

Warning

Summary

Alertmanager has failed notifications.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} notifications on the {{ $labels.namespace }}/{{ $labels.pod }} Pod fail for 2 minutes.


AlertmanagerAlertsInvalidWarning

Severity

Warning

Summary

Alertmanager has invalid alerts.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} alerts on the {{ $labels.namespace }}/{{ $labels.pod }} Pod are invalid for 2 minutes.

Calico

This section describes the alerts for Calico.


CalicoDataplaneFailuresHigh

Severity

Warning

Summary

High number of data plane failures within Felix.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} data plane failures within the last hour.


CalicoDataplaneAddressMsgBatchSizeHigh

Severity

Warning

Summary

Felix address message batch size is higher than 5.

Description

The size of the data plane address message batch on the {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node is {{ $value }}.


CalicoDatapaneIfaceMsgBatchSizeHigh

Severity

Warning

Summary

Felix interface message batch size is higher than 5.

Description

The size of the data plane interface message batch on the {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node is {{ $value }}.


CalicoIPsetErrorsHigh

Severity

Warning

Summary

More than 5 IPset errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} IPset errors within the last hour.


CalicoIptablesSaveErrorsHigh

Severity

Warning

Summary

More than 5 iptable save errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} iptable save errors within the last hour.


CalicoIptablesRestoreErrorsHigh

Severity

Warning

Summary

More than 5 iptable restore errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} iptable restore errors within the last hour.

Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthMinor

Severity

Minor

Summary

Ceph cluster health is WARNING.

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.


CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.


CephMonQuorumAtRisk

Severity

Major

Summary

Ceph cluster quorum is at risk.

Description

The Ceph cluster quorum is low.


CephOSDDownMinor

Severity

Minor

Summary

Ceph OSDs are down.

Description

{{ $value }} of Ceph OSDs in the Ceph cluster are down. For details, run ceph osd tree.


CephOSDDiskNotResponding

Severity

Critical

Summary

Disk is not responding.

Description

The {{ $labels.device }} disk device is not responding on the {{ $labels.host }} host.


CephOSDDiskUnavailable

Severity

Critical

Summary

Disk is not accessible.

Description

The {{ $labels.device }} disk device is not accessible on the {{ $labels.host }} host.


CephClusterFullWarning

Severity

Warning

Summary

Ceph cluster is nearly full.

Description

The Ceph cluster utilization has crossed 85%, expansion is required.


CephClusterFullCritical

Severity

Critical

Summary

Ceph cluster is full.

Description

The Ceph cluster utilization has crossed 95%, immediate expansion is required.


CephOSDPgNumTooHighWarning

Severity

Warning

Summary

Some Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 Placement Groups. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephOSDPgNumTooHighCritical

Severity

Critical

Summary

Some Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 Placement Groups. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephMonHighNumberOfLeaderChanges

Severity

Warning

Summary

Too many leader changes occur in the Ceph cluster.

Description

{{ $value }} leader changes per minute occur for the {{ $labels.instance }} instance of the {{ $labels.job }} Ceph Monitor.


CephNodeDown

Severity

Critical

Summary

Ceph node {{ $labels.node }} went down.

Description

The {{ $labels.node }} Ceph node is down and requires immediate verification.


CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of Ceph OSDs are running.

Description

{{ $value }} different versions of Ceph OSD components are running.


CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of Ceph Monitors are running.

Description

{{ $value }} different versions of Ceph Monitor components are running.


CephPGInconsistent

Severity

Minor

Summary

Too many inconsistent Ceph PGs.

Description

The Ceph cluster detects inconsistencies in one or more replicas of an object in {{ $value }} Placement Groups.


CephPGUndersized

Severity

Minor

Summary

Too many undersized Ceph PGs.

Description

The Ceph cluster reports {{ $value }} Placement Groups have fewer copies than the configured pool replication level.

Docker Swarm

This section describes the alerts for the Docker Swarm service.


DockerSwarmLeadElectionLoop

Severity

Major

Summary

Docker Swarm Manager leadership election loop.

Description

More than 2 Docker Swarm leader elections occur for the last 10 minutes.


DockerSwarmNetworkUnhealthy

Severity

Warning

Summary

Docker Swarm network is unhealthy.

Description

The qLen size and NetMsg showed unexpected output for the last 10 minutes. Verify the NetworkDb Stats output for the qLen size and NetMsg using journalctl -d docker.

Note

For the DockerNetworkUnhealthy alert, StackLight collects metrics from logs. Therefore, this alert is available only if logging is enabled.


DockerSwarmNodeFlapping

Severity

Major

Summary

Docker Swarm node is flapping.

Description

The {{ $labels.node_name }} Docker Swarm node has changed the state more than 3 times for the last 10 minutes.


DockerSwarmServiceReplicasDown

Severity

Major

Summary

Docker Swarm replica is down.

Description

The {{ $labels.service_name }} Docker Swarm service replica is down for 2 minutes.


DockerSwarmServiceReplicasFlapping

Severity

Major

Summary

Docker Swarm service replica is flapping.

Description

The {{ $labels.service_name }} Docker Swarm service replica is flapping for 15 minutes.


DockerSwarmServiceReplicasOutage

Severity

Critical

Summary

Docker Swarm service outage.

Description

All {{ $labels.service_name }} Docker Swarm service replicas are down for 2 minutes.

Elasticsearch

This section describes the alerts for the Elasticsearch service.


ElasticHeapUsageCritical

Severity

Critical

Summary

Elasticsearch heap usage is too high (>90%).

Description

Elasticsearch heap usage is over 90% for 5 minutes.


ElasticHeapUsageWarning

Severity

Warning

Summary

Elasticsearch heap usage is high (>80%).

Description

Elasticsearch heap usage is over 80% for 5 minutes.


ElasticClusterStatusCritical

Severity

Critical

Summary

Elasticsearch critical status.

Description

The Elasticsearch cluster status has changed to RED.


ElasticClusterStatusWarning

Severity

Warning

Summary

Elasticsearch warning status.

Description

The Elasticsearch cluster status has changed to YELLOW. The alert persists for the cluster in the RED status.


NumberOfRelocationShards

Severity

Warning

Summary

Shards relocation takes more than 20 minutes.

Description

Elasticsearch has {{ $value }} relocating shards for 20 minutes.


NumberOfInitializingShards

Severity

Warning

Summary

Shards initialization takes more than 10 minutes.

Description

Elasticsearch has {{ $value }} shards being initialized for 10 minutes.


NumberOfUnassignedShards

Severity

Major

Summary

Shards have unassigned status for 5 minutes.

Description

Elasticsearch has {{ $value }} unassigned shards for 5 minutes.


NumberOfPendingTasks

Severity

Warning

Summary

Tasks have pending state for 10 minutes.

Description

Elasticsearch has {{ $value }} pending tasks for 10 minutes. The cluster works slowly.


ElasticNoNewDataCluster

Severity

Major

Summary

Elasticsearch cluster has no new data for 30 minutes.

Description

No new data has arrived to the Elasticsearch cluster for 30 minutes.


ElasticNoNewDataNode

Severity

Warning

Summary

Elasticsearch node has no new data for 30 minutes.

Description

No new data has arrived to the {{ $labels.name }} Elasticsearch node for 30 minutes. The alert also indicates Elasticsearch node cordoning.

etcd

This section describes the alerts for the etcd service.


etcdInsufficientMembers

Severity

Critical

Summary

The etcd cluster has insufficient members.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} insufficient members.


etcdNoLeader

Severity

Critical

Summary

The etcd cluster has no leader.

Description

The {{ $labels.instance }} member of the {{ $labels.job }} etcd cluster has no leader.


etcdHighNumberOfLeaderChanges

Severity

Warning

Summary

More than 3 leader changes occurred in the the etcd cluster within the last hour.

Description

The {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster has {{ $value }} leader changes within the last hour.


etcdGRPCRequestsSlow

Severity

Warning

Summary

The etcd cluster has slow gRPC requests.

Description

The gRPC requests to {{ $labels.grpc_method }} take {{ $value }}s on {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster.


etcdMemberCommunicationSlow

Severity

Warning

Summary

The etcd cluster has slow member communication.

Description

The member communication with {{ $labels.To }} on the {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster takes {{ $value }}s.


etcdHighNumberOfFailedProposals

Severity

Warning

Summary

The etcd cluster has more than 5 proposal failures.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} proposal failures on the {{ $labels.instance }} etcd instance within the last hour.


etcdHighFsyncDurations

Severity

Warning

Summary

The etcd cluster has high fync duration.

Description

The duration of 99% of all fync operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.


etcdHighCommitDurations

Severity

Warning

Summary

The etcd cluster has high commit duration.

Description

The duration of 99% of all commit operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.

External endpoint

This section describes the alerts for external endpoints.


ExternalEndpointDown

Severity

Critical

Summary

External endpoint is down.

Description

The {{ $labels.instance }} external endpoint is not accessible for the last 2 minutes.


ExternalEndpointTCPFailure

Severity

Critical

Summary

Failure to establish a TCP or TLS connection.

Description

The system cannot establish a TCP or TLS connection to {{ $labels.instance }}.

General alerts

This section lists the general available alerts.


TargetDown

Severity

Critical

Summary

The {{ $labels.job }} target is down.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is down.


TargetFlapping

Severity

Critical

Summary

The {{ $labels.job }} target is flapping.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is changing its state between UP and DOWN for 30 minutes, at least once within the 15 minutes time range.


NodeDown

Severity

Critical

Summary

The {{ $labels.node }} node is down.

Description

The {{ $labels.node }} node is down. Kubernetes treats the node as Not Ready and kubelet is not accessible from Prometheus.


Watchdog

Severity

None

Summary

Watchdog alert that is always firing.

Description

This alert ensures that the entire alerting pipeline is functional. This alert should always be firing in Alertmanager against a receiver. Some integrations with various notification mechanisms can send a notification when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.

General node alerts

This section lists the general alerts for Kubernetes nodes.


FileDescriptorUsageCritical

Severity

Critical

Summary

Node uses 95% of file descriptors.

Description

The {{ $labels.node }} node uses 95% of file descriptors.


FileDescriptorUsageMajor

Severity

Major

Summary

Node uses 90% of file descriptors.

Description

The {{ $labels.node }} node uses 90% of file descriptors.


FileDescriptorUsageWarning

Severity

Warning

Summary

Node uses 80% of file descriptors.

Description

The {{ $labels.node }} node uses 80% of file descriptors.


SystemCpuFullWarning

Severity

Warning

Summary

High CPU consumption.

Description

The average CPU consumption on the {{ $labels.node }} node is {{ $value }}% for 2 minutes.


SystemLoadTooHighWarning

Severity

Warning

Summary

System load is more than 1 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemLoadTooHighCritical

Severity

Critical

Summary

System load is more than 2 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemDiskFullWarning

Severity

Warning

Summary

Disk partition {{ $labels.mountpoint }} is 85% full.

Description

The {{ $labels.device }} disk partition {{ $labels.mountpoint }} on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemDiskFullMajor

Severity

Major

Summary

Disk partition {{ $labels.mountpoint }} is 95% full.

Description

The {{ $labels.device }} disk partition {{ $labels.mountpoint }} on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemMemoryFullWarning

Severity

Warning

Summary

More than 90% of memory is used or less than 8 GB is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemMemoryFullMajor

Severity

Major

Summary

More than 95% of memory is used or less than 4 GB of memory is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemDiskInodesFullWarning

Severity

Warning

Summary

The {{ $labels.mountpoint }} volume uses 85% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskInodesFullMajor

Severity

Major

Summary

The {{ $labels.mountpoint }} volume uses 95% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk is failing.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node is reporting errors for 5 minutes.

Ironic

This section describes the alerts for Ironic bare metal. The alerted events include Ironic API availability and Ironic processes availability.


IronicBmMetricsMissing

Severity

Major

Summary

Ironic metrics missing.

Description

Metrics retrieved from the Ironic API are not available for 2 minutes.


IronicBmApiOutage

Severity

Critical

Summary

Ironic API outage.

Description

The Ironic API is not accessible.

Kubernetes applications

This section lists the alerts for Kubernetes applications.


KubePodCrashLooping

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in a crash loop status.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod container {{ $labels.container }} was restarted at least twice during the last 5 minutes.


KubePodNotReady

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in the non-ready state.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod state is not Ready for longer than 15 minutes.


KubeDeploymentGenerationMismatch

Severity

Major

Summary

The {{ $labels.deployment }} Deployment generation does not match the metadata.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} Deployment generation does not match the metadata, indicating that the deployment failed but has not been rolled back.


KubeDeploymentReplicasMismatch

Severity

Major

Summary

The {{ $labels.deployment }} Deployment has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} Deployment does not match the expected number of replicas for longer than 10 minutes.


KubeStatefulSetReplicasMismatch

Severity

Major

Summary

The {{ $labels.statefulset }} StatefulSet has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet does not match the expected number of replicas for longer than 10 minutes.


KubeStatefulSetGenerationMismatch

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet generation does not match the metadata.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet generation does not match the metadata, indicating that the StatefulSet failed but has not been rolled back.


KubeStatefulSetUpdateNotRolledOut

Severity

Major

Summary

The {{ $labels.statefulset }} StatefulSet update has not been rolled out.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet update has not been rolled out.


KubeDaemonSetRolloutStuck

Severity

Major

Summary

The {{ $labels.daemonset }} DaemonSet is not ready.

Description

{{ $value }} Pods of the {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet are scheduled but not ready.


KubeDaemonSetNotScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has not scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has {{ $value }} not scheduled Pods.


KubeDaemonSetMisScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has incorrectly scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has {{ $value }} Pods running where they are not supposed to run.


KubeCronJobRunning

Severity

Warning

Summary

The {{ $labels.cronjob }} CronJob is not ready.

Description

The {{ $labels.namespace }}/{{ $labels.cronjob }} CronJob takes more than 15 minutes to complete.


KubeJobCompletion

Severity

Minor

Summary

The {{ $labels.job_name }} job is not completed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job takes more than 15 minutes to complete.


KubeJobFailed

Severity

Minor

Summary

The {{ $labels.job_name }} job failed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job failed to complete.

Kubernetes resources

This section lists the alerts for Kubernetes resources.


KubeCPUOvercommitPods

Severity

Warning

Summary

Kubernetes has overcommitted CPU requests.

Description

The Kubernetes cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.


KubeMemOvercommitPods

Severity

Warning

Summary

Kubernetes has overcommitted memory requests.

Description

The Kubernetes cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.


KubeCPUOvercommitNamespaces

Severity

Warning

Summary

Kubernetes has overcommitted CPU requests for namespaces.

Description

The Kubernetes cluster has overcommitted CPU resource requests for namespaces.


KubeMemOvercommitNamespaces

Severity

Warning

Summary

Kubernetes has overcommitted memory requests for namespaces.

Description

The Kubernetes cluster has overcommitted memory resource requests for namespaces.


KubeQuotaExceeded

Severity

Warning

Summary

The {{ $labels.namespace }} namespace consumes more than 90% of its {{ $labels.resource }} quota.

Description

The {{ $labels.namespace }} namespace consumes {{ printf "%0.0f" $value }}% of its {{ $labels.resource }} quota.


CPUThrottlingHigh

Severity

Warning

Summary

The {{ $labels.pod_name }} Pod has CPU throttling.

Description

The {{ $labels.namespace }} container in the {{ $labels.namespace }}/{{ $labels.pod }} Pod has {{ printf "%0.0f" $value }}% of CPU throttling.

Kubernetes storage

This section lists the alerts for Kubernetes storage.

Caution

Due to the upstream bug in Kubernetes, metrics for the KubePersistentVolumeUsageCritical and KubePersistentVolumeFullInFourDays alerts that are collected for persistent volumes provisioned by cinder-csi-plugin are not available.


KubePersistentVolumeUsageCritical

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume has less than 3% of free space.

Description

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is only {{ printf "%0.2f" $value }}% free.


KubePersistentVolumeFullInFourDays

Severity

Warning

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume is expected to fill up in 4 days.

Description

Based on the recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is expected to fill up within four days. Currently, {{ printf "%0.2f" $value }}% of free space is available.


KubePersistentVolumeErrors

Severity

Critical

Summary

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Description

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Kubernetes system

This section lists the alerts for the Kubernetes system.


KubeNodeNotReady

Severity

Warning

Summary

The {{ $labels.node }} node is not ready.

Description

The Kubernetes {{ $labels.node }} node is not ready for more than one hour.


KubeVersionMismatch

Severity

Warning

Summary

Kubernetes components have mismatching versions.

Description

Kubernetes has components with {{ $value }} different semantic versions running.


KubeClientErrors

Severity

Warning

Summary

Kubernetes API client has more than 1% of error requests.

Description

The {{ $labels.job }}/{{ $labels.instance }} Kubernetes API server client has {{ printf "%0.0f" $value }}% errors.


KubeletTooManyPods

Severity

Warning

Summary

kubelet reached 90% of Pods limit.

Description

The {{ $labels.instance }}/{{ $labels.node }} kubelet runs {{ $value }} Pods, nearly 90% of possible allocation.


KubeAPIDown

Severity

Critical

Summary

Kubernetes API endpoint is down.

Description

The Kubernetes API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


KubeAPIOutage

Severity

Critical

Summary

Kubernetes API is down.

Description

The Kubernetes API is not accessible for the last 30 seconds.


KubeAPILatencyHighWarning

Severity

Warning

Summary

The API server has a 99th percentile latency of more than 1 second.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPILatencyHighMajor

Severity

Major

Summary

The API server has a 99th percentile latency of more than 4 seconds.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPIErrorsHighMajor

Severity

Major

Summary

API server returns errors for more than 3% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIErrorsHighWarning

Severity

Warning

Summary

API server returns errors for more than 1% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIResourceErrorsHighMajor

Severity

Major

Summary

API server returns errors for 10% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeAPIResourceErrorsHighWarning

Severity

Warning

Summary

API server returns errors for 5% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeClientCertificateExpirationInSevenDays

Severity

Warning

Summary

A client certificate expires in 7 days.

Description

A client certificate used to authenticate to the API server expires in less than 7 days.


KubeClientCertificateExpirationInOneDay

Severity

Critical

Summary

A client certificate expires in 24 hours.

Description

A client certificate used to authenticate to the API server expires in less than 24 hours.


ContainerScrapeError

Severity

Warning

Summary

Failure to get Kubernetes container metrics.

Description

Prometheus was not able to scrape metrics from the container on the {{ $labels.node }} Kubernetes node.

Netchecker

This section lists the alerts for the Netchecker service.


NetCheckerAgentErrors

Severity

Warning

Summary

Netchecker has a high number of errors.

Description

The {{ $labels.agent }} Netchecker agent had {{ $value }} errors within the last hour.


NetCheckerReportsMissing

Severity

Warning

Summary

The number of agent reports is lower than expected.

Description

The {{ $labels.agent }} Netchecker agent has not reported anything for the last 5 minutes.


NetCheckerTCPServerDelay

Severity

Warning

Summary

The TCP connection to Netchecker server takes too much time.

Description

The {{ $labels.agent }} Netchecker agent TCP connection time to the Netchecker server has increased by {{ $value }} within the last 5 minutes.


NetCheckerDNSSlow

Severity

Warning

Summary

The DNS lookup time is too high.

Description

The DNS lookup time on the {{ $labels.agent }} Netchecker agent has increased by {{ $value }} within the last 5 minutes.

NGINX

This section lists the alerts for the NGINX service.


NginxServiceDown

Severity

Critical

Summary

The NGINX service is down.

Description

The NGINX service on the {{ $labels.node }} node is down.


NginxDroppedIncomingConnections

Severity

Minor

Summary

NGINX drops incoming connections.

Description

The NGINX service on the {{ $labels.node }} node drops {{ $value }} accepted connections per second for 5 minutes.

Node network

This section lists the alerts for a Kubernetes node network.


SystemRxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} has package receive errors.

Description

The {{ $labels.device }} network interface has receive errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod.


SystemTxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} node has package transmit errors.

Description

The {{ $labels.device }} network interface has transmit errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod.


SystemRxPacketsDroppedTooHigh

Severity

Warning

Summary

60 or more received packets were dropped.

Description

{{ $value | printf "%.2f" }} packets received by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


SystemTxPacketsDroppedTooHigh

Severity

Warning

Summary

100 transmitted packets were dropped.

Description

{{ $value | printf "%.2f" }} packets transmitted by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


NodeNetworkInterfaceFlapping

Severity

Warning

Summary

The {{ $labels.node }} node has flapping interface.

Description

The {{ $labels.device }} network interface often changes its UP status on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter.

Node time

This section lists the alerts for a Kubernetes node time.


ClockSkewDetected

Severity

Warning

Summary

The NTP offset reached the limit of 0.03 seconds.

Description

Clock skew was detected on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod. Verify that NTP is configured correctly on this host.

PostgreSQL

This section lists the alerts for the PoststgreSQL and Patroni services.


PostgresqlDataPageCorruption

Severity

Major

Summary

Patroni cluster member is experiencing data page corruption.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Patroni Pod in the {{ $labels.cluster }} cluster fails to calculate the data page checksum due to a possible hardware fault.


PostgresqlDeadlocksDetected

Severity

Warning

Summary

PostgreSQL transactions deadlocks.

Description

The transactions submitted to the Patroni {{ $labels.cluster }} cluster in the {{ $labels.namespace }} Namespace are experiencing deadlocks.


PostgresqlInsufficientWorkingMemory

Severity

Warning

Summary

Insufficient memory for PostgreSQL queries.

Description

The query data does not fit into working memory on the {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace.


PostgresqlPatroniClusterSplitBrain

Severity

Critical

Summary

Patroni cluster split-brain detected.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has multiple primaries, split-brain detected.


PostgresqlPatroniClusterUnlocked

Severity

Major

Summary

Patroni cluster primary node is missing.

Description

The primary node of the {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace is missing.


PostgresqlPrimaryDown

Severity

Critical

Summary

PostgreSQL is down on the cluster primary node.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace is down due to missing primary node.


PostgresqlReplicaDown

Severity

Minor

Summary

Patroni cluster has replicas with inoperable PostgreSQL.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has {{ $value }}% of replicas with inoperable PostgreSQL.


PostgresqlReplicationNonStreamingReplicas

Severity

Warning

Summary

Patroni cluster has non-streaming replicas.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has replicas not streaming the segments from the primary node.


PostgresqlReplicationPaused

Severity

Major

Summary

Replication has stopped.

Description

Replication has stopped on the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationSlowWalApplication

Severity

Warning

Summary

WAL segment application is slow.

Description

Slow replication while applying WAL segments on the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationSlowWalDownload

Severity

Warning

Summary

Streaming replication is slow.

Description

Slow replication while downloading WAL segments for the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationWalArchiveWriteFailing

Severity

Major

Summary

Patroni cluster WAL segment writes are failing.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Patroni Pod in the {{ $labels.cluster }} cluster fails to write replication segments.

Prometheus

This section describes the alerts for the Prometheus service.


PrometheusConfigReloadFailed

Severity

Warning

Summary

Failure to reload the Prometheus configuration.

Description

Reloading of the Prometheus configuration has failed for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


PrometheusNotificationQueueRunningFull

Severity

Warning

Summary

Prometheus alert notification queue is running full.

Description

The Prometheus alert notification queue is running full for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


PrometheusErrorSendingAlertsWarning

Severity

Warning

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from the {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod to Alertmanager {{ $labels.Alertmanager }}.


PrometheusErrorSendingAlertsMajor

Severity

Major

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from the {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod to Alertmanager {{ $labels.Alertmanager }}.


PrometheusNotConnectedToAlertmanagers

Severity

Minor

Summary

Prometheus is not connected to Alertmanager.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod is not connected to any Alertmanager instance.


PrometheusTSDBReloadsFailing

Severity

Warning

Summary

Prometheus has issues reloading data blocks from disk.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod had {{ $value | humanize }} reload failures over the last 12 hours.


PrometheusTSDBCompactionsFailing

Severity

Warning

Summary

Prometheus has issues compacting sample blocks.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod had {{ $value | humanize }} compaction failures over the last 12 hours.


PrometheusTSDBWALCorruptions

Severity

Warning

Summary

Prometheus encountered WAL corruptions.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has write-ahead log (WAL) corruptions in the time series database (TSDB) for the last 5 minutes.


PrometheusNotIngestingSamples

Severity

Warning

Summary

Prometheus does not ingest samples.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod does not ingest samples.


PrometheusTargetScrapesDuplicate

Severity

Warning

Summary

Prometheus has many rejected samples.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has many rejected samples because of duplicate timestamps but different values.


PrometheusRuleEvaluationsFailed

Severity

Warning

Summary

Prometheus failed to evaluate recording rules.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus Web UI.

Salesforce notifier

This section lists the alerts for the Salesforce notifier service.


SfNotifierAuthFailure

Severity

Critical

Summary

Failure to authenticate to Salesforce.

Description

The sf-notifier service fails to authenticate to Salesforce for 1 minute.

SMART disks

This section describes the alerts for SMART disks.


SystemSMARTDiskUDMACrcErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk has UDMA CRC errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting SMART UDMA CRC errors for 5 minutes.


SystemSMARTDiskHealthStatus

Severity

Warning

Summary

The {{ $labels.device }} disk has bad health.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting a bad health status for 1 minute.


SystemSMARTDiskReadErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has read errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased read error rate for 5 minutes.


SystemSMARTDiskSeekErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has seek errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased seek error rate for 5 minutes.


SystemSMARTDiskTemperatureHigh

Severity

Warning

Summary

The {{ $labels.device }} disk temperature is high.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has a temperature of {{ $value }}C for 5 minutes.


SystemSMARTDiskReallocatedSectorsCount

Severity

Major

Summary

The {{ $labels.device }} disk has reallocated sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has reallocated {{ $value }} sectors.


SystemSMARTDiskCurrentPendingSectors

Severity

Major

Summary

The {{ $labels.device }} disk has current pending sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} current pending sectors.


SystemSMARTDiskReportedUncorrectableErrors

Severity

Major

Summary

The {{ $labels.device }} disk has reported uncorrectable errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} reported uncorrectable errors.


SystemSMARTDiskOfflineUncorrectableSectors

Severity

Major

Summary

The {{ $labels.device }} disk has offline uncorrectable sectors

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} offline uncorrectable sectors.


SystemSMARTDiskEndToEndError

Severity

Major

Summary

The {{ $labels.device }} disk has end-to-end errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} end-to-end errors.

SSL certificates

This section lists the alerts for SSL certificates.


SSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate expires in 30 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 30 days.


SSLCertExpirationMajor

Severity

Major

Summary

SSL certificate expires in 10 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 10 days.


SSLProbesFailing

Severity

Critical

Summary

SSL certificate probes are failing.

Description

The SSL certificate probes for the {{ $labels.instance }} service endpoint are failing.


MCCSSLCertExpirationMajor

Severity

Major

Summary

SSL certificate for a Container Cloud service expires in 10 days.

Description

The SSL certificate for the Container Cloud {{ $labels.service }} service endpoint {{ $labels.instance }} expires in 10 days.


MCCSSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate for a Container Cloud service expires in 30 days.

Description

The SSL certificate for the Container Cloud {{ $labels.service }} service endpoint {{ $labels.instance }} expires in 30 days.


MCCSSLProbesFailing

Severity

Critical

Summary

SSL certificate probes for a Container Cloud service are failing.

Description

The SSL certificate probes for the Container Cloud {{ $labels.instance }} service endpoint are failing.

Telegraf

This section lists the alerts for the Telegraf service.


TelegrafGatherErrors

Severity

Major

Summary

Telegraf failed to gather metrics.

Description

Telegraf has gathering errors for the last 10 minutes.

Telemeter

This section describes the alerts for the Telemeter service.


TelemeterClientFederationFailed

Severity

Warning

Summary

Telemeter client failed to send data to the server.

Description

Telemeter client has failed to send data to the Telemeter server twice for the last 30 minutes. Verify the telemeter-client container logs.

Mirantis Kubernetes Engine

This section describes the alerts for the Mirantis Kubernetes Engine (MKE) cluster.


MKEAPIDown

Severity

Critical

Summary

MKE API endpoint is down.

Description

The MKE API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


MKEAPIOutage

Severity

Critical

Summary

MKE API is down.

Description

The MKE API (port 443) is not accessible for the last minute.


MKEContainerUnhealthy

Severity

Major

Summary

MKE container is in the Unhealthy state.

Description

The {{ $labels.name }} MKE container is in the Unhealthy state.


MKENodeDiskFullCritical

Severity

Critical

Summary

MKE node disk is 95% full.

Description

The {{ $labels.instance }} MKE node disk is 95% full.


MKENodeDiskFullWarning

Severity

Warning

Summary

MKE node disk is 85% full.

Description

The {{ $labels.instance }} MKE node disk is 85% full.


MKENodeDown

Severity

Critical

Summary

MKE node is down.

Description

The {{ $labels.instance }} MKE node is down.

Configure StackLight

This section describes the initial steps required for StackLight configuration. For a detailed description of StackLight configuration options, see StackLight configuration parameters.

  1. Download your management cluster kubeconfig:

    1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

    2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

    3. Expand the menu of the tab with your user name.

    4. Click Download kubeconfig to download kubeconfig of your management cluster.

    5. Log in to any local machine with kubectl installed.

    6. Copy the downloaded kubeconfig to this machine.

  2. Run one of the following commands:

    • For a management cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGEMENT_CLUSTER_NAME>
      
    • For a managed cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
  3. In the following section of the opened manifest, configure the required StackLight parameters as described in StackLight configuration parameters.

    spec:
      providerSpec:
        value:
          helmReleases:
         - name: stacklight
           values:
    
  4. Verify StackLight after configuration.

StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in Configure StackLight. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.


Alerta

Key

Description

Example values

alerta.enabled (bool)

Enables or disables Alerta. Set to true by default.

true or false


Elasticsearch

Key

Description

Example values

elasticsearch.logstashRetentionTime (int)

Defines the Elasticsearch logstash-* index retention time in days. The logstash-* index stores all logs gathered from all nodes and containers. Set to 1 by default.

1, 5, 15


Grafana

Key

Description

Example values

grafana.renderer.enabled (bool)

Disables Grafana Image Renderer. For example, for resource-limited environments. Enabled by default.

true or false

grafana.homeDashboard (string)

Defines the home dashboard. Set to kubernetes-cluster by default. You can define any of the available dashboards.

kubernetes-cluster


Logging

Key

Description

Example values

logging.enabled (bool)

Enables or disables the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture. Set to true by default.

true or false

logging.level (bool) Available since 2.6.0

Sets the least important level of log messages to send to Elasticsearch. Requires logging.enabled set to true.

The default logging level is INFO, meaning that StackLight will drop log messages for the lower DEBUG and TRACE levels. Levels from WARNING to EMERGENCY require attention.

Note

The FLUENTD_ERROR logs are of special type and cannot be dropped.

  • TRACE - The most verbose logs. Such level generates large amounts of data.

  • DEBUG- Messages that contain information typically of use only for debugging purposes.

  • INFO - Standard informational messages describing common processes such as service starting or stopping. Can be ignored during normal system operation but may provide additional input for investigation.

  • NOTICE - Normal but significant conditions that may require special handling.

  • WARNING - Messages on unexpected conditions that may require attention.

  • ERROR - Messages on error conditions that prevent normal system operation and require action.

  • CRITICAL - Messages on critical conditions indicating that a service is not working or working incorrectly.

  • ALERT - Messages on severe events indicating that action is needed immediately.

  • EMERGENCY - Messages indicating that a service is unusable.

logging.cerebro (bool)

Enables or disables Cerebro, a web UI for interacting with the Elasticsearch cluster that stores logs. To access the Cerebro web UI, see Access Elasticsearch clusters using Cerebro.

Note

Prior to enabling Cerebro, verify that your Container Cloud cluster has minimum 0.5-1 GB of free RAM and 1 vCPU available.

true or false


Logging to syslog

Available since 2.6.0

Key

Description

Example values

logging.syslog.enabled (bool)

Enables or disables remote logging to syslog. Disabled by default. Requires logging.enabled set to true. For details and configuration example, see Enable remote logging to syslog.

true or false

logging.syslog.host (string)

Specifies the remote syslog host.

remote-syslog.svc

logging.syslog.port (string)

Specifies the remote syslog port.

514

logging.syslog.protocol (bool)

Specifies the remote syslog protocol. Set to udp by default.

tcp or udp

logging.syslog.tls.enabled (bool)

Optional. Disabled by default. Enables or disables TLS. Use TLS only for the TCP protocol. TLS will not be enabled if you set a protocol other than TCP.

true or false

logging.syslog.tls.verify (bool)

Enables or disables TLS verification.

true or false

logging.syslog.tls.certificate (string)

Defines how to pass the certificate. secret takes precedence over hostPath.

  • secret - specifies the name of the secret holding the certificate.

  • hostPath - specifies an absolute host path to the PEM certificate.

certificate:
  secret: ""
  hostPath: "/etc/ssl/certs/ca-bundle.pem"

High availability

Key

Description

Example values

highAvailabilityEnabled (bool)

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Reference Architecture: StackLight deployment architecture. Set to false by default.

true or false


Prometheus

Key

Description

Example values

prometheusServer.retentionTime (string)

Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag. Set to 15d by default.

15d, 1000h, 10d12h

prometheusServer.retentionSize (string)

Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag. Set to 15GB by default.

15GB, 512MB

prometheusServer.alertResendDelay (string)

Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag. Set to 2m by default.

2m, 90s


Cluster size

Key

Description

Example values

clusterSize (string)

Specifies the approximate expected cluster size. Set to small by default. Other possible values include medium and large. Depending on the choice, appropriate resource limits are passed according to the resourcesPerClusterSize parameter. The values differ by the Elasticsearch and Prometheus resource limits:

  • small (default) - 2 CPU, 6 Gi RAM for Elasticsearch, 1 CPU, 8 Gi RAM for Prometheus. Use small only for testing and evaluation purposes with no workloads expected.

  • medium - 4 CPU, 16 Gi RAM for Elasticsearch, 3 CPU, 16 Gi RAM for Prometheus.

  • large - 8 CPU, 32 Gi RAM for Elasticsearch, 6 CPU, 32 Gi RAM for Prometheus. Set to large only in case of lack of resources for Elasticsearch and Prometheus.

small, medium, or large


Resource limits

Key

Description

Example values

resourcesPerClusterSize (map)

Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resourcesPerClusterSize:
  elasticsearch:
    small:
      limits:
        cpu: "1000m"
        memory: "4Gi"
    medium:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    large:
      limits:
        cpu: "4000m"
        memory: "16Gi"

resources (map)

Provides the capability to override the containers resource requests or limits for any StackLight component. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resources:
  alerta:
    requests:
      cpu: "50m"
      memory: "200Mi"
    limits:
      memory: "500Mi"

Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.

Note

The logging mechanism performance depends on the cluster log load. If the cluster components send an excessive amount of logs, the default resource requests and limits for fluentdElasticsearch may be insufficient, which may cause its pods to be OOMKilled and trigger the KubePodCrashLooping alert. In such case, increase the default resource requests and limits for fluentdElasticsearch. For example:

resources:
  fluentdElasticsearch:
    requests:
      memory: "500Mi"
    limits:
      memory: "1500Mi"

Kubernetes tolerations

Key

Description

Example values

tolerations.default (slice)

Kubernetes tolerations to add to all StackLight components.

default:
- key: "com.docker.ucp.manager"
  operator: "Exists"
  effect: "NoSchedule"

tolerations.component (map)

Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

component:
  elasticsearch:
  - key: "com.docker.ucp.manager"
    operator: "Exists"
    effect: "NoSchedule"
  postgresql:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

Storage class

Key

Description

Example values

storage.defaultStorageClass (string)

Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. To use the cluster default storage class, leave the string empty.

lvp, standard

storage.componentStorageClasses (map)

Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the cluster default storage class, leave the string empty.

componentStorageClasses:
  elasticsearch: ""
  fluentd: ""
  postgresql: ""
  prometheusAlertManager: ""
  prometheusPushGateway: ""
  prometheusServer: ""

NodeSelector

Key

Description

Example values

nodeSelector.default (map)

Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

default:
  role: stacklight

nodeSelector.component (map)

Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

component:
  alerta:
    role: stacklight
    component: alerta
  kibana:
    role: stacklight
    component: kibana

Salesforce reporter

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce reporter depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-reporter and sf-notifier services.

The clusterId key is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

sfReporter.enabled (bool)

Enables or disables reporting of Prometheus metrics to Salesforce. For details, see StackLight deployment architecture. Disabled by default.

true or false

sfReporter.salesForceAuth (map)

Salesforce parameters and credentials for the metrics reporting integration.

Note

Modify this parameter if sf-notifier is not configured or if you want to use a different Salesforce user account to send reports to.

salesForceAuth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

sfReporter.cronjob (map)

Defines the Kubernetes cron job for sending metrics to Salesforce. By default, reports are sent at midnight server time.

cronjob:
  schedule: "0 0 * * *"
  concurrencyPolicy: "Allow"
  failedJobsHistoryLimit: ""
  successfulJobsHistoryLimit: ""
  startingDeadlineSeconds: 200

Ceph monitoring

Key

Description

Example values

ceph.enabled (bool)

Enables or disables Ceph monitoring. Set to false by default.

true or false


External endpoint monitoring

Key

Description

Example values

externalEndpointMonitoring.enabled (bool)

Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds. Set to false by default.

true or false

externalEndpointMonitoring.certificatesHostPath (string)

Defines the directory path with external endpoints certificates on host.

/etc/ssl/certs/

externalEndpointMonitoring.domains (slice)

Defines the list of HTTP endpoints to monitor. The endpoints must successfully respond to a liveness probe. For success, a request to a specific endpoint must result in a 2xx HTTP response code.

domains:
- https://prometheus.io/health
- http://example.com:8080/status
- http://example.net:8080/pulse

Ironic monitoring

Key

Description

Example values

ironic.endpoint (string)

Enables or disables monitoring of bare metal Ironic. To enable, specify the Ironic API URL.

http://ironic-api-http.kaas.svc:6385/v1

ironic.insecure (bool)

Defines whether to skip the chain and host verification. Set to false by default.

true or false


SSL certificates monitoring

Key

Description

Example values

sslCertificateMonitoring.enabled (bool)

Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour. Set to false by default.

true or false

sslCertificateMonitoring.domains (slice)

Defines the list of HTTPS endpoints to monitor the certificates from.

domains:
- https://prometheus.io
- https://example.com:8080

Workload monitoring

Key

Description

Example values

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. Set to false by default. If set to true, you can define the namespaces to which the filter will apply.

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight
  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace


Mirantis Kubernetes Engine monitoring

Key

Description

Example values

mke.enabled (bool)

Enables or disables Mirantis Kubernetes Engine (MKE) monitoring. Set to true by default.

true or false

mke.dockerdDataRoot (string)

Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

/var/lib/docker


Alerts configuration

Key

Description

Example values

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see Available StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

customAlerts:
# To add a new alert:
- alert: ExampleAlert
  annotations:
    description: Alert description
    summary: Alert summary
  expr: example_metric > 0
  for: 5m
  labels:
    severity: warning
# To modify an existing alert expression:
- alert: AlertmanagerFailedReload
  expr: alertmanager_config_last_reload_successful == 5
# To disable an existing alert:
- alert: TargetDown
  enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.


Watchdog alert

Key

Description

Example values

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers. Set to true by default.

true or false


Alertmanager integrations

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled, for example, for the Salesforce integration and Alertmanager notifications external rules.

Key

Description

Example values

alertmanagerSimpleConfig.genericReceivers (slice)

Provides a generic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

For example, to enable notifications to OpsGenie:

alertmanagerSimpleConfig:
  genericReceivers:
  - name: HTTP-opsgenie
    enabled: true # optional
    opsgenie_configs:
    - api_url: "https://example.app.eu.opsgenie.com/"
      api_key: "secret-key"
      send_resolved: true

alertmanagerSimpleConfig.genericRoutes (slice)

Provides a template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

genericRoutes:
- receiver: HTTP-opsgenie
  enabled: true # optional
  match_re:
    severity: major|critical
  continue: true

alertmanagerSimpleConfig.inhibitRules.enabled (bool)

Disables or enables alert inhibition rules. If enabled, Alertmanager decreases alert noise by suppressing dependent alerts notifications to provide a clearer view on the cloud status and simplify troubleshooting. Enabled by default. For details, see Alert dependencies. For details on inhibition rules, see Prometheus documentation.

true or false


Notifications to email

Key

Description

Example values

alertmanagerSimpleConfig.email.enabled (bool)

Enables or disables Alertmanager integration with email. Set to false by default.

true or false

alertmanagerSimpleConfig.email (map)

Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

email:
  enabled: false
  send_resolved: true
  to: "to@test.com"
  from: "from@test.com"
  smarthost: smtp.gmail.com:587
  auth_username: "from@test.com"
  auth_password: password
  auth_identity: "from@test.com"
  require_tls: true

alertmanagerSimpleConfig.email.route (map)

Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications to Salesforce

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Salesforce integration depends on the Internet access through HTTPS.

Key

Description

Example values

clusterId (string)

Unique cluster identifier clusterId="<Cluster Project>/<Cluster Name>/<UID>", generated for each cluster using Cluster Project, Cluster Name, and cluster UID, separated by a slash. Used for both sf-notifier and sf-reporter services.

The clusterId is automatically defined for each cluster. Do not set or modify it manually.

Do not modify clusterId.

alertmanagerSimpleConfig.salesForce.enabled (bool)

Enables or disables Alertmanager integration with Salesforce using the sf-notifier service. Disabled by default.

true or false

alertmanagerSimpleConfig.salesForce.auth (map)

Defines the Salesforce parameters and credentials for integration with Alertmanager.

auth:
  url: "<SF instance URL>"
  username: "<SF account email address>"
  password: "<SF password>"
  environment_id: "<Cloud identifier>"
  organization_id: "<Organization identifier>"
  sandbox_enabled: "<Set to true or false>"

alertmanagerSimpleConfig.salesForce.route (map)

Defines the notifications route for Alertmanager integration with Salesforce. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications to Slack

On the managed clusters with limited Internet access, proxy is required for StackLight components that use HTTP and HTTPS and are disabled by default but need external access if enabled. The Slack integration depends on the Internet access through HTTPS.

Key

Description

Example values

alertmanagerSimpleConfig.slack.enabled (bool)

Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration. Set to false by default.

true or false

alertmanagerSimpleConfig.slack.api_url (string)

Defines the Slack webhook URL.

http://localhost:8888

alertmanagerSimpleConfig.slack.channel (string)

Defines the Slack channel or user to send notifications to.

monitoring

alertmanagerSimpleConfig.slack.route (map)

Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in Configure StackLight and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

To verify StackLight after configuration:

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta

elasticsearch.logstashRetentionTime

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

grafana.renderer.enabled

Verify the Grafana Image Renderer. If set to true, the output should include HTTP Server started, listening at http://localhost:8081.

kubectl logs -f -n stacklight -l app=grafana \
--container grafana-renderer

grafana.homeDashboard

In the Grafana web UI, verify that the desired dashboard is set as a home dashboard.

logging.enabled

Verify that Elasticsearch, Fluentd, and Kibana are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(elasticsearch-master,kibana,fluentd-elasticsearch)'

logging.syslog.enabled

  1. Verify the fluentd-elasticsearch Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-elasticsearch -o \
    "jsonpath={.data['output-logs\.conf']}"
    

    The output must contain an additional container with the remote syslog configuration.

  2. After restart of the fluentd-elasticsearch pods, verify that their logs do not contain any delivery error messages.

  3. Verify that the log messages are appearing in the remote syslog database.

logging.level Available since 2.6.0

  1. Run the following command to inspect the fluentd-elasticsearch Kubernetes configmap in the stacklight namespace:

    kubectl get cm -n stacklight fluentd-elasticsearch \
    -o "jsonpath={.data['output-logs\.conf']}"
    
  2. Grep the output using the following command. The pattern should contain all logging levels below the expected one.

    @type grep
    <exclude>
     key severity_label
     pattern /^<pattern>$/
    </exclude>
    

highAvailabilityEnabled

Verify the number of service replicas for the HA or non-HA StackLight mode. For details, see StackLight deployment architecture.

kubectl get sts -n stacklight
  • prometheusServer.retentionTime

  • prometheusServer.retentionSize

  • prometheusServer.alertResendDelay

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • storage.tsdb.retention.time

    • storage.tsdb.retention.size

    • rules.alert.resend-delay

  • clusterSize

  • resourcesPerClusterSize

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
  • nodeSelector.default

  • nodeSelector.component

  • tolerations.default

  • tolerations.component

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
  • storage.defaultStorageClass

  • storage.componentStorageClasses

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight
  • sfReporter.enabled

  • sfReporter.salesForce

  • sfReporter.cronjob

  1. Verify that Salesforce reporter is enabled. The SUSPEND field in the output must be False.

    kubectl get cronjob -n stacklight
    
  2. Verify that the Salesforce reporter configuration includes all expected queries:

    kubectl get configmap -n stacklight \
    sf-reporter-config -o yaml
    
  3. After cron job execution (by default, at midnight server time), obtain the Salesforce reporter pod name. The output should include the Salesforce reporter pod name and STATUS must be Completed.

    kubectl get pods -n stacklight
    
  4. Verify that Salesforce reporter successfully authenticates to Salesforce and creates records. The output must include the Salesforce authentication successful, Created record or Duplicate record and Updated record lines.

    kubectl logs -n stacklight <sf-reporter-pod-name>
    

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

  • ironic.endpoint

  • ironic.insecure

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

  • sslCertificateMonitoring.enabled

  • sslCertificateMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

mke.enabled

  1. In the Grafana web UI, verify that the MKE Cluster and MKE Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the MKE* alerts are present in the list of alerts.

mke.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the MKEAPIDown is not false-positively firing due to the certificate absence.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

alertmanagerSimpleConfig.inhibitRules.enabled

Run the following command. An empty output indicates either a failure or that the feature is disabled.

kubectl  get cm -n stacklight prometheus-alertmanager -o \
yaml | grep -A 6 inhibit_rules
  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

  • alertmanagerSimpleConfig.salesForce.enabled

  • alertmanagerSimpleConfig.salesForce.auth

  • alertmanagerSimpleConfig.salesForce.route

  1. Verify that sf-notifier is enabled. The output must include the sf-notifier pod name, 1/1 in the READY field and Running in the STATUS field.

    kubectl get pods -n stacklight
    
  2. Verify that sf-notifier successfully authenticates to Salesforce. The output must include the Salesforce authentication successful line.

    kubectl logs -f -n stacklight <sf-notifier-pod-name>
    
  3. In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-salesforce receiver and route.

  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

Access Elasticsearch clusters using Cerebro

Note

Prior to enabling Cerebro, verify that your Container Cloud cluster has minimum 0.5-1 GB of free RAM and 1 vCPU available.

Cerebro is a web UI for Elasticsearch, allowing visual inspection of and interaction with an Elasticsearch cluster, useful for evaluating its health and convenient debugging. Cerebro is disabled by default. Mirantis recommends that you enable Cerebro if needed, for example, if your Elasticsearch cluster encounters an issue, and disable it afterward.

To enable or disable Cerebro, set the logging.cerebro parameter to true or false as described in Configure StackLight and StackLight configuration parameters.

To access the Cerebro web UI:

  1. Obtain the host IP address:

    1. Log in to the Container Cloud web UI with writer or operator permissions and switch to the required project.

    2. From the Clusters page, click the required cluster.

    3. Click the action icon in the last column of any machine of the manager type.

    4. Click Machine info and copy the Host IP.

  2. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

  3. Obtain the cluster network CIDR:

    kubectl get cluster kaas-mgmt -o jsonpath='{.spec.clusterNetwork.services.cidrBlocks}'
    
  4. Create an SSH tunnel to the host, for example, using sshuttle:

    Note

    This step requires SSH access to Container Cloud hosts.

    sshuttle -r ubuntu@<HOST_IP> <CIDR>
    
  5. Obtain the Cerebro IP address:

    kubectl get svc -n stacklight cerebro -o jsonpath='{.spec.clusterIP}'
    
  6. Paste the Cerebro IP address in a web browser.

Enable generic metric scraping

StackLight can scrape metrics from any service that exposes Prometheus metrics and is running on the Kubernetes cluster. Such metrics appear in Prometheus under the {job="stacklight-generic",service="<service_name>",namespace="<service_namespace>"} set of labels. If the Kubernetes service is backed by Kubernetes pods, the set of labels also includes {pod="<pod_name>"}.

To enable the functionality, define at least one of the following annotations in the service metadata:

  • "generic.stacklight.mirantis.com/scrape-port" - the HTTP endpoint port. By default, the port number found through Kubernetes service discovery, usually __meta_kubernetes_pod_container_port_number. If none discovered, use the default port for the chosen scheme.

  • "generic.stacklight.mirantis.com/scrape-path" - the HTTP endpoint path, related to the Prometheus scrape_config.metrics_path option. By default, /metrics.

  • "generic.stacklight.mirantis.com/scrape-scheme" - the HTTP endpoint scheme between HTTP and HTTPS, related to the Prometheus scrape_config.scheme option. By default, http.

For example:

metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-path": "/metrics"
metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-port": "8080"

Enable remote logging to syslog

Caution

This feature is available starting from the Container Cloud release 2.6.0.

By default, StackLight sends logs to Elasticsearch. However, you can configure StackLight to forward all logs to an external syslog server. In this case, StackLight will send logs both to the syslog server and to Elasticsearch. Prior to enabling the functionality, consider the following requirements:

  • StackLight logging must be enabled

  • A remote syslog server must be deployed outside Container Cloud

  • Container Cloud proxy must not be enabled since it only supports the HTTP(S) traffic

To enable sending of logs to syslog:

  1. Perform the steps 1-2 described in Configure StackLight.

  2. In the stacklight.values section of the opened manifest, configure the logging.syslog parameters as described in StackLight configuration parameters.

    For example:

    logging:
      enabled: true
      syslog:
        enabled: true
        host: remote-syslog.svc
        port: 514
        protocol: tcp
        tls:
          enabled: true
          certificate:
            secret: ""
            hostPath: "/etc/ssl/certs/ca-bundle.pem"
          verify: true
    

    Note

    The hostname field in the remote syslog database will be set based on clusterId specified in the StackLight chart values. For example, if clusterId is ns/cluster/example-uid, the hostname will transform to ns_cluster_example-uid. For details, see clusterId in StackLight configuration parameters.

  3. Verify remote logging to syslog as described in Verify StackLight after configuration.

Manage Ceph

This section outlines Ceph LCM operations such as adding Ceph Monitor, Ceph nodes, and RADOS Gateway nodes to an existing Ceph cluster or removing them, as well as removing or replacing Ceph OSDs or updating your Ceph cluster.

Enable automated Ceph LCM

Ceph controller can automatically redeploy Ceph OSDs in case of significant configuration changes such as changing the block.db device or replacing Ceph OSDs. Ceph controller can also clean disks and configuration during a Ceph OSD removal.

To remove a single Ceph OSD or the entire Ceph node, manually remove its definition from the kaasCephCluster CR.

To enable automated management of Ceph OSDs:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  3. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  4. Set the manageOsds parameter to true:

    spec:
      cephClusterSpec:
        manageOsds: true
    

Once done, all Ceph OSDs with a modified configuration will be redeployed. Mirantis recommends modifying only one Ceph node at a time. For details about supported configuration parameters, see OSD Configuration Settings.

Add, remove, or reconfigure Ceph nodes

Mirantis Ceph controller simplifies a Ceph cluster management by automating LCM operations. To modify Ceph components, only the MiraCeph custom resource (CR) update is required. Once you update the MiraCeph CR, the Ceph controller automatically adds, removes, or reconfigures Ceph nodes as required.

Note

When adding a Ceph node with the Ceph Monitor role, if any issues occur with the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead, named using the next alphabetic character in order. Therefore, the Ceph Monitor names may not follow the alphabetical order. For example, a, b, d, instead of a, b, c.

To add, remove, or reconfigure Ceph nodes on a management or managed cluster:

  1. To modify Ceph OSDs, verify that the manageOsds parameter is set to true in the KaasCephCluster CR as described in Enable automated Ceph LCM.

  2. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  3. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  4. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  5. In the nodes section, specify or remove the parameters for a Ceph OSD as required. For the parameters description, see OSD Configuration Settings.

    For example:

    nodes:
      kaas-mgmt-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            storeType: bluestore
        name: sdb
    

    Note

    • To use a new Ceph node for a Ceph Monitor or Ceph Manager deployment, also specify the roles parameter.

    • Reducing the number of Ceph Monitors is not supported and causes the Ceph Monitor daemons removal from random nodes.

    • Removal of the mgr role in the nodes section of the KaaSCephCluster CR does not remove Ceph Managers. To remove a Ceph Manager from a node, remove it from the nodes spec and manually delete the mgr pod in the Rook namespace.

  6. If you are making changes for your managed cluster, obtain and export kubeconfig of the managed cluster as described in Connect to a Mirantis Container Cloud cluster. Otherwise, skip this step.

  7. Monitor the status of your Ceph cluster deployment. For example:

    kubectl -n rook-ceph get pods
    
    kubectl -n ceph-lcm-mirantis logs ceph-controller-78c95fb75c-dtbxk
    
    kubectl -n rook-ceph logs rook-ceph-operator-56d6b49967-5swxr
    
  8. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  9. Verify that the Ceph node has been successfully added, removed, or reconfigured:

    1. Verify that the Ceph cluster status is healthy:

      ceph status
      

      Example of a positive system response:

      cluster:
        id:     0868d89f-0e3a-456b-afc4-59f06ed9fbf7
        health: HEALTH_OK
      
      services:
        mon: 3 daemons, quorum a,b,c (age 20h)
        mgr: a(active, since 20h)
        osd: 9 osds: 9 up (since 20h), 9 in (since 2d)
      
      data:
        pools:   1 pools, 32 pgs
        objects: 0 objects, 0 B
        usage:   9.1 GiB used, 231 GiB / 240 GiB avail
        pgs:     32 active+clean
      
    2. Verify that the status of the Ceph OSDs is up:

      ceph osd tree
      

      Example of a positive system response:

      ID  CLASS WEIGHT  TYPE NAME                   STATUS REWEIGHT PRI-AFF
      -1       0.23424 root default
      -3       0.07808             host osd1
       1   hdd 0.02930                 osd.1           up  1.00000 1.00000
       3   hdd 0.01949                 osd.3           up  1.00000 1.00000
       6   hdd 0.02930                 osd.6           up  1.00000 1.00000
      -15       0.07808             host osd2
       2   hdd 0.02930                 osd.2           up  1.00000 1.00000
       5   hdd 0.01949                 osd.5           up  1.00000 1.00000
       8   hdd 0.02930                 osd.8           up  1.00000 1.00000
      -9       0.07808             host osd3
       0   hdd 0.02930                 osd.0           up  1.00000 1.00000
       4   hdd 0.01949                 osd.4           up  1.00000 1.00000
       7   hdd 0.02930                 osd.7           up  1.00000 1.00000
      

Replace a failed Ceph OSD

After a physical disk replacement, you can use Rook to redeploy a failed Ceph OSD by restarting rook-operator that triggers the reconfiguration of the management or managed cluster.

To redeploy a failed Ceph OSD:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the required management or managed cluster as described in Connect to a Mirantis Container Cloud cluster.

  3. Identify the failed Ceph OSD ID:

    ceph osd tree
    
  4. Remove the Ceph OSD deployment from the management or managed cluster:

    kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID>
    
  5. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  6. Remove the failed Ceph OSD from the Ceph cluster:

    ceph osd purge osd.<ID>
    
  7. Replace the failed disk.

  8. Restart the Rook operator:

    kubectl delete pod $(kubectl -n rook-ceph get pod -l "app=rook-ceph-operator" \
    -o jsonpath='{.items[0].metadata.name}') -n rook-ceph
    

Update Ceph cluster

You can update Ceph cluster to the latest minor version of Ceph Nautilus by triggering the existing Ceph cluster update.

To update Ceph cluster:

  1. Verify that your management cluster is automatically upgraded to the latest Mirantis Container Cloud release:

    1. Log in to the Container Cloud web UI with the writer permissions.

    2. On the bottom of the page, verify the Container Cloud version number.

  2. Verify that your managed clusters are updated to the latest Cluster release. For details, see Update a managed cluster.

  3. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  4. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  5. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster
    
  6. Update the version parameter. For example:

    version: 14.2.9
    
  7. Obtain and export kubeconfig of the managed clusters as described in Connect to a Mirantis Container Cloud cluster.

  8. Repeat the steps 5-7 to update Ceph on every managed cluster.

Enable Ceph RGW Object Storage

Ceph controller enables you to deploy RADOS Gateway (RGW) Object Storage instances and automatically manages its resources such as users and buckets. Ceph Object Storage has an integration with OpenStack Object Storage (Swift) in Mirantis OpenStack for Kubernetes (MOS).

To enable the RGW Object Storage:

  1. Select from the following options:

    • If you do not have a management cluster yet, open kaascephcluster.yaml.template for editing.

    • If the management cluster is already deployed, open the KaasCephCluster CR for editing. Select from the following options:

      • If the Ceph cluster is placed in the management cluster:

        kubectl edit kaascephcluster
        
      • If the Ceph cluster is placed in a managed cluster:

        kubectl edit kaascephcluster -n <managedClusterProjectName>
        

        Substitute <managedClusterProjectName> with a corresponding value.

  2. Using the following table, update the cephClusterSpec.objectStorage.rgw section specification as required:

    Warning

    Starting from Container Cloud 2.6.0, the spec.rgw section is deprecated and its parameters are moved under objectStorage.rgw. If you continue using spec.rgw, it will be automatically translated into objectStorage.rgw during the Container Cloud update to 2.6.0.

    RADOS Gateway parameters

    Parameter

    Description

    name

    Ceph Object Storage instance name.

    dataPool

    Mutually exclusive with the zone parameter. Object storage data pool spec that should only contain replicated or erasureCoded and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. For dataPool, Mirantis recommends using an erasureCoded pool. For details, see Rook documentation: Erasure coding. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          dataPool:
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
    

    metadataPool

    Mutually exclusive with the zone parameter. Object storage metadata pool spec that should only contain replicated and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. Can use only replicated settings. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          metadataPool:
            replicated:
              size: 3
            failureDomain: host
    

    where replicated.size is the number of full copies of data on multiple nodes.

    gateway

    The gateway settings corresponding to the rgw daemon settings. Includes the following parameters:

    • port - the port on which the Ceph RGW service will be listening on HTTP.

    • securePort - the port on which the Ceph RGW service will be listening on HTTPS.

    • instances - the number of pods in the Ceph RGW ReplicaSet. If allNodes is set to true, a DaemonSet is created instead.

      Note

      Mirantis recommends using 2 instances for Ceph Object Storage.

    • allNodes - defines whether to start the Ceph RGW pods as a DaemonSet on all nodes. The instances parameter is ignored if allNodes is set to true.

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          gateway:
            allNodes: false
            instances: 1
            port: 80
            securePort: 8443
    

    preservePoolsOnDelete

    Defines whether to delete the data and metadata pools in the rgw section if the object storage is deleted. Set this parameter to true if you need to store data even if the object storage is deleted. However, Mirantis recommends setting this parameter to false.

    users and buckets

    Optional. To create new Ceph RGW resources, such as buckets or users, specify the following keys. Ceph controller will automatically create the specified object storage users and buckets in the Ceph cluster.

    • users - a list of strings that contain user names to create for object storage.

    • buckets - a list of strings that contain bucket names to create for object storage.

    zone Available since 2.7.0

    Optional. Mutually exclusive with metadataPool and dataPool. Defines the Ceph Multisite zone where the object storage must be placed. Includes the name parameter that must be set to one of the zones items. For details, see Enable Multisite for Ceph RGW Object Storage.

    For example:

    cephClusterSpec:
      objectStorage:
        multisite:
          zones:
          - name: master-zone
          ...
        rgw:
          zone:
            name: master-zone
    

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          name: rgw-store
          dataPool:
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
            failureDomain: host
          metadataPool:
            failureDomain: host
            replicated:
              size: 3
          gateway:
            allNodes: false
            instances: 1
            port: 80
            securePort: 8443
          preservePoolsOnDelete: false
    

Enable Multisite for Ceph RGW Object Storage

Caution

This feature is available starting from the Container Cloud release 2.7.0.

Caution

This feature is available as Technology Preview. Use such configuration for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

The Ceph Multisite feature allows object storage to replicate its data over multiple Ceph clusters. Using Multisite, such object storage is independent and isolated from another object storage in the cluster. For more details, see Ceph documentation: Multisite.

Warning

Rook does not handle Multisite configuration changes and removal. Therefore, once you enable Multisite for Ceph RGW Object Storage, perform such operations manually through the ceph-tools pod. For details, see Rook documentation: Multisite cleanup.

To enable the Multisite RGW Object Storage:

  1. Select from the following options:

    • If you do not have a management cluster yet, open kaascephcluster.yaml.template for editing.

    • If the management cluster is already deployed, open the KaasCephCluster CR for editing. Select from the following options:

      • If the Ceph cluster is placed in the management cluster:

        kubectl edit kaascephcluster
        
      • If the Ceph cluster is placed in a managed cluster:

        kubectl edit kaascephcluster -n <managedClusterProjectName>
        

        Substitute <managedClusterProjectName> with a corresponding value.

  2. Using the following table, update the cephClusterSpec.objectStorage.multisite section specification as required:

    Multisite parameters

    Parameter

    Description

    realms Available since 2.7.0, Technical Preview

    List of realms to use, represents the realm namespaces. Includes the following parameters:

    • name - the realm name.

    • pullEndpoint - optional, required only when the master zone is in a different storage cluster. The endpoint, access key, and system key of the system user from the realm to pull from. Includes the following parameters:

      • endpoint - the endpoint of the master zone in the master zone group.

      • accessKey - the access key of the system user from the realm to pull from.

      • secretKey - the system key of the system user from the realm to pull from.

    zoneGroups Available since 2.7.0, Technical Preview

    The list of zone groups for realms. Includes the following parameters:

    • name - the zone group name.

    • realmName - the realm namespace name to which the zone group belongs to.

    zones Available since 2.7.0, Technical Preview

    The list of zones used within one zone group. Includes the following parameters:

    • name - the zone name.

    • metadataPool - the settings used to create the Object Storage metadata pools. Must use replication. For details, see Pool parameters.

    • dataPool - the settings to create the Object Storage data pool. Can use replication or erasure coding. For details, see Pool parameters.

    • zoneGroupName - the zone group name.

    For example:

    objectStorage:
      multiSite:
        realms:
        - name: realm_from_cluster
        zoneGroups:
        - name: zonegroup_from_cluster
          realmName: realm_from_cluster
        zones:
        - name: secondary-zone
          zoneGroupName: zonegroup_from_cluster
          metadataPool:
            failureDomain: host
              replicated:
                size: 3
          dataPool:
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
            failureDomain: host
    
  3. Select from the following options:

    • If you do not need to replicate data from a different storage cluster, do not specify the pullEndpoint parameter. The current zone used in the ObjectStorage RGW in KaaSCephCluster will be the master zone.

    • If a different storage cluster exists and its object storage data must be replicated, specify the same realm and zone group names and the pullEndpoint parameter. Additionally, specify the endpoint, access key, and system keys of the system user of the realm from which you need to replicate data. For details, see the step 2.

      1. To obtain the endpoint of the cluster zone that must be replicated, run the following command specifying the realm and zone group names of the required master zone:

        radosgw-admin zonegroup get --rgw-realm=<USER_NAME> --rgw-zonegroup=<ZONE_GROUP_NAME>
        
      2. To obtain the access key and the secret key of the system user, run the following command on the required Ceph cluster:

        radosgw-admin user info --uid="<USER_NAME>"
        

      For example:

      objectStorage:
        multiSite:
          realms:
          - name: realm_from_cluster
            pullEndpoint:
              endpoint: http://10.11.0.75:8080
              accessKey: DRND5J2SVC9O6FQGEJJF
              secretKey: qpjIjY4lRFOWh5IAnbrgL5O6RTA1rigvmsqRGSJk
          zoneGroups:
          - name: zonegroup_from_cluster
            realmName: realm_from_cluster
          zones:
          - name: secondary-zone
            zoneGroupName: zonegroup_from_cluster
            metadataPool:
              failureDomain: host
              replicated:
                size: 3
            dataPool:
              erasureCoded:
                codingChunks: 1
                dataChunks: 2
              failureDomain: host
      
  4. Configure the zone RADOS Gateway parameter as described in Enable Ceph RGW Object Storage. Leave dataPool and metadataPool empty. These parameters will be ignored because the zone block in the Multisite configuration specifies the pools parameters.

    Note

    If Ceph RGW Object Storage in your cluster is not set up for Multisite, see Ceph documentation: Migrating a single site system to multi-site.

    For example:

    rgw:
      dataPool: {}
      gateway:
        allNodes: false
        instances: 2
        port: 80
        securePort: 8443
      healthCheck:
        bucket:
          disabled: true
      metadataPool: {}
      name: store-test-pull
      preservePoolsOnDelete: false
      zone:
        name: "secondary-zone"
    

Once done, ceph-operator will create the required resources and Rook will handle the Multisite configuration. For details, see: Rook documentation: Object Multisite.

Verify Ceph

This section describes how to verify the components of a Ceph cluster after deployment. For troubleshooting, verify Ceph controller and Rook logs as described in Verify Ceph controller and Rook.

Verify the Ceph core services

To confirm that all Ceph components including mon, mgr, osd, and rgw have joined your cluster properly, analyze the logs for each pod and verify the Ceph status:

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
ceph -s

Example of a positive system response:

cluster:
    id:     4336ab3b-2025-4c7b-b9a9-3999944853c8
    health: HEALTH_OK

services:
    mon: 3 daemons, quorum a,b,c (age 20m)
    mgr: a(active, since 19m)
    osd: 6 osds: 6 up (since 16m), 6 in (since 16m)
    rgw: 1 daemon active (miraobjstore.a)

data:
    pools:   12 pools, 216 pgs
    objects: 201 objects, 3.9 KiB
    usage:   6.1 GiB used, 174 GiB / 180 GiB avail
    pgs:     216 active+clean

Verify rook-discover

To ensure that rook-discover is running properly, verify if the local-device configmap has been created for each Ceph node specified in the cluster configuration:

  1. Obtain the list of local devices:

    kubectl get configmap -n rook-ceph | grep local-device
    

    Example of a system response:

    local-device-01      1      30m
    local-device-02      1      29m
    local-device-03      1      30m
    
  2. Verify that each device from the list contains information about available devices for the Ceph node deployment:

    kubectl describe configmap local-device-01 -n rook-ceph
    

    Example of a positive system response:

    Name:         local-device-01
    Namespace:    rook-ceph
    Labels:       app=rook-discover
                  rook.io/node=01
    Annotations:  <none>
    
    Data
    ====
    devices:
    ----
    [{"name":"vdd","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-41d72dac-c0ff-4f24-b /dev/disk/by-path/virtio-pci-0000:00:09.0","size":32212254720,"uuid":"27e9cf64-85f4-48e7-8862-faa7270202ed","serial":"41d72dac-c0ff-4f24-b","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdd\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdd\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdb","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-path/virtio-pci-0000:00:07.0","size":67108864,"uuid":"988692e5-94ac-4c9a-bc48-7b057dd94fa4","serial":"","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdb\",\"available\":false,\"rejected_reasons\":[\"Insufficient space (\\u003c5GB)\"],\"sys_api\":{\"size\":67108864.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"64.00 MB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdb\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdc","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-e8fdba13-e24b-41f0-9 /dev/disk/by-path/virtio-pci-0000:00:08.0","size":32212254720,"uuid":"190a50e7-bc79-43a9-a6e6-81b173cd2e0c","serial":"e8fdba13-e24b-41f0-9","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdc\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdc\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""}]
    

Verify Ceph cluster state

To verify the state of a Ceph cluster, Ceph controller provides a Kubernetes API that includes a custom MiraCephLog resource. The resource contains information about the state of different components of your Ceph cluster.

To verify the Ceph cluster state:

  1. Obtain kubeconfig of the management or managed cluster and provide it as an environment variable:

    export KUBECONFIG=<path-to-kubeconfig>
    
  2. Obtain MiraCephLog:

    kubectl get miracephlog rook-ceph -n ceph-lcm-mirantis -o yaml
    
  3. Verify the state of the required component using the MiraCephLog specification description below.

    Specification fields of the MiraCephLog object

    Field

    Description

    lastLogs

    The tail of the ceph-operator pod logs. Use the logs for investigation and troubleshooting.

    osdStatus

    The string result of the current state of Ceph OSDs. If all OSDs operate properly, the value is ALL OK.

    pools

    The list of Ceph block pools. Use this list to verify whether all defined pools have been created properly.

Verify Ceph controller and Rook

The starting point for Ceph troubleshooting is the ceph-controller and rook-operator logs. Once you locate the component that causes issues, verify the logs of the related pod. This section describes how to verify the Ceph controller and Rook objects of a Ceph cluster.

To verify Ceph controller and Rook:

  1. Verify data access. Ceph volumes can be consumed directly by Kubernetes workloads and internally, for example, by OpenStack services. To verify the Kubernetes storage:

    1. Verify the available storage classes. The storage classes that are automatically managed by Ceph controller use the rook-ceph.rbd.csi.ceph.com provisioner.

      kubectl get storageclass
      

      Example of system response:

      NAME                            PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
      iam-kaas-iam-data               kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  64m
      kubernetes-ssd (default)        rook-ceph.rbd.csi.ceph.com     Delete          Immediate              false                  55m
      stacklight-alertmanager-data    kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-elasticsearch-data   kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-postgresql-db        kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      stacklight-prometheus-data      kubernetes.io/no-provisioner   Delete          WaitForFirstConsumer   false                  55m
      
    2. Verify that volumes are properly connected to the pod:

      1. Obtain the list of volumes:

        kubectl get persistentvolumeclaims -n kaas
        

        Example of system response:

        NAME                         STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS       AGE
        ironic-aio-pvc               Bound   pvc-9132beb2-6a17-4877-af40-06031d52da47  5Gi       RWO           kubernetes-ssd     62m
        ironic-inspector-pvc         Bound   pvc-e84e9a9e-51b8-4c57-b116-0e1e6a9e7e94  1Gi       RWO           kubernetes-ssd     62m
        mariadb-pvc                  Bound   pvc-fb0dbf01-ee4b-4c88-8b08-901080ee8e14  2Gi       RWO           kubernetes-ssd     62m
        mysql-data-mariadb-server-0  Bound   local-pv-d1ecc89d                         457Gi     RWO           iam-kaas-iam-data  62m
        mysql-data-mariadb-server-1  Bound   local-pv-1f385d17                         457Gi     RWO           iam-kaas-iam-data  62m
        mysql-data-mariadb-server-2  Bound   local-pv-79a820d7                         457Gi     RWO           iam-kaas-iam-data  62m
        
      2. For each volume, verify the connection. For example:

        kubectl describe pvc ironic-aio-pvc -n kaas
        

        Example of a positive system response:

        Name:          ironic-aio-pvc
        Namespace:     kaas
        StorageClass:  kubernetes-ssd
        Status:        Bound
        Volume:        pvc-9132beb2-6a17-4877-af40-06031d52da47
        Labels:        <none>
        Annotations:   pv.kubernetes.io/bind-completed: yes
                       pv.kubernetes.io/bound-by-controller: yes
                       volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
        Finalizers:    [kubernetes.io/pvc-protection]
        Capacity:      5Gi
        Access Modes:  RWO
        VolumeMode:    Filesystem
        Events:        <none>
        Mounted By:    dnsmasq-dbd84d496-6fcz4
                       httpd-0
                       ironic-555bff5dd8-kb8p2
        

        In case of connection issues, inspect the pod description for the volume information:

        kubectl describe pod <crashloopbackoff-pod-name>
        

        Example of system response:

        ...
        Events:
          FirstSeen LastSeen Count From    SubObjectPath Type     Reason           Message
          --------- -------- ----- ----    ------------- -------- ------           -------
          1h        1h       3     default-scheduler     Warning  FailedScheduling PersistentVolumeClaim is not bound: "mysql-pv-claim" (repeated 2 times)
          1h        35s      36    kubelet, 172.17.8.101 Warning  FailedMount      Unable to mount volumes for pod "wordpress-mysql-918363043-50pjr_default(08d14e75-bd99-11e7-bc4c-001c428b9fc8)": timeout expired waiting for volumes to attach/mount for pod "default"/"wordpress-mysql-918363043-50pjr". list of unattached/unmounted volumes=[mysql-persistent-storage]
          1h        35s      36    kubelet, 172.17.8.101 Warning  FailedSync       Error syncing pod
        
    3. Verify that the CSI provisioner plugins were started properly and have the Running status:

      1. Obtain the list of CSI provisioner plugins:

        kubectl -n rook-ceph get pod -l app=csi-rbdplugin-provisioner
        
      2. Verify the logs of the required CSI provisioner:

        kubectl logs -n rook-ceph <csi-provisioner-plugin-name> csi-provisioner
        
  2. Verify the Ceph cluster status:

    1. Verify that the status of each pod in the ceph-lcm-mirantis and rook-ceph name spaces is Running:

      • For ceph-lcm-mirantis:

        kubectl get pod -n ceph-lcm-mirantis
        
      • For rook-ceph:

        kubectl get pod -n rook-ceph
        
  3. Verify Ceph controller. Ceph controller prepares the configuration that Rook uses to deploy the Ceph cluster, managed using the KaasCephCluster resource. If Rook cannot finish the deployment, verify the Rook operator logs as described in the step 4.

    1. List the pods:

      kubectl -n ceph-lcm-mirantis get pods
      
    2. Verify the logs of the required pod:

      kubectl -n ceph-lcm-mirantis logs <ceph-controller-pod-name>
      
    3. Verify the configuration:

      kubectl get kaascephcluster -n <managedClusterProjectName> -o yaml
      
    4. On the managed cluster, verify the MiraCeph subresource:

      kubectl get miraceph -n ceph-lcm-mirantis -o yaml
      
  4. Verify the Rook operator logs. Rook deploys a Ceph cluster based on custom resources created by the MiraCeph controller, such as pools, clients, cephcluster, and so on. Rook logs contain details about components orchestration. For details about the Ceph cluster status and to get access to CLI tools, connect to the ceph-tools pod as described in the step 5.

    1. Verify the Rook operators logs:

      kubectl -n rook-ceph logs -l app=rook-ceph-operator
      
    2. Verify the CephCluster configuratuion:

      Note

      The MiraCeph controller manages the CephCluster CR . Open the CephCluster CR only for verification and do not modify it manually.

      kubectl get cephcluster -n rook-ceph -o yaml
      
  5. Verify the ceph-tools pod:

    1. Execute the ceph-tools pod:

      kubectl --kubeconfig <pathToManagedClusterKubeconfig> -n rook-ceph exec -it $(kubectl --kubeconfig <pathToManagedClusterKubeconfig> -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
      
    2. Verify that CLI commands can run on the ceph-tools pod:

      ceph -s
      
  6. Verify hardware:

    1. Through the ceph-tools pod, obtain the required device in your cluster:

      ceph osd tree
      
    2. Enter all Ceph OSD pods in the rook-ceph namespace one by one:

      kubectl exec -it -n rook-ceph <osd-pod-name> bash
      
    3. Verify that the ceph-volume tool is available on all pods running on the target node:

      ceph-volume lvm list
      

Ceph advanced configuration

This section describes how to configure a Ceph cluster through the KaaSCephCluster (kaascephclusters.kaas.mirantis.com) CR during or after the deployment of a management or managed cluster.

The KaaSCephCluster CR spec has two sections, cephClusterSpec and k8sCluster and specifies the nodes to deploy as Ceph components. Based on the roles definitions in the KaaSCephCluster CR, Ceph Controller automatically labels nodes for Ceph Monitors and Managers. Ceph OSDs are deployed based on the storageDevices parameter defined for each Ceph node.

For a default KaaSCephCluster CR, see templates/bm/kaascephcluster.yaml.template. For details on how to configure the default template for a baremetal-based cluster bootstrap, see Deployment Guide: Bootstrap a management cluster.

To configure a Ceph cluster:

  1. Select from the following options:

    • If you do not have a management cluster yet, open kaascephcluster.yaml.template for editing.

    • If the management cluster is already deployed, open the KaasCephCluster CR for editing:

      • If the Ceph cluster is placed in the management cluster:

        kubectl edit kaascephcluster
        
      • If the Ceph cluster is placed in a managed cluster:

        kubectl edit kaascephcluster -n <managedClusterProjectName>
        

        Substitute <managedClusterProjectName> with a corresponding value.

  2. Using the tables below, configure the Ceph cluster as required.

    High-level parameters

    Parameter

    Description

    cephClusterSpec

    Describes a Ceph cluster in the management cluster. For details on cephClusterSpec parameters, see the tables below.

    k8sCluster

    Defines the management cluster on which the KaaSCephCluster depends on. Use the k8sCluster parameter if the name or namespace of the management cluster differs from default one:

    spec:
      k8sCluster:
        name: kaas-mgmt
        namespace: default
    
    General parameters

    Parameter

    Description

    manageOsds

    Recommended. Enables automated management of Ceph OSDs. For details, see Enable automated Ceph LCM.

    clusterNet

    Specifies the CIDR for Ceph OSD replication network.

    publicNet

    Specifies the CIDR for communication between the service and operator.

    nodes

    Specifies the list of Ceph nodes. For details, see Node parameters. The nodes parameter is a map with machine names as keys and Ceph node specifications as values, for example:

    nodes:
      master-0:
        <node spec>
      master-1:
        <node spec>
      ...
      worker-0:
        <node spec>
    

    pools

    Specifies the list of Ceph pools. For details, see Pool parameters.

    objectStorage Available since 2.6.0

    Specifies the parameters for Object Storage, such as RADOS Gateway, the Ceph Object Storage. Starting from Container Cloud 2.7.0, also specifies the RADOS Gateway Multisite configuration. For details, see RADOS Gateway parameters and Multisite parameters.

    rgw Deprecated since 2.6.0

    Specifies RADOS Gateway, the Ceph Object Storage. For details, see RADOS Gateway parameters.

    rgw Deprecated since 2.6.0

    Specifies RADOS Gateway, the Ceph Object Storage. For details, see RADOS Gateway parameters.

    maintenance Available since 2.6.0

    Enables or disables the noout, norebalance, and nofill flags on the entire Ceph cluster. Set to false by default. Mirantis strongly recommends not using it on production deployments other than during update.

    Example configuration:

    spec:
      cephClusterSpec:
        manageOsds: true
        network:
          clusterNet: 10.10.10.0/24
          publicNet: 10.10.11.0/24
        nodes:
          master-0:
            <node spec>
          ...
        pools:
        - <pool spec>
        ...
    
    Node parameters

    Parameter

    Description

    roles

    Specifies the mon or mgr daemon to be installed on a Ceph node. You can place the daemons on any nodes upon your decision. Consider the following recommendations:

    • The recommended number of Ceph Monitors in a Ceph cluster is 3. Therefore, at least 3 Ceph nodes must contain the mon item in the roles parameter.

    • The number of Ceph Monitors must be odd. For example, if the KaaSCephCluster spec contains 3 Ceph monitors and you need to add more, the number of Ceph monitors must equal 5, 7, 9, and so on.

    • Do not add more than 2 Ceph monitors at a time and wait until the Ceph cluster is Ready.

    • For a better HA, the number of mgr roles must equal the number of mon roles.

      If a Ceph node contains a mon role, the Ceph Monitor Pod will be deployed on it. If the Ceph node contains a mgr role, it informs the Ceph Controller that a Ceph Manager can be deployed on that node. However, only one Ceph Manager must be deployed on a node.

    storageDevices

    Specifies the list of devices to use for Ceph OSD deployment. Includes the following parameters:

    • name - the device name placed in the /dev folder. For example, vda.

    • config - a map of device configurations that must contain a device class. The device class must be defined in a pool and can contain a metadata device:

      storageDevices:
      - name: sdc
        config:
          deviceClass: hdd
          metadataDevice: nvme01
      

      The underlying storage format to use for Ceph OSDs is BlueStore.

    crush Available since 2.7.0

    Specifies the explicit key-value CRUSH topology for a node. For details, see Ceph official documentation: CRUSH maps. Includes the following parameters:

    • datacenter - a physical data center that consists of rooms and handles data.

    • room - a room that accommodates one or more racks with hosts.

    • pdu - a power distribution unit (PDU) device that has multiple outputs and distributes electric power to racks located within a data center.

    • row - a row of computing racks inside room.

    • rack - a computing rack that accommodates one or more hosts.

    • chassis - a bare metal structure that houses or physically assembles hosts.

    • region - the geographic location of one or more Ceph Object instances within one or more zones.

    • zone - a logical group that consists of one or more Ceph Object instances.

    Example configuration:

    crush:
      datacenter: dc1
      room: room1
      pdu: pdu1
      row: row1
      rack: rack1
      chassis: ch1
      region: region1
      zone: zone1
    
    Pool parameters

    Parameter

    Description

    name

    Specifies the pool name as a prefix for each Ceph block pool.

    role

    Specifies the pool role and is used mostly for Mirantis OpenStack for Kubernetes (MOS) pools.

    default

    Defines if the pool and dependent StorageClass should be set as default. Must be enabled only for one pool.

    deviceClass

    Specifies the device class for the defined pool. Possible values are HDD, SSD, and NVMe.

    replicated

    The number of pool replicas. The replicated parameter is mutually exclusive with erasureCoded.

    erasureCoded

    Enables the erasure-coded pool. For details, see Rook documentation: Erasure coded and Ceph documentation: Erasure coded pool. The erasureCoded parameter is mutually exclusive with replicated.

    failureDomain

    The failure domain across which the replicas or chunks of data will be spread. Set to host by default. The possible values are osd or host.

    Example configuration:

    pools:
    - name: kubernetes
      role: kubernetes
      deviceClass: hdd
      replicated:
        size: 3
      default: true
    

    To configure additional required pools for MOS, see MOS Deployment Guide: Deploy a Ceph cluster.

    RADOS Gateway parameters

    Parameter

    Description

    name

    Ceph Object Storage instance name.

    dataPool

    Mutually exclusive with the zone parameter. Object storage data pool spec that should only contain replicated or erasureCoded and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. For dataPool, Mirantis recommends using an erasureCoded pool. For details, see Rook documentation: Erasure coding. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          dataPool:
            erasureCoded:
              codingChunks: 1
              dataChunks: 2
    

    metadataPool

    Mutually exclusive with the zone parameter. Object storage metadata pool spec that should only contain replicated and failureDomain parameters. The failureDomain parameter may be set to osd or host, defining the failure domain across which the data will be spread. Can use only replicated settings. For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          metadataPool:
            replicated:
              size: 3
            failureDomain: host
    

    where replicated.size is the number of full copies of data on multiple nodes.

    gateway

    The gateway settings corresponding to the rgw daemon settings. Includes the following parameters:

    • port - the port on which the Ceph RGW service will be listening on HTTP.

    • securePort - the port on which the Ceph RGW service will be listening on HTTPS.

    • instances - the number of pods in the Ceph RGW ReplicaSet. If allNodes is set to true, a DaemonSet is created instead.

      Note

      Mirantis recommends using 2 instances for Ceph Object Storage.

    • allNodes - defines whether to start the Ceph RGW pods as a DaemonSet on all nodes. The instances parameter is ignored if allNodes is set to true.

    For example:

    cephClusterSpec:
      objectStorage:
        rgw:
          gateway:
            allNodes: false
            instances: 1
            port: 80
            securePort: 8443
    

    preservePoolsOnDelete

    Defines whether to delete the data and metadata pools in the rgw section if the object storage is deleted. Set this parameter to true if you need to store data even if the object storage is deleted. However, Mirantis recommends setting this parameter to false.

    users and buckets

    Optional. To create new Ceph RGW resources, such as buckets or users, specify the following keys. Ceph controller will automatically create the specified object storage users and buckets in the Ceph cluster.

    • users - a list of strings that contain user names to create for object storage.

    • buckets - a list of strings that contain bucket names to create for object storage.

    zone Available since 2.7.0

    Optional. Mutually exclusive with metadataPool and dataPool. Defines the Ceph Multisite zone where the object storage must be placed. Includes the name parameter that must be set to one of the zones items. For details, see Enable Multisite for Ceph RGW Object Storage.

    For example:

    cephClusterSpec:
      objectStorage:
        multisite:
          zones:
          - name: master-zone
          ...
        rgw:
          zone:
            name: master-zone
    

    For configuration example, see Enable Ceph RGW Object Storage.

    Multisite parameters

    Parameter

    Description

    realms Available since 2.7.0, Technical Preview

    List of realms to use, represents the realm namespaces. Includes the following parameters:

    • name - the realm name.

    • pullEndpoint - optional, required only when the master zone is in a different storage cluster. The endpoint, access key, and system key of the system user from the realm to pull from. Includes the following parameters:

      • endpoint - the endpoint of the master zone in the master zone group.

      • accessKey - the access key of the system user from the realm to pull from.

      • secretKey - the system key of the system user from the realm to pull from.

    zoneGroups Available since 2.7.0, Technical Preview

    The list of zone groups for realms. Includes the following parameters:

    • name - the zone group name.

    • realmName - the realm namespace name to which the zone group belongs to.

    zones Available since 2.7.0, Technical Preview

    The list of zones used within one zone group. Includes the following parameters:

    • name - the zone name.

    • metadataPool - the settings used to create the Object Storage metadata pools. Must use replication. For details, see Pool parameters.

    • dataPool - the settings to create the Object Storage data pool. Can use replication or erasure coding. For details, see Pool parameters.

    • zoneGroupName - the zone group name.

    For configuration example, see Enable Multisite for Ceph RGW Object Storage.

  3. Select from the following options:

    • If you are bootstrapping a management cluster, save the updated KaaSCephCluster template to the templates/bm/kaascephcluster.yaml.template file and proceed with the bootstrap.

    • If you are creating a managed cluster, save the updated KaaSCephCluster template to the corresponding file and proceed with the managed cluster creation.

    • If you are configuring KaaSCephCluster of an existing management cluster, run the following command:

      kubectl apply
      
    • If you are configuring KaaSCephCluster of an existing managed cluster, run the following command:

      kubectl apply -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

Enable Ceph tolerations and resources management

Caution

This feature is available starting from the Container Cloud release 2.6.0.

Caution

This feature is available as Technology Preview. Use such configuration for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

This section describes how to configure Ceph controller to manage Ceph nodes resources.

Enable Ceph tolerations and resources management

Caution

This feature is available starting from the Container Cloud release 2.6.0.

Caution

This feature is available as Technology Preview. Use such configuration for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

Note

This document does not provide any specific recommendations on requests and limits for Ceph resources. The document stands for a native Helm-release based Ceph resources configuration for any cluster with Mirantis Container Cloud or Mirantis OpenStack for Kubernetes (MOS).

You can configure Ceph Controller to manage Ceph resources by specifying their requirements and constraints. To configure the resources consumption for the Ceph nodes, consider the following options that are based on different Helm release configuration values:

  • Configuring tolerations for taint nodes for the Ceph Monitor, Ceph Manager, and Ceph OSD daemons.

  • Configuring nodes resources requests or limits for the Ceph daemons and for each Ceph OSD device class such as HDD, SSD, or NVMe. For details, see Managing Resources for Containers.

Warning

Mirantis recommends enabling Ceph resources management when bootstrapping a new Ceph management or managed cluster. Enabling Ceph resources management on an existing Ceph cluster may cause downtime.

To enable Ceph tolerations and resources management:

  1. Open templates/bm/kaascephcluster.yaml.template for editing.

  2. In the ceph-controller section of spec.providerSpec.value.helmReleases, specify the hyperconverge.tolerations or hyperconverge.resources parameters as required:

    Ceph resource management parameters

    Parameter

    Description

    Example values

    tolerations

    Specifies tolerations for taint nodes.

    hyperconverge:
      tolerations:
        # Array of correct k8s
        # toleration rules for
        # mon/mgr/osd daemon pods
        mon:
        mgr:
        osd:
    

    Note

    Use vertical bars after tolerations keys. The mon, mgr, and osd values are strings that contain YAML-formatted arrays of Kubernetes toleration rules.

    hyperconverge:
      tolerations:
        mon: |
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
        mgr: |
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
        osd: |
          - effect: NoSchedule
            key: node-role.kubernetes.io/controlplane
            operator: Exists
    

    resources

    Specifies resources requests or limits. The hdd, ssd, and nvme resource requirements handle only the Ceph OSDs with a defined device class.

    Note

    Use vertical bars after resources requirements keys. The mon, mgr, osd and hdd, ssd, nvme values are strings that contain YAML-formatted maps of requests and limits for each component type.

    hyperconverge:
      resources:
        # resources requirements
        # for Ceph daemons
        mon:
        mgr:
        osd:
        # resources requirements
        # for Ceph OSD device
        # classes
        hdd:
        ssd:
        nvme:
    
    hyperconverge:
      resources:
        mon: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        mgr: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        osd: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        hdd: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        ssd: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
        nvme: |
          requests:
            memory: 1Gi
            cpu: 2
          limits:
            memory: 2Gi
            cpu: 3
    
  3. Save the reconfigured cluster resource and wait for the ceph-controller Helm release upgrade. It will recreate Ceph Monitors, Ceph Managers, or Ceph OSDs according to the specified hyperconverge configuration. The Ceph cluster may experience a short downtime.

Once done, proceed to Verify Ceph tolerations and resources management.

Verify Ceph tolerations and resources management

Caution

This feature is available starting from the Container Cloud release 2.6.0.

Caution

This feature is available as Technology Preview. Use such configuration for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

After you enable Ceph resources management as described in Enable Ceph tolerations and resources management, perform the steps below to verify that the configured tolerations, requests, or limits have been successfully specified in the Ceph cluster.

To verify Ceph tolerations and resources management:

  • To verify that the required tolerations are specified in the Ceph cluster, inspect the output of the following commands:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.mon.tolerations}'
    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.mgr.tolerations}'
    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.placement.osd.tolerations}'
    
  • To verify that the required resources requests or limits are specified for the Ceph mon, mgr, or osd daemons, inspect the output of the following command:

    kubectl -n rook-ceph get $(kubectl -n rook-ceph get cephcluster -o name) -o jsonpath='{.spec.resources}'
    
  • To verify that the required resources requests or limits are specified for the Ceph OSDs hdd, ssd, or nvme device classes, perform the following steps:

    1. Identify which Ceph OSDs belong to the <deviceClass> device class in question:

      kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o name) -- ceph osd crush class ls-osd <deviceClass>
      
    2. For each <osdID> obtained in the previous step, run the following command. Compare the output with the desired result.

      kubectl -n rook-ceph get deploy rook-ceph-osd-<osdID> -o jsonpath='{.spec.template.spec.containers[].resources}'
      

Troubleshooting

This section provides solutions to the issues that can occur while operating a Mirantis Container Cloud management, regional, or managed cluster.

Collect cluster logs

While operating your management, regional, or managed cluster, you may require collecting and inspecting the cluster logs to analyze cluster events or troubleshoot issues. For the logs structure, see Deployment Guide: Collect the bootstrap logs.

To collect cluster logs:

  1. Choose from the following options:

    • If you did not delete the kaas-bootstrap folder from the bootstrap node, log in to the bootstrap node.

    • If you deleted the kaas-bootstrap folder:

      1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

      2. Download and run the Container Cloud bootstrap script:

        wget https://binary.mirantis.com/releases/get_container_cloud.sh
        
        chmod 0755 get_container_cloud.sh
        
        ./get_container_cloud.sh
        
  2. Obtain kubeconfig of the required cluster. The management or regional cluster kubeconfig files are created during the last stage of the management or regional cluster bootstrap. To obtain a managed cluster kubeconfig, see Connect to a Mirantis Container Cloud cluster.

  3. Obtain the private SSH key of the required cluster. For a management or regional cluster, ssh_key is created during bootstrap of the corresponding cluster in the same directory as the bootstrap script. For a managed cluster, this is an SSH key added in the Container Cloud web UI before the managed cluster creation.

  4. Depending on the cluster type that you require logs from, run the corresponding command:

    • For a management cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToMgmtClusterPrivateSshKey> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      
    • For a regional cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToRegionalClusterSshKey> --kubeconfig <pathToRegionalClusterKubeconfig> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      
    • For a managed cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToManagedClusterSshKey> --kubeconfig <pathToManagedClusterKubeconfig> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      

    Substitute the parameters enclosed in angle brackets with the corresponding values of your cluster.

    Optionally, add --output-dir that is a directory path to collect logs. The default value is logs/. For example, logs/<clusterName>/events.log.

Troubleshoot Ceph

This section provides solutions to the issues that may occur during Ceph usage.

Ceph disaster recovery

This section describes how to recover a failed or accidentally removed Ceph cluster in the following cases:

  • If the Ceph controller underlying a running Rook Ceph cluster has failed and you want to install a new Ceph controller Helm release and recover the failed Ceph cluster onto the new Ceph controller.

  • To migrate the data of an existing Ceph cluster to a new Container Cloud or Mirantis OpenStack on Kubernetes (MOS) deployment in case downtime can be tolerated.

Consider the common state of a failed or removed Ceph cluster:

  • The rook-ceph namespace does not contain pods or they are in the Terminating state.

  • The rook-ceph or/and ceph-lcm-mirantis namespaces are in the Terminating state.

  • The ceph-operator is in the FAILED state:

    • For Container Cloud: the state of the ceph-operator Helm release in the management HelmBundle, such as default/kaas-mgmt, has switched from DEPLOYED to FAILED.

    • For MOS: the state of the osh-system/ceph-operator HelmBundle, or a related namespace, has switched from DEPLOYED to FAILED.

  • The Rook CephCluster, CephBlockPool, CephObjectStore CRs in the rook-ceph namespace cannot be found or have the deletionTimestamp parameter in the metadata section.

Note

Prior to recovering the Ceph cluster, verify that your deployment meets the following prerequisites:

  1. The Ceph cluster fsid exists.

  2. The Ceph cluster Monitor keyrings exist.

  3. The Ceph cluster devices exist and include the data previously handled by Ceph OSDs.

Overview of the recovery procedure workflow:

  1. Create a backup of the remaining data and resources.

  2. Clean up the failed or removed ceph-operator Helm release.

  3. Deploy a new ceph-operator Helm release with the previously used KaaSCephCluster and one Ceph Monitor.

  4. Replace the ceph-mon data with the old cluster data.

  5. Replace fsid in secrets/rook-ceph-mon with the old one.

  6. Fix the Monitor map in the ceph-mon database.

  7. Fix the Ceph Monitor authentication key and disable authentication.

  8. Start the restored cluster and inspect the recovery.

  9. Fix the admin authentication key and enable authentication.

  10. Restart the cluster.

To recover a failed or removed Ceph cluster:

  1. Back up the remaining resources. Skip the commands for the resources that have already been removed:

    kubectl -n rook-ceph get cephcluster <clusterName> -o yaml > backup/cephcluster.yaml
    # perform this for each cephblockpool
    kubectl -n rook-ceph get cephblockpool <cephBlockPool-i> -o yaml > backup/<cephBlockPool-i>.yaml
    # perform this for each client
    kubectl -n rook-ceph get cephclient <cephclient-i> -o yaml > backup/<cephclient-i>.yaml
    kubectl -n rook-ceph get cephobjectstore <cephObjectStoreName> -o yaml > backup/<cephObjectStoreName>.yaml
    # perform this for each secret
    kubectl -n rook-ceph get secret <secret-i> -o yaml > backup/<secret-i>.yaml
    # perform this for each configMap
    kubectl -n rook-ceph get cm <cm-i> -o yaml > backup/<cm-i>.yaml
    
  2. SSH to each node where the Ceph Monitors or Ceph OSDs were placed before the failure and back up the valuable data:

    mv /var/lib/rook /var/lib/rook.backup
    mv /etc/ceph /etc/ceph.backup
    mv /etc/rook /etc/rook.backup
    

    Once done, close the SSH connection.

  3. Clean up the previous installation of ceph-operator. For details, see Rook documentation: Cleaning up a cluster.

    1. Delete the ceph-lcm-mirantis/ceph-controller deployment:

      kubectl -n ceph-lcm-mirantis delete deployment ceph-controller
      
    2. Delete all deployments, DaemonSets, and jobs from the rook-ceph namespace, if any:

      kubectl -n rook-ceph delete deployment --all
      kubectl -n rook-ceph delete daemonset --all
      kubectl -n rook-ceph delete job --all
      
    3. Edit the MiraCeph and MiraCephLog CRs of the ceph-lcm-mirantis namespace and remove the finalizer parameter from the metadata section:

      kubectl -n ceph-lcm-mirantis edit miraceph
      kubectl -n ceph-lcm-mirantis edit miracephlog
      
    4. Edit the CephCluster, CephBlockPool, CephClient, and CephObjectStore CRs of the rook-ceph namespace and remove the finalizer parameter from the metadata section:

      kubectl -n rook-ceph edit cephclusters
      kubectl -n rook-ceph edit cephblockpools
      kubectl -n rook-ceph edit cephclients
      kubectl -n rook-ceph edit cephobjectstores
      kubectl -n rook-ceph edit cephobjectusers
      
    5. Once you clean up every single resource related to the Ceph release, open the Cluster CR for editing:

      kubectl -n <projectName> edit cluster <clusterName>
      

      Substitute <projectName> with default for the management cluster or with a related project name for the managed cluster.

    6. Remove the ceph-controller Helm release item from the spec.providerSpec.value.helmReleases array and save the Cluster CR:

      - name: ceph-controller
        values: {}
      
    7. Verify that ceph-controller has disappeared from the corresponding HelmBundle:

      kubectl -n <projectName> get helmbundle -o yaml
      
  4. Open the KaaSCephCluster CR of the related management or managed cluster for editing:

    kubectl -n <projectName> edit kaascephcluster
    

    Substitute <projectName> with default for the management cluster or with a related project name for the managed cluster.

  5. Edit the roles of nodes. The entire nodes spec must contain only one mon role. Save KaaSCephCluster after editing.

  6. Open the Cluster CR for editing:

    kubectl -n <projectName> edit cluster <clusterName>
    

    Substitute <projectName> with default for the management cluster or with a related project name for the managed cluster.

  7. Add ceph-controller to spec.providerSpec.value.helmReleases to restore the ceph-controller Helm release. Save Cluster after editing.

    - name: ceph-controller
      values: {}
    
  8. Verify that the ceph-controller Helm release is deployed:

    1. Inspect the Rook operator logs and wait until the orchestration has settled:

      kubectl -n rook-ceph logs -l app=rook-ceph-operator
      
    2. Verify that the pods in the rook-ceph namespace have rook-ceph-mon-a, rook-ceph-mgr-a, and all the auxiliary pods ar up and running, and no rook-ceph-osd-ID-xxxxxx are running:

      kubectl -n rook-ceph get pod
      
    3. Verify the Ceph state. The output must indicate that one mon and one mgr are running, all Ceph OSDs are down, and all PGs are in the Unknown state.

      kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s
      

      Note

      Rook should not start any Ceph OSD daemon because all devices belong to the old cluster that has a different fsid. To verify the Ceph OSD daemons, inspect the osd-prepare pods logs:

      kubectl -n rook-ceph logs -l app=rook-ceph-osd-prepare
      
  9. Connect to the terminal of the rook-ceph-mon-a pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}') bash
    
  10. Output the keyring file and save it for further usage:

    cat /etc/ceph/keyring-store/keyring
    exit
    
  11. Obtain and save the nodeName of mon-a for further usage:

    kubectl -n rook-ceph get pod $(kubectl -n rook-ceph get pod \
    -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}') -o jsonpath='{.spec.nodeName}'
    
  12. Obtain and save the cephImage used in the Ceph cluster for further usage:

    kubectl -n ceph-lcm-mirantis get cm ccsettings -o jsonpath='{.data.cephImage}'
    
  13. Stop the Rook operator and scale the deployment replicas to 0:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
    
  14. Remove the Rook deployments generated with the Rook operator:

    kubectl -n rook-ceph delete deploy -l app=rook-ceph-mon
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-mgr
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-osd
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-crashcollector
    
  15. Using the saved nodeName, SSH to the host where rook-ceph-mon-a in the new Kubernetes cluster is placed and perform the following steps:

    1. Remove /var/lib/rook/mon-a or copy it to another folder:

      mv /var/lib/rook/mon-a /var/lib/rook/mon-a.new
      
    2. Pick a healthy rook-ceph-mon-ID directory (/var/lib/rook.backup/mon-ID) in the previous backup, copy to /var/lib/rook/mon-a:

      cp -rp /var/lib/rook.backup/mon-<ID> /var/lib/rook/mon-a
      

      Substitute ID with any healthy mon node ID of the old cluster.

    3. Replace /var/lib/rook/mon-a/keyring with the previously saved keyring, preserving only the [mon.] section. Remove the [client.admin] section.

    4. Run the cephImage Docker container using the previously saved cephImage image:

      docker run -it --rm -v /var/lib/rook:/var/lib/rook <cephImage> bash
      
    5. Inside the container, create /etc/ceph/ceph.conf for a stable operation of ceph-mon:

      touch /etc/ceph/ceph.conf
      
    6. Change the directory to /var/lib/rook and edit monmap by replacing the existing mon hosts with the new mon-a endpoints:

      cd /var/lib/rook
      rm /var/lib/rook/mon-a/data/store.db/LOCK # make sure the quorum lock file does not exist
      ceph-mon --extract-monmap monmap --mon-data ./mon-a/data  # Extract monmap from old ceph-mon db and save as monmap
      monmaptool --print monmap  # Print the monmap content, which reflects the old cluster ceph-mon configuration.
      monmaptool --rm a monmap  # Delete `a` from monmap.
      monmaptool --rm b monmap  # Repeat, and delete `b` from monmap.
      monmaptool --rm c monmap  # Repeat this pattern until all the old ceph-mons are removed and monmap won't be empty
      monmaptool --addv a [v2:<nodeIP>:3300,v1:<nodeIP>:6789] monmap   # Replace it with the rook-ceph-mon-a address you got from previous command.
      ceph-mon --inject-monmap monmap --mon-data ./mon-a/data  # Replace monmap in ceph-mon db with our modified version.
      rm monmap
      exit
      

      Substitute <nodeIP> with the IP address of the current <nodeName> node.

    7. Close the SSH connection.

  16. Change fsid to the original one to run Rook as an old cluster:

    kubectl -n rook-ceph edit secret/rook-ceph-mon
    

    Note

    The fsid is base64 encoded and must not contain a trailing carriage return. For example:

    echo -n a811f99a-d865-46b7-8f2c-f94c064e4356 | base64  # Replace with the fsid from the old cluster.
    
  17. Scale the ceph-lcm-mirantis/ceph-controller deployment replicas to 0:

    kubectl -n ceph-lcm-mirantis scale deployment ceph-controller --replicas 0
    
  18. Disable authentication:

    1. Open the cm/rook-config-override ConfigMap for editing:

      kubectl -n rook-ceph edit cm/rook-config-override
      
    2. Add the following content:

      data:
        config: |
          [global]
          ...
          auth cluster required = none
          auth service required = none
          auth client required = none
          auth supported = none
      
  19. Start the Rook operator by scaling its deployment replicas to 1:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
    
  20. Inspect the Rook operator logs and wait until the orchestration has settled:

    kubectl -n rook-ceph logs -l app=rook-ceph-operator
    
  21. Verify that the pods in the rook-ceph namespace have the rook-ceph-mon-a, rook-ceph-mgr-a, and all the auxiliary pods are up and running, and all rook-ceph-osd-ID-xxxxxx greater than zero are running:

    kubectl -n rook-ceph get pod
    
  22. Verify the Ceph state. The output must indicate that one mon, one mgr, and all Ceph OSDs are up and running and all PGs are either in the Active or Degraded state:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s
    
  23. Enter the ceph-tools pod and import the authentication key:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') bash
    vi key
    [paste keyring content saved before, preserving only `[client admin]` section]
    ceph auth import -i key
    rm key
    exit
    
  24. Stop the Rook operator by scaling the deployment to 0 replicas:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 0
    
  25. Re-enable authentication:

    1. Open the cm/rook-config-override ConfigMap for editing:

      kubectl -n rook-ceph edit cm/rook-config-override
      
    2. Remove the following content:

      data:
        config: |
          [global]
          ...
          auth cluster required = none
          auth service required = none
          auth client required = none
          auth supported = none
      
  26. Remove all Rook deployments generated with the Rook operator:

    kubectl -n rook-ceph delete deploy -l app=rook-ceph-mon
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-mgr
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-osd
    kubectl -n rook-ceph delete deploy -l app=rook-ceph-crashcollector
    
  27. Start the Ceph controller by scaling its deployment replicas to 1:

    kubectl -n ceph-lcm-mirantis scale deployment ceph-controller --replicas 1
    
  28. Start the Rook operator by scaling its deployment replicas to 1:

    kubectl -n rook-ceph scale deploy rook-ceph-operator --replicas 1
    
  29. Inspect the Rook operator logs and wait until the orchestration has settled:

    kubectl -n rook-ceph logs -l app=rook-ceph-operator
    
  30. Verify that the pods in the rook-ceph namespace have the rook-ceph-mon-a, rook-ceph-mgr-a, and all the auxiliary pods are up and running, and all rook-ceph-osd-ID-xxxxxx greater than zero are running:

    kubectl -n rook-ceph get pod
    
  31. Verify the Ceph state. The output must indicate that one mon, one mgr, and all Ceph OSDs are up and running and the overall stored data size equals to the old cluster data size.

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s
    
  32. Edit the MiraCeph CR and add two more mon and mgr roles to the corresponding nodes:

    kubectl -n ceph-lcm-mirantis edit miraceph
    
  33. Inspect the Rook namespace and wait until all Ceph Monitors are in the Running state:

    kubectl -n rook-ceph get pod -l app=rook-ceph-mon
    
  34. Verify the Ceph state. The output must indicate that three mon (three in quorum), one mgr, and all Ceph OSDs are up and running and the overall stored data size equals to the old cluster data size.

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=rook-ceph-tools -o jsonpath='{.items[0].metadata.name}') -- ceph -s
    

Once done, the data from the failed or removed Ceph cluster is restored and ready to use.

Mirantis Container Cloud API

Warning

This section is intended only for advanced Infrastructure Operators who are familiar with Kubernetes Cluster API.

Mirantis currently supports only those Mirantis Container Cloud API features that are implemented in the Container Cloud web UI. Use other Container Cloud API features for testing and evaluation purposes only.

The Container Cloud APIs are implemented using the Kubernetes CustomResourceDefinitions (CRDs) that enable you to expand the Kubernetes API. Different types of resources are grouped in the dedicated files, such as cluster.yaml or machines.yaml.

This section contains descriptions and examples of the Container Cloud API resources for the bare metal cloud provider.

Note

The API documentation for the OpenStack, AWS, and VMWare vSphere resources will be added in the upcoming Container Cloud releases.

Public key resources

This section describes the PublicKey resource used in Mirantis Container Cloud API for all supported providers: OpenStack, AWS, and bare metal. This resource is used to provide SSH access to every machine of a Container Cloud cluster.

The Container Cloud PublicKey CR contains the following fields:

  • apiVersion

    API version of the object that is kaas.mirantis.com/v1alpha1

  • kind

    Object type that is PublicKey

  • metadata

    The metadata object field of the PublicKey resource contains the following fields:

    • name

      Name of the public key

    • namespace

      Project where the public key is created

  • spec

    The spec object field of the PublicKey resource contains the publicKey field that is an SSH public key value.

The PublicKey resource example:

apiVersion: kaas.mirantis.com/v1alpha1
kind: PublicKey
metadata:
  name: demokey
  namespace: test
spec:
  publicKey: |
    ssh-rsa AAAAB3NzaC1yc2EAAAA…

Bare metal resources

This section contains descriptions and examples of the baremetal-based Kubernetes resources for Mirantis Container Cloud.

Cluster

This section describes the Cluster resource used the in Mirantis Container Cloud API that describes the cluster-level parameters.

For demonstration purposes, the Container Cloud Cluster custom resource (CR) is split into the following major sections:

Warning

The fields of the Cluster resource that are located under the status section including providerStatus are available for viewing only. They are automatically generated by the bare metal cloud provider and must not be modified using Container Cloud API.

metadata

The Container Cloud Cluster CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is Cluster.

The metadata object field of the Cluster resource contains the following fields:

  • name

    Name of a cluster. A managed cluster name is specified under the Cluster Name field in the Create Cluster wizard of the Container Cloud web UI. A management and regional cluster names are configurable in the bootstrap script.

  • namespace

    Project in which the cluster object was created. The management and regional clusters are created in the default project. The managed cluster project equals to the selected project name.

  • labels

    Key-value pairs attached to the object:

    • kaas.mirantis.com/provider

      Provider type that is baremetal for the baremetal-based clusters.

    • kaas.mirantis.com/region

      Region name. The default region name for the management cluster is region-one. For the regional cluster, it is configurable using the REGION parameter in the bootstrap script.

Configuration example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
metadata:
  name: demo
  namespace: test
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
spec:providerSpec

The spec object field of the Cluster object represents the BaremetalClusterProviderSpec subresource that contains a complete description of the desired bare metal cluster state and all details to create the cluster-level resources. It also contains the fields required for LCM deployment and integration of the Container Cloud components.

The providerSpec object field is custom for each cloud provider and contains the following generic fields for the bare metal provider:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1

  • kind

    Object type that is BaremetalClusterProviderSpec

Configuration example:

spec:
  ...
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      kind: BaremetalClusterProviderSpec
spec:providerSpec common

The providerSpec object field of the Cluster resource contains the following common fields for all Container Cloud providers:

  • publicKeys

    List of the SSH public key references

  • release

    Name of the ClusterRelease object to install on a cluster

  • helmReleases

    List of the enabled Helm releases from the Release object that run on a Container Cloud cluster

Configuration example:

spec:
  ...
  providerSpec:
    value:
      publicKeys:
        - name: bootstrap-key
      release: ucp-5-7-0-3-3-3-tp11
      helmReleases:
        - name: metallb
          values:
            configInline:
              address-pools:
                - addresses:
                  - 10.0.0.101-10.0.0.120
                    name: default
                    protocol: layer2
        ...
        - name: stacklight
spec:providerSpec configuration

This section represents the Container Cloud components that are enabled on a cluster. It contains the following fields:

  • management

    Configuration for the management cluster components:

    • enabled

      Management cluster enabled (true) or disabled (false).

    • helmReleases

      List of the management cluster Helm releases that will be installed on the cluster. A Helm release includes the name and values fields. The specified values will be merged with relevant Helm release values of the management cluster in the Release object.

  • regional

    List of regional clusters components on the Container Cloud cluster for each configured provider available for a specific region:

    • provider

      Provider type that is baremetal.

    • helmReleases

      List of the regional Helm releases that will be installed on the cluster. A Helm release includes the name and values fields. The specified values will be merged with relevant regional Helm release values in the Release object.

  • release

    Name of the Container Cloud Release object.

Configuration example:

spec:
  ...
  providerSpec:
     value:
       kaas:
         management:
           enabled: true
           helmReleases:
             - name: kaas-ui
               values:
                 serviceConfig:
                   server: https://10.0.0.117
         regional:
           - helmReleases:
             - name: baremetal-provider
               values: {}
             provider: baremetal
           - helmReleases:
             - name: byo-provider
               values: {}
             provider: byo
         release: kaas-2-0-0
status:providerStatus common

Must not be modified using API

The common providerStatus object field of the Cluster resource contains the following fields:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1

  • kind

    Object type that is BaremetalClusterProviderStatus

  • loadBalancerHost

    Load balancer IP or host name of the Container Cloud cluster

  • apiServerCertificate

    Server certificate of Kubernetes API

  • ucpDashboard

    URL of the Mirantis Kubernetes Engine (MKE) Dashboard

Configuration example:

status:
  providerStatus:
    apiVersion: baremetal.k8s.io/v1alpha1
    kind: BaremetalClusterProviderStatus
    loadBalancerHost: 10.0.0.100
    apiServerCertificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS…
    ucpDashboard: https://10.0.0.100:6443
status:providerStatus for cluster readiness

Must not be modified using API

The providerStatus object field of the Cluster resource that reflects the cluster readiness contains the following fields:

  • persistentVolumesProviderProvisioned

    Status of the persistent volumes provisioning. Prevents the Helm releases that require persistent volumes from being installed until some default StorageClass is added to the Cluster object.

  • helm

    Details about the deployed Helm releases:

    • ready

      Status of the deployed Helm releases. The true value indicates that all Helm releases are deployed successfully.

    • releases

      List of the enabled Helm releases that run on the Container Cloud cluster:

      • releaseStatuses

        List of the deployed Helm releases. The success: true field indicates that the release is deployed successfully.

      • stacklight

        Status of the StackLight deployment. Contains URLs of all StackLight components. The success: true field indicates that StackLight is deployed successfully.

  • nodes

    Details about the cluster nodes:

    • ready

      Number of nodes that completed the deployment or update.

    • requested

      Total number of nodes. If the number of ready nodes does not match the number of requested nodes, it means that a cluster is being currently deployed or updated.

  • notReadyObjects

    The list of the services, deployments, and statefulsets Kubernetes objects that are not in the Ready state yet. A service is not ready if its external address has not been provisioned yet. A deployment or statefulset is not ready if the number of ready replicas is not equal to the number of desired replicas. Both objects contain the name and namespace of the object and the number of ready and desired replicas (for controllers). If all objects are ready, the notReadyObjects list is empty.

Configuration example:

status:
  providerStatus:
    persistentVolumesProviderProvisioned: true
    helm:
      ready: true
      releases:
        releaseStatuses:
          iam:
            success: true
          ...
        stacklight:
          alerta:
            url: http://10.0.0.106
          alertmanager:
            url: http://10.0.0.107
          grafana:
            url: http://10.0.0.108
          kibana:
            url: http://10.0.0.109
          prometheus:
            url: http://10.0.0.110
          success: true
    nodes:
      ready: 3
      requested: 3
    notReadyObjects:
      services:
        - name: testservice
          namespace: default
      deployments:
        - name: baremetal-provider
          namespace: kaas
          replicas: 3
          readyReplicas: 2
      statefulsets: {}
status:providerStatus for Open ID Connect

Must not be modified using API

The oidc section of the providerStatus object field in the Cluster resource reflects the Open ID Connect configuration details. It contains the required details to obtain a token for a Container Cloud cluster and consists of the following fields:

  • certificate

    Base64-encoded OIDC certificate.

  • clientId

    Client ID for OIDC requests.

  • groupsClaim

    Name of an OIDC groups claim.

  • issuerUrl

    Issuer URL to obtain the representation of the realm.

  • ready

    OIDC status relevance. If true, the status corresponds to the LCMCluster OIDC configuration.

Configuration example:

status:
  providerStatus:
    oidc:
      certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUREekNDQWZ...
      clientId: kaas
      groupsClaim: iam_roles
      issuerUrl: https://10.0.0.117/auth/realms/iam
      ready: true
status:providerStatus for cluster releases

Must not be modified using API

The releaseRefs section of the providerStatus object field in the Cluster resource provides the current Cluster release version as well as the one available for upgrade. It contains the following fields:

  • current

    Details of the currently installed Cluster release:

    • lcmType

      Type of the Cluster release (ucp).

    • name

      Name of the Cluster release resource.

    • version

      Version of the Cluster release.

    • unsupportedSinceKaaSVersion

      Indicates that a Container Cloud release newer than the current one exists and that it does not support the current Cluster release.

  • available

    List of the releases available for upgrade. Contains the name and version fields.

Configuration example:

status:
  providerStatus:
    releaseRefs:
      available:
        - name: ucp-5-5-0-3-4-0-dev
          version: 5.5.0+3.4.0-dev
      current:
        lcmType: ucp
        name: ucp-5-4-0-3-3-0-beta1
        version: 5.4.0+3.3.0-beta1

Machine

This section describes the Machine resource used in Mirantis Container Cloud API for bare metal provider. The Machine resource describes the machine-level parameters.

For demonstration purposes, the Container Cloud Machine custom resource (CR) is split into the following major sections:

metadata

The Container Cloud Machine CR contains the following fields:

  • apiVersion

    API version of the object that is cluster.k8s.io/v1alpha1.

  • kind

    Object type that is Machine.

The metadata object field of the Machine resource contains the following fields:

  • name

    Name of the Machine object.

  • namespace

    Project in which the Machine object is created.

  • annotations

    Key-value pair to attach arbitrary metadata to the object:

    • metal3.io/BareMetalHost

      Annotation attached to the Machine object to reference the corresponding BareMetalHost object in the <BareMetalHostProjectName/BareMetalHostName> format.

  • labels

    Key-value pairs that are attached to the object:

    • kaas.mirantis.com/provider

      Provider type that matches the provider type in the Cluster object and must be baremetal.

    • kaas.mirantis.com/region

      Region name that matches the region name in the Cluster object.

    • cluster.sigs.k8s.io/cluster-name

      Cluster name that the Machine object is linked to.

    • cluster.sigs.k8s.io/control-plane

      For the control plane role of a machine, this label contains any value, for example, "true". For the worker role, this label is absent or does not contain any value.

Configuration example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: example-control-plane
  namespace: example-ns
  annotations:
    metal3.io/BareMetalHost: default/master-0
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
    cluster.sigs.k8s.io/cluster-name: example-cluster
    cluster.sigs.k8s.io/control-plane: "true" # remove for worker
spec:providerSpec for instance configuration

The spec object field of the Machine object represents the BareMetalMachineProviderSpec subresource with all required details to create a bare metal instance. It contains the following fields:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1.

  • kind

    Object type that is BareMetalMachineProviderSpec.

  • bareMetalHostProfile

    Configuration profile of a bare metal host:

    • name

      Name of a bare metal host profile

    • namespace

      Project in which the bare metal host profile is created.

  • l2TemplateIfMappingOverride

    If specified, overrides the interface mapping value for the corresponding L2Template object.

  • l2TemplateSelector

    If specified, contains the name (first priority) or label of the L2 template that will be applied during a machine creation. The l2TemplateSelector field is copied from the Machine providerSpec object to the IpamHost object only once, during a machine creation. To modify l2TemplateSelector after creation of a Machine CR, edit the IpamHost object.

  • hostSelector

    Specifies the matching criteria for labels on the bare metal hosts. Limits the set of the BareMetalHost objects considered for claiming for the Machine object. The following selector labels can be added when creating a machine using the Container Cloud web UI:

    • hostlabel.bm.kaas.mirantis.com/controlplane

    • hostlabel.bm.kaas.mirantis.com/worker

    • hostlabel.bm.kaas.mirantis.com/storage

    Any custom label that is assigned to one or more bare metal hosts using API can be used as a host selector. If the BareMetalHost objects with the specified label are missing, the Machine object will not be deployed until at least one bare metal host with the specified label is available.

  • nodeLabels

    List of node labels to be attached to the corresponding node. Enables running of certain components on separate cluster nodes. The list of allowed node labels is defined in the providerStatus.releaseRef.current.allowedNodeLabels cluster status. Addition of any unsupported node label not from this list is restricted.

Configuration example:

spec:
  ...
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      kind: BareMetalMachineProviderSpec
      bareMetalHostProfile:
        name: default
        namespace: default
      l2TemplateIfMappingOverride:
        - eno1
        - enp0s0
      l2TemplateSelector:
        label: l2-template1-label-1
      hostSelector:
        matchLabels:
          baremetal: hw-master-0
      kind: BareMetalMachineProviderSpec
      nodeLabels:
      - key: stacklight
        value: enabled
Machine status

The status object field of the Machine object represents the BareMetalMachineProviderStatus subresource that describes the current bare metal instance state and contains the following fields:

  • apiVersion

    API version of the object that is cluster.k8s.io/v1alpha1.

  • kind

    Object type that is BareMetalMachineProviderStatus.

  • hardware

    Provides a machine hardware information:

    • cpu

      Number of CPUs.

    • ram

      RAM capacity in GB.

    • storage

      List of hard drives mounted on the machine. Contains the disk name and size in GB.

  • status

    Represents the current status of a machine:

    • Provision

      Machine is yet to obtain a status.

    • Uninitialized

      Machine is yet to obtain a node IP address and hostname.

    • Pending

      Machine is yet to receive the deployment instructions. It is either not booted yet or waits for the LCM controller to be deployed.

    • Prepare

      Machine is running the Prepare phase when mostly Docker images and packages are being predownloaded.

    • Deploy

      Machine is processing the LCM controller instructions.

    • Reconfigure

      Some configurations are being updated on a machine.

    • Ready

      Machine is deployed and the supported Mirantis Kubernetes Engine (MKE) version is set.

Configuration example:

status:
  providerStatus:
    apiVersion: baremetal.k8s.io/v1alpha1
    kind: BareMetalMachineProviderStatus
    hardware:
      cpu: 11
      ram: 16
    storage:
      - name: /dev/vda
        size: 61
      - name: /dev/vdb
        size: 32
      - name: /dev/vdc
        size: 32
    status: Ready

BareMetalHostProfile

This section describes the BareMetalHostProfile resource used in Mirantis Container Cloud API to define how the storage devices and operating system are provisioned and configured.

For demonstration purposes, the Container Cloud BareMetalHostProfile custom resource (CR) is split into the following major sections:

metadata

The Container Cloud BareMetalHostProfile CR contains the following fields:

  • apiVersion

    API version of the object that is metal3.io/v1alpha1.

  • kind

    Object type that is BareMetalHostProfile.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the bare metal host profile.

    • namespace

      Project in which the bare metal host profile was created.

Configuration example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHostProfile
metadata:
  name: default
  namespace: default
spec

The spec field of BareMetalHostProfile object contains the fields to customize your hardware configuration:

  • devices

    List of definitions of the physical storage devices. To configure more than three storage devices per host, add additional devices to this list. Each device in the list may have one or more partitions defined by the list in the partitions field.

  • fileSystems

    List of file systems. Each file system can be created on top of either device, partition, or logical volume. If more file systems are required for additional devices, define them in this field.

  • logicalVolumes

    List of LVM logical volumes. Every logical volume belongs to a volume group from the volumeGroups list and has the sizeGiB attribute for size in gigabytes.

  • volumeGroups

    List of definitions of LVM volume groups. Each volume group contains one or more devices or partitions from the devices list.

  • preDeployScript

    Shell script that executes on a host before provisioning the target operating system inside the ramfs system.

  • postDeployScript

    Shell script that executes on a host after deploying the operating system inside the ramfs system that is chrooted to the target operating system.

  • grubConfig

    List of options passed to the Linux GRUB bootloader. Each string in the list defines one parameter.

  • kernelParameters:sysctl

    List of kernel sysctl options passed to /etc/sysctl.d/999-baremetal.conf during a bmh provisioning.

  • kernelParameters:modules

    List of kernel modules options passed to /etc/modprobe.d/{filename} during a bmh provisioning.

Configuration example:

spec:
  devices:
  - device:
      wipe: true
    partitions:
    - dev: ""
      name: bios_grub
      partflags:
      - bios_grub
      sizeGiB: 0.00390625
      ...
  - device:
      wipe: true
    partitions:
    - dev: ""
      name: lvm_lvp_part
  fileSystems:
  - fileSystem: vfat
    partition: config-2
  - fileSystem: vfat
    mountPoint: /boot/efi
    partition: uefi
    ...
  - fileSystem: ext4
    logicalVolume: lvp
    mountPoint: /mnt/local-volumes/
  logicalVolumes:
  - name: root
    sizeGiB: 0
    vg: lvm_root
  - name: lvp
    sizeGiB: 0
    vg: lvm_lvp
  postDeployScript: |
    #!/bin/bash -ex

    echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
  preDeployScript: |
    #!/bin/bash -ex

    echo 'ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="deadline"' > /etc/udev/rules.d/60-ssd-scheduler.rules
    echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
  volumeGroups:
  - devices:
    - partition: lvm_root_part
    name: lvm_root
  - devices:
    - partition: lvm_lvp_part
    name: lvm_lvp
  grubConfig:
    defaultGrubOptions:
    - GRUB_DISABLE_RECOVERY="true"
    - GRUB_PRELOAD_MODULES=lvm
    - GRUB_TIMEOUT=20
  kernelParameters:
    sysctl:
      kernel.panic: "900"
      kernel.dmesg_restrict: "1"
      kernel.core_uses_pid: "1"
      fs.file-max: "9223372036854775807"
      fs.aio-max-nr: "1048576"
      fs.inotify.max_user_instances: "4096"
      vm.max_map_count: "262144"
    modules:
      - filename: kvm_intel.conf
        content: |
          options kvm_intel nested=1

BareMetalHost

This section describes the BareMetalHost resource used in the Mirantis Container Cloud API. BareMetalHost object is being created for each Machine and contains all information about machine hardware configuration. It is needed for further selecting which machine to choose for the deploy. When machine is created the provider assigns a BareMetalHost to that machine based on labels and BareMetalHostProfile configuration.

For demonstration purposes, the Container Cloud BareMetalHost custom resource (CR) can be split into the following major sections:

BareMetalHost metadata

The Container Cloud BareMetalHost CR contains the following fields:

  • apiVersion

    API version of the object that is metal3.io/v1alpha1.

  • kind

    Object type that is BareMetalHost.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the BareMetalHost object.

    • namespace

      Project in which the BareMetalHost object was created.

    • labels

      Labels used by the bare metal provider to find a matching BareMetalHost object to deploy a machine:

      • hostlabel.bm.kaas.mirantis.com/controlplane

      • hostlabel.bm.kaas.mirantis.com/worker

      • hostlabel.bm.kaas.mirantis.com/storage

      Each BareMetalHost object added using the Container Cloud web UI will be assigned one of these labels. If the BareMetalHost and Machine objects are created using API, any label may be used to match these objects for a bare metal host to deploy a machine.

Configuration example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: master-0
  namespace: default
  labels:
    baremetal: hw-master-0
BareMetalHost configuration

The spec section for the BareMetalHost object defines the desired state of BareMetalHost. It contains the following fields:

  • bmc

    Details for communication with the Baseboard Management Controller (bmc) module on a host:

    • address

      URL for accessing bmc in the network.

    • credentialsName

      Name of the secret containing the bmc credentials. The secret requires the username and password keys in the Base64 encoding.

  • bootMACAddress

    MAC address for booting.

  • bootUEFI

    UEFI boot mode enabled (true) or disabled (false).

  • online

    Defines whether the server must be online after inspection.

Configuration example:

spec:
  bmc:
    address: 5.43.227.106:623
    credentialsName: master-0-bmc-secret
  bootMACAddress: 0c:c4:7a:a8:d3:44
  bootUEFI: true
  consumerRef:
    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    name: master-0
    namespace: default
  online: true
BareMetalHost status

The status field of the BareMetalHost object defines the current state of BareMetalHost. It contains the following fields:

  • errorMessage

    Last error message reported by the provisioning subsystem.

  • goodCredentials

    Last credentials that were validated.

  • hardware

    Hardware discovered on the host. Contains information about the storage, CPU, host name, firmware, and so on.

  • operationalStatus

    Status of the host:

    • OK

      Host is configured correctly and is manageable.

    • discovered

      Host is only partially configured. For example, the bmc address is discovered but not the login credentials.

    • error

      Host has any sort of error.

  • poweredOn

    Host availability status: powered on (true) or powered off (false).

  • provisioning

    State information tracked by the provisioner:

    • state

      Current action being done with the host by the provisioner.

    • id

      UUID of a machine.

  • triedCredentials

    Details of the last credentials sent to the provisioning back end.

Configuration example:

status:
  errorMessage: ""
  goodCredentials:
    credentials:
      name: master-0-bmc-secret
      namespace: default
    credentialsVersion: "13404"
  hardware:
    cpu:
      arch: x86_64
      clockMegahertz: 3000
      count: 32
      flags:
      - 3dnowprefetch
      - abm
      ...
      model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
    firmware:
      bios:
        date: ""
        vendor: ""
        version: ""
    hostname: ipa-fcab7472-892f-473c-85a4-35d64e96c78f
    nics:
    - ip: ""
      mac: 0c:c4:7a:a8:d3:45
      model: 0x8086 0x1521
      name: enp8s0f1
      pxe: false
      speedGbps: 0
      vlanId: 0
      ...
    ramMebibytes: 262144
    storage:
    - by_path: /dev/disk/by-path/pci-0000:00:1f.2-ata-1
      hctl: "4:0:0:0"
      model: Micron_5200_MTFD
      name: /dev/sda
      rotational: false
      serialNumber: 18381E8DC148
      sizeBytes: 1920383410176
      vendor: ATA
      wwn: "0x500a07511e8dc148"
      wwnWithExtension: "0x500a07511e8dc148"
      ...
    systemVendor:
      manufacturer: Supermicro
      productName: SYS-6018R-TDW (To be filled by O.E.M.)
      serialNumber: E16865116300188
  operationalStatus: OK
  poweredOn: true
  provisioning:
    state: provisioned
  triedCredentials:
    credentials:
      name: master-0-bmc-secret
      namespace: default
    credentialsVersion: "13404"

IpamHost

This section describes the IpamHost resource used in Mirantis Container Cloud API. The kaas-ipam controller monitors the current state of the bare metal Machine, verifies if BareMetalHost is successfully created and inspection is completed. Then the kaas-ipam controller fetches the information about the network card, creates the IpamHost object, and requests the IP address.

The IpamHost object is created for each Machine and contains all configuration of the host network interfaces and IP address. It also contains the information about associated BareMetalHost, Machine, and MAC addresses.

For demonstration purposes, the Container Cloud IpamHost custom resource (CR) is split into the following major sections:

IpamHost metadata

The Container Cloud IpamHost CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1

  • kind

    Object type that is IpamHost

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the IpamHost object

    • namespace

      Project in which the IpamHost object has been created

    • labels

      Key-value pairs that are attached to the object:

      • cluster.sigs.k8s.io/cluster-name

        References the Cluster object name that IpamHost is assigned to

      • ipam/BMHostID

        Unique ID of the associated BareMetalHost object

      • ipam/MAC-XX-XX-XX-XX-XX-XX: "1"

        Number of NICs of the host that the corresponding MAC address is assigned to

      • ipam/MachineID

        Unique ID of the associated Machine object

      • ipam/UID

        Unique ID of the IpamHost object

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: IpamHost
metadata:
  name: master-0
  namespace: default
  labels:
    cluster.sigs.k8s.io/cluster-name: kaas-mgmt
    ipam/BMHostID: 57250885-f803-11ea-88c8-0242c0a85b02
    ipam/MAC-0C-C4-7A-1E-A9-5C: "1"
    ipam/MAC-0C-C4-7A-1E-A9-5D: "1"
    ipam/MachineID: 573386ab-f803-11ea-88c8-0242c0a85b02
    ipam/UID: 834a2fc0-f804-11ea-88c8-0242c0a85b02
IpamHost configuration

The spec field of the IpamHost resource describes the desired state of the object. It contains the following fields:

  • nicMACmap

    Represents an unordered list of all NICs of the host. Each NIC entry contains such fields as name, mac, ip, and so on. The primary field defines that the current NIC is primary. Only one NIC can be primary.

  • l2TemplateSelector

    If specified, contains the name (first priority) or label of the L2 template that will be applied during a machine creation. The l2TemplateSelector field is copied from the Machine providerSpec object to the IpamHost object only once, during a machine creation. To modify l2TemplateSelector after creation of a Machine CR, edit the IpamHost object.

Configuration example:

spec:
  nicMACmap:
  - mac: 0c:c4:7a:1e:a9:5c
    name: ens11f0
  - ip: 172.16.48.157
    mac: 0c:c4:7a:1e:a9:5d
    name: ens11f1
    primary: true
  l2TemplateSelector:
    label:xxx
IpamHost status

The status field of the IpamHost resource describes the observed state of the object. It contains the following fields:

  • ipAllocationResult

    Status of IP allocation for the primary NIC (PXE boot). Possible values are OK or ERR if no IP address was allocated.

  • l2RenderResult

    Result of the L2 template rendering, if applicable. Possible values are OK or an error message.

  • lastUpdated

    Date and time of the last IpamHost status update.

  • nicMACmap

    Unordered list of all NICs of host with a detailed description. Each nicMACmap entry contains additional fields such as ipRef, nameservers, online, and so on.

  • osMetadataNetwork

    Configuration of the host OS metadata network. This configuration is used in the cloud-init tool and is applicable to the primary NIC only. It is added when the IP address is allocated and the ipAllocationResult status is OK.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  ipAllocationResult: OK
  l2RenderResult: There are no available L2Templates
  lastUpdated: "2020-09-16T11:02:39Z"
  nicMACmap:
  - mac: 0C:C4:7A:1E:A9:5C
    name: ens11f0
  - gateway: 172.16.48.1
    ip: 172.16.48.200/24
    ipRef: default/auto-0c-c4-7a-a8-d3-44
    mac: 0C:C4:7A:1E:A9:5D
    name: ens11f1
    nameservers:
    - 172.18.176.6
    online: true
    primary: true
  osMetadataNetwork:
    links:
    - ethernet_mac_address: 0C:C4:7A:A8:D3:44
      id: enp8s0f0
      type: phy
    networks:
    - ip_address: 172.16.48.200
      link: enp8s0f0
      netmask: 255.255.255.0
      routes:
      - gateway: 172.16.48.1
        netmask: 0.0.0.0
        network: 0.0.0.0
      type: ipv4
    services:
    - address: 172.18.176.6
      type: dns
  versionIpam: v3.0.999-20200807-130909-44151f8

Subnet

This section describes the Subnet resource used in Mirantis Container Cloud API to allocate IP addresses for the cluster nodes.

For demonstration purposes, the Container Cloud Subnet custom resource (CR) can be split into the following major sections:

Subnet metadata

The Container Cloud Subnet CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is Subnet

  • metadata

    This field contains the following subfields:

    • name

      Name of the Subnet object.

    • namespace

      Project in which the Subnet object was created.

    • labels

      Key-value pairs that are attached to the object:

      • ipam/DefaultSubnet: "1"

        Indicates that the subnet was automatically created for the PXE network. The subnet with this label is unique for a specific region and global for all clusters and projects in the region.

      • ipam/UID

        Unique ID of a subnet.

      • kaas.mirantis.com/provider

        Provider type.

      • kaas.mirantis.com/region

        Region type.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: Subnet
metadata:
  name: kaas-mgmt
  namespace: default
  labels:
    ipam/DefaultSubnet: "1"
    ipam/UID: 1bae269c-c507-4404-b534-2c135edaebf5
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
Subnet spec

The spec field of the Subnet resource describes the desired state of a subnet. It contains the following fields:

  • cidr

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

  • gateway

    A valid gateway address, for example, 10.11.0.9.

  • includeRanges

    A list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they intersect with one of the range. The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The includeRanges parameter is mutually exclusive with excludeRanges.

  • excludeRanges

    A list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they are included in the CIDR. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The excludeRanges parameter is mutually exclusive with includeRanges.

  • useWholeCidr

    If set to false (by default), the subnet address and broadcast address will be excluded from the address allocation. If set to true, the subnet address and the broadcast address are included into the address allocation for nodes.

  • nameservers

    The list of IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

Configuration example:

spec:
  cidr: 172.16.48.0/24
  excludeRanges:
  - 172.16.48.99
  - 172.16.48.101-172.16.48.145
  gateway: 172.16.48.1
  nameservers:
  - 172.18.176.6
Subnet status

The status field of the Subnet resource describes the actual state of a subnet. It contains the following fields:

  • allocatable

    The number of IP addresses that are available for allocation.

  • allocatedIPs

    The list of allocated IP addresses in the IP:<IPAddr object UID> format.

  • capacity

    The total number of IP addresses to be allocated, including the sum of allocatable and already allocated IP addresses.

  • cidr

    The IPv4 CIDR for a subnet.

  • gateway

    The gateway address for a subnet.

  • nameservers

    The list of IP addresses of name servers.

  • ranges

    The list of IP address ranges within the given CIDR that are used in the allocation of IPs for nodes.

Configuration example:

status:
  allocatable: 51
  allocatedIPs:
  - 172.16.48.200:24e94698-f726-11ea-a717-0242c0a85b02
  - 172.16.48.201:2bb62373-f726-11ea-a717-0242c0a85b02
  - 172.16.48.202:37806659-f726-11ea-a717-0242c0a85b02
  capacity: 54
  cidr: 172.16.48.0/24
  gateway: 172.16.48.1
  lastUpdate: "2020-09-15T12:27:58Z"
  nameservers:
  - 172.18.176.6
  ranges:
  - 172.16.48.200-172.16.48.253
  statusMessage: OK

SubnetPool

This section describes the SubnetPool resource used in Mirantis Container Cloud API to manage a pool of addresses from which subnets can be allocated.

For demonstration purposes, the Container Cloud SubnetPool custom resource (CR) is split into the following major sections:

SubnetPool metadata

The Container Cloud SubnetPool CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is SubnetPool.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the SubnetPool object.

    • namespace

      Project in which the SubnetPool object was created.

    • labels

      Key-value pairs that are attached to the object:

      • kaas.mirantis.com/provider

        Provider type that is baremetal.

      • kaas.mirantis.com/region

        Region name.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: SubnetPool
metadata:
  name: kaas-mgmt
  namespace: default
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
SubnetPool spec

The spec field of the SubnetPool resource describes the desired state of a subnet pool. It contains the following fields:

  • cidr

    Valid IPv4 CIDR. For example, 10.10.0.0/16.

  • blockSize

    IP address block size to use when assigning an IP address block to every new child Subnet object. For example, if you set /25, every new child Subnet will have 128 IPs to allocate. Possible values are from /29 to the cidr size. Immutable.

  • nameservers

    Optional. List of IP addresses of name servers to use for every new child Subnet object. Each element of the list is a single address, for example, 172.18.176.6. Default: empty.

  • gatewayPolicy

    Optional. Method of assigning a gateway address to new child Subnet objects. Default: none. Possible values are:

    • first - first IP of the IP address block assigned to a child Subnet, for example, 10.11.10.1.

    • last - last IP of the IP address block assigned to a child Subnet, for example, 10.11.10.254.

    • none - no gateway address.

Configuration example:

spec:
  cidr: 10.10.0.0/16
  blockSize: /25
  nameservers:
  - 172.18.176.6
  gatewayPolicy: first
SubnetPool status

The status field of the SubnetPool resource describes the actual state of a subnet pool. It contains the following fields:

  • statusMessage

    Message that reflects the current status of the SubnetPool resource. Possible values are:

    • OK - a subnet pool is active.

    • ERR: <error message> - a subnet pool is in the Failure state.

    • TERM - a subnet pool is terminating.

  • allocatedSubnets

    List of allocated subnets. Each subnet has the <CIDR>:<SUBNET_UID> format.

  • blockSize

    Block size to use for IP address assignments from the defined pool.

  • capacity

    Total number of IP addresses to be allocated. Includes the number of allocatable and already allocated IP addresses.

  • allocatable

    Number of subnets with the blockSize size that are available for allocation.

  • lastUpdate

    Date and time of the last SubnetPool status update.

  • versionIpam

    IPAM version used during the last object update.

Example:

status:
  allocatedSubnets:
  - 10.10.0.0/24:0272bfa9-19de-11eb-b591-0242ac110002
  blockSize: /24
  capacity: 54
  allocatable: 51
  lastUpdate: "2020-09-15T08:30:08Z"
  versionIpam: v3.0.999-20200807-130909-44151f8
  statusMessage: OK

IPaddr

This section describes the IPaddr resource used in Mirantis Container Cloud API. The IPAddr object describes an IP address and contains all information about the associated MAC address.

For demonstration purposes, the Container Cloud IPaddr custom resource (CR) is split into the following major sections:

IPaddr metadata

The Container Cloud IPaddr CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1

  • kind

    Object type that is IPaddr

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the IPaddr object in the auto-XX-XX-XX-XX-XX-XX format where XX-XX-XX-XX-XX-XX is the associated MAC address

    • namespace

      Project in which the IPaddr object was created

    • labels

      Key-value pairs that are attached to the object:

      • ipam/IP

        IPv4 address

      • ipam/IpamHostID

        Unique ID of the associated IpamHost object

      • ipam/MAC

        MAC address

      • ipam/SubnetID

        Unique ID of the Subnet object

      • ipam/UID

        Unique ID of the IPAddr object

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: IPaddr
metadata:
  name: auto-0c-c4-7a-a8-b8-18
  namespace: default
  labels:
    ipam/IP: 172.16.48.201
    ipam/IpamHostID: 848b59cf-f804-11ea-88c8-0242c0a85b02
    ipam/MAC: 0C-C4-7A-A8-B8-18
    ipam/SubnetID: 572b38de-f803-11ea-88c8-0242c0a85b02
    ipam/UID: 84925cac-f804-11ea-88c8-0242c0a85b02
IPAddr spec

The spec object field of the IPAddr resource contains the associated MAC address and the reference to the Subnet object:

  • mac

    MAC address in the XX:XX:XX:XX:XX:XX format

  • subnetRef

    Reference to the Subnet resource in the <subnetProjectName>/<subnetName> format

Configuration example:

spec:
  mac: 0C:C4:7A:A8:B8:18
  subnetRef: default/kaas-mgmt
IPAddr status

The status object field of the IPAddr resource reflects the actual state of the IPAddr object. In contains the following fields:

  • address

    IP address.

  • cidr

    IPv4 CIDR for the Subnet.

  • gateway

    Gateway address for the Subnet.

  • lastUpdate

    Date and time of the last IPAddr status update.

  • mac

    MAC address in the XX:XX:XX:XX:XX:XX format.

  • nameservers

    List of the IP addresses of name servers of the Subnet. Each element of the list is a single address, for example, 172.18.176.6.

  • phase

    Current phase of the IP address. Possible values: Active, Failed, or Terminating.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  address: 172.16.48.201
  cidr: 172.16.48.201/24
  gateway: 172.16.48.1
  lastUpdate: "2020-09-16T10:08:07Z"
  mac: 0C:C4:7A:A8:B8:18
  nameservers:
  - 172.18.176.6
  phase: Active
  versionIpam: v3.0.999-20200807-130909-44151f8

L2Template

This section describes the L2Template resource used in Mirantis Container Cloud API.

By default, Container Cloud configures a single interface on cluster nodes, leaving all other physical interfaces intact. With L2Template, you can create advanced host networking configurations for your clusters. For example, you can create bond interfaces on top of physical interfaces on the host.

For demonstration purposes, the Container Cloud L2Template custom resource (CR) is split into the following major sections:

L2Template metadata

The Container Cloud L2Template CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is L2Template.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the L2Template object.

    • namespace

      Project in which the L2Template object was created.

    • labels

      Key-value pairs that are attached to the object:

      Caution

      All ipam/* labels, except ipam/DefaultForCluster, are set automatically and must not be configured manually.

      • ipam/Cluster

        References the Cluster object name that this template is applied to. The process of selecting the L2Template object for a specific cluster is as follows:

        1. The kaas-ipam controller monitors the L2Template objects with the ipam/Cluster:<clusterName> label.

        2. The L2Template object with the ipam/Cluster: <clusterName> label is assigned to a cluster with Name: <clusterName>, if available. Otherwise, the default L2Template object with the ipam/Cluster: default label is assigned to a cluster.

      • ipam/PreInstalledL2Template: "1"

        Is automatically added during a management or regional cluster deployment. Indicates that the current L2Template object was preinstalled. Represents L2 templates that are automatically copied to a project once it is created. Once the L2 templates are copied, the ipam/PreInstalledL2Template label is removed.

      • ipam/DefaultForCluster

        This label is unique per cluster. When you use several L2 templates per cluster, only the first template is automatically labeled as the default one. All consequent templates must be referenced in the machines configuration files using L2templateSelector. You can manually configure this label if required.

      • ipam/UID

        Unique ID of an object.

      • kaas.mirantis.com/provider

        Provider type.

      • kaas.mirantis.com/region

        Region type.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: l2template-test
  namespace: default
  labels:
    ipam/Cluster: test
    ipam/DefaultForCluster: "1"
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
L2Template configuration

The spec field of the L2Template resource describes the desired state of the object. It contains the following fields:

  • clusterRef

    The Cluster object that this template is applied to. The default value is used to apply the given template to all clusters within a particular project, unless an L2 template that references a specific cluster name exists.

    Caution

    • An L2 template must have the same namespace as the referenced cluster.

    • A cluster can be associated with many L2 templates. Only one of them can have the ipam/DefaultForCluster label. Every L2 template that does not have the ipam/DefaultForCluster label must be assigned to a particular machine using l2TemplateSelector.

    • A project (Kubernetes namespace) can have only one default L2 template (L2Template with Spec.clusterRef: default).

  • ifMapping

    The list of interface names for the template. The interface mapping is defined globally for all bare metal hosts in the cluster but can be overridden at the host level, if required, by editing the IpamHost object for a particular host. The ifMapping parameter is mutually exclusive with autoIfMappingPrio.

  • autoIfMappingPrio

    The list of prefixes, such as eno, ens, and so on, to match the interfaces to automatically create a list for the template. The result of generation may be overridden at the host level using ifMappingOverride in the corresponded IpamHost spec. The autoIfMappingPrio parameter is mutually exclusive with ifMapping.

  • npTemplate

    A netplan-compatible configuration with special lookup functions that defines the networking settings for the cluster hosts, where physical NIC names and details are parameterized. This configuration will be processed using Go templates. Instead of specifying IP and MAC addresses, interface names, and other network details specific to a particular host, the template supports use of special lookup functions. These lookup functions, such as nic, mac, ip, and so on, return host-specific network information when the template is rendered for a particular host.

    Caution

    All rules and restrictions of the netplan configuration also apply to L2 templates. For details, see the official netplan documentation.

Configuration example:

spec:
  autoIfMappingPrio:
  - provision
  - eno
  - ens
  - enp
  l3Layout: null
  npTemplate: |
   version: 2
   ethernets:
     {{nic 0}}:
       dhcp4: false
       dhcp6: false
       addresses:
         - {{ip "0:kaas-mgmt"}}
       gateway4: {{gateway_from_subnet "kaas-mgmt"}}
       nameservers:
         addresses: {{nameservers_from_subnet "kaas-mgmt"}}
       match:
         macaddress: {{mac 0}}
       set-name: {{nic 0}}
L2Template status

The status field of the L2Template resource reflects the actual state of the L2Template object and contains the following fields:

  • phase

    Current phase of the L2Template object. Possible values: Ready, Failed, or Terminating.

  • reason

    Detailed error message in case L2Template has the Failed status.

  • lastUpdate

    Date and time of the last L2Template status update.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  lastUpdate: "2020-09-15T08:30:08Z"
  phase: Failed
  reason: The kaas-mgmt subnet in the terminating state.
  versionIpam: v3.0.999-20200807-130909-44151f8