Mirantis Container Cloud Operations Guide latest documentation

Mirantis Container Cloud Operations Guide

Preface

This documentation provides information on how to deploy and operate Mirantis Container Cloud.

About this documentation set

The documentation is intended to help operators understand the core concepts of the product.

The information provided in this documentation set is being constantly improved and amended based on the feedback and kind requests from our software consumers. This documentation set outlines description of the features that are supported within two latest Cloud Container minor releases, with a corresponding note Available since release.

The following table lists the guides included in the documentation set you are reading:

Guides list

Guide

Purpose

Reference Architecture

Learn the fundamentals of Container Cloud reference architecture to plan your deployment.

Deployment Guide

Deploy Container Cloud of a preferred configuration using supported deployment profiles tailored to the demands of specific business cases.

Operations Guide

Operate your Container Cloud deployment.

Release Compatibility Matrix

Deployment compatibility of the Container Cloud components versions for each product release.

Release Notes

Learn about new features and bug fixes in the current Container Cloud version as well as in the Container Cloud minor releases.

For your convenience, we provide all guides from this documentation set in HTML (default), single-page HTML, PDF, and ePUB formats. To use the preferred format of a guide, select the required option from the Formats menu next to the guide title on the Container Cloud documentation home page.

Intended audience

This documentation assumes that the reader is familiar with network and cloud concepts and is intended for the following users:

  • Infrastructure Operator

    • Is member of the IT operations team

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI, and OpenStack to support the application development team

    • Accesses Mirantis Container Cloud and Kubernetes through a local machine or web UI

    • Provides verified artifacts through a central repository to the Tenant DevOps engineers

  • Tenant DevOps engineer

    • Is member of the application development team and reports to line-of-business (LOB)

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI to support application owners

    • Accesses Container Cloud and Kubernetes through a local machine or web UI

    • Consumes artifacts from a central repository approved by the Infrastructure Operator

Conventions

This documentation set uses the following conventions in the HTML format:

Documentation conventions

Convention

Description

boldface font

Inline CLI tools and commands, titles of the procedures and system response examples, table titles.

monospaced font

Files names and paths, Helm charts parameters and their values, names of packages, nodes names and labels, and so on.

italic font

Information that distinguishes some concept or term.

Links

External links and cross-references, footnotes.

Main menu > menu item

GUI elements that include any part of interactive user interface and menu navigation.

Superscript

Some extra, brief information. For example, if a feature is available from a specific release or if a feature is in the Technology Preview development stage.

Note

The Note block

Messages of a generic meaning that may be useful to the user.

Caution

The Caution block

Information that prevents a user from mistakes and undesirable consequences when following the procedures.

Warning

The Warning block

Messages that include details that can be easily missed, but should not be ignored by the user and are valuable before proceeding.

See also

The See also block

List of references that may be helpful for understanding of some related tools, concepts, and so on.

Learn more

The Learn more block

Used in the Release Notes to wrap a list of internal references to the reference architecture, deployment and operation procedures specific to a newly implemented product feature.

Technology Preview support scope

This documentation set includes description of the Technology Preview features. A Technology Preview feature provide early access to upcoming product innovations, allowing customers to experience the functionality and provide feedback during the development process. Technology Preview features may be privately or publicly available and neither are intended for production use. While Mirantis will provide support for such features through official channels, normal Service Level Agreements do not apply. Customers may be supported by Mirantis Customer Support or Mirantis Field Support.

As Mirantis considers making future iterations of Technology Preview features generally available, we will attempt to resolve any issues that customers experience when using these features.

During the development of a Technology Preview feature, additional components may become available to the public for testing. Because Technology Preview features are being under development, Mirantis cannot guarantee the stability of such features. As a result, if you are using Technology Preview features, you may not be able to seamlessly upgrade to subsequent releases of that feature. Mirantis makes no guarantees that Technology Preview features will be graduated to a generally available product release.

The Mirantis Customer Success Organization may create bug reports on behalf of support cases filed by customers. These bug reports will then be forwarded to the Mirantis Product team for possible inclusion in a future release.

Documentation history

The documentation set refers to Mirantis Container Cloud GA as to the latest released GA version of the product. For details about the Container Cloud GA minor releases dates, refer to Container Cloud releases.

Operate managed clusters

Note

This tutorial applies only to the Container Cloud web UI users with the writer access role assigned by the Infrastructure Operator. To add a bare metal host, the operator access role is also required.

After you deploy the Mirantis Container Cloud management cluster, you can start creating managed clusters that will be based on the same cloud provider type that you have for the management cluster: OpenStack, AWS, bare metal, or VMWare vSphere.

The deployment procedure is performed using the Container Cloud web UI and comprises the following steps:

  1. Create an initial cluster configuration depending on the provider type.

  2. For a baremetal-based managed cluster, create and configure bare metal hosts with corresponding labels for machines such as worker, manager, or storage.

  3. Add the required amount of machines with the corresponding configuration to the managed cluster.

  4. For a baremetal-based managed cluster, add a Ceph cluster.

Create and operate a baremetal-based managed cluster

After bootstrapping your baremetal-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy a baremetal-based management cluster, you start creating the baremetal-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section instructs you on how to configure and deploy a managed cluster that is based on the baremetal-based management cluster through the Mirantis Container Cloud web UI.

To create a managed cluster on bare metal:

  1. Recommended. Verify that you have successfully configured an L2 template for a new cluster as described in Advanced networking configuration. You may skip this step if you do not require L2 separation for network traffic.

  2. Optional. Create a custom bare metal host profile depending on your needs as described in Create a custom bare metal host profile.

  3. Log in to the Container Cloud web UI with the writer permissions.

  4. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  5. In the SSH keys tab, click Add SSH Key to upload the public SSH key that will be used for the SSH access to VMs.

  6. In the Clusters tab, click Create Cluster.

  7. Configure the new cluster in the Create New Cluster wizard that opens:

    1. Define general and Kubernetes parameters:

      Create new cluster: General, Provider, and Kubernetes

      Section

      Parameter name

      Description

      General settings

      Cluster name

      The cluster name.

      Provider

      Select Baremetal.

      Region

      From the drop-down list, select Baremetal.

      Release version

      The Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to the bare metal hosts.

      Provider

      LB host IP

      The IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster. This IP address must be from the same subnet as used for DHCP in Metal³.

      LB address range

      The range of IP addresses that can be assigned to load balancers for Kubernetes Services by MetalLB.

      Kubernetes

      Node CIDR

      The Kubernetes worker nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR blocks

      The Kubernetes Services CIDR blocks. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes pods CIDR blocks. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  8. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

Now, proceed to Add a bare metal host.

Add a bare metal host

This section describes how to add a bare metal host to a newly created managed cluster using either the Container Cloud web UI or CLI for an advanced configuration.

Add a bare metal host using web UI

After you create a managed cluster as described in Create a managed cluster, proceed with adding a bare metal host through the Mirantis Container Cloud web UI using the instruction below.

Before you proceed with adding a bare metal host, verify that the physical network on the server has been configured correctly. See Reference Architecture: Network fabric for details.

To add a bare metal host to a baremetal-based managed cluster:

  1. Log in to the Container Cloud web UI with the operator permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Baremetal tab, click Add BM host.

  4. Fill out the Add new BM host form as required:

    • Baremetal host name

      Specify the name of the new bare metal host.

    • Username

      Specify the name of the user for accessing the BMC (IPMI user).

    • Password

      Specify the password of the user for accessing the BMC (IPMI password).

    • Boot MAC address

      Specify the MAC address of the PXE network interface.

    • Address

      Specify the URL to access the BMC. Should start with https://.

    • Label

      Assign the machine label to the new host that defines which type of machine may be deployed on this bare metal host. Only one label can be assigned to a host. The supported labels include:

  5. Click Create

    While adding the bare metal host, Container Cloud discovers and inspects the hardware of the bare metal host and adds it to BareMetalHost.status for future references.

Now, you can proceed to Create a machine using web UI.

Add a bare metal host using CLI

After you create a managed cluster as described in Create a managed cluster, proceed with adding bare metal hosts using the Mirantis Container Cloud CLI using the instruction below.

To add a bare metal host using API:

  1. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  2. Create a secret YAML file that describes the credentials of the new bare metal host.

    Example of the bare metal host secret:

    apiVersion: v1
    data:
      password: <credentials-password>
      username: <credentials-user-name>
    kind: Secret
    metadata:
      labels:
        kaas.mirantis.com/credentials: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <credentials-name>
      namespace: <managed-cluster-namespace-name>
    type: Opaque
    

    In the data section, add the IPMI user name and password in the base64 encoding to access the BMC. To obtain the base64-encoded credentials, you can use the following command in your Linux console:

    echo -n <username|password> | base64
    
  3. Apply the secret YAML file to your deployment:

    kubectl apply -f ${<bmh-cred-file-name>}.yaml
    
  4. Create a YAML file that contains a description of the new bare metal host.

    Example of the bare metal host configuration file with the worker role:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      labels:
        kaas.mirantis.com/baremetalhost-id: <unique-bare-metal-host-hardware-node-id>
        hostlabel.bm.kaas.mirantis.com/worker: "true"
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: <bare-metal-host-unique-name>
      namespace: <managed-cluster-namespace-name>
    spec:
      bmc:
        address: <ip_address_for-bmc-access>
        credentialsName: <credentials-name>
      bootMACAddress: <bare-metal-host-boot-mac-address>
      online: true
    

    For a detailed fields description, see BareMetalHost.

  5. Apply the secret YAML file to your deployment:

    kubectl apply -f ${<bare-metal-host-config-file-name>}.yaml
    

Now, proceed with Deploy a machine to a specific bare metal host.

Add a machine

This section describes how to add a machine to a newly created managed cluster using either the Mirantis Container Cloud web UI or CLI for an advanced configuration.

Create a machine using web UI

After you add a bare metal host to the managed cluster as described in Add a bare metal host using web UI, you can create a Kubernetes machine in your cluster using the Mirantis Container Cloud web UI.

To add a Kubernetes machine to a baremetal-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the operator or writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine button.

  5. Fill out the Create New Machine form as required:

    • Count

      Specify the number of machines to add.

    • Manager

      Select Manager or Worker to create a Kubernetes manager or worker node. The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    • BareMetal Host Label

      Assign the role to the new machine(s) to link the machine to a previously created bare metal host with the corresponding label. You can assign one role type per machine. The supported labels include:

      • Worker

        The default role for any node in a managed cluster. Only the kubelet service is running on the machines of this type.

      • Manager

        This node hosts the manager services of a managed cluster. For the reliability reasons, Container Cloud does not permit running end user workloads on the manager nodes or use them as storage nodes.

      • Storage

        This node is a worker node that also hosts Ceph OSDs and provides its disk resources to Ceph. Container Cloud permits end users to run workloads on storage nodes by default.

    • Node Labels Available since 2.1.0

      Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

      Caution

      If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

      Note

      You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

At this point, Container Cloud adds the new machine object to the specified managed cluster. And the Bare Metal Operator controller creates the relation to BareMetalHost with the labels matching the roles.

Provisioning of the newly created machine starts when the machine object is created and includes the following stages:

  1. Creation of partitions on the local disks as required by the operating system and the Container Cloud architecture.

  2. Configuration of the network interfaces on the host as required by the operating system and the Container Cloud architecture.

  3. Installation and configuration of the Container Cloud LCM agent.

Now, proceed to Add a Ceph cluster.

Create a machine using CLI

This section describes a bare metal host and machine configuration using Mirantis Container Cloud CLI.

Deploy a machine to a specific bare metal host

A Kubernetes machine requires a dedicated bare metal host for deployment. The bare metal hosts are represented by the BareMetalHost objects in Kubernetes API. All BareMetalHost objects are labeled by the Operator when created. A label reflects the hardware capabilities of a host. As a result of labeling, all bare metal hosts are divided into three types: Control Plane, Worker, and Storage.

In some cases, you may need to deploy a machine to a specific bare metal host. This is especially useful when some of your bare metal hosts have different hardware configuration than the rest.

To deploy a machine to a specific bare metal host:

  1. Log in to the host where your management cluster kubeconfig is located and where kubectl is installed.

  2. Identify the bare metal host that you want to associate with the specific machine. For example, host host-1.

    kubectl get baremetalhost host-1 -o yaml
    
  3. Add a label that will uniquely identify this host, for example, by the name of the host and machine that you want to deploy on it.

    Caution

    Do not remove any existing labels from the BareMetalHost resource. For more details about labels, see BareMetalHost.

    kubectl edit baremetalhost host-1
    

    Configuration example:

    kind: BareMetalHost
    metadata:
      name: host-1
      namespace: myproject
      labels:
        kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
        ...
    
  4. Create a new text file with the YAML definition of the Machine object, as defined in Machine.

  5. Add a label selector that matches the label you have added to the BareMetalHost object in the previous step.

    Example:

    kind: Machine
    metadata:
      name: worker-HW11-cad5
      namespace: myproject
    spec:
      kind: BareMetalMachineProviderSpec
      hostSelector:
        matchLabels:
          kaas.mirantis.com/baremetalhost-id: host-1-worker-HW11-cad5
    ...
    
  6. Specify the details of the machine configuration in the object created in the previous step. For example, add a reference to a custom BareMetalHostProfile object, as defined in Machine. Or specify an override for the ordering and naming of the NICs for the machine. For details, see Override network interfaces naming and order.

  7. Add the configured machine to the cluster:

    kubectl create -f worker-HW11-cad5.yaml
    

    Once done, this machine will be associated with the specified bare metal host.

Override network interfaces naming and order

An L2 template contains the ifMapping field that allows you to identify Ethernet interfaces for the template. The Machine object API enables the Operator to override the mapping from the L2 template by enforcing a specific order of names of the interfaces when applied to the template.

The field l2TemplateIfMappingOverride in the spec of the Machine object contains a list of interfaces names. The order of the interfaces names in the list is important because the L2Template object will be rendered with NICs ordered as per this list.

Note

Changes in the l2TemplateIfMappingOverride field will apply only once when the Machine and corresponding IpamHost objects are created. Further changes to l2TemplateIfMappingOverride will not reset the interfaces assignment and configuration.

Caution

The l2TemplateIfMappingOverride field must contain the names of all interfaces of the bare metal host.

The following example illustrates how to include the override field to the Machine object. In this example, we configure the interface eno1, which is the second on-board interface of the server, to precede the first on-board interface eno0.

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  finalizers:
  - foregroundDeletion
  - machine.cluster.sigs.k8s.io
  labels:
    cluster.sigs.k8s.io/cluster-name: kaas-mgmt
    cluster.sigs.k8s.io/control-plane: "true"
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
spec:
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      hostSelector:
        matchLabels:
          baremetal: hw-master-0
      image: {}
      kind: BareMetalMachineProviderSpec
      l2TemplateIfMappingOverride:
      - eno1
      - eno0
      - enp0s1
      - enp0s2

As a result of the configuration above, when used with the example L2 template for bonds and bridges described in Create L2 templates, the eno1 interface will be used for the bm-pxe bridge, while the eno0 interface will be used to create subinterfaces for Kubernetes networks.

See also

Delete a machine

Add a Ceph cluster

After you add machines to your new bare metal managed cluster as described in Add a machine, you can create a Ceph cluster on top of this managed cluster using the Mirantis Container Cloud web UI.

The procedure below enables you to create a Ceph cluster with minimum three Ceph nodes that provides persistent volumes to the Kubernetes workloads in the managed cluster.

To create a Ceph cluster in the managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The Cluster page with the Machines and Ceph clusters lists opens.

  4. In the Ceph Clusters block, click Create Cluster.

  5. Configure the Ceph cluster in the Create New Ceph Cluster wizard that opens:

    Create new Ceph cluster

    Section

    Parameter name

    Description

    General settings

    Name

    The Ceph cluster name.

    Cluster Network

    Replication network for Ceph OSDs.

    Public Network

    Public network for Ceph data.

    Enable OSDs LCM

    Select to enable LCM for Ceph OSDs.

    Machines / Machine #1-3

    Select machine

    Select the name of the Kubernetes machine that will host the corresponding Ceph node in the Ceph cluster.

    Manager, Monitor

    Select the required Ceph services to install on the Ceph node.

    Devices

    Select the disk that Ceph will use.

    Warning

    Do not select the device for system services, for example, sda.

  6. To add more Ceph nodes to the new Ceph cluster, click + next to any Ceph Machine title in the Machines tab. Configure a Ceph node as required.

    Warning

    Do not add more than 3 Manager and/or Monitor services to the Ceph cluster.

  7. After you add and configure all nodes in your Ceph cluster, click Create.

Once done, verify your Ceph cluster as described in Verify Ceph components.

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of the machines running on the cluster.

To delete a baremetal-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

  5. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, click the Delete credential action icon next to the name of the credentials to be deleted.

    2. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs, and so on.

Advanced networking configuration

By default, Mirantis Container Cloud configures a single interface on the cluster nodes, leaving all other physical interfaces intact.

With L2 networking templates, you can create advanced host networking configurations for your clusters. For example, you can create bond interfaces on top of physical interfaces on the host or use multiple subnets to separate different types of network traffic.

When you create a baremetal-based project, the exemplary templates with the ipam/PreInstalledL2Template label are copied to this project. These templates are preinstalled during the management cluster bootstrap.

Follow the procedures below to create L2 templates for your managed clusters.

Create subnets

Before creating an L2 template, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the managed cluster nodes. Where required, create a number of subnets for a particular project using the Subnet CR. A subnet has three logical scopes:

  • global - CR uses the default namespace. A subnet can be used for any cluster located in any project.

  • namespaced - CR uses the namespace that corresponds to a particular project where managed clusters are located. A subnet can be used for any cluster located in the same project.

  • cluster - CR uses the namespace where the referenced cluster is located. A subnet is only accessible to the cluster that L2Template.spec.clusterRef refers to. The Subnet objects with the cluster scope will be created for every new cluster.

You can have subnets with the same name in different projects. In this case, the subnet that has the same project as the cluster will be used. One L2 template may often reference several subnets, those subnets may have different scopes in this case.

The IP address objects (IPaddr CR) that are allocated from subnets always have the same project as their corresponding IpamHost objects, regardless of the subnet scope.

To create subnets:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnet.yaml file with a number of global or namespaced subnets:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnet.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      name: demo
      namespace: demo-namespace
    spec:
      cidr: 10.11.0.0/24
      gateway: 10.11.0.9
      includeRanges:
      - 10.11.0.5-10.11.0.70
      nameservers:
      - 172.18.176.6
    
    Specification fields of the Subnet object

    Parameter

    Description

    cidr (singular)

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

    includeRanges (list)

    A list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes (excluding the gateway address). The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. In the example above, the addresses 10.11.0.5-10.11.0.70 (excluding the gateway address 10.11.0.9) will be allocated for nodes. The includeRanges parameter is mutually exclusive with excludeRanges.

    excludeRanges (list)

    A list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation (excluding gateway address). Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The excludeRanges parameter is mutually exclusive with includeRanges.

    useWholeCidr (boolean)

    If set to true, the subnet address (10.11.0.0 in the example above) and the broadcast address (10.11.0.255 in the example above) are included into the address allocation for nodes. Otherwise, (false by default), the subnet address and broadcast address will be excluded from the address allocation.

    gateway (singular)

    A valid gateway address, for example, 10.11.0.9.

    nameservers (list)

    A list of the IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

    Caution

    The subnet for the PXE network is automatically created during deployment and must contain the ipam/DefaultSubnet: "1" label. Each bare metal region must have only one subnet with this label.

  3. Verify that the subnet is successfully created:

    kubectl get subnet kaas-mgmt -oyaml
    

    In the system output, verify the status fields of the subnet.yaml file using the table below.

    Status fields of the Sunbet object

    Parameter

    Description

    statusMessage

    Contains a short state description and a more detailed one if applicable. The short status values are as follows:

    • OK - operational.

    • ERR - non-operational. This status has a detailed description, for example, ERR: Wrong includeRange for CIDR….

    cidr

    Reflects the actual CIDR, has the same meaning as spec.cidr.

    gateway

    Reflects the actual gateway, has the same meaning as spec.gateway.

    nameservers

    Reflects the actual name servers, has same meaning as spec.nameservers.

    ranges

    Specifies the address ranges that are calculated using the fields from spec: cidr, includeRanges, excludeRanges, gateway, useWholeCidr. These ranges are directly used for nodes IP allocation.

    lastUpdate

    Includes the date and time of the latest update of the Subnet RC.

    allocatable

    Includes the number of currently available IP addresses that can be allocated for nodes from the subnet.

    allocatedIPs

    Specifies the list of IPv4 addresses with the corresponding IPaddr object IDs that were already allocated from the subnet.

    capacity

    Contains the total number of IP addresses being held by ranges that equals to a sum of the allocatable and allocatedIPs parameters values.

    versionIpam

    Contains thevVersion of the kaas-ipam component that made the latest changes to the Subnet RC.

    Example of a successfully created subnet:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      labels:
        ipam/UID: 6039758f-23ee-40ba-8c0f-61c01b0ac863
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
      name: kaas-mgmt
      namespace: default
    spec:
      cidr: 10.0.0.0/24
      excludeRanges:
      - 10.0.0.100
      - 10.0.0.101-10.0.0.120
      gateway: 10.0.0.1
      includeRanges:
      - 10.0.0.50-10.0.0.90
      nameservers:
      - 172.18.176.6
    status:
      allocatable: 38
      allocatedIPs:
      - 10.0.0.50:0b50774f-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.51:1422e651-ffed-11ea-84c7-0242c0a85b02
      - 10.0.0.52:1d19912c-ffed-11ea-84c7-0242c0a85b02
      capacity: 41
      cidr: 10.0.0.0/24
      gateway: 10.0.0.1
      lastUpdate: "2020-09-26T11:40:44Z"
      nameservers:
      - 172.18.176.6
      ranges:
      - 10.0.0.50-10.0.0.90
      statusMessage: OK
      versionIpam: v3.0.999-20200807-130909-44151f8
    
  4. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates.

Automate multiple subnet creation using SubnetPool

Caution

This feature is available starting from the Container Cloud release 2.2.0.

Before creating an L2 template, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the managed cluster nodes. You can also create multiple subnets using the SubnetPool object to separate different types of network traffic. SubnetPool allows for automatic creation of Subnet objects that will consume blocks from the parent SubnetPool CIDR IP address range. The SubnetPool blockSize setting defines the IP address block size to allocate to each child Subnet. SubnetPool has a global scope, so any SubnetPool can be used to create the Subnet objects for any namespace and for any cluster.

To automate multiple subnet creation using SubnetPool:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnetpool.yaml file with a number of subnet pools:

    Note

    You can define either or both subnets and subnet pools, depending on the use case. A single L2 template can use either or both subnets and subnet pools.

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of a subnetpool.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: SubnetPool
    metadata:
      name: kaas-mgmt
      namespace: default
      labels:
        kaas.mirantis.com/provider: baremetal
        kaas.mirantis.com/region: region-one
    spec:
      cidr: 10.10.0.0/16
      blockSize: /25
      nameservers:
      - 172.18.176.6
      gatewayPolicy: first
    

    For the specification fields description of the SubnetPool object, see SubnetPool spec.

  3. Verify that the subnet pool is successfully created:

    kubectl get subnetpool kaas-mgmt -oyaml
    

    In the system output, verify the status fields of the subnetpool.yaml file. For the status fields description of the SunbetPool object, see SubnetPool status.

  4. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates. In this procedure, select the exemplary L2 template for multiple subnets that contains the l3Layout section.

    Caution

    Using the l3Layout section, define all subnets of a cluster. Otherwise, do not use the l3Layout section. Defining only part of subnets is not allowed.

Create L2 templates

After you create subnets for one or more managed clusters or projects as described in Create subnets or Automate multiple subnet creation using SubnetPool, follow the procedure below that contains exemplary L2 templates for the following use cases:

L2 template example with bonds and bridges

This section contains an exemplary L2 template that demonstrates how to set up bonds and bridges on hosts for your managed clusters as described in Create L2 templates.

Example of an L2 template with interfaces bonding:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
spec:
  clusterRef: child-cluster
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  npTemplate: |
    version: 2
    ethernets:
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
    bonds:
      bond0:
        interfaces:
          - ten10gbe0s0
          - ten10gbe0s1
    bridges:
      bm-ceph:
        interfaces: [bond0]
        addresses:
          - {{ip "bm-ceph:demo"}}
L2 template example for automatic multiple subnet creation

Caution

This feature is available starting from the Container Cloud release 2.2.0.

This section contains an exemplary L2 template for automatic multiple subnet creation as described in Automate multiple subnet creation using SubnetPool. This template also contains the L3Layout section that allows defining the Subnet scopes and enables optional auto-creation of the Subnet objects from the SubnetPool objects.

For details on how to create L2 templates, see Create L2 templates.

Example of an L2 template for multiple subnets:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: test-managed
  namespace: managed-ns
spec:
  clusterRef: child-cluster
  autoIfMappingPrio:
    - provision
    - eno
    - ens
    - enp
  l3Layout:
    - subnetName: pxe-subnet
      scope:      global
    - subnetName: subnet-1
      subnetPool: kaas-mgmt
      scope:      namespace
    - subnetName: subnet-2
      subnetPool: kaas-mgmt
      scope:      cluster
  npTemplate: |
    version: 2
    ethernets:
      onboard1gbe0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 0}}
        set-name: {{nic 0}}
        addresses:
          - {{ip "0:pxe-subnet"}}
        nameservers:
          addresses: {{nameservers_from_subnet "pxe-subnet"}}
        gateway4: {{gateway_from_subnet "pxe-subnet"}}
      onboard1gbe1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 1}}
        set-name: {{nic 1}}
      ten10gbe0s0:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 2}}
        set-name: {{nic 2}}
        addresses:
          - {{ip "2:subnet-1"}}
      ten10gbe0s1:
        dhcp4: false
        dhcp6: false
        match:
          macaddress: {{mac 3}}
        set-name: {{nic 3}}
        addresses:
          - {{ip "3:subnet-2"}}

In the template above, the following networks are defined in the l3Layout section:

  • pxe-subnet - global PXE network that already exists. A subnet name must refer to the PXE subnet created for the region.

  • subnet-1 - unless already created, this subnet will be created from the kaas-mgmt subnet pool. The subnet name must be unique within the project. This subnet is shared between the project clusters.

  • subnet-2 - will be created from the kaas-mgmt subnet pool. This subnet has the cluster scope. Therefore, the real name of the Subnet CR object consists of the subnet name defined in l3Layout and the cluster UID. But the npTemplate section of the L2 template must contain only the subnet name defined in l3Layout. The subnets of the cluster scope are not shared between clusters.

Caution

Using the l3Layout section, define all subnets of a cluster. Otherwise, do not use the l3Layout section. Defining only part of subnets is not allowed.


To create an L2 template for a new managed cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Inspect the existing L2 templates to select the one that fits your deployment:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    get l2template -n <ProjectNameForNewManagedCluster>
    
  3. Create an L2 YAML template specific to your deployment using one of the exemplary templates:

  4. Add or edit the mandatory parameters in the new L2 template. The following tables provide the description of the mandatory and the l3Layout section parameters in the example templates mentioned in the previous step.

    L2 template mandatory parameters

    Parameter

    Description

    clusterRef

    References the Cluster object that this template is applied to. The default value is used to apply the given template to all clusters in the corresponding project, unless an L2 template that references a specific cluster name exists.

    Caution

    • A cluster can be associated with only one template.

    • An L2 template must have the same namespace as the referenced cluster.

    • A project can have only one default L2 template.

    ifMapping or autoIfMappingPrio

    • ifMapping is a list of interface names for the template. The interface mapping is defined globally for all bare metal hosts in the cluster but can be overridden at the host level, if required, by editing the IpamHost object for a particular host.

    • autoIfMappingPrio is a list of prefixes, such as eno, ens, and so on, to match the interfaces to automatically create a list for the template. If you are not aware of any specific ordering of interfaces on the nodes, use the default ordering from Predictable Network Interfaces Names specification for systemd. You can also override the default NIC list per host using the IfMappingOverride parameter of the corresponding IpamHost. The provision value corresponds to the network interface that was used to provision a node. Usually, it is the first NIC found on a particular node. It is defined explicitly to ensure that this interface will not be reconfigured accidentally.

    npTemplate

    A netplan-compatible configuration with special lookup functions that defines the networking settings for the cluster hosts, where physical NIC names and details are parameterized. This configuration will be processed using Go templates. Instead of specifying IP and MAC addresses, interface names, and other network details specific to a particular host, the template supports use of special lookup functions. These lookup functions, such as nic, mac, ip, and so on, return host-specific network information when the template is rendered for a particular host. For details about netplan, see the official netplan documentation.

    Caution

    All rules and restrictions of the netplan configuration also apply to L2 templates. For details, see the official netplan documentation.

    l3Layout section parameters

    Parameter

    Description

    subnetName

    Name of the Subnet object that will be used in the npTemplate section to allocate IP addresses from. All Subnet names must be unique across a single L2 template.

    subnetPool

    Optional. Default: none. Name of the parent SubnetPool object that will be used to create a Subnet object with a given subnetName and scope. If a corresponding Subnet object already exists, nothing will be created and the existing object will be used. If no SubnetPool is provided, no new Subnet object will be created.

    scope

    Logical scope of the Subnet object with a corresponding subnetName. Possible values:

    • global - the Subnet object is accessible globally, for any Container Cloud project and cluster in the region, for example, the PXE subnet.

    • namespace - the Subnet object is accessible within the same project and region where the L2 template is defined.

    • cluster - the Subnet object is only accessible to the cluster that L2Template.spec.clusterRef refers to. The Subnet objects with the cluster scope will be created for every new cluster.

    The following table describes the main lookup functions for an L2 template.

    Lookup function

    Description

    {{nic N}}

    Name of a NIC number N. NIC numbers correspond to the interface mapping list.

    {{mac N}}

    MAC address of a NIC number N registered during a host hardware inspection.

    {{ip “N:subnet-a”}}

    IP address and mask for a NIC number N. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{ip “br0:subnet-x”}}

    IP address and mask for a virtual interface, “br0” in this example. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{gateway_from_subnet “subnet-a”}}

    IPv4 default gateway address from the given subnet.

    {{nameservers_from_subnet “subnet-a”}}

    List of the IP addresses of name servers from the given subnet.

    Note

    Every subnet referenced in an L2 template can have either a global or namespaced scope. In the latter case, the subnet must exist in the same project where the corresponding cluster and L2 template are located.

  5. Add the L2 template to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <pathToL2TemplateYamlFile>
    
  6. If required, further modify the template:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    -n <ProjectNameForNewManagedCluster> edit l2template <L2templateName>
    
  7. Proceed with creating a managed cluster as described in Create a managed cluster. The resulting L2 template will be used to render the netplan configuration for the managed cluster machines.


The workflow of the netplan configuration using an L2 template is as follows:

  1. The kaas-ipam service uses the data from BareMetalHost, the L2 template, and subnets to generate the netplan configuration for every cluster machine.

  2. The generated netplan configuration is saved in the status.netconfigV2 section of the IpamHost resource. If the status.l2RenderResult field of the IpamHost resource is OK, the configuration was rendered in the IpamHost resource successfully. Otherwise, the status contains an error message.

  3. The baremetal-provider service copies data from the status.netconfigV2 of IpamHost to the Spec.StateItemsOverwrites[‘deploy’][‘bm_ipam_netconfigv2’] parameter of LCMMachine.

  4. The lcm-agent service on every host synchronizes the LCMMachine data to its host. The lcm-agent service runs a playbook to update the netplan configuration on the host during the pre-download and deploy phases.

Create a custom bare metal host profile

The bare metal host profile is a Kubernetes custom resource. It allows the operator to define how the storage devices and the operating system are provisioned and configured.

This section describes the bare metal host profile default settings and configuration of custom profiles for managed clusters using Mirantis Container Cloud API. This procedure also applies to a management cluster with a few differences described in Deployment Guide: Customize the default bare metal host profile.

Default configuration of the host system storage

The default host profile requires three storage devices in the following strict order:

  1. Boot device and operating system storage

    This device contains boot data and operating system data. It is partitioned using the GUID Partition Table (GPT) labels. The root file system is an ext4 file system created on top of an LVM logical volume. For a detailed layout, refer to the table below.

  2. Local volumes device

    This device contains an ext4 file system with directories mounted as persistent volumes to Kubernetes. These volumes are used by the Mirantis Container Cloud services to store its data, including monitoring and identity databases.

  3. Ceph storage device

    This device is used as a Ceph datastore or Ceph OSD.

The following table summarizes the default configuration of the host system storage set up by the Container Cloud bare metal management.

Default configuration of the bare metal host storage

Device/partition

Name/Mount point

Recommended size, GB

Description

/dev/sda1

bios_grub

4 MiB

The mandatory GRUB boot partition required for non-UEFI systems.

/dev/sda2

UEFI -> /boot/efi

0.2 GiB

The boot partition required for the UEFI boot mode.

/dev/sda3

config-2

64 MiB

The mandatory partition for the cloud-init configuration. Used during the first host boot for initial configuration.

/dev/sda4

lvm_root_part

100% of the remaining free space in the LVM volume group

The main LVM physical volume that is used to create the root file system.

/dev/sdb

lvm_lvp_part -> /mnt/local-volumes

100% of the remaining free space in the LVM volume group

The LVM physical volume that is used to create the file system for LocalVolumeProvisioner.

/dev/sdc

-

100% of the remaining free space in the LVM volume group

Clean raw disk that will be used for the Ceph storage back end.

If required, you can customize the default host storage configuration. For details, see Create a custom host profile.

Create a custom host profile

In addition to the default BareMetalHostProfile object installed with Mirantis Container Cloud, you can create custom profiles for managed clusters using Container Cloud API.

Note

The procedure below also applies to the Container Cloud management clusters.

To create a custom bare metal host profile:

  1. Select from the following options:

    • For a management cluster, log in to the bare metal seed node that will be used to bootstrap the management cluster.

    • For a managed cluster, log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

      Note

      The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Select from the following options:

    • For a management cluster, open templates/bm/baremetalhostprofiles.yaml.template for editing.

    • For a managed cluster, create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the host profile using the example template below to meet your hardware configuration requirements:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHostProfile
    metadata:
      name: <PROFILE_NAME>
      namespace: <PROJECT_NAME>
    spec:
      devices:
      # From the HW node, obtain the first device, which size is at least 60Gib
      - device:
          minSizeGiB: 60
          wipe: true
        partitions:
        - name: bios_grub
          partflags:
          - bios_grub
          sizeGiB: 0.00390625
          wipe: true
        - name: uefi
          partflags:
          - esp
          sizeGiB: 0.2
          wipe: true
        - name: config-2
          sizeGiB: 0.0625
          wipe: true
        - name: lvm_root_part
          sizeGiB: 0
          wipe: true
      # From the HW node, obtain the second device, which size is at least 60Gib
      # If a device exists but does not fit the size,
      # the BareMetalHostProfile will not be applied to the node
      - device:
          minSizeGiB: 30
          wipe: true
      # From the HW node, obtain the disk device with the exact name
      - device:
          byName: /dev/nvme0n1
          minSizeGiB: 30
          wipe: true
        partitions:
        - name: lvm_lvp_part
          sizeGiB: 0
          wipe: true
      # Example of wiping a device w\o partitioning it.
      # Mandatory for the case when a disk is supposed to be used for Ceph back end
      # later
      - device:
          byName: /dev/sde
          wipe: true
      fileSystems:
      - fileSystem: vfat
        partition: config-2
      - fileSystem: vfat
        mountPoint: /boot/efi
        partition: uefi
      - fileSystem: ext4
        logicalVolume: root
        mountPoint: /
      - fileSystem: ext4
        logicalVolume: lvp
        mountPoint: /mnt/local-volumes/
      logicalVolumes:
      - name: root
        sizeGiB: 0
        vg: lvm_root
      - name: lvp
        sizeGiB: 0
        vg: lvm_lvp
      postDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
      preDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
      volumeGroups:
      - devices:
        - partition: lvm_root_part
        name: lvm_root
      - devices:
        - partition: lvm_lvp_part
        name: lvm_lvp
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
      kernelParameters:
        sysctl:
          kernel.panic: "900"
          kernel.dmesg_restrict: "1"
          kernel.core_uses_pid: "1"
          fs.file-max: "9223372036854775807"
          fs.aio-max-nr: "1048576"
          fs.inotify.max_user_instances: "4096"
          vm.max_map_count: "262144"
    
  4. Add or edit the mandatory parameters in the new BareMetalHostProfile object. For the parameters description, see API: BareMetalHostProfile spec.

  5. Select from the following options:

    • For a management cluster, proceed with the cluster bootstrap procedure as described in Deployment Guide: Bootstrap a management cluster.

    • For a managed cluster:

      1. Add the bare metal host profile to your management cluster:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> apply -f <pathToBareMetalHostProfileFile>
        
      2. If required, further modify the host profile:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> edit baremetalhostprofile <hostProfileName>
        
      3. Proceed with creating a managed cluster as described in Create a managed cluster.

Enable huge pages in a host profile

The BareMetalHostProfile API allows configuring a host to use the huge pages feature of the Linux kernel on managed clusters.

Note

Huge pages is a mode of operation of the Linux kernel. With huge pages enabled, the kernel allocates the RAM in bigger chunks, or pages. This allows a KVM (kernel-based virtual machine) and VMs running on it to use the host RAM more efficiently and improves the performance of VMs.

To enable huge pages in a custom bare metal host profile for a managed cluster:

  1. Log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Open for editing or create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the grubConfig section of the host profile spec using the example below to configure the kernel boot parameters and enable huge pages:

    spec:
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
        - GRUB_CMDLINE_LINUX_DEFAULT="hugepagesz=1G hugepages=N"
    

    The example configuration above will allocate N huge pages of 1 GB each on the server boot. The last hugepagesz parameter value is default unless default_hugepagesz is defined. For details about possible values, see official Linux kernel documentation.

  4. Add the bare metal host profile to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> apply -f <pathToBareMetalHostProfileFile>
    
  5. If required, further modify the host profile:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <projectName> edit baremetalhostprofile <hostProfileName>
    
  6. Proceed with creating a managed cluster as described in Create a managed cluster.

Create and operate an OpenStack-based managed cluster

After bootstrapping your OpenStack-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy an OpenStack-based management cluster, you can create the OpenStack-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create an OpenStack-based managed cluster using the Mirantis Container Cloud web UI of the OpenStack-based management cluster.

To create an OpenStack-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the OpenStack VMs creation.

  4. In the Credentials tab:

    1. Click Add Credential to add your OpenStack credentials. You can either upload your OpenStack clouds.yaml configuration file or fill in the fields manually.

    2. Available since 2.1.0 Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select OpenStack

      Provider credential

      From the drop-down list, select the OpenStack credentials name that you created in the previous step.

      Release version

      The Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      External network

      Type of the external network in the OpenStack cloud provider.

      DNS name servers

      Comma-separated list of the DNS hosts IPs for the OpenStack VMs configuration.

      Kubernetes

      Node CIDR

      The Kubernetes nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  6. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

  7. Proceed with Add a machine.

Add a machine

After you create a new OpenStack-based Mirantis Container Cloud managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an OpenStack-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with Machines list opens.

  4. On the cluster page, click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Specify the number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Flavor

    From the drop-down list, select the required hardware configuration for the machine. The list of available flavors corresponds to the one in your OpenStack environment.

    For the hardware requirements, see: Reference Architecture: Requirements for an OpenStack-based cluster.

    Image

    From the drop-down list, select the cloud image with Ubuntu 18.04. If you do not have this image in the list, add it to your OpenStack environment using the Horizon web UI by downloading the image from the Ubuntu official website.

    Availability zone

    From the drop-down list, select the availability zone from which the new machine will be launched.

    Node Labels Available since 2.1.0

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    To view the deployment status, monitor the machines status in the Managers and Workers columns on the Clusters page. Once the status changes to Ready, the deployment is complete. For the statuses description, see Reference Architecture: LCM controller.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an OpenStack-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. If the cluster deletion hangs and the The cluster is being deleted message does not disappear for a while:

    1. Expand the menu of the tab with your username.

    2. Click Download kubeconfig to download kubeconfig of your management cluster.

    3. Log in to any local machine with kubectl installed.

    4. Copy the downloaded kubeconfig to this machine.

    5. Run the following command:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
    6. Edit the opened kubeconfig by removing the following lines:

      finalizers:
      - cluster.cluster.k8s.io
      
  6. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Create and operate an AWS-based managed cluster

After bootstrapping your AWS-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy an AWS-based management cluster, you can create the AWS-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create an AWS-based managed cluster using the Mirantis Container Cloud web UI of the AWS-based management cluster.

To create an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be configured on each AWS instance to provide user access.

  4. In the Credentials tab:

    1. Click Add Credential and fill in the required fields to add your AWS credentials.

    2. Available since 2.1.0 Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select AWS

      Provider credential

      From the drop-down list, select the previously created AWS credentials name.

      Release version

      The Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      AWS region

      From the drop-down list, select the AWS Region for the managed cluster. For example, us-east-2.

      Kubernetes

      Services CIDR blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  6. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

  7. Proceed with Add a machine.

Add a machine

After you create a new AWS-based managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Mirantis Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Specify the number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Instance type

    From the drop-down list, select the required AWS instance type. For production deployments, Mirantis recommends:

    • c5d.2xlarge for worker nodes

    • c5d.4xlarge for manager nodes

    • r5.4xlarge for nodes where the StackLight server components run

    For more details about requirements, see Reference architecture: AWS system requirements.

    AMI ID

    From the drop-down list, select the required AMI ID of Ubuntu 18.04. For example, ubuntu-bionic-18.04-amd64-server-20200729.

    Root device size

    Select the required root device size, 40 by default.

    Node Labels Available since 2.1.0

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    To view the deployment status, monitor the machines status in the Managers and Workers columns on the Clusters page. Once the status changes to Ready, the deployment is complete. For the statuses description, see Reference Architecture: LCM controller.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an AWS-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically removes the Amazon Virtual Private Cloud (VPC) connected with this cluster and frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Create and operate a VMWare vSphere-based managed cluster

Caution

This feature is available as Technology Preview. Use such configuration for testing and evaluation purposes only. For details about the Mirantis Technology Preview support scope, see the Preface section of this guide.

Caution

This feature is available starting from the Container Cloud release 2.2.0.

After bootstrapping your VMWare vSphere-based Mirantis Container Cloud management cluster as described in Deployment Guide: Deploy a VMWare vSphere-based management cluster, you can create VMWare vSphere-based managed clusters using the Container Cloud web UI.

Create a managed cluster

This section describes how to create a VMWare vSphere-based managed cluster using the Mirantis Container Cloud web UI of the VMWare vSphere-based management cluster.

To create a VMWare vSphere-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the VMWare vSphere VMs creation.

  4. In the Credentials tab:

    1. Click Add Credential to add your VMWare vSphere credentials. You can either upload your VMWare vSphere vsphere.yaml configuration file or fill in the fields manually.

    2. Verify that the new credentials status is Ready. If the status is Error, hover over the status to determine the reason of the issue.

  5. In the RHEL Licenses tab, click Add RHEL License and fill out the form with the following parameters:

    RHEL license parameters

    Parameter

    Description

    RHEL License Name

    RHEL license name

    Username

    User name to access the RHEL license

    Password

    Password to access the RHEL license

    Pool IDs

    Optional. Specify the pool IDs for RHEL licenses for Virtual Datacenters. Otherwise, Subscription Manager will select a subscription from the list of available and appropriate for the machines.

  6. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select VMWare vSphere

      Provider Credential

      From the drop-down list, select the VMWare vSphere credentials name that you have previously added.

      Release Version

      The Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      LB Host IP

      The IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster.

      LB address range

      The range of IP addresses that can be assigned to load balancers for Kubernetes Services.

      Kubernetes

      Node CIDR

      The Kubernetes nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR Blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  7. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

  8. Proceed with Add a machine.

Add a machine

After you create a new VMWare vSphere-based Mirantis Container Cloud managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to a VMWare vSphere-based managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with Machines list opens.

  4. On the cluster page, click Create Machine.

  5. Fill out the form with the following parameters as required:

    Container Cloud machine configuration

    Parameter

    Description

    Count

    Number of machines to create.

    The required minimum number of machines is three for the manager nodes HA and two for the Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Template Path

    Path to the prepared OVF template.

    SSH Username

    SSH user name to access the node. Defaults to cloud-user.

    RHEL License

    From the drop-down list, select the RHEL license that you previously added for the cluster being deployed.

    Node Labels

    Select the required node labels for the machine to run certain components on a specific node. For example, for the StackLight nodes that run Elasticsearch and require more resources than a standard node, select the StackLight label. The list of available node labels is obtained from your current Cluster release.

    Caution

    If you deploy StackLight in the HA mode (recommended), add the StackLight label to minimum three nodes.

    Note

    You can configure node labels after deploying a machine. On the Machines page, click the More action icon in the last column of the required machine field and select Configure machine.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    To view the deployment status, monitor the machines status in the Managers and Workers columns on the Clusters page. Once the status changes to Ready, the deployment is complete. For the statuses description, see Reference Architecture: LCM controller.

  8. Verify the status of the cluster nodes as described in Connect to a Mirantis Container Cloud cluster.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete a VMWare vSphere-based managed cluster:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

  5. Deleting a cluster automatically turns the machines off. Therefore, clean up the hosts manually in the VMWare vSphere web UI. The machines will be automatically released from the RHEL subscription once the deletion succeeds.

  6. If you are going to remove the associated regional cluster or if you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, verify that the required credentials are not in the In Use status.

    2. Click the Delete credential action icon next to the name of the credentials to be deleted.

    3. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Change a cluster configuration

After deploying a managed cluster, you can enable or disable StackLight and configure its parameters if enabled. Alternatively, you can configure StackLight through kubeconfig as described in Configure StackLight.

To change a cluster configuration:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Select the required project.

  3. On the Clusters page, click the More action icon in the last column of the required cluster and select Configure cluster.

  4. In the Configure cluster window, select or deselect StackLight and configure its parameters if enabled.

  5. Click Update to apply the changes.

Update a managed cluster

A Mirantis Container Cloud management cluster automatically upgrades to a new available Container Cloud release version that supports new Cluster releases. Once done, a newer version of a Cluster release becomes available for managed clusters that you update using the Container Cloud web UI.

Caution

Make sure to update the Cluster release version of your managed cluster before the current Cluster release version becomes unsupported by a new Container Cloud release version. Otherwise, Container Cloud stops auto-upgrade and eventually Container Cloud itself becomes unsupported.

This section describes how to update a managed cluster of any provider type using the Container Cloud web UI.

To update a managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click More action icon in the last column for each cluster and select Update cluster where available.

  4. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release. The release notes for each Container Cloud and Cluster release are available at Release Notes: Container Cloud releases and Release Notes: Cluster releases.

  5. Click Update.

    To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Note

In rare cases, after a managed cluster upgrade, Grafana may stop working due to the issues with helm-controller.

The development team is working on the issue that will be addressed in the upcoming release.

Delete a machine

This section instructs you on how to scale down an existing managed cluster through the Mirantis Container Cloud web UI.

Warning

A machine with the manager node role cannot be deleted manually. A machine with such role is automatically deleted during the managed cluster deletion.

To delete a machine from a managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click on the required cluster name to open the list of machines running on it.

  4. Click the More action icon in the last column of the machine you want to delete and select Delete. Confirm the deletion.

Deleting a machine automatically frees up the resources allocated to this machine.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

Operate management and regional clusters

The Mirantis Container Cloud web UI enables you to perform the following operations with the Container Cloud management and regional clusters:

  • View the cluster details (such as cluster ID, creation date, nodes count, and so on) as well as obtain a list of the cluster endpoints including the StackLight components, depending on your deployment configuration.

    To view generic cluster details, in the Clusters tab, click the More action icon in the last column of the required cluster and select Cluster info.

    Note

    • Adding more than 3 nodes or deleting nodes from a management or regional cluster is not supported.

    • Removing a management or regional cluster using the Container Cloud web UI is not supported. Use the dedicated cleanup script instead. For details, see Remove a management cluster and Remove a regional cluster.

    • Before removing a regional cluster, delete the credentials of the deleted managed clusters associated with the region.

  • Verify the current release version of the cluster including the list of installed components with their versions and the cluster release change log.

    To view a cluster release version details, in the Clusters tab, click the version in the Release column next to the name of the required cluster.

    A management cluster upgrade to a newer version is performed automatically once a new Container Cloud version is released. Regional clusters also upgrade automatically along with the management cluster. For more details about the Container Cloud release upgrade mechanism, see: Reference Architecture: Container Cloud release controller.

This section outlines the operations that can be performed with a management or regional cluster.

Remove a management cluster

This section describes how to remove a management cluster.

To remove a management cluster:

  1. Verify that you have successfully removed all managed clusters that run on top of the management cluster to be removed. For details, see the corresponding Delete a managed cluster section depending on your cloud provider in Operate managed clusters.

  2. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  3. Run the following script:

    bootstrap.sh cleanup
    

Note

Removing a management or regional cluster using the Container Cloud web UI is not supported.

Remove a regional cluster

This section describes how to remove a regional cluster.

To remove a regional cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the project with the managed clusters of the regional cluster to remove using the Switch Project action icon located on top of the main left-side navigation panel.

  3. Verify that you have successfully deleted all managed clusters that run on top of the regional cluster to be removed. For details, see the corresponding Delete a managed cluster section depending on your cloud provider in Operate managed clusters.

  4. Delete the credentials associated with the region:

    1. In the Credentials tab, click the first credentials name.

    2. In the window that opens, capture the Region Name field.

    3. Repeat two previous steps for the remaining credentials in the list.

    4. Delete all credentials with the name of the region that you are going to remove.

  5. Log in to a local machine where your management and regional clusters kubeconfig files are located and where kubectl is installed.

    Note

    The management or regional cluster kubeconfig files are created during the last stage of the management or regional cluster bootstrap.

  6. Run the following script with the corresponding values of your cluster:

    REGIONAL_CLUSTER_NAME=<regionalClusterName> REGIONAL_KUBECONFIG=<pathToRegionalClusterKubeconfig> KUBECONFIG=<mgmtClusterKubeconfig> ./bootstrap.sh destroy_regional
    

Note

Removing a management or regional cluster using the Container Cloud web UI is not supported.

Attach an existing Mirantis Kubernetes Engine cluster

Starting from Mirantis Kubernetes Engine (MKE) 3.3.3, you can attach an existing MKE cluster that is not deployed by Mirantis Container Cloud to a management cluster. This feature allows for visualization of all your MKE clusters details in one place including clusters health, capacity, and usage.

For supported configurations of existing MKE clusters that are not deployed by Container Cloud, see Docker Enterprise Compatibility Matrix.

Note

Using the free Mirantis license, you can create up to three Container Cloud managed clusters with three worker nodes on each cluster. Within the same quota, you can also attach existing MKE clusters that are not deployed by Container Cloud. If you need to increase this quota, contact Mirantis support for further details.

Using the instruction below, you can also install StackLight to your existing MKE cluster during the attach procedure. For the StackLight system requirements, refer to the Reference Architecture: Requirements of the corresponding cloud provider.

You can also update all your MKE clusters to the latest version once your management cluster automatically updates to a newer version where a new MKE Cluster release with the latest MKE version is available. For details, see Update a managed cluster.

Caution

  • An MKE cluster can be attached to only one management cluster. Attaching a Container Cloud-based MKE cluster to another management cluster is not supported.

  • Due to the development limitations, if you detach an MKE cluster that is not deployed by Container Cloud, Helm controller and OIDC integration are not deleted.

  • Detaching a Container Cloud-based MKE cluster is not supported.

To attach an existing MKE cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, expand the Create Cluster menu and click Attach Existing MKE Cluster.

  4. In the wizard that opens, fill out the form with the following parameters as required:

    1. Configure general settings:

      MKE cluster configuration

      Section

      Parameter

      Description

      General Settings

      Cluster Name

      Specify the cluster name.

      Region

      Select the required cloud provider: OpenStack, AWS, or bare metal.

    2. Upload the MKE client bundle or fill in the fields manually. To download the MKE client bundle, refer to MKE user access: Download client certificates.

    3. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  5. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

Connect to the Mirantis Kubernetes Engine web UI

After you deploy a new or attach an existing Mirantis Kubernetes Engine (MKE) cluster to a management cluster, start managing your cluster using the MKE web UI.

To connect to the MKE web UI:

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required MKE cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the MKE UI endpoint.

  5. Paste the copied IP to a web browser and use the same credentials that you use to access the Container Cloud web UI.

Warning

To ensure the Container Cloud stability in managing the Container Cloud-based MKE clusters, a number of MKE API functions is not available for the Container Cloud-based MKE clusters as compared to the attached MKE clusters that are deployed not by Container Cloud. Use the Container Cloud web UI or CLI for this functionality instead.

See Reference Architecture: MKE API limitations for details.

Caution

The MKE web UI contains help links that lead to the Docker Enterprise documentation suite. Besides MKE and Mirantis Container Runtime (MCR), which are integrated with Container Cloud, that documentation suite covers other Docker Enterprise components and cannot be fully applied to the Container Cloud-based MKE clusters. Therefore, to avoid any sort of misconceptions, before you proceed with MKE web UI documentation, read Reference Architecture: MKE API limitations and make sure you are using the documentation of the supported MKE version as per Release Compatibility Matrix.

Connect to a Mirantis Container Cloud cluster

After you deploy a Mirantis Container Cloud management or managed cluster, connect to the cluster to verify the availability and status of the nodes as described below.

This section also describes how to SSH to a node of a cluster where Bastion host is used for SSH access. For example, on the OpenStack-based management cluster or AWS-based management and managed clusters.

To connect to a managed cluster:

  1. Log in to the Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Verify the status of the manager nodes. Once the first manager node is deployed and has the Ready status, the Download Kubeconfig option for the cluster being deployed becomes active.

  5. Open the Clusters tab.

  6. Click the More action icon in the last column of the required cluster and select Download Kubeconfig:

    1. Enter your user password.

    2. Not recommended. Select Offline Token to generate an offline IAM token. Otherwise, for security reasons, the kubeconfig token expires every 30 minutes of the Container Cloud API idle time and you have to download kubeconfig again with a newly generated token.

    3. Click Download.

  7. Verify the availability of the managed cluster machines:

    1. Export the kubeconfig parameters to your local machine with access to kubectl. For example:

      export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
      
    2. Obtain the list of available Container Cloud machines:

      kubectl get nodes -o wide
      

      The system response must contain the details of the nodes in the READY status.

To connect to a management cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Obtain the list of available management cluster machines:

    kubectl get nodes -o wide
    

    The system response must contain the details of the nodes in the READY status.

To SSH to a Container Cloud cluster node if Bastion is used:

  1. Obtain kubeconfig of the management or managed cluster as described in the procedures above.

  2. Obtain the internal IP address of a node you require access to:

    kubectl get nodes -o wide
    
  3. Obtain the Bastion public IP:

    kubectl get cluster -o jsonpath='{.status.providerStatus.bastion.publicIp}' \
    -n <project_name> <cluster_name>
    
  4. Run the following command:

    ssh -i <private_key> ubuntu@<node_internal_ip> -o "proxycommand ssh -W %h:%p \
    -i <private_key> ubuntu@<bastion_public_ip>"
    

    Substitute the parameters enclosed in angle brackets with the corresponding values of your cluster obtained in previous steps. The <private_key> for a management cluster is located at ~/.ssh/openstack_tmp. For a managed cluster, this is the SSH Key that you added in the Container Cloud web UI before the managed cluster creation.

Manage IAM

IAM CLI

IAM CLI is a user-facing command-line tool for managing scopes, roles, and grants. Using your personal credentials, you can perform different IAM operations through the iamctl tool. For example, you can verify the current status of the IAM service, request or revoke service tokens, verify your own grants within Mirantis Container Cloud as well as your token details.

Configure IAM CLI

The iamctl command-line interface uses the iamctl.yaml configuration file to interact with IAM.

To create the IAM CLI configuration file:

  1. Log in to the management cluster.

  2. Change the directory to one of the following:

    • $HOME/.iamctl

    • $HOME

    • $HOME/etc

    • /etc/iamctl

  3. Create iamctl.yaml with the following exemplary parameters and values that correspond to your deployment:

    server: <IAM_API_ADDRESS>
    timeout: 60
    verbose: 99 # Verbosity level, from 0 to 99
    
    tls:
        enabled: true
        ca: <PATH_TO_CA_BUNDLE>
    
    auth:
        issuer: <IAM_REALM_IN_KEYCLOAK>
        ca: <PATH_TO_CA_BUNDLE>
        client_id: iam
        client_secret:
    

    The <IAM_REALM_IN_KEYCLOAK> value has the <keycloak-url>/auth/realms/<realm-name> format, where <realm-name> defaults to iam.

Available IAM CLI commands

Using iamctl, you can perform different role-based access control operations in your managed cluster. For example:

  • Grant or revoke access to a managed cluster and a specific user for troubleshooting

  • Grant or revoke access to a Mirantis Container Cloud project that contains several managed clusters

  • Create or delete tokens for the Container Cloud services with a specific set of grants as well as identify when a service token was used the last time

The iamctl command-line interface contains the following set of commands:

The following tables describe the iamctl commands with their descriptions.

General commands

Usage

Description

iamctl --help, iamctl help

Output the list of available commands.

iamctl help <command>

Output the description of a specific command.

Account information commands

Usage

Description

iamctl account info

Output detailed account information such as user email, user name, the details of their active and offline sessions, tokens statuses and expiration dates.

iamctl account login

Log in the current user. The system prompts to enter your authentication credentials. After a successful login, your user token is added to the $HOME/.iamctl directory.

iamctl account logout

Log out the current user. Once done, the user information is removed from $HOME/.iamctl.

Scope commands

Usage

Description

iamctl scope list

List the IAM scopes available for the current environment.

Example output:

+---------------+--------------------------+
|     NAME      |   DESCRIPTION            |
+---------------+--------------------------+
| m:iam         | IAM scope                |
| m:kaas        | Container Cloud scope    |
| m:k8s:managed |                          |
| m:k8s         | Kubernetes scope         |
| m:cloud       | Cloud scope              |
+---------------+--------------------------+

iamctl scope list [prefix]

Output the specified scope list. For example: iamctl m:k8s.

Role commands

Usage

Description

iamctl role list <scope>

List the roles for the specified scope in IAM.

iamctl role show <scope> <role>

Output the details of the specified scope role including the role name (admin, viewer, reader), its description, and an example of the grant command. For example: iamctl role show m:iam admin.

Grant commands

Usage

Description

iamctl grant give [username] [scope] [role]

Provide a user with a role in a scope. For example, the iamctl grant give jdoe m:iam admin command provides the IAM admin role in the m:iam scope to John Doe.

For the list of supported IAM scopes and roles, see: Role list.

Note

To lock or disable a user, use LDAP or Google OAuth depending on the external provider integrated to your deployment.

iamctl grant list <username>

List the grants provided to the specified user. For example: iamctl grant list jdoe.

Example output:

+--------+--------+---------------+
| SCOPE  |  ROLE  |   GRANT FQN   |
+--------+--------+---------------+
| m:iam  | admin  | m:iam@admin   |
| m:sl   | viewer | m:sl@viewer   |
| m:kaas | writer | m:kaas@writer |
+--------+--------+---------------+
  • m:iam@admin - admin rights in all IAM-related applications

  • m:sl@viewer - viewer rights in all StackLight-related applications

  • m:kaas@writer - writer rights in Container Cloud

iamctl grant revoke [username] [scope] [role]

Revoke the grants provided to the user.

Service token commands

Usage

Description

iamctl servicetoken list [--all]

List the details of all service tokens created by the current user. The output includes the following service token details:

  • ID

  • Alias, for example, nova, jenkins-ci

  • Creation date and time

  • Creation owner

  • Grants

  • Last refresh date and time

  • IP address

iamctl servicetoken show [ID]

Output the details of a service token with the specified ID.

iamctl servicetoken create [alias] [service] [grant1 grants2...]

Create a token for a specific service with the specified set of grants. For example, iamctl servicetoken create new-token iam m:iam@viewer.

iamctl servicetoken delete [ID1 ID2...]

Delete a service token with the specified ID.

User commands

Usage

Description

iamctl user list

List user names and emails of all current users.

iamctl user show <username>

Output the details of the specified user.

Role list

Mirantis Container Cloud creates the IAM roles in scopes. For each application type, such as iam, k8s, or kaas, Container Cloud creates a scope in Keycloak. And every scope contains a set of roles such as admin, user, viewer. The default IAM roles can be changed during a managed cluster deployment. You can grant or revoke a role access using the IAM CLI. For details, see: IAM CLI.

Example of the structure of a cluster-admin role in a managed cluster:

m:k8s:kaas-tenant-name:k8s-cluster-name@cluster-admin
  • m - prefix for all IAM roles in Container Cloud

  • k8s - application type, Kubernetes

  • kaas-tenant-name:k8s-cluster-name - a managed cluster identifier in Container Cloud (CLUSTER_ID)

  • @ - delimiter between a scope and role

  • cluster-admin - name of the role within the Kubernetes scope


The following tables include the scopes and their roles descriptions by Container Cloud components:

Container Cloud

Scope identifier

Role name

Grant example

Role description

m:kaas

reader

m:kaas@reader 0

List the managed clusters within the Container Cloud scope.

writer

m:kaas@writer 0

Create or delete the managed clusters within the Container Cloud scope.

operator

m:kaas@operator

Add or delete a bare metal host and machine within the Container Cloud scope, create a project.

m:kaas:$<CLUSTER_ID>

reader

m:kaas:$<CLUSTER_ID>@reader

List the managed clusters within the specified Container Cloud cluster ID.

writer

m:kaas:$<CLUSTER_ID>@writer

Create or delete the managed clusters within the specified Container Cloud cluster ID.

0(1,2)

Grant is available by default. Other grants can be added during a management and managed cluster deployment.

Kubernetes

Scope identifier

Role name

Grant example

Role description

m:k8s:<CLUSTER_ID>

cluster-admin

m:k8s:<CLUSTER_ID>@cluster-admin

Allow the super-user access to perform any action on any resource on the cluster level. When used in ClusterRoleBinding, provide full control over every resource in a cluster and all Kubernetes namespaces.

StackLight

Scope identifier

Role name

Grant example

Role description

m:sl:$<CLUSTER_ID> or m:sl:$<CLUSTER_ID>:<SERVICE_NAME>

admin

  • m:sl:$<CLUSTER_ID>@admin

  • m:sl:$<CLUSTER_ID>:alerta@admin

  • m:sl:$<CLUSTER_ID>:alertmngmnt@admin

  • m:sl:$<CLUSTER_ID>:kibana@admin

  • m:sl:$<CLUSTER_ID>:grafana@admin

  • m:sl:$<CLUSTER_ID>:prometheus@admin

Access the specified web UI(s) within the scope.

The m:sl:$<CLUSTER_ID>@admin grant provides access to all StackLight web UIs: Prometheus, Alerta, Alertmanager, Kibana, Grafana.

Manage StackLight

Using StackLight, you can monitor the components deployed in Mirantis Container Cloud and be quickly notified of critical conditions that may occur in the system to prevent service downtimes.

Access StackLight web UIs

StackLight provides five web UIs including Prometheus, Alertmanager, Alerta, Kibana, and Grafana. This section describes how to access any of these web UIs.

To access a StackLight web UI:

  1. Log in to the Mirantis Container Cloud web UI.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the required endpoint IP from the StackLight Endpoints section.

  5. Paste the copied IP to a web browser and use the default credentials to log in to the web UI. Once done, you are automatically authenticated to all StackLight web UIs.

View Grafana dashboards

Using the Grafana web UI, you can view the visual representation of the metric graphs based on the time series databases.

To view the Grafana dashboards:

  1. Log in to the Grafana web UI as described in Access StackLight web UIs.

  2. From the drop-down list, select the required dashboard to inspect the status and statistics of the corresponding service in your management or managed cluster:

    Component

    Dashboard

    Description

    Ceph cluster

    Ceph Cluster

    Provides the overall health status of the Ceph cluster, capacity, latency, and recovery metrics.

    Ceph Nodes

    Provides an overview of the host-related metrics, such as the number of Ceph Monitors, Ceph OSD hosts, average usage of resources across the cluster, network and hosts load.

    Ceph OSD

    Provides metrics for Ceph OSDs, including the Ceph OSD read and write latencies, distribution of PGs per Ceph OSD, Ceph OSDs and physical device performance.

    Ceph Pools

    Provides metrics for Ceph pools, including the client IOPS and throughput by pool and pools capacity usage.

    Ironic bare metal

    Ironic BM

    Provides graphs on Ironic health, HTTP API availability, provisioned nodes by state and installed ironic-conductor back-end drivers.

    Container Cloud clusters

    Clusters Overview

    Represents the main cluster capacity statistics for all clusters of a Mirantis Container Cloud deployment where StackLight is installed.

    Kubernetes resources

    Kubernetes Calico

    Provides metrics of the entire Calico cluster usage, including the cluster status, host status, and Felix resources.

    Kubernetes Cluster

    Provides metrics for the entire Kubernetes cluster, including the cluster status, host status, and resources consumption.

    Kubernetes Deployments

    Provides information on the desired and current state of all service replicas deployed on a Container Cloud cluster.

    Kubernetes Namespaces

    Provides the pods state summary and the CPU, MEM, network, and IOPS resources consumption per name space.

    Kubernetes Nodes

    Provides charts showing resources consumption per Container Cloud cluster node.

    Kubernetes Pods

    Provides charts showing resources consumption per deployed pod.

    NGINX

    NGINX

    Provides the overall status of the NGINX cluster and information about NGINX requests and connections.

    StackLight

    Alertmanager

    Provides performance metrics on the overall health status of the Prometheus Alertmanager service, the number of firing and resolved alerts received for various periods, the rate of successful and failed notifications, and the resources consumption.

    Elasticsearch

    Provides information about the overall health status of the Elasticsearch cluster, including the resources consumption and the state of the shards.

    Grafana

    Provides performance metrics for the Grafana service, including the total number of Grafana entities, CPU and memory consumption.

    PostgreSQL

    Provides PostgreSQL statistics, including read (DQL) and write (DML) row operations, transaction and lock, replication lag and conflict, and checkpoint statistics, as well as PostgreSQL performance metrics.

    Prometheus

    Provides the availability and performance behavior of the Prometheus servers, the sample ingestion rate, and system usage statistics per server. Also, provides statistics about the overall status and uptime of the Prometheus service, the chunks number of the local storage memory, target scrapes, and queries duration.

    Pushgateway

    Provides performance metrics and the overall health status of the service, the rate of samples received for various periods, and the resources consumption.

    Prometheus Relay

    Provides service status and resources consumption metrics.

    Telemeter Server

    Provides statistics and the overall health status of the Telemeter service.

    System

    System

    Provides a detailed resource consumption and operating system information per Container Cloud cluster node.

    Mirantis Kubernetes Engine (MKE)

    UCP Cluster

    Provides a global overview of an MKE cluster: statistics about the number of the worker and manager nodes, containers, images, Swarm services.

    UCP Containers

    Provides per container resources consumption metrics for the MKE containers such as CPU, RAM, network.

View Kibana dashboards

Using the Kibana web UI, you can view the visual representation of logs and Kubernetes events of your deployment.

To view the Kibana dashboards:

  1. Log in to the Kibana web UI as described in Access StackLight web UIs.

  2. Click the required dashboard to inspect the visualizations or perform a search:

    Dashboard

    Description

    Logs

    Provides visualizations on the number of log messages per severity, source, and top log-producing host, namespaces, containers, and applications. Includes search.

    Kubernetes events

    Provides visualizations on the number of Kubernetes events per type, and top event-producing resources and namespaces by reason and event type. Includes search.

Available StackLight alerts

This section provides an overview of the available predefined StackLight alerts. To view the alerts, use the Prometheus web UI. To view the firing alerts, use Alertmanager or Alerta web UI.

Alertmanager

This section describes the alerts for the Alertmanager service.


AlertmanagerFailedReload

Severity

Warning

Summary

Failure to reload the Alertmanager configuration.

Description

Reloading the Alertmanager configuration failed for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


AlertmanagerMembersInconsistent

Severity

Major

Summary

Alertmanager cluster members are not found.

Description

Alertmanager has not found all other members of the cluster.


AlertmanagerNotificationFailureWarning

Severity

Warning

Summary

Alertmanager has failed notifications.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} notifications on the {{ $labels.namespace }}/{{ $labels.pod }} Pod fail for 2 minutes.


AlertmanagerAlertsInvalidWarning

Severity

Warning

Summary

Alertmanager has invalid alerts.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} alerts on the {{ $labels.namespace }}/{{ $labels.pod }} Pod are invalid for 2 minutes.

Calico

This section describes the alerts for Calico.


CalicoDataplaneFailuresHigh

Severity

Warning

Summary

High number of data plane failures within Felix.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} data plane failures within the last hour.


CalicoDataplaneAddressMsgBatchSizeHigh

Severity

Warning

Summary

Felix address message batch size is higher than 5.

Description

The size of the data plane address message batch on the {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node is {{ $value }}.


CalicoDatapaneIfaceMsgBatchSizeHigh

Severity

Warning

Summary

Felix interface message batch size is higher than 5.

Description

The size of the data plane interface message batch on the {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node is {{ $value }}.


CalicoIPsetErrorsHigh

Severity

Warning

Summary

More than 5 IPset errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} IPset errors within the last hour.


CalicoIptablesSaveErrorsHigh

Severity

Warning

Summary

More than 5 iptable save errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} iptable save errors within the last hour.


CalicoIptablesRestoreErrorsHigh

Severity

Warning

Summary

More than 5 iptable restore errors occur in Felix per hour.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Felix Pod on the {{ $labels.node }} node has {{ $value }} iptable restore errors within the last hour.

Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthMinor

Severity

Minor

Summary

Ceph cluster health is WARNING.

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.


CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.


CephMonQuorumAtRisk

Severity

Major

Summary

Storage quorum is at risk.

Description

The storage cluster quorum is low.


CephOSDDownMinor

Severity

Minor

Summary

Ceph OSDs are down.

Description

{{ $value }} of Ceph OSDs in the Ceph cluster are down. For details, run ceph osd tree.


CephOSDDiskNotResponding

Severity

Critical

Summary

Disk is not responding.

Description

The {{ $labels.device }} disk device is not responding on the {{ $labels.host }} host.


CephOSDDiskUnavailable

Severity

Critical

Summary

Disk is not accessible.

Description

The {{ $labels.device }} disk device is not accessible on the {{ $labels.host }} host.


CephClusterNearFull

Severity

Warning

Summary

Storage cluster is nearly full. Expansion is required.

Description

The storage cluster utilization has crossed 85%.


CephClusterCriticallyFull

Severity

Critical

Summary

Storage cluster is critically full and needs immediate expansion.

Description

The storage cluster utilization has crossed 95%.


CephOSDPgNumTooHighWarning

Severity

Warning

Summary

Some Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephOSDPgNumTooHighCritical

Severity

Critical

Summary

Some Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephMonHighNumberOfLeaderChanges

Severity

Warning

Summary

Many leader changes occur in the storage cluster.

Description

{{ $value }} leader changes per minute occur for the {{ $labels.instance }} instance of the {{ $labels.job }} Ceph Monitor.


CephNodeDown

Severity

Critical

Summary

Ceph node {{ $labels.node }} went down.

Description

The {{ $labels.node }} Ceph node is down and requires immediate verification.


CephDataRecoveryTakingTooLong

Severity

Warning

Summary

Data recovery is slow.

Description

Data recovery has been active for more than two hours.


CephPGRepairTakingTooLong

Severity

Warning

Summary

Self-heal issues detected.

Description

The self-heal operations take an excessive amount of time.


CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph OSD components are running.


CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph Monitor components are running.

Elasticsearch

This section describes the alerts for the Elasticsearch service.


ElasticHeapUsageCritical

Severity

Critical

Summary

Elasticsearch heap usage is too high (>90%).

Description

Elasticsearch heap usage is over 90% for 5 minutes.


ElasticHeapUsageWarning

Severity

Warning

Summary

Elasticsearch heap usage is high (>80%).

Description

Elasticsearch heap usage is over 80% for 5 minutes.


ElasticClusterStatusCritical

Severity

Critical

Summary

Elasticsearch critical status.

Description

The Elasticsearch cluster status has changed to RED.


ElasticClusterStatusWarning

Severity

Warning

Summary

Elasticsearch warning status.

Description

The Elasticsearch cluster status has changed to YELLOW. The alert persists for the cluster in the RED status.


NumberOfRelocationShards

Severity

Warning

Summary

Shards relocation takes more than 20 minutes.

Description

Elasticsearch has {{ $value }} relocating shards for 20 minutes.


NumberOfInitializingShards

Severity

Warning

Summary

Shards initialization takes more than 10 minutes.

Description

Elasticsearch has {{ $value }} shards being initialized for 10 minutes.


NumberOfUnassignedShards

Severity

Major

Summary

Shards have unassigned status for 5 minutes.

Description

Elasticsearch has {{ $value }} unassigned shards for 5 minutes.


NumberOfPendingTasks

Severity

Warning

Summary

Tasks have pending state for 10 minutes.

Description

Elasticsearch has {{ $value }} pending tasks for 10 minutes. The cluster works slowly.


ElasticNoNewDataCluster

Severity

Major

Summary

Elasticsearch cluster has no new data for 30 minutes.

Description

No new data has arrived to the Elasticsearch cluster for 30 minutes.


ElasticNoNewDataNode

Severity

Warning

Summary

Elasticsearch node has no new data for 30 minutes.

Description

No new data has arrived to the {{ $labels.name }} Elasticsearch node for 30 minutes. The alert also indicates Elasticsearch node cordoning.

etcd

This section describes the alerts for the etcd service.


etcdInsufficientMembers

Severity

Critical

Summary

The etcd cluster has insufficient members.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} insufficient members.


etcdNoLeader

Severity

Critical

Summary

The etcd cluster has no leader.

Description

The {{ $labels.instance }} member of the {{ $labels.job }} etcd cluster has no leader.


etcdHighNumberOfLeaderChanges

Severity

Warning

Summary

More than 3 leader changes occurred in the the etcd cluster within the last hour.

Description

The {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster has {{ $value }} leader changes within the last hour.


etcdGRPCRequestsSlow

Severity

Warning

Summary

The etcd cluster has slow gRPC requests.

Description

The gRPC requests to {{ $labels.grpc_method }} take {{ $value }}s on {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster.


etcdMemberCommunicationSlow

Severity

Warning

Summary

The etcd cluster has slow member communication.

Description

The member communication with {{ $labels.To }} on the {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster takes {{ $value }}s.


etcdHighNumberOfFailedProposals

Severity

Warning

Summary

The etcd cluster has more than 5 proposal failures.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} proposal failures on the {{ $labels.instance }} etcd instance within the last hour.


etcdHighFsyncDurations

Severity

Warning

Summary

The etcd cluster has high fync duration.

Description

The duration of 99% of all fync operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.


etcdHighCommitDurations

Severity

Warning

Summary

The etcd cluster has high commit duration.

Description

The duration of 99% of all commit operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.

External endpoint

This section describes the alerts for external endpoints.


ExternalEndpointDown

Severity

Critical

Summary

External endpoint is down.

Description

The {{ $labels.instance }} external endpoint is not accessible for the last 2 minutes.


ExternalEndpointTCPFailure

Severity

Critical

Summary

Failure to establish a TCP or TLS connection.

Description

The system cannot establish a TCP or TLS connection to {{ $labels.instance }}.

General alerts

This section lists the general available alerts.


TargetDown

Severity

Critical

Summary

The {{ $labels.job }} target is down.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is down.


TargetFlapping

Severity

Critical

Summary

The {{ $labels.job }} target is flapping.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is changing its state between UP and DOWN for 30 minutes, at least once within the 15 minutes time range.


NodeDown

Severity

Critical

Summary

The {{ $labels.node }} node is down.

Description

The {{ $labels.node }} node is down. Kubernetes treats the node as Not Ready and kubelet is not accessible from Prometheus.


Watchdog

Severity

None

Summary

Watchdog alert that is always firing.

Description

This alert ensures that the entire alerting pipeline is functional. This alert should always be firing in Alertmanager against a receiver. Some integrations with various notification mechanisms can send a notification when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.

General node alerts

This section lists the general alerts for Kubernetes nodes.


FileDescriptorUsageCritical

Available since 2.2.0

Severity

Critical

Summary

Node uses 95% of file descriptors.

Description

The {{ $labels.node }} node uses 95% of file descriptors.


FileDescriptorUsageMajor

Available since 2.2.0

Severity

Major

Summary

Node uses 90% of file descriptors.

Description

The {{ $labels.node }} node uses 90% of file descriptors.


FileDescriptorUsageWarning

Available since 2.2.0

Severity

Warning

Summary

Node uses 80% of file descriptors.

Description

The {{ $labels.node }} node uses 80% of file descriptors.


SystemCpuFullWarning

Severity

Warning

Summary

High CPU consumption.

Description

The average CPU consumption on the {{ $labels.node }} node is {{ $value }}% for 2 minutes.


SystemLoadTooHighWarning

Severity

Warning

Summary

System load is more than 1 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemLoadTooHighCritical

Severity

Critical

Summary

System load is more than 2 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemDiskFullWarning

Severity

Warning

Summary

Disk partition {{ $labels.mountpoint }} is 85% full.

Description

The {{ $labels.device }} disk partition {{ $labels.mountpoint }} on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemDiskFullMajor

Severity

Major

Summary

Disk partition {{ $labels.mountpoint }} is 95% full.

Description

The {{ $labels.device }} disk partition {{ $labels.mountpoint }} on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemMemoryFullWarning

Severity

Warning

Summary

More than 90% of memory is used or less than 8 GB is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemMemoryFullMajor

Severity

Major

Summary

More than 95% of memory is used or less than 4 GB of memory is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemDiskInodesFullWarning

Severity

Warning

Summary

The {{ $labels.mountpoint }} volume uses 85% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskInodesFullMajor

Severity

Major

Summary

The {{ $labels.mountpoint }} volume uses 95% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk is failing.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node is reporting errors for 5 minutes.

Ironic

This section describes the alerts for Ironic bare metal. The alerted events include Ironic API availability and Ironic processes availability.


IronicBmMetricsMissing

Severity

Major

Summary

Ironic metrics missing.

Description

Metrics retrieved from the Ironic API are not available for 2 minutes.


IronicBmApiOutage

Severity

Critical

Summary

Ironic API outage.

Description

The Ironic API is not accessible.

Kubernetes applications

This section lists the alerts for Kubernetes applications.


KubePodCrashLooping

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in a crash loop status.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod container {{ $labels.container }} was restarted at least twice during the last 5 minutes.


KubePodNotReady

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in the non-ready state.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod state is not Ready for longer than 15 minutes.


KubeDeploymentGenerationMismatch

Severity

Major

Summary

The {{ $labels.deployment }} Deployment generation does not match the metadata.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} Deployment generation does not match the metadata, indicating that the deployment failed but has not been rolled back.


KubeDeploymentReplicasMismatch

Severity

Major

Summary

The {{ $labels.deployment }} Deployment has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} Deployment does not match the expected number of replicas for longer than 10 minutes.


KubeStatefulSetReplicasMismatch

Severity

Major

Summary

The {{ $labels.statefulset }} StatefulSet has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet does not match the expected number of replicas for longer than 10 minutes.


KubeStatefulSetGenerationMismatch

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet generation does not match the metadata.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet generation does not match the metadata, indicating that the StatefulSet failed but has not been rolled back.


KubeStatefulSetUpdateNotRolledOut

Severity

Major

Summary

The {{ $labels.statefulset }} StatefulSet update has not been rolled out.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet update has not been rolled out.


KubeDaemonSetRolloutStuck

Severity

Major

Summary

The {{ $labels.daemonset }} DaemonSet is not ready.

Description

Only {{ $value }}% of the desired Pods of the {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet are scheduled and ready.


KubeDaemonSetNotScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has not scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has {{ $value }} not scheduled Pods.


KubeDaemonSetMisScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has incorrectly scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has {{ $value }} Pods running where they are not supposed to run.


KubeCronJobRunning

Severity

Warning

Summary

The {{ $labels.cronjob }} CronJob is not ready.

Description

The {{ $labels.namespace }}/{{ $labels.cronjob }} CronJob takes more than 15 minutes to complete.


KubeJobCompletion

Severity

Minor

Summary

The {{ $labels.job_name }} job is not completed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job takes more than 15 minutes to complete.


KubeJobFailed

Severity

Minor

Summary

The {{ $labels.job_name }} job failed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job failed to complete.

Kubernetes resources

This section lists the alerts for Kubernetes resources.


KubeCPUOvercommitPods

Severity

Warning

Summary

Kubernetes has overcommitted CPU requests.

Description

The Kubernetes cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.


KubeMemOvercommitPods

Severity

Warning

Summary

Kubernetes has overcommitted memory requests.

Description

The Kubernetes cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.


KubeCPUOvercommitNamespaces

Severity

Warning

Summary

Kubernetes has overcommitted CPU requests for namespaces.

Description

The Kubernetes cluster has overcommitted CPU resource requests for namespaces.


KubeMemOvercommitNamespaces

Severity

Warning

Summary

Kubernetes has overcommitted memory requests for namespaces.

Description

The Kubernetes cluster has overcommitted memory resource requests for namespaces.


KubeQuotaExceeded

Severity

Warning

Summary

The {{ $labels.namespace }} namespace consumes more than 90% of its {{ $labels.resource }} quota.

Description

The {{ $labels.namespace }} namespace consumes {{ printf "%0.0f" $value }}% of its {{ $labels.resource }} quota.


CPUThrottlingHigh

Severity

Warning

Summary

The {{ $labels.pod_name }} Pod has CPU throttling.

Description

The {{ $labels.namespace }} container in the {{ $labels.namespace }}/{{ $labels.pod }} Pod has {{ printf "%0.0f" $value }}% of CPU throttling.

Kubernetes storage

This section lists the alerts for Kubernetes storage.

Caution

Due to the upstream bug in Kubernetes, metrics for the KubePersistentVolumeUsageCritical and KubePersistentVolumeFullInFourDays alerts that are collected for persistent volumes provisioned by cinder-csi-plugin are not available.


KubePersistentVolumeUsageCritical

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume has less than 3% of free space.

Description

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is only {{ printf "%0.2f" $value }}% free.


KubePersistentVolumeFullInFourDays

Severity

Warning

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume is expected to fill up in 4 days.

Description

Based on the recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is expected to fill up within four days. Currently, {{ printf "%0.2f" $value }}% of free space is available.


KubePersistentVolumeErrors

Severity

Critical

Summary

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Description

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Kubernetes system

This section lists the alerts for the Kubernetes system.


KubeNodeNotReady

Severity

Warning

Summary

The {{ $labels.node }} node is not ready.

Description

The Kubernetes {{ $labels.node }} node is not ready for more than one hour.


KubeVersionMismatch

Severity

Warning

Summary

Kubernetes components have mismatching versions.

Description

Kubernetes has components with {{ $value }} different semantic versions running.


KubeClientErrors

Severity

Warning

Summary

Kubernetes API client has more than 1% of error requests.

Description

The {{ $labels.job }}/{{ $labels.instance }} Kubernetes API server client has {{ printf "%0.0f" $value }}% errors.


KubeletTooManyPods

Severity

Warning

Summary

kubelet reached 90% of Pods limit.

Description

The {{ $labels.instance }}/{{ $labels.node }} kubelet runs {{ $value }} Pods, close to the limit of 110.


KubeAPIDown

Severity

Critical

Summary

Kubernetes API endpoint is down.

Description

The Kubernetes API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


KubeAPIOutage

Severity

Critical

Summary

Kubernetes API is down.

Description

The Kubernetes API is not accessible for the last 30 seconds.


KubeAPILatencyHighWarning

Severity

Warning

Summary

The API server has a 99th percentile latency of more than 1 second.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPILatencyHighMajor

Severity

Major

Summary

The API server has a 99th percentile latency of more than 4 seconds.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPIErrorsHighMajor

Severity

Major

Summary

API server returns errors for more than 3% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIErrorsHighWarning

Severity

Warning

Summary

API server returns errors for more than 1% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIResourceErrorsHighMajor

Severity

Major

Summary

API server returns errors for 10% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeAPIResourceErrorsHighWarning

Severity

Warning

Summary

API server returns errors for 5% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeClientCertificateExpirationInSevenDays

Severity

Warning

Summary

A client certificate expires in 7 days.

Description

A client certificate used to authenticate to the API server expires in less than 7 days.


KubeClientCertificateExpirationInOneDay

Severity

Critical

Summary

A client certificate expires in 24 hours.

Description

A client certificate used to authenticate to the API server expires in less than 24 hours.


ContainerScrapeError

Severity

Warning

Summary

Failure to get Kubernetes container metrics.

Description

Prometheus was not able to scrape metrics from the container on the {{ $labels.node }} Kubernetes node.

Netchecker

This section lists the alerts for the Netchecker service.


NetCheckerAgentErrors

Severity

Warning

Summary

Netchecker has a high number of errors.

Description

The {{ $labels.agent }} Netchecker agent had {{ $value }} errors within the last hour.


NetCheckerReportsMissing

Severity

Warning

Summary

The number of agent reports is lower than expected.

Description

The {{ $labels.agent }} Netchecker agent has not reported anything for the last 5 minutes.


NetCheckerTCPServerDelay

Severity

Warning

Summary

The TCP connection to Netchecker server takes too much time.

Description

The {{ $labels.agent }} Netchecker agent TCP connection time to the Netchecker server has increased by {{ $value }} within the last 5 minutes.


NetCheckerDNSSlow

Severity

Warning

Summary

The DNS lookup time is too high.

Description

The DNS lookup time on the {{ $labels.agent }} Netchecker agent has increased by {{ $value }} within the last 5 minutes.

NGINX

This section lists the alerts for the NGINX service.


NginxServiceDown

Severity

Critical

Summary

The NGINX service is down.

Description

The NGINX service on the {{ $labels.node }} node is down.


NginxDroppedIncomingConnections

Severity

Minor

Summary

NGINX drops incoming connections.

Description

The NGINX service on the {{ $labels.node }} node drops {{ $value }} accepted connections per second for 5 minutes.

Node network

This section lists the alerts for a Kubernetes node network.


SystemRxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} has package receive errors.

Description

The {{ $labels.device }} network interface has receive errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod.


SystemTxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} node has package transmit errors.

Description

The {{ $labels.device }} network interface has transmit errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod.


SystemRxPacketsDroppedTooHigh

Severity

Warning

Summary

60 or more received packets were dropped.

Description

{{ $value | printf "%.2f" }} packets received by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


SystemTxPacketsDroppedTooHigh

Severity

Warning

Summary

100 transmitted packets were dropped.

Description

{{ $value | printf "%.2f" }} packets transmitted by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


NodeNetworkInterfaceFlapping

Severity

Warning

Summary

The {{ $labels.node }} node has flapping interface.

Description

The {{ $labels.device }} network interface often changes its UP status on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter.

Node time

This section lists the alerts for a Kubernetes node time.


ClockSkewDetected

Severity

Warning

Summary

The NTP offset reached the limit of 0.03 seconds.

Description

Clock skew was detected on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter Pod. Verify that NTP is configured correctly on this host.

PostgreSQL

This section lists the alerts for the PoststgreSQL and Patroni services.


PostgresqlDataPageCorruption

Severity

Major

Summary

Patroni cluster member is experiencing data page corruption.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Patroni Pod in the {{ $labels.cluster }} cluster fails to calculate the data page checksum due to a possible hardware fault.


PostgresqlDeadlocksDetected

Severity

Warning

Summary

PostgreSQL transactions deadlocks.

Description

The transactions submitted to the Patroni {{ $labels.cluster }} cluster in the {{ $labels.namespace }} Namespace are experiencing deadlocks.


PostgresqlInsufficientWorkingMemory

Severity

Warning

Summary

Insufficient memory for PostgreSQL queries.

Description

The query data does not fit into working memory on the {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace.


PostgresqlPatroniClusterSplitBrain

Severity

Critical

Summary

Patroni cluster split-brain detected.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has multiple primaries, split-brain detected.


PostgresqlPatroniClusterUnlocked

Severity

Major

Summary

Patroni cluster primary node is missing.

Description

The primary node of the {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace is missing.


PostgresqlPrimaryDown

Severity

Critical

Summary

PostgreSQL is down on the cluster primary node.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace is down due to missing primary node.


PostgresqlReplicaDown

Severity

Minor

Summary

Patroni cluster has replicas with inoperable PostgreSQL.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has {{ $value }}% of replicas with inoperable PostgreSQL.


PostgresqlReplicationNonStreamingReplicas

Severity

Warning

Summary

Patroni cluster has non-streaming replicas.

Description

The {{ $labels.cluster }} Patroni cluster in the {{ $labels.namespace }} Namespace has replicas not streaming the segments from the primary node.


PostgresqlReplicationPaused

Severity

Major

Summary

Replication has stopped.

Description

Replication has stopped on the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationSlowWalApplication

Severity

Warning

Summary

WAL segment application is slow.

Description

Slow replication while applying WAL segments on the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationSlowWalDownload

Severity

Warning

Summary

Streaming replication is slow.

Description

Slow replication while downloading WAL segments for the {{ $labels.namespace }}/{{ $labels.pod }} replica Pod in the {{ $labels.cluster }} cluster.


PostgresqlReplicationWalArchiveWriteFailing

Severity

Major

Summary

Patroni cluster WAL segment writes are failing.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Patroni Pod in the {{ $labels.cluster }} cluster fails to write replication segments.

Prometheus

This section describes the alerts for the Prometheus service.


PrometheusConfigReloadFailed

Severity

Warning

Summary

Failure to reload the Prometheus configuration.

Description

Reloading of the Prometheus configuration has failed for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


PrometheusNotificationQueueRunningFull

Severity

Warning

Summary

Prometheus alert notification queue is running full.

Description

The Prometheus alert notification queue is running full for the {{ $labels.namespace }}/{{ $labels.pod }} Pod.


PrometheusErrorSendingAlertsWarning

Severity

Warning

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from the {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod to Alertmanager {{ $labels.Alertmanager }}.


PrometheusErrorSendingAlertsMajor

Severity

Major

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from the {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod to Alertmanager {{ $labels.Alertmanager }}.


PrometheusNotConnectedToAlertmanagers

Severity

Minor

Summary

Prometheus is not connected to Alertmanager.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod is not connected to any Alertmanager instance.


PrometheusTSDBReloadsFailing

Severity

Warning

Summary

Prometheus has issues reloading data blocks from disk.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod had {{ $value | humanize }} reload failures over the last 12 hours.


PrometheusTSDBCompactionsFailing

Severity

Warning

Summary

Prometheus has issues compacting sample blocks.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod had {{ $value | humanize }} compaction failures over the last 12 hours.


PrometheusTSDBWALCorruptions

Severity

Warning

Summary

Prometheus encountered WAL corruptions.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has write-ahead log (WAL) corruptions in the time series database (TSDB) for the last 5 minutes.


PrometheusNotIngestingSamples

Severity

Warning

Summary

Prometheus does not ingest samples.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod does not ingest samples.


PrometheusTargetScrapesDuplicate

Severity

Warning

Summary

Prometheus has many rejected samples.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has many rejected samples because of duplicate timestamps but different values.


PrometheusRuleEvaluationsFailed

Severity

Warning

Summary

Prometheus failed to evaluate recording rules.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Prometheus Pod has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus Web UI.

SMART disks

This section describes the alerts for SMART disks.


SystemSMARTDiskUDMACrcErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk has UDMA CRC errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting SMART UDMA CRC errors for 5 minutes.


SystemSMARTDiskHealthStatus

Severity

Warning

Summary

The {{ $labels.device }} disk has bad health.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting a bad health status for 1 minute.


SystemSMARTDiskReadErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has read errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased read error rate for 5 minutes.


SystemSMARTDiskSeekErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has seek errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased seek error rate for 5 minutes.


SystemSMARTDiskTemperatureHigh

Severity

Warning

Summary

The {{ $labels.device }} disk temperature is high.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has a temperature of {{ $value }}C for 5 minutes.


SystemSMARTDiskReallocatedSectorsCount

Severity

Major

Summary

The {{ $labels.device }} disk has reallocated sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has reallocated {{ $value }} sectors.


SystemSMARTDiskCurrentPendingSectors

Severity

Major

Summary

The {{ $labels.device }} disk has current pending sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} current pending sectors.


SystemSMARTDiskReportedUncorrectableErrors

Severity

Major

Summary

The {{ $labels.device }} disk has reported uncorrectable errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} reported uncorrectable errors.


SystemSMARTDiskOfflineUncorrectableSectors

Severity

Major

Summary

The {{ $labels.device }} disk has offline uncorrectable sectors

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} offline uncorrectable sectors.


SystemSMARTDiskEndToEndError

Severity

Major

Summary

The {{ $labels.device }} disk has end-to-end errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} end-to-end errors.

SSL certificates

This section lists the alerts for SSL certificates.


SSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate expires in 30 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 30 days.


SSLCertExpirationMajor

Severity

Major

Summary

SSL certificate expires in 10 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 10 days.


KaasSSLCertExpirationMajor

Severity

Major

Summary

SSL certificate for a Container Cloud service expires in 10 days.

Description

The SSL certificate for the Container Cloud {{ $labels.service }} service endpoint {{ $labels.instance }} expires in 10 days.


KaasSSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate for a Container Cloud service expires in 30 days.

Description

The SSL certificate for the Container Cloud {{ $labels.service }} service endpoint {{ $labels.instance }} expires in 30 days.


SSLProbesFailing

Available since 2.2.0

Severity

Critical

Summary

SSL certificate probes are failing.

Description

The SSL certificate probes for the {{ $labels.instance }} service endpoint are failing.

Telemeter

This section describes the alerts for the Telemeter service.


TelemeterClientFederationFailed

Severity

Warning

Summary

Telemeter client failed to send data to the server.

Description

Telemeter client has failed to send data to the Telemeter server twice for the last 30 minutes. Verify the telemeter-client container logs.

Mirantis Kubernetes Engine

This section describes the alerts for the Mirantis Kubernetes Engine (MKE) cluster.


DockerNetworkUnhealthy

Severity

Warning

Summary

Docker network is unhealthy.

Description

The qLen size and NetMsg showed unexpected output for the last 10 minutes. Verify the NetworkDb Stats output for the qLen size and NetMsg using journalctl -d docker.

Note

For the DockerNetworkUnhealthy alert, StackLight collects metrics from logs. Therefore, this alert is available only if logging is enabled.


DockerNodeFlapping

Severity

Major

Summary

Docker node is flapping.

Description

The {{ $labels.node_name }} Docker node has changed the state more than 3 times for the last 10 minutes.


DockerServiceReplicasDown

Severity

Major

Summary

Docker Swarm replica is down.

Description

The {{ $labels.service_name }} Docker Swarm service replica is down for 2 minutes.


DockerServiceReplicasFlapping

Severity

Major

Summary

Docker Swarm service replica is flapping.

Description

The {{ $labels.service_name }} Docker Swarm service replica is flapping for 15 minutes.


DockerServiceReplicasOutage

Severity

Critical

Summary

Docker Swarm service outage.

Description

All {{ $labels.service_name }} Docker Swarm service replicas are down for 2 minutes.


DockerUCPAPIDown

Severity

Critical

Summary

MKE API endpoint is down.

Description

The MKE API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


DockerUCPAPIOutage

Severity

Critical

Summary

MKE API is down.

Description

The MKE API (port 443) is not accessible for the last minute.


DockerUCPContainerUnhealthy

Severity

Major

Summary

MKE container is in the Unhealthy state.

Description

The {{ $labels.name }} MKE container is in the Unhealthy state.


DockerUCPLeadElectionLoop

Severity

Major

Summary

MKE Manager leadership election loop.

Description

More than 2 MKE leader elections occur for the last 10 minutes.


DockerUCPNodeDiskFullCritical

Severity

Critical

Summary

MKE node disk is 95% full.

Description

The {{ $labels.instance }} MKE node disk is 95% full.


DockerUCPNodeDiskFullWarning

Severity

Warning

Summary

MKE node disk is 85% full.

Description

The {{ $labels.instance }} MKE node disk is 85% full.


DockerUCPNodeDown

Severity

Critical

Summary

MKE node is down.

Description

The {{ $labels.instance }} MKE node is down.

Configure StackLight

This section describes the initial steps required for StackLight configuration. For a detailed description of StackLight configuration options, see StackLight configuration parameters.

  1. Log in to the Mirantis Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. Expand the menu of the tab with your username.

  4. Click Download kubeconfig to download kubeconfig of your management cluster.

  5. Log in to any local machine with kubectl installed.

  6. Copy the downloaded kubeconfig to this machine.

  7. Run one of the following commands:

    • For a management cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGEMENT_CLUSTER_NAME>
      
    • For a managed cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
  8. In the following section of the opened manifest, configure the required StackLight parameters as described in StackLight configuration parameters.

    spec:
      providerSpec:
        value:
          helmReleases:
         - name: stacklight
           values:
    
  9. Verify StackLight after configuration.

StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in Configure StackLight. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.


Alerta

Key

Description

Example values

alerta.enabled (bool)

Enables or disables Alerta. Set to true by default.

true or false


Elasticsearch

Key

Description

Example values

elasticsearch.logstashRetentionTime (int)

Defines the Elasticsearch logstash-* index retention time in days. The logstash-* index stores all logs gathered from all nodes and containers. Set to 1 by default.

1, 5, 15


Grafana

Available since 2.1.0

Key

Description

Example values

grafana.renderer.enabled (bool)

Disables Grafana Image Renderer. For example, for resource-limited environments. Enabled by default.

true or false

grafana.homeDashboard (string)

Defines the home dashboard. Set to kubernetes-cluster by default. You can define any of the available dashboards.

kubernetes-cluster


Logging

Key

Description

Example values

logging.enabled (bool)

Enables or disables the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture. Set to true by default.

true or false


High availability

Key

Description

Example values

highAvailabilityEnabled (bool)

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Reference Architecture: StackLight deployment architecture. Set to false by default.

true or false


Metric collector

Key

Description

Example values

metricCollector.enabled (bool)

Disables or enables the metric collector. Modify this parameter for the management cluster only. Set to false by default.

false or true


Prometheus

Key

Description

Example values

prometheusServer.retentionTime (string)

Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag. Set to 15d by default.

15d, 1000h, 10d12h

prometheusServer.retentionSize (string)

Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag. Set to 15GB by default.

15GB, 512MB

prometheusServer.alertResendDelay (string)

Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag. Set to 2m by default.

2m, 90s


Cluster size

Key

Description

Example values

clusterSize (string)

Specifies the approximate expected cluster size. Set to small by default. Other possible values include medium and large. Depending on the choice, appropriate resource limits are passed according to the resourcesPerClusterSize parameter. The values differ by the Elasticsearch and Prometheus resource limits:

  • small (default) - 2 CPU, 6 Gi RAM for Elasticsearch, 1 CPU, 8 Gi RAM for Prometheus. Use small only for testing and evaluation purposes with no workloads expected.

  • medium - 4 CPU, 16 Gi RAM for Elasticsearch, 3 CPU, 16 Gi RAM for Prometheus.

  • large - 8 CPU, 32 Gi RAM for Elasticsearch, 6 CPU, 32 Gi RAM for Prometheus. Set to large only in case of lack of resources for Elasticsearch and Prometheus.

small, medium, or large


Resource limits

Key

Description

Example values

resourcesPerClusterSize (map)

Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resourcesPerClusterSize:
  elasticsearch:
    small:
      limits:
        cpu: "1000m"
        memory: "4Gi"
    medium:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    large:
      limits:
        cpu: "4000m"
        memory: "16Gi"

resources (map)

Provides the capability to override the containers resource requests or limits for any StackLight component. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resources:
  alerta:
    requests:
      cpu: "50m"
      memory: "200Mi"
    limits:
      memory: "500Mi"

Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.


Kubernetes tolerations

Key

Description

Example values

tolerations.default (slice)

Kubernetes tolerations to add to all StackLight components.

default:
- key: "com.docker.ucp.manager"
  operator: "Exists"
  effect: "NoSchedule"

tolerations.component (map)

Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

component:
  elasticsearch:
  - key: "com.docker.ucp.manager"
    operator: "Exists"
    effect: "NoSchedule"
  postgresql:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

Storage class

Key

Description

Example values

storage.defaultStorageClass (string)

Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. To use the cluster default storage class, leave the string empty.

lvp, standard

storage.componentStorageClasses (map)

Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the cluster default storage class, leave the string empty.

componentStorageClasses:
  elasticsearch: ""
  fluentd: ""
  postgresql: ""
  prometheusAlertManager: ""
  prometheusPushGateway: ""
  prometheusServer: ""

NodeSelector

Key

Description

Example values

nodeSelector.default (map)

Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

default:
  role: stacklight

nodeSelector.component (map)

Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

component:
  alerta:
    role: stacklight
    component: alerta
  kibana:
    role: stacklight
    component: kibana

Ceph monitoring

Key

Description

Example values

ceph.enabled (bool)

Enables or disables Ceph monitoring. Set to false by default.

true or false


External endpoint monitoring

Key

Description

Example values

externalEndpointMonitoring.enabled (bool)

Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds. Set to false by default.

true or false

externalEndpointMonitoring.certificatesHostPath (string)

Defines the directory path with external endpoints certificates on host.

/etc/ssl/certs/

externalEndpointMonitoring.domains (slice)

Defines the list of HTTP endpoints to monitor.

domains:
- https://prometheus.io_health
- http://example.com:8080_status
- http://example.net:8080_pulse

Ironic monitoring

Key

Description

Example values

ironic.endpoint (string)

Enables or disables monitoring of bare metal Ironic. To enable, specify the Ironic API URL.

http://ironic-api-http.kaas.svc:6385/v1


SSL certificates monitoring

Key

Description

Example values

sslCertificateMonitoring.enabled (bool)

Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour. Set to false by default.

true or false

sslCertificateMonitoring.domains (slice)

Defines the list of HTTPS endpoints to monitor the certificates from.

domains:
- https://prometheus.io
- http://example.com:8080

Workload monitoring

Key

Description

Example values

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. Set to false by default. If set to true, you can define the namespaces to which the filter will apply.

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight
  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace


Mirantis Kubernetes Engine monitoring

Key

Description

Example values

ucp.enabled (bool)

Enables or disables Mirantis Kubernetes Engine (MKE) monitoring. Set to false by default.

true or false

ucp.dockerdDataRoot (string)

Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

/var/lib/docker


Alerts configuration

Key

Description

Example values

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see Available StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

customAlerts:
# To add a new alert:
- alert: ExampleAlert
  annotations:
    description: Alert description
    summary: Alert summary
  expr: example_metric > 0
  for: 5m
  labels:
    severity: warning
# To modify an existing alert expression:
- alert: AlertmanagerFailedReload:
  expr: alertmanager_config_last_reload_successful == 5
# To disable an existing alert:
- alert: TargetDown
  enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.


Watchdog alert

Key

Description

Example values

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers. Set to true by default.

true or false


Alertmanager integrations

Key

Description

Example values

alertmanagerSimpleConfig.genericReceivers (slice)

Provides a genetic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

For example, to enable notifications to OpsGenie:

alertmanagerSimpleConfig:
  genericReceivers:
  - name: HTTP-opsgenie
    enabled: true # optional
    opsgenie_configs:
    - api_url: "https://example.app.eu.opsgenie.com/"
      api_key: "secret-key"
      send_resolved: true

Notifications to email

Key

Description

Example values

alertmanagerSimpleConfig.email.enabled (bool)

Enables or disables Alertmanager integration with email. Set to false by default.

true or false

alertmanagerSimpleConfig.email (map)

Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

email:
  enabled: false
  send_resolved: true
  to: "to@test.com"
  from: "from@test.com"
  smarthost: smtp.gmail.com:587
  auth_username: "from@test.com"
  auth_password: password
  auth_identity: "from@test.com"
  require_tls: true

alertmanagerSimpleConfig.email.route (map)

Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications to Slack

Key

Description

Example values

alertmanagerSimpleConfig.slack.enabled (bool)

Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration. Set to false by default.

true or false

alertmanagerSimpleConfig.slack.api_url (string)

Defines the Slack webhook URL.

http://localhost:8888

alertmanagerSimpleConfig.slack.channel (string)

Defines the Slack channel or user to send notifications to.

monitoring

alertmanagerSimpleConfig.slack.route (map)

Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications routing

Key

Description

Example values

alertmanagerSimpleConfig.genericRoutes (slice)

Template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

genericRoutes:
- receiver: HTTP-opsgenie
  enabled: true # optional
  match_re:
    severity: major|critical
  continue: true

Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in Configure StackLight and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

To verify StackLight after configuration:

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta

elasticsearch.logstashRetentionTime

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

grafana.renderer.enabled Available since 2.1.0

Verify the Grafana Image Renderer. If set to true, the output should include HTTP Server started, listening at http://localhost:8081.

kubectl logs -f -n stacklight -l app=grafana --container grafana-renderer

grafana.homeDashboard Available since 2.1.0

In the Grafana web UI, verify that the desired dashboard is set as a home dashboard.

logging.enabled

Verify that Elasticsearch, Fluentd, and Kibana are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(elasticsearch-master,kibana,fluentd-elasticsearch)'

highAvailabilityEnabled

Run kubectl get sts -n stacklight. The output includes the number of services replicas for the HA or non-HA StackLight modes. For details, see StackLight deployment architecture.

metricCollector.enabled

Verify that metric collector is present in the list of StackLight resources. An empty output indicates that metric collector is disabled.

kubectl get all -n stacklight -l app=mcc-metric-collector
  • prometheusServer.retentionTime

  • prometheusServer.retentionSize

  • prometheusServer.alertResendDelay

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • storage.tsdb.retention.time

    • storage.tsdb.retention.size

    • rules.alert.resend-delay

  • clusterSize

  • resourcesPerClusterSize

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
  • nodeSelector.default

  • nodeSelector.component

  • tolerations.default

  • tolerations.component

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
  • storage.defaultStorageClass

  • storage.componentStorageClasses

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

ironic.endpoint

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

  • sslCertificateMonitoring.enabled

  • sslCertificateMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

ucp.enabled

  1. In the Grafana web UI, verify that the UCP Cluster and UCP Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the DockerUCP* alerts are present in the list of alerts.

ucp.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the DockerUCPAPIDown is not false-positively firing due to the certificate absence.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

Enable generic metric scraping

StackLight can scrape metrics from any service that exposes Prometheus metrics and is running on the Kubernetes cluster. Such metrics appear in Prometheus under the {job="stacklight-generic",service="<service_name>",namespace="<service_namespace>"} set of labels. If the Kubernetes service is backed by Kubernetes pods, the set of labels also includes {pod="<pod_name>"}.

To enable the functionality, define at least one of the following annotations in the service metadata:

  • "generic.stacklight.mirantis.com/scrape-port" - the HTTP endpoint port. By default, the port number found through Kubernetes service discovery, usually __meta_kubernetes_pod_container_port_number. If none discovered, use the default port for the chosen scheme.

  • "generic.stacklight.mirantis.com/scrape-path" - the HTTP endpoint path, related to the Prometheus scrape_config.metrics_path option. By default, /metrics.

  • "generic.stacklight.mirantis.com/scrape-scheme" - the HTTP endpoint scheme between HTTP and HTTPS, related to the Prometheus scrape_config.scheme option. By default, http.

For example:

metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-path": "/metrics"
metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-port": "8080"

Manage Ceph

This section outlines Ceph LCM operations such as adding Ceph Monitor, Ceph nodes, and RADOS Gateway nodes to an existing Ceph cluster or removing them, as well as removing or replacing Ceph OSDs or updating your Ceph cluster.

Enable automated Ceph LCM

Ceph controller can automatically redeploy Ceph OSDs in case of significant configuration changes such as changing the block.db device or replacing Ceph OSDs. Ceph controller can also clean disks and configuration during a Ceph OSD removal.

To remove a single Ceph OSD or the entire Ceph node, manually remove its definition from the kaasCephCluster CR.

To enable automated management of Ceph OSDs:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  3. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  4. Set the manageOsds parameter to true:

    spec:
      cephClusterSpec:
        manageOsds: true
    

Once done, all Ceph OSDs with a modified configuration will be redeployed. Mirantis recommends modifying only one Ceph node at a time. For details about supported configuration parameters, see OSD Configuration Settings.

Add, remove, or reconfigure Ceph nodes

Mirantis Ceph controller simplifies a Ceph cluster management by automating LCM operations. To modify Ceph components, only the MiraCeph custom resource (CR) update is required. Once you update the MiraCeph CR, the Ceph controller automatically adds, removes, or reconfigures Ceph nodes as required.

Note

When adding a Ceph node with the Ceph Monitor role, if any issues occur with the Ceph Monitor, rook-ceph removes it and adds a new Ceph Monitor instead, named using the next alphabetic character in order. Therefore, the Ceph Monitor names may not follow the alphabetical order. For example, a, b, d, instead of a, b, c.

To add, remove, or reconfigure Ceph nodes on a management or managed cluster:

  1. To modify Ceph OSDs, verify that the manageOsds parameter is set to true in the KaasCephCluster CR as described in Enable automated Ceph LCM.

  2. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  3. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  4. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  5. In the nodes section, specify or remove the parameters for a Ceph OSD as required. For the parameters description, see OSD Configuration Settings.

    For example:

    nodes:
      kaas-mgmt-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            storeType: bluestore
        name: sdb
    

    Note

    To use a new Ceph node for a Ceph Monitor or Ceph Manager deployment, also specify the roles parameter.

  6. If you are making changes for your managed cluster, obtain and export kubeconfig of the managed cluster as described in Connect to a Mirantis Container Cloud cluster. Otherwise, skip this step.

  7. Monitor the status of your Ceph cluster deployment. For example:

    kubectl -n rook-ceph get pods
    
    kubectl -n ceph-lcm-mirantis logs ceph-controller-78c95fb75c-dtbxk
    
    kubectl -n rook-ceph logs rook-ceph-operator-56d6b49967-5swxr
    
  8. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  9. Verify that the Ceph node has been successfully added, removed, or reconfigured:

    1. Verify that the Ceph cluster status is healthy:

      ceph status
      

      Example of a positive system response:

      cluster:
        id:     0868d89f-0e3a-456b-afc4-59f06ed9fbf7
        health: HEALTH_OK
      
      services:
        mon: 3 daemons, quorum a,b,c (age 20h)
        mgr: a(active, since 20h)
        osd: 9 osds: 9 up (since 20h), 9 in (since 2d)
      
      data:
        pools:   1 pools, 32 pgs
        objects: 0 objects, 0 B
        usage:   9.1 GiB used, 231 GiB / 240 GiB avail
        pgs:     32 active+clean
      
    2. Verify that the status of the Ceph OSDs is up:

      ceph osd tree
      

      Example of a positive system response:

      ID  CLASS WEIGHT  TYPE NAME                   STATUS REWEIGHT PRI-AFF
      -1       0.23424 root default
      -3       0.07808             host osd1
       1   hdd 0.02930                 osd.1           up  1.00000 1.00000
       3   hdd 0.01949                 osd.3           up  1.00000 1.00000
       6   hdd 0.02930                 osd.6           up  1.00000 1.00000
      -15       0.07808             host osd2
       2   hdd 0.02930                 osd.2           up  1.00000 1.00000
       5   hdd 0.01949                 osd.5           up  1.00000 1.00000
       8   hdd 0.02930                 osd.8           up  1.00000 1.00000
      -9       0.07808             host osd3
       0   hdd 0.02930                 osd.0           up  1.00000 1.00000
       4   hdd 0.01949                 osd.4           up  1.00000 1.00000
       7   hdd 0.02930                 osd.7           up  1.00000 1.00000
      

Replace a failed Ceph OSD

After a physical disk replacement, you can use Rook to redeploy a failed Ceph OSD by restarting rook-operator that triggers the reconfiguration of the management or managed cluster.

To redeploy a failed Ceph OSD:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the required management or managed cluster as described in Connect to a Mirantis Container Cloud cluster.

  3. Identify the failed Ceph OSD ID:

    ceph osd tree
    
  4. Remove the Ceph OSD deployment from the management or managed cluster:

    kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID>
    
  5. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  6. Remove the failed Ceph OSD from the Ceph cluster:

    ceph osd purge osd.<ID>
    
  7. Replace the failed disk.

  8. Restart the Rook operator:

    kubectl delete pod $(kubectl -n rook-ceph get pod -l "app=rook-ceph-operator" \
    -o jsonpath='{.items[0].metadata.name}') -n rook-ceph
    

Update Ceph cluster

You can update Ceph cluster to the latest minor version of Ceph Nautilus by triggering the existing Ceph cluster update.

To update Ceph cluster:

  1. Verify that your management cluster is automatically upgraded to the latest Mirantis Container Cloud release:

    1. Log in to the Container Cloud web UI with the writer permissions.

    2. On the bottom of the page, verify the Container Cloud version number.

  2. Verify that your managed clusters are updated to the latest Cluster release. For details, see Update a managed cluster.

  3. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  4. Obtain and export kubeconfig of the management cluster as described in Connect to a Mirantis Container Cloud cluster.

  5. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster
    
  6. Update the version parameter. For example:

    version: 14.2.9
    
  7. Obtain and export kubeconfig of the managed clusters as described in Connect to a Mirantis Container Cloud cluster.

  8. Repeat the steps 5-7 to update Ceph on every managed cluster.

Verify Ceph components

This section describes how to verify a Ceph cluster components.

Verify the Ceph core services

To confirm that all Ceph components including mon, mgr, osd, and rgw have joined your cluster properly, analyze the logs for each pod and verify the Ceph status:

kubectl exec -it rook-ceph-tools-5748bc69c6-cpzf8 -n rook-ceph bash
ceph -s

Example of a positive system response:

cluster:
    id:     4336ab3b-2025-4c7b-b9a9-3999944853c8
    health: HEALTH_OK

services:
    mon: 3 daemons, quorum a,b,c (age 20m)
    mgr: a(active, since 19m)
    osd: 6 osds: 6 up (since 16m), 6 in (since 16m)
    rgw: 1 daemon active (miraobjstore.a)

data:
    pools:   12 pools, 216 pgs
    objects: 201 objects, 3.9 KiB
    usage:   6.1 GiB used, 174 GiB / 180 GiB avail
    pgs:     216 active+clean

Verify rook-discover

To ensure that rook-discover is running properly, verify if the local-device configmap has been created for each Ceph node specified in the cluster configuration:

  1. Obtain the list of local devices:

    kubectl get configmap -n rook-ceph | grep local-device
    

    Example of a system response:

    local-device-01      1      30m
    local-device-02      1      29m
    local-device-03      1      30m
    
  2. Verify that each device from the list contains information about available devices for the Ceph node deployment:

    kubectl describe configmap local-device-01 -n rook-ceph
    

    Example of a positive system response:

    Name:         local-device-01
    Namespace:    rook-ceph
    Labels:       app=rook-discover
                  rook.io/node=01
    Annotations:  <none>
    
    Data
    ====
    devices:
    ----
    [{"name":"vdd","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-41d72dac-c0ff-4f24-b /dev/disk/by-path/virtio-pci-0000:00:09.0","size":32212254720,"uuid":"27e9cf64-85f4-48e7-8862-faa7270202ed","serial":"41d72dac-c0ff-4f24-b","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdd\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdd\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdb","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-path/virtio-pci-0000:00:07.0","size":67108864,"uuid":"988692e5-94ac-4c9a-bc48-7b057dd94fa4","serial":"","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdb\",\"available\":false,\"rejected_reasons\":[\"Insufficient space (\\u003c5GB)\"],\"sys_api\":{\"size\":67108864.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"64.00 MB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdb\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""},{"name":"vdc","parent":"","hasChildren":false,"devLinks":"/dev/disk/by-id/virtio-e8fdba13-e24b-41f0-9 /dev/disk/by-path/virtio-pci-0000:00:08.0","size":32212254720,"uuid":"190a50e7-bc79-43a9-a6e6-81b173cd2e0c","serial":"e8fdba13-e24b-41f0-9","type":"disk","rotational":true,"readOnly":false,"Partitions":null,"filesystem":"","vendor":"","model":"","wwn":"","wwnVendorExtension":"","empty":true,"cephVolumeData":"{\"path\":\"/dev/vdc\",\"available\":true,\"rejected_reasons\":[],\"sys_api\":{\"size\":32212254720.0,\"scheduler_mode\":\"none\",\"rotational\":\"1\",\"vendor\":\"0x1af4\",\"human_readable_size\":\"30.00 GB\",\"sectors\":0,\"sas_device_handle\":\"\",\"rev\":\"\",\"sas_address\":\"\",\"locked\":0,\"sectorsize\":\"512\",\"removable\":\"0\",\"path\":\"/dev/vdc\",\"support_discard\":\"0\",\"model\":\"\",\"ro\":\"0\",\"nr_requests\":\"128\",\"partitions\":{}},\"lvs\":[]}","label":""}]
    

Troubleshooting

This section provides solutions to the issues that may occur while operating a Mirantis Container Cloud management, regional, or child cluster.

Collect cluster logs

Caution

This feature is available starting from the Container Cloud release 2.2.0.

While operating your management, regional, or managed cluster, you may require collecting and inspecting the cluster logs to analyze cluster events or troubleshoot issues. For the logs structure, see Deployment Guide: Collect the bootstrap logs.

To collect cluster logs:

  1. Choose from the following options:

    • If you did not delete the kaas-bootstrap folder from the bootstrap node, log in to the bootstrap node.

    • If you deleted the kaas-bootstrap folder:

      1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

      2. Download and run the Container Cloud bootstrap script:

        wget https://binary.mirantis.com/releases/get_container_cloud.sh
        
        chmod 0755 get_container_cloud.sh
        
        ./get_container_cloud.sh
        
  2. Obtain kubeconfig of the required cluster. The management or regional cluster kubeconfig files are created during the last stage of the management or regional cluster bootstrap. To obtain a managed cluster kubeconfig, see Connect to a Mirantis Container Cloud cluster.

  3. Obtain the private SSH key of the required cluster. For a management or regional cluster, this key is created during bootstrap of a management cluster in ~/.ssh/openstack_tmp. For a managed cluster, this is an SSH key added in the Container Cloud web UI before the managed cluster creation.

  4. Depending on the cluster type that you require logs from, run the corresponding command:

    • For a management cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToMgmtClusterPrivateSshKey> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      
    • For a regional cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToRegionalClusterSshKey> --kubeconfig <pathToRegionalClusterKubeconfig> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      
    • For a managed cluster:

      kaas collect logs --management-kubeconfig <pathToMgmtClusterKubeconfig> \
      --key-file <pathToManagedClusterSshKey> --kubeconfig <pathToManagedClusterKubeconfig> \
      --cluster-name <clusterName> --cluster-namespace <clusterProject>
      

    Substitute the parameters enclosed in angle brackets with the corresponding values of your cluster.

    Optionally, add --output-dir that is a directory path to collect logs. The default value is logs/. For example, logs/<clusterName>/events.log.

Mirantis Container Cloud API

Warning

This section is intended only for advanced Infrastructure Operators who are familiar with Kubernetes Cluster API.

Mirantis currently supports only those Mirantis Container Cloud API features that are implemented in the Container Cloud web UI. Use other Container Cloud API features for testing and evaluation purposes only.

The Container Cloud APIs are implemented using the Kubernetes CustomResourceDefinitions (CRDs) that enable you to expand the Kubernetes API. Different types of resources are grouped in the dedicated files, such as cluster.yaml or machines.yaml.

This section contains descriptions and examples of the Container Cloud API resources for the bare metal cloud provider.

Note

The API documentation for the OpenStack, AWS, and VMWare vSphere resources will be added in the upcoming Container Cloud releases.

Public key resources

This section describes the PublicKey resource used in Mirantis Container Cloud API for all supported providers: OpenStack, AWS, and bare metal. This resource is used to provide SSH access to every machine of a Container Cloud cluster.

The Container Cloud PublicKey CR contains the following fields:

  • apiVersion

    API version of the object that is kaas.mirantis.com/v1alpha1

  • kind

    Object type that is PublicKey

  • metadata

    The metadata object field of the PublicKey resource contains the following fields:

    • name

      Name of the public key

    • namespace

      Project where the public key is created

  • spec

    The spec object field of the PublicKey resource contains the publicKey field that is an SSH public key value.

The PublicKey resource example:

apiVersion: kaas.mirantis.com/v1alpha1
kind: PublicKey
metadata:
  name: demokey
  namespace: test
spec:
  publicKey: |
    ssh-rsa AAAAB3NzaC1yc2EAAAA…

Bare metal resources

This section contains descriptions and examples of the baremetal-based Kubernetes resources for Mirantis Container Cloud.

Cluster

This section describes the Cluster resource used the in Mirantis Container Cloud API that describes the cluster-level parameters.

For demonstration purposes, the Container Cloud Cluster custom resource (CR) is split into the following major sections:

Warning

The fields of the Cluster resource that are located under the status section including providerStatus are available for viewing only. They are automatically generated by the bare metal cloud provider and must not be modified using Container Cloud API.

metadata

The Container Cloud Cluster CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is Cluster.

The metadata object field of the Cluster resource contains the following fields:

  • name

    Name of a cluster. A managed cluster name is specified under the Cluster Name field in the Create Cluster wizard of the Container Cloud web UI. A management and regional cluster names are configurable in the bootstrap script.

  • namespace

    Project in which the cluster object was created. The management and regional clusters are created in the default project. The managed cluster project equals to the selected project name.

  • labels

    Key-value pairs attached to the object:

    • kaas.mirantis.com/provider

      Provider type that is baremetal for the baremetal-based clusters.

    • kaas.mirantis.com/region

      Region name. The default region name for the management cluster is region-one. For the regional cluster, it is configurable using the REGION parameter in the bootstrap script.

Configuration example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Cluster
metadata:
  name: demo
  namespace: test
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
spec:providerSpec

The spec object field of the Cluster object represents the BaremetalClusterProviderSpec subresource that contains a complete description of the desired bare metal cluster state and all details to create the cluster-level resources. It also contains the fields required for LCM deployment and integration of the Container Cloud components.

The providerSpec object field is custom for each cloud provider and contains the following generic fields for the bare metal provider:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1

  • kind

    Object type that is BaremetalClusterProviderSpec

Configuration example:

spec:
  ...
  providerSpec:
    value:
      apiVersion: baremetal.k8s.io/v1alpha1
      kind: BaremetalClusterProviderSpec
spec:providerSpec common

The providerSpec object field of the Cluster resource contains the following common fields for all Container Cloud providers:

  • publicKeys

    List of the SSH public key references

  • release

    Name of the ClusterRelease object to install on a cluster

  • helmReleases

    List of the enabled Helm releases from the Release object that run on a Container Cloud cluster

Configuration example:

spec:
  ...
  providerSpec:
    value:
      publicKeys:
        - name: bootstrap-key
      release: ucp-5-7-0-3-3-3-tp11
      helmReleases:
        - name: metallb
          values:
            configInline:
              address-pools:
                - addresses:
                  - 10.0.0.101-10.0.0.120
                    name: default
                    protocol: layer2
        ...
        - name: stacklight
spec:providerSpec configuration

This section represents the Container Cloud components that are enabled on a cluster. It contains the following fields:

  • management

    Configuration for the management cluster components:

    • enabled

      Management cluster enabled (true) or disabled (false).

    • helmReleases

      List of the management cluster Helm releases that will be installed on the cluster. A Helm release includes the name and values fields. The specified values will be merged with relevant Helm release values of the management cluster in the Release object.

  • regional

    List of regional clusters components on the Container Cloud cluster for each configured provider available for a specific region:

    • provider

      Provider type that is baremetal.

    • helmReleases

      List of the regional Helm releases that will be installed on the cluster. A Helm release includes the name and values fields. The specified values will be merged with relevant regional Helm release values in the Release object.

  • release

    Name of the Container Cloud Release object.

Configuration example:

spec:
  ...
  providerSpec:
     value:
       kaas:
         management:
           enabled: true
           helmReleases:
             - name: kaas-ui
               values:
                 serviceConfig:
                   server: https://10.0.0.117
         regional:
           - helmReleases:
             - name: baremetal-provider
               values: {}
             provider: baremetal
           - helmReleases:
             - name: byo-provider
               values: {}
             provider: byo
         release: kaas-2-0-0
status:providerStatus common

Must not be modified using API

The common providerStatus object field of the Cluster resource contains the following fields:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1

  • kind

    Object type that is BaremetalClusterProviderStatus

  • loadBalancerHost

    Load balancer IP or host name of the Container Cloud cluster

  • apiServerCertificate

    Server certificate of Kubernetes API

  • ucpDashboard

    URL of the Mirantis Kubernetes Engine (MKE) Dashboard

Configuration example:

status:
  providerStatus:
    apiVersion: baremetal.k8s.io/v1alpha1
    kind: BaremetalClusterProviderStatus
    loadBalancerHost: 10.0.0.100
    apiServerCertificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS…
    ucpDashboard: https://10.0.0.100:6443
status:providerStatus for cluster readiness

Must not be modified using API

The providerStatus object field of the Cluster resource that reflects the cluster readiness contains the following fields:

  • persistentVolumesProviderProvisioned

    Status of the persistent volumes provisioning. Prevents the Helm releases that require persistent volumes from being installed until some default StorageClass is added to the Cluster object.

  • helm

    Details about the deployed Helm releases:

    • ready

      Status of the deployed Helm releases. The true value indicates that all Helm releases are deployed successfully.

    • releases

      List of the enabled Helm releases that run on the Container Cloud cluster:

      • releaseStatuses

        List of the deployed Helm releases. The success: true field indicates that the release is deployed successfully.

      • stacklight

        Status of the StackLight deployment. Contains URLs of all StackLight components. The success: true field indicates that StackLight is deployed successfully.

  • nodes

    Details about the cluster nodes:

    • ready

      Number of nodes that completed the deployment or update.

    • requested

      Total number of nodes. If the number of ready nodes does not match the number of requested nodes, it means that a cluster is being currently deployed or updated.

  • notReadyObjects

    The list of the services, deployments, and statefulsets Kubernetes objects that are not in the Ready state yet. A service is not ready if its external address has not been provisioned yet. A deployment or statefulset is not ready if the number of ready replicas is not equal to the number of desired replicas. Both objects contain the name and namespace of the object and the number of ready and desired replicas (for controllers). If all objects are ready, the notReadyObjects list is empty.

Configuration example:

status:
  providerStatus:
    persistentVolumesProviderProvisioned: true
    helm:
      ready: true
      releases:
        releaseStatuses:
          iam:
            success: true
          ...
        stacklight:
          alerta:
            url: http://10.0.0.106
          alertmanager:
            url: http://10.0.0.107
          grafana:
            url: http://10.0.0.108
          kibana:
            url: http://10.0.0.109
          prometheus:
            url: http://10.0.0.110
          success: true
    nodes:
      ready: 3
      requested: 3
    notReadyObjects:
      services:
        - name: testservice
          namespace: default
      deployments:
        - name: baremetal-provider
          namespace: kaas
          replicas: 3
          readyReplicas: 2
      statefulsets: {}
status:providerStatus for Open ID Connect

Must not be modified using API

The oidc section of the providerStatus object field in the Cluster resource reflects the Open ID Connect configuration details. It contains the required details to obtain a token for a Container Cloud cluster and consists of the following fields:

  • certificate

    Base64-encoded OIDC certificate.

  • clientId

    Client ID for OIDC requests.

  • groupsClaim

    Name of an OIDC groups claim.

  • issuerUrl

    Issuer URL to obtain the representation of the realm.

  • ready

    OIDC status relevance. If true, the status corresponds to the LCMCluster OIDC configuration.

Configuration example:

status:
  providerStatus:
    oidc:
      certificate: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUREekNDQWZ...
      clientId: kaas
      groupsClaim: iam_roles
      issuerUrl: https://10.0.0.117/auth/realms/iam
      ready: true
status:providerStatus for cluster releases

Must not be modified using API

The releaseRefs section of the providerStatus object field in the Cluster resource provides the current Cluster release version as well as the one available for upgrade. It contains the following fields:

  • current

    Details of the currently installed Cluster release:

    • lcmType

      Type of the Cluster release (ucp).

    • name

      Name of the Cluster release resource.

    • version

      Version of the Cluster release.

    • unsupportedSinceKaaSVersion

      Indicates that a Container Cloud release newer than the current one exists and that it does not support the current Cluster release.

  • available

    List of the releases available for upgrade. Contains the name and version fields.

Configuration example:

status:
  providerStatus:
    releaseRefs:
      available:
        - name: ucp-5-5-0-3-4-0-dev
          version: 5.5.0+3.4.0-dev
      current:
        lcmType: ucp
        name: ucp-5-4-0-3-3-0-beta1
        version: 5.4.0+3.3.0-beta1

Machine

This section describes the Machine resource used in Mirantis Container Cloud API for bare metal provider. The Machine resource describes the machine-level parameters.

For demonstration purposes, the Container Cloud Machine custom resource (CR) is split into the following major sections:

metadata

The Container Cloud Machine CR contains the following fields:

  • apiVersion

    API version of the object that is cluster.k8s.io/v1alpha1.

  • kind

    Object type that is Machine.

The metadata object field of the Machine resource contains the following fields:

  • name

    Name of the Machine object.

  • namespace

    Project in which the Machine object is created.

  • annotations

    Key-value pair to attach arbitrary metadata to the object:

    • metal3.io/BareMetalHost

      Annotation attached to the Machine object to reference the corresponding BareMetalHost object in the <BareMetalHostProjectName/BareMetalHostName> format.

  • labels

    Key-value pairs that are attached to the object:

    • kaas.mirantis.com/provider

      Provider type that matches the provider type in the Cluster object and must be baremetal.

    • kaas.mirantis.com/region

      Region name that matches the region name in the Cluster object.

    • cluster.sigs.k8s.io/cluster-name

      Cluster name that the Machine object is linked to.

    • cluster.sigs.k8s.io/control-plane

      For the control plane role of a machine, this label contains any value, for example, "true". For the worker role, this label is absent or does not contain any value.

Configuration example:

apiVersion: cluster.k8s.io/v1alpha1
kind: Machine
metadata:
  name: example-control-plane
  namespace: example-ns
  annotations:
    metal3.io/BareMetalHost: default/master-0
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
    cluster.sigs.k8s.io/cluster-name: example-cluster
    cluster.sigs.k8s.io/control-plane: "true" # remove for worker
spec:providerSpec for instance configuration

The spec object field of the Machine object represents the BareMetalMachineProviderSpec subresource with all required details to create a bare metal instance. It contains the following fields:

  • apiVersion

    API version of the object that is baremetal.k8s.io/v1alpha1.

  • kind

    Object type that is BareMetalMachineProviderSpec.

  • bareMetalHostProfile

    Configuration profile of a bare metal host:

    • name

      Name of a bare metal host profile

    • namespace

      Project in which the bare metal host profile is created.

  • l2TemplateIfMappingOverride

    If specified, overrides the interface mapping value for the corresponding L2Template object.

  • hostSelector

    Specifies the matching criteria for labels on the bare metal hosts. Limits the set of the BareMetalHost objects considered for claiming for the Machine object. The following selector labels can be added when creating a machine using the Container Cloud web UI:

    • hostlabel.bm.kaas.mirantis.com/controlplane

    • hostlabel.bm.kaas.mirantis.com/worker

    • hostlabel.bm.kaas.mirantis.com/storage

    Any custom label that is assigned to one or more bare metal hosts using API can be used as a host selector. If the BareMetalHost objects with the specified label are missing, the Machine object will not be deployed until at least one bare metal host with the specified label is available.

  • nodeLabels

    List of node labels to be attached to the corresponding node. Enables running of certain components on separate cluster nodes. The list of allowed node labels is defined in the providerStatus.releaseRef.current.allowedNodeLabels cluster status. Addition of any unsupported node label not from this list is restricted.

Configuration example:

providerSpec:
  value:
    apiVersion: baremetal.k8s.io/v1alpha1
    kind: BareMetalMachineProviderSpec
    bareMetalHostProfile:
      name: default
      namespace: default
    l2TemplateIfMappingOverride:
      - eno1
      - enp0s0
    hostSelector:
      matchLabels:
        baremetal: hw-master-0
    kind: BareMetalMachineProviderSpec
    nodeLabels:
    - key: stacklight
      value: enabled
Machine status

The status object field of the Machine object represents the BareMetalMachineProviderStatus subresource that describes the current bare metal instance state and contains the following fields:

  • apiVersion

    API version of the object that is cluster.k8s.io/v1alpha1.

  • kind

    Object type that is BareMetalMachineProviderStatus.

  • hardware

    Provides a machine hardware information:

    • cpu

      Number of CPUs.

    • ram

      RAM capacity in GB.

    • storage

      List of hard drives mounted on the machine. Contains the disk name and size in GB.

  • status

    Represents the current status of a machine:

    • Provision

      Machine is yet to obtain a status.

    • Uninitialized

      Machine is yet to obtain a node IP address and hostname.

    • Pending

      Machine is yet to receive the deployment instructions. It is either not booted yet or waits for the LCM controller to be deployed.

    • Prepare

      Machine is running the Prepare phase when mostly Docker images and packages are being predownloaded.

    • Deploy

      Machine is processing the LCM controller instructions.

    • Reconfigure

      Some configurations are being updated on a machine.

    • Ready

      Machine is deployed and the supported Mirantis Kubernetes Engine (MKE) version is set.

Configuration example:

status:
  providerStatus:
    apiVersion: baremetal.k8s.io/v1alpha1
    kind: BareMetalMachineProviderStatus
    hardware:
      cpu: 11
      ram: 16
    storage:
      - name: /dev/vda
        size: 61
      - name: /dev/vdb
        size: 32
      - name: /dev/vdc
        size: 32
    status: Ready

BareMetalHostProfile

This section describes the BareMetalHostProfile resource used in Mirantis Container Cloud API to define how the storage devices and operating system are provisioned and configured.

For demonstration purposes, the Container Cloud BareMetalHostProfile custom resource (CR) is split into the following major sections:

metadata

The Container Cloud BareMetalHostProfile CR contains the following fields:

  • apiVersion

    API version of the object that is metal3.io/v1alpha1.

  • kind

    Object type that is BareMetalHostProfile.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the bare metal host profile.

    • namespace

      Project in which the bare metal host profile was created.

Configuration example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHostProfile
metadata:
  name: default
  namespace: default
spec

The spec field of BareMetalHostProfile object contains the fields to customize your hardware configuration:

  • devices

    List of definitions of the physical storage devices. To configure more than three storage devices per host, add additional devices to this list. Each device in the list may have one or more partitions defined by the list in the partitions field.

  • fileSystems

    List of file systems. Each file system can be created on top of either device, partition, or logical volume. If more file systems are required for additional devices, define them in this field.

  • logicalVolumes

    List of LVM logical volumes. Every logical volume belongs to a volume group from the volumeGroups list and has the sizeGiB attribute for size in gigabytes.

  • volumeGroups

    List of definitions of LVM volume groups. Each volume group contains one or more devices or partitions from the devices list.

  • preDeployScript

    Shell script that is executed on a host before provisioning the target operating system inside the ramfs system.

  • postDeployScript

    Shell script that is executed on a host after deploying the operating system inside the ramfs system that is chrooted to the target operating system.

  • grubConfig

    List of options passed to the Linux GRUB bootloader. Each string in the list defines one parameter.

  • kernelParameters:sysctl Available since 2.2.0

    List of options passed to /etc/sysctl.d/999-baremetal.conf during bmh provisioning.

Configuration example:

spec:
  devices:
  - device:
      wipe: true
    partitions:
    - dev: ""
      name: bios_grub
      partflags:
      - bios_grub
      sizeGiB: 0.00390625
      ...
  - device:
      wipe: true
    partitions:
    - dev: ""
      name: lvm_lvp_part
  fileSystems:
  - fileSystem: vfat
    partition: config-2
  - fileSystem: vfat
    mountPoint: /boot/efi
    partition: uefi
    ...
  - fileSystem: ext4
    logicalVolume: lvp
    mountPoint: /mnt/local-volumes/
  logicalVolumes:
  - name: root
    sizeGiB: 0
    vg: lvm_root
  - name: lvp
    sizeGiB: 0
    vg: lvm_lvp
  postDeployScript: |
    #!/bin/bash -ex
    echo $(date) 'post_deploy_script done' >> /root/post_deploy_done
  preDeployScript: |
    #!/bin/bash -ex
    echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
  volumeGroups:
  - devices:
    - partition: lvm_root_part
    name: lvm_root
  - devices:
    - partition: lvm_lvp_part
    name: lvm_lvp
  grubConfig:
    defaultGrubOptions:
    - GRUB_DISABLE_RECOVERY="true"
    - GRUB_PRELOAD_MODULES=lvm
    - GRUB_TIMEOUT=20
  kernelParameters:
    sysctl:
      kernel.panic: "900"
      kernel.dmesg_restrict: "1"
      kernel.core_uses_pid: "1"
      fs.file-max: "9223372036854775807"
      fs.aio-max-nr: "1048576"
      fs.inotify.max_user_instances: "4096"
      vm.max_map_count: "262144"

BareMetalHost

This section describes the BareMetalHost resource used in the Mirantis Container Cloud API. BareMetalHost object is being created for each Machine and contains all information about machine hardware configuration. It is needed for further selecting which machine to choose for the deploy. When machine is created the provider assigns a BareMetalHost to that machine based on labels and BareMetalHostProfile configuration.

For demonstration purposes, the Container Cloud BareMetalHost custom resource (CR) can be split into the following major sections:

BareMetalHost metadata

The Container Cloud BareMetalHost CR contains the following fields:

  • apiVersion

    API version of the object that is metal3.io/v1alpha1.

  • kind

    Object type that is BareMetalHost.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the BareMetalHost object.

    • namespace

      Project in which the BareMetalHost object was created.

    • labels

      Labels used by the bare metal provider to find a matching BareMetalHost object to deploy a machine:

      • hostlabel.bm.kaas.mirantis.com/controlplane

      • hostlabel.bm.kaas.mirantis.com/worker

      • hostlabel.bm.kaas.mirantis.com/storage

      Each BareMetalHost object added using the Container Cloud web UI will be assigned one of these labels. If the BareMetalHost and Machine objects are created using API, any label may be used to match these objects for a bare metal host to deploy a machine.

Configuration example:

apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: master-0
  namespace: default
  labels:
    baremetal: hw-master-0
BareMetalHost configuration

The spec section for the BareMetalHost object defines the desired state of BareMetalHost. It contains the following fields:

  • bmc

    Details for communication with the Baseboard Management Controller (bmc) module on a host:

    • address

      URL for accessing bmc in the network.

    • credentialsName

      Name of the secret containing the bmc credentials. The secret requires the username and password keys in the Base64 encoding.

  • bootMACAddress

    MAC address for booting.

  • bootUEFI

    UEFI boot mode enabled (true) or disabled (false).

  • online

    Defines whether the server must be online after inspection.

Configuration example:

spec:
  bmc:
    address: 5.43.227.106:623
    credentialsName: master-0-bmc-secret
  bootMACAddress: 0c:c4:7a:a8:d3:44
  bootUEFI: true
  consumerRef:
    apiVersion: cluster.k8s.io/v1alpha1
    kind: Machine
    name: master-0
    namespace: default
  online: true
BareMetalHost status

The status field of the BareMetalHost object defines the current state of BareMetalHost. It contains the following fields:

  • errorMessage

    Last error message reported by the provisioning subsystem.

  • goodCredentials

    Last credentials that were validated.

  • hardware

    Hardware discovered on the host. Contains information about the storage, CPU, host name, firmware, and so on.

  • operationalStatus

    Status of the host:

    • OK

      Host is configured correctly and is manageable.

    • discovered

      Host is only partially configured. For example, the bmc address is discovered but not the login credentials.

    • error

      Host has any sort of error.

  • poweredOn

    Host availability status: powered on (true) or powered off (false).

  • provisioning

    State information tracked by the provisioner:

    • state

      Current action being done with the host by the provisioner.

    • id

      UUID of a machine.

  • triedCredentials

    Details of the last credentials sent to the provisioning back end.

Configuration example:

status:
  errorMessage: ""
  goodCredentials:
    credentials:
      name: master-0-bmc-secret
      namespace: default
    credentialsVersion: "13404"
  hardware:
    cpu:
      arch: x86_64
      clockMegahertz: 3000
      count: 32
      flags:
      - 3dnowprefetch
      - abm
      ...
      model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
    firmware:
      bios:
        date: ""
        vendor: ""
        version: ""
    hostname: ipa-fcab7472-892f-473c-85a4-35d64e96c78f
    nics:
    - ip: ""
      mac: 0c:c4:7a:a8:d3:45
      model: 0x8086 0x1521
      name: enp8s0f1
      pxe: false
      speedGbps: 0
      vlanId: 0
      ...
    ramMebibytes: 262144
    storage:
    - by_path: /dev/disk/by-path/pci-0000:00:1f.2-ata-1
      hctl: "4:0:0:0"
      model: Micron_5200_MTFD
      name: /dev/sda
      rotational: false
      serialNumber: 18381E8DC148
      sizeBytes: 1920383410176
      vendor: ATA
      wwn: "0x500a07511e8dc148"
      wwnWithExtension: "0x500a07511e8dc148"
      ...
    systemVendor:
      manufacturer: Supermicro
      productName: SYS-6018R-TDW (To be filled by O.E.M.)
      serialNumber: E16865116300188
  operationalStatus: OK
  poweredOn: true
  provisioning:
    state: provisioned
  triedCredentials:
    credentials:
      name: master-0-bmc-secret
      namespace: default
    credentialsVersion: "13404"

IpamHost

This section describes the IpamHost resource used in Mirantis Container Cloud API. The kaas-ipam controller monitors the current state of the bare metal Machine, verifies if BareMetalHost is successfully created and inspection is completed. Then the kaas-ipam controller fetches the information about the network card, creates the IpamHost object, and requests the IP address.

The IpamHost object is created for each Machine and contains all configuration of the host network interfaces and IP address. It also contains the information about associated BareMetalHost, Machine, and MAC addresses.

For demonstration purposes, the Container Cloud IpamHost custom resource (CR) is split into the following major sections:

IpamHost metadata

The Container Cloud IpamHost CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1

  • kind

    Object type that is IpamHost

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the IpamHost object

    • namespace

      Project in which the IpamHost object has been created

    • labels

      Key-value pairs that are attached to the object:

      • cluster.sigs.k8s.io/cluster-name

        References the Cluster object name that IpamHost is assigned to

      • ipam/BMHostID

        Unique ID of the associated BareMetalHost object

      • ipam/MAC-XX-XX-XX-XX-XX-XX: "1"

        Number of NICs of the host that the corresponding MAC address is assigned to

      • ipam/MachineID

        Unique ID of the associated Machine object

      • ipam/UID

        Unique ID of the IpamHost object

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: IpamHost
metadata:
  name: master-0
  namespace: default
  labels:
    cluster.sigs.k8s.io/cluster-name: kaas-mgmt
    ipam/BMHostID: 57250885-f803-11ea-88c8-0242c0a85b02
    ipam/MAC-0C-C4-7A-1E-A9-5C: "1"
    ipam/MAC-0C-C4-7A-1E-A9-5D: "1"
    ipam/MachineID: 573386ab-f803-11ea-88c8-0242c0a85b02
    ipam/UID: 834a2fc0-f804-11ea-88c8-0242c0a85b02
IpamHost configuration

The spec field of the IpamHost resource describes the desired state of the object. It contains the nicMACmap field that represents an unordered list of all NICs of the host. Each NIC entry contains such fields as name, mac, ip, and so on. The primary field defines that the current NIC is primary. Only one NIC can be primary.

Configuration example:

spec:
  nicMACmap:
  - mac: 0c:c4:7a:1e:a9:5c
    name: ens11f0
  - ip: 172.16.48.157
    mac: 0c:c4:7a:1e:a9:5d
    name: ens11f1
    primary: true
IpamHost status

The status field of the IpamHost resource describes the observed state of the object. It contains the following fields:

  • ipAllocationResult

    Status of IP allocation for the primary NIC (PXE boot). Possible values are OK or ERR if no IP address was allocated.

  • l2RenderResult

    Result of the L2 template rendering, if applicable. Possible values are OK or an error message.

  • lastUpdated

    Date and time of the last IpamHost status update.

  • nicMACmap

    Unordered list of all NICs of host with a detailed description. Each nicMACmap entry contains additional fields such as ipRef, nameservers, online, and so on.

  • osMetadataNetwork

    Configuration of the host OS metadata network. This configuration is used in the cloud-init tool and is applicable to the primary NIC only. It is added when the IP address is allocated and the ipAllocationResult status is OK.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  ipAllocationResult: OK
  l2RenderResult: There are no available L2Templates
  lastUpdated: "2020-09-16T11:02:39Z"
  nicMACmap:
  - mac: 0C:C4:7A:1E:A9:5C
    name: ens11f0
  - gateway: 172.16.48.1
    ip: 172.16.48.200/24
    ipRef: default/auto-0c-c4-7a-a8-d3-44
    mac: 0C:C4:7A:1E:A9:5D
    name: ens11f1
    nameservers:
    - 172.18.176.6
    online: true
    primary: true
  osMetadataNetwork:
    links:
    - ethernet_mac_address: 0C:C4:7A:A8:D3:44
      id: enp8s0f0
      type: phy
    networks:
    - ip_address: 172.16.48.200
      link: enp8s0f0
      netmask: 255.255.255.0
      routes:
      - gateway: 172.16.48.1
        netmask: 0.0.0.0
        network: 0.0.0.0
      type: ipv4
    services:
    - address: 172.18.176.6
      type: dns
  versionIpam: v3.0.999-20200807-130909-44151f8

Subnet

This section describes the Subnet resource used in Mirantis Container Cloud API to allocate IP addresses for the cluster nodes.

For demonstration purposes, the Container Cloud Subnet custom resource (CR) can be split into the following major sections:

Subnet metadata

The Container Cloud Subnet CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is Subnet

  • metadata

    This field contains the following subfields:

    • name

      Name of the Subnet object.

    • namespace

      Project in which the Subnet object was created.

    • labels

      Key-value pairs that are attached to the object:

      • ipam/DefaultSubnet: "1"

        Indicates that the subnet was automatically created for the PXE network. The subnet with this label is unique for a specific region and global for all clusters and projects in the region.

      • ipam/UID

        Unique ID of a subnet.

      • kaas.mirantis.com/provider

        Provider type.

      • kaas.mirantis.com/region

        Region type.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: Subnet
metadata:
  name: kaas-mgmt
  namespace: default
  labels:
    ipam/DefaultSubnet: "1"
    ipam/UID: 1bae269c-c507-4404-b534-2c135edaebf5
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
Subnet spec

The spec field of the Subnet resource describes the desired state of a subnet. It contains the following fields:

  • cidr

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

  • gateway

    A valid gateway address, for example, 10.11.0.9.

  • includeRanges

    A list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they intersect with one of the range. The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The includeRanges parameter is mutually exclusive with excludeRanges.

  • excludeRanges

    A list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation. The gateway, network, broadcast, and DNS addresses will be excluded (protected) automatically if they are included in the CIDR. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The excludeRanges parameter is mutually exclusive with includeRanges.

  • useWholeCidr

    If set to false (by default), the subnet address and broadcast address will be excluded from the address allocation. If set to true, the subnet address and the broadcast address are included into the address allocation for nodes.

  • nameservers

    The list of IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

Configuration example:

spec:
  cidr: 172.16.48.0/24
  excludeRanges:
  - 172.16.48.99
  - 172.16.48.101-172.16.48.145
  gateway: 172.16.48.1
  nameservers:
  - 172.18.176.6
Subnet status

The status field of the Subnet resource describes the actual state of a subnet. It contains the following fields:

  • allocatable

    The number of IP addresses that are available for allocation.

  • allocatedIPs

    The list of allocated IP addresses in the IP:<IPAddr object UID> format.

  • capacity

    The total number of IP addresses to be allocated, including the sum of allocatable and already allocated IP addresses.

  • cidr

    The IPv4 CIDR for a subnet.

  • gateway

    The gateway address for a subnet.

  • nameservers

    The list of IP addresses of name servers.

  • ranges

    The list of IP address ranges within the given CIDR that are used in the allocation of IPs for nodes.

Configuration example:

status:
  allocatable: 51
  allocatedIPs:
  - 172.16.48.200:24e94698-f726-11ea-a717-0242c0a85b02
  - 172.16.48.201:2bb62373-f726-11ea-a717-0242c0a85b02
  - 172.16.48.202:37806659-f726-11ea-a717-0242c0a85b02
  capacity: 54
  cidr: 172.16.48.0/24
  gateway: 172.16.48.1
  lastUpdate: "2020-09-15T12:27:58Z"
  nameservers:
  - 172.18.176.6
  ranges:
  - 172.16.48.200-172.16.48.253
  statusMessage: OK

SubnetPool

This section describes the SubnetPool resource used in Mirantis Container Cloud API to manage a pool of addresses from which subnets can be allocated.

For demonstration purposes, the Container Cloud SubnetPool custom resource (CR) is split into the following major sections:

SubnetPool metadata

The Container Cloud SubnetPool CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is SubnetPool.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the SubnetPool object.

    • namespace

      Project in which the SubnetPool object was created.

    • labels

      Key-value pairs that are attached to the object:

      • kaas.mirantis.com/provider

        Provider type that is baremetal.

      • kaas.mirantis.com/region

        Region name.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: SubnetPool
metadata:
  name: kaas-mgmt
  namespace: default
  labels:
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
SubnetPool spec

The spec field of the SubnetPool resource describes the desired state of a subnet pool. It contains the following fields:

  • cidr

    Valid IPv4 CIDR. For example, 10.10.0.0/16.

  • blockSize

    IP address block size to use when assigning an IP address block to every new child Subnet object. For example, if you set /25, every new child Subnet will have 128 IPs to allocate. Possible values are from /29 to the cidr size. Immutable.

  • nameservers

    Optional. List of IP addresses of name servers to use for every new child Subnet object. Each element of the list is a single address, for example, 172.18.176.6. Default: empty.

  • gatewayPolicy

    Optional. Method of assigning a gateway address to new child Subnet objects. Default: none. Possible values are:

    • first - first IP of the IP address block assigned to a child Subnet, for example, 10.11.10.1.

    • last - last IP of the IP address block assigned to a child Subnet, for example, 10.11.10.254.

    • none - no gateway address.

Configuration example:

spec:
  cidr: 10.10.0.0/16
  blockSize: /25
  nameservers:
  - 172.18.176.6
  gatewayPolicy: first
SubnetPool status

The status field of the SubnetPool resource describes the actual state of a subnet pool. It contains the following fields:

  • statusMessage

    Message that reflects the current status of the SubnetPool resource. Possible values are:

    • OK - a subnet pool is active.

    • ERR: <error message> - a subnet pool is in the Failure state.

    • TERM - a subnet pool is terminating.

  • allocatedSubnets

    List of allocated subnets. Each subnet has the <CIDR>:<SUBNET_UID> format.

  • blockSize

    Block size to use for IP address assignments from the defined pool.

  • capacity

    Total number of IP addresses to be allocated. Includes the number of allocatable and already allocated IP addresses.

  • allocatable

    Number of subnets with the blockSize size that are available for allocation.

  • lastUpdate

    Date and time of the last SubnetPool status update.

  • versionIpam

    IPAM version used during the last object update.

Example:

status:
  allocatedSubnets:
  - 10.10.0.0/24:0272bfa9-19de-11eb-b591-0242ac110002
  blockSize: /24
  capacity: 54
  allocatable: 51
  lastUpdate: "2020-09-15T08:30:08Z"
  versionIpam: v3.0.999-20200807-130909-44151f8
  statusMessage: OK

IPaddr

This section describes the IPaddr resource used in Mirantis Container Cloud API. The IPAddr object describes an IP address and contains all information about the associated MAC address.

For demonstration purposes, the Container Cloud IPaddr custom resource (CR) is split into the following major sections:

IPaddr metadata

The Container Cloud IPaddr CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1

  • kind

    Object type that is IPaddr

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the IPaddr object in the auto-XX-XX-XX-XX-XX-XX format where XX-XX-XX-XX-XX-XX is the associated MAC address

    • namespace

      Project in which the IPaddr object was created

    • labels

      Key-value pairs that are attached to the object:

      • ipam/IP

        IPv4 address

      • ipam/IpamHostID

        Unique ID of the associated IpamHost object

      • ipam/MAC

        MAC address

      • ipam/SubnetID

        Unique ID of the Subnet object

      • ipam/UID

        Unique ID of the IPAddr object

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: IPaddr
metadata:
  name: auto-0c-c4-7a-a8-b8-18
  namespace: default
  labels:
    ipam/IP: 172.16.48.201
    ipam/IpamHostID: 848b59cf-f804-11ea-88c8-0242c0a85b02
    ipam/MAC: 0C-C4-7A-A8-B8-18
    ipam/SubnetID: 572b38de-f803-11ea-88c8-0242c0a85b02
    ipam/UID: 84925cac-f804-11ea-88c8-0242c0a85b02
IPAddr spec

The spec object field of the IPAddr resource contains the associated MAC address and the reference to the Subnet object:

  • mac

    MAC address in the XX:XX:XX:XX:XX:XX format

  • subnetRef

    Reference to the Subnet resource in the <subnetProjectName>/<subnetName> format

Configuration example:

spec:
  mac: 0C:C4:7A:A8:B8:18
  subnetRef: default/kaas-mgmt
IPAddr status

The status object field of the IPAddr resource reflects the actual state of the IPAddr object. In contains the following fields:

  • address

    IP address.

  • cidr

    IPv4 CIDR for the Subnet.

  • gateway

    Gateway address for the Subnet.

  • lastUpdate

    Date and time of the last IPAddr status update.

  • mac

    MAC address in the XX:XX:XX:XX:XX:XX format.

  • nameservers

    List of the IP addresses of name servers of the Subnet. Each element of the list is a single address, for example, 172.18.176.6.

  • phase

    Current phase of the IP address. Possible values: Active, Failed, or Terminating.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  address: 172.16.48.201
  cidr: 172.16.48.201/24
  gateway: 172.16.48.1
  lastUpdate: "2020-09-16T10:08:07Z"
  mac: 0C:C4:7A:A8:B8:18
  nameservers:
  - 172.18.176.6
  phase: Active
  versionIpam: v3.0.999-20200807-130909-44151f8

L2Template

This section describes the L2Template resource used in Mirantis Container Cloud API.

By default, Container Cloud configures a single interface on cluster nodes, leaving all other physical interfaces intact. With L2Template, you can create advanced host networking configurations for your clusters. For example, you can create bond interfaces on top of physical interfaces on the host.

For demonstration purposes, the Container Cloud L2Template custom resource (CR) is split into the following major sections:

L2Template metadata

The Container Cloud L2Template CR contains the following fields:

  • apiVersion

    API version of the object that is ipam.mirantis.com/v1alpha1.

  • kind

    Object type that is L2Template.

  • metadata

    The metadata field contains the following subfields:

    • name

      Name of the L2Template object.

    • namespace

      Project in which the L2Template object was created.

    • labels

      Key-value pairs that are attached to the object:

      • ipam/Cluster

        References the Cluster object name that this template is applied to. The process of selecting the L2Template object for a specific cluster is as follows:

        1. The kaas-ipam controller monitors the L2Template objects with the ipam/Cluster:<clusterName> label.

        2. The L2Template object with the ipam/Cluster: <clusterName> label is assigned to a cluster with Name: <clusterName>, if available. Otherwise, the default L2Template object with the ipam/Cluster: default label is assigned to a cluster.

      • ipam/PreInstalledL2Template: "1"

        Indicates that the current L2Template object was preinstalled. Represents L2 templates that are automatically copied to a project once it is created. Once the L2 templates are copied, the ipam/PreInstalledL2Template label is removed. This label is set automatically and cannot be configured manually.

      • ipam/UID

        Unique ID of an object.

      • kaas.mirantis.com/provider

        Provider type.

      • kaas.mirantis.com/region

        Region type.

Configuration example:

apiVersion: ipam.mirantis.com/v1alpha1
kind: L2Template
metadata:
  name: l2template-test
  namespace: default
  labels:
    ipam/Cluster: test
    ipam/PreInstalledL2Template: "1"
    ipam/UID: a74f1e2b-4a09-4b90-bbce-6b78fcef2ba5
    kaas.mirantis.com/provider: baremetal
    kaas.mirantis.com/region: region-one
L2Template configuration

The spec field of the L2Template resource describes the desired state of the object. It contains the following fields:

  • clusterRef

    The Cluster object that this template is applied to. The default value is used to apply the given template to all clusters unless an L2 template that references a specific cluster name exists.

    Caution

    • A cluster can be associated with only one template.

    • An L2 template must have the same namespace as the referenced cluster.

    • A project can have only one default L2 template.

  • ifMapping

    The list of interface names for the template. The interface mapping is defined globally for all bare metal hosts in the cluster but can be overridden at the host level, if required, by editing the IpamHost object for a particular host. The ifMapping parameter is mutually exclusive with autoIfMappingPrio.

  • autoIfMappingPrio

    The list of prefixes, such as eno, ens, and so on, to match the interfaces to automatically create a list for the template. The result of generation may be overridden at the host level using ifMappingOverride in the corresponded IpamHost spec. The autoIfMappingPrio parameter is mutually exclusive with ifMapping.

  • npTemplate

    A netplan-compatible configuration with special lookup functions that defines the networking settings for the cluster hosts, where physical NIC names and details are parameterized. This configuration will be processed using Go templates. Instead of specifying IP and MAC addresses, interface names, and other network details specific to a particular host, the template supports use of special lookup functions. These lookup functions, such as nic, mac, ip, and so on, return host-specific network information when the template is rendered for a particular host.

    Caution

    All rules and restrictions of the netplan configuration also apply to L2 templates. For details, see the official netplan documentation.

Configuration example:

spec:
  autoIfMappingPrio:
  - provision
  - eno
  - ens
  - enp
  l3Layout: null
  npTemplate: |
   version: 2
   ethernets:
     {{nic 0}}:
       dhcp4: false
       dhcp6: false
       addresses:
         - {{ip "0:kaas-mgmt"}}
       gateway4: {{gateway_from_subnet "kaas-mgmt"}}
       nameservers:
         addresses: {{nameservers_from_subnet "kaas-mgmt"}}
       match:
         macaddress: {{mac 0}}
       set-name: {{nic 0}}
L2Template status

The status field of the L2Template resource reflects the actual state of the L2Template object and contains the following fields:

  • phase

    Current phase of the L2Template object. Possible values: Ready, Failed, or Terminating.

  • reason

    Detailed error message in case L2Template has the Failed status.

  • lastUpdate

    Date and time of the last L2Template status update.

  • versionIpam

    IPAM version used during the last update of the object.

Configuration example:

status:
  lastUpdate: "2020-09-15T08:30:08Z"
  phase: Failed
  reason: The kaas-mgmt subnet in the terminating state.
  versionIpam: v3.0.999-20200807-130909-44151f8