Docker Enterprise Container Cloud Operations Guide latest documentation

Docker Enterprise Container Cloud Operations Guide

Preface

This documentation provides information on how to use Mirantis products to deploy cloud environments. The information is for reference purposes and is subject to change.

Intended audience

This documentation assumes that the reader is familiar with network and cloud concepts and is intended for the following users:

  • Infrastructure Operator

    • Is member of the IT operations team

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI, and OpenStack to support the application development team

    • Accesses Docker Enterprise (DE) Container Cloud and Kubernetes through a local machine or web UI

    • Provides verified artifacts through a central repository to the Tenant DevOps engineers

  • Tenant DevOps engineer

    • Is member of the application development team and reports to line-of-business (LOB)

    • Has working knowledge of Linux, virtualization, Kubernetes API and CLI to support application owners

    • Accesses DE Container Cloud and Kubernetes through a local machine or web UI

    • Consumes artifacts from a central repository approved by the Infrastructure Operator

Documentation history

The documentation set refers to DE Container Cloud GA as to the latest released GA version of the product. For details about the DE Container Cloud GA minor releases dates, refer to DE Container Cloud releases.

Create and operate a managed cluster

Note

This tutorial applies only to the Docker Enterprise Container Cloud web UI users with the writer access role assigned by the Infrastructure Operator. To add a bare metal host, the operator access role is also required.

After you deploy the Docker Enterprise (DE) Container Cloud management cluster, you can start creating managed clusters that will be based on the same cloud provider type that you have for the management cluster: OpenStack, AWS, or bare metal.

The deployment procedure is performed using the DE Container Cloud web UI and comprises the following steps:

  1. Create an initial cluster configuration depending on the provider type.

  2. For a baremetal-based managed cluster, create and configure bare metal hosts with corresponding labels for machines such as worker, manager, or storage.

  3. Add the required amount of machines with the corresponding configuration to the managed cluster.

  4. For a baremetal-based managed cluster, add a Ceph cluster.

Create and operate a baremetal-based managed cluster

After bootstrapping your baremetal-based Docker Enterprise (DE) Container Cloud management cluster as described in Deployment Guide: Deploy a baremetal-based management cluster, you start creating the baremetal-based managed clusters using the DE Container Cloud web UI.

Prepare L2 templates

By default, Docker Enterprise (DE) Container Cloud configures a single interface on the cluster nodes, leaving all other physical interfaces intact.

With L2 networking templates, you can create advanced host networking configurations for your clusters. For example, you can create bond interfaces on top of physical interfaces on the host.

Follow the procedures below to create L2 templates for your managed clusters.

Create subnets

Before creating an L2 template, ensure that you have the required subnets that can be used in the L2 template to allocate IP addresses for the managed cluster nodes. Where required, create a number of subnets for a particular project using the Subnet CR. A subnet has two logical scopes:

  • global - CR uses the default namespace. A subnet can be used for any cluster located in any project.

  • namespaced - CR uses the namespace that corresponds to a particular project where managed clusters are located. A subnet can be used for any cluster located in the same project.

You can have subnets with the same name in different projects and different scopes. In this case, the subnet that has the same project as the cluster will be used. One L2 template may often reference several subnets, those subnets may have different scopes in this case.

The IP address objects (IPaddr CR) that are allocated from subnets always have the same project as their corresponding IpamHost objects, regardless of the subnet scope.

To create subnets for an L2 template:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create the subnet.yaml file with a number of global or namespaced subnets:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <SubnetFileName.yaml>
    

    Note

    In the command above and in the steps below, substitute the parameters enclosed in angle brackets with the corresponding values.

    Example of an subnet.yaml file:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: Subnet
    metadata:
      finalizers:
      - finalizer.ipam.mirantis.com
      name: demo
      namespace: demo-namespace
    spec:
      cidr: 10.11.0.0/24
      gateway: 10.11.0.9
      includeRanges:
      - 10.11.0.5-10.11.0.70
      nameservers:
      - 172.18.176.6
    
    The subnet.yaml file parameters

    Parameter

    Description

    cidr (singular)

    A valid IPv4 CIDR, for example, 10.11.0.0/24.

    includeRanges (list)

    A list of IP address ranges within the given CIDR that should be used in the allocation of IPs for nodes (excluding the gateway address). The IPs outside the given ranges will not be used in the allocation. Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. In the example above, the addresses 10.11.0.5-10.11.0.70 (excluding the gateway address 10.11.0.9) will be allocated for nodes. The includeRanges parameter is mutually exclusive with excludeRanges.

    excludeRanges (list)

    A list of IP address ranges within the given CIDR that should not be used in the allocation of IPs for nodes. The IPs within the given CIDR but outside the given ranges will be used in the allocation (excluding gateway address). Each element of the list can be either an interval 10.11.0.5-10.11.0.70 or a single address 10.11.0.77. The excludeRanges parameter is mutually exclusive with includeRanges.

    useWholeCidr (boolean)

    If set to true, the subnet address (10.11.0.0 in the example above) and the broadcast address (10.11.0.255 in the example above) are included into the address allocation for nodes. Otherwise, (false by default), the subnet address and broadcast address will be excluded from the address allocation.

    gateway (singular)

    A valid gateway address, for example, 10.11.0.9.

    nameservers (list)

    A list of the IP addresses of name servers. Each element of the list is a single address, for example, 172.18.176.6.

    Caution

    The subnet for the PXE network is automatically created during deployment and must contain the ipam/DefaultSubnet: "1" label. The subnet with this label is unique and global for all clusters and projects.

  3. Proceed to creating an L2 template for one or multiple managed clusters as described in Create L2 templates.

Create L2 templates

After you create subnets for one or more managed clusters or projects as described in Create subnets, follow the procedure below that contains an exemplary L2 template to create L2 templates for your managed clusters.

To create an L2 template for a new managed cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Create an L2 YAML template specific to your deployment using the exemplary template below.

    Example of an L2 template:

    apiVersion: ipam.mirantis.com/v1alpha1
    kind: L2Template
    metadata:
      name: test-managed
      namespace: managed-ns
    spec:
      clusterRef: child-cluster
      autoIfMappingPrio:
        - provision
        - eno
        - ens
        - enp
      npTemplate: |
        version: 2
        ethernets:
          onboard1gbe0:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 0}}
            set-name: {{nic 0}}
          onboard1gbe1:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 1}}
            set-name: {{nic 1}}
          ten10gbe0s0:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 2}}
            set-name: {{nic 2}}
          ten10gbe0s1:
            dhcp4: false
            dhcp6: false
            match:
              macaddress: {{mac 3}}
            set-name: {{nic 3}}
        bonds:
          bond0:
            interfaces:
              - ten10gbe0s0
              - ten10gbe1s0
        vlans:
          kaas-lcm-vlan:
            id: 101
            link: onboard1gbe1
          k8s-api-vlan:
            id: 102
            link: onboard1gbe1
          k8s-ext-vlan:
            id: 103
            link: onboard1gbe1
          k8s-pods-vlan:
            id: 104
            link: bond0
          bm-ceph-vlan:
            id: 105
            link: bond0
        bridges:
          bm-pxe:
            interfaces: [onboard1gbe0]
            addresses:
              - {{ip "bm-pxe:pxe-subnet"}}
            gateway4: {{gateway_from_subnet "pxe-subnet"}}
            nameservers:
              addresses: {{nameservers_from_subnet "pxe-subnet"}}
          kaas-lcm:
            interfaces: [kaas-lcm-vlan]
            addresses:
              - {{ip "kaas-lcm:lcm-subnet"}}
          k8s-api:
            interfaces: [k8s-api-vlan]
            addresses:
              - {{ip "k8s-api:k8sapi-subnet"}}
          k8s-ext:
            interfaces: [k8s-ext-vlan]
            addresses:
              - {{ip "k8s-ext:k8s-ext-subnet"}}
          k8s-pods:
            interfaces: [k8s-pods-vlan]
            addresses:
              - {{ip "k8s-pods:pods-subnet"}}
          bm-ceph:
            interfaces: [bm-ceph-vlan]
            addresses:
              - {{ip "bm-ceph:ceph-subnet"}}
    
  3. Add or edit the mandatory parameters in the new L2 template using the table below:

    L2 template mandatory parameters

    Parameter

    Description

    clusterRef

    References the Cluster object that this template is applied to. The default value is used to apply the given template to all clusters in the corresponding project, unless an L2 template that references a specific cluster name exists.

    Caution

    • A cluster can be associated with only one template.

    • A project can have only one default L2 template.

    ifMapping or autoIfMappingPrio

    • ifMapping is a list of interface names for the template. The interface mapping is defined globally for all bare metal hosts in the cluster but can be overridden at the host level, if required, by editing the IpamHost object for a particular host.

    • autoIfMappingPrio is a list of prefixes, such as eno, ens, and so on, to match the interfaces to automatically create a list for the template. If you are not aware of any specific ordering of interfaces on the nodes, use the default ordering from Predictable Network Interfaces Names specification for systemd. You can also override the default NIC list per host using the IfMappingOverride parameter of the corresponding IpamHost. The provision value corresponds to the network interface that was used to provision a node. Usually, it is the first NIC found on a particular node. It is defined explicitly to ensure that this interface will not be reconfigured accidentally.

    npTemplate

    A netplan-compatible configuration with special lookup functions that defines the networking settings for the cluster hosts, where physical NIC names and details are parameterized. This configuration will be processed using Go templates. Instead of specifying IP and MAC addresses, interface names, and other network details specific to a particular host, the template supports use of special lookup functions. These lookup functions, such as nic, mac, ip, and so on, return host-specific network information when the template is rendered for a particular host. For details about netplan, see netplan documentation.

    Main lookup functions for an L2 template

    Lookup function

    Description

    {{nic N}}

    Name of a NIC number N. NIC numbers correspond to the interface mapping list.

    {{mac N}}

    MAC address of a NIC number N registered during a host hardware inspection.

    {{ip “N:subnet-a”}}

    IP address and mask for a NIC number N. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{ip “br0:subnet-x”}}

    IP address and mask for a virtual interface, “br0” in this example. The address will be auto-allocated from the given subnet if the address does not exist yet.

    {{gateway_from_subnet “subnet-a”}}

    IPv4 default gateway address from the given subnet.

    {{nameservers_from_subnet “subnet-a”}}

    List of the IP addresses of name servers from the given subnet.

    Note

    Every subnet referenced in an L2 template can have either a global or namespaced scope. In the latter case, the subnet must exist in the same project where the corresponding cluster and L2 template are located.

  4. Add the L2 template to your management cluster:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> apply -f <pathToL2TemplateYamlFile>
    
  5. If required, further modify the template:

    kubectl --kubeconfig <pathToManagementClusterKubeconfig> \
    -n <ProjectNameForNewManagedCluster> edit l2template <L2templateName>
    
  6. Proceed with creating a managed cluster as described in Create a managed cluster. The L2 template edited in previous steps will be used to render the netplan configuration for the managed cluster machines.


The workflow of the netplan configuration using an L2 template is as follows:

  1. The kaas-ipam service uses the data from BareMetalHost, the L2 template, and subnets to generate the netplan configuration for every cluster machine.

  2. The generated netplan configuration is saved in the status.netconfigV2 section of the IpamHost resource. If the status.l2RenderResult field of the IpamHost resource is OK, the configuration is successful. Otherwise, the status contains an error message.

  3. The baremetal-provider service copies data from the status.netconfigV2 of IpamHost to the Spec.StateItemsOverwrites[‘deploy’][‘bm_ipam_netconfigv2’] parameter of LCMMachine.

  4. The lcm-agent service on every host synchronizes the LCMMachine data to its host. The lcm-agent service runs a playbook to update the netplan configuration on the host during the deploy phase.

Create a custom bare metal host profile

The bare metal host profile is a Kubernetes custom resource. It allows the operator to define how the storage devices and the operating system are provisioned and configured.

This section describes the bare metal host profile default settings and configuration of custom profiles for managed clusters using Docker Enterprise (DE) Container Cloud API. This procedure also applies to a management cluster with a few differences described in Deployment Guide: Customize the default bare metal host profile.

Default configuration of the host system storage

The default host profile requires three storage devices in the following strict order:

  1. Boot device and operating system storage

    This device contains boot data and operating system data. It is partitioned using the GUID Partition Table (GPT) labels. The root file system is an ext4 file system created on top of an LVM logical volume. For a detailed layout, refer to the table below.

  2. Local volumes device

    This device contains an ext4 file system with directories mounted as persistent volumes to Kubernetes. These volumes are used by the DE Container Cloud services to store its data, including monitoring and identity databases.

  3. Ceph storage device

    This device is used as a Ceph datastore or OSD.

The following table summarizes the default configuration of the host system storage set up by the Docker Enterprise (DE) Container Cloud bare metal management.

Default configuration of the bare metal host storage

Device/partition

Name/Mount point

Recommended size, GB

Description

/dev/sda1

bios_grub

4 MiB

The mandatory GRUB boot partition required for non-UEFI systems.

/dev/sda2

UEFI -> /boot/efi

0.2 GiB

The boot partition required for the UEFI boot mode.

/dev/sda3

config-2

64 MiB

The mandatory partition for the cloud-init configuration. Used during the first host boot for initial configuration.

/dev/sda4

lvm_root_part

100% of the remaining free space in the LVM volume group

The main LVM physical volume that is used to create the root file system.

/dev/sdb

lvm_lvp_part -> /mnt/local-volumes

100% of the remaining free space in the LVM volume group

The LVM physical volume that is used to create the file system for LocalVolumeProvisioner.

/dev/sdc

-

100% of the remaining free space in the LVM volume group

Clean raw disk that will be used for the Ceph storage back end.

If required, you can customize the default host storage configuration. For details, see Create a custom host profile.

Create a custom host profile

In addition to the default BareMetalHostProfile object installed with Docker Enetrprise (DE) Container Cloud, you can create custom profiles for managed clusters using DE Container Cloud API.

Note

The procedure below also applies to the DE Container Cloud management clusters.

To create a custom bare metal host profile:

  1. Select from the following options:

    • For a management cluster, log in to the bare metal seed node that will be used to bootstrap the management cluster.

    • For a managed cluster, log in to the local machine where you management cluster kubeconfig is located and where kubectl is installed.

      Note

      The management cluster kubeconfig is created automatically during the last stage of the management cluster bootstrap.

  2. Select from the following options:

    • For a management cluster, open templates/bm/baremetalhostprofiles.yaml.template for editing.

    • For a managed cluster, create a new bare metal host profile under the templates/bm/ directory.

  3. Edit the host profile using the example template below to meet your hardware configuration requirements:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHostProfile
    metadata:
      name: <PROFILE_NAME>
      namespace: <PROJECT_NAME>
    spec:
      devices:
      # From the HW node, obtain the first device, which size is at least 60Gib
      - device:
          minSizeGiB: 60
          wipe: true
        partitions:
        - name: bios_grub
          partflags:
          - bios_grub
          sizeGiB: 0.00390625
          wipe: true
        - name: uefi
          partflags:
          - esp
          sizeGiB: 0.2
          wipe: true
        - name: config-2
          sizeGiB: 0.0625
          wipe: true
        - name: lvm_root_part
          sizeGiB: 0
          wipe: true
      # From the HW node, obtain the second device, which size is at least 60Gib
      # If a device exists but does not fit the size,
      # the BareMetalHostProfile will not be applied to the node
      - device:
          minSizeGiB: 30
          wipe: true
      # From the HW node, obtain the disk device with the exact name
      - device:
          byName: /dev/nvme0n1
          minSizeGiB: 30
          wipe: true
        partitions:
        - name: lvm_lvp_part
          sizeGiB: 0
          wipe: true
      # Example of wiping a device w\o partitioning it.
      # Mandatory for the case when a disk is supposed to be used for Ceph back end
      # later
      - device:
          byName: /dev/sde
          wipe: true
      fileSystems:
      - fileSystem: vfat
        partition: config-2
      - fileSystem: vfat
        mountPoint: /boot/efi
        partition: uefi
      - fileSystem: ext4
        logicalVolume: root
        mountPoint: /
      - fileSystem: ext4
        logicalVolume: lvp
        mountPoint: /mnt/local-volumes/
      grubConfig:
        defaultGrubOptions:
        - GRUB_DISABLE_RECOVERY="true"
        - GRUB_PRELOAD_MODULES=lvm
        - GRUB_TIMEOUT=20
      logicalVolumes:
      - name: root
        sizeGiB: 0
        vg: lvm_root
      - name: lvp
        sizeGiB: 0
        vg: lvm_lvp
      postDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'post_deploy_script done' >> /root/pre_deploy_done
      preDeployScript: |
        #!/bin/bash -ex
        echo $(date) 'pre_deploy_script done' >> /root/pre_deploy_done
      volumeGroups:
      - devices:
        - partition: lvm_root_part
        name: lvm_root
      - devices:
        - partition: lvm_lvp_part
        name: lvm_lvp
    
  4. Add or edit the mandatory parameters in the new BareMetalHostProfile object using the table below:

    Bare metal host profile parameters

    Parameter

    Description

    devices

    List of definitions of the physical storage devices. To configure more than three storage devices per host, additional devices must be added to this list. Each device in the list may have one or more partitions defined by a list in the partitions field.

    volumeGroups

    List of definitions of LVM volume groups. Each volume group contains one or more devices or partitions from the devices list.

    logicalVolumes

    List of LVM logical volumes. Every logical volume belongs to a volume group from the volumeGroups list and has the sizeGiB attribute for size in gigabytes.

    fileSystems

    List of file systems. Each file system can be created on top of either device, partition or logical volume. If more file systems are required for additional devices, define them in this field.

    preDeployScript

    Shell script that is executed on a host before provisioning the target operating system inside the ramfs system.

    postDeployScript

    Shell script that is executed on a host after deploying the operating system inside the ramfs system that is chrooted to the target operating system.

    grubConfig

    List of options passed to the Linux GRUB bootloader. Each string in the list defines one parameter.

  5. Select from the following options:

    • For a management cluster, proceed with the cluster bootstrap procedure as described in Deployment Guide: Bootstrap a management cluster.

    • For a managed cluster:

      1. Add the bare metal host profile to your management cluster:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <PROJECT_NAME> apply -f <pathToBareMetalHostProfileFile>
        
      2. If required, further modify the host profile:

        kubectl --kubeconfig <pathToManagementClusterKubeconfig> -n <PROJECT_NAME> edit baremetalhostprofile <PROFILE_NAME_HERE>
        
      3. Proceed with creating a managed cluster as described in Create a managed cluster.

Create a managed cluster

This section instructs you on how to configure and deploy a managed cluster that is based on the baremetal-based management cluster through the Docker Enterprise (DE) Container Cloud web UI.

To create a managed cluster on bare metal:

  1. Recommended. Verify that you have successfully configured an L2 template for a new cluster as described in Prepare L2 templates. You may skip this step if you do not require L2 separation for network traffic.

  2. Optional. Create a custom bare metal host profile depending on your needs as described in Create a custom bare metal host profile.

  3. Log in to DE Container Cloud web UI with the writer permissions.

  4. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  5. In the SSH keys tab, click Add SSH Key to upload the public SSH key that will be used for the SSH access to VMs.

  6. In the Clusters tab, click Create Cluster.

  7. Configure the new cluster in the Create New Cluster wizard that opens:

    1. Define general and Kubernetes parameters:

      Create new cluster: General, Provider, and Kubernetes

      Section

      Parameter name

      Description

      General settings

      Cluster name

      The cluster name.

      Provider

      Select Baremetal.

      Region

      From the drop-down list, select Baremetal.

      Release version

      The DE Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to the bare metal hosts.

      Provider

      LB host IP

      The IP address of the load balancer endpoint that will be used to access the Kubernetes API of the new cluster. This IP address must be from the same subnet as used for DHCP in Metal³.

      LB address range

      The range of IP addresses that can be assigned to load balancers for Kubernetes Services by MetalLB.

      Kubernetes

      Node CIDR

      The Kubernetes worker nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR blocks

      The Kubernetes Services CIDR blocks. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes pods CIDR blocks. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  8. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

Now, proceed to Add a bare metal host.

Add a bare metal host

After you create a managed cluster as described in Create a managed cluster, proceed with adding a bare metal host through the Docker Enterprise (DE) Container Cloud web UI using the instruction below.

Before you proceed with adding a bare metal host, verify that the physical network on the server has been configured correctly. See Reference Architecture: Network fabric for details.

To add a bare metal host to a baremetal-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the operator permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Baremetal tab, click Add BM host.

  4. Fill out the Add new BM host form as required:

    • Baremetal host name

      Specify the name of the new bare metal host.

    • Username

      Specify the name of the user for accessing the BMC (IPMI user).

    • Password

      Specify the password of the user for accessing the BMC (IPMI password).

    • Boot MAC address

      Specify the MAC address of the PXE network interface.

    • Address

      Specify the URL to access the BMC. Should start with https://.

    • Label

      Assign the machine label to the new host that defines which type of machine may be deployed on this bare metal host. Only one label can be assigned to a host. The supported labels include:

  5. Click Create

    While adding the bare metal host, DE Container Cloud discovers and inspects the hardware of the bare metal host and adds it to BareMetalHost.status for future references.

Now, you can proceed to Add a machine.

Add a machine

After you add a bare metal host to the managed cluster as described in Add a bare metal host, you can create a Kubernetes machine in your cluster.

To add a Kubernetes machine to a baremetal-based managed cluster:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the operator or writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine button.

  5. Fill out the Create New Machine form as required:

    • Count

      Specify the number of machines to add.

    • Manager

      Select Manager or Worker to create a Kubernetes manager or worker node. The recommended minimum number of machines is three for the manager nodes HA and two for the DE Container Cloud workloads.

    • BareMetal Host Label

      Assign the role to the new machine(s) to link the machine to a previously created bare metal host with the corresponding label. You can assign one role type per machine. The supported labels include:

      • Worker

        The default role for any node in a managed cluster. Only the kubelet service is running on the machines of this type.

      • Manager

        This node hosts the manager services of a managed cluster. For the reliability reasons, DE Container Cloud does not permit running end user workloads on the manager nodes or use them as storage nodes.

      • Storage

        This node is a worker node that also hosts Ceph OSD daemons and provides its disk resources to Ceph. DE Container Cloud permits end users to run workloads on storage nodes by default.

  6. Click Create.

At this point, DE Container Cloud adds the new machine object to the specified managed cluster. And the Bare Metal Operator controller creates the relation to BareMetalHost with the labels matching the roles.

Provisioning of the newly created machine starts when the machine object is created and includes the following stages:

  1. Creation of partitions on the local disks as required by the operating system and the DE Container Cloud architecture.

  2. Configuration of the network interfaces on the host as required by the operating system and the DE Container Cloud architecture.

  3. Installation and configuration of the DE Container Cloud LCM agent.

Add a Ceph cluster

After you add machines to your new bare metal managed cluster as described in Add a machine, you can create a Ceph cluster on top of this managed cluster using the Docker Enterprise (DE) Container Cloud web UI.

The procedure below enables you to create a Ceph cluster with minimum three nodes that provides persistent volumes to the Kubernetes workloads in the managed cluster.

To create a Ceph cluster in the managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The Cluster page with the Machines and Ceph clusters lists opens.

  4. In the Ceph Clusters block, click Create Cluster.

  5. Configure the Ceph cluster in the Create New Ceph Cluster wizard that opens:

    Create new Ceph cluster

    Section

    Parameter name

    Description

    General settings

    Name

    The Ceph cluster name.

    Cluster Network

    Replication network for Ceph OSDs.

    Public Network

    Public network for Ceph data.

    Enable OSDs LCM

    Select to enable LCM for Ceph OSDs.

    Machines / Machine #1-3

    Select machine

    Select the name of the Kubernetes machine that will host the corresponding Ceph node in the Ceph cluster.

    Manager, Monitor

    Select the required Ceph services to install on the Ceph node.

    Devices

    Select the disk that Ceph will use.

    Warning

    Do not select the device for system services, for example, sda.

  6. To add more Ceph nodes to the new Ceph cluster, click + next to any Ceph Machine title in the Machines tab. Configure a Ceph node as required.

    Warning

    Do not add more than 3 Manager and/or Monitor services to the Ceph cluster.

  7. After you add and configure all nodes in your Ceph cluster, click Create.

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of the machines running on the cluster.

To delete a baremetal-based managed cluster:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

  5. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, click the Delete credential action icon next to the name of the credentials to be deleted.

    2. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs, and so on.

Create and operate an OpenStack-based managed cluster

After bootstrapping your OpenStack-based Docker Enterprise (DE) Container Cloud management cluster as described in Deployment Guide: Deploy an OpenStack-based management cluster, you can create the OpenStack-based managed clusters using the DE Container Cloud web UI.

Create a managed cluster

This section describes how to create an OpenStack-based managed cluster using the Docker Enterprise (DE) Container Cloud web UI of the OpenStack-based management cluster.

To create an OpenStack-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the OpenStack VMs creation.

  4. In the Credentials tab, click Add Credential to add your OpenStack credentials. You can either upload your OpenStack clouds.yaml configuration file or fill in the fields manually.

  5. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select OpenStack

      Provider credential

      From the drop-down list, select the OpenStack credentials name that you created in the previous step.

      Release version

      The DE Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      External network

      Type of the external network in the OpenStack cloud provider.

      DNS name servers

      Comma-separated list of the DNS hosts IPs for the OpenStack VMs configuration.

      Kubernetes

      Node CIDR

      The Kubernetes nodes CIDR block. For example, 10.10.10.0/24.

      Services CIDR blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  6. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

  7. Proceed with Add a machine.

Add a machine

After you create a new OpenStack-based DE Container Cloud managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the DE Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an OpenStack-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with Machines list opens.

  4. On the cluster page, click Create Machine.

  5. Fill out the form with the following parameters as required:

    DE Container Cloud machine configuration

    Parameter

    Description

    Count

    Add the required number of machines to create.

    The recommended minimum number of machines is three for the manager nodes HA and two for the DE Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Flavor

    From the drop-down list, select the required hardware configuration for the machine. The list of available flavors corresponds to the one in your OpenStack environment.

    For the hardware requirements, see: Reference Architecture: Requirements for an OpenStack-based cluster.

    Image

    From the drop-down list, select the cloud image with Ubuntu 18.04. If you do not have this image in the list, add it to your OpenStack environment using the Horizon web UI by downloading the image from the Ubuntu official website.

    Availability zone

    From the drop-down list, select the availability zone from which the new machine will be launched.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    To view the deployment status, monitor the machines status in the Managers and Workers columns on the Clusters page. Once the status changes to Ready, the deployment is complete. For the statuses description, see Reference Architecture: LCM controller.

  8. Verify the status of the cluster nodes as described in Connect to a Docker Enterprise Container Cloud cluster.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an OpenStack-based managed cluster:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. If the cluster deletion hangs and the The cluster is being deleted message does not disappear for a while:

    1. Expand the menu of the tab with your username.

    2. Click Download kubeconfig to download kubeconfig of your management cluster.

    3. Log in to any local machine with kubectl installed.

    4. Copy the downloaded kubeconfig to this machine.

    5. Run the following command:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
    6. Edit the opened kubeconfig by removing the following lines:

      finalizers:
      - cluster.cluster.k8s.io
      
  6. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, click the Delete credential action icon next to the name of the credentials to be deleted.

    2. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Create and operate an AWS-based managed cluster

After bootstrapping your AWS-based Docker Enterprise (DE) Container Cloud management cluster as described in Deployment Guide: Deploy an AWS-based management cluster, you can create the AWS-based managed clusters using the DE Container Cloud web UI.

Create a managed cluster

This section describes how to create an AWS-based managed cluster using the Docker Enterprise (DE) Container Cloud web UI of the AWS-based management cluster.

To create an AWS-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the SSH Keys tab, click Add SSH Key to upload the public SSH key that will be used for the AWS VMs creation.

  4. In the Credentials tab, click Add Credential and fill in the required fields to add your AWS credentials.

  5. In the Clusters tab, click Create Cluster and fill out the form with the following parameters as required:

    1. Configure general settings and the Kubernetes parameters:

      Managed cluster configuration

      Section

      Parameter

      Description

      General settings

      Name

      Cluster name

      Provider

      Select AWS

      Provider credential

      From the drop-down list, select the previously created AWS credentials name.

      Release version

      The DE Container Cloud version.

      SSH keys

      From the drop-down list, select the SSH key name that you have previously added for SSH access to VMs.

      Provider

      AWS region

      Type in the AWS Region for the managed cluster. For example, us-east-2.

      Kubernetes

      Services CIDR blocks

      The Kubernetes Services CIDR block. For example, 10.233.0.0/18.

      Pods CIDR blocks

      The Kubernetes Pods CIDR block. For example, 10.233.64.0/18.

    2. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  6. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

  7. Proceed with Add a machine.

Add a machine

After you create a new AWS-based managed cluster as described in Create a managed cluster, proceed with adding machines to this cluster using the Docker Enterprise (DE) Container Cloud web UI.

You can also use the instruction below to scale up an existing managed cluster.

To add a machine to an AWS-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Click Create Machine.

  5. Fill out the form with the following parameters as required:

    DE Container Cloud machine configuration

    Parameter

    Description

    Count

    Add the required number of machines to create.

    The recommended minimum number of machines is three for the manager nodes HA and two for the DE Container Cloud workloads.

    Select Manager or Worker to create a Kubernetes manager or worker node.

    Instance type

    Type in the AWS instance type that is c5d.2xlarge.

    AMI ID

    Type in the required AMI ID of Ubuntu 18.04. For example, ami-033a0960d9d83ead0.

    Root device size

    Select the required root device size, 40 by default.

  6. Click Create.

  7. Repeat the steps above for the remaining machines.

    To view the deployment status, monitor the machines status in the Managers and Workers columns on the Clusters page. Once the status changes to Ready, the deployment is complete. For the statuses description, see Reference Architecture: LCM controller.

  8. Verify the status of the cluster nodes as described in Connect to a Docker Enterprise Container Cloud cluster.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

See also

Delete a machine

Delete a managed cluster

Deleting a managed cluster does not require a preliminary deletion of VMs that run on this cluster.

To delete an AWS-based managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Delete.

  4. Verify the list of machines to be removed. Confirm the deletion.

    Deleting a cluster automatically removes the Amazon Virtual Private Cloud (VPC) connected with this cluster and frees up the resources allocated for this cluster, for example, instances, load balancers, networks, floating IPs.

  5. Optional. If you do not plan to reuse the credentials of the deleted cluster, delete them:

    1. In the Credentials tab, click the Delete Credential action icon next to the name of the credentials to be deleted.

    2. Confirm the deletion.

    Warning

    You can delete credentials only after deleting the managed cluster they relate to.

Change a cluster configuration

After deploying a managed cluster, you can enable or disable StackLight and configure its parameters if enabled. Alternatively, you can configure StackLight through kubeconfig as described in Configure StackLight.

To change a cluster configuration:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the writer permissions.

  2. Select the required namespace.

  3. On the Clusters page, click the More action icon in the last column of the required cluster and select Configure cluster.

  4. In the Configure cluster window, select or deselect StackLight and configure its parameters if enabled.

  5. Click Update to apply the changes.

Update a managed cluster

A Docker Enterprise (DE) Container Cloud management cluster automatically upgrades to a new available DE Container Cloud release version that supports new Cluster releases. Once done, a newer version of a Cluster release becomes available for managed clusters that you update using the DE Container Cloud web UI.

Caution

Make sure to update the Cluster release version of your managed cluster before the current Cluster release version becomes unsupported by a new DE Container Cloud release version. Otherwise, DE Container Cloud stops auto-upgrade and eventually DE Container Cloud itself becomes unsupported.

This section describes how to update a managed cluster of any provider type using the DE Container Cloud web UI.

To update a managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click More action icon in the last column for each cluster and select Update cluster where available.

  4. In the Release Update window, select the required Cluster release to update your managed cluster to.

    The Description section contains the list of components versions to be installed with a new Cluster release. The release notes for each DE Container Cloud and Cluster release are available at Release Notes: DE Container Cloud releases and Release Notes: Cluster releases.

  5. Click Update.

    To view the update status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the update is complete.

Delete a machine

This section instructs you on how to scale down an existing managed cluster through the Docker Enterprise (DE) Container Cloud web UI.

Warning

A machine with the manager node role cannot be deleted manually. A machine with such role is automatically deleted during the managed cluster deletion.

To delete a machine from a managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click on the required cluster name to open the list of machines running on it.

  4. Click the More action icon in the last column of the machine you want to delete and select Delete. Confirm the deletion.

Deleting a machine automatically frees up the resources allocated to this machine.

Warning

The operational managed cluster should contain minimum 3 Kubernetes manager nodes and 2 Kubernetes worker nodes. To meet the etcd quorum and to prevent the deployment failure, scaling down of the manager nodes is prohibited.

Manage a management cluster

The Docker Enterprise (DE) Container Cloud web UI enables you to perform the following operations with a DE Container Cloud management cluster:

  • View the cluster details (such as cluster ID, creation date, nodes count, and so on) as well as obtain a list of the cluster endpoints including the StackLight components, depending on your deployment configuration.

    To view generic cluster details, in the Clusters tab, click the More action icon in the last column of the required management cluster and select Cluster info.

  • Verify the current release version of the cluster including the list of installed components with their versions and the cluster release change log.

    To view a cluster release version details, in the Clusters tab, click the version in the Release column next to the name of the required management cluster.

    A management cluster upgrade to a newer version is performed automatically once a new DE Container Cloud version is released. Regional clusters also upgrade automatically along with the management cluster. For more details about the DE Container Cloud release upgrade mechanism, see: Reference Architecture: DE Container Cloud release controller.

This section outlines the operations that can be performed with a management cluster.

Remove a management cluster

This section describes how to remove a management cluster.

To remove a management cluster:

  1. Verify that you have successfully removed all managed clusters that run on top of the management cluster to be removed. For details, see the corresponding Delete a managed cluster section depending on your cloud provider in Create and operate a managed cluster.

  2. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  3. Run the following script:

    bootstrap.sh cleanup
    

Attach an existing UCP cluster

Starting from UCP 3.3.3, you can attach an existing UCP cluster that is not deployed by Docker Enterprise (DE) Container Cloud to a management cluster. This feature allows for visualization of all your UCP clusters details in one place including clusters health, capacity, and usage.

For supported configurations of existing UCP clusters that are not deployed by DE Container Cloud, see Docker Enterprise Compatibility Matrix.

Note

Using the free Mirantis license, you can create up to three DE Container Cloud managed clusters with three worker nodes on each cluster. Within the same quota, you can also attach existing UCP clusters that are not deployed by DE Container Cloud. If you need to increase this quota, contact Mirantis support for further details.

Using the instruction below, you can also install StackLight to your existing UCP cluster during the attach procedure. For the StackLight system requirements, refer to the Reference Architecture: Requirements of the corresponding cloud provider.

You can also update all your UCP clusters to the latest version once your management cluster automatically updates to a newer version where a new UCP Cluster release with the latest UCP version is available. For details, see Update a managed cluster.

Caution

  • A UCP cluster can be attached to only one management cluster. Attaching a DE Container Cloud-based UCP cluster to another management cluster is not supported.

  • Due to the development limitations, if you detach a UCP cluster that is not deployed by DE Container Cloud, Helm controller and OIDC integration are not deleted.

  • Detaching a DE Container Cloud-based UCP cluster is not supported.

To attach an existing UCP cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, expand the Create Cluster menu and click Attach Existing UCP Cluster.

  4. In the wizard that opens, fill out the form with the following parameters as required:

    1. Configure general settings:

      UCP cluster configuration

      Section

      Parameter

      Description

      General Settings

      Cluster Name

      Specify the cluster name.

      Region

      Select the required cloud provider: OpenStack, AWS, or bare metal.

    2. Upload the UCP client bundle or fill in the fields manually. To download the UCP client bundle, refer to UCP user access: Download client certificates.

    3. Configure StackLight:

      StackLight configuration

      Section

      Parameter name

      Description

      StackLight

      Enable StackLight

      Selected by default. Deselect to skip StackLight deployment.

      Note

      You can also enable, disable, or configure StackLight parameters after deploying a managed cluster. For details, see Change a cluster configuration or Configure StackLight.

      Enable Logging

      Select to deploy the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

      Multiserver Mode

      Select to enable StackLight monitoring in the HA mode. For the differences between HA and non-HA modes, see Reference Architecture: StackLight deployment architecture.

      Elasticsearch

      Retention Time

      The Elasticsearch logs retention period in Logstash.

      Persistent Volume Claim Size

      The Elasticsearch persistent volume claim size.

      Prometheus

      Retention Time

      The Prometheus database retention period.

      Retention Size

      The Prometheus database retention size.

      Persistent Volume Claim Size

      The Prometheus persistent volume claim size.

      Enable Watchdog Alert

      Select to enable the Watchdog alert that fires as long as the entire alerting pipeline is functional.

      Custom Alerts

      Specify alerting rules for new custom alerts or upload a YAML file in the following exemplary format:

      - alert: HighErrorRate
        expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
        for: 10m
        labels:
          severity: page
        annotations:
          summary: High request latency
      

      For details, see Official Prometheus documentation: Alerting rules. For the list of the predefined StackLight alerts, see Operations Guide: Available StackLight alerts.

      StackLight Email Alerts

      Enable Email Alerts

      Select to enable the StackLight email alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Require TLS

      Select to enable transmitting emails through TLS.

      Email alerts configuration for StackLight

      Fill out the following email alerts parameters as required:

      • To - the email address to send notifications to.

      • From - the sender address.

      • SmartHost - the SMTP host through which the emails are sent.

      • Authentication username - the SMTP user name.

      • Authentication password - the SMTP password.

      • Authentication identity - the SMTP identity.

      • Authentication secret - the SMTP secret.

      StackLight Slack Alerts

      Enable Slack alerts

      Select to enable the StackLight Slack alerts.

      Send Resolved

      Select to enable notifications about resolved StackLight alerts.

      Slack alerts configuration for StackLight

      Fill out the following Slack alerts parameters as required:

      • API URL - The Slack webhook URL.

      • Channel - The channel to send notifications to, for example, #channel-for-alerts.

  5. Click Create.

    To view the deployment status, verify the cluster status on the Clusters page. Once the orange blinking dot near the cluster name disappears, the deployment is complete.

Connect to the UCP web UI

After you deploy a new or attach an existing UCP cluster to a management cluster, start managing your cluster using the UCP web UI.

To connect to the UCP web UI:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required UCP cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the UCP UI endpoint.

  5. Paste the copied IP to a web browser and use the same credentials that you use to access the DE Container Cloud web UI.

Warning

To ensure the DE Container Cloud stability in managing the DE Container Cloud-based UCP clusters, a number of UCP API functions is not available for the DE Container Cloud-based UCP clusters as compared to the attached UCP clusters that are deployed not by DE Container Cloud. Use the DE Container Cloud web UI or CLI for this functionality instead.

See Reference Architecture: UCP API limitations for details.

Caution

The UCP web UI contains help links that lead to the Docker Enterprise documentation suite. Besides UCP and Docker Engine - Enterprise, which are integrated with DE Container Cloud, that documentation suite covers other Docker Enterprise components and cannot be fully applied to the DE Container Cloud-based UCP clusters. Therefore, to avoid any sort of misconceptions, before you proceed with UCP web UI documentation, read Reference Architecture: UCP API limitations and make sure you are using the documentation of the supported UCP version as per Release Compatibility Matrix.

Connect to a Docker Enterprise Container Cloud cluster

After you deploy a Docker Enterprise (DE) Container Cloud management or managed cluster, connect to the cluster to verify the availability and status of the nodes as described below.

This section also describes how to SSH to a node of a cluster where Bastion host is used for SSH access. For example, on the OpenStack-based management cluster or AWS-based management and managed clusters.

To connect to a managed cluster:

  1. Log in to the DE Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the required cluster name. The cluster page with the Machines list opens.

  4. Verify the status of the manager nodes. Once the first manager node is deployed and has the Ready status, the Download Kubeconfig option for the cluster being deployed becomes active.

  5. Open the Clusters tab.

  6. Click the More action icon in the last column of the required cluster and select Download Kubeconfig:

    1. Enter your user password.

    2. Not recommended. Select Offline Token to generate an offline IAM token. Otherwise, for security reasons, the kubeconfig token expires every 30 minutes of the DE Container Cloud API idle time and you have to download kubeconfig again with a newly generated token.

    3. Click Download.

  7. Verify the availability of the managed cluster machines:

    1. Export the kubeconfig parameters to your local machine with access to kubectl. For example:

      export KUBECONFIG=~/Downloads/kubeconfig-test-cluster.yml
      
    2. Obtain the list of available DE Container Cloud machines:

      kubectl get nodes -o wide
      

      The system response must contain the details of the nodes in the READY status.

To connect to a management cluster:

  1. Log in to a local machine where your management cluster kubeconfig is located and where kubectl is installed.

    Note

    The management cluster kubeconfig is created during the last stage of the management cluster bootstrap.

  2. Obtain the list of available management cluster machines:

    kubectl get nodes -o wide
    

    The system response must contain the details of the nodes in the READY status.

To SSH to a DE Container Cloud cluster node if Bastion is used:

  1. Obtain kubeconfig of the management or managed cluster as described in the procedures above.

  2. Obtain the internal IP address of a node you require access to:

    kubectl get nodes -o wide
    
  3. Obtain the Bastion public IP:

    kubectl get cluster -o jsonpath='{.status.providerStatus.bastion.publicIp}' \
    -n <project_name> <cluster_name>
    
  4. Run the following command:

    ssh -i <private_key> ubuntu@<node_internal_ip> -o "proxycommand ssh -W %h:%p \
    -i <private_key> ubuntu@<bastion_public_ip>"
    

    Substitute the parameters enclosed in angle brackets with the corresponding values of your cluster obtained in previous steps. The <private_key> for a management cluster is located at ~/.ssh/openstack_tmp. For a managed cluster, this is the SSH Key that you added in the DE Container Cloud web UI before the managed cluster creation.

Manage IAM

IAM CLI

IAM CLI is a user-facing command-line tool for managing scopes, roles, and grants. Using your personal credentials, you can perform different IAM operations through the iamctl tool. For example, you can verify the current status of the IAM service, request or revoke service tokens, verify your own grants within Docker Enterprise (DE) Container Cloud as well as your token details.

Configure IAM CLI

The iamctl command-line interface uses the iamctl.yaml configuration file to interact with IAM.

To create the IAM CLI configuration file:

  1. Log in to the management cluster.

  2. Change the directory to one of the following:

    • $HOME/.iamctl

    • $HOME

    • $HOME/etc

    • /etc/iamctl

  3. Create iamctl.yaml with the following exemplary parameters and values that correspond to your deployment:

    server: <IAM_API_ADDRESS>
    timeout: 60
    verbose: 99 # Verbosity level, from 0 to 99
    
    tls:
        enabled: true
        ca: <PATH_TO_CA_BUNDLE>
    
    auth:
        issuer: <IAM_REALM_IN_KEYCLOAK>
        ca: <PATH_TO_CA_BUNDLE>
        client_id: iam
        client_secret:
    

    The <IAM_REALM_IN_KEYCLOAK> value has the <keycloak-url>/auth/realms/<realm-name> format, where <realm-name> defaults to iam.

Available IAM CLI commands

Using iamctl, you can perform different role-based access control operations in your managed cluster. For example:

  • Grant or revoke access to a managed cluster and a specific user for troubleshooting

  • Grant or revoke access to a Docker Enterprise (DE) Container Cloud project that contains several managed clusters

  • Create or delete tokens for the DE Container Cloud services with a specific set of grants as well as identify when a service token was used the last time

The iamctl command-line interface contains the following set of commands:

The following tables describe the iamctl commands with their descriptions.

General commands

Usage

Description

iamctl --help, iamctl help

Output the list of available commands.

iamctl help <command>

Output the description of a specific command.

Account information commands

Usage

Description

iamctl account info

Output detailed account information such as user email, user name, the details of their active and offline sessions, tokens statuses and expiration dates.

iamctl account login

Log in the current user. The system prompts to enter your authentication credentials. After a successful login, your user token is added to the $HOME/.iamctl directory.

iamctl account logout

Log out the current user. Once done, the user information is removed from $HOME/.iamctl.

Scope commands

Usage

Description

iamctl scope list

List the IAM scopes available for the current environment.

Example output:

+---------------+--------------------------+
|     NAME      |   DESCRIPTION            |
+---------------+--------------------------+
| m:iam         | IAM scope                |
| m:kaas        | DE Container Cloud scope |
| m:k8s:managed |                          |
| m:k8s         | Kubernetes scope         |
| m:cloud       | Cloud scope              |
+---------------+--------------------------+

iamctl scope list [prefix]

Output the specified scope list. For example: iamctl m:k8s.

Role commands

Usage

Description

iamctl role list <scope>

List the roles for the specified scope in IAM.

iamctl role show <scope> <role>

Output the details of the specified scope role including the role name (admin, viewer, reader), its description, and an example of the grant command. For example: iamctl role show m:iam admin.

Grant commands

Usage

Description

iamctl grant give [username] [scope] [role]

Provide a user with a role in a scope. For example, the iamctl grant give jdoe m:iam admin command provides the IAM admin role in the m:iam scope to John Doe.

For the list of supported IAM scopes and roles, see: Role list.

Note

To lock or disable a user, use LDAP or Google OAuth depending on the external provider integrated to your deployment.

iamctl grant list <username>

List the grants provided to the specified user. For example: iamctl grant list jdoe.

Example output:

+--------+--------+---------------+
| SCOPE  |  ROLE  |   GRANT FQN   |
+--------+--------+---------------+
| m:iam  | admin  | m:iam@admin   |
| m:sl   | viewer | m:sl@viewer   |
| m:kaas | writer | m:kaas@writer |
+--------+--------+---------------+
  • m:iam@admin - admin rights in all IAM-related applications

  • m:sl@viewer - viewer rights in all StackLight-related applications

  • m:kaas@writer - writer rights in DE Container Cloud

iamctl grant revoke [username] [scope] [role]

Revoke the grants provided to the user.

Service token commands

Usage

Description

iamctl servicetoken list [--all]

List the details of all service tokens created by the current user. The output includes the following service token details:

  • ID

  • Alias, for example, nova, jenkins-ci

  • Creation date and time

  • Creation owner

  • Grants

  • Last refresh date and time

  • IP address

iamctl servicetoken show [ID]

Output the details of a service token with the specified ID.

iamctl servicetoken create [alias] [service] [grant1 grants2...]

Create a token for a specific service with the specified set of grants. For example, iamctl servicetoken create new-token iam m:iam@viewer.

iamctl servicetoken delete [ID1 ID2...]

Delete a service token with the specified ID.

User commands

Usage

Description

iamctl user list

List user names and emails of all current users.

iamctl user show <username>

Output the details of the specified user.

Role list

Docker Enterprise (DE) Container Cloud creates the IAM roles in scopes. For each application type, such as iam, k8s, or kaas, DE Container Cloud creates a scope in Keycloak. And every scope contains a set of roles such as admin, user, viewer. The default IAM roles can be changed during a managed cluster deployment. You can grant or revoke a role access using the IAM CLI. For details, see: IAM CLI.

Example of the structure of a cluster-admin role in a managed cluster:

m:k8s:kaas-tenant-name:k8s-cluster-name@cluster-admin
  • m - prefix for all IAM roles in DE Container Cloud

  • k8s - application type, Kubernetes

  • kaas-tenant-name:k8s-cluster-name - a managed cluster identifier in DE Container Cloud (CLUSTER_ID)

  • @ - delimiter between a scope and role

  • cluster-admin - name of the role within the Kubernetes scope


The following tables include the scopes and their roles descriptions by DE Container Cloud components:

IAM

Scope identifier

Role name

Grant example

Role description

m:iam

admin

m:iam@admin 0

Access Keycloak, the IAM API and web UI.

user

m:iam@user 0

Access the IAM API and web UI.

viewer

m:iam@viewer 0

Access the data to be used by the monitoring systems.

DE Container Cloud

Scope identifier

Role name

Grant example

Role description

m:kaas

reader

m:kaas@reader 0

List the managed clusters within the DE Container Cloud scope.

writer

m:kaas@writer 0

Create or delete the managed clusters within the DE Container Cloud scope.

m:kaas:$<CLUSTER_ID>

reader

m:kaas:$<CLUSTER_ID>@reader

List the managed clusters within the specified DE Container Cloud cluster ID.

writer

m:kaas:$<CLUSTER_ID>@writer

Create or delete the managed clusters within the specified DE Container Cloud cluster ID.

operator

m:kaas@operator

Add or delete a bare metal host and machine within the DE Container Cloud scope, create a project.

0(1,2,3,4,5)

Grant is available by default. Other grants can be added during a management and managed cluster deployment.

Kubernetes

Scope identifier

Role name

Grant example

Role description

m:k8s:<CLUSTER_ID>

cluster-admin

m:k8s:<CLUSTER_ID>@cluster-admin

Allow the super-user access to perform any action on any resource on the cluster level. When used in ClusterRoleBinding, provide full control over every resource in a cluster and all Kubernetes namespaces.

StackLight

Scope identifier

Role name

Grant example

Role description

m:sl:$<CLUSTER_ID> or m:sl:$<CLUSTER_ID>:<SERVICE_NAME>

admin

  • m:sl:$<CLUSTER_ID>@admin

  • m:sl:$<CLUSTER_ID>:alerta@admin

  • m:sl:$<CLUSTER_ID>:alertmngmnt@admin

  • m:sl:$<CLUSTER_ID>:kibana@admin

  • m:sl:$<CLUSTER_ID>:graphana@admin

  • m:sl:$<CLUSTER_ID>:prometheus@admin

Assign roles to other users within the scope.

viewer

  • m:sl:$<CLUSTER_ID>@viewer

  • m:sl:$<CLUSTER_ID>:alerta@viewer

  • m:sl:$<CLUSTER_ID>:alertmngmnt@viewer

  • m:sl:$<CLUSTER_ID>:kibana@viewer

  • m:sl:$<CLUSTER_ID>:graphana@viewer

  • m:sl:$<CLUSTER_ID>:prometheus@viewer

Access the specified web UI(s) within the scope.

The m:sl:$<CLUSTER_ID>@viewer grant provides access to all StackLight web UIs: Prometheus, Alerta, Alertmanager, Kibana, Grafana.

Manage StackLight

Using StackLight, you can monitor the components deployed in Docker Enterprise (DE) Container Cloud and be quickly notified of critical conditions that may occur in the system to prevent service downtimes.

Access StackLight web UIs

StackLight provides five web UIs including Prometheus, Alertmanager, Alerta, Kibana, and Grafana. This section describes how to access any of these web UIs.

To access a StackLight web UI:

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. In the Clusters tab, click the More action icon in the last column of the required cluster and select Cluster info.

  4. In the dialog box with the cluster information, copy the required endpoint IP from the StackLight Endpoints section.

  5. Paste the copied IP to a web browser and use the default credentials to log in to the web UI. Once done, you are automatically authenticated to all StackLight web UIs.

View Grafana dashboards

Using the Grafana web UI, you can view the visual representation of the metric graphs based on the time series databases.

To view the Grafana dashboards:

  1. Log in to the Grafana web UI as described in Access StackLight web UIs.

  2. From the drop-down list, select the required dashboard to inspect the status and statistics of the corresponding service in your management or managed cluster:

    Component

    Dashboard

    Description

    Ceph cluster

    Ceph Cluster

    Provides the overall health status of the Ceph cluster, capacity, latency, and recovery metrics.

    Ceph Nodes

    Provides an overview of the host-related metrics, such as the number of monitors, OSD hosts, average usage of resources across the cluster, network and hosts load.

    Ceph OSD

    Provides metrics for Ceph OSDs, including the OSD read and write latencies, distribution of PGs per OSD, Ceph OSDs and physical device performance.

    Ceph Pools

    Provides metrics for Ceph pools, including the client IOPS and throughput by pool and pools capacity usage.

    Ironic bare metal

    Ironic BM

    Provides graphs on Ironic health, HTTP API availability, provisioned nodes by state and installed ironic-conductor back-end drivers.

    DE Container Cloud clusters

    Clusters Overview

    Represents the main cluster capacity statistics for all clusters of a Docker Enterprise (DE) Container Cloud deployment where StackLight is installed.

    Kubernetes resources

    Kubernetes Calico

    Provides metrics of the entire Calico cluster usage, including the cluster status, host status, and Felix resources.

    Kubernetes Cluster

    Provides metrics for the entire Kubernetes cluster, including the cluster status, host status, and resources consumption.

    Kubernetes Deployments

    Provides information on the desired and current state of all service replicas deployed on a DE Container Cloud cluster.

    Kubernetes Namespaces

    Provides the pods state summary and the CPU, MEM, network, and IOPS resources consumption per name space.

    Kubernetes Nodes

    Provides charts showing resources consumption per DE Container Cloud cluster node.

    Kubernetes Pods

    Provides charts showing resources consumption per deployed pod.

    NGINX

    NGINX

    Provides the overall status of the NGINX cluster and information about NGINX requests and connections.

    NGINX Ingress controller

    Monitors the number of requests, response times and statuses, as well as the number of Ingress SSL certificates including expiration time and resources usage.

    StackLight

    Alertmanager

    Provides performance metrics on the overall health status of the Prometheus Alertmanager service, the number of firing and resolved alerts received for various periods, the rate of successful and failed notifications, and the resources consumption.

    Elasticsearch

    Provides information about the overall health status of the Elasticsearch cluster, including the resources consumption and the state of the shards.

    Grafana

    Provides performance metrics for the Grafana service, including the total number of Grafana entities, CPU and memory consumption.

    PostgreSQL

    Provides PostgreSQL statistics, including read (DQL) and write (DML) row operations, transaction and lock, replication lag and conflict, and checkpoint statistics, as well as PostgreSQL performance metrics.

    Prometheus

    Provides the availability and performance behavior of the Prometheus servers, the sample ingestion rate, and system usage statistics per server. Also, provides statistics about the overall status and uptime of the Prometheus service, the chunks number of the local storage memory, target scrapes, and queries duration.

    Pushgateway

    Provides performance metrics and the overall health status of the service, the rate of samples received for various periods, and the resources consumption.

    Prometheus Relay

    Provides service status and resources consumption metrics.

    Telemeter Server

    Provides statistics and the overall health status of the Telemeter service.

    System

    System

    Provides a detailed resource consumption and operating system information per DE Container Cloud cluster node.

    UCP

    UCP Cluster

    Provides a global overview of a UCP Cluster: statistics about the number of the worker and manager nodes, containers, images, Swarm services.

    UCP Containers

    Provides per container resources consumption metrics for the UCP containers such as CPU, RAM, network.

View Kibana dashboards

Using the Kibana web UI, you can view the visual representation of logs and Kubernetes events of your deployment.

To view the Kibana dashboards:

  1. Log in to the Kibana web UI as described in Access StackLight web UIs.

  2. Click the required dashboard to inspect the visualizations or perform a search:

    Dashboard

    Description

    Logs

    Provides visualizations on the number of log messages per severity, source, and top log-producing host, namespaces, containers, and applications. Includes search.

    Kubernetes events

    Provides visualizations on the number of Kubernetes events per type, and top event-producing resources and namespaces by reason and event type. Includes search.

Available StackLight alerts

This section provides an overview of the available predefined StackLight alerts. To view the alerts, use the Prometheus, Alertmanager, or Alerta web UI.

Alertmanager

This section describes the alerts for the Alertmanager service.


AlertmanagerFailedReload

Severity

Warning

Summary

Failure to reload the Alertmanager configuration.

Description

Reloading the Alertmanager configuration failed for {{ $labels.namespace }}/{{ $labels.pod }}.


AlertmanagerMembersInconsistent

Severity

Critical

Summary

Alertmanager did not detect all cluster members.

Description

Alertmanager did not detect all other members of the cluster.


AlertmanagerNotificationFailureWarning

Severity

Warning

Summary

Alertmanager has failed notifications.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} notifications on the {{ $labels.instance }} instance fail for 2 minutes.


AlertmanagerAlertsInvalidWarning

Severity

Warning

Summary

Alertmanager has invalid alerts.

Description

An average of {{ $value }} Alertmanager {{ $labels.integration }} alerts on the {{ $labels.instance }} instance are invalid for 2 minutes.

Calico

This section describes the alerts for Calico.


CalicoDataplaneFailuresHigh

Severity

Warning

Summary

High number of data plane failures within Felix.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} data plane failures within the last hour.


CalicoDataplaneAddressMsgBatchSizeHigh

Severity

Warning

Summary

Felix address message batch size is higher than 5.

Description

The size of the data plane address message batch on the {{ $labels.instance }} Felix instance is {{ $value }}.


CalicoDatapaneIfaceMsgBatchSizeHigh

Severity

Warning

Summary

Felix interface message batch size is higher than 5.

Description

The size of the data plane interface message batch on the {{ $labels.instance }} Felix instance is {{ $value }}.


CalicoIPsetErrorsHigh

Severity

Warning

Summary

More than 5 IPset errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} IPset errors within the last hour.


CalicoIptablesSaveErrorsHigh

Severity

Warning

Summary

More than 5 iptable save errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} iptable save errors within the last hour.


CalicoIptablesRestoreErrorsHigh

Severity

Warning

Summary

More than 5 iptable restore errors occur in Felix per hour.

Description

The {{ $labels.instance }} Felix instance has {{ $value }} iptable restore errors within the last hour.

Ceph

This section describes the alerts for the Ceph cluster.


CephClusterHealthMinor

Severity

Minor

Summary

Ceph cluster health is WARNING.

Description

The Ceph cluster is in the WARNING state. For details, run ceph -s.


CephClusterHealthCritical

Severity

Critical

Summary

Ceph cluster health is CRITICAL.

Description

The Ceph cluster is in the CRITICAL state. For details, run ceph -s.


CephMonQuorumAtRisk

Severity

Critical

Summary

Storage quorum is at risk.

Description

The storage cluster quorum is low.


CephOsdDownMinor

Severity

Minor

Summary

Ceph OSDs are down.

Description

{{ $value }} of Ceph OSD nodes in the Ceph cluster are down. For details, run ceph osd tree.


CephOSDDiskNotResponding

Severity

Critical

Summary

Disk is not responding.

Description

The {{ $labels.device }} disk device is not responding on the {{ $labels.host }} host.


CephOSDDiskUnavailable

Severity

Critical

Summary

Disk is not accessible.

Description

The {{ $labels.device }} disk device is not accessible on the {{ $labels.host }} host.


CephClusterNearFull

Severity

Warning

Summary

Storage cluster is nearly full. Expansion is required.

Description

The storage cluster utilization has crossed 85%.


CephClusterCriticallyFull

Severity

Critical

Summary

Storage cluster is critically full and needs immediate expansion.

Description

The storage cluster utilization has crossed 95%.


CephOsdPgNumTooHighWarning

Severity

Warning

Summary

Some Ceph OSDs have more than 200 PGs.

Description

Some Ceph OSDs contain more than 200 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephOsdPgNumTooHighCritical

Severity

Critical

Summary

Some Ceph OSDs have more than 300 PGs.

Description

Some Ceph OSDs contain more than 300 PGs. This may have a negative impact on the cluster performance. For details, run ceph pg dump.


CephMonHighNumberOfLeaderChanges

Severity

Warning

Summary

Many leader changes occur in the storage cluster.

Description

{{ $value }} leader changes per minute occur for the {{ $labels.instance }} instance of the {{ $labels.job }} Ceph Monitor.


CephNodeDown

Severity

Critical

Summary

Storage node {{ $labels.node }} went down.

Description

The {{ $labels.node }} storage node is down and requires immediate verification.


CephDataRecoveryTakingTooLong

Severity

Warning

Summary

Data recovery is slow.

Description

Data recovery has been active for more than two hours.


CephPGRepairTakingTooLong

Severity

Warning

Summary

Self-heal issues detected.

Description

The self-heal operations take an excessive amount of time.


CephOSDVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph OSD components are running.


CephMonVersionMismatch

Severity

Warning

Summary

Multiple versions of storage services are running.

Description

{{ $value }} different versions of Ceph Monitor components are running.

Elasticsearch

This section describes the alerts for the Elasticsearch service.


ElasticHeapUsageTooHigh

Severity

Critical

Summary

Elasticsearch heap usage is too high (>90%).

Description

Elasticsearch heap usage is over 90% for 5 minutes.


ElasticHeapUsageWarning

Severity

Warning

Summary

Elasticsearch heap usage is high (>80%).

Description

Elasticsearch heap usage is over 80% for 5 minutes.


ElasticClusterRed

Severity

Critical

Summary

Elasticsearch cluster is RED.

Description

The Elasticsearch cluster status is RED.


ElasticClusterYellow

Severity

Warning

Summary

Elasticsearch cluster is YELLOW.

Description

The Elasticsearch cluster status is YELLOW.


NumberOfRelocationShards

Severity

Critical

Summary

Shards relocation takes more than 20 minutes.

Description

Elasticsearch has {{ $value }} relocating shards for 20 minutes.


NumberOfInitializingShards

Severity

Critical

Summary

Shards initialization takes more than 10 minutes.

Description

Elasticsearch has {{ $value }} shards being initialized for 10 minutes.


NumberOfUnassignedShards

Severity

Critical

Summary

Shards have unassigned status for 5 minutes.

Description

Elasticsearch has {{ $value }} unassigned shards for 5 minutes.


NumberOfPendingTasks

Severity

Warning

Summary

Tasks have pending state for 10 minutes.

Description

Elasticsearch has {{ $value }} pending tasks for 10 minutes. The cluster works slowly.


ElasticNoNewDataCluster

Severity

Critical

Summary

Elasticsearch cluster has no new data for 30 minutes.

Description

No new data has arrived to the Elasticsearch cluster for 30 minutes.


ElasticNoNewDataNode

Severity

Warning

Summary

Elasticsearch node has no new data for 30 minutes.

Description

No new data has arrived to the {{ $labels.name }} Elasticsearch node for 30 minutes. The alert also indicates Elasticsearch node cordoning.

etcd

This section describes the alerts for the etcd service.


etcdInsufficientMembers

Severity

Critical

Summary

The etcd cluster has insufficient members.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} insufficient members.


etcdNoLeader

Severity

Critical

Summary

The etcd cluster has no leader.

Description

The {{ $labels.instance }} member of the {{ $labels.job }} etcd cluster has no leader.


etcdHighNumberOfLeaderChanges

Severity

Warning

Summary

More than 3 leader changes occurred in the the etcd cluster within the last hour.

Description

The {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster has {{ $value }} leader changes within the last hour.


etcdGRPCRequestsSlow

Severity

Critical

Summary

The etcd cluster has slow gRPC requests.

Description

The gRPC requests to {{ $labels.grpc_method }} take {{ $value }}s on {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster.


etcdMemberCommunicationSlow

Severity

Warning

Summary

The etcd cluster has slow member communication.

Description

The member communication with {{ $labels.To }} on the {{ $labels.instance }} instance of the {{ $labels.job }} etcd cluster takes {{ $value }}s.


etcdHighNumberOfFailedProposals

Severity

Warning

Summary

The etcd cluster has more than 5 proposal failures.

Description

The {{ $labels.job }} etcd cluster has {{ $value }} proposal failures on the {{ $labels.instance }} etcd instance within the last hour.


etcdHighFsyncDurations

Severity

Warning

Summary

The etcd cluster has high fync duration.

Description

The duration of 99% of all fync operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.


etcdHighCommitDurations

Severity

Warning

Summary

The etcd cluster has high commit duration.

Description

The duration of 99% of all commit operations on the {{ $labels.instance }} of the {{ $labels.job }} etcd cluster is {{ $value }}s.

External endpoint

This section describes the alerts for external endpoints.


ExternalEndpointDown

Severity

Critical

Summary

External endpoint is down.

Description

The {{ $labels.instance }} external endpoint is not accessible for the last 3 minutes.


ExternalEndpointTCPFailure

Severity

Critical

Summary

Failure to establish a TCP or TLS connection.

Description

The system cannot establish a TCP or TLS connection to {{ $labels.instance }}.

General alerts

This section lists the general available alerts.


TargetDown

Severity

Critical

Summary

The {{ $labels.job }} target is down.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is down.


TargetFlapping

Severity

Critical

Summary

The {{ $labels.job }} target is flapping.

Description

The {{ $labels.job }}/{{ $labels.instance }} target is changing its state between UP and DOWN for 30 minutes, at least once within the 15 minutes time range.


NodeDown

Severity

Critical

Summary

The {{ $labels.node }} node is down.

Description

The {{ $labels.node }} node is down. Kubernetes treats {{ $labels.node }} as not Ready and kubelet is not accessible from Prometheus.


Watchdog

Severity

None

Summary

Watchdog alert that is always firing.

Description

This alert ensures that the entire alerting pipeline is functional. This alert should always be firing in Alertmanager against a receiver. Some integrations with various notification mechanisms can send a notification when this alert is not firing. For example, the DeadMansSnitch integration in PagerDuty.

General node alerts

This section lists the general alerts for Kubernetes nodes.


SystemCpuFullWarning

Severity

Warning

Summary

High CPU consumption.

Description

The average CPU consumption on the {{ $labels.node }} node is {{ $value }}% for 2 minutes.


SystemLoadTooHighWarning

Severity

Warning

Summary

System load is more than 1 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemLoadTooHighCritical

Severity

Critical

Summary

System load is more than 2 per CPU.

Description

The system load per CPU on the {{ $labels.node }} node is {{ $value }} for 5 minutes.


SystemDiskFullWarning

Severity

Warning

Summary

Disk partition {{ $labels.mountpoint }} is 85% full.

Description

The {{ $labels.mountpoint }} partition of the {{ $labels.device }} disk on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemDiskFullMajor

Severity

Major

Summary

Disk partition {{ $labels.mountpoint }} is 95% full.

Description

The {{ $labels.mountpoint }} partition of the {{ $labels.device }} disk on the {{ $labels.node }} node is {{ $value }}% full for 2 minutes.


SystemMemoryFullWarning

Severity

Warning

Summary

More than 90% of memory is used or less than 8 GB is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemMemoryFullMajor

Severity

Major

Summary

More than 95% of memory is used or less than 4 GB of memory is available.

Description

The {{ $labels.node }} node consumes {{ $value }}% of memory for 2 minutes.


SystemDiskInodesFullWarning

Severity

Warning

Summary

The {{ $labels.mountpoint }} volume uses 85% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskInodesFullMajor

Severity

Major

Summary

The {{ $labels.mountpoint }} volume uses 95% of inodes.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node consumes {{ $value }}% of disk inodes in the {{ $labels.mountpoint }} volume for 2 minutes.


SystemDiskErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk is failing.

Description

The {{ $labels.device }} disk on the {{ $labels.node }} node is reporting errors for 5 minutes.

Ironic

This section describes the alerts for Ironic bare metal. The alerted events include Ironic API availability and Ironic processes availability.


IronicBmMetricsMissing

Severity

Major

Summary

Ironic metrics missing.

Description

Metrics retrieved from the Ironic API are not available for 2 minutes.


IronicBmApiOutage

Severity

Critical

Summary

Ironic API outage.

Description

The Ironic API is not accessible.

Kubernetes applications

This section lists the alerts for Kubernetes applications.


KubePodCrashLooping

Severity

Critical

Summary

The {{ $labels.pod }} Pod is restarting.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} Pod ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times per 5 minutes.


KubePodNotReady

Severity

Critical

Summary

The {{ $labels.pod }} Pod is in the non-ready state.

Description

The {{ $labels.namespace }}/{{ $labels.pod }} is in the non-ready state for longer than an hour.


KubeDeploymentGenerationMismatch

Severity

Critical

Summary

The {{ $labels.deployment }} deployment generation does not match the metadata.

Description

The deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match the metadata, indicating that the deployment failed but has not been rolled back.


KubeDeploymentReplicasMismatch

Severity

Critical

Summary

The {{ $labels.deployment }} deployment has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.deployment }} deployment does not match the expected number of replicas for longer than one hour.


KubeStatefulSetReplicasMismatch

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet has a wrong number of replicas.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet does not match the expected number of replicas for longer than 15 minutes.


KubeStatefulSetGenerationMismatch

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet generation does not match the metadata.

Description

The StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match the metadata, indicating that the StatefulSet failed but has not been rolled back.


KubeStatefulSetUpdateNotRolledOut

Severity

Critical

Summary

The {{ $labels.statefulset }} StatefulSet update has not been rolled out.

Description

The {{ $labels.namespace }}/{{ $labels.statefulset }} StatefulSet update has not been rolled out.


KubeDaemonSetRolloutStuck

Severity

Critical

Summary

The {{ $labels.daemonset }} DaemonSet is not ready.

Description

Only {{ $value }}% of the desired Pods of the {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet are scheduled and ready.


KubeDaemonSetNotScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has not scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} DaemonSet has {{ $value }} not scheduled Pods.


KubeDaemonSetMisScheduled

Severity

Warning

Summary

The {{ $labels.daemonset }} DaemonSet has incorrectly scheduled Pods.

Description

The {{ $labels.namespace }}/{{ $labels.daemonset }} has {{ $value }} Pods running where they are not supposed to run.


KubeCronJobRunning

Severity

Warning

Summary

The {{ $labels.cronjob }} CronJob is not ready for more than one hour.

Description

The {{ $labels.namespace }}/{{ $labels.cronjob }} CronJob takes more than one hour to complete.


KubeJobCompletion

Severity

Warning

Summary

The {{ $labels.job_name }} job is not ready for more than one hour.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job takes more than one hour to complete.


KubeJobFailed

Severity

Warning

Summary

The {{ $labels.job_name }} job failed.

Description

The {{ $labels.namespace }}/{{ $labels.job_name }} job failed to complete.

Kubernetes resources

This section lists the alerts for Kubernetes resources.


KubeCPUOvercommitPods

Severity

Warning

Summary

Cluster has overcommitted CPU requests.

Description

The cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.


KubeMemOvercommitPods

Severity

Warning

Summary

Cluster has overcommitted memory requests.

Description

The cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.


KubeCPUOvercommitNamespaces

Severity

Warning

Summary

Cluster has overcommitted CPU requests for namespaces.

Description

The cluster has overcommitted CPU resource requests for namespaces.


KubeMemOvercommitNamespaces

Severity

Warning

Summary

Cluster has overcommitted memory requests for namespaces.

Description

The cluster has overcommitted memory resource requests for namespaces.


KubeQuotaExceeded

Severity

Warning

Summary

The {{ $labels.namespace }} namespace consumes more than 90% of its {{ $labels.resource }} quota.

Description

The {{ $labels.namespace }} namespace consumes {{ printf "%0.0f" $value }}% of its {{ $labels.resource }} quota.


CPUThrottlingHigh

Severity

Warning

Summary

The {{ $labels.pod_name }} Pod has CPU throttling.

Description

The CPU in the {{ $labels.namespace }} namespace for the {{ $labels.container_name }} container in the {{ $labels.pod_name }} Pod has {{ printf "%0.0f" $value }}% throttling.

Kubernetes storage

This section lists the alerts for Kubernetes storage.

Caution

Due to the upstream bug in Kubernetes, metrics for the KubePersistentVolumeUsageCritical and KubePersistentVolumeFullInFourDays alerts that are collected for persistent volumes provisioned by cinder-csi-plugin are not available.


KubePersistentVolumeUsageCritical

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume has less than 3% of free space.

Description

The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is only {{ printf "%0.2f" $value }}% free.


KubePersistentVolumeFullInFourDays

Severity

Critical

Summary

The {{ $labels.persistentvolumeclaim }} PersistentVolume is expected to fill up in 4 days.

Description

Based on the recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in the {{ $labels.namespace }} namespace is expected to fill up within four days. Currently, {{ printf "%0.2f" $value }}% of free space is available.


KubePersistentVolumeErrors

Severity

Critical

Summary

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Description

The status of the {{ $labels.persistentvolume }} PersistentVolume is {{ $labels.phase }}.

Kubernetes system

This section lists the alerts for the Kubernetes system.


KubeNodeNotReady

Severity

Warning

Summary

The {{ $labels.node }} node is not ready for more than one hour.

Description

The Kubernetes {{ $labels.node }} node is not ready for more than one hour.


KubeVersionMismatch

Severity

Warning

Summary

Kubernetes components have mismatching versions.

Description

Kubernetes has components with {{ $value }} different semantic versions running.


KubeClientErrors

Severity

Warning

Summary

Kubernetes API client has more than 1% of error requests.

Description

The {{ $labels.job }}/{{ $labels.instance }} Kubernetes API server client has {{ printf "%0.0f" $value }}% errors.


KubeletTooManyPods

Severity

Warning

Summary

kubelet reached 90% of Pods limit.

Description

The {{ $labels.instance }}/{{ $labels.node }} kubelet runs {{ $value }} Pods, close to the limit of 110.


KubeAPIDown

Severity

Critical

Summary

Kubernetes API endpoint is down.

Description

The Kubernetes API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


KubeAPIOutage

Severity

Critical

Summary

Kubernetes API is down.

Description

The Kubernetes API is not accessible for the last 30 seconds.


KubeAPILatencyHighWarning

Severity

Warning

Summary

The API server has a 99th percentile latency of more than 1 second.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPILatencyHighCritical

Severity

Critical

Summary

The API server has a 99th percentile latency of more than 4 seconds.

Description

The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.


KubeAPIErrorsHighCritical

Severity

Critical

Summary

API server returns errors for more than 3% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIErrorsHighWarning

Severity

Warning

Summary

API server returns errors for more than 1% of requests.

Description

The API server returns errors for {{ $value }}% of requests.


KubeAPIResourceErrorsHighCritical

Severity

Critical

Summary

API server returns errors for 10% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeAPIResourceErrorsHighWarning

Severity

Warning

Summary

API server returns errors for 5% of requests.

Description

The API server returns errors for {{ $value }}% of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}.


KubeClientCertificateExpirationInSevenDays

Severity

Warning

Summary

An authentication client certificate for the API server expires in less than 7.0 days.

Description

A client certificate used to authenticate to the API server expires in less than 7.0 days.


KubeClientCertificateExpirationInOneDay

Severity

Critical

Summary

An authentication client certificate for the API server expires in less than 24.0 hours.

Description

A client certificate used to authenticate to the API server expires in less than 24.0 hours.


ContainerScrapeError

Severity

Warning

Summary

Failure to get Kubernetes container metrics.

Description

Prometheus was not able to scrape metrics from the container on the {{ $labels.node }} Kubernetes node.

Netchecker

This section lists the alerts for the Netchecker service.


NetCheckerAgentErrors

Severity

Warning

Summary

Netchecker has a high number of errors.

Description

The {{ $labels.agent }} Netchecker agent had {{ $value }} errors within the last hour.


NetCheckerReportsMissing

Severity

Warning

Summary

The number of agent reports is lower than expected.

Description

The {{ $labels.agent }} Netchecker agent has not reported anything for the last 5 minutes.


NetCheckerTCPServerDelay

Severity

Warning

Summary

The TCP connection to Netchecker server takes too much time.

Description

The {{ $labels.agent }} Netchecker agent TCP connection time to the Netchecker server has increased by {{ $value }} within the last 5 minutes.


NetCheckerDNSSlow

Severity

Warning

Summary

The DNS lookup time is too high.

Description

The DNS lookup time on the {{ $labels.agent }} Netchecker agent has increased by {{ $value }} within the last 5 minutes.

NGINX

This section lists the alerts for the NGINX service.


NginxServiceDown

Severity

Minor

Summary

The NGINX service is down.

Description

The NGINX service on the {{ $labels.node }} node is down.


NginxDroppedIncomingConnections

Severity

Minor

Summary

NGINX drops incoming connections.

Description

NGINX on the {{ $labels.node }} node drops {{ $value }} accepted connections per second for 5 minutes.

Node network

This section lists the alerts for a Kubernetes node network.


SystemRxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} has package receive errors.

Description

The {{ $labels.device }} network interface has receive errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter.


SystemTxPacketsErrorTooHigh

Severity

Warning

Summary

The {{ $labels.node }} node has package transmit errors.

Description

The {{ $labels.device }} network interface has transmit errors on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter.


SystemRxPacketsDroppedTooHigh

Severity

Warning

Summary

60 or more received packets were dropped.

Description

{{ $value }} packets received by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


SystemTxPacketsDroppedTooHigh

Severity

Warning

Summary

100 transmitted packets were dropped.

Description

{{ $value }} packets transmitted by the {{ $labels.device }} interface on the {{ $labels.node }} node were dropped during the last minute.


NodeNetworkInterfaceFlapping

Severity

Warning

Summary

The {{ $labels.node }} node has flapping interface.

Description

The {{ $labels.device }} network interface often changes its UP status on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter.

Node time

This section lists the alerts for a Kubernetes node time.


ClockSkewDetected

Severity

Warning

Summary

The NTP offset reached the limit of 0.03 seconds.

Description

Clock skew was detected on the {{ $labels.namespace }}/{{ $labels.pod }} node exporter. Verify that NTP is configured correctly on this host.

PostgreSQL

This section lists the alerts for the PoststgreSQL and Patroni services.


PostgresqlDataPageCorruption

Severity

Critical

Summary

Patroni cluster member is experiencing data page corruption.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}, pod {{ $labels.pod }} fails to calculate the data page checksum due to possible hardware fault.


PostgresqlDeadlocksDetected

Severity

Warning

Summary

PostgreSQL transactions deadlocks.

Description

The transactions submitted to the Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} are experiencing deadlocks.


PostgresqlInsufficientWorkingMemory

Severity

Warning

Summary

Insufficient memory for PostgreSQL queries.

Description

The query data does not fit into working memory on the Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}.


PostgresqlPatroniClusterSplitBrain

Severity

Critical

Summary

Patroni cluster split-brain detected.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} has multiple primaries, split-brain detected.


PostgresqlPatroniClusterUnlocked

Severity

Critical

Summary

Patroni cluster primary node is missing.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} primary node is missing.


PostgresqlPrimaryDown

Severity

Critical

Summary

PostgreSQL is down on the cluster primary node.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} is down due to missing primary node.


PostgresqlReplicaDown

Severity

Warning

Summary

Patroni cluster has replicas with inoperable PostgreSQL.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} has {{ $value }}% of replicas with inoperable PostgreSQL.


PostgresqlReplicationNonStreamingReplicas

Severity

Warning

Summary

Patroni cluster has non-streaming replicas.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }} has replicas not streaming segments from the primary node.


PostgresqlReplicationPaused

Severity

Critical

Summary

Replication has stopped.

Description

Replication has stopped on namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}, replica pod {{ $labels.pod }}.


PostgresqlReplicationSlowWalApplication

Severity

Warning

Summary

WAL segment application is slow.

Description

Slow replication while applying WAL segments on namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}, replica pod {{ $labels.pod }}.


PostgresqlReplicationSlowWalDownload

Severity

Warning

Summary

Streaming replication is slow.

Description

Slow replication while downloading WAL segments on namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}, replica pod {{ $labels.pod }}.


PostgresqlReplicationWalArchiveWriteFailing

Severity

Critical

Summary

Patroni cluster WAL segment writes are failing.

Description

The Patroni namespace {{ $labels.namespace }}, cluster {{ $labels.cluster }}, pod {{ $labels.pod }} fails to write replication segments.

Prometheus

This section describes the alerts for the Prometheus service.


PrometheusConfigReloadFailed

Severity

Warning

Summary

Failure to reload the Prometheus configuration.

Description

Reloading of the Prometheus configuration failed for {{$labels.namespace}}/{{$labels.pod}}.


PrometheusNotificationQueueRunningFull

Severity

Warning

Summary

Prometheus alert notification queue is running full.

Description

The Prometheus alert notification queue is running full for {{$labels.namespace}}/{{ $labels.pod}}.


PrometheusErrorSendingAlertsWarning

Severity

Warning

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}.


PrometheusErrorSendingAlertsCritical

Severity

Critical

Summary

Errors occur while sending alerts from Prometheus.

Description

Errors occur while sending alerts from Prometheus {{$labels.namespace}}/{{ $labels.pod}} to Alertmanager {{$labels.Alertmanager}}.


PrometheusNotConnectedToAlertmanagers

Severity

Warning

Summary

Prometheus is not connected to Alertmanager.

Description

Prometheus {{ $labels.namespace }}/{{ $labels.pod}} is not connected to any Alertmanager instance.


PrometheusTSDBReloadsFailing

Severity

Warning

Summary

Prometheus has issues reloading data blocks from disk.

Description

The Prometheus server on the {{$labels.instance}} instance has {{$value | humanize}} reload failures over the last four hours.


PrometheusTSDBCompactionsFailing

Severity

Warning

Summary

Prometheus has issues compacting sample blocks.

Description

The Prometheus server on the {{$labels.instance}} instance has {{$value | humanize}} compaction failures over the last four hours.


PrometheusTSDBWALCorruptions

Severity

Warning

Summary

Prometheus encountered WAL corruptions.

Description

The Prometheus server on the {{$labels.instance}} instance has write-ahead log (WAL) corruptions in the time series database (TSDB) for the last 5 minutes.


PrometheusNotIngestingSamples

Severity

Warning

Summary

Prometheus does not ingest samples.

Description

Prometheus {{ $labels.namespace }}/{{ $labels.pod}} does not ingest samples.


PrometheusTargetScrapesDuplicate

Severity

Warning

Summary

Prometheus has many rejected samples.

Description

Prometheus {{$labels.namespace}}/{{$labels.pod}} has many rejected samples because of duplicate timestamps but different values.


PrometheusRuleEvaluationsFailed

Severity

Warning

Summary

Prometheus failed to evaluate recording rules.

Description

Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed evaluations for recording rules. Verify the rules state in the Status/Rules section of the Prometheus Web UI.

SMART disks

This section describes the alerts for SMART disks.


SystemSMARTDiskUDMACrcErrorsTooHigh

Severity

Warning

Summary

The {{ $labels.device }} disk has UDMA CRC errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting SMART UDMA CRC errors for 5 minutes.


SystemSMARTDiskHealthStatus

Severity

Warning

Summary

The {{ $labels.device }} disk has bad health.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting a bad health status for 1 minute.


SystemSMARTDiskReadErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has read errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased read error rate for 5 minutes.


SystemSMARTDiskSeekErrorRate

Severity

Warning

Summary

The {{ $labels.device }} disk has seek errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node is reporting an increased seek error rate for 5 minutes.


SystemSMARTDiskTemperatureHigh

Severity

Warning

Summary

The {{ $labels.device }} disk temperature is high.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has a temperature of {{ $value }}C for 5 minutes.


SystemSMARTDiskReallocatedSectorsCount

Severity

Major

Summary

The {{ $labels.device }} disk has reallocated sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has reallocated {{ $value }} sectors.


SystemSMARTDiskCurrentPendingSectors

Severity

Major

Summary

The {{ $labels.device }} disk has current pending sectors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} current pending sectors.


SystemSMARTDiskReportedUncorrectableErrors

Severity

Major

Summary

The {{ $labels.device }} disk has reported uncorrectable errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} reported uncorrectable errors.


SystemSMARTDiskOfflineUncorrectableSectors

Severity

Major

Summary

The {{ $labels.device }} disk has offline uncorrectable sectors

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} offline uncorrectable sectors.


SystemSMARTDiskEndToEndError

Severity

Major

Summary

The {{ $labels.device }} disk has end-to-end errors.

Description

The {{ $labels.device }} disk on the {{ $labels.host }} node has {{ $value }} end-to-end errors.

SSL certificates

This section lists the alerts for SSL certificates.


SSLCertExpirationWarning

Severity

Warning

Summary

SSL certificate expires in 30 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 30 days.


SSLCertExpirationCritical

Severity

Critical

Summary

SSL certificate expires in 10 days.

Description

The SSL certificate for {{ $labels.instance }} expires in 10 days.

Telemeter

This section describes the alerts for the Telemeter service.


TelemeterClientFederationFailed

Severity

Warning

Summary

Telemeter client failed to send data to the server.

Description

Telemeter client has failed to send data to the Telemeter server twice for the last 30 minutes. Verify the telemeter-client container logs.

UCP

This section describes the alerts for the UCP cluster.


DockerNetworkUnhealthy

Severity

Warning

Summary

Docker network is unhealthy.

Description

The qLen size and NetMsg showed unexpected output for the last 10 minutes. Verify the NetworkDb Stats output for the qLen size and NetMsg using journalctl -d docker.

Note

For the DockerNetworkUnhealthy alert, StackLight collects metrics from logs. Therefore, this alert is available only if logging is enabled.


DockerNodeFlapping

Severity

Major

Summary

Docker node is flapping.

Description

The {{ $labels.node_name }} Docker node has changed the state more than 3 times for the last 10 minutes.


DockerServiceReplicasDown

Severity

Major

Summary

Docker Swarm replica is down.

Description

The {{ $labels.service_name }} Docker Swarm service replica is down for 2 minutes.


DockerServiceReplicasFlapping

Severity

Major

Summary

Docker Swarm service replica is flapping.

Description

The {{ $labels.service_name }} Docker Swarm service replica is flapping for 15 minutes.


DockerServiceReplicasOutage

Severity

Critical

Summary

Docker Swarm service outage.

Description

All {{ $labels.service_name }} Docker Swarm service replicas are down for 2 minutes.


DockerUCPAPIDown

Severity

Critical

Summary

UCP API endpoint is down.

Description

The Docker UCP API endpoint {{ $labels.instance }} is not accessible for the last 3 minutes.


DockerUCPAPIOutage

Severity

Critical

Summary

UCP API is down.

Description

The Docker UCP API (port 443) is not accessible for the last 30 seconds.


DockerUCPContainerUnhealthy

Severity

Critical

Summary

UCP engine container is in the Unhealthy state.

Description

The {{ $labels.name }} Docker UCP engine container is in the Unhealthy state.


DockerUCPLeadElectionLoop

Severity

Critical

Summary

UCP Manager leadership election loop.

Description

More than 3 Docker UCP leader elections occur for the last 10 minutes.


DockerUCPNodeDiskFullCritical

Severity

Critical

Summary

UCP node disk is 95% full.

Description

The {{ $labels.instance }} Docker UCP node disk is 95% full.


DockerUCPNodeDiskFullWarning

Severity

Warning

Summary

UCP node disk is 85% full.

Description

The {{ $labels.instance }} Docker UCP node disk is 85% full.


DockerUCPNodeDown

Severity

Critical

Summary

UCP node is down.

Description

The {{ $labels.instance }} Docker UCP node is down.

Configure StackLight

This section describes the initial steps required for StackLight configuration. For a detailed description of StackLight configuration options, see StackLight configuration parameters.

  1. Log in to the Docker Enterprise (DE) Container Cloud web UI with the writer permissions.

  2. Switch to the required project using the Switch Project action icon located on top of the main left-side navigation panel.

  3. Expand the menu of the tab with your username.

  4. Click Download kubeconfig to download kubeconfig of your management cluster.

  5. Log in to any local machine with kubectl installed.

  6. Copy the downloaded kubeconfig to this machine.

  7. Run one of the following commands:

    • For a management cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGEMENT_CLUSTER_NAME>
      
    • For a managed cluster:

      kubectl --kubeconfig <KUBECONFIG_PATH> edit -n <PROJECT_NAME> cluster <MANAGED_CLUSTER_NAME>
      
  8. In the following section of the opened manifest, configure the required StackLight parameters as described in StackLight configuration parameters.

    spec:
      providerSpec:
        value:
          helmReleases:
         - name: stacklight
           values:
    
  9. Verify StackLight after configuration.

StackLight configuration parameters

This section describes the StackLight configuration keys that you can specify in the values section to change StackLight settings as required. Prior to making any changes to StackLight configuration, perform the steps described in Configure StackLight. After changing StackLight configuration, verify the changes as described in Verify StackLight after configuration.


Alerta

Key

Description

Example values

alerta.enabled (bool)

Enables or disables Alerta.

true or false


Elasticsearch

Key

Description

Example values

elasticsearch.logstashRetentionTime (int)

Defines the Elasticsearch logstash-* index retention time in days. The logstash-* index stores all logs gathered from all nodes and containers.

1, 5, 15


Logging

Key

Description

Example values

logging.enabled (bool)

Enables or disables the StackLight logging stack. For details about the logging components, see Reference Architecture: StackLight deployment architecture.

true or false


High availability

Key

Description

Example values

highAvailabilityEnabled (bool)

Enables or disables StackLight multiserver mode. For details, see StackLight database modes in Reference Architecture: StackLight deployment architecture.

true or false


Metric collector

Key

Description

Example values

metricCollector.enabled (bool)

Disables or enables the metric collector. Modify this parameter for the management cluster only.

false or true


Prometheus

Key

Description

Example values

prometheusServer.retentionTime (string)

Defines the Prometheus database retention period. Passed to the --storage.tsdb.retention.time flag.

15d, 1000h, 10d12h

prometheusServer.retentionSize (string)

Defines the Prometheus database retention size. Passed to the --storage.tsdb.retention.size flag.

15GB, 512MB

prometheusServer.alertResendDelay (string)

Defines the minimum amount of time for Prometheus to wait before resending an alert to Alertmanager. Passed to the --rules.alert.resend-delay flag.

2m, 90s


Cluster size

Key

Description

Example values

clusterSize (string)

Specifies the cluster size.

small, medium, or large


Resource limits

Key

Description

Example values

resourcesPerClusterSize (map)

Provides the capability to override the default resource requests or limits for any StackLight component for the predefined cluster sizes. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resourcesPerClusterSize:
  elasticsearch:
    small:
      limits:
        cpu: "1000m"
        memory: "4Gi"
    medium:
      limits:
        cpu: "2000m"
        memory: "8Gi"
      requests:
        cpu: "1000m"
        memory: "4Gi"
    large:
      limits:
        cpu: "4000m"
        memory: "16Gi"

resources (map)

Provides the capability to override the containers resource requests or limits for any StackLight component. For a list of StackLight components, see Components versions in Release Notes: Cluster releases.

resources:
  alerta:
    requests:
      cpu: "50m"
      memory: "200Mi"
    limits:
      memory: "500Mi"

Using the example above, each pod in the alerta service will be requesting 50 millicores of CPU and 200 MiB of memory, while being hard-limited to 500 MiB of memory usage. Each configuration key is optional.


Kubernetes tolerations

Key

Description

Example values

tolerations.default (slice)

Kubernetes tolerations to add to all StackLight components.

default:
- key: "com.docker.ucp.manager"
  operator: "Exists"
  effect: "NoSchedule"

tolerations.component (map)

Defines Kubernetes tolerations (overrides the default ones) for any StackLight component.

component:
  elasticsearch:
  - key: "com.docker.ucp.manager"
    operator: "Exists"
    effect: "NoSchedule"
  postgresql:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"

Storage class

Key

Description

Example values

storage.defaultStorageClass (string)

Defines the StorageClass to use for all StackLight Persistent Volume Claims (PVCs) if a component StorageClass is not defined using the componentStorageClasses. To use the cluster default storage class, leave the string empty.

lvp, standard

storage.componentStorageClasses (map)

Defines (overrides the defaultStorageClass value) the storage class for any StackLight component separately. To use the cluster default storage class, leave the string empty.

componentStorageClasses:
  elasticsearch: ""
  fluentd: ""
  postgresql: ""
  prometheusAlertManager: ""
  prometheusPushGateway: ""
  prometheusServer: ""

NodeSelector

Key

Description

Example values

nodeSelector.default (map)

Defines the NodeSelector to use for the most of StackLight pods (except some pods that refer to DaemonSets) if the NodeSelector of a component is not defined.

default:
  role: stacklight

nodeSelector.component (map)

Defines the NodeSelector to use for particular StackLight component pods. Overrides nodeSelector.default.

component:
  alerta:
    role: stacklight
    component: alerta
  kibana:
    role: stacklight
    component: kibana

Ceph monitoring

Key

Description

Example values

ceph.enabled (bool)

Enables or disables Ceph monitoring.

true or false


External endpoint monitoring

Key

Description

Example values

externalEndpointMonitoring.enabled (bool)

Enables or disables HTTP endpoints monitoring. If enabled, the monitoring tool performs the probes against the defined endpoints every 15 seconds.

true or false

externalEndpointMonitoring.certificatesHostPath (string)

Defines the directory path with external endpoints certificates on host.

/etc/ssl/certs/

externalEndpointMonitoring.domains (slice)

Defines the list of HTTP endpoints to monitor.

domains:
- https://prometheus.io_health
- http://example.com:8080_status
- http://example.net:8080_pulse

Ironic monitoring

Key

Description

Example values

ironic.endpoint (string)

Enables or disables monitoring of bare metal Ironic. To enable, specify the Ironic API URL.

http://ironic-api-http.kaas.svc:6385/v1


SSL certificates monitoring

Key

Description

Example values

sslCertificateMonitoring.enabled (bool)

Enables or disables StackLight to monitor and alert on the expiration date of the TLS certificate of an HTTPS endpoint. If enabled, the monitoring tool performs the probes against the defined endpoints every hour.

true or false

sslCertificateMonitoring.domains (slice)

Defines the list of HTTPS endpoints to monitor the certificates from.

domains:
- https://prometheus.io
- http://example.com:8080

Workload monitoring

Key

Description

Example values

metricFilter (map)

On the clusters that run large-scale workloads, workload monitoring generates a big amount of resource-consuming metrics. To prevent generation of excessive metrics, you can disable workload monitoring in the StackLight metrics and monitor only the infrastructure.

The metricFilter parameter enables the cAdvisor (Container Advisor) and kubeStateMetrics metric ingestion filters for Prometheus. The feature is disabled by default. If enabled, you can define the namespaces to which the filter will apply.

metricFilter:
  enabled: true
  action: keep
  namespaces:
  - kaas
  - kube-system
  - stacklight
  • enabled - enable or disable metricFilter using true or false

  • action - action to take by Prometheus:

    • keep - keep only metrics from namespaces that are defined in the namespaces list

    • drop - ignore metrics from namespaces that are defined in the namespaces list

  • namespaces - list of namespaces to keep or drop metrics from regardless of the boolean value for every namespace


UCP monitoring

Key

Description

Example values

ucp.enabled (bool)

Enables or disables Docker UCP monitoring.

true or false

ucp.dockerdDataRoot (string)

Defines the dockerd data root directory of persistent Docker state. For details, see Docker documentation: Daemon CLI (dockerd).

/var/lib/docker


Alerts configuration

Key

Description

Example values

prometheusServer.customAlerts (slice)

Defines custom alerts. Also, modifies or disables existing alert configurations. For the list of predefined alerts, see Available StackLight alerts. While adding or modifying alerts, follow the Alerting rules.

customAlerts:
# To add a new alert:
- alert: ExampleAlert
  annotations:
    description: Alert description
    summary: Alert summary
  expr: example_metric > 0
  for: 5m
  labels:
    severity: warning
# To modify an existing alert expression:
- alert: AlertmanagerFailedReload:
  expr: alertmanager_config_last_reload_successful == 5
# To disable an existing alert:
- alert: TargetDown
  enabled: false

An optional field enabled is accepted in the alert body to disable an existing alert by setting to false. All fields specified using the customAlerts definition override the default predefined definitions in the charts’ values.


Watchdog alert

Key

Description

Example values

prometheusServer.watchDogAlertEnabled (bool)

Enables or disables the Watchdog alert that constantly fires as long as the entire alerting pipeline is functional. You can use this alert to verify that Alertmanager notifications properly flow to the Alertmanager receivers.

true or false


Alertmanager integrations

Key

Description

Example values

alertmanagerSimpleConfig.genericReceivers (slice)

Provides a genetic template for notifications receiver configurations. For a list of supported receivers, see Prometheus Alertmanager documentation: Receiver.

For example, to enable notifications to OpsGenie:

alertmanagerSimpleConfig:
  genericReceivers:
  - name: HTTP-opsgenie
    enabled: true # optional
    opsgenie_configs:
    - api_url: "https://example.app.eu.opsgenie.com/"
      api_key: "secret-key"
      send_resolved: true

Notifications to email

Key

Description

Example values

alertmanagerSimpleConfig.email.enabled (bool)

Enables or disables Alertmanager integration with email.

true or false

alertmanagerSimpleConfig.email (map)

Defines the notification parameters for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Email configuration.

email:
  enabled: false
  send_resolved: true
  to: "to@test.com"
  from: "from@test.com"
  smarthost: smtp.gmail.com:587
  auth_username: "from@test.com"
  auth_password: password
  auth_identity: "from@test.com"
  require_tls: true

alertmanagerSimpleConfig.email.route (map)

Defines the route for Alertmanager integration with email. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications to Slack

Key

Description

Example values

alertmanagerSimpleConfig.slack.enabled (bool)

Enables or disables Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Slack configuration.

true or false

alertmanagerSimpleConfig.slack.api_url (string)

Defines the Slack webhook URL.

http://localhost:8888

alertmanagerSimpleConfig.slack.channel (string)

Defines the Slack channel or user to send notifications to.

monitoring

alertmanagerSimpleConfig.slack.route (map)

Defines the notifications route for Alertmanager integration with Slack. For details, see Prometheus Alertmanager documentation: Route.

route:
  match: {}
  match_re: {}
  routes: []

Notifications routing

Key

Description

Example values

alertmanagerSimpleConfig.genericRoutes (slice)

Template for notifications route configuration. For details, see Prometheus Alertmanager documentation: Route.

genericRoutes:
- receiver: HTTP-opsgenie
  enabled: true # optional
  match_re:
    severity: major|critical
  continue: true

Verify StackLight after configuration

This section describes how to verify StackLight after configuring its parameters as described in Configure StackLight and StackLight configuration parameters. Perform the verification procedure described for a particular modified StackLight key.

To verify StackLight after configuration:

Key

Verification procedure

alerta.enabled

Verify that Alerta is present in the list of StackLight resources. An empty output indicates that Alerta is disabled.

kubectl get all -n stacklight -l app=alerta

elasticsearch.logstashRetentionTime

Verify that the unit_count parameter contains the desired number of days:

kubectl get cm elasticsearch-curator-config -n \
stacklight -o=jsonpath='{.data.action_file\.yml}'

logging.enabled

Verify that Elasticsearch, Fluentd, and Kibana are present in the list of StackLight resources. An empty output indicates that the StackLight logging stack is disabled.

kubectl get all -n stacklight -l 'app in
(elasticsearch-master,kibana,fluentd-elasticsearch)'

highAvailabilityEnabled

Run kubectl get sts -n stacklight. The output includes the number of services replicas for the HA or non-HA StackLight modes. For details, see StackLight deployment architecture.

metricCollector.enabled

Verify that metric collector is present in the list of StackLight resources. An empty output indicates that metric collector is disabled.

kubectl get all -n stacklight -l app=mcc-metric-collector
  • prometheusServer.retentionTime

  • prometheusServer.retentionSize

  • prometheusServer.alertResendDelay

  1. In the Prometheus web UI, navigate to Status > Command-Line Flags.

  2. Verify the values for the following flags:

    • storage.tsdb.retention.time

    • storage.tsdb.retention.size

    • rules.alert.resend-delay

  • clusterSize

  • resourcesPerClusterSize

  • resources

  1. Obtain the list of pods:

    kubectl get po -n stacklight
    
  2. Verify that the desired resource limits or requests are set in the resources section of every container in the pod:

    kubectl get po <pod_name> -n stacklight -o yaml
    
  • nodeSelector.default

  • nodeSelector.component

  • tolerations.default

  • tolerations.component

Verify that the appropriate components pods are located on the intended nodes:

kubectl get pod -o=custom-columns=NAME:.metadata.name,\
STATUS:.status.phase,NODE:.spec.nodeName -n stacklight
  • storage.defaultStorageClass

  • storage.componentStorageClasses

Verify that the appropriate components PVCs have been created according to the configured StorageClass:

kubectl get pvc -n stacklight

ceph.enabled

  1. In the Grafana web UI, verify that Ceph dashboards are present in the list of dashboards and are populated with data.

  2. In the Prometheus web UI, click Alerts and verify that the list of alerts contains Ceph* alerts.

  • externalEndpointMonitoring.enabled

  • externalEndpointMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox-external-endpoint target contains the configured domains (URLs).

ironic.endpoint

In the Grafana web UI, verify that the Ironic BM dashboard displays valuable data (no false-positive or empty panels).

metricFilter

  1. In the Prometheus web UI, navigate to Status > Configuration.

  2. Verify that the following fields in the metric_relabel_configs section for the kubernetes-nodes-cadvisor and prometheus-kube-state-metrics scrape jobs have the required configuration:

    • action is set to keep or drop

    • regex contains a regular expression with configured namespaces delimited by |

    • source_labels is set to [namespace]

  • sslCertificateMonitoring.enabled

  • sslCertificateMonitoring.domains

  1. In the Prometheus web UI, navigate to Status -> Targets.

  2. Verify that the blackbox target contains the configured domains (URLs).

ucp.enabled

  1. In the Grafana web UI, verify that the UCP Cluster and UCP Containers dashboards are present and not empty.

  2. In the Prometheus web UI, navigate to Alerts and verify that the DockerUCP* alerts are present in the list of alerts.

ucp.dockerdDataRoot

In the Prometheus web UI, navigate to Alerts and verify that the DockerUCPAPIDown is not false-positively firing due to the certificate absence.

prometheusServer.customAlerts

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts has changed according to your customization.

prometheusServer.watchDogAlertEnabled

In the Prometheus web UI, navigate to Alerts and verify that the list of alerts contains the Watchdog alert.

alertmanagerSimpleConfig.genericReceivers

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended receiver(s).

alertmanagerSimpleConfig.genericRoutes

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the intended route(s).

  • alertmanagerSimpleConfig.email.enabled

  • alertmanagerSimpleConfig.email

  • alertmanagerSimpleConfig.email.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the Email receiver and route.

  • alertmanagerSimpleConfig.slack.enabled

  • alertmanagerSimpleConfig.slack.api_url

  • alertmanagerSimpleConfig.slack.channel

  • alertmanagerSimpleConfig.slack.route

In the Alertmanager web UI, navigate to Status and verify that the Config section contains the HTTP-slack receiver and route.

Enable generic metric scraping

StackLight can scrape metrics from any service that exposes Prometheus metrics and is running on the Kubernetes cluster. Such metrics appear in Prometheus under the {job="stacklight-generic",service="<service_name>",namespace="<service_namespace>"} set of labels. If the Kubernetes service is backed by Kubernetes pods, the set of labels also includes {pod="<pod_name>"}.

To enable the functionality, define at least one of the following annotations in the service metadata:

  • "generic.stacklight.mirantis.com/scrape-port" - the HTTP endpoint port. By default, the port number found through Kubernetes service discovery, usually __meta_kubernetes_pod_container_port_number. If none discovered, use the default port for the chosen scheme.

  • "generic.stacklight.mirantis.com/scrape-path" - the HTTP endpoint path, related to the Prometheus scrape_config.metrics_path option. By default, /metrics.

  • "generic.stacklight.mirantis.com/scrape-scheme" - the HTTP endpoint scheme between HTTP and HTTPS, related to the Prometheus scrape_config.scheme option. By default, http.

For example:

metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-path": "/metrics"
metadata:
  annotations:
    "generic.stacklight.mirantis.com/scrape-port": "8080"

Manage Ceph

This section outlines Ceph LCM operations such as adding Ceph Monitor, Ceph OSD, and RADOS Gateway nodes to an existing Ceph cluster or removing them, as well as removing or replacing Ceph OSDs or updating your Ceph cluster.

Enable automated Ceph LCM

Ceph controller can automatically redeploy Ceph OSDs in case of significant configuration changes such as changing the block.db device or replacing Ceph OSDs. Ceph controller can also clean disks and configuration during a Ceph OSD removal.

To remove a single Ceph OSD or an entire node, manually remove its definition from the kaasCephCluster CR.

To enable automated management of Ceph OSDs:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the management cluster as described in Connect to a Docker Enterprise Container Cloud cluster.

  3. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  4. Set the manageOsds parameter to true:

    spec:
      cephClusterSpec:
        manageOsds: true
    

Once done, all OSDs with a modified configuration will be redeployed. Mirantis recommends modifying only one node at a time. For details about supported configuration parameters, see OSD Configuration Settings.

Add, remove, or reconfigure Ceph nodes

Mirantis Ceph controller makes simplifies a Ceph cluster management by automating LCM operations. To modify Ceph components, only the MiraCeph custom resource (CR) update is required. Once you update the MiraCeph CR, the Ceph controller automatically adds, removes, or reconfigures nodes as required.

To add, remove, or reconfigure Ceph nodes on a management or managed cluster:

  1. To modify Ceph OSDs, verify that the manageOsds parameter is set to true in the KaasCephCluster CR as described in Enable automated Ceph LCM.

  2. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  3. Obtain and export kubeconfig of the management cluster as described in Connect to a Docker Enterprise Container Cloud cluster.

  4. Open the KaasCephCluster CR for editing. Choose from the following options:

    • For a management cluster:

      kubectl edit kaascephcluster
      
    • For a managed cluster:

      kubectl edit kaascephcluster -n <managedClusterProjectName>
      

      Substitute <managedClusterProjectName> with the corresponding value.

  5. In the nodes section, specify or remove the parameters for a Ceph OSD as required. For the parameters description, see OSD Configuration Settings.

    For example:

    nodes:
      kaas-mgmt-node-5bgk6:
        roles:
        - mon
        - mgr
        storageDevices:
        - config:
            storeType: bluestore
        name: sdb
    

    Note

    To use a new node for Ceph Monitor or Manager deployment, also specify the roles parameter.

  6. If you are making changes for your managed cluster, obtain and export kubeconfig of the managed cluster as described in Connect to a Docker Enterprise Container Cloud cluster. Otherwise, skip this step.

  7. Monitor the status of your Ceph cluster deployment. For example:

    kubectl -n rook-ceph get pods
    
    kubectl -n ceph-lcm-mirantis logs ceph-controller-78c95fb75c-dtbxk
    
    kubectl -n rook-ceph logs rook-ceph-operator-56d6b49967-5swxr
    
  8. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  9. Verify that the Ceph node has been successfully added, removed, or reconfigured:

    1. Verify that the Ceph cluster status is healthy:

      ceph status
      

      Example of a positive system response:

      cluster:
        id:     0868d89f-0e3a-456b-afc4-59f06ed9fbf7
        health: HEALTH_OK
      
      services:
        mon: 3 daemons, quorum a,b,c (age 20h)
        mgr: a(active, since 20h)
        osd: 9 osds: 9 up (since 20h), 9 in (since 2d)
      
      data:
        pools:   1 pools, 32 pgs
        objects: 0 objects, 0 B
        usage:   9.1 GiB used, 231 GiB / 240 GiB avail
        pgs:     32 active+clean
      
    2. Verify that the status of the Ceph OSDs is up:

      ceph osd tree
      

      Example of a positive system response:

      ID  CLASS WEIGHT  TYPE NAME                   STATUS REWEIGHT PRI-AFF
      -1       0.23424 root default
      -3       0.07808             host osd1
       1   hdd 0.02930                 osd.1           up  1.00000 1.00000
       3   hdd 0.01949                 osd.3           up  1.00000 1.00000
       6   hdd 0.02930                 osd.6           up  1.00000 1.00000
      -15       0.07808             host osd2
       2   hdd 0.02930                 osd.2           up  1.00000 1.00000
       5   hdd 0.01949                 osd.5           up  1.00000 1.00000
       8   hdd 0.02930                 osd.8           up  1.00000 1.00000
      -9       0.07808             host osd3
       0   hdd 0.02930                 osd.0           up  1.00000 1.00000
       4   hdd 0.01949                 osd.4           up  1.00000 1.00000
       7   hdd 0.02930                 osd.7           up  1.00000 1.00000
      

Replace a failed Ceph OSD

After a physical disk replacement, you can use Rook to redeploy a failed Ceph OSD by restarting rook-operator that triggers the reconfiguration of the management or managed cluster.

To redeploy a failed Ceph OSD:

  1. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  2. Obtain and export kubeconfig of the required management or managed cluster as described in Connect to a Docker Enterprise Container Cloud cluster.

  3. Identify the failed Ceph OSD ID:

    ceph osd tree
    
  4. Remove the Ceph OSD deployment from the management or managed cluster:

    kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID>
    
  5. Connect to the terminal of the ceph-tools pod:

    kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod \
    -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') bash
    
  6. Remove the failed Ceph OSD from the Ceph cluster:

    ceph osd purge osd.<ID>
    
  7. Replace the failed disk.

  8. Restart the Rook operator:

    kubectl delete pod $(kubectl -n rook-ceph get pod -l "app=rook-ceph-operator" \
    -o jsonpath='{.items[0].metadata.name}') -n rook-ceph
    

Update Ceph cluster

You can update Ceph cluster to the latest minor version of Ceph Nautilus by triggering the existing Ceph cluster update.

To update Ceph cluster:

  1. Verify that your management cluster is automatically upgraded to the latest Docker Enterprise (DE) Container Cloud release:

    1. Log in to the DE Container Cloud web UI with the writer permissions.

    2. On the bottom of the page, verify the DE Container Cloud version number.

  2. Verify that your managed clusters are updated to the latest Cluster release. For details, see Update a managed cluster.

  3. Log in to a local machine running Ubuntu 18.04 where kubectl is installed.

  4. Obtain and export kubeconfig of the management cluster as described in Connect to a Docker Enterprise Container Cloud cluster.

  5. Open the KaasCephCluster CR for editing:

    kubectl edit kaascephcluster
    
  6. Update the version parameter. For example:

    version: 14.2.9
    
  7. Obtain and export kubeconfig of the managed clusters as described in Connect to a Docker Enterprise Container Cloud cluster.

  8. Repeat the steps 5-7 to update Ceph on every managed cluster.